Monday, October 14, 2024

Kubernetes: How to deploy different pods close to each other (same node or zone etc.)

 In my last two posts, I discussed how to avoid and enforce pod scheduling on specific nodes. You can read them here

Avoidscheduling pods on certain nodes

Schedulepods on certain nodes

In this post, I will talk about, how to schedule your pods near to some other pod or in other words schedule a pod on a node which already has some other pod running on it. I will use some terminology from my earlier post about scheduling pods on certain nodes. So, please make sure you get some understanding from that post.

Idea here is to make sure when your pod is scheduled on some node, you want to make sure, it is placed on a node which already has some other pod running or it should not have some other pod running. First, let’s talk about possible use cases where you might need this.

Use cases

  • A application can have pods which frequently talk to each other, so you need to make sure, they are placed on same nodes to avoid latency. For ex: In an ecommerce application, every time you place an order, service needs to check inventory before accepting the order. For every order placement, there will be a inventory check service invocation. So, you might want to make sure that node having a order service pod should also have inventory service pod.
  • Another example could be, to place a cache pod near to a web application pod for faster access to cache contents.
  • You need to make sure that no more than one pod is scheduled on a node to make sure your pods are as distributed as possible.

Pod Affinity: Schedule pods closer to already running pods

In my previous post, I talked about node affinity to schedule pods on specific set of nodes. In this post I will talk about another affinity which is pod affinity. So, here is how pod affinity and anti affinity works:

Schedule a pod (or don’t schedule, in case of anti-affinity) on a node X, if pod Y is already running on that node. Here X is a label key of the node (it can have any value) and Y is the label assigned to already running pod.

Don’t panic, if you could not make much sense out of above statement. We will discuss this in detail.

Like nodeaffinity, podaffinity is of 2 types: 

  • requiredDuringSchedulingIgnoredDuringExecution: This setting means that the specified affinity or anti-affinity rule must be met for the pod to be scheduled onto a node. If the criteria cannot be fulfilled, the pod will remain unscheduled, waiting for the conditions to be satisfied. This can be useful for critical applications that require specific placement for performance or regulatory reasons but might lead to unscheduled pods if the cluster does not have the capacity or the right node configuration.
  • preferredDuringSchedulingIgnoredDuringExecution: This setting tells the scheduler that it should try to enforce the rules, but if it can't, the pod can still be scheduled onto a node that doesn't meet the criteria. This approach provides more flexibility and is generally recommended for use in larger or more dynamic environments where strict adherence to the rules may not always be possible or optimal.

The choice between required and preferred rules will depend on the specific needs of your applications and the desired balance between strict scheduling requirements and the flexibility of pod placement.

To understand pod affinity, let’s discuss a scenario from one of the use cases above, where you want to place a web-app pod on the same node where cache pod is already running. (or vice versa depending upon your application)

Here is the yaml definition to achieve the same.

apiVersion: v1
kind: Pod
metadata:
  name: web-app
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: name
            operator: In
            values:
            - web-cache
        topologyKey: my-node-label  
  containers:
  - name: web-app
    image: <container-image>

This YAML ensures that the web-app pod is scheduled on a node that already runs a pod labeled web-cache. The topologyKey ensures that the node selection is based on a label key, ensuring pods are placed together efficiently.

Let's talk about some key elements used in above YAML

podAffinity: This contains all the required rules for pod scheduling

topologyKey: This is the label key assigned to the target nodes. Remember, label has key and value and we are talking only about key. This can be any label key (value does not matter) whether set by you or by your cloud vendor.

With above pod definition file, we are deploying a pod containing web application named “we-app” and we want to place this pod on a node which is already running a cache pod named “web-cache”. Under podAffinity, we are telling scheduler to schedule this pod on a node having a label with key “my-node-label” and if a pod with name “web-cache” is already running on this node.

While trying to find suitable node for this pod, scheduler takes care of above requirement and whether pod is scheduled or not depends upon podAffinity type used. In our example, we used podAffinity of type, requiredDuringSchedulingIgnoredDuringExecution. If there is no node which is running a cache pod your web pod will not be scheduled and will be in pending state until it finds a suitable node. So, it is important that you are aware of how your application works and which pods will be scheduled before other pods and accordingly use the appropriate podAffinity type.

Pod Anti Affinity: Avoid scheduling pods on nodes which are already running certain pods

To understand pod anti-affinity, let’s discuss another scenario from one of the use cases above, where we want to make sure no more than one pod of a type is running on a node. Below yaml definition achieves this.

apiVersion: v1
kind: Pod
metadata:
  name: web-app
  labels:
    app: web-app
spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - web-app
        topologyKey: my-node-label  
  containers:
  - name: web-app
    image: <container-image>

The only new element is podAntiAffinity, which is opposite to podAffinity. podAffinity is all about scheduling and podAntiAffinity is about not scheduling.

In above yaml defintion, we are trying to deploy a pod named “web-app” and goal is to make sure no more than one “web-app” pod is running on a node.

Under podAntiAffinity, we are telling scheduler to schedule this pod on a node having a label with key “my-node-label” if and only if a pod with name “web-app” is not running on this node. So, if a pod with same name is running on this node, new pod will be scheduled on some other node.

A word of caution

Though podAffinity and podAntiAffinity let you control placement of pods but this also has a downside. pod affinity and anti-affinity require substantial amount of processing which can slow down scheduling in large clusters significantly. These are not recommended for clusters larger than several hundred nodes.

Million dollar question: why not place containers in same pod

Why should we get into this much complexity to place 2 pods together. Why not place multiple containers in a single pod, after all pod is nothing but a wrapper around container. So, rather than going though all the complexity above, should we just place multiple containers in a pod. Short answer is, it depends. Longer answer, you need to keep following things in mind:

  1. No one can stop you from putting multiple containers in a pod but it all depends on your application requirements. If you think, your business model and application operation mode would benefit from this go ahead and do this.
  2. If due to some problem in one of your container, your pod crashes, you are losing another application also, which might not be at fault at all.
  3. While horizontal scaling, you will be scaling both applications together, even if there was a need to scale only one. That would mean, you will pay for more resources.

Conclusion

Pod affinity and anti-affinity are powerful tools for controlling pod placement in Kubernetes. However, in large clusters, be mindful of the performance impact. Always balance the need for specific pod placement with scalability and performance concerns.

Please comment in comments section if you have question or have any feedback.

Friday, September 27, 2024

Kubernetes: How to schedule pods on certain nodes

 In my previous post, I talked about how you can avoid pods from being scheduled on certain nodes. In this post, I will discuss opposite of that. How you can schedule pods on certain nodes. This can be a hard requirement (failing which pod won’t be scheduled)or a soft requirement (scheduler will try to meet the requirement but if it does not, pod will be scheduled on some other node). Let’s discuss how we can do this.

Use case

There are couple of use cases when you want some pods to be always scheduled on certain nodes. Some of them are:

  1. Pods have specific hardware/resource requirement and you want those pods to be always scheduled on nodes which have supporting resources.
  2. Pods have specific security requirement to align with certain industry standards and pods should always go to nodes which satisfy those requirements.

There are 2 ways of assigning a pod to a node:

  1. Node selector, which can do the job but is not very expressive
  2. Node affinity, which is a bit more complex than node selector but gives you more features as well.

We will discuss both these approaches.

First approach, using Node selector

With node selector, you decorate the pod with a label(key value pair) of the node on which you want this pod to be scheduled. Once you specify the node selector, pod will be scheduled on node which has corresponding matching label(s).

Node selector is a field in pod spec. After pod definition, you need to specify the node selector. Here are the steps to do do this.

Step 1: Label the node

You can assign the labels on the node using following command:

kubectl label nodes <node-name> <label-key>:<label-value>

In above command, you need to replace the place holders with appropriate values. <node-name> is the name of one of the nodes in your cluster. You can get the list of nodes in your cluster, using following command.

kubectl get nodes 

<label-key> and <label-value> can be any arbitrary key and and value.

Step 2: Add node selector to pod configuration

See the pod configuration below. After pod specification, node selector is defined for the pod.

apiVersion: v1
kind: Pod
metadata:
  name: <pod-name>
spec:
  containers:
  - name: <pod-name>
    image: <image-name>
  nodeSelector:
    <label-key>: <label-value>

Replace the placeholders with appropriate values. Make sure key and value in node selector match with label key and value of the node as mentioned in step 1 above.

Once this pod is created, it will be scheduled on node having specified label.

You can verify this by using following command and check the node on which pod is scheduled.

kubectl get pods -o wide

Multiple node selectors

You can specify multiple key value pairs in node selector and in that case scheduler will try to find a node with all key values pairs. If it can’t find any node with all labels mentioned in node selector, pod will not be scheduled. Here is how you specify multiple node selectors

apiVersion: v1
kind: Pod
metadata:
  name: <pod-name>
spec:
  containers:
  - name: <pod-name>
    image: <image-name>
  nodeSelector:
    <label-key-1>: <label-value>
    <label-key-2>: <label-value>

Since node selector is a map, key should be different for every entry.

Second approach, using Node affinity

We can use node selector for scheduling pod on certain nodes but node selector is not very expressive. Instead of using node selector we can use node affinity which is more expressive. Here are few differences between them:

  1. Node selector only supports AND operator but besides AND operator node affinity also supports other operators such as In, NotIn, Exist, DoesNotExist, Gt and Lt.
  2. You can specify a requirement as either hard or soft rules. We will discuss shortly what are hard and soft rules.

With flexibility comes complexity. Node affinity rules are complex when compared to node selector rules but once you understand them you will be able to write them down fairly easily.

Hard and soft rules

Hard rules are conditions which scheduler will look for before scheduling a pod on node failing which pod will not be scheduled. Hard requirements are specified with requiredDuringSchedulingIgnoredDuringExecution

Soft rules are conditions which scheduler will try to match before scheduling a pod on node but if it can’t find a node, it will still schedule the pod on most preferred node. Soft requirements are specified with preferredDuringSchedulingIgnoredDuringExecution

Using node affinity

NodeAffinity has 2 fields as mentioned above:

  • requiredDuringSchedulingIgnoredDuringExecution
  • preferredDuringSchedulingIgnoredDuringExecution

If you see carefully, one start with required and other starts with preferred and rest everything is same in both the fields. Both these fields specify 2 phases of pod lifecycle: scheduling and execution. First half of each field tells about conditions which scheduler should look for before scheduling the pod and second half tells about conditions which should meet during execution of pod on a node. “IgnoredDuringExecution” means, after a pod has been scheduled on a node and labels on node change afterwards, pods won’t be affected and will continue to run.

Using hard rules

Like nodeSelector, nodeAffinity is also a field of pod spec. Here is how you can use it. Replace the placeholders with appropriate values.

apiVersion: v1
kind: Pod
metadata:
  name: <pod-name>
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: <label-key>
            operator: In
            values:
            - <label-value-1>
            - <label-value-2>

  containers:
  - name: <pod-name>
    image: <image-name>

For Pod definition with above affinity rules, scheduler with try to find a node with label key as “<label-key>” and label value either of “<label-value-1>” or “<label-value-2>”. Since we have specified hard rule, scheduler will make sure if no node exists with either of label values, pod will not be scheduled.

Multiple nodeSelectorTerms vs matchExpressions

If you see carefully, both nodeSelectorTerms and matchExpressions are list. So, how do you decide whether you specify multiple nodeSelectorTerms or multiple matchExpressions. Between multiple nodeSelectorTerms “or” operation is used i.e. if any of the nodeSelectorTerms matches on a node, pod will be scheduled on the node but for multiple matchExpressions “and” operation is used i.e. all of the rules in a matchExpression should match for a node to be eligible to host a pod.

Using soft rules

For using soft rules, we need to use “preferred…” field of node affinity. Here is how you can use it. Replace the placeholders with appropriate values.

apiVersion: v1
kind: Pod
metadata:
  name: <pod-name>
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
          - key: <label-key>
            operator: In
            values:
            - <label-value-1>

  containers:
  - name: <pod-name>
    image: <image-name>

For Pod definition with above affinity rules, scheduler with try to find a node with label key as “<label-key>” and label value of “<label-value-1>”. Since we have specified soft rule, scheduler will try to schedule the pod on a node with appropriate label else it will schedule the pod on some other node.

When multiple nodes match the preferred criteria, it is the weight field which decides which node the pod will be scheduled on. Weight is in the range 1–100. For each node that meets all of the scheduling requirements (resource request, RequiredDuringScheduling affinity expressions, etc.), the scheduler will compute a sum by iterating through the elements of this field and adding “weight” to the sum if the node matches the corresponding MatchExpressions. This score is then combined with the scores of other priority functions for the node. The node(s) with the highest total score are the most preferred.

Using NodeSelector and NodeAffinity together

If you specify both node selector and node affinity on a pod, then target node should satisfy both node selector and node affinity rules for it to be able to schedule the pod on it.

Conclusion

Pods with specific requirements can be aligned to certain node based on labels attached to a node. You can specify either node selector or node affinity or both to assign pod to certain nodes. Node selector is simple to use but is not very expressive and is limited in scope. Node affinity rules are a bit complex but gives you more option of expressing yourself in terms whether you want to use hard rules or soft rules and gives you multiple options for operator which you can use.

Monday, September 16, 2024

Kubernetes: How to avoid scheduling pods on certain nodes

 When new pods are created in a cluster (either due to failure of existing pods or to scale the system horizontally), these pods are placed on some node. If some existing node has capacity in line with resource requirements for the pod, the pod is scheduled on that node or else new node is created.

Generally, an application will have multiple pods and every pod will have different resource requirements. Depending upon resource requirement of pods, you decide the node size. but if one of your pod has pretty much different resource requirement compared to other pods (eg: database), you might be tempted to have different node configuration for that particular pod and in such scenarios you want to make sure that only certain pods (eg: database) land on that node but no other pod, so that the node has sufficient capacity for target pod. We will see how we can avoid other pods from landing onto these nodes.

 

We can do this by using taints and tolerations.

Taint and Tolerations

Taints and tolerations work together to ensure that pods are not scheduled onto inappropriate nodes.

Taint

Taints are applied to nodes. They allow a node to repel a set of pods.

Tolerations

Tolerations are applied to pods, and allow (but do not require) the pods to be scheduled onto nodes with matching taints.

Core concept

Idea is, you taint a node on which you want only certain pods to be scheduled and pods which should be scheduled on these nodes should have toleration for the taint which is applied to the node. 

Gate security analogy

It is same as if you want to restrict entry to some premise to certain people only, than you put security on the gate of the premise which will not allow anyone to enter the premise. Only the ones who have the required pass to enter the premise will be allowed by security. So, taints are like security on nodes and tolerations are gate pass for the security.

Put security on gate: Apply taint on node

To restrict a node to accept pod of certain types, we need to apply a taint on the node. You can apply the taint using kubectl taint.

kubectl taint nodes <node-name> type=db:NoSchedule

You need to replace the <node-name> place holder with name of node. Above command places a taint on node “<node-name>”. The taint has key “type”, value “db”, and taint effect “NoSchedule”. This means that no pod will be scheduled on node <node-name> unless it has a matching toleration. We will shortly see what is taint effect and what are different types of effects. You can have any value for key and value. In my case, I choose “type” as key and “db” as value.

List taints on a node

You can list taints which are applied on a node using kubectl describe and then applying filter. This will list all the taints applied on specified node. In below command, replace the <node-name> placeholder with actual name of a node in your cluster.

kubectl describe node <node-name> | grep Taints

Give pass to some people: Apply tolerations to pods

In following pod definition, notice tolerations under spec. This toleration is for a taint and hence acts as a gate pass for security (we are referring taint as security). Text in bold in code snippet below is the tolerations. Since it is a list, you can apply multiple tolerations on a pod.

apiVersion: v1

kind: Pod

metadata:

  name: <image-name>

  labels:

    app: taint-test

spec:

  containers:

  - name: <image-name>

    image: <image>

  tolerations:

  - key: "type"

    operator: "Equal"

    value: "db"

    effect: "NoSchedule"

Important thing to note while applying tolerations is, it should be absolutely identical to taint you are trying to address. Notice in above toleration, key, value and effect are exactly same as mentioned in taint. Equal operator tells controller to match the value for the key.

Once above toleration is applied, this pod can be scheduled on the node with similar taint. So effectively, this pod has the gate pass to get into the node. Any pod which does not have this toleration, can’t be scheduled on this node.

It is important to understand that applying toleration on pod means that pod can be scheduled on node with same taint but this does not mean that this pod can’t go to any other node in cluster. This pod can still go to any other node in the cluster but a node with the taint can accept pods with similar toleration.

Before we move further, lets discuss about various taint effects and operators in tolerations.

Taint effects

There are 3 taint effects: NoSchedule, PreferNoSchedule and NoExecute

  • NoSchedule: Pods that don't tolerate this taint are not scheduled on the node.
  • PreferNoSchedule: Kubernetes tries to avoid scheduling the pods that don't tolerate this taint but may schedule them if there are no other options. 
  • NoExecute: Pods that don't tolerate this taint, are evicted immediately. It prevents new pods from being scheduled on the node and also removes the existing ones.

Toleration operators

There are 2 operators for tolerance in pods:

  • Equal: This will match both key and value to make sure they both match with the ones specified in taint.
  • Exist: This will make sure that taint with given key exists on node and does not bother about value. Value in the taint can be anything.

Taints and master node/control plane

If you notice, in multi node cluster, pods are not scheduled on master node. How is this controlled? Well you guessed it right, using taints on master node.

Master node has a following taint applied to it:

node-role.kubernetes.io/master

You can check this by describing the node and filtering taint as mentioned above.

and since no pod has a tolerance for this taint, no pod is scheduled on master node. You can schedule pods on master node by removing the taint from the node as describe in following section.

Remove taint from node

To remove the taint added by the command above, you can run:

kubectl taint nodes node1 key1=value1:NoSchedule-

It is exactly the same command which is used to apply taint but followed by “-” at the end.

That’s is all about how can you avoid pods to be scheduled on certain nodes.

Conclusion

Due to some special resources requirements of some pods, you may launch nodes with higher configuration and want to make sure those nodes don’t accept any pod coming its way rather you want to restrict scheduling of certain pods on that node. You do this by applying the taint on the node. A taint on node will restrict any pod from being scheduled on that node unless a pod has a toleration for the taint which is applied on that node. Pods with appropriate toleration can be scheduled in that node.

So, it is a 2 step process:

  1. Apply taint on node
  2. Mention toleration on pod for the taint

I hope this helps.

 


Wednesday, September 4, 2024

Migrating Virtual box VM to Hyper-V

I started using Virtual box around 4–5 years back and I got so comfortable that I refused to use any other Hypervisor be it VM ware or Hyper-V. Recently when I was trying to set up minikube (Kubernetes on local machine) cluster on my local machine using VirtualBox, I ran into many problems and after spending good amount of time I didn’t get any success. 

So, I decided to use Hyper-V for virtualization to setup minikube but I knew that once I enable Hyper-V, I won’t be able to use VirtualBox and then what about all my data on VMs in VirtualBox. So, I was wondering how can I migrate my VMs from VirtualBox to Hyper-V. After lots of search, I was able to successfully migrate my VirtualBox VM to Hyper-V. In this post, I will describe all the steps required to do so.

At high level, you need to perform 5 steps to migrate the VM from Virtual Box to Hyper-V

  1. Export the VM from VirtualBox into VHD format
  2. Convert VHD to VHDX
  3. Create a new VM in Hyper-V without disk
  4. Attach Hard Disk with Virtual machine
  5. Set the Boot options

1. Export the VM from VirtualBox into VHD format

Though VirtualBox has the option for this from UI but that didn’t work for me. I was using VirtualBox version 6.1.4. You may want to try with your version of VirtualBox but for me instead of VHD file it was generating a VDI file. So, I decided to use following command and it worked like charm for me

Format of command

VBoxManage clonehd <absolute-path-of-vdi-file> <vhd-destination> — format vhd

Example:

VBoxManage clonehd C:\users\myuser\VirtualBoxVMs\ubuntu.vdi D:\virtualbox-export\ubuntu.vhd — format vhd


Note 1: If your absolute path contains spaces, you will have to wrap the path in double quotes.

Note 2: VBoxManage might not be in your path, so you will to navigate to path where virtual box is installed. This happens to be following location on my machine. C:\Program Files\Oracle\VirtualBox


For those who want to try with Virtual Box UI, there are the steps:


Select the VDI you want to convert, right click and select copy



Select the location where you want to save VHD and enter a name. In Disk image file type, select VHD and click copy. This should generate the VHD



2. Convert VHD to VHDX

  1. Launch Hyper-V Manager and select server in left pane
  2. Under Actions, select Edit Disk…
  3. Click Next on the ‘Before You Begin’ screen
  4. Browse for the copied file. The file will be having a file extension of VHD. Select the file and click Next button.
  5. On the Choose Action window, select Convert and click Next button



6. Select VHDX format and click Next



7. Select Dynamically expanding and click Next

8. Select a name and location for the file and click Next

9. Click Finish and wait for the conversion process to complete

3. Create a new VM in Hyper-V without disk

From actions in top menu, select New → Virtual Machine and follow the wizard. Select the default values during different steps of wizard or change the values as per your convivence.




There is one screen in wizard which you need to take care of. It is Connect Virtual Hard Disk screen. On this screen, select “Attach a Virtual Hard disk later”.




4. Attach Hard Disk with Virtual machine

In Hyper V Manager, right click the VM and select settings. In left pane select IDE controller 0, select Hard Drive in right pane and click Add.




Browse to the path of vhdx file created in previous step and select that. Click Apply.


5. Set the Boot options

Select BOIS in left pane and in right pane move the IDE option to top and select ok.



After going through all the above steps, start the VM and connect to it and you will be able to use the VM through Hyper V. This VM will all the packages and data which you installed on it while working with Virtual box.

Hope this helps. 

Sunday, August 18, 2024

Postgres - How I improved data insertion speed by a factor of more than 1000x

 Postgresql is one of the most popular RDBMS around and is being widely used. Saving data in database is one of the most primitive operation but there has always been a desire to get better performance. In this post I will talk about how I managed to improve the performance of database insert by a factor of more than 1000x

I will walk you through everything I did starting from beginning and how I achieved the desired performance.

TechStack

  • Java 21
  • Spring boot 3.x
  • PostgresSQL 15

Machine configuration

  • OS - windows 11
  • CPU - i5 11th generation 2.4 Ghz
  • RAM - 16 GB

Assumptions

This post assumes that you are aware of Java programming language and how spring boot ecosystem works.

Git Repo

All the code discussed in the post is available in following git repo: 


Let's start

Creating Spring boot application

I started off with creating a spring boot application. Easiest way to create a Spring boot application is by using https://start.spring.io/

Architecture

I started off with MVC architecture, creating a controller, Service and Repository.

Entity

 I created an entity class named Person. Here is how it looks like. For all the testing we will use this entity class to save to database.



Repository

This is simplest possible repository created with Spring boot. Here is how it looks like


Test Data

I generated some fake data related to person name and address in CSV format. We will use the same data for testing all scenarios. The test data is shared in the same git repo under resources folder in a file named: person-list.csv

Getting off the blocks - Save all records one at a time

Read all records from CSV and save them to database. As you can imagine very basic and probably the most expensive way of saving to DB. It took approx 2166 seconds (36 mins) to save 100K records to database. This is such an expensive operation, I didn't even try to save all 1 million rows to DB. The fact that it took 36 mins to save 100K records, indicates that this is very expensive and inefficient way of saving to database. The code for this to work is very straightforward and simple, invoke save method on your repository class passing in the instance of Person Entity. Here is how it looks like



Optimization 1 - Use batching

As a first optimization, I reduced the number number of database calls by introducing batching. I used a different batch sizes of 25K, 30K and 50K to measure the performance and with that total time to save to database came down considerably. 

Here are the times recorded for different batch sizes for saving all 1 million rows to DB.

  • 25K - 58 sec
  • 30K - 65 sec
  • 50K - 79 sec

Note that, as we increase the batch size, time goes up. This is opposite to what we would expect. 

Note: Numbers can vary for machine but at some time I expect to see same behaviour for you as well. You might notice this behaviour for different batch sizes. I encourage you to play around with different batch sizes.

For batching to work, we need to add required configuration to enable batching. Add following configuration to your application.yaml file



and use following code to save the data in batches.


Optimization 2 (and last one) - Use PostGres CopyManager

In previous 2 approaches I used Spring Data JPA for all database interactions. For this last optimization, I decided to use Postgres API. I used CopyManager API from PostgresSQL.  Copy manager API is a native functionality of Postgres which is used to load bulk data into database. It is also available as a command line tool from Postgres command line interface. 

With CopyManager I was able to load all 1 million rows into database in 4 seconds. Yes, you read that right, 4 seconds. An improvement of more than 1000x from basic approach and 15x from batch approach.

To use CopyManager API, first we need to write an SQL query which defines the structure of your table and few options like delimiter and character encoding.



Next step is to get a connection to your database




Next prepare the data to be inserted to database. Prepare  a string in which data in a row is "\t" delimited and rows are separated by "\n"


and the last step is to use CopyManager API and send the data to DB.



Here is the graph showing the performance difference between 3 approaches. 



Cons of using CopyManager:

There are no free lunches

  • You lose all the benefits of ORM framework
  • You need to mention and take care of all columns for which data should be inserted. With a ORM in play, you don't have to worry about this
  • For any change in table structure, you need to remember to update the structure where you use CopyManager API
  • With ORM, when save something in DB, you get back an object which has the id of object but this approach you won't get that.
If all the performance gain which you get from using CopyManager, is more than the drawbacks which are mentioned above, you can use CopyManager but you need to evaluate carefully between what you gain and what are you losing upon.

Why CopyManager is so fast

CopyManager is fast because it cuts down the number of database visits. In a prod setup, where your application server and database server won't be on same machine, this would mean, it cuts down the network calls. Network latency is one of the major factors when interacting between 2 servers. So, In a prod setup, your gain will be even more compared to what you get on a personal laptop where application and database are both on same machine.

Summary

When working with Spring boot and Postgres database, saving to database is a straights operation but when we need to save large dataset, we need some optimizations to improve the speed. Batching serves us well but when number of rows runs into millions, then even batching has its own limitations. To over come that, we can use Postgres native API called CopyManager. With CopyManager we can load bulk data into database cutting down our number of database trips and hence network calls and latency and giving us more better performance.


Sunday, August 4, 2024

Erasure Coding - This is how cloud providers store multiple copies of data at low cost

How cloud providers store multiple copies of data at low cost

Have you ever thought how object storage systems like AWS S3, Azure blob storage etc. are able to provide such high data availability at such a low cost.

If they are really storing multiple copies of your data for high availability, then how come they can they provide this at such low cost. At the end of the day it's a money game. 😉

Well, answer is they do not store multiple copies of your data. Then how do they provide such high availability without storing multiple copies of your data. Answer is, they use a technique named Erasure Coding.

In this post, I will talk about what Erasure coding is, how it works, how cloud providers use it to provide high availability without storing multiple copies of your data and in the end will point you to open source github repo which provides a good implementation.

What is Erasure coding

Erasure coding is a data protection technique that breaks data into chunks, encodes the chunks and creates some extra redundant information. This redundancy allows the data to be reconstructed even if some of the fragments are lost or damaged.

In simpler words, it gives you the ability to recover the data from failures at the cost of some storage overhead.

How Erasure coding works

Working of Erasure coding can be summarized in 3 steps:

  • Splitting the data: Erasure coding breaks down your data file into smaller chunks called data chunks
  • Creating extra bits: It then uses complex math to create additional pieces of data called “parity bits” or “erasure codes.
  • Spreading the pieces: The original data chunks and the parity bits are then distributed across multiple storage locations

What are different techniques

There are many different erasure coding techniques, each with its own advantages and disadvantages. Some of the most common schemes include:

  • Reed-Solomon coding
  • Low-density parity-check (LDPC) codes
  • Turbo codes

In this post we will talk in more about Reed-Solomon.

How Erasure coding is different from replication

Replication creates full copies of your data and stores them on separate storage devices. In case of failure, we can use any copy to get the data.

Erasure coding on other hand breaks down your data into smaller chunks, generate additional “parity” or “erasure codes.” These parity pieces are then distributed alongside the data chunks across multiple storage devices and in case of failure these parity codes can be use for data recovery.

Comparing both, Erasure coding is much more storage efficient compared to replication. For ex:

In Reed-Solomon (10, 4) technique, 10 MB of user data is divided into ten 1MB blocks. Then, four additional 1 MB parity blocks are created to provide redundancy. This can tolerate up to 4 concurrent failures. The storage overhead here is 14/10 = 1.4X.

In the case of a fully replicated system, the 10 MB of user data will have to be replicated 4 times to tolerate up to 4 concurrent failures. The storage overhead in that case will be 50/10 = 5X.

Cost overhead of 1.4X vs 5X, is self explanatory to how public clouds, provide high availability at such low cost.   

How Reed-Solomon works

Reed Solomon splits the incoming data into data chunk of size let’s say k. It then create additional chunks called parity bits or parity codes let’s say p, taking the total number of chunks to k+p. It then distributes these k+p chunks on different storage devices. These additional chunks give us the ability to recover the data in case some of the data chunks are lost. Till any k chunks out of total of k+p are available, data can be recovered successfully.



How many parity chunks are created, it depends in your requirement. Number of parity chunks should be equal to number of failures you want to tolerate. For ex: If data chunks size is 10 and you want to withstand 2 failures, you will have to create 2 parity chunks.

A more detail description of, how it works can be found here. I encourage you to go through this link for much better understanding.

Implementation

One of the core principles of software engineering is, DRY (Don't repeat yourself). So, rather than trying to reinvent the wheel and implement Reed solomon, I looked for an easy to use implementation, which can provide a good starting point and then change that If required.

In this process I stumbled upon this github repo which provides a good started point. I highly encourage you clone this repo and play around with the code.

Since this repo was last updated 3 years ago, it is not most up to date with latest versions of tools. On my machine, I use Java 21 and in order to make it work with Java 21, I had to make some trivial changes. I will mention those changes, to save you some time and help you get started quickly.

This repo uses gradle version 6.9 which is not compatible with Java 21. You need to upgrade to latest version of gradle. To do this, locate gradle-wrapper.properties file, in gradle/wrapper folder and change the value of distributionurl to: 

https\://services.gradle.org/distributions/gradle-8.9-bin.zip

Another required change is in build.gradle. Replace testCompile with testImplementation as former has been removed in latest version. That's it.

Run the following command from root of project to create a lib for using within your project.

./gradlew clean build

The repo also provides examples how to use Encoder and Decoder. Take a look at classes named SampleEncoder.java and SampleDecoder.java.

I used them and it works perfectly fine. I encoded a file and deleted some chunks and decoder was still able to recover the original file from leftover chunks.

That's it for this Post. Happy eating (Learning 😉)

References:

Reed Solomom tutorial

Summary of how Reed Solomon works

Kubernetes: How to deploy different pods close to each other (same node or zone etc.)

 In my last two posts, I discussed how to avoid and enforce pod scheduling on specific nodes. You can read them here Avoidscheduling pods ...