Food for Techies: August 2024

Sunday, August 18, 2024

Postgres - How I improved data insertion speed by a factor of more than 1000x

Postgresql is one of the most popular RDBMS around and is being widely used. Saving data in database is one of the most primitive operation but there has always been a desire to get better performance. In this post I will talk about how I managed to improve the performance of database insert by a factor of more than 1000x

I will walk you through everything I did starting from beginning and how I achieved the desired performance.

TechStack

Java 21
Spring boot 3.x
PostgresSQL 15

Machine configuration

OS - windows 11
CPU - i5 11th generation 2.4 Ghz
RAM - 16 GB

Assumptions

This post assumes that you are aware of Java programming language and how spring boot ecosystem works.

Git Repo

All the code discussed in the post is available in following git repo:

https://github.com/food-for-techies/postgres-copy-save

Let's start

Creating Spring boot application

I started off with creating a spring boot application. Easiest way to create a Spring boot application is by using https://start.spring.io/

Architecture

I started off with MVC architecture, creating a controller, Service and Repository.

Entity

I created an entity class named Person. Here is how it looks like. For all the testing we will use this entity class to save to database.

Repository

This is simplest possible repository created with Spring boot. Here is how it looks like

Test Data

I generated some fake data related to person name and address in CSV format. We will use the same data for testing all scenarios. The test data is shared in the same git repo under resources folder in a file named: person-list.csv

Getting off the blocks - Save all records one at a time

Read all records from CSV and save them to database. As you can imagine very basic and probably the most expensive way of saving to DB. It took approx 2166 seconds (36 mins) to save 100K records to database. This is such an expensive operation, I didn't even try to save all 1 million rows to DB. The fact that it took 36 mins to save 100K records, indicates that this is very expensive and inefficient way of saving to database. The code for this to work is very straightforward and simple, invoke save method on your repository class passing in the instance of Person Entity. Here is how it looks like

Optimization 1 - Use batching

As a first optimization, I reduced the number number of database calls by introducing batching. I used a different batch sizes of 25K, 30K and 50K to measure the performance and with that total time to save to database came down considerably.

Here are the times recorded for different batch sizes for saving all 1 million rows to DB.

25K - 58 sec
30K - 65 sec
50K - 79 sec

Note that, as we increase the batch size, time goes up. This is opposite to what we would expect.

Note: Numbers can vary for machine but at some time I expect to see same behaviour for you as well. You might notice this behaviour for different batch sizes. I encourage you to play around with different batch sizes.

For batching to work, we need to add required configuration to enable batching. Add following configuration to your application.yaml file

and use following code to save the data in batches.

Optimization 2 (and last one) - Use PostGres CopyManager

In previous 2 approaches I used Spring Data JPA for all database interactions. For this last optimization, I decided to use Postgres API. I used CopyManager API from PostgresSQL. Copy manager API is a native functionality of Postgres which is used to load bulk data into database. It is also available as a command line tool from Postgres command line interface.

With CopyManager I was able to load all 1 million rows into database in 4 seconds. Yes, you read that right, 4 seconds. An improvement of more than 1000x from basic approach and 15x from batch approach.

To use CopyManager API, first we need to write an SQL query which defines the structure of your table and few options like delimiter and character encoding.

Next step is to get a connection to your database

Next prepare the data to be inserted to database. Prepare a string in which data in a row is "\t" delimited and rows are separated by "\n"

and the last step is to use CopyManager API and send the data to DB.

Here is the graph showing the performance difference between 3 approaches.

Cons of using CopyManager:

There are no free lunches

You lose all the benefits of ORM framework
You need to mention and take care of all columns for which data should be inserted. With a ORM in play, you don't have to worry about this
For any change in table structure, you need to remember to update the structure where you use CopyManager API
With ORM, when save something in DB, you get back an object which has the id of object but this approach you won't get that.

If all the performance gain which you get from using CopyManager, is more than the drawbacks which are mentioned above, you can use CopyManager but you need to evaluate carefully between what you gain and what are you losing upon.

Why CopyManager is so fast

CopyManager is fast because it cuts down the number of database visits. In a prod setup, where your application server and database server won't be on same machine, this would mean, it cuts down the network calls. Network latency is one of the major factors when interacting between 2 servers. So, In a prod setup, your gain will be even more compared to what you get on a personal laptop where application and database are both on same machine.

Summary

When working with Spring boot and Postgres database, saving to database is a straights operation but when we need to save large dataset, we need some optimizations to improve the speed. Batching serves us well but when number of rows runs into millions, then even batching has its own limitations. To over come that, we can use Postgres native API called CopyManager. With CopyManager we can load bulk data into database cutting down our number of database trips and hence network calls and latency and giving us more better performance.

Sunday, August 4, 2024

Erasure Coding - This is how cloud providers store multiple copies of data at low cost

How cloud providers store multiple copies of data at low cost

Have you ever thought how object storage systems like AWS S3, Azure blob storage etc. are able to provide such high data availability at such a low cost.

If they are really storing multiple copies of your data for high availability, then how come they can they provide this at such low cost. At the end of the day it's a money game. 😉

Well, answer is they do not store multiple copies of your data. Then how do they provide such high availability without storing multiple copies of your data. Answer is, they use a technique named Erasure Coding.

In this post, I will talk about what Erasure coding is, how it works, how cloud providers use it to provide high availability without storing multiple copies of your data and in the end will point you to open source github repo which provides a good implementation.

What is Erasure coding

Erasure coding is a data protection technique that breaks data into chunks, encodes the chunks and creates some extra redundant information. This redundancy allows the data to be reconstructed even if some of the fragments are lost or damaged.

In simpler words, it gives you the ability to recover the data from failures at the cost of some storage overhead.

How Erasure coding works

Working of Erasure coding can be summarized in 3 steps:

Splitting the data: Erasure coding breaks down your data file into smaller chunks called data chunks

Creating extra bits: It then uses complex math to create additional pieces of data called “parity bits” or “erasure codes.

Spreading the pieces: The original data chunks and the parity bits are then distributed across multiple storage locations

What are different techniques

There are many different erasure coding techniques, each with its own advantages and disadvantages. Some of the most common schemes include:

Reed-Solomon coding
Low-density parity-check (LDPC) codes
Turbo codes

In this post we will talk in more about Reed-Solomon.

How Erasure coding is different from replication

Replication creates full copies of your data and stores them on separate storage devices. In case of failure, we can use any copy to get the data.

Erasure coding on other hand breaks down your data into smaller chunks, generate additional “parity” or “erasure codes.” These parity pieces are then distributed alongside the data chunks across multiple storage devices and in case of failure these parity codes can be use for data recovery.

Comparing both, Erasure coding is much more storage efficient compared to replication. For ex:

In Reed-Solomon (10, 4) technique, 10 MB of user data is divided into ten 1MB blocks. Then, four additional 1 MB parity blocks are created to provide redundancy. This can tolerate up to 4 concurrent failures. The storage overhead here is 14/10 = 1.4X.

In the case of a fully replicated system, the 10 MB of user data will have to be replicated 4 times to tolerate up to 4 concurrent failures. The storage overhead in that case will be 50/10 = 5X.

Cost overhead of 1.4X vs 5X, is self explanatory to how public clouds, provide high availability at such low cost.

How Reed-Solomon works

Reed Solomon splits the incoming data into data chunk of size let’s say k. It then create additional chunks called parity bits or parity codes let’s say p, taking the total number of chunks to k+p. It then distributes these k+p chunks on different storage devices. These additional chunks give us the ability to recover the data in case some of the data chunks are lost. Till any k chunks out of total of k+p are available, data can be recovered successfully.

How many parity chunks are created, it depends in your requirement. Number of parity chunks should be equal to number of failures you want to tolerate. For ex: If data chunks size is 10 and you want to withstand 2 failures, you will have to create 2 parity chunks.

A more detail description of, how it works can be found here. I encourage you to go through this link for much better understanding.

Implementation

One of the core principles of software engineering is, DRY (Don't repeat yourself). So, rather than trying to reinvent the wheel and implement Reed solomon, I looked for an easy to use implementation, which can provide a good starting point and then change that If required.

In this process I stumbled upon this github repo which provides a good started point. I highly encourage you clone this repo and play around with the code.

Since this repo was last updated 3 years ago, it is not most up to date with latest versions of tools. On my machine, I use Java 21 and in order to make it work with Java 21, I had to make some trivial changes. I will mention those changes, to save you some time and help you get started quickly.

This repo uses gradle version 6.9 which is not compatible with Java 21. You need to upgrade to latest version of gradle. To do this, locate gradle-wrapper.properties file, in gradle/wrapper folder and change the value of distributionurl to:

https\://services.gradle.org/distributions/gradle-8.9-bin.zip

Another required change is in build.gradle. Replace testCompile with testImplementation as former has been removed in latest version. That's it.

Run the following command from root of project to create a lib for using within your project.

./gradlew clean build

The repo also provides examples how to use Encoder and Decoder. Take a look at classes named SampleEncoder.java and SampleDecoder.java.

I used them and it works perfectly fine. I encoded a file and deleted some chunks and decoder was still able to recover the original file from leftover chunks.

That's it for this Post. Happy eating (Learning 😉)

References:

Reed Solomom tutorial

Summary of how Reed Solomon works

Food for Techies