Sunday, August 18, 2024

Postgres - How I improved data insertion speed by a factor of more than 1000x

 Postgresql is one of the most popular RDBMS around and is being widely used. Saving data in database is one of the most primitive operation but there has always been a desire to get better performance. In this post I will talk about how I managed to improve the performance of database insert by a factor of more than 1000x

I will walk you through everything I did starting from beginning and how I achieved the desired performance.

TechStack

  • Java 21
  • Spring boot 3.x
  • PostgresSQL 15

Machine configuration

  • OS - windows 11
  • CPU - i5 11th generation 2.4 Ghz
  • RAM - 16 GB

Assumptions

This post assumes that you are aware of Java programming language and how spring boot ecosystem works.

Git Repo

All the code discussed in the post is available in following git repo: 


Let's start

Creating Spring boot application

I started off with creating a spring boot application. Easiest way to create a Spring boot application is by using https://start.spring.io/

Architecture

I started off with MVC architecture, creating a controller, Service and Repository.

Entity

 I created an entity class named Person. Here is how it looks like. For all the testing we will use this entity class to save to database.



Repository

This is simplest possible repository created with Spring boot. Here is how it looks like


Test Data

I generated some fake data related to person name and address in CSV format. We will use the same data for testing all scenarios. The test data is shared in the same git repo under resources folder in a file named: person-list.csv

Getting off the blocks - Save all records one at a time

Read all records from CSV and save them to database. As you can imagine very basic and probably the most expensive way of saving to DB. It took approx 2166 seconds (36 mins) to save 100K records to database. This is such an expensive operation, I didn't even try to save all 1 million rows to DB. The fact that it took 36 mins to save 100K records, indicates that this is very expensive and inefficient way of saving to database. The code for this to work is very straightforward and simple, invoke save method on your repository class passing in the instance of Person Entity. Here is how it looks like



Optimization 1 - Use batching

As a first optimization, I reduced the number number of database calls by introducing batching. I used a different batch sizes of 25K, 30K and 50K to measure the performance and with that total time to save to database came down considerably. 

Here are the times recorded for different batch sizes for saving all 1 million rows to DB.

  • 25K - 58 sec
  • 30K - 65 sec
  • 50K - 79 sec

Note that, as we increase the batch size, time goes up. This is opposite to what we would expect. 

Note: Numbers can vary for machine but at some time I expect to see same behaviour for you as well. You might notice this behaviour for different batch sizes. I encourage you to play around with different batch sizes.

For batching to work, we need to add required configuration to enable batching. Add following configuration to your application.yaml file



and use following code to save the data in batches.


Optimization 2 (and last one) - Use PostGres CopyManager

In previous 2 approaches I used Spring Data JPA for all database interactions. For this last optimization, I decided to use Postgres API. I used CopyManager API from PostgresSQL.  Copy manager API is a native functionality of Postgres which is used to load bulk data into database. It is also available as a command line tool from Postgres command line interface. 

With CopyManager I was able to load all 1 million rows into database in 4 seconds. Yes, you read that right, 4 seconds. An improvement of more than 1000x from basic approach and 15x from batch approach.

To use CopyManager API, first we need to write an SQL query which defines the structure of your table and few options like delimiter and character encoding.



Next step is to get a connection to your database




Next prepare the data to be inserted to database. Prepare  a string in which data in a row is "\t" delimited and rows are separated by "\n"


and the last step is to use CopyManager API and send the data to DB.



Here is the graph showing the performance difference between 3 approaches. 



Cons of using CopyManager:

There are no free lunches

  • You lose all the benefits of ORM framework
  • You need to mention and take care of all columns for which data should be inserted. With a ORM in play, you don't have to worry about this
  • For any change in table structure, you need to remember to update the structure where you use CopyManager API
  • With ORM, when save something in DB, you get back an object which has the id of object but this approach you won't get that.
If all the performance gain which you get from using CopyManager, is more than the drawbacks which are mentioned above, you can use CopyManager but you need to evaluate carefully between what you gain and what are you losing upon.

Why CopyManager is so fast

CopyManager is fast because it cuts down the number of database visits. In a prod setup, where your application server and database server won't be on same machine, this would mean, it cuts down the network calls. Network latency is one of the major factors when interacting between 2 servers. So, In a prod setup, your gain will be even more compared to what you get on a personal laptop where application and database are both on same machine.

Summary

When working with Spring boot and Postgres database, saving to database is a straights operation but when we need to save large dataset, we need some optimizations to improve the speed. Batching serves us well but when number of rows runs into millions, then even batching has its own limitations. To over come that, we can use Postgres native API called CopyManager. With CopyManager we can load bulk data into database cutting down our number of database trips and hence network calls and latency and giving us more better performance.


No comments:

Post a Comment

Kubernetes: How to deploy different pods close to each other (same node or zone etc.)

 In my last two posts, I discussed how to avoid and enforce pod scheduling on specific nodes. You can read them here Avoidscheduling pods ...