Postgresql is one of the most popular RDBMS around and is being widely used. Saving data in database is one of the most primitive operation but there has always been a desire to get better performance. In this post I will talk about how I managed to improve the performance of database insert by a factor of more than 1000x
I will walk you through everything I did starting from beginning and how I achieved the desired performance.
TechStack
- Java 21
- Spring boot 3.x
- PostgresSQL 15
Machine configuration
- OS - windows 11
- CPU - i5 11th generation 2.4 Ghz
- RAM - 16 GB
Assumptions
Git Repo
Let's start
Creating Spring boot application
Architecture
I started off with MVC architecture, creating a controller, Service and Repository.
Entity
I created an entity class named Person. Here is how it looks like. For all the testing we will use this entity class to save to database.
Repository
Test Data
Getting off the blocks - Save all records one at a time
Read all records from CSV and save them to database. As you can imagine very basic and probably the most expensive way of saving to DB. It took approx 2166 seconds (36 mins) to save 100K records to database. This is such an expensive operation, I didn't even try to save all 1 million rows to DB. The fact that it took 36 mins to save 100K records, indicates that this is very expensive and inefficient way of saving to database. The code for this to work is very straightforward and simple, invoke save method on your repository class passing in the instance of Person Entity. Here is how it looks like
Optimization 1 - Use batching
As a first optimization, I reduced the number number of database calls by introducing batching. I used a different batch sizes of 25K, 30K and 50K to measure the performance and with that total time to save to database came down considerably.
Here are the times recorded for different batch sizes for saving all 1 million rows to DB.
- 25K - 58 sec
- 30K - 65 sec
- 50K - 79 sec
Note that, as we increase the batch size, time goes up. This is opposite to what we would expect.
Note: Numbers can vary for machine but at some time I expect to see same behaviour for you as well. You might notice this behaviour for different batch sizes. I encourage you to play around with different batch sizes.
For batching to work, we need to add required configuration to enable batching. Add following configuration to your application.yaml file
Optimization 2 (and last one) - Use PostGres CopyManager
In previous 2 approaches I used Spring Data JPA for all database interactions. For this last optimization, I decided to use Postgres API. I used CopyManager API from PostgresSQL. Copy manager API is a native functionality of Postgres which is used to load bulk data into database. It is also available as a command line tool from Postgres command line interface.
With CopyManager I was able to load all 1 million rows into database in 4 seconds. Yes, you read that right, 4 seconds. An improvement of more than 1000x from basic approach and 15x from batch approach.
To use CopyManager API, first we need to write an SQL query which defines the structure of your table and few options like delimiter and character encoding.
Next step is to get a connection to your database
Next prepare the data to be inserted to database. Prepare a string in which data in a row is "\t" delimited and rows are separated by "\n"
Here is the graph showing the performance difference between 3 approaches.
Cons of using CopyManager:
There are no free lunches
- You lose all the benefits of ORM framework
- You need to mention and take care of all columns for which data should be inserted. With a ORM in play, you don't have to worry about this
- For any change in table structure, you need to remember to update the structure where you use CopyManager API
- With ORM, when save something in DB, you get back an object which has the id of object but this approach you won't get that.
Why CopyManager is so fast
CopyManager is fast because it cuts down the number of database visits. In a prod setup, where your application server and database server won't be on same machine, this would mean, it cuts down the network calls. Network latency is one of the major factors when interacting between 2 servers. So, In a prod setup, your gain will be even more compared to what you get on a personal laptop where application and database are both on same machine.