Sunday, August 4, 2024

Erasure Coding - This is how cloud providers store multiple copies of data at low cost

How cloud providers store multiple copies of data at low cost

Have you ever thought how object storage systems like AWS S3, Azure blob storage etc. are able to provide such high data availability at such a low cost.

If they are really storing multiple copies of your data for high availability, then how come they can they provide this at such low cost. At the end of the day it's a money game. 😉

Well, answer is they do not store multiple copies of your data. Then how do they provide such high availability without storing multiple copies of your data. Answer is, they use a technique named Erasure Coding.

In this post, I will talk about what Erasure coding is, how it works, how cloud providers use it to provide high availability without storing multiple copies of your data and in the end will point you to open source github repo which provides a good implementation.

What is Erasure coding

Erasure coding is a data protection technique that breaks data into chunks, encodes the chunks and creates some extra redundant information. This redundancy allows the data to be reconstructed even if some of the fragments are lost or damaged.

In simpler words, it gives you the ability to recover the data from failures at the cost of some storage overhead.

How Erasure coding works

Working of Erasure coding can be summarized in 3 steps:

  • Splitting the data: Erasure coding breaks down your data file into smaller chunks called data chunks
  • Creating extra bits: It then uses complex math to create additional pieces of data called “parity bits” or “erasure codes.
  • Spreading the pieces: The original data chunks and the parity bits are then distributed across multiple storage locations

What are different techniques

There are many different erasure coding techniques, each with its own advantages and disadvantages. Some of the most common schemes include:

  • Reed-Solomon coding
  • Low-density parity-check (LDPC) codes
  • Turbo codes

In this post we will talk in more about Reed-Solomon.

How Erasure coding is different from replication

Replication creates full copies of your data and stores them on separate storage devices. In case of failure, we can use any copy to get the data.

Erasure coding on other hand breaks down your data into smaller chunks, generate additional “parity” or “erasure codes.” These parity pieces are then distributed alongside the data chunks across multiple storage devices and in case of failure these parity codes can be use for data recovery.

Comparing both, Erasure coding is much more storage efficient compared to replication. For ex:

In Reed-Solomon (10, 4) technique, 10 MB of user data is divided into ten 1MB blocks. Then, four additional 1 MB parity blocks are created to provide redundancy. This can tolerate up to 4 concurrent failures. The storage overhead here is 14/10 = 1.4X.

In the case of a fully replicated system, the 10 MB of user data will have to be replicated 4 times to tolerate up to 4 concurrent failures. The storage overhead in that case will be 50/10 = 5X.

Cost overhead of 1.4X vs 5X, is self explanatory to how public clouds, provide high availability at such low cost.   

How Reed-Solomon works

Reed Solomon splits the incoming data into data chunk of size let’s say k. It then create additional chunks called parity bits or parity codes let’s say p, taking the total number of chunks to k+p. It then distributes these k+p chunks on different storage devices. These additional chunks give us the ability to recover the data in case some of the data chunks are lost. Till any k chunks out of total of k+p are available, data can be recovered successfully.



How many parity chunks are created, it depends in your requirement. Number of parity chunks should be equal to number of failures you want to tolerate. For ex: If data chunks size is 10 and you want to withstand 2 failures, you will have to create 2 parity chunks.

A more detail description of, how it works can be found here. I encourage you to go through this link for much better understanding.

Implementation

One of the core principles of software engineering is, DRY (Don't repeat yourself). So, rather than trying to reinvent the wheel and implement Reed solomon, I looked for an easy to use implementation, which can provide a good starting point and then change that If required.

In this process I stumbled upon this github repo which provides a good started point. I highly encourage you clone this repo and play around with the code.

Since this repo was last updated 3 years ago, it is not most up to date with latest versions of tools. On my machine, I use Java 21 and in order to make it work with Java 21, I had to make some trivial changes. I will mention those changes, to save you some time and help you get started quickly.

This repo uses gradle version 6.9 which is not compatible with Java 21. You need to upgrade to latest version of gradle. To do this, locate gradle-wrapper.properties file, in gradle/wrapper folder and change the value of distributionurl to: 

https\://services.gradle.org/distributions/gradle-8.9-bin.zip

Another required change is in build.gradle. Replace testCompile with testImplementation as former has been removed in latest version. That's it.

Run the following command from root of project to create a lib for using within your project.

./gradlew clean build

The repo also provides examples how to use Encoder and Decoder. Take a look at classes named SampleEncoder.java and SampleDecoder.java.

I used them and it works perfectly fine. I encoded a file and deleted some chunks and decoder was still able to recover the original file from leftover chunks.

That's it for this Post. Happy eating (Learning 😉)

References:

Reed Solomom tutorial

Summary of how Reed Solomon works

No comments:

Post a Comment

Kubernetes: How to deploy different pods close to each other (same node or zone etc.)

 In my last two posts, I discussed how to avoid and enforce pod scheduling on specific nodes. You can read them here Avoidscheduling pods ...