If you have been working with DynamoDB long enough, and your table is big enough (like 200+ GB), you will see that sometimes when querying data, whatever you tried, you couldn’t get the maximum provisioned read capacity on the table. You might allocate 4000 read capacity units, but you can only consumed around 300 units, and of course a lot of throttled requests. So you had to pay for 4000 units while actually using merely 300.
Then you started tearing your hair out, wondering why and if DynamoDB really worths it. Trust me, I have been there and done that.
However, there is a reason behind. The long answer can be found in this documentation, but keep reading this post because I will tell you how we solved the problem, not only on DynamoDB side, but also on other aspects.
First of all, you have to be aware that data on your table is split into different partitions. Roughly speaking, the maximum size of each partition is 10GB, so if you have 200 GB of data, there will be 20 partitions. The data items (rows, if you prefer) are divided into partitions based on its hash key. All the items that have the same hash key are assigned to the same partition.
N.B. 1: Actually I don’t know what DynamoDB does if there is a hash key with more than 10GB of data, I guess it will allocate that hash key into multiple partitions based on the range key, but I am not sure.
N.B. 2: It is more complicated than that. Each partition can only handle maximum 3000 read capacity units and 1000 write capacity units, hence if you allocate a very big number or provisioned capacity units, more partitions will be needed to handle that much capacity. The final number of the partitions, therefore, depends on both the size and the provisioned capacity of the table.
Now the interesting part: the provisioned capacity is divided equally for all partitions. So if you allocate 4000 read capacity for a table of 20 partitions, each partition will have a maximum read capacity of 4000/20 = 200 units.
And the pain comes because of this design. Let’s assume for a given hash key, you have 800 MB of data, and you want to take all the data of that hash key. You write a program to repeatedly query that hash key. How long will it take? Well, no less than 1000 seconds, which is 16 minutes.
(For those of you who haven’t worked it out: the capacity for reading the same hash key is 200 units, which means you can only read maximum 200*4 KB = 800 KB per second. So you need 1000 seconds to read 800 MB of data).
Now you have 250 such hash keys to read (because you have 200 GB of data), it will take 4000 minutes, which is about 4 days. Simply unreasonable.
The problem is because you simply querying the same hash key repeatedly and hitting the read capacity limit of each partition. This is generally a bad thing to do with DynamoDB.
The solution is simple. Given 250 hash keys, you should loop through all of them and send exactly 1 query. Regardless if the query is success (you get maximum 1MB of data per request) or failed (because you hit the limit), you still move to the next key. Keep looping until you read all the data of all hash keys.
This is a clever idea, but be prepared to change a lot in your query code. You cannot simply take a single hash key and fetching all of its data. With the new strategy, that function should take a list of hash keys, and sending 1 request for each key repeatedly. Using the “Exclusive Start Key” parameter in the DynamoDB query() function, you can come back and query the same hash key starting from where you stopped.
Nonetheless, we still couldnot get better than 1000 read capacity. There are no throttled requests on DynamoDB side, but we noticed a large query latency. So the problem turns to be network latency. Our program has to wait for the messages coming back from the network before sending the next request, which prevents it from sending some more request in a single second.
The solution for network latency is multi-processing. We launched 4 processes in parallel working for all the hash keys, and we can hit 4000 read capacity units on DynamoDB side. Perfect solution so far. With 4000 units, we can read 200GB of data in around 3 hours. Of course we can increase the number of processes running in parallel.
With this, we solved the problem. However let’s take one moment to blame Amazon for their freaky design of DynamoDB. The whole complicated stuffs about provisioned capacity is only for a single purpose, which is Amazon’s billing strategy. The most “shocking” thing for me was the fact that they brutally divide the capacity we bought by the number of partitions, which makes it slow to repeatedly query data belonging to the same partitions! Holy crap. If they don’t charge us based on the provisioned capacity, the whole problem just disappear!
But well, you have to admit that charging based on the data bandwidth was a wise choice, so you either have to like it or hate it.
For even bigger dataset, the solution would be moving to Hadoop where we can get a lot of processes running in parallel to fight with the size of the data.
So the take-away lesson, at least for me, is that when querying DynamoDB, make sure that you don’t repeatedly query the same hash key. That is the key technique to make sure you can get high data throughput.
However, there was also another lesson about analysing the performance of a big system to know why the program was slow. Simply speaking, when the program is slow, it might be because of I/O or processing power. But in practice, even I/O alone might cover a lot of things: database speed, network latency, hard disk access… so on and so forth. The thing is we have to correctly figure out the bottle-neck, and the solution will come out from that.