Protobuf in Amazon DynamoDB

After defending the thesis (and unofficially graduate), I had some chances to involve in an interesting project which is not very challenging but also not that straightforward. The functionalities of the project include, among others, a Java-based service running on Amazon EC2 and manipulating data stored in DynamoDB. Pretty easy, huh? Uuhhmm, not really…

Although I am quite familiar with NoSql systems, this is indeed my first time with DynamoDB. Amazon puts some “weird” limits on DynamoDB which significantly affect our design choices. In addition, some requirements of our project made it far less obvious than I initially thought. Here is a brief overview of DynamoDB’s constraints as well as our requirements:

  • First of all, for those who don’t know DynamoDB, Amazon charges you based on the bandwidth you consumed with DynamoDB. The more bandwidth you consume, the more money you have to pay. Each table in DynamoDB can be assigned with a threshold for read/write throughput and you will be charged by the total throughput you used over all the tables. Quite strange at the first take, but it turns out to be obvious and easy to understand.
  • Here are their limits: each item (or row if you prefer) can not be greater than 64KB in size. Each read/write request can not exceed 1MB. Those are two weirdest limits, and more of those can be found here.
  • We used the Java AWS SDK and it only supports strings, numbers (stored as strings) and byte arrays (ByteBuffer, specifically).
  • In our application, we had to deal with complex objects (arrays, list…) and it is awkward to store them in DynamoDB. One obvious approach is to iterate over all elements in a list and store them with an index. For instance, an array
    int[] scores = {14, 23, 32}
    can be stored as
    {scores_0=14; scores_1=23; scores_2=32}
    but this is obviously tedious to code and maintain. Moreover this might make the item much bigger than the 64KB limit of DynamoDB.
  • And for the best part, our application deals with time-series data. The data comes from external sources and normally the clients only care about the data of a small interval around the current time. All data of the past (like 3 days ago) will barely be read again, ever. While NoSql can be seen as a “perfect” approach for dealing with time-series data, the pattern of accessing data in our application makes it a little bit trickier to implement.

Fortunately, we came up with a reasonably good solution. Here are our design choices:

Firstly, we used protobuf to serialize our POJO classes into byte arrays and store them in DynamoDB. Some properties of those POJO are also stored as keys and secondary indices, but the whole object is always stored in binary form. This did not only save my effort in writing code for serialize/deserialize objects (I don’t need to manually serialize each field in the POJO), but also efficiently help us to deal with the 64 KB limit. Some similar libraries like Thrift, MessagePack… could also be used, but I decided to go with protobuf just because I personally prefer it. An easier solution is to construct JSON strings from POJO, zip the JSON string and store the zipped string in DynamoDB. However that would add the overhead of zipping/unzipping, and JSON is far less efficient than protobuf in the length of the serialized messages.

Secondly, using protobuf‘s objects is not always easy. Instead of using the raw objects provided by protobuf, I wrapped them into corresponding classes. So our POJO is something like this:

class Score
{
    private ScoreData _data;        // ScoreData is protobuf's class

    public ScoreData getData()
    {
        return _data;
    }

    public Score(ScoreData data)
    {
        _data = data;
    }

    // other stuffs...
}

The important idea is that Score does not hold any data field. All the data fields are already stored in ScoreData, hence Score only implements supplemental/utility functions which help the code in other parts of the project to be more concise. Of course one might be temped to make _data totally private inside Score, but in that case, they might need to wrap all the functions already provided in ScoreData. While this is a very good OOP design, that might require a lot of coding efforts. I am kind of lazy so I choose the medium trade-off: ScoreData can be accessed using getData() function. With this approach, the data fields can be directly modified (which violates the encapsulation principle of OO design), but because I am the only one who is gonna use it, that is okay, as far as I pay enough attention… The best part is that I wouldn’t spend too much coding efforts for this.

Thirdly, in order to deal with the freaking data access patterns, I store the data of each day separately. That means I create a table for each day. With this approach, I can delete the tables which are too far in the past, like more than 3 days ago. This will reduce the total size of the data stored in DynamoDB and reduces the cost, of course. This actually follows the best practices recommended by Amazon.

To sum up, I think the best decision I have made so far is to use protobuf to serialize our POJO, and wrap the protobuf object into a dummy POJO (is it kinda “design pattern” of using protobuf?). But the choice of how much encapsulation to be applied to those dummy POJO has to be considered carefully. In our case, I was quite lazy, but in some other scenarios, a highly principled OO approach should be employed.

I believe that coding is much easier than design…. Okay, I have to be more specific: writing good code given a good design is much easier than building a good design, given the mess of business requirements. Personally, I always enjoy the design phase but I am not that highly demanded for a perfect design. Very often, a “perfect” OOP design might result in clumsy pieces of code which are sometimes very difficult to maintain. The point of OOP design is, at least for me, to (partly) isolate the code so that it can be re-used in other projects in the future. With that point of view, a trade-off between good design and good code is always a good thing to have.

How about you? Have you used DynamoDB before? If yes, would you mind to share your experience in the comments?

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s