Big Data

Random notes on stream-first architecture, Flink and friends

Use-cases and Technical requirements

The main use-cases for stream processing systems are applications that need to process a large amount of data coming at high throughput (say high velocity) with very low latency. Low latency is important, because in some use-cases where computations don’t need to be real-time, e.g. computing monthly cost on AWS, preparing quarterly business reports, re-indexing websites for a search engine, etc… then a traditional batch processing systems like MapReduce or Spark would be a better fit. Time-critical applications like processing events in online games or financial applications, monitoring and predicting device failures, processing critical performance metrics, etc.., however, requires real-time or near real-time insights. In those cases, a stream processing system would be more suitable.

That being said, when we say stream processing systems, we mean tools that allow users to perform computations on-the-fly. Related technologies for simple message queues (without computations), especially in the context of the microservice madness, can sometimes overlap with stream processing technologies, but they should be considered separately in various facets and contexts.

The pioneer in stream processing systems was Apache Storm. Storm can provide low latency, but word on the streets is that it is hard to get high throughput. Moreover, Storm doesn’t guarantees exactly-once semantic (more on this later).

A clever compromise to get exactly-once semantic at a relatively low latency was to cut the data stream into small batches (streaming as a special case of batch). Since computations are done on batches, it can be redone when failure happens. With small batches, this can approximate streaming pretty well and is suitable for applications that aren’t too picky on latency. However, cutting the stream into batches have inherit problems dealing with the semantic of times, and can lead to incorrect results when events do not arrive in order (which happens quite often in distributed systems). Nonetheless, this is the approach taken by Spark Streaming. Storm Trident is an improvement over Storm, also seems to take this micro-batch approach.

Message transport systems

In order to feed data into stream processing engines, we need a message transport system, which does what it should: transport messages. Although it might sound simple, doing this at scale is quite challenging. An ideal message transport system for a stream applications must:

  • provide persistence, i.e. the ability to rewind and replay the stream (i.e. time travel). It was once thought that a message transport system can’t have both performance and persistence. Lately, it turns out that this tradeoff is not fundamental, and systems like Apache Kafka can provide high throughput delivery with persistence. This is one of the important revolutions for the streaming community.
  • decouple producers and consumers. A good transport system doesn’t send messages point-to-point, because doing so in a scalable and fault-tolerant way is difficult. One of the solution that Kafka employs is to allow users to create topics, to which producers can send messages to, and consumers can subscribe from.

Having a good message transport system is critical for the stream processing engine down the road.

The notion of Time

In streaming applications, there are 2 notions of time: event time, which is the time that the message was generated in real-world, often stored as a timestamp along with the message, and processing time, which is the time that the message is observed by the machine processing it, normally measured by the system clock of that machine.

Now messing up with times is dangerous. To perform computations, stream processing engines divide the streams into (sliding) windows, based on the time of each message. It is therefore important to specify which time they are using, otherwise it can lead to incorrect results, as demonstrated in an interesting experiment here.

Micro-batching system, like Spark Streaming, only accepts time windows, and it is therefore tricky to get things right.

The bible in handling times in streaming applications is this VLDB paper from Google Dataflow (now Apache Beam), which was one of the inspirations of Flink. The following section was written based on this paper.

Windows, Triggers, and Watermarks

There are different kinds of windows supported by modern stream processing engines:

  • Time windows
  • Count windows
  • Session windows (or timeout windows): based on the gaps between messages, suitable for streams of user interactions on a website, for instance.

All kinds of windows are special cases of triggers – which define when the content of a window is aggregated and returned to the output stream. Good stream processing engines should allow users to specify custom trigger logic.

Another aspect of time is time travel – the ability to rewind the clock and repeat the computation at some certain time in the past. This is important in real-world applications where users need to replace their applications with a new version, or do A/B testing between different implementations of an algorithm, etc… Obviously if we only use processing times, time travel would be impossible, because the new version of the application would just simply process the latest messages coming right now. Therefore, serious stream processing applications must support event times (and coupled with a message transport that provides persistence).

Event times are tricky to work with though. Since events can arrive to the system out-of-order, when it sees the event happened at time 10:01 and then at 10:10, how does it knows that there won’t be another event happened at 10:05 coming to the system? In other words, there needs to be a clock defined by the data, so that event time windows can be correctly defined.

Stream processing engines deal with this problem by the concept of Watermarks, which are embedded messages in the stream, bearing a time marker t, notifying that there won’t be any other event happened before t coming to the stream. With watermarks, event time progresses independently from processing time. If a watermark is late (in processing time), the results will still be computed correctly, albeit it will be late in the result stream.

Stream processing engines would allow developers to define watermarks, because that requires some understanding of the domain, e.g. they know that their messages can’t be late more than 5 seconds, then the watermark will be the current time minus 5 seconds. This can get as elaborated/fancy as having a Machine Learning model to predict how late the messages come to the stream.

Stateful computation and exactly-once semantic

Most interesting stream-based applications are stateful. If the engine doesn’t support stateful, application developers would need to take care of that (by recruiting a database, for instance). More useful engines, however, can take care of the states and recomputing it in case of failures. Doing this properly will make sure the engine provides exactly-once semantics (the output stream will be correctly computed even in case of failures), but normally it is difficult to do, both at the semantic and implementation levels.

Folks behind Flink came up with a solution described in this paper, called Asynchronous Distributed Snapshots. This algorithm allows states to be restored correctly in case of failure, even when there are multiple workers.

The core idea is simple to understand. They inject checkpoints periodically into the coming stream. If a checkpoint event is processed, each worker will dump its states into a persistence storage. When failure happens on one of the worker, it will notify every other workers, everyone resets their states at the same previous checkpoint, and start the computation again (keep in mind that they can do time travel thanks to event times and watermarks). In order for this to work, the checkpoint ID needs to be synchronized across workers. Some tricks also need to be implemented at the output stage to make sure the output stream doesn’t contain duplicated entries. At the end, this checkpointing mechanism will ensure exactly-once semantic with relatively minimal effect on overall latency.

The same checkpointing mechanism allow Flink to create savepoints, at which moment the state of the whole system is saved, and can be loaded later. This is extremely useful in practice since it allows new version of the streaming application to pick up (reasonably) arbitrary savepoints of previous versions.

Now that we have stateful exactly-one semantics, we can even allow users to query the states directly, which makes stream processors behave more or less like a traditional database. This is still on-going research.

Stream-first architecture, Lambda architecture, Kappa architecture

Fancy words for fancy tech. They are all about employing a message transport, like Kafka, as a courier of messages, and some form of stream processors. Sometimes the message transport can dump data periodically into a persistence layer (like HDFS), and batch jobs will be scheduled to pick up latest data and provide exact insights. However, again, this approach is only suitable for applications that don’t require real-time insights.

An interesting use-case of streaming is to predict failures on a stream of sensor data. A feasible scalable approach would be to write a Flink application that monitor the sensor data stream and output a stream of statistics, stored in a Kafka topic. This stream will be then picked up by another Flink application that does the Machine Learning job, and raise alert if anomaly is detected.


Stream processors might have difficulties in batch computation (although batches can be obviously seen as special cases of streams – they are streams that have a definite start and end), e.g. running SQL queries on Flink is still an open effort. Doing SQL properly on streams has always been tricky.

Other computations traditionally suited for batch processors, like Machine Learning, is also tricky to do on systems designed for streaming.

Finally, like any other emerging technologies, streaming might be hyped. Picking the right solution should be based on the nature of the problems, not on the trends.


What Goes Around Comes Around: Databases, Big Data and the likes

Over the years, I have the privilege of working with some pretty damn good people. One of those guys is a PhD in Database Research, used to be a professor, but apparently so passionate and so good at teaching that he eventually quits academia to join the industry.

He did an PhD in XML database, and even though XML database turned out to be completely useless nowadays, it doesn’t mean a PhD in XML Database couldn’t teach you anything (in fact, a good PhD could teach you quite many useful things). One interesting thing I learned from him was the evolution of database technology, which originates in an essay by Michael Stonebraker called What goes around comes around.

Michael Stonebraker is a big name in Database Research and has been around in Database research for a good 35 years, so you could easily expect him to be highly opinionated on a variety of things. The first few lines in the essay read like this:

In addition, we present the lessons learned from the exploration of the proposals in each era. Most current researchers were not around for many of the previous eras, and have limited (if any) understanding of what was previously learned. There is an old adage that he who does not understand history is condemned to repeat it. By presenting “ancient history”, we hope to allow future researchers to avoid replaying history.

Unfortunately, the main proposal in the current XML era bears a striking resemblance to the CODASYL proposal from the early 1970’s, which failed because of its complexity. Hence, the current era is replaying history, and “what goes around comes around”. Hopefully the next era will be smarter.

His work, among others, include PostgreSQL. Hence, after reading the essay, it becomes obvious to me why there are so many highly opinionated design decisions being implemented in Postgres.

The essay is a very good read. You get to see how database technologies evolved from naive data models to unnecessarily complicated models, then thanks to a good mathematician named Edgar F. Codd, it got way more beautiful and highly theoretically-grounded. After a few alternatives, a new wave of XML databases come back bearing a lot of complications. (Along the way, you also get to see how Michael Stonebraker managed to sell several startups, but that isn’t the main story – or is it?)

There are many interesting lesson learned. The most two interesting for me are:

  • XML database doesn’t take off because it is exceedingly complicated, and there is no way to efficiently store and index it using our best data structures like B-trees and the likes.
    I learned XML databases and I was told that XML databases failed because it lacks a theoretical foundation like the Relational model. Now in retrospect, I think that isn’t the main issue. The problem with XML is that it allows too much flexibility in the language, so implementing a good query optimizer for it is extremely difficult.
    A bit more ironically, this is how Michael Stonebraker puts it:

    We close with one final cynical note. A couple of years ago OLE-DB was being pushed hard by Microsoft; now it is “X stuff”. OLE-DB was pushed by Microsoft, in large part, because it did not control ODBC and perceived a competitive advantage in OLE-DB. Now Microsoft perceives a big threat from Java and its various cross platform extensions, such as J2EE. Hence, it is pushing hard on the XML and Soap front to try to blunt the success of Java

    It sounds very reasonable to me. At some point around 2000-2010, I remember hearing things like XML is everywhere in Windows. It has to be someone like Microsoft keeps pushing hard to make it quite phenomenal. When Microsoft started the .NET effort to directly compete with Java, the XML database movement faded away.
    One thing Michael Stonebraker got wrong though. In the essay, he said XML (and SOAP) is gonna be the data exchange format of the future, but it turns out XML is still overly complicated for this purpose, and people ended up with JSON and RESTful instead.

  • The whole competitive advantage of PostgreSQL was about UDTs and UDFs, a somewhat generalization of stored procedures. Stored procedures, though, are soon out-of-fashion because people realize it is difficult to maintain their code in multiple places, both in application code and store procedures in DBMS. However, the idea of bringing code close to data (instead of bringing data to code) is a good one, and has a big consequence on the Big Data movement.

Speaking of Big Data, Stonebraker must have something to say about it. For anyone who is in Big Data, you should really see this if you haven’t:

The talk presents a highly organized view on various aspects of Big Data and how people solved them (and of course mentions a few startups founded by our very Michael Stonebraker).

He mentioned Spark at some point. If you look at Spark nowadays, it’s nothing more than an in-memory distributed SQL engine (for traditional business intelligence queries), along with a pretty good Machine Learning library (for advanced analytics). From a database point of view, Spark looks like a toy: you can’t share tables, tables don’t have indices, etc… but the key idea is still there: you bring computation to the data.

Of course I don’t think Spark wants to become a database eventually, so I am not sure if Spark plans to fix those issues at all, but adding catalog (i.e. data schema), and supporting a somewhat full-fledged SQL engine were pretty wise decisions.

There are several other good insights about the Big Data ecosystem as well: why MapReduce sucks, what are other approaches to solve the Big Volume problem (besides Spark), how to solve the Big Velocity problem with streaming, SQL, NoSQL and NewSQL, why the hardest problem in Big Data is Variety, etc…  I should’ve written a better summary of those, but you could simply check it out.

Count Featurizer

Trong “sự nghiệp” ngắn ngủi đi làm Data Scientist dạo, mình từng thấy rất nhiều kĩ thuật feature engineering khá lạ lùng. Một trong những kĩ thuật lạ lùng nhất là hashing. Chẳng hạn có 1 feature là địa chỉ nhà (của khách hàng), thông thường là một string, và thật khó để extract bất kì thông tin nào. Hashing đơn giản chỉ là biến đổi feature đó như sau:

f = hash(s) % M

và sau đó f được dùng trong one-hot encoding. M càng lớn thì chi phí càng cao, nhưng chất lượng có thể “tốt” hơn. Kĩ thuật này đơn giản nhưng hiệu quả, vì nó sẽ hoạt động tốt ngay cả trên test  data. Về mặt kĩ thuật,  thực chất đây rất gần giống với việc gán các địa chỉ 1 cách “ngẫu nhiên” vào M ngăn, tuy nhiên nhờ tính deterministic của hàm hash nên ta có thể đảm bảo một chuỗi sẽ chỉ được gán vào đúng 1 ngăn.

Một kĩ thuật feature engineering khác cũng “lạ lùng” không kém là Count-based featurizer. Mặc dù đơn giản nhưng kĩ thuật này tỏ ra rất  hiệu quả trong nhiều bài toán lớn như Online advertisement, fraud detection. Count featurizer phù hợp để encode các feature dạng categorical mà số lượng category quá lớn, đến nỗi khó có thể dùng One-hot encoding. Các feature dạng này có thể là địa chỉ, là ID của khách hàng v.v… cụ thể như sau.


Data science open class

Đây ( là lớp Data Science dạy cho sinh viên undergrad ở Harvard mấy tháng trước. Toàn bộ video lectures đã up và có thể xem miễn phí.
Xem video dạng này vẫn có vẻ thú vị hơn các lớp Data Science trên coursera.

GraphLab – A New Parallel Framework for ML

(Yet) another great software from CMU. This time, GraphLab is a Machine learning library for Big Data. It is quite “new”, the first version was released at the end of 2010, but GraphLab has a variety range of algorithms optimized for Big Data. It’s not  like MapReduce, not cumbersome like MPI/OpenMP, it’s just a high-scalability library for processing large amounts of data, which is often the case in many Data Mining applications nowadays.