Random notes on stream-first architecture, Flink and friends

Use-cases and Technical requirements

The main use-cases for stream processing systems are applications that need to process a large amount of data coming at high throughput (say high velocity) with very low latency. Low latency is important, because in some use-cases where computations don’t need to be real-time, e.g. computing monthly cost on AWS, preparing quarterly business reports, re-indexing websites for a search engine, etc… then a traditional batch processing systems like MapReduce or Spark would be a better fit. Time-critical applications like processing events in online games or financial applications, monitoring and predicting device failures, processing critical performance metrics, etc.., however, requires real-time or near real-time insights. In those cases, a stream processing system would be more suitable.

That being said, when we say stream processing systems, we mean tools that allow users to perform computations on-the-fly. Related technologies for simple message queues (without computations), especially in the context of the microservice madness, can sometimes overlap with stream processing technologies, but they should be considered separately in various facets and contexts.

The pioneer in stream processing systems was Apache Storm. Storm can provide low latency, but word on the streets is that it is hard to get high throughput. Moreover, Storm doesn’t guarantees exactly-once semantic (more on this later).

A clever compromise to get exactly-once semantic at a relatively low latency was to cut the data stream into small batches (streaming as a special case of batch). Since computations are done on batches, it can be redone when failure happens. With small batches, this can approximate streaming pretty well and is suitable for applications that aren’t too picky on latency. However, cutting the stream into batches have inherit problems dealing with the semantic of times, and can lead to incorrect results when events do not arrive in order (which happens quite often in distributed systems). Nonetheless, this is the approach taken by Spark Streaming. Storm Trident is an improvement over Storm, also seems to take this micro-batch approach.

Message transport systems

In order to feed data into stream processing engines, we need a message transport system, which does what it should: transport messages. Although it might sound simple, doing this at scale is quite challenging. An ideal message transport system for a stream applications must:

  • provide persistence, i.e. the ability to rewind and replay the stream (i.e. time travel). It was once thought that a message transport system can’t have both performance and persistence. Lately, it turns out that this tradeoff is not fundamental, and systems like Apache Kafka can provide high throughput delivery with persistence. This is one of the important revolutions for the streaming community.
  • decouple producers and consumers. A good transport system doesn’t send messages point-to-point, because doing so in a scalable and fault-tolerant way is difficult. One of the solution that Kafka employs is to allow users to create topics, to which producers can send messages to, and consumers can subscribe from.

Having a good message transport system is critical for the stream processing engine down the road.

The notion of Time

In streaming applications, there are 2 notions of time: event time, which is the time that the message was generated in real-world, often stored as a timestamp along with the message, and processing time, which is the time that the message is observed by the machine processing it, normally measured by the system clock of that machine.

Now messing up with times is dangerous. To perform computations, stream processing engines divide the streams into (sliding) windows, based on the time of each message. It is therefore important to specify which time they are using, otherwise it can lead to incorrect results, as demonstrated in an interesting experiment here.

Micro-batching system, like Spark Streaming, only accepts time windows, and it is therefore tricky to get things right.

The bible in handling times in streaming applications is this VLDB paper from Google Dataflow (now Apache Beam), which was one of the inspirations of Flink. The following section was written based on this paper.

Windows, Triggers, and Watermarks

There are different kinds of windows supported by modern stream processing engines:

  • Time windows
  • Count windows
  • Session windows (or timeout windows): based on the gaps between messages, suitable for streams of user interactions on a website, for instance.

All kinds of windows are special cases of triggers – which define when the content of a window is aggregated and returned to the output stream. Good stream processing engines should allow users to specify custom trigger logic.

Another aspect of time is time travel – the ability to rewind the clock and repeat the computation at some certain time in the past. This is important in real-world applications where users need to replace their applications with a new version, or do A/B testing between different implementations of an algorithm, etc… Obviously if we only use processing times, time travel would be impossible, because the new version of the application would just simply process the latest messages coming right now. Therefore, serious stream processing applications must support event times (and coupled with a message transport that provides persistence).

Event times are tricky to work with though. Since events can arrive to the system out-of-order, when it sees the event happened at time 10:01 and then at 10:10, how does it knows that there won’t be another event happened at 10:05 coming to the system? In other words, there needs to be a clock defined by the data, so that event time windows can be correctly defined.

Stream processing engines deal with this problem by the concept of Watermarks, which are embedded messages in the stream, bearing a time marker t, notifying that there won’t be any other event happened before t coming to the stream. With watermarks, event time progresses independently from processing time. If a watermark is late (in processing time), the results will still be computed correctly, albeit it will be late in the result stream.

Stream processing engines would allow developers to define watermarks, because that requires some understanding of the domain, e.g. they know that their messages can’t be late more than 5 seconds, then the watermark will be the current time minus 5 seconds. This can get as elaborated/fancy as having a Machine Learning model to predict how late the messages come to the stream.

Stateful computation and exactly-once semantic

Most interesting stream-based applications are stateful. If the engine doesn’t support stateful, application developers would need to take care of that (by recruiting a database, for instance). More useful engines, however, can take care of the states and recomputing it in case of failures. Doing this properly will make sure the engine provides exactly-once semantics (the output stream will be correctly computed even in case of failures), but normally it is difficult to do, both at the semantic and implementation levels.

Folks behind Flink came up with a solution described in this paper, called Asynchronous Distributed Snapshots. This algorithm allows states to be restored correctly in case of failure, even when there are multiple workers.

The core idea is simple to understand. They inject checkpoints periodically into the coming stream. If a checkpoint event is processed, each worker will dump its states into a persistence storage. When failure happens on one of the worker, it will notify every other workers, everyone resets their states at the same previous checkpoint, and start the computation again (keep in mind that they can do time travel thanks to event times and watermarks). In order for this to work, the checkpoint ID needs to be synchronized across workers. Some tricks also need to be implemented at the output stage to make sure the output stream doesn’t contain duplicated entries. At the end, this checkpointing mechanism will ensure exactly-once semantic with relatively minimal effect on overall latency.

The same checkpointing mechanism allow Flink to create savepoints, at which moment the state of the whole system is saved, and can be loaded later. This is extremely useful in practice since it allows new version of the streaming application to pick up (reasonably) arbitrary savepoints of previous versions.

Now that we have stateful exactly-one semantics, we can even allow users to query the states directly, which makes stream processors behave more or less like a traditional database. This is still on-going research.

Stream-first architecture, Lambda architecture, Kappa architecture

Fancy words for fancy tech. They are all about employing a message transport, like Kafka, as a courier of messages, and some form of stream processors. Sometimes the message transport can dump data periodically into a persistence layer (like HDFS), and batch jobs will be scheduled to pick up latest data and provide exact insights. However, again, this approach is only suitable for applications that don’t require real-time insights.

An interesting use-case of streaming is to predict failures on a stream of sensor data. A feasible scalable approach would be to write a Flink application that monitor the sensor data stream and output a stream of statistics, stored in a Kafka topic. This stream will be then picked up by another Flink application that does the Machine Learning job, and raise alert if anomaly is detected.


Stream processors might have difficulties in batch computation (although batches can be obviously seen as special cases of streams – they are streams that have a definite start and end), e.g. running SQL queries on Flink is still an open effort. Doing SQL properly on streams has always been tricky.

Other computations traditionally suited for batch processors, like Machine Learning, is also tricky to do on systems designed for streaming.

Finally, like any other emerging technologies, streaming might be hyped. Picking the right solution should be based on the nature of the problems, not on the trends.


Simpson’s paradox

I learned about the Simpson’s paradox fairly recently, and I found it quite disturbing, not because of the mere “paradox” itself, but mainly because I felt it was something I should have known already.

In case you haven’t heard about it, one instance of the paradox is a real-world medical study for comparing the success rate of two treatments for kidney stones (from Wikipedia):


Overall, Treatment B is better because its success rate is 83%, compared to 78% of Treatment A. However, when they split the patients into 2 groups: those with small stones and those with large stones, then Treatment A is better than Treatment B in both subgroups. Paradoxical enough?

Well, it’s not. It turns out that for severe cases (large stones), doctors tend to give the more effective Treatment A, while for milder cases with small stones, they tend to give the inferior Treatment B. Therefore the sum is dominant by group 2 and group 3, while the other groups contribute little to the final sums. So the results can be interpreted more accurately as: when Treatment B is more frequently applied to less severe cases, it can appear to be more effective.

Now, knowing that Treatment and Stone size are not independent, this should not come up as a paradox. In fact, we can visualize the problem as a graphical model like this

Untitled drawing

All the numbers in the table above can be expressed as conditional probabilities like so:

  • Group 1: p\left(S=true \vert T=A, St=small\right) = 0.93
  • Group 2: p\left(S=true \vert T=B, St=small\right) = 0.87
  • Group 3: p\left(S=true \vert T=A, St=large\right) = 0.73
  • Group 4: p\left(S=true \vert T=B, St=large\right) = 0.69
  • p\left(S=true \vert T=A\right) = 0.78
  • p\left(S=true \vert T=B\right) = 0.83

For any of us who studied Probability, it is no surprise that the probabilities might turn up-side-down whenever some conditional variables are stripped out of the equations. In this particular case, since S depends on both St and T, the last 2 equations do not bring any new knowledge about S.

So what is this “paradox” about? Isn’t it nothing more than the problem of confounding/lurking variables, something that most people in Probability/Statistics already known? In this particular case, Stone size is the lurking variable that dictates both Treatment and Success, therefore the scientists who designed the experiment should have taken it into account since the beginning. It is well-known among Statistic practitioners that they must try their best to identify and eliminate the effect of any lurking variables in their experiments, or at least keep them fixed, before drawing any meaningful conclusion.

From a slightly different perspective, the paradox can be understood once we understand the human bias of drawing causal relations. Human, perhaps for the sake of survival, constantly look for causal relations and often tend to ignore rates or proportions. Once we conceived something as being causal (Treatment B gives higher success rate than Treatment A in general), which might be wrong, we continue to assume a causal relation and proceed with that assumption in mind. Obviously with this assumption, we will find the success rates for the subgroups of patients to be highly counter-intuitive, or even paradoxical.

In fact, the connection of this paradox to human intuitions is so important that Judea Pearl dedicated a whole section in his book for it. Modern Statistical textbooks and curriculum, however, don’t even mention it. Instead they will generally present the topic along with lurking/confounding variables.

Therefore, if you haven’t heard about this, it is probably for a good reason, or perhaps you are simply too young.

Albert Einstein and random thoughts on Machine Learning


I read Einstein’s biography with as much enthusiasm as I did with Stephen Hawking’s A brief history of Time and Domigos’ The Master Algorithm. It’s not only because the book is recommended by, among others, Elon Musk, but probably more because of my childhood dream of becoming a physicist back when I was in high school. Although I was too dumb for physics, nothing could prevent me from admiring its beauty.

The book was excellent. Walter Isaacson did a great job in depicting Albert Einstein from his complicated personality to his belief, religion, politics, and, of course, his scientific achievements.

As a human being, Einstein is a typical introvert. He was always a loner, enjoyed discussing ideas more than any personal or emotional entanglements. During the most difficult periods of life, he would rather immerse into science rather than get up and do anything about his real-life struggles. To quote Isaacson, “the stubborn patience that Einstein displayed when dealing with scientific problems was equaled by his impatience when dealing with personal entanglements”, those that put “emotional burdens” on him. Some may criticise and regard him as being “cold-hearted”, but perhaps for him, it was way easier to use the analytical brain rather than the emotional brain to deal with daily mundane affairs. This, often times, resulted in what we can consider as brutal acts, like when he gave Mileva Maric a list of harsh terms in order to stay with him, or when he did almost nothing for his first kid, and let it die in Serbia. For this aspect, though, he probably deserves more sympathy than condemnation. He was surely a complicated man, and expecting him to be well-rounded in handling personal affairs is perhaps as unreasonable as impossible.

Now, it is perhaps worth emphasizing that Einstein is a one-in-a-million genius who happens to have those personality traits. It does not imply those who have those traits are genius. Correlation does not imply causation 😉

Einstein made his mark in physics back in 1905 when he challenged Newton’s classical physics. He was bold and stubborn in challenging long-standing beliefs in science that are not backed by any experimental results. Unfortunately during his 40s, quantum physics made the same rationale, and he just couldn’t take it, although he contributed a great amount of results in its development (to this he amusingly commented: a good joke should not be repeated too often). His quest to look for a unified field theory that can explain both gravitational field and electromagnetic field by a consise set of rule was failed, and eventually quantum mechanics, with a probabilistic approach, was widely accepted. This saga tells us a lot:

  • The argument Einstein had with the rest of physicists back in 1910s on his breakthrough in relativity theory was almost exactly the same with the argument Neils Bohr had in 1930s on quantum theory, except that in 1930s, Einstein was on the conservative side. In 1910s, people believed time is absolute, Einstein shown that was wrong. In 1930s, Neils Bohr used probabilistic models to describe subatomic world, while Einstein resisted, because he didn’t believe Nature was “playing dice”.
    Perhaps amusingly, one can draw some analogies in Machine Learning. Einstein’s quest to describe physics in a set of rules sounds like Machine Learners trying to build rule-based systems back in 1980s. That effort failed and probabilistic models took advantages until today. The world is perhaps too complicated to be captured in a deterministic systems, while looking at it from a different angle, probability provides a neat mathematical framework to describe uncertainties that Nature seems to carry. While it seems impossible to describe any complicated system deterministically, it is perhaps okay to describe them probabilistically, although it might not explain how the system was created in the first place.
  • During the 1930s, in a series of lonely, depressing attempts to unify field theories, Einstein sounds a lot like… Geoff Hinton who attempted to explain how the human brain works. Actually, those are perhaps not too far from each other. The brain is eventually the 3-pound universe of mankind, and completely understanding the brain is probably as hard as understanding the universe.
  • Being a theorist his whole life, Einstein’s approach to physics is quite remarkable. He never started from experimental results, but often drawn insights at the abstract level, then proceed with intuitive thought experiments, and then went on with rigorous mathematical frameworks. He would often end his papers with a series of experimental studies that could be used to confirm his theory. This top-down approach is very appealing and became widely adopted in physics for quite a long time.
    On the contrary, many researches in Machine Learning are often bottom-up. Even the master algorithm proposed in Domigos’ book is too bottom-up to be useful. Computer Science, after all, is an applied science in which empirical results are often too emphasized. In particular, Machine Learning research are heavily based on experiments, and theories that justify those experiments often came long after, if there was any. To be fair, there are some work that come from rigorous mathematical inference, like LSTM, SELU and similar ideas, but a lot of breakthroughs in the field are empirical, like Convolutional nets, GANs and so on.
    Looking forward, drawing insights from Neuroscience is probably a promising way of designing Machine Learning systems in a top-down fashion. After all, human brain is the only instance of general intelligence that we known of by far, and the distribution of those generally intelligent devices might be highly skewed and sparse, hence drawing insights from Neuroscience is perhaps our best hope.
  • The way Einstein became an international celebrity was quite bizarre. He officially became celebrity after he paid visits to America for a series of fund-raising events for a Zionist cause in Israel. The world at the time was heavily divided after World War I, and the media was desperately looking for a international symbol to heal the wounds. Einstein, with his self-awareness, twinkling eyes and a good sense of humour, was too ready to become one. American media is surely good in this business, and the rest became history.
  • Einstein’s quest of understanding universe bears a lot of similarities with Machine Learner’s quest of building general AI systems. However, while computer scientists are meddling with our tiny superficial simulations on computers, physicists are looking for ways to understand the universe. Putting our work along side physicists’, we should probably feel humbled and perhaps a bit embarrassing.

It was amazing and refreshing to revise Einstein’s scientific journey about 100 years ago, and with a bit of creativity, one could draw many lessons that are still relevant to the research community today. Not only that, the book gives a well-informed picture about Einstein as a human, with all flaws and weaknesses. Those flaws do not undermine his genius, but on the contrary, make us readers respect him even more. Therefore Einstein is, among others, an exemplar for how much an introvert can contribute to the humankind.

For those of us who happen to live in Berlin, any time you sit in Einstein Kaffee and sip a cup of delighting coffee, perhaps you could pay some thoughts to the man who lived a well-lived life, achieved so much and also missed so much (although the Kaffe itself has nothing to do with Einstein). Berlin, after all, is where Einstein spent 17 years of his life. It is where he produced the general relativity theory – the most important work in his career, it is the only city he considered to be home throughout his bohemian life.

[RL4a] Policy Optimization

I thought I would write about some theory behind Reinforcement Learning, with eligibility traces, contraction mapping, POMDP and so on, but then I realized if I go down that rabbit hole, I would probably never finish this. So here are some practical stuff that people are actually using these days.

Two main approaches in model-free RL are policy-based, where we build a model of the policy function, and value-based, where we learn the value function. For many practical problems, the structure of the (unknown) policy function might be easier to learn than the (unknown) value function, in which case it makes more sense to use policy-based methods. Moreover, with policy-based methods, we will have an explicit policy, which is the ultimate goal of Reinforcement learning to control (the other type being learn-to-predict, more on that some other time). With value-based methods, like DQN, we will need to do an additional inference step to get the optimal policy.

The hybrid of policy-based and value-based is called Actor-Critic methods, which hopefully will be mentioned in another post.

One of the most straightforward approach in policy-based RL is, unsurprisingly, evolutionary algorithms. In this approach, a population of policies is maintained and evolutionized over time. People show that this works pretty well, e.g. for the Tetris game. However, due to the randomness, this is apparently only applied to problems where the number of parameters of the policy function is small.

A big part of policy-based methods, however, is based on Policy Gradient, where an exact estimate of the gradient of the expected future reward can be computed. Roughly speaking, there is an exact formulation for the gradient of the policy, which we can then use to optimize the policy. Since Deep Learning people basically worship gradients (pun intended), this method suites very well and became trendy about 2 years ago.

So, what is it all about? It is based on a very simple trick called Likelihood Ratio.


On “The Sympathizer”


I made great progress and almost finish The Sympathizer. There is only one final knot to be untied, but I would write something about it now, otherwise I will be too lazy when I am done with it.

Usually I am highly reluctant to read trending books. Books, especially fictions, are written to sustain the test of time, hence if a book is good for today, it shouldn’t be too bad 10 years later, otherwise it isn’t worth it. Therefore, it usually isn’t worth the effort to read a book when you are not sure if it would last for 10 years (or 5 years, perhaps).

The Sympathizer is different though. Reading a fiction of one of your countrymen in a foreign language is a pretty weird experience, so weird that I simply couldn’t resist, especially when I was in short supply of good Vietnamese books.

Without spoiling the content, here are a few random comments on the book. It was very enjoyable and turned out to be a good investment.

The story told in the book was inspired by many events that are not too unfamiliar with many Vietnamese. Even the way the war was explained, although totally different from the way it was taught in Vietnam, is in fact, well-informed and thoughtful. Therefore, if you are a self-respected Vietnamese who cares to learn about history more than what being taught in schools, the story wouldn’t be too surprising.

The surprise for me though, was the writing style. Being a debut fiction, the book was remarkable. Readers are left with the feeling that the author puts effort in every single word appeared in the book. He would use bachelor to describe someone in celibacy, or use naïveté instead of naivety, perhaps just to make the narrator sounds a bit more French. In other scenario though, he would use tummy instead of stomach, just to highlight the intimacy of the plot being told. Sentences are often short, but he does not hesitate to write sentences that are one-page long, sometimes just to make a point. I haven’t read too many fictions in English, except a few from Charles Dickens, Jack London and Dan Brown (yea, I read Dan Brown too), so I might be bias, but this kind of dedication makes the book a pleasant read.

Many people praised the book for its satire and sense of humour, but those probably come from the brutal honesty of the unnamed narrator, speaking of whom, is quite a unique character.

The narrator is a hybrid, whose parent is a French priest and a Vietnamese maid. During the war, he found himself being an assistant to a General of the Army of South Vietnam, although he is actually a sleeper agent of the North. Like any other human being, he has his own weaknesses, in this case being his bastard status, which drives him nut every time it is mentioned by other people. Having studied in the US, he consumed the Western values and culture. The whole book is, therefore in some ways, his fight to find his true identity, the true home that he really belongs to. These existentialist questions are echoed by the fact that the book was opened with a quote from Friedrich Nietzsche.

Having such a complicated background, readers could easily expect him to be quite a man they could possibly have a beer with. He would make smart, provoking comments on every single chance, from the name of the USA to that of the USSR, from sex workers to how dating works, from wines to guns, from Saigon to Hollywood, from military to, you bet, politics, philosophy and arts. He could draw, or perhaps more precisely throw, deep philosophical thoughts on seemingly random events and stories. Having seen everything from both sides, perhaps multiple sides, his opinions are well-informed, brutal and amusing at the same time. He would take every chance to reflect and show the differences, or correspondences, between Oriental and Western world, as part of his identity crisis.

I still have couples of chapters left to work on, and therefore haven’t seen everything from the book yet. However, if there is anything to criticize, I would perhaps be concerned about how naive the narrator was when it comes to his loyalty with the North. Just in the same way he cracked the American politics and culture, as well as the war, it would be amazing if he spends a bit more effort to expose the Communist side. That would make the book a fair treatment on many sides involved in this bloody war.

Moreover, although the author was tactically smart about where to let the story speaks for itself and where to make comments, sometimes he made too much of a comment, making some part of the novel a bit heavy and overdone.

Nonetheless, The Sympathizer was a good book. For many Vietnamese who are not yet exposed to the minuscule details of the Vietnam war aftermaths, this is certainly a good read. For others, this is a refreshing book that probably will keep them thinking for a while after finishing it.

Kalman filters (and how they relate to HMMs)

Kalman filters are insanely popular in many engineering fields, especially those involve sensors and motion tracking. Consider how to design a radar system to track military aircrafts (or warships, submarines, … for that matter), how to track people or vehicles in a video stream, how to predict location of a vehicle carrying a GPS sensor… In all these cases, some (advanced) variation of Kalman filter is probably what you would need.

Learning and teaching Kalman filters is therefore quite challenging, not only because of the mere complexity of the algorithms, but also because there are many variations of them.

With a Computer Science background, I encountered Kalman filters several years ago, but never managed to put them into the global context of the field. I had chances to look at them again recently, and rediscovered yet another way to present and explain Kalman filters. It made a lot of sense to me, and hopefully it does to you too.

Note that there are a lot of details missing from this post (if you are building a radar system to track military aircrafts, look somewhere else!). I was just unhappy to see many introductory material on Kalman filters are either too engineering or too simplified. I want something more Machine Learning-friendly, so this is my attempt.

Let’s say you want to track an enemy’s aircraft. All you have is a lame radar (bought from Russia probably) which, when oriented properly, will give you a 3-tuple of range, angle and angular velocity [r \;\phi\;\dot{\phi}]^{T} of the aircraft. This vector is called the observation \mathbf{z}_k (subscript k because it depends on time). The actual position of the aircraft, though, is a vector in cartesian coordinates \mathbf{x}_k = [x_1\;x_2\;x_3]^{T}. Since it is an enemy’s aircraft, you can only observe \mathbf{z}_k, and you want to track the state vector \mathbf{x}_k over time, every time you receive a measurement \mathbf{z}_k from the radar.

Visualised as a Bayesian network, it looks like this:

Untitled Diagram (1)

With all the Markov properties hold, i.e. \mathbf{x}_k only depends on \mathbf{x}_{k-1} and \mathbf{z}_k only depends on \mathbf{x}_k, does this look familiar?


On GPU architecture and why it matters

I had a nice conversation recently around the architecture of CPUs versus that of GPUs. It was so good that I still remember the day after, so it is probably worth writing down.

Note that a lot of the following are still several levels of abstraction away from the hardware, and this is in no way a rigorous discussion of modern hardware design. Still, from the software development point of view, they are adequate for everything we need to know.

It started out of the difference in allocating transistors to different components on the chip of CPU and GPU. Roughly speaking, on CPUs, a lot of transistors are reserved for the cache (several levels of those), while on GPUs, most of transistors are used for the ALUs, and cache is not very well-developed. Moreover, a modern CPU merely has a few dozen cores, while GPUs might have thousands.

Why is that? The simple answer is because CPUs are MIMD, while GPUs are SIMD (although modern nVidia GPUs are closer to MIMD).

The long answer is CPUs are designed for the Von-neumann architecture, where data and instructions are stored on RAM and then fetched to the chip on demand. The bandwidth between RAM and CPU is limited (so-called data bus and instruction bus, whose bandwidth are typically ~100 bits on modern computers). For each clock cycle, only ~100bits of data can be transfer from RAM to the chip. If an instruction or data element needed by the CPU is not on the chip, the CPU might need to wait for a few cycles before the data is fetched from RAM. Therefore, a cache is highly needed, and the bigger the cache, the better. Modern CPUs have around 3 levels of cache, unsurprisingly named L1, L2, L3… with higher level cache sits closer to the processor. Data and instructions will first be fetched to the caches, and CPU can read from the cache with much lower latency (cache is expensive though, but that is another story). In short, in order to keep the CPU processors busy, cache is used to reduce the latency of reading from RAM.

GPUs are different. Designed for graphic processing, GPUs need to compute the same, often simple, arithmetic operations on a large amount of data points, because this is what happens in 3D rendering where there are thousands of vertices need to be processed in the shader (for those who are not familiar with computer graphics, that is to compute the color values of each vertex in the scene). Each vertex can be computed independently, therefore it makes sense to have thousands of cores running in parallel. For this to be scalable, all the cores should run the same computation, hence SIMD (otherwise it is a mess to schedule thousands of cores).

For CPUs, even with caches, there are still chances that the chip requires some data or commands that are not in the cache yet, and it would need to wait for a few cycles for the data to be read from RAM. This is obviously wasteful. Modern CPUs have pretty smart and complicated prediction on where to prefetch the data from RAM to minimize latency. For example, when it enters a FOR loop, it could fetch data around the arrays being accessed and the commands around the loops. Nonetheless, even with all those tricks, there are still chances for cache misses!

One simple way to keep the CPU cores busy is context switching. While the CPU is waiting for data from RAM, it can work on something else, and this eventually keeps the cores busy, while also provides the multi-tasking feature. We are not going to dive into context switching, but basically it is about to store the current stack, restore the stack trace, reload the registers, reset the instruction counter, etc…

Let’s talk about GPUs. A typical fragment of data that GPUs have to work with are in the order of megabytes in size, so it could easily take hundreds of cycles for the data to be fetched to the cores. The question then is how to keep the cores busy.

CPUs deal with this problem by context switching. GPUs don’t do that. The threads on GPUs are not switching, because it would be problematic to switch context at the scale of thousands of cores. For the sake of efficiency, there is little of locking mechanism between GPU cores, so context switching is difficult to implement efficiently.
– In fact, the GPUs don’t try to be too smart in this regards. It simply leaves the problem to be solved at the higher level, i.e. the application level.

Talking of applications, GPUs are designed for a very specific set of applications anyway, so can we do something smarter to keep the cores busy? In graphical rendering, the usual workflow is the cores read a big chunk of data from RAM, do computation on each element of the data and write the results back to RAM (sounds like Map Reduce? Actually it is not too far from that, we can talk about GPGPU algorithms in another post). For this to be efficient, both the reading and writing phases should be efficient. Writing is tricky, but reading can be made way faster with, unsurprisingly, a cache. However, the biggest cache system on GPUs are read-only, because writable cache is messy, especially when you have thousands of cores. Historically it is called texture cache, because it is where the graphical application would write the texture (typically a bitmap) for the cores to use to shade the vertices. The cores cant write to this cache because it would not need to, but it is writable from the CPU. When people move to GPGPU, the texture cache is normally used to store constants, where they can be read by multiple cores simultaneously with low latency.

To summarize, the whole point of the discussion was about to avoid the cores being idle because of memory latency. Cache is the answer to both CPUs and GPUs, but cache on GPUs are read-only to the cores due to their massive number of cores. When cache is certainly helpful, CPUs also do context switching to further increase core utilization. GPUs, to the best of my knowledge, don’t do that much. It is left to the developers to design their algorithms so that the cores are fed with enough computation to hide the memory latency (which, by the way, also includes the transfer from RAM to GPU memory via PCIExpress – way slower and hasn’t been discussed so far).

The proper way to optimize GPGPU algorithms is, therefore, to use the data transfer latency as the guide to optimize.

Nowadays, frameworks like tensorflow or torch hide all of these details, but at the price of being a bit inefficient. Tensorflow community is aware of this and trying their best, but still much left to be done.