On GPU architecture and why it matters

I had a nice conversation recently around the architecture of CPUs versus that of GPUs. It was so good that I still remember the day after, so it is probably worth writing down.

Note that a lot of the following are still several levels of abstraction away from the hardware, and this is in no way a rigorous discussion of modern hardware design. Still, from the software development point of view, they are adequate for everything we need to know.

It started out of the difference in allocating transistors to different components on the chip of CPU and GPU. Roughly speaking, on CPUs, a lot of transistors are reserved for the cache (several levels of those), while on GPUs, most of transistors are used for the ALUs, and cache is not very well-developed. Moreover, a modern CPU merely has a few dozen cores, while GPUs might have thousands.

Why is that? The simple answer is because CPUs are MIMD, while GPUs are SIMD (although modern nVidia GPUs are closer to MIMD).

The long answer is CPUs are designed for the Von-neumann architecture, where data and instructions are stored on RAM and then fetched to the chip on demand. The bandwidth between RAM and CPU is limited (so-called data bus and instruction bus, whose bandwidth are typically ~100 bits on modern computers). For each clock cycle, only ~100bits of data can be transfer from RAM to the chip. If an instruction or data element needed by the CPU is not on the chip, the CPU might need to wait for a few cycles before the data is fetched from RAM. Therefore, a cache is highly needed, and the bigger the cache, the better. Modern CPUs have around 3 levels of cache, unsurprisingly named L1, L2, L3… with higher level cache sits closer to the processor. Data and instructions will first be fetched to the caches, and CPU can read from the cache with much lower latency (cache is expensive though, but that is another story). In short, in order to keep the CPU processors busy, cache is used to reduce the latency of reading from RAM.

GPUs are different. Designed for graphic processing, GPUs need to compute the same, often simple, arithmetic operations on a large amount of data points, because this is what happens in 3D rendering where there are thousands of vertices need to be processed in the shader (for those who are not familiar with computer graphics, that is to compute the color values of each vertex in the scene). Each vertex can be computed independently, therefore it makes sense to have thousands of cores running in parallel. For this to be scalable, all the cores should run the same computation, hence SIMD (otherwise it is a mess to schedule thousands of cores).

For CPUs, even with caches, there are still chances that the chip requires some data or commands that are not in the cache yet, and it would need to wait for a few cycles for the data to be read from RAM. This is obviously wasteful. Modern CPUs have pretty smart and complicated prediction on where to prefetch the data from RAM to minimize latency. For example, when it enters a FOR loop, it could fetch data around the arrays being accessed and the commands around the loops. Nonetheless, even with all those tricks, there are still chances for cache misses!

One simple way to keep the CPU cores busy is context switching. While the CPU is waiting for data from RAM, it can work on something else, and this eventually keeps the cores busy, while also provides the multi-tasking feature. We are not going to dive into context switching, but basically it is about to store the current stack, restore the stack trace, reload the registers, reset the instruction counter, etc…

Let’s talk about GPUs. A typical fragment of data that GPUs have to work with are in the order of megabytes in size, so it could easily take hundreds of cycles for the data to be fetched to the cores. The question then is how to keep the cores busy.

CPUs deal with this problem by context switching. GPUs don’t do that. The threads on GPUs are not switching, because it would be problematic to switch context at the scale of thousands of cores. For the sake of efficiency, there is little of locking mechanism between GPU cores, so context switching is difficult to implement efficiently.
– In fact, the GPUs don’t try to be too smart in this regards. It simply leaves the problem to be solved at the higher level, i.e. the application level.

Talking of applications, GPUs are designed for a very specific set of applications anyway, so can we do something smarter to keep the cores busy? In graphical rendering, the usual workflow is the cores read a big chunk of data from RAM, do computation on each element of the data and write the results back to RAM (sounds like Map Reduce? Actually it is not too far from that, we can talk about GPGPU algorithms in another post). For this to be efficient, both the reading and writing phases should be efficient. Writing is tricky, but reading can be made way faster with, unsurprisingly, a cache. However, the biggest cache system on GPUs are read-only, because writable cache is messy, especially when you have thousands of cores. Historically it is called texture cache, because it is where the graphical application would write the texture (typically a bitmap) for the cores to use to shade the vertices. The cores cant write to this cache because it would not need to, but it is writable from the CPU. When people move to GPGPU, the texture cache is normally used to store constants, where they can be read by multiple cores simultaneously with low latency.

To summarize, the whole point of the discussion was about to avoid the cores being idle because of memory latency. Cache is the answer to both CPUs and GPUs, but cache on GPUs are read-only to the cores due to their massive number of cores. When cache is certainly helpful, CPUs also do context switching to further increase core utilization. GPUs, to the best of my knowledge, don’t do that much. It is left to the developers to design their algorithms so that the cores are fed with enough computation to hide the memory latency (which, by the way, also includes the transfer from RAM to GPU memory via PCIExpress – way slower and hasn’t been discussed so far).

The proper way to optimize GPGPU algorithms is, therefore, to use the data transfer latency as the guide to optimize.

Nowadays, frameworks like tensorflow or torch hide all of these details, but at the price of being a bit inefficient. Tensorflow community is aware of this and trying their best, but still much left to be done.

Self-driving cars, again

This is my second take on Self-driving cars, a bit more serious than last time. You might be surprised to know that it is a combination of many old-school stuff in Computer Vision and Machine Learning like Perspective Transform, thresholding, Image warping,  sliding windows, HoG, linear SVM, etc…

Three months ago I kept wondering how would Self-driving cars work in Vietnam.

Now I am certain that it will never work, at least for the next 20 years (in Vietnam or in India, for that matter).

Secured private Docker registry in Kubernetes

If you run a Docker-based Kubernetes cluster yourself, sooner or later you will find out that you need a Docker registry to store the docker images. You might start out with a public registry out there, but often you might want to keep your images away from the public. Now if your cluster is on the cloud, you can just use the Container Registry provided by AWS EC2 or Google Cloud Platform. If your cluster is on-prem however, then you might want to keep the registry close to your cluster, hence deploying your own registry might be a good idea.

For starters, you can always use the registry addon shipped with Kubernetes. The default setup will give you an unsecured registry, so you will need to setup a DeamonSet to route a local port to the registry, so that to the workers, your registry runs on localhost:PORT, which will not trigger the secured logic of the docker daemon. Check the link for more information.

This setup is rather bad though. If a user, from his machine, wants to push his image to the registry, then he has to use kubectl to setup a proxy to the registry service, so that the service is available on his machine at localhost:PORT. This is rather inconvenient and tedious. We need a registry available at a separated host name, so that it can receive images from any machines in the network, and serve images to any workers in the Kubernetes cluster.


Yes, you should tune your pesky learning rate

For a ConvNN I trained recently, this is the learning curves when using Adam optimizer with initial learning rate = 0.01:


When using the traditional SGD with initial learning rate = 0.01, momentum = 0.9 and decaying learning rate every 3 epochs with decay rate of 0.96, the learning curves become:


I hope you see the drastic difference. With momentum, we got 10% error rate after 5 epochs. While with Adam, we got ~30% error rate after 20 epochs.

Now it might happen that Adam will work better if we add fancy stuff like Batch Norm and the likes to the network, which I didn’t try. However, when everything else being equal, it feels to me that Adam was a bit aggressive in decreasing the learning rate, which makes learning progress slow after a while.

Since the learning rate is more strongly regulated in Adam, perhaps we can be more lax in setting the initial learning rate? This is the learning curves for Adam with initial learning rate = 0.1


It went wild eventually.

But with initial learning rate = 0.001, Adam gives this:


It is much better now.

Over the years, momentum and decaying learning rate has been my first choice for tuning the learning rate. I sometimes use Adagrad/RMSProp/Adam for cross-checking, but the best results are usually found with momentum, often with less training epochs.

The take-away message is you should really tune your learning rate hard. It is still one of the most important hyper-parameters. Although methods like Adam/Adagrad might make the impression that tuning the learning rate is easy, in fact it is very problem-dependent. Momentum has many more knobs to tune, but once used wisely, it will be very flexible and powerful. Often you will end up to the same ballpark with any of those optimizers.

Jenkins server, or why you should never trust engineer’s estimation

Recently, like any other serious engineer, I realized I need a CI server. I didn’t want to spend money on travis-ci, or using 3rd-party services like and alike, so I decided to roll my own Jenkins server. I thought it will be just to start a new instance, install Jenkins, configure some plugins and up it goes! C’mon, how hard could that possibly be? I could probably get it running in couples of hours!

I’ve never been so wrong in my entire life.


Installing Jenkins is indeed a piece of cake, but configuring the thing is a real pain in the butt.

The first weekend passed by and I haven’t got it running, I thought: okay, this is annoying, but I can probably get it done in the next couples of hours.

But the second weekend passed by and Jenkins still didn’t work, I was not annoyed anymore. I was frustrated, so frustrated that I decided to rant about it (after I get it done, of course).

Seriously, can’t you make this a bit more verbose and tell me what could be possibly wrong? All I want is just a simple Jenkins server that builds every Pull Request, and gets trigged when something happened on my git server. Is it too much of a thing to ask? Also, can someone please write some tutorials that is up-to-date?

The cool thing about Jenkins is that it is very flexible, but so flexible that any single plugin can break your system. I ran into NullPointerException, and when that is fixed, my build never gets triggered. At some moment in time I got it running, but naively I updated the plugin to the latest version and it stopped working. Never thought Jenkins will be this troublesome.

So here you go, this is how I get it done (at last, yea, I managed to get it done without committing suicide).

  1. Number 1 rule: use Github Integration plugin, instead of the probably outdated GitHub pull request builder.
  2. In fact if you happened to install GitHub pull request builder, you need to disable Auto-manage webhooks in Jenkins configuration, otherwise all the hooks will be handled by this plugin, instead of GitHub Integration plugin.
  3. Configure Github Integration plugin is pretty easy, although the guy was pretty terse in words.
  4. But then don’t upgrade the plugin! I initially used Github Integration plugin v0.1.0-rc9, and it works. But v0.1.0-rc10 has just been released couple days ago, and it stops working when I upgraded it. So this is the learning: Jenkins plugins might stop working when you upgrade it, think twice before doing that.
  5. As a bonus, use the Environment Script plugin to specify environment variables needed for your build. Don’t use Shell script, because in Jenkins, shell scripts will be executed with sh -ex, so anything you do with export ABC=… will go away once the script finished executing.
    But the Environment Script plugin is pretty retarded. You set the environment variables with echo “ABC=…”, instead of export ABC=….

Since it took me way much more time than I estimated, I decided to take some extra miles. Here are a few things that will make your build a bit cooler:

  1. Use Scoverage plugin to publish test coverage results (produced by sbt-scoverage, for instance)
  2. Use Embedded Build Status plugin to display those tiny logos screen-shot-2016-10-23-at-12-05-01 in your README file on github. Honestly, those logos are quite satisfying to have on your repo.
  3. Use Slack plugin for better integration with Slack.

When something you thought as “trivial” as setting up Jenkins took you this much time, you will start worrying about all other estimations you have made in your entire life…


Injecting environment variables with Guice

This is rather mundane, but I couldn’t resist.

Dependency Injection is cool, as it allows you to dynamically configure the dependency graph of the classes in your application. Now very often, your application also needs to take into account some environment variables to do something useful: the address and port of the microservice to connect to, the username and password of some databases, etc… Since it is just another kind of configuration, it is probably a good idea to “guice” the environment variables directly into the classes, so that you don’t need to write up a separated Configuration class and intrude it into every here and there.

The easiest way to do this is to use some form of Instance Binding, but I don’t like the “Named” annotation because it uses a string as the identifier, which is error-prone. Moreover, it will be nice if we can see all the environment variables in one place, instead of all over places in the codebase.

guice-property does just that. It allows you to define all the environment variables in one place:

SERVER("MY_SERVER", "my.server", "localhost", "The server"),
PORT("", "my.port", "1234", "The port"),

and then use a special annotation @Prop to inject them into wherever they are needed:

import phvu.prop.{Prop, Property}

class PropertyUser @Inject()
(@Prop(Property.SERVER) val server: String, @Prop(Property.PORT) val port: Int) {

  def run() = {
    println(s"I am on $server:$port")

Simple enough?

Check it out and let me know what you think.

Count Featurizer

Trong “sự nghiệp” ngắn ngủi đi làm Data Scientist dạo, mình từng thấy rất nhiều kĩ thuật feature engineering khá lạ lùng. Một trong những kĩ thuật lạ lùng nhất là hashing. Chẳng hạn có 1 feature là địa chỉ nhà (của khách hàng), thông thường là một string, và thật khó để extract bất kì thông tin nào. Hashing đơn giản chỉ là biến đổi feature đó như sau:

f = hash(s) % M

và sau đó f được dùng trong one-hot encoding. M càng lớn thì chi phí càng cao, nhưng chất lượng có thể “tốt” hơn. Kĩ thuật này đơn giản nhưng hiệu quả, vì nó sẽ hoạt động tốt ngay cả trên test  data. Về mặt kĩ thuật,  thực chất đây rất gần giống với việc gán các địa chỉ 1 cách “ngẫu nhiên” vào M ngăn, tuy nhiên nhờ tính deterministic của hàm hash nên ta có thể đảm bảo một chuỗi sẽ chỉ được gán vào đúng 1 ngăn.

Một kĩ thuật feature engineering khác cũng “lạ lùng” không kém là Count-based featurizer. Mặc dù đơn giản nhưng kĩ thuật này tỏ ra rất  hiệu quả trong nhiều bài toán lớn như Online advertisement, fraud detection. Count featurizer phù hợp để encode các feature dạng categorical mà số lượng category quá lớn, đến nỗi khó có thể dùng One-hot encoding. Các feature dạng này có thể là địa chỉ, là ID của khách hàng v.v… cụ thể như sau.