On GPU architecture and why it matters

I had a nice conversation recently around the architecture of CPUs versus that of GPUs. It was so good that I still remember the day after, so it is probably worth writing down.

Note that a lot of the following are still several levels of abstraction away from the hardware, and this is in no way a rigorous discussion of modern hardware design. Still, from the software development point of view, they are adequate for everything we need to know.

It started out of the difference in allocating transistors to different components on the chip of CPU and GPU. Roughly speaking, on CPUs, a lot of transistors are reserved for the cache (several levels of those), while on GPUs, most of transistors are used for the ALUs, and cache is not very well-developed. Moreover, a modern CPU merely has a few dozen cores, while GPUs might have thousands.

Why is that? The simple answer is because CPUs are MIMD, while GPUs are SIMD (although modern nVidia GPUs are closer to MIMD).

The long answer is CPUs are designed for the Von-neumann architecture, where data and instructions are stored on RAM and then fetched to the chip on demand. The bandwidth between RAM and CPU is limited (so-called data bus and instruction bus, whose bandwidth are typically ~100 bits on modern computers). For each clock cycle, only ~100bits of data can be transfer from RAM to the chip. If an instruction or data element needed by the CPU is not on the chip, the CPU might need to wait for a few cycles before the data is fetched from RAM. Therefore, a cache is highly needed, and the bigger the cache, the better. Modern CPUs have around 3 levels of cache, unsurprisingly named L1, L2, L3… with higher level cache sits closer to the processor. Data and instructions will first be fetched to the caches, and CPU can read from the cache with much lower latency (cache is expensive though, but that is another story). In short, in order to keep the CPU processors busy, cache is used to reduce the latency of reading from RAM.

GPUs are different. Designed for graphic processing, GPUs need to compute the same, often simple, arithmetic operations on a large amount of data points, because this is what happens in 3D rendering where there are thousands of vertices need to be processed in the shader (for those who are not familiar with computer graphics, that is to compute the color values of each vertex in the scene). Each vertex can be computed independently, therefore it makes sense to have thousands of cores running in parallel. For this to be scalable, all the cores should run the same computation, hence SIMD (otherwise it is a mess to schedule thousands of cores).

For CPUs, even with caches, there are still chances that the chip requires some data or commands that are not in the cache yet, and it would need to wait for a few cycles for the data to be read from RAM. This is obviously wasteful. Modern CPUs have pretty smart and complicated prediction on where to prefetch the data from RAM to minimize latency. For example, when it enters a FOR loop, it could fetch data around the arrays being accessed and the commands around the loops. Nonetheless, even with all those tricks, there are still chances for cache misses!

One simple way to keep the CPU cores busy is context switching. While the CPU is waiting for data from RAM, it can work on something else, and this eventually keeps the cores busy, while also provides the multi-tasking feature. We are not going to dive into context switching, but basically it is about to store the current stack, restore the stack trace, reload the registers, reset the instruction counter, etc…

Let’s talk about GPUs. A typical fragment of data that GPUs have to work with are in the order of megabytes in size, so it could easily take hundreds of cycles for the data to be fetched to the cores. The question then is how to keep the cores busy.

CPUs deal with this problem by context switching. GPUs don’t do that. The threads on GPUs are not switching, because it would be problematic to switch context at the scale of thousands of cores. For the sake of efficiency, there is little of locking mechanism between GPU cores, so context switching is difficult to implement efficiently.
– In fact, the GPUs don’t try to be too smart in this regards. It simply leaves the problem to be solved at the higher level, i.e. the application level.

Talking of applications, GPUs are designed for a very specific set of applications anyway, so can we do something smarter to keep the cores busy? In graphical rendering, the usual workflow is the cores read a big chunk of data from RAM, do computation on each element of the data and write the results back to RAM (sounds like Map Reduce? Actually it is not too far from that, we can talk about GPGPU algorithms in another post). For this to be efficient, both the reading and writing phases should be efficient. Writing is tricky, but reading can be made way faster with, unsurprisingly, a cache. However, the biggest cache system on GPUs are read-only, because writable cache is messy, especially when you have thousands of cores. Historically it is called texture cache, because it is where the graphical application would write the texture (typically a bitmap) for the cores to use to shade the vertices. The cores cant write to this cache because it would not need to, but it is writable from the CPU. When people move to GPGPU, the texture cache is normally used to store constants, where they can be read by multiple cores simultaneously with low latency.

To summarize, the whole point of the discussion was about to avoid the cores being idle because of memory latency. Cache is the answer to both CPUs and GPUs, but cache on GPUs are read-only to the cores due to their massive number of cores. When cache is certainly helpful, CPUs also do context switching to further increase core utilization. GPUs, to the best of my knowledge, don’t do that much. It is left to the developers to design their algorithms so that the cores are fed with enough computation to hide the memory latency (which, by the way, also includes the transfer from RAM to GPU memory via PCIExpress – way slower and hasn’t been discussed so far).

The proper way to optimize GPGPU algorithms is, therefore, to use the data transfer latency as the guide to optimize.

Nowadays, frameworks like tensorflow or torch hide all of these details, but at the price of being a bit inefficient. Tensorflow community is aware of this and trying their best, but still much left to be done.


Deep Learning for the masses

Deep Learning for the masses

Andrew Ng’s group has just published a paper that shows how to build giant neural networks on low-cost GPUs. Excerpt:

In this paper, we present technical details and results from our own system based on Commodity Off -The-Shelf High Performance Computing (COTS HPC) technology: a cluster of GPU servers with In finiband interconnects and MPI. Our system is able to train 1 billion parameter networks on just 3 machines in a couple of days, and we show that it can scale to networks with over 11 billion parameters using just 16 machines. As this infrastructure is much more easily marshaled by others, the approach enables much wider-spread research with extremely large neural networks.

More information can be found in the full article on wired.

I am just wondering who is going to need this much computational power, except Google and some big companies. For me, even a single GTX 580 would be already overwhelmed… if I ever had one.

What happens if you over-train a neural net?

When you let a machine learn too much, it may happen that it will do worse. It is just like us – as human being – start forgetting things, or even go crazy, when we are forced to study excessively.

All jokes aside, I am having some chances to play with deep and (reasonably) big neural network, and I just found out what have been said above.

In the first experiment, I trained a feed forward neural network on MNIST. The net has 3 hidden layers with 1024 – 1024 – 2048 hidden units, fully connected, trained by stochastic gradient descent with momentum, L2 weight decay and decaying learning rate. The cost function is Cross Entropy. The net is similar to the one described here, but 2 layer deeper. The number of errors on training/validation/test sets are displayed in the figure below

Number of error of 3 hidden layer Neural net, without dropout.

Number of error of 3 hidden layer Neural net, without dropout, on MNIST. The vertical axis displays the actual number of errors, instead of error percentage as mistakenly shown in the figure.

It gone wild, eventually. After 700*2000 = 1400000 iterations, the number of errors jumped to almost 10000, which is 99% of the test set. The more we train it, the more stupid it is.

Looking into the few first iterations, it can be seen that the net was actually learning well. The number of errors kept steadily decreasing, and after 50*2000=10000 iterations, the net achieves a reasonably good performance with around 170 errors on the test set. Remember that the best result reported so far on MNIST (without applying any data augmentation trick) is 160 errors. This is just to say that to obtain a competitive result on MNIST is quite simple with deep neural network.


The first few iteration of training the 3 layer deep neural network, without dropout, on MNIST. Again, the vertical axis displays the number of errors, instead of error percentage.

But what the heck was happening after 1400000 iterations? Why did the error increase wildly like this? While backpropagation (and stochastic gradient descent) ensures that the more we learn, the more we decrease the error, or at least stay at the current best state. But why it went worse in this case?

Well, the reason is that I used L2 weight decay. After training for a while, the Cross Entropy falls to a local optimum and the net stops learning. However the L2 weight regularization is still there, so the net just keeps pushing all the weights toward zeros. After some points, this will just totally ruin the net.

I then performed another experiment with the same network architecture, but I used dropout as the regularization method, and another weight norm technique. I got very similar behavior

Training of deep neural network with dropout, on MNIST

Training of deep neural network with dropout, on MNIST. Again, the vertical axis displays the number of errors, instead of error percentage.

The first few steps of training a deep neural network with dropout on MNIST

The first few steps of training a deep neural network with dropout on MNIST. Again, the vertical axis displays the number of errors, instead of error percentage.

The net just went wild after a while. Looking into the first few steps, we can see that with dropout, the training error decreases more slowly, and the test error approaches very close to 160, which is the best reported result. It shows that our network has been overfitted, but different regularization methods seems to be powerful in controlling the behavior of the net. Of course in any case, we can manually stop the training process and select the “just-right” model, or even stop-early and pick a good model after very few first steps.

The nice thing is that, thanks to the recently public library for training NN on GPU called deepnet, all of those experiments did not take much time. Actually it took only 5 minutes to complete the first 80*2000 = 160000 iterations. After that much iterations, the network was already well-trained.

So the moral of the story is do not study too much, or it might drive you crazy, just like a neural net being over-trained.