english

On GPU architecture and why it matters

I had a nice conversation recently around the architecture of CPUs versus that of GPUs. It was so good that I still remember the day after, so it is probably worth writing down.

Note that a lot of the following are still several levels of abstraction away from the hardware, and this is in no way a rigorous discussion of modern hardware design. Still, from the software development point of view, they are adequate for everything we need to know.

It started out of the difference in allocating transistors to different components on the chip of CPU and GPU. Roughly speaking, on CPUs, a lot of transistors are reserved for the cache (several levels of those), while on GPUs, most of transistors are used for the ALUs, and cache is not very well-developed. Moreover, a modern CPU merely has a few dozen cores, while GPUs might have thousands.

Why is that? The simple answer is because CPUs are MIMD, while GPUs are SIMD (although modern nVidia GPUs are closer to MIMD).

The long answer is CPUs are designed for the Von-neumann architecture, where data and instructions are stored on RAM and then fetched to the chip on demand. The bandwidth between RAM and CPU is limited (so-called data bus and instruction bus, whose bandwidth are typically ~100 bits on modern computers). For each clock cycle, only ~100bits of data can be transfer from RAM to the chip. If an instruction or data element needed by the CPU is not on the chip, the CPU might need to wait for a few cycles before the data is fetched from RAM. Therefore, a cache is highly needed, and the bigger the cache, the better. Modern CPUs have around 3 levels of cache, unsurprisingly named L1, L2, L3… with higher level cache sits closer to the processor. Data and instructions will first be fetched to the caches, and CPU can read from the cache with much lower latency (cache is expensive though, but that is another story). In short, in order to keep the CPU processors busy, cache is used to reduce the latency of reading from RAM.

GPUs are different. Designed for graphic processing, GPUs need to compute the same, often simple, arithmetic operations on a large amount of data points, because this is what happens in 3D rendering where there are thousands of vertices need to be processed in the shader (for those who are not familiar with computer graphics, that is to compute the color values of each vertex in the scene). Each vertex can be computed independently, therefore it makes sense to have thousands of cores running in parallel. For this to be scalable, all the cores should run the same computation, hence SIMD (otherwise it is a mess to schedule thousands of cores).

For CPUs, even with caches, there are still chances that the chip requires some data or commands that are not in the cache yet, and it would need to wait for a few cycles for the data to be read from RAM. This is obviously wasteful. Modern CPUs have pretty smart and complicated prediction on where to prefetch the data from RAM to minimize latency. For example, when it enters a FOR loop, it could fetch data around the arrays being accessed and the commands around the loops. Nonetheless, even with all those tricks, there are still chances for cache misses!

One simple way to keep the CPU cores busy is context switching. While the CPU is waiting for data from RAM, it can work on something else, and this eventually keeps the cores busy, while also provides the multi-tasking feature. We are not going to dive into context switching, but basically it is about to store the current stack, restore the stack trace, reload the registers, reset the instruction counter, etc…

Let’s talk about GPUs. A typical fragment of data that GPUs have to work with are in the order of megabytes in size, so it could easily take hundreds of cycles for the data to be fetched to the cores. The question then is how to keep the cores busy.

CPUs deal with this problem by context switching. GPUs don’t do that. The threads on GPUs are not switching, because it would be problematic to switch context at the scale of thousands of cores. For the sake of efficiency, there is little of locking mechanism between GPU cores, so context switching is difficult to implement efficiently.
– In fact, the GPUs don’t try to be too smart in this regards. It simply leaves the problem to be solved at the higher level, i.e. the application level.

Talking of applications, GPUs are designed for a very specific set of applications anyway, so can we do something smarter to keep the cores busy? In graphical rendering, the usual workflow is the cores read a big chunk of data from RAM, do computation on each element of the data and write the results back to RAM (sounds like Map Reduce? Actually it is not too far from that, we can talk about GPGPU algorithms in another post). For this to be efficient, both the reading and writing phases should be efficient. Writing is tricky, but reading can be made way faster with, unsurprisingly, a cache. However, the biggest cache system on GPUs are read-only, because writable cache is messy, especially when you have thousands of cores. Historically it is called texture cache, because it is where the graphical application would write the texture (typically a bitmap) for the cores to use to shade the vertices. The cores cant write to this cache because it would not need to, but it is writable from the CPU. When people move to GPGPU, the texture cache is normally used to store constants, where they can be read by multiple cores simultaneously with low latency.

To summarize, the whole point of the discussion was about to avoid the cores being idle because of memory latency. Cache is the answer to both CPUs and GPUs, but cache on GPUs are read-only to the cores due to their massive number of cores. When cache is certainly helpful, CPUs also do context switching to further increase core utilization. GPUs, to the best of my knowledge, don’t do that much. It is left to the developers to design their algorithms so that the cores are fed with enough computation to hide the memory latency (which, by the way, also includes the transfer from RAM to GPU memory via PCIExpress – way slower and hasn’t been discussed so far).

The proper way to optimize GPGPU algorithms is, therefore, to use the data transfer latency as the guide to optimize.

Nowadays, frameworks like tensorflow or torch hide all of these details, but at the price of being a bit inefficient. Tensorflow community is aware of this and trying their best, but still much left to be done.

Variational Autoencoders 3: Training, Inference and comparison with other models

Variational Autoencoders 1: Overview
Variational Autoencoders 2: Maths
Variational Autoencoders 3: Training, Inference and comparison with other models

Recalling that the backbone of VAEs is the following equation:

\log P\left(X\right) - \mathcal{D}\left[Q\left(z\vert X\right)\vert\vert P\left(z\vert X\right)\right] = E_{z\sim Q}\left[\log P\left(X\vert z\right)\right] - \mathcal{D}\left[Q\left(z\vert X\right) \vert\vert P\left(z\right)\right]

In order to use gradient descent for the right hand side, we need a tractable way to compute it:

  • The first part E_{z\sim Q}\left[\log P\left(X\vert z\right)\right] is tricky, because that requires passing multiple samples of z through f in order to have a good approximation for the expectation (and this is expensive). However, we can just take one sample of z, then pass it through f and use it as an estimation for E_{z\sim Q}\left[\log P\left(X\vert z\right)\right] . Eventually we are doing stochastic gradient descent over different sample X in the training set anyway.
  • The second part \mathcal{D}\left[Q\left(z\vert X\right) \vert\vert P\left(z\right)\right] is even more tricky. By design, we fix P\left(z\right) to be the standard normal distribution \mathcal{N}\left(0,I\right) (read part 1 to know why). Therefore, we need a way to parameterize Q\left(z\vert X\right) so that the KL divergence is tractable.

Here comes perhaps the most important approximation of VAEs. Since P\left(z\right) is standard Gaussian, it is convenient to have Q\left(z\vert X\right) also Gaussian. One popular way to parameterize Q is to make it also Gaussian with mean \mu\left(X\right) and diagonal covariance \sigma\left(X\right)I, i.e. Q\left(z\vert X\right) = \mathcal{N}\left(z;\mu\left(X\right), \sigma\left(X\right)I\right), where \mu\left(X\right) and \sigma\left(X\right) are two vectors computed by a neural network. This is the original formulation of VAEs in section 3 of this paper.

This parameterization is preferred because the KL divergence now becomes closed-form:

\displaystyle \mathcal{D}\left[\mathcal{N}\left(\mu\left(X\right), \sigma\left(X\right)I\right)\vert\vert P\left(z\right)\right] = \frac{1}{2}\left[\left(\sigma\left(X\right)\right)^T\left(\sigma\left(X\right)\right) +\left(\mu\left(X\right)\right)^T\left(\mu\left(X\right)\right) - k - \log \det \left(\sigma\left(X\right)I\right) \right]

Although this looks like magic, but it is quite natural if you apply the definition of KL divergence on two normal distributions. Doing so will teach you a bit of calculus.

So we have all the ingredients. We use a feedforward net to predict \mu\left(X\right) and \sigma\left(X\right) given an input sample X draw from the training set. With those vectors, we can compute the KL divergence and \log P\left(X\vert z\right), which, in term of optimization, will translate into something similar to \Vert X - f\left(z\right)\Vert^2.

It is worth to pause here for a moment and see what we just did. Basically we used a constrained Gaussian (with diagonal covariance matrix) to parameterize Q. Moreover, by using \Vert X - f\left(z\right)\Vert^2 for one of the training criteria, we implicitly assume P\left(X\vert z\right) to be also Gaussian. So although the maths that lead to VAEs are generic and beautiful, at the end of the day, to make things tractable, we ended up using those severe approximations. Whether those approximations are good enough totally depend on practical applications.

There is an important detail though. Once we have \mu\left(X\right) and \sigma\left(X\right) from the encoder, we will need to sample z from a Gaussian distribution parameterized by those vectors. z is needed for the decoder to reconstruct \hat{X}, which will then be optimized to be as close to X as possible via gradient descent. Unfortunately, the “sample” step is not differentiable, therefore we will need a trick call reparameterization, where we don’t sample z directly from \mathcal{N}\left(\mu\left(X\right), \sigma\left(X\right)\right), but first sample z' from \mathcal{N}\left(0, I\right), and then compute z = \mu\left(X\right) + \mu\left(X\right)Iz'. This will make the whole computation differentiable and we can apply gradient descent as usual.

The cool thing is during inference, you won’t need the encoder to compute \mu\left(X\right) and \sigma\left(X\right) at all! Remember that during training, we try to pull Q to be close to P\left(z\right) (which is standard normal), so during inference, we can just inject \epsilon \sim \mathcal{N}\left(0, I\right) directly into the decoder and get a sample of X. This is how we can leverage the power of “generation” from VAEs.

There are various extensions to VAEs like Conditional VAEs and so on, but once you understand the basic, everything else is just nuts and bolts.

To sum up the series, this is the conceptual graph of VAEs during training, compared to some other models. Of course there are many details in those graphs that are left out, but you should get a rough idea about how they work.

vae

In the case of VAEs, I added the additional cost term in blue to highlight it. The cost term for other models, except GANs, are the usual L2 norm \Vert X - \hat{X}\Vert^2.

GSN is an extension to Denoising Autoencoder with explicit hidden variables, however that requires to form a fairly complicated Markov Chain. We may have another post  for it.

With this diagram, hopefully you will see how lame GAN is. It is even simpler than the humble RBM. However, the simplicity of GANs makes it so powerful, while the complexity of VAE makes it quite an effort just to understand. Moreover, VAEs make quite a few severe approximation, which might explain why samples generated from VAEs are far less realistic than those from GANs.

That’s quite enough for now. Next time we will switch to another topic I’ve been looking into recently.

Variational Autoencoders 2: Maths

Variational Autoencoders 1: Overview
Variational Autoencoders 2: Maths
Variational Autoencoders 3: Training, Inference and comparison with other models

Last time we saw the probability distribution of X with a latent variable z as follows:

\displaystyle P(X) = \int P\left(X\vert z; \theta\right)P(z)dz  (1)

and we said the key idea behind VAEs is to not sample z from the whole distribution P\left(z\right), but actually from a simpler distribution Q\left(z\vert X\right). The reason is because most of z will likely to give P\left(X\vert z\right) close to zero, and therefore making little contribution to the estimation of P\left(X\right). Now if we sample z \sim Q\left(z\vert X\right), those values of z will more likely to generate X in the training set. Moreover, we hope that Q will has less modes than P\left(z\right), and therefore easier to sample from. The intuition of this is the locations of the modes of Q\left(z\vert X\right) depends on X, and this flexibility will compensate the limitation of the fact that Q\left(z\vert X\right) is simpler than P\left(z\right).

But how Q\left(z\vert X\right) can help with modelling P\left(X\right)? If z is sampled from Q, then using f we will get E_{z \sim Q}P\left(X\vert z\right). We will then need to show the relationship of this quantity with P\left(X\right), which is the actual quantity we want to estimate. The relationship between E_{z \sim Q}P\left(X\vert z\right) and P\left(X\right) is the backbone of VAEs.

We start with the KL divergence of Q\left(z\vert X\right) and P\left(z\vert X\right):

\mathcal{D}\left[Q\left(z\vert X\right) \vert\vert P\left(z\vert X\right)\right] = E_{z\sim Q}\left[\log Q\left(z\vert X\right) - log P\left(z\vert X\right)\right]

The unknown quantity in this equation is P\left(z\vert X\right), but at least we can use Bayes rule for it:

\mathcal{D}\left[Q\left(z\vert X\right) \vert\vert P\left(z\vert X\right)\right] = E_{z\sim Q}\left[\log Q\left(z\vert X\right) - log P\left(X\vert z\right) - \log P\left(z\right)\right] + \log P\left(X\right)

Rearrange things a bit, and apply the definition of KL divergence between Q\left(z\vert X\right) and P\left(z\right), we have:

\log P\left(X\right) - \mathcal{D}\left[Q\left(z\vert X\right)\vert\vert P\left(z\vert X\right)\right] = E_{z\sim Q}\left[\log P\left(X\vert z\right)\right] - \mathcal{D}\left[Q\left(z\vert X\right) \vert\vert P\left(z\right)\right]    (2)

If you forget everything, this formula is the thing you should remember. It is therefore important to understand what it means:

  • The left-hand-side is exactly what we want to optimize, plus an error term. The smaller this error term is, the better we are in mazimizing P\left(X\right). In other words, the left-hand-side is a lower-bound of what we want to optimize, hence the name variational (Bayesian).
  • If Q happens to be a differentiable function, the right-hand-side is something we can optimize with gradient descent (we will see how to do it later). Note that the right-hand-side happens to take the form of encoder and decoder, where Q encodes X into z, and then P decodes z to reconstruct X, hence the name “Autoencoder”. However, VAEs don’t really belong to the family of Denoising and Sparse Autoencoders, although there are indeed some connections.
  • Note that P\left(z\vert X\right) on the left hand side is something intractable. However, by maximizing the left hand side, we simultaneously minimize \mathcal{D}\left[Q\left(z\vert X\right)\vert\vert P\left(z\vert X\right)\right], and therefore pull Q\left(z\vert X\right) closer to P\left(z\vert X\right). If we use a flexible model for Q, then we can use Q as an approximation for P\left(z\vert X\right). This is a nice side effect of the whole framework.

Actually the above maths existed way before VAEs. However the trick was to use a feedforward network for Q, which gave rise to VAEs several years ago.

Next time, we will see how to do that, and hopefully conclude this series. Then we can move on with something more interesting.

Self-driving cars, again

This is my second take on Self-driving cars, a bit more serious than last time. You might be surprised to know that it is a combination of many old-school stuff in Computer Vision and Machine Learning like Perspective Transform, thresholding, Image warping,  sliding windows, HoG, linear SVM, etc…

Three months ago I kept wondering how would Self-driving cars work in Vietnam.

Now I am certain that it will never work, at least for the next 20 years (in Vietnam or in India, for that matter).

Variational Autoencoders 1: Overview

In a previous post, we briefly mentioned some recent approaches for Generative Modeling. Among those, RBMs and DBMs are probably very tricky because the estimation of gradients in those models is based on a good mixing of MCMC, which tends to get worse during the course of training because the model distribution gets sharper. Autogressive models like PixelRNN, WaveNet, etc… are easier to train but have no latent variables, which makes them somewhat less powerful. Therefore, the current frontier in Generative Modelling is probably GANs and Variational Autoencoders (VAEs).

While GANs are too mainstream, I thought I can probably write a post or two about Variational Autoencoders, at least to clear up some confusions I am having about them.

Formally, generative modeling is the area in Machine Learning that deals with models of distributions P(X), defined over datapoints X in some high-dimensional space \mathcal{X}. The whole idea is to construct models of P(X) that assigns high probabilities to data points similar to those in the training set, and low probabilities every where else. For example, a generative models of images of cows should assign small probabilities to images of human.

However, computing the probability of a given example is not the most exciting thing about generative models. More often, we want to use the model to generate new samples that look like those in the training set. This “creativity” is something unique to generative models, and does not exist in, for instance, discriminative models. More formally, say we have a training set sampled from an unknown distribution P_\text{org}(X), and we want to train a model P which we can take sample from, such that P is as close as possible to P_\text{org}.

Needless to say, this is a difficult problem. To make it tractable, traditional approaches in Machine Learning often have to 1) make strong assumptions about the structure of the data, or 2) make severe approximation, leading to suboptimal models, or 3) rely on expensive sampling procedures like MCMC. Those are all limitations which make Generative modeling a long-standing problem in ML research.

Without further ado, let’s get to the point. When \mathcal{X} is a high-dimensional space, modeling is difficult mostly because it is tricky to handle the inter-dependencies between dimensions. For instance, if the left half of an image is a horse then probably the right half is likely another horse.

To further reduce this complexity, we add a latent variable z in a high-dimensional space \mathcal{Z} that we can easily sample from, according to a distribution P(z) defined over \mathcal{Z}. Then say we have a family of deterministic function f(z;\theta) parameterized by a vector \theta in some space \Theta where f: \mathcal{Z} \times \Theta \rightarrow \mathcal{X}. Now f is deterministic, but since z is a random variable, f(z;\theta) is a random variable in \mathcal{X}.

During inference, we will sample z from P(z), and then train \theta such that f(z;\theta) is close to samples in the training set. Mathematically, we want to maximize the following probability for every sample X in the training set:

\displaystyle P(X) = \int P\left(X\vert z; \theta\right)P(z)dz   (1)

This is the good old maximum likelihood framework, but we replace f(z;\theta) by P\left(X\vert z;\theta\right) (called the output distribution) to explicitly indicate that X depends on z, so that we can use the integral to make it a proper probability distribution.

There are a few things to note here:

  • In VAEs, the choice of the output distribution is often Gaussian, i.e. P\left(X\vert z;\theta\right) = \mathcal{N}\left(X; f(z;\theta), \sigma^2 * I\right), meaning it is a Gaussian distribution with mean f(z;\theta) and diagonal covariance matrix where \sigma is a scalar hyper-parameter. This particular choice has some important motivations:
    • We need the output distribution to be continuous, so that we can use gradient descent on the whole model. It wouldn’t be possible if we use discontinuous function like the Dirac distribution, meaning to use exactly the output value of f(z;\theta) for X.
    • We don’t really need to train our model such that f(z;\theta) produces exactly some sample X in the training set. Instead, we want it to produce samples that are merely like X. In the beginning of training, there is no way for f to gives exact samples in the training set. Hence by using a Gaussian, we allow the model to gradually (and gently) learn to produce samples that are more and more like those in the training set.
    • It doesn’t have to be Gaussian though. For instance, if X is binary, we can make P\left(X\vert z;\theta\right) a Bernoulli parameterized by f(z;\theta). The important property is that P\left(X\vert z\right) can be computed, and is continuous in the domain of \theta.
  • The distribution of z is simply the normal distribution, i.e. P(z) = \mathcal{N}\left(0,I\right). Why? How is it possible? Is there any limitation with this? A related question is why don’t we have several levels of latent variables. which potentially might help modelling complicated processes?
    All those question can be answered by the key observation that any distribution in d dimensions can be generated by taking d variables from the normal distribution and mapping them through a sufficiently complicated function.
    Let that sink for a moment. Readers who are interested in the mathematical details can have a look at the conditional distribution method described in this paper. You can also convince yourself if you remember how we can sample from any Gaussian as described in an earlier post.
    Now, this observation means we don’t need to go to more than one level of latent variable, with a condition that we need a sufficiently complicated function for f(z;\theta). Since deep neural nets has been shown to be a powerful function approximator, it makes a lot of sense to use deep neural nets for modeling f.
  • Now the only business is to maximize (1). Using the law of large numbers, we can approximate the integral by the expected value over a large number of samples. So the plan will be to take a very large sample \left\{z_1, ..., z_n\right\} from P(z), then compute P(X) \approx \frac{1}{n}\sum_i P\left(X\vert z_i;\theta\right). Unfortunately the plan is infeasible because in high dimensional spaces, n needs to be very large in order to have a good enough approximation of P(X) (imagine how much samples you would need for 200 \times 200 \times 3 images, which is in 120K dimensional space?)
    Now the key to realize is that we don’t need to sample z from all over P(z). In fact, we only need to sample z such that f(z;\theta) is more likely to be similar to samples in the training set. Moreover, it is likely that for most of z, P(X\vert z) is nearly zero, and therefore contribute very little into the estimation of P(X). So the question is: is there any way to sample z such that it is likely to generate X, and only estimate P(X) from those?
    It is the key idea behind VAEs.

That’s quite enough for an overview. Next time we will do some maths and see how we go about maximizing (1). Hopefully I can then convince you that VAEs, GANs and GSNs are really not far away from each other, at least in their core ideas.

Metalearning: Learning to learn by gradient descent by gradient descent

So I read the Learning to learn paper a while ago, and I was surprised that the Decoupled Neural Interfaces paper didn’t cite them. For me the ideas are pretty close, where you try to predict the gradient used in each step of gradient descent, instead of computing it by backpropagation. Taking into account that they are all from DeepMind, won’t it be nice to cite each other and increase the impact factors for both of them?

Nevertheless, I enjoyed the paper. The key idea is instead of doing a normal update \theta_{t+1} = \theta_{t} - \alpha_t \nabla f\left(\theta_t\right), we do it as \theta_{t+1} = \theta_{t} + g_t\left(\nabla f\left(\theta_t\right), \phi\right) where g_t is some function parameterized by \phi.

Now one can use any function approximator for g_t (called optimizer, to differentiate with f\left(\theta\right) – the optimizee), but using RNNs has a particular interesting intuition as we hope that the RNNs can remember the gradient history and mimic the behavior of, for instance, momentum.

The convenient thing about this framework is that the objective function for training the optimizer is the expected weighted sum of the output of the optimizee f\left(\theta\right). Apart from this main idea, everything else is nuts and bolts, which of course are equivalently important.

The first obstacle that they had to solve is how to deal with big models of perhaps millions of parameters. In such cases, g_t has to input and output vector of millions of dimensions. Instead, the authors solved this problem very nicely by only working with one parameter at a time, i.e. the optimizer only takes as input one element of the gradient vector and output the update for that element. However, since the optimizer is a LSTM, the state of the gradient coordinates are maintained separately. This also has a nice side effect that  it reduces the size of the optimizer, and you can potentially re-use the optimizer for different optimizees.

The next two nuts and bolts are not so obvious. To mimic the L2 gradient clipping trick, they used the so-called global averaging cell (GAC), where the outgoing activations of LSTM cells are averaged at each step across all coordinates. To mimic Hessian-based optimization algorithms, they wire the LSTM optimizer with an external memory unit, hoping that the optimizer will learn to store the second-order derivatives in the memory.

Although the experimental results look pretty promising, many people pose some doubts about the whole idea of learning to learn. I was in the panel discussion of Learning to learn at NIPS, and it wasn’t particularly fruitful (people were drinking sangria all the time). It will be interesting to see the follow-ups on this line of work, if there is any.

 

From self-driving cars to model and dataset sizes

So I am done with teaching a vehicle to drive itself!

Errh, not quite there yet. I did it on a simulator, in an easy environment where there is only one lane, and no other traffic. This is very far from an actual self-driving vehicle.

Nevertheless, I had a lot of fun. It was actually way easier than I initially thought. It is simply a regression problem, where a CNN was trained to predict the steering angle. A vanila CNN with a significant amount of training data would do the job quite easily. Although it sounds simple, eventually this is how nVidia drives a car with their DAVE-2 system.

In practice, self-driving car is a bit more complicated. For example, nVidia’s paper didn’t show how they would handle traffic lights. I guess the Deep Learning way for that would be to collect a lot more data at crossroads, but I feel that would not be enough. At some point, you will need traditional engineering methods like sensor fusion to precisely locate the car on the road (more precise than what GPS provides), path finding for planning and all kinds of other signals.

However, every time I apply Deep Learning to a new domain, I learned something new. For this project, it is the following:

  • On the vehicle, there are 3 cameras: one in the middle, one on the left and one on the right. Normally you just need to train the CNN to map the image collected from the center camera to the steering angle, and be done with it. However, it turns out that you can use the side cameras to teach the vehicle to recover from mistakes. For example, if the car is taking a left turn, then you can use the image from the left camera to teach it to do a softer left turn, and the image from the right camera do a harder left turn. Using this approach, during inference, you only need to run inference on the center image. How much softer and harder should be empirically determined.
    You might think that you can read 3 images in the same time, and feed all three into the network, but that will require 3 images during inference, which might slow down the inference.
    In fact the above technique is used by nVidia in their paper, and it could help the vehicle to recover from mistake, for example when it is close to the edge of the road.
    Another data augmentation technique is to vertically flip the images, and reverse the steering angle. Using both techniques, you can augment the training set by a factor of 6.
  • Inference time is crucial. In the beginning, I struggled a lot making the model to work. Then at some point I realize that it took around 0.1 second to evaluate the model, which might be too slow to drive a car. I then reduce the size of the model, until the point where it takes 0.01 seconds to evaluate, then the vehicle starts driving smoothly.

So how small (or big) your model should be? This obviously depends on the training set,  but is there any rule of thumb? A related question that some people also asked me is how big the training set should be? We keep saying Deep Learning needs big datasets, but how big is big, or how big should it be to expect some sensible performance? I hope the rest of this post could answer those questions.

How big the model should be?

Let’s say you have a training set of N samples. Now if I use a simple array of bits to store those samples, then I would need N bits to store N samples (the first bit is ON given the first sample, and so on). More strictly, I could say I only need \log_2\left(N\right) bits to store N samples, because I could have N different configurations with that many bits.

In Deep Learning, we are well graduated from speaking in bits, but the same principle still holds. The easiest answer is you will need to construct your model so that it has N parameters to learn a training set of N samples.

That is still too lax though. Recall that a parameter in a neural net is a 32-bit floating point number, so a model of N parameters will have 32N bits in total. That’s why you would only need a model of \frac{N}{32} parameters?

Not that strict. Although the parameters in neural nets are floating points, their values are often small, typically in the range of -0.3 to 0.3 (depending on how you normalize the data). This is due to various tricks we apply to the nets like initialization and small learning rate, in order to make optimization easier.

Since their values are restricted, probably only a few bits in each parameters are carrying useful information. How many is that? Typically people think it is about 8 or 16 bits. The proof for that is when you quantize the nets to low-precision (of 8 or 16 bits), then the performance of the net doesn’t decrease much.

So, as a typical (wild) rule of thumb, you should be able to overfit a training set of size N with a model of \frac{N}{4} parameters. If you cannot overfit the training set, you are doing something really wrong with your initialization, learning rate and regularizer.

So you need to know how to count the number of parameters in a deep net. For fully connected layers, that simply is the size of the weight matrix and the biases. For convolutional layers, it is the size of the filter, multiplied by the number of filters. Most  modern Deep learning framework doesn’t use biases for convolutional layer, but in the past, people used to use a bias for each filter, so keep in mind that if you want to be very precise. The vanila RNN can be computed similarly.

LSTM is a bit more tricky, because there are a few variants of those: whether peephole is enabled, whether the forget bias is fixed, is it multi-dimensional LSTM, etc.. so the exact number might vary. However in general, the number of parameters of an LSTM layers of p units with inputs should be in the order of 5pq.

Some time ago I used to write a python script to compute the exact number of parameters in a MDLSTM cell, but looking at it now took me some time to understand it.

I hope this points out that the key advantage of Deep Learning, compared to traditional method, is we can engineer the model as big as we want, sometimes depending on the dataset. This is not easily doable with other models like SVM and the like.

How big is the training set?

Using a similar reasoning, you could also answer this pretty easily.

Assume that your input is a N-dimensional vector, then the maximum number of configuration in that space is 2^{32N}, which is enormous (sorry for using the word, you have Donald Trump to blame).

Of course that is the number of distinct configuration for all possible input. Your input domain is likely going to be a manifold in that high-dimensional space, meaning it will probably only take a tenth of that many degrees of freedom. So let’s say 2^{N/10}.

Now you don’t need every sample in your input domain to train a deep model. As long as your input domain is relatively smooth, and the training set covers the most important modes in the data distribution, the model should be able to figure out the missing regions. So again, probably you only need a fifth of those, meaning around 2^{N/50} samples.

For instance in MNIST, the input is of 28 * 28 = 784 dimensions, then you should have around 2^{784/50} \approx 32000 samples. In fact there are 50000 samples in the MNIST training set.

In general, I think the rule of thumb would be around tens of thousands samples for a typical problem so that you can expect some optimistic results.

Note that those calculations are very coarse, and should only be used to give some intuition. They shouldn’t be used as an exact calculation as-it-is.

The problem is worse with time series and sequential data in general. Using the same calculation, you would end up with pretty big numbers because you need to multiply the numbers by the length of the sequence. I don’t think the same calculation can be applied for sequential data, because in sequences, the correlation between consecutive elements also play a big role in learning, so that might lax or limit the degree of freedom of the data. However, I used to work with small sequence dataset of size around tens of thousands samples. For difficult datasets, we might need half a million of samples.

The more you work on modelling, the more you learn about it. As always, I would love to hear your experience!