This is my second take on Self-driving cars, a bit more serious than last time. You might be surprised to know that it is a combination of many old-school stuff in Computer Vision and Machine Learning like Perspective Transform, thresholding, Image warping, sliding windows, HoG, linear SVM, etc…

Three months ago I kept wondering how would Self-driving cars work in Vietnam.

Now I am certain that it will never work, at least for the next 20 years (in Vietnam or in India, for that matter).

So I am done with teaching a vehicle to drive itself!

Errh, not quite there yet. I did it on a simulator, in an easy environment where there is only one lane, and no other traffic. This is very far from an actual self-driving vehicle.

Nevertheless, I had a lot of fun. It was actually way easier than I initially thought. It is simply a regression problem, where a CNN was trained to predict the steering angle. A vanila CNN with a significant amount of training data would do the job quite easily. Although it sounds simple, eventually this is how nVidia drives a car with their DAVE-2 system.

In practice, self-driving car is a bit more complicated. For example, nVidia’s paper didn’t show how they would handle traffic lights. I guess the Deep Learning way for that would be to collect a lot more data at crossroads, but I feel that would not be enough. At some point, you will need traditional engineering methods like sensor fusion to precisely locate the car on the road (more precise than what GPS provides), path finding for planning and all kinds of other signals.

However, every time I apply Deep Learning to a new domain, I learned something new. For this project, it is the following:

On the vehicle, there are 3 cameras: one in the middle, one on the left and one on the right. Normally you just need to train the CNN to map the image collected from the center camera to the steering angle, and be done with it. However, it turns out that you can use the side cameras to teach the vehicle to recover from mistakes. For example, if the car is taking a left turn, then you can use the image from the left camera to teach it to do a softer left turn, and the image from the right camera do a harder left turn. Using this approach, during inference, you only need to run inference on the center image. How much softer and harder should be empirically determined.
You might think that you can read 3 images in the same time, and feed all three into the network, but that will require 3 images during inference, which might slow down the inference.
In fact the above technique is used by nVidia in their paper, and it could help the vehicle to recover from mistake, for example when it is close to the edge of the road.
Another data augmentation technique is to vertically flip the images, and reverse the steering angle. Using both techniques, you can augment the training set by a factor of 6.

Inference time is crucial. In the beginning, I struggled a lot making the model to work. Then at some point I realize that it took around 0.1 second to evaluate the model, which might be too slow to drive a car. I then reduce the size of the model, until the point where it takes 0.01 seconds to evaluate, then the vehicle starts driving smoothly.

So how small (or big) your model should be? This obviously depends on the training set, but is there any rule of thumb? A related question that some people also asked me is how big the training set should be? We keep saying Deep Learning needs big datasets, but how big is big, or how big should it be to expect some sensible performance? I hope the rest of this post could answer those questions.

How big the model should be?

Let’s say you have a training set of N samples. Now if I use a simple array of bits to store those samples, then I would need N bits to store N samples (the first bit is ON given the first sample, and so on). More strictly, I could say I only need bits to store N samples, because I could have N different configurations with that many bits.

In Deep Learning, we are well graduated from speaking in bits, but the same principle still holds. The easiest answer is you will need to construct your model so that it has N parameters to learn a training set of N samples.

That is still too lax though. Recall that a parameter in a neural net is a 32-bit floating point number, so a model of N parameters will have bits in total. That’s why you would only need a model of parameters?

Not that strict. Although the parameters in neural nets are floating points, their values are often small, typically in the range of -0.3 to 0.3 (depending on how you normalize the data). This is due to various tricks we apply to the nets like initialization and small learning rate, in order to make optimization easier.

Since their values are restricted, probably only a few bits in each parameters are carrying useful information. How many is that? Typically people think it is about 8 or 16 bits. The proof for that is when you quantize the nets to low-precision (of 8 or 16 bits), then the performance of the net doesn’t decrease much.

So, as a typical (wild) rule of thumb, you should be able to overfit a training set of size N with a model of parameters. If you cannot overfit the training set, you are doing something really wrong with your initialization, learning rate and regularizer.

So you need to know how to count the number of parameters in a deep net. For fully connected layers, that simply is the size of the weight matrix and the biases. For convolutional layers, it is the size of the filter, multiplied by the number of filters. Most modern Deep learning framework doesn’t use biases for convolutional layer, but in the past, people used to use a bias for each filter, so keep in mind that if you want to be very precise. The vanila RNN can be computed similarly.

LSTM is a bit more tricky, because there are a few variants of those: whether peephole is enabled, whether the forget bias is fixed, is it multi-dimensional LSTM, etc.. so the exact number might vary. However in general, the number of parameters of an LSTM layers of p units with q inputs should be in the order of .

Some time ago I used to write a python script to compute the exact number of parameters in a MDLSTM cell, but looking at it now took me some time to understand it.

I hope this points out that the key advantage of Deep Learning, compared to traditional method, is we can engineer the model as big as we want, sometimes depending on the dataset. This is not easily doable with other models like SVM and the like.

How big is the training set?

Using a similar reasoning, you could also answer this pretty easily.

Assume that your input is a N-dimensional vector, then the maximum number of configuration in that space is , which is enormous (sorry for using the word, you have Donald Trump to blame).

Of course that is the number of distinct configuration for all possible input. Your input domain is likely going to be a manifold in that high-dimensional space, meaning it will probably only take a tenth of that many degrees of freedom. So let’s say .

Now you don’t need every sample in your input domain to train a deep model. As long as your input domain is relatively smooth, and the training set covers the most important modes in the data distribution, the model should be able to figure out the missing regions. So again, probably you only need a fifth of those, meaning around samples.

For instance in MNIST, the input is of dimensions, then you should have around samples. In fact there are 50000 samples in the MNIST training set.

In general, I think the rule of thumb would be around tens of thousands samples for a typical problem so that you can expect some optimistic results.

Note that those calculations are very coarse, and should only be used to give some intuition. They shouldn’t be used as an exact calculation as-it-is.

The problem is worse with time series and sequential data in general. Using the same calculation, you would end up with pretty big numbers because you need to multiply the numbers by the length of the sequence. I don’t think the same calculation can be applied for sequential data, because in sequences, the correlation between consecutive elements also play a big role in learning, so that might lax or limit the degree of freedom of the data. However, I used to work with small sequence dataset of size around tens of thousands samples. For difficult datasets, we might need half a million of samples.

The more you work on modelling, the more you learn about it. As always, I would love to hear your experience!

For a ConvNN I trained recently, this is the learning curves when using Adam optimizer with initial learning rate = 0.01:

When using the traditional SGD with initial learning rate = 0.01, momentum = 0.9 and decaying learning rate every 3 epochs with decay rate of 0.96, the learning curves become:

I hope you see the drastic difference. With momentum, we got 10% error rate after 5 epochs. While with Adam, we got ~30% error rate after 20 epochs.

Now it might happen that Adam will work better if we add fancy stuff like Batch Norm and the likes to the network, which I didn’t try. However, when everything else being equal, it feels to me that Adam was a bit aggressive in decreasing the learning rate, which makes learning progress slow after a while.

Since the learning rate is more strongly regulated in Adam, perhaps we can be more lax in setting the initial learning rate? This is the learning curves for Adam with initial learning rate = 0.1

It went wild eventually.

But with initial learning rate = 0.001, Adam gives this:

It is much better now.

Over the years, momentum and decaying learning rate has been my first choice for tuning the learning rate. I sometimes use Adagrad/RMSProp/Adam for cross-checking, but the best results are usually found with momentum, often with less training epochs.

The take-away message is you should really tune your learning rate hard. It is still one of the most important hyper-parameters. Although methods like Adam/Adagrad might make the impression that tuning the learning rate is easy, in fact it is very problem-dependent. Momentum has many more knobs to tune, but once used wisely, it will be very flexible and powerful. Often you will end up to the same ballpark with any of those optimizers.

Since the rise of Deep learning, I was quite lagged behind in the Object Detection domain. It has been known that the “state-of-the-art” at circa. 2010 was the Deformable part models. After that I have no idea, mostly because I haven’t done anything serious with this (except for HoG, which was used for some hobby-ish projects).

Turns out it advanced quite a bit. We had a reading group this week, where I was educated on recent advances in Object Detection, and this is a recap.

One key problem in Object Detection is how to detect the bounding box around the objects to be detected in the image. Traditionally, people use sliding windows for this. The good news is we are already graduated from that. Recent works use Region proposals approach, where some forms of heuristics are used to propose category-independent regions. Some methods for region proposals are Selective Search, CPMC, etc… The proposed regions are then scored by a classifier to tell whether it is an object or a background patch.

Naturally you would like to use a CNN on top of Region proposals. In case you are hurry (rushing for a paper to get graduated from grad schools), you would just take the pre-trained AlexNet, extract the features from AlexNet and train a simple SVM for classification. No kidding, this is how theR-CNN was built. With some fine-tuning, it got 53.3% mAP on VOC 2012, best result at that time.

But R-CNN is slow. The Region proposals can easily create ~2000 candidates for each image. Then for each of the proposed regions, you would need to run AlexNet to extract the features. So for each image, you will need to run AlexNet ~2000 times. This is quite costly.
One way to fix that is to make sure the image are fed into the convolutional layers only once, and the information about regions is applied on the feature space (after all the convolutions), not on the image space. This is the original idea of SPPnet and Fast R-CNN(the method is called ROI pooling). Using the VGG architecture, and a multi-task loss function, Fast R-CNN gave better results than R-CNN.

In Fast R-CNN, the information about the regions are somehow hand-coded in the data that is fed into the network. Since we are Deep learners, we want the network to learn the regions as well. Faster R-CNN does that. By cleverly designing a Region Proposal Network (RPN) that shares the same convolutional layers with the main network (check Fig. 2), the whole network can be trained end-to-end. The RPN module works based on k anchors. With 3 scales and 3 aspect ratios, there are k = 9 anchors, that will be used to “scanned” on the feature maps of the image, and propose the regions. For each region there are 4 numbers encoding the coordinates. Those coordinates will be compared against the groundtruth using a smooth L1 loss function. This loss (called bounding-box regressor), combined with the usual softmax loss for classification, is the multi-task loss function to be optimized.
Btw, the smooth L1 loss is quite clever. It is the absolute function, smoothed at zero by the square function.

Using VGG, Faster R-CNN runs at 5fps, approaching real-time object detection.

The whole point in object detection is how to detect the regions. We have seen various approaches, but the approach used in the Single-Shot detector (SSD) is way more “deep-learning-style”. Similar to the Fully Convolutional Nets (FCN), it applies a cascade of convolutional layers, which all being combined into a long 7308-dimensional vectors. At the last layer, the receptive field of each neuron will be a region of the image at some specific location and scale. This, in fact, does the heavy-lifting job of region detection (see Fig. 2). Each unit will be scored whether it is an object or a background patch. The whole thing can be again trained end-to-end in the deep learning style. Finally there will be some non-maximal suppresion step for filtering all the overlapping regions, and only the region with maximal score will be kept.
SSD runs at 59fps and gives 74.3% mAP.

The Inside-Outside Net is another revelation. After the ROI pooling step (like in Fast R-CNN), they feed the sequence into an four-directional IRNN to extract a context feature, which will then combined with all the convolutional features in a “cascade” style (similar to Single-Shot detector), which will then be evaluated in a multi-task loss function.
Four-directional IRNN is something quite clever. They have 2 RNN layers. The first RNN layer scan the whole feature maps from left to right, right to left, top to bottom and bottom to top. It does that for every column (when scanning left-right) or every row (when scanning top-bottom) at once, instead of every pixel at once. This is so to facilitate parallelization.
The second RNN layer is the same but applied differently. The output of the left-to-right scan from the first RNN layer will be fed into the top-to-bottom scan of the second layer. This feels like we are rotating the photo and apply the same thing again, which hopefully gives the context feature the full context around each region in the image.
The IRNN thing caught my eyes. It is probably the first time I saw IRNN being used. However at the end of the paper, they said it turns out that even when they don’t train the IRNN weights (i.e. keeping the recurrent weights to be the identity matrix forever) gives approximately the same performance. This is quite curious, and I think it would be interesting to try out LSTM/GRU for this architecture.
Apparently ION is the state-of-the-art with 76.4% mAP on Pascal VOC.

That’s about it. Let me know if I missed any paper.

On this blog, I used to mention Google’s breakthrough in Machine Translation, in which an “encoder” LSTM-RNN are used to directly “read” a source sentence, translate it into fixed-length feature representation and those features are used to feed another “decoder” LSTM-RNN to produce the target sentence.

The idea can be applied for images, too. We can simply use a DNN to “read” an image, translate it into a fixed-length feature representation which will then be used to feed the “decoder” RNN. The ability of DNN in encoding images into fixed-length feature vectors is almost indisputable, hence this approach is promising.

Without further ado, it simply works, as shown in a recent paper, which is also featured on NYTimes.

Update: This is a similar study on video http://lanl.arxiv.org/pdf/1411.4389.pdf

You might have been familiar to Raspberry Pi, a Single-Board Computer (SBC) made for hobbyist and DIYer.

Now SBC market gets somewhat more exciting with Intel’s Minnowboard Max, which runs an 1.91 GHz Atom processor, at only $99.

But the most interesting one is nVidia’s Jetson K1, which features an ARM Cortex A15 CPU and a Kepler-class GPU with 192 CUDA cores, at $192. This means your CUDA code can now run efficiently on a 127mm x 127 mm piece of board.