Since the rise of Deep learning, I was quite lagged behind in the Object Detection domain. It has been known that the “state-of-the-art” at circa. 2010 was the Deformable part models. After that I have no idea, mostly because I haven’t done anything serious with this (except for HoG, which was used for some hobby-ish projects).
Turns out it advanced quite a bit. We had a reading group this week, where I was educated on recent advances in Object Detection, and this is a recap.
One key problem in Object Detection is how to detect the bounding box around the objects to be detected in the image. Traditionally, people use sliding windows for this. The good news is we are already graduated from that. Recent works use Region proposals approach, where some forms of heuristics are used to propose category-independent regions. Some methods for region proposals are Selective Search, CPMC, etc… The proposed regions are then scored by a classifier to tell whether it is an object or a background patch.
- Naturally you would like to use a CNN on top of Region proposals. In case you are hurry (rushing for a paper to get graduated from grad schools), you would just take the pre-trained AlexNet, extract the features from AlexNet and train a simple SVM for classification. No kidding, this is how the R-CNN was built. With some fine-tuning, it got 53.3% mAP on VOC 2012, best result at that time.
- But R-CNN is slow. The Region proposals can easily create ~2000 candidates for each image. Then for each of the proposed regions, you would need to run AlexNet to extract the features. So for each image, you will need to run AlexNet ~2000 times. This is quite costly.
One way to fix that is to make sure the image are fed into the convolutional layers only once, and the information about regions is applied on the feature space (after all the convolutions), not on the image space. This is the original idea of SPPnet and Fast R-CNN (the method is called ROI pooling). Using the VGG architecture, and a multi-task loss function, Fast R-CNN gave better results than R-CNN.
- In Fast R-CNN, the information about the regions are somehow hand-coded in the data that is fed into the network. Since we are Deep learners, we want the network to learn the regions as well. Faster R-CNN does that. By cleverly designing a Region Proposal Network (RPN) that shares the same convolutional layers with the main network (check Fig. 2), the whole network can be trained end-to-end. The RPN module works based on k anchors. With 3 scales and 3 aspect ratios, there are k = 9 anchors, that will be used to “scanned” on the feature maps of the image, and propose the regions. For each region there are 4 numbers encoding the coordinates. Those coordinates will be compared against the groundtruth using a smooth L1 loss function. This loss (called bounding-box regressor), combined with the usual softmax loss for classification, is the multi-task loss function to be optimized.
Btw, the smooth L1 loss is quite clever. It is the absolute function, smoothed at zero by the square function.
Using VGG, Faster R-CNN runs at 5fps, approaching real-time object detection.
- The whole point in object detection is how to detect the regions. We have seen various approaches, but the approach used in the Single-Shot detector (SSD) is way more “deep-learning-style”. Similar to the Fully Convolutional Nets (FCN), it applies a cascade of convolutional layers, which all being combined into a long 7308-dimensional vectors. At the last layer, the receptive field of each neuron will be a region of the image at some specific location and scale. This, in fact, does the heavy-lifting job of region detection (see Fig. 2). Each unit will be scored whether it is an object or a background patch. The whole thing can be again trained end-to-end in the deep learning style. Finally there will be some non-maximal suppresion step for filtering all the overlapping regions, and only the region with maximal score will be kept.
SSD runs at 59fps and gives 74.3% mAP.
- The Inside-Outside Net is another revelation. After the ROI pooling step (like in Fast R-CNN), they feed the sequence into an four-directional IRNN to extract a context feature, which will then combined with all the convolutional features in a “cascade” style (similar to Single-Shot detector), which will then be evaluated in a multi-task loss function.
Four-directional IRNN is something quite clever. They have 2 RNN layers. The first RNN layer scan the whole feature maps from left to right, right to left, top to bottom and bottom to top. It does that for every column (when scanning left-right) or every row (when scanning top-bottom) at once, instead of every pixel at once. This is so to facilitate parallelization.
The second RNN layer is the same but applied differently. The output of the left-to-right scan from the first RNN layer will be fed into the top-to-bottom scan of the second layer. This feels like we are rotating the photo and apply the same thing again, which hopefully gives the context feature the full context around each region in the image.
The IRNN thing caught my eyes. It is probably the first time I saw IRNN being used. However at the end of the paper, they said it turns out that even when they don’t train the IRNN weights (i.e. keeping the recurrent weights to be the identity matrix forever) gives approximately the same performance. This is quite curious, and I think it would be interesting to try out LSTM/GRU for this architecture.
Apparently ION is the state-of-the-art with 76.4% mAP on Pascal VOC.
That’s about it. Let me know if I missed any paper.