# A neural net telling the story behind every photo

On this blog, I used to mention Google’s breakthrough in Machine Translation, in which an “encoder” LSTM-RNN are used to directly “read” a source sentence, translate it into fixed-length feature representation and those features are used to feed another “decoder” LSTM-RNN to produce the target sentence.

The idea can be applied for images, too. We can simply use a DNN to “read” an image, translate it into a fixed-length feature representation which will then be used to feed the “decoder” RNN. The ability of DNN in encoding images into fixed-length feature vectors is almost indisputable, hence this approach is promising.

Without further ado, it simply works, as shown in a recent paper, which is also featured on NYTimes.

Update: This is a similar study on video http://lanl.arxiv.org/pdf/1411.4389.pdf

# LSTM for Speech recognition

LSTM-RNN has been the state-of-the-art in handwritten recognition for quite a long time. Now it has just been shown to overpass ReLU DNN in speech recognition, at least on TIMIT. The nice thing about LSTM in this setting is that it is a much smaller architecture with only 2 layers. The paper can be accessed at http://arxiv.org/abs/1402.1128

On other hands, this is some advancement on translating long sentences using neural nets: http://arxiv.org/abs/1409.0473

# Multilingual acoustic models for speech recognition

Some weeks ago, Google revealed its support for Vietnamese voice and handwritten search. It was quite a surprise for me since I didn’t think that the Vietnamese market is worth that much attention from Google. Okay, there are more than 90 millions people speaking Vietnamese, but the number of Google users speaking Vietnamese is certainly not that much, and probably they might prefer English when making queries. Anyway, Google seems to want to go ahead of its brand new rival in searching at the Vietnamese market, coccoc.com. Speech and handwriting recognition is something that coccoc.com could not afford, at least in a near future.

But still, I was wondering how Google could support Vietnamese speech and handwriting recognition in such a short time. I don’t think that Google would spend lots of money to collect big datasets for Vietnamese (there is no any such dataset available up till now), then train some models (whatever the heck it is) and deploy it just for the Vietnamese market.

Well, it turns out that Google can do that using recent advances in deep learning. This paper, presented at ICASSP 2013, shows how one can use all their data in different languages to train shared high-level features, and then discriminative training for each language independently. That is to have a big neural network with multiple output layers (as much as the number of languages), all share the same hierarchy at lower layers. All the output layers enjoy the shared features which are learned so that they are generally helpful in any language. Figure 2 in the paper clearly illustrates this strategy. This is called multilingual training and actually was one of the main theme at ICASSP 2013. Using this kind of technique, we can have a recognition system which is reasonably good when not so much labelled data is available. Of course it would be perfect if we have a lot of labeled data. In such case then we can use a usual technique for acoustic modeling. However collecting labeled data is expensive (even with companies like Google), hence multilingual training seems to be a promising approach when we don’t have much data. This is actually one special case of the so-called multitask learning.

There is no evidence that Google actually used this technique for Vietnamese voice search. However given the fact that the Vietnamese dataset collected by Google is quite small (500 people), then it seems to be the case. With this approach, Google obviously did not need to spend too much time and effort to support Vietnamese.

There are some critics about multilingual training though. I am wondering if there is any constraint for the languages used in the multilingual setting. For instance, Latin languages (English, French, Spanish, Italian…) are quite similar to each other, so maybe it is possible to combine them because the shared high-level hidden features might be helpful in those cases. However what if we combine Latin languages with a totally different language, like Chinese, Japanese, Arabic… One might wonder that the shared features learned on Latin languages might be not very useful for Chinese, so jointly train a model for Latin languages and Chinese may be not a very good idea. This should be tested anyway.

Another important message from the published works in speech recognition at ICASSP is that when one uses a deep neural network for speech recognition, then the low-level filter banks turns out to be much more helpful than MFCC. MFCC has been widely used in the last 10 years in HMM-based recognition system, but people showed that deep neural networks are more suitable with filterbank features. That might due to the fact that filter-banks are low-level features, and they do not throw a lot of information in the speech signal like MFCC. This once more seems to give evidence for the ability of deep neural nets in automatically learning high-level features.

Regarding Vietnamese handwriting recognition, since Google is recognizing short words written on a touch screen, this is an example of online handwriting recognition. In this case, the system has more information about the coordinates of each point in each character, hence it is considered to be “easier” to recognize. The Vietnamese alphabet also much similar to English and French, hence I think Google would just need to take care some special accents (not that much) and a proper language model (a simple model like n-grams would do the job). That seems to be enough to give a reasonable performance.

# Local normalization in Neural networks

Local normalization là một kĩ thuật tương đối mới trong neural network. Kĩ thuật này được sử dụng với ConvNN trong paper ImageNet 2012 của Hinton đã giúp giảm độ lỗi từ 1 đến 2%.

Về ý nghĩa, local normalization có thể xem là một kĩ thuật regularization cho neural network. Tuy nhiên thay vì tập trung vào thuật toán Backpropagation như nhiều kĩ thuật khác, phương pháp này trực tiếp thay đổi kiến trúc mạng. Cụ thể phương pháp này tỏ ra có hiệu quả khi sử dụng với các non-linearity không bị chặn như rectified linear unit (ReLU) vì nó ngăn không cho activation của các neuron có giá trị quá lớn so với các neuron chung quanh.

Tuy vậy, như sẽ thấy ngay sau đây, local normalization thêm vào 4 hyper-parameter cho mạng neural, làm tăng thêm gánh nặng hyper-parameter tuning vốn đã là “ác mộng” trong việc huấn luyện mạng neural. Cứ mỗi lớp normalization là có 4 hyper-parameter kèm theo.

Bài này ghi lại công thức và đạo hàm tương ứng của các thể loại local normalization. Chi tiết sẽ được viết sau, khi có dịp.

## 1. Local response normalization

Across maps:

$\displaystyle b_{x,y}^i = \frac{a_{x,y}^i}{\displaystyle \left(k + \alpha\sum_{j=\max\left(0,i-\frac{n}{2}\right)}^{\min\left(N-1,i+\frac{n}{2}\right)}\left(a_{x,y}^j\right)^2\right)^\beta}$

Same map:

$\displaystyle b_{x,y}^i = \frac{a_{x,y}^i}{\displaystyle \left(k + \alpha\sum_{\left(u, v\right)=\left(\max\left(0,x-\frac{n}{2}\right), \max\left(0,y-\frac{n}{2}\right)\right)}^{\left(\min\left(S-1,x+\frac{n}{2}\right), \min\left(S-1,y+\frac{n}{2}\right)\right)}\left(a_{u,v}^i\right)^2\right)^\beta}$

Đạo hàm:
Do local normalization không có tham số nào, nên ta chỉ cần tính đạo hàm của output đối với input. Tuy nhiên việc này hơi tricky vì ta phải tính hai thành phần:

•  đạo hàm của $b_{x, y}^i$ đối với $a_{x, y}^i$
• đạo hàm của $b_{x, y}^j$ đối với $a_{x, y}^i$, với $b_{x, y}^j$ là output ở trong phạm vi neighborhood của $b_{x, y}^i$, và dĩ nhiên $j \neq i$

Ta có:

${\displaystyle \frac{\partial b_{x,y}^{i}}{\partial a_{x,y}^{i}}=\frac{1}{{\displaystyle \left(d_{x,y}^{i}\right)^{\beta}}}-2\alpha\beta a_{x,y}^{i}\frac{{\displaystyle b_{x,y}^{i}}}{d_{x,y}^{i}}}$

${\displaystyle \frac{\partial b_{x,y}^{j}}{\partial a_{x,y}^{i}}=-2\alpha\beta a_{x,y}^{i}\frac{{\displaystyle b_{x,y}^{j}}}{d_{x,y}^{j}}}\quad\quad\left(j\neq i\right)$

Vậy nên đạo hàm của output đối với $a_{x, y}^i$ là tổng các đạo hàm riêng tại tất cả các vị trí trong local neighborhood của nó:

$\displaystyle \frac{\partial b}{\partial a_{x,y}^i} = \displaystyle \sum_{j \in \mathcal{N}\left(i\right)}\frac{\partial b_{x,y}^j}{\partial a_{x,y}^i} = \frac{1}{\left(d_{x,y}^i\right)^\beta} - 2\alpha\beta a_{x,y}^i\sum_{j \in \mathcal{N}\left(i\right)}\frac{b_{x,y}^j}{d_{x,y}^j}$

Trong đó $d_{x,y}^i$ là kí hiệu tắt cho mẫu số trong biểu thức ban đầu:

$\displaystyle d_{x,y}^{i}=k+\alpha\sum_{j=\max\left(0,i-\frac{n}{2}\right)}^{\min\left(N-1,i+\frac{n}{2}\right)}\left(a_{x,y}^{j}\right)^{2}$

Trong cài đặt của thuật toán Backpropagation, ta sẽ quan tâm đến đạo hàm của hàm lỗi (tạm gọi là $C$) đối với input $a_{x, y}^i$. Sử dụng chain rule, ta có:

$\begin{array}{rl} \displaystyle \frac{\partial C}{\partial a_{x,y}^{i}} & \displaystyle =\sum_{j}\frac{\partial C}{\partial b_{x,y}^{j}}\frac{\partial b_{x,y}^{j}}{\partial a_{x,y}^{i}} \\ & \displaystyle =\frac{\partial C}{\partial b_{x,y}^{i}}\frac{1}{\left(d_{x,y}^{i}\right)^{\beta}}-2\alpha\beta a_{x,y}^{i}\sum_{j}\frac{\partial C}{\partial b_{x,y}^{j}}\frac{b_{x,y}^{j}}{d_{x,y}^{j}}\end{array}$

Đây là công thức cuối cùng. Công thức này viết cho trường hợp across maps, trường hợp same map hoàn toàn tương tự.

## 2. Local contrast normalization

Trong khi local response normalization tính correlation trong vùng neighborhood thì local contrast normalization tính variance bằng cách tính thêm mean của vùng neighborhood. Chi tiết này làm tăng đáng kể sự phức tạp của đạo hàm cũng như khi cài đặt trên máy tính

Across maps:

$\displaystyle b_{x,y}^i = \frac{a_{x,y}^i}{\displaystyle \left(k + \alpha\sum_{j=\max\left(0,i-\frac{n}{2}\right)}^{\min\left(N-1,i+\frac{n}{2}\right)}\left(a_{x,y}^j - m_{x,y}^i\right)^2\right)^\beta}$

Same map:

${\displaystyle b_{x,y}^{i}=\frac{a_{x,y}^{i}}{{\displaystyle \left(k+\alpha\sum_{\left(u,v\right)=\left(\max\left(0,x-\frac{n}{2}\right),\max\left(0,y-\frac{n}{2}\right)\right)}^{\left(\min\left(S-1,x+\frac{n}{2}\right),\min\left(S-1,y+\frac{n}{2}\right)\right)}\left(a_{u,v}^{i}-m_{x,y}^{i}\right)^{2}\right)^{\beta}}}}$

Đạo hàm:

Tương tự như local response normalization, đạo hàm cũng gồm 2 thành phần:

${\displaystyle \frac{\partial b_{x,y}^{i}}{\partial a_{x,y}^{i}}=\frac{1}{{\displaystyle \left(d_{x,y}^{i}\right)^{\beta}}}-2\alpha\beta\left(a_{x,y}^{i}-m_{x,y}^{i}\right)\frac{{\displaystyle b_{x,y}^{i}}}{d_{x,y}^{i}}}$

${\displaystyle \frac{\partial b_{x,y}^{j}}{\partial a_{x,y}^{i}}=-2\alpha\beta\left(a_{x,y}^{i}-m_{x,y}^{j}\right)\frac{{\displaystyle b_{x,y}^{j}}}{d_{x,y}^{j}}}\quad\quad\left(j\neq i\right)$

Và công thức đạo hàm của hàm lỗi theo input là:

$\begin{array}{rl}{\displaystyle \frac{\partial C}{\partial a_{x,y}^{i}}} & ={\displaystyle \sum_{j}\frac{\partial C}{\partial b_{x,y}^{j}}\frac{\partial b_{x,y}^{j}}{\partial a_{x,y}^{i}}}\\ & ={\displaystyle \frac{\partial C}{\partial b_{x,y}^{i}}\frac{1}{\left(d_{x,y}^{i}\right)^{\beta}}-2\alpha\beta\sum_{j}\left(a_{x,y}^{i}-m_{x,y}^{j}\right)\frac{\partial C}{\partial b_{x,y}^{j}}\frac{b_{x,y}^{j}}{d_{x,y}^{j}}}\end{array}$