Multilingual acoustic models for speech recognition

Some weeks ago, Google revealed its support for Vietnamese voice and handwritten search. It was quite a surprise for me since I didn’t think that the Vietnamese market is worth that much attention from Google. Okay, there are more than 90 millions people speaking Vietnamese, but the number of Google users speaking Vietnamese is certainly not that much, and probably they might prefer English when making queries. Anyway, Google seems to want to go ahead of its brand new rival in searching at the Vietnamese market, Speech and handwriting recognition is something that could not afford, at least in a near future.

But still, I was wondering how Google could support Vietnamese speech and handwriting recognition in such a short time. I don’t think that Google would spend lots of money to collect big datasets for Vietnamese (there is no any such dataset available up till now), then train some models (whatever the heck it is) and deploy it just for the Vietnamese market.

Well, it turns out that Google can do that using recent advances in deep learning. This paper, presented at ICASSP 2013, shows how one can use all their data in different languages to train shared high-level features, and then discriminative training for each language independently. That is to have a big neural network with multiple output layers (as much as the number of languages), all share the same hierarchy at lower layers. All the output layers enjoy the shared features which are learned so that they are generally helpful in any language. Figure 2 in the paper clearly illustrates this strategy. This is called multilingual training and actually was one of the main theme at ICASSP 2013. Using this kind of technique, we can have a recognition system which is reasonably good when not so much labelled data is available. Of course it would be perfect if we have a lot of labeled data. In such case then we can use a usual technique for acoustic modeling. However collecting labeled data is expensive (even with companies like Google), hence multilingual training seems to be a promising approach when we don’t have much data. This is actually one special case of the so-called multitask learning.

There is no evidence that Google actually used this technique for Vietnamese voice search. However given the fact that the Vietnamese dataset collected by Google is quite small (500 people), then it seems to be the case. With this approach, Google obviously did not need to spend too much time and effort to support Vietnamese.

There are some critics about multilingual training though. I am wondering if there is any constraint for the languages used in the multilingual setting. For instance, Latin languages (English, French, Spanish, Italian…) are quite similar to each other, so maybe it is possible to combine them because the shared high-level hidden features might be helpful in those cases. However what if we combine Latin languages with a totally different language, like Chinese, Japanese, Arabic… One might wonder that the shared features learned on Latin languages might be not very useful for Chinese, so jointly train a model for Latin languages and Chinese may be not a very good idea. This should be tested anyway.

Another important message from the published works in speech recognition at ICASSP is that when one uses a deep neural network for speech recognition, then the low-level filter banks turns out to be much more helpful than MFCC. MFCC has been widely used in the last 10 years in HMM-based recognition system, but people showed that deep neural networks are more suitable with filterbank features. That might due to the fact that filter-banks are low-level features, and they do not throw a lot of information in the speech signal like MFCC. This once more seems to give evidence for the ability of deep neural nets in automatically learning high-level features.

Regarding Vietnamese handwriting recognition, since Google is recognizing short words written on a touch screen, this is an example of online handwriting recognition. In this case, the system has more information about the coordinates of each point in each character, hence it is considered to be “easier” to recognize. The Vietnamese alphabet also much similar to English and French, hence I think Google would just need to take care some special accents (not that much) and a proper language model (a simple model like n-grams would do the job). That seems to be enough to give a reasonable performance.



  1. Hi,
    I dont know whether you and me are same major. Anyway, have found so many interested things in your blog(involving my project as well as the subjects I am studying). I have been living in Germany for 4 months studying Master in Information Processing. How about you?

    1. Hi Thuy,
      My major is Machine Learning and Data mining. Of course I am always available for discussion about common interests. Eventually that’s the reason why I am maintaining this blog 😉
      Feel free to comment or drop me an email.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s