On this blog, I used to mention Google’s breakthrough in Machine Translation, in which an “encoder” LSTM-RNN are used to directly “read” a source sentence, translate it into fixed-length feature representation and those features are used to feed another “decoder” LSTM-RNN to produce the target sentence.
The idea can be applied for images, too. We can simply use a DNN to “read” an image, translate it into a fixed-length feature representation which will then be used to feed the “decoder” RNN. The ability of DNN in encoding images into fixed-length feature vectors is almost indisputable, hence this approach is promising.
Update: This is a similar study on video http://lanl.arxiv.org/pdf/1411.4389.pdf