Metalearning: Learning to learn by gradient descent by gradient descent

So I read the Learning to learn paper a while ago, and I was surprised that the Decoupled Neural Interfaces paper didn’t cite them. For me the ideas are pretty close, where you try to predict the gradient used in each step of gradient descent, instead of computing it by backpropagation. Taking into account that they are all from DeepMind, won’t it be nice to cite each other and increase the impact factors for both of them?

Nevertheless, I enjoyed the paper. The key idea is instead of doing a normal update \theta_{t+1} = \theta_{t} - \alpha_t \nabla f\left(\theta_t\right), we do it as \theta_{t+1} = \theta_{t} + g_t\left(\nabla f\left(\theta_t\right), \phi\right) where g_t is some function parameterized by \phi.

Now one can use any function approximator for g_t (called optimizer, to differentiate with f\left(\theta\right) – the optimizee), but using RNNs has a particular interesting intuition as we hope that the RNNs can remember the gradient history and mimic the behavior of, for instance, momentum.

The convenient thing about this framework is that the objective function for training the optimizer is the expected weighted sum of the output of the optimizee f\left(\theta\right). Apart from this main idea, everything else is nuts and bolts, which of course are equivalently important.

The first obstacle that they had to solve is how to deal with big models of perhaps millions of parameters. In such cases, g_t has to input and output vector of millions of dimensions. Instead, the authors solved this problem very nicely by only working with one parameter at a time, i.e. the optimizer only takes as input one element of the gradient vector and output the update for that element. However, since the optimizer is a LSTM, the state of the gradient coordinates are maintained separately. This also has a nice side effect that  it reduces the size of the optimizer, and you can potentially re-use the optimizer for different optimizees.

The next two nuts and bolts are not so obvious. To mimic the L2 gradient clipping trick, they used the so-called global averaging cell (GAC), where the outgoing activations of LSTM cells are averaged at each step across all coordinates. To mimic Hessian-based optimization algorithms, they wire the LSTM optimizer with an external memory unit, hoping that the optimizer will learn to store the second-order derivatives in the memory.

Although the experimental results look pretty promising, many people pose some doubts about the whole idea of learning to learn. I was in the panel discussion of Learning to learn at NIPS, and it wasn’t particularly fruitful (people were drinking sangria all the time). It will be interesting to see the follow-ups on this line of work, if there is any.


NIPS 2016

So, NIPS 2016, the record-breaking NIPS with more than 6000 attendees, the massive recruiting event, the densest collection of great men with huge egos, whatever you call it.

I gotta write about this. Maybe several hundreds of people will also write something about NIPS, so I would start with something personal, before going to the usual march through papers and ideas, you know…

One of the cool things about this NIPS is I got to listen directly to the very men who taught me so many things during the last several years. Hearing Nando de Freitas talking on stage, I could easily recall his voice, the accent when he says thee-ta (\theta ) was so familiar. Listening to Rajesh Rao talking, I couldn’t help recalling the joke with the adventurer hat he made in order to “moisturise” his Neuroscience lectures. Sorry professor, nice try, but the joke didn’t quite work.

And of course, Yoshua Bengio with his usual hard-to-be-impressed style (although he hasn’t changed much since last time we talked). Also Alex Graves whose works wowed me so many times.

One of the highlights of the days was Jurgen Schmidhuber, with his deep, machine-generated voice, told a deep joke. The joke goes like this:

Three men were sentenced to death because of the invention of technology that causes mass unemployment in some certain industries. They were a French guy named LeCun, a British guy named Hinton and a German guy named Schmidhuber.

Before the execution, the Death asked them: “Any last word?”
– The French guy said: Je veux … (blah blah, in French, I couldn’t get it)
– The German guy: I want to give a final speech about the history of Deep Learning!
– The British guy: Please shoot me before Schmidhuber gives his goddamn speech!

As some of my friends put it: when he can make a joke about himself, probably he is still mentally healthy (pun intended).