Machine Learning

Albert Einstein and random thoughts on Machine Learning


I read Einstein’s biography with as much enthusiasm as I did with Stephen Hawking’s A brief history of Time and Domigos’ The Master Algorithm. It’s not only because the book is recommended by, among others, Elon Musk, but probably more because of my childhood dream of becoming a physicist back when I was in high school. Although I was too dumb for physics, nothing could prevent me from admiring its beauty.

The book was excellent. Walter Isaacson did a great job in depicting Albert Einstein from his complicated personality to his belief, religion, politics, and, of course, his scientific achievements.

As a human being, Einstein is a typical introvert. He was always a loner, enjoyed discussing ideas more than any personal or emotional entanglements. During the most difficult periods of life, he would rather immerse into science rather than get up and do anything about his real-life struggles. To quote Isaacson, “the stubborn patience that Einstein displayed when dealing with scientific problems was equaled by his impatience when dealing with personal entanglements”, those that put “emotional burdens” on him. Some may criticise and regard him as being “cold-hearted”, but perhaps for him, it was way easier to use the analytical brain rather than the emotional brain to deal with daily mundane affairs. This, often times, resulted in what we can consider as brutal acts, like when he gave Mileva Maric a list of harsh terms in order to stay with him, or when he did almost nothing for his first kid, and let it die in Serbia. For this aspect, though, he probably deserves more sympathy than condemnation. He was surely a complicated man, and expecting him to be well-rounded in handling personal affairs is perhaps as unreasonable as impossible.

Now, it is perhaps worth emphasizing that Einstein is a one-in-a-million genius who happens to have those personality traits. It does not imply those who have those traits are genius. Correlation does not imply causation 😉

Einstein made his mark in physics back in 1905 when he challenged Newton’s classical physics. He was bold and stubborn in challenging long-standing beliefs in science that are not backed by any experimental results. Unfortunately during his 40s, quantum physics made the same rationale, and he just couldn’t take it, although he contributed a great amount of results in its development (to this he amusingly commented: a good joke should not be repeated too often). His quest to look for a unified field theory that can explain both gravitational field and electromagnetic field by a consise set of rule was failed, and eventually quantum mechanics, with a probabilistic approach, was widely accepted. This saga tells us a lot:

  • The argument Einstein had with the rest of physicists back in 1910s on his breakthrough in relativity theory was almost exactly the same with the argument Neils Bohr had in 1930s on quantum theory, except that in 1930s, Einstein was on the conservative side. In 1910s, people believed time is absolute, Einstein shown that was wrong. In 1930s, Neils Bohr used probabilistic models to describe subatomic world, while Einstein resisted, because he didn’t believe Nature was “playing dice”.
    Perhaps amusingly, one can draw some analogies in Machine Learning. Einstein’s quest to describe physics in a set of rules sounds like Machine Learners trying to build rule-based systems back in 1980s. That effort failed and probabilistic models took advantages until today. The world is perhaps too complicated to be captured in a deterministic systems, while looking at it from a different angle, probability provides a neat mathematical framework to describe uncertainties that Nature seems to carry. While it seems impossible to describe any complicated system deterministically, it is perhaps okay to describe them probabilistically, although it might not explain how the system was created in the first place.
  • During the 1930s, in a series of lonely, depressing attempts to unify field theories, Einstein sounds a lot like… Geoff Hinton who attempted to explain how the human brain works. Actually, those are perhaps not too far from each other. The brain is eventually the 3-pound universe of mankind, and completely understanding the brain is probably as hard as understanding the universe.
  • Being a theorist his whole life, Einstein’s approach to physics is quite remarkable. He never started from experimental results, but often drawn insights at the abstract level, then proceed with intuitive thought experiments, and then went on with rigorous mathematical frameworks. He would often end his papers with a series of experimental studies that could be used to confirm his theory. This top-down approach is very appealing and became widely adopted in physics for quite a long time.
    On the contrary, many researches in Machine Learning are often bottom-up. Even the master algorithm proposed in Domigos’ book is too bottom-up to be useful. Computer Science, after all, is an applied science in which empirical results are often too emphasized. In particular, Machine Learning research are heavily based on experiments, and theories that justify those experiments often came long after, if there was any. To be fair, there are some work that come from rigorous mathematical inference, like LSTM, SELU and similar ideas, but a lot of breakthroughs in the field are empirical, like Convolutional nets, GANs and so on.
    Looking forward, drawing insights from Neuroscience is probably a promising way of designing Machine Learning systems in a top-down fashion. After all, human brain is the only instance of general intelligence that we known of by far, and the distribution of those generally intelligent devices might be highly skewed and sparse, hence drawing insights from Neuroscience is perhaps our best hope.
  • The way Einstein became an international celebrity was quite bizarre. He officially became celebrity after he paid visits to America for a series of fund-raising events for a Zionist cause in Israel. The world at the time was heavily divided after World War I, and the media was desperately looking for a international symbol to heal the wounds. Einstein, with his self-awareness, twinkling eyes and a good sense of humour, was too ready to become one. American media is surely good in this business, and the rest became history.
  • Einstein’s quest of understanding universe bears a lot of similarities with Machine Learner’s quest of building general AI systems. However, while computer scientists are meddling with our tiny superficial simulations on computers, physicists are looking for ways to understand the universe. Putting our work along side physicists’, we should probably feel humbled and perhaps a bit embarrassing.

It was amazing and refreshing to revise Einstein’s scientific journey about 100 years ago, and with a bit of creativity, one could draw many lessons that are still relevant to the research community today. Not only that, the book gives a well-informed picture about Einstein as a human, with all flaws and weaknesses. Those flaws do not undermine his genius, but on the contrary, make us readers respect him even more. Therefore Einstein is, among others, an exemplar for how much an introvert can contribute to the humankind.

For those of us who happen to live in Berlin, any time you sit in Einstein Kaffee and sip a cup of delighting coffee, perhaps you could pay some thoughts to the man who lived a well-lived life, achieved so much and also missed so much (although the Kaffe itself has nothing to do with Einstein). Berlin, after all, is where Einstein spent 17 years of his life. It is where he produced the general relativity theory – the most important work in his career, it is the only city he considered to be home throughout his bohemian life.

[RL4a] Policy Optimization

I thought I would write about some theory behind Reinforcement Learning, with eligibility traces, contraction mapping, POMDP and so on, but then I realized if I go down that rabbit hole, I would probably never finish this. So here are some practical stuff that people are actually using these days.

Two main approaches in model-free RL are policy-based, where we build a model of the policy function, and value-based, where we learn the value function. For many practical problems, the structure of the (unknown) policy function might be easier to learn than the (unknown) value function, in which case it makes more sense to use policy-based methods. Moreover, with policy-based methods, we will have an explicit policy, which is the ultimate goal of Reinforcement learning to control (the other type being learn-to-predict, more on that some other time). With value-based methods, like DQN, we will need to do an additional inference step to get the optimal policy.

The hybrid of policy-based and value-based is called Actor-Critic methods, which hopefully will be mentioned in another post.

One of the most straightforward approach in policy-based RL is, unsurprisingly, evolutionary algorithms. In this approach, a population of policies is maintained and evolutionized over time. People show that this works pretty well, e.g. for the Tetris game. However, due to the randomness, this is apparently only applied to problems where the number of parameters of the policy function is small.

A big part of policy-based methods, however, is based on Policy Gradient, where an exact estimate of the gradient of the expected future reward can be computed. Roughly speaking, there is an exact formulation for the gradient of the policy, which we can then use to optimize the policy. Since Deep Learning people basically worship gradients (pun intended), this method suites very well and became trendy about 2 years ago.

So, what is it all about? It is based on a very simple trick called Likelihood Ratio.


Kalman filters (and how they relate to HMMs)

Kalman filters are insanely popular in many engineering fields, especially those involve sensors and motion tracking. Consider how to design a radar system to track military aircrafts (or warships, submarines, … for that matter), how to track people or vehicles in a video stream, how to predict location of a vehicle carrying a GPS sensor… In all these cases, some (advanced) variation of Kalman filter is probably what you would need.

Learning and teaching Kalman filters is therefore quite challenging, not only because of the mere complexity of the algorithms, but also because there are many variations of them.

With a Computer Science background, I encountered Kalman filters several years ago, but never managed to put them into the global context of the field. I had chances to look at them again recently, and rediscovered yet another way to present and explain Kalman filters. It made a lot of sense to me, and hopefully it does to you too.

Note that there are a lot of details missing from this post (if you are building a radar system to track military aircrafts, look somewhere else!). I was just unhappy to see many introductory material on Kalman filters are either too engineering or too simplified. I want something more Machine Learning-friendly, so this is my attempt.

Let’s say you want to track an enemy’s aircraft. All you have is a lame radar (bought from Russia probably) which, when oriented properly, will give you a 3-tuple of range, angle and angular velocity [r \;\phi\;\dot{\phi}]^{T} of the aircraft. This vector is called the observation \mathbf{z}_k (subscript k because it depends on time). The actual position of the aircraft, though, is a vector in cartesian coordinates \mathbf{x}_k = [x_1\;x_2\;x_3]^{T}. Since it is an enemy’s aircraft, you can only observe \mathbf{z}_k, and you want to track the state vector \mathbf{x}_k over time, every time you receive a measurement \mathbf{z}_k from the radar.

Visualised as a Bayesian network, it looks like this:

Untitled Diagram (1)

With all the Markov properties hold, i.e. \mathbf{x}_k only depends on \mathbf{x}_{k-1} and \mathbf{z}_k only depends on \mathbf{x}_k, does this look familiar?


Variational Autoencoders 3: Training, Inference and comparison with other models

Variational Autoencoders 1: Overview
Variational Autoencoders 2: Maths
Variational Autoencoders 3: Training, Inference and comparison with other models

Recalling that the backbone of VAEs is the following equation:

\log P\left(X\right) - \mathcal{D}\left[Q\left(z\vert X\right)\vert\vert P\left(z\vert X\right)\right] = E_{z\sim Q}\left[\log P\left(X\vert z\right)\right] - \mathcal{D}\left[Q\left(z\vert X\right) \vert\vert P\left(z\right)\right]

In order to use gradient descent for the right hand side, we need a tractable way to compute it:

  • The first part E_{z\sim Q}\left[\log P\left(X\vert z\right)\right] is tricky, because that requires passing multiple samples of z through f in order to have a good approximation for the expectation (and this is expensive). However, we can just take one sample of z, then pass it through f and use it as an estimation for E_{z\sim Q}\left[\log P\left(X\vert z\right)\right] . Eventually we are doing stochastic gradient descent over different sample X in the training set anyway.
  • The second part \mathcal{D}\left[Q\left(z\vert X\right) \vert\vert P\left(z\right)\right] is even more tricky. By design, we fix P\left(z\right) to be the standard normal distribution \mathcal{N}\left(0,I\right) (read part 1 to know why). Therefore, we need a way to parameterize Q\left(z\vert X\right) so that the KL divergence is tractable.

Here comes perhaps the most important approximation of VAEs. Since P\left(z\right) is standard Gaussian, it is convenient to have Q\left(z\vert X\right) also Gaussian. One popular way to parameterize Q is to make it also Gaussian with mean \mu\left(X\right) and diagonal covariance \sigma\left(X\right)I, i.e. Q\left(z\vert X\right) = \mathcal{N}\left(z;\mu\left(X\right), \sigma\left(X\right)I\right), where \mu\left(X\right) and \sigma\left(X\right) are two vectors computed by a neural network. This is the original formulation of VAEs in section 3 of this paper.

This parameterization is preferred because the KL divergence now becomes closed-form:

\displaystyle \mathcal{D}\left[\mathcal{N}\left(\mu\left(X\right), \sigma\left(X\right)I\right)\vert\vert P\left(z\right)\right] = \frac{1}{2}\left[\left(\sigma\left(X\right)\right)^T\left(\sigma\left(X\right)\right) +\left(\mu\left(X\right)\right)^T\left(\mu\left(X\right)\right) - k - \log \det \left(\sigma\left(X\right)I\right) \right]

Although this looks like magic, but it is quite natural if you apply the definition of KL divergence on two normal distributions. Doing so will teach you a bit of calculus.

So we have all the ingredients. We use a feedforward net to predict \mu\left(X\right) and \sigma\left(X\right) given an input sample X draw from the training set. With those vectors, we can compute the KL divergence and \log P\left(X\vert z\right), which, in term of optimization, will translate into something similar to \Vert X - f\left(z\right)\Vert^2.

It is worth to pause here for a moment and see what we just did. Basically we used a constrained Gaussian (with diagonal covariance matrix) to parameterize Q. Moreover, by using \Vert X - f\left(z\right)\Vert^2 for one of the training criteria, we implicitly assume P\left(X\vert z\right) to be also Gaussian. So although the maths that lead to VAEs are generic and beautiful, at the end of the day, to make things tractable, we ended up using those severe approximations. Whether those approximations are good enough totally depend on practical applications.

There is an important detail though. Once we have \mu\left(X\right) and \sigma\left(X\right) from the encoder, we will need to sample z from a Gaussian distribution parameterized by those vectors. z is needed for the decoder to reconstruct \hat{X}, which will then be optimized to be as close to X as possible via gradient descent. Unfortunately, the “sample” step is not differentiable, therefore we will need a trick call reparameterization, where we don’t sample z directly from \mathcal{N}\left(\mu\left(X\right), \sigma\left(X\right)\right), but first sample z' from \mathcal{N}\left(0, I\right), and then compute z = \mu\left(X\right) + \mu\left(X\right)Iz'. This will make the whole computation differentiable and we can apply gradient descent as usual.

The cool thing is during inference, you won’t need the encoder to compute \mu\left(X\right) and \sigma\left(X\right) at all! Remember that during training, we try to pull Q to be close to P\left(z\right) (which is standard normal), so during inference, we can just inject \epsilon \sim \mathcal{N}\left(0, I\right) directly into the decoder and get a sample of X. This is how we can leverage the power of “generation” from VAEs.

There are various extensions to VAEs like Conditional VAEs and so on, but once you understand the basic, everything else is just nuts and bolts.

To sum up the series, this is the conceptual graph of VAEs during training, compared to some other models. Of course there are many details in those graphs that are left out, but you should get a rough idea about how they work.


In the case of VAEs, I added the additional cost term in blue to highlight it. The cost term for other models, except GANs, are the usual L2 norm \Vert X - \hat{X}\Vert^2.

GSN is an extension to Denoising Autoencoder with explicit hidden variables, however that requires to form a fairly complicated Markov Chain. We may have another post  for it.

With this diagram, hopefully you will see how lame GAN is. It is even simpler than the humble RBM. However, the simplicity of GANs makes it so powerful, while the complexity of VAE makes it quite an effort just to understand. Moreover, VAEs make quite a few severe approximation, which might explain why samples generated from VAEs are far less realistic than those from GANs.

That’s quite enough for now. Next time we will switch to another topic I’ve been looking into recently.

Variational Autoencoders 2: Maths

Variational Autoencoders 1: Overview
Variational Autoencoders 2: Maths
Variational Autoencoders 3: Training, Inference and comparison with other models

Last time we saw the probability distribution of X with a latent variable z as follows:

\displaystyle P(X) = \int P\left(X\vert z; \theta\right)P(z)dz  (1)

and we said the key idea behind VAEs is to not sample z from the whole distribution P\left(z\right), but actually from a simpler distribution Q\left(z\vert X\right). The reason is because most of z will likely to give P\left(X\vert z\right) close to zero, and therefore making little contribution to the estimation of P\left(X\right). Now if we sample z \sim Q\left(z\vert X\right), those values of z will more likely to generate X in the training set. Moreover, we hope that Q will has less modes than P\left(z\right), and therefore easier to sample from. The intuition of this is the locations of the modes of Q\left(z\vert X\right) depends on X, and this flexibility will compensate the limitation of the fact that Q\left(z\vert X\right) is simpler than P\left(z\right).

But how Q\left(z\vert X\right) can help with modelling P\left(X\right)? If z is sampled from Q, then using f we will get E_{z \sim Q}P\left(X\vert z\right). We will then need to show the relationship of this quantity with P\left(X\right), which is the actual quantity we want to estimate. The relationship between E_{z \sim Q}P\left(X\vert z\right) and P\left(X\right) is the backbone of VAEs.

We start with the KL divergence of Q\left(z\vert X\right) and P\left(z\vert X\right):

\mathcal{D}\left[Q\left(z\vert X\right) \vert\vert P\left(z\vert X\right)\right] = E_{z\sim Q}\left[\log Q\left(z\vert X\right) - log P\left(z\vert X\right)\right]

The unknown quantity in this equation is P\left(z\vert X\right), but at least we can use Bayes rule for it:

\mathcal{D}\left[Q\left(z\vert X\right) \vert\vert P\left(z\vert X\right)\right] = E_{z\sim Q}\left[\log Q\left(z\vert X\right) - log P\left(X\vert z\right) - \log P\left(z\right)\right] + \log P\left(X\right)

Rearrange things a bit, and apply the definition of KL divergence between Q\left(z\vert X\right) and P\left(z\right), we have:

\log P\left(X\right) - \mathcal{D}\left[Q\left(z\vert X\right)\vert\vert P\left(z\vert X\right)\right] = E_{z\sim Q}\left[\log P\left(X\vert z\right)\right] - \mathcal{D}\left[Q\left(z\vert X\right) \vert\vert P\left(z\right)\right]    (2)

If you forget everything, this formula is the thing you should remember. It is therefore important to understand what it means:

  • The left-hand-side is exactly what we want to optimize, plus an error term. The smaller this error term is, the better we are in mazimizing P\left(X\right). In other words, the left-hand-side is a lower-bound of what we want to optimize, hence the name variational (Bayesian).
  • If Q happens to be a differentiable function, the right-hand-side is something we can optimize with gradient descent (we will see how to do it later). Note that the right-hand-side happens to take the form of encoder and decoder, where Q encodes X into z, and then P decodes z to reconstruct X, hence the name “Autoencoder”. However, VAEs don’t really belong to the family of Denoising and Sparse Autoencoders, although there are indeed some connections.
  • Note that P\left(z\vert X\right) on the left hand side is something intractable. However, by maximizing the left hand side, we simultaneously minimize \mathcal{D}\left[Q\left(z\vert X\right)\vert\vert P\left(z\vert X\right)\right], and therefore pull Q\left(z\vert X\right) closer to P\left(z\vert X\right). If we use a flexible model for Q, then we can use Q as an approximation for P\left(z\vert X\right). This is a nice side effect of the whole framework.

Actually the above maths existed way before VAEs. However the trick was to use a feedforward network for Q, which gave rise to VAEs several years ago.

Next time, we will see how to do that, and hopefully conclude this series. Then we can move on with something more interesting.

Seq2Seq and recent advances

This is the slides I used for a talk I did recently in our reading group. The slides, particularly the Attention part, was based on one of Quoc Le’s talks on the same topic. I couldn’t come up with any better visual than what he did.

It has been quite a while since the last time I look at this topic, unfortunately I never managed to fully anticipate its beauty. Seq2Seq is one of those simple-ideas-that-actually-work in Deep Learning, which opened up a whole lot of possibilities and enabled many interesting work in the field.

A friend of mine did Variational Inference for his PhD, and once he said Variational Inference is one of those mathematically-beautiful-but-don’t-work things in Machine Learning.

Indeed, there are stuff like Variational, Bayesian inference, Sum-Product Nets etc… that come with beautiful mathematical frameworks, but don’t really work at scale, and stuff like Convolutional nets, GANs, etc.. that are a bit slippery in their mathematical foundation, often empirically discovered, but work really well in practice.

So even though many people might not really like the idea of GANs, for example, but given this “empirical tradition” in Deep Learning literature, probably they are here to stay.

Self-driving cars, again

This is my second take on Self-driving cars, a bit more serious than last time. You might be surprised to know that it is a combination of many old-school stuff in Computer Vision and Machine Learning like Perspective Transform, thresholding, Image warping,  sliding windows, HoG, linear SVM, etc…

Three months ago I kept wondering how would Self-driving cars work in Vietnam.

Now I am certain that it will never work, at least for the next 20 years (in Vietnam or in India, for that matter).