# Optimization in Machine Learning, again

This is an excerpt from a tutorial on Bayesian Optimization I wrote recently. In this post, I am only gonna talk about the overall picture of optimization and parameter estimation.

This is a brief introduction to how Optimization is used in Machine Learning. The readers are expected to have some background (or at least some exposure) in Machine Learning, but haven’t got a global picture about how it all fits together. This article hopefully fills the gap.

## 1. Optimization for a layman

Optimization is essentially everywhere. Every day, almost every single decision we make is an optimization problem. Let it be as simple as going out for a meal (maximizing your appetite given a fixed budget and time), taking a day off at work (get as much stuff done as possible while avoid being fired); or as important as buying a house (maximizing your life’s leisure and productivity, given a fixed income and a proper retirement plan). Raising a kid is also one of the most challenging optimization problems probably.

Jokes aside, here we are going to deal with mathematical optimization problems, in which we need to find the value of  $x$ that maximizes (or minimizes) a function $f\left(x\right)$, given a set of constraints $g\left(x\right)$.

More formally, an optimization problem is formulated as:

$\displaystyle x^* = \arg \min_x f\left(x\right), \; \text{s.t.} \; g\left(x\right)$

Note that a value $x$ that maximizes $-f\left(x\right)$ will also maximizes the function $f\left(x\right)$, so any maximization problem can be easily converted into the form above.

The objective function $f\left(x\right)$ is typically a real-valued function, meaning it gives a single number for every value of $x$.

The constraints $g\left(x\right)$ limits the space in which we search for the optimal $x$. If no constraints are given, we have to search on the whole domain of $x$.

Now, depending on the characteristics of the variable $x$, the objective function $f\left(x\right)$ and the constraints $g\left(x\right)$, there are different classes of optimization problems:

• If the constraint is not specified, it is called unconstrained optimization. Remember how you find the stationary point of the quadratic function $f\left(x\right) = ax^2 + bx + c$ in your high school? That is unconstrained optimization. Otherwise we have constrained optimization.
• If $x$ can only take discrete values, we have discrete optimization. Moreover when the search space is finite, we call it combinatorial optimization. Remember the Travelling salesman problem? It is combinatorial.
• When the domain of $x$ is continuous, and the objective function is convex, we have convex optimization. Otherwise it is nonconvex optimization.
• Then we have linear optimization if $f\left(x\right)$ is linear, or nonlinear optimization otherwise. If the constraints $g\left(x\right)$ is linear we have linearly constrained, otherwise nonlinearly constrained.

Note that this taxonomy is based on different characteristics of the components in an optimization problem, so they are not exclusive. For instance, finding the optimum value of a quadratic function $f\left(x\right) = ax^2 + bx + c$ is a unconstrained nonlinear convex optimization problem.

But why all of this complicated taxonomy? The reason is there is no single generic optimization algorithm (called solver by optimization people) which works for all optimization problems. People find specific algorithms that work for some specific scenarios, and that’s why it is helpful to maintain a taxonomy so we know which algorithm to use.

In Machine Learning, most of the time we deal with continuous optimization, in which the variable $x$ is continuous. Moreover, the objective function $f\left(x\right)$ often have to be at least Lipschitz-continuous. Depending on the model, the objective function can be either convex (SVM) or non-convex (Neural networks). In some very rare cases (e.g. Linear Regression), the solution is in closed-form, making it easy to optimize. Most of the time, however, we assume that the objective function is relatively cheap to evaluate, which is true for most ML models. Still, there are cases when the objective function is computationally expensive, where we need special methods to deal with.

The following analogy is a great way to think about optimization in Machine Learning. I call it man-in-the-meadow (sorry for being sexist, couldn’t resist).

Imagine the surface of the objective function is like a meadow, and you are standing in the middle of it, seeking for its lowest point (because there is a good amount of fortune there). Now:

• If it is in daytime and you see clearly the lowest point, you will just head there. This is when we know the exact solution of the problem, i.e. having closed-form solution of it. Solving linear regression with the Normal Equation falls into this category.
• If it is night time, you can’t see the lowest point, but you have a torch in your hands, and you see a limited area around the location you are at, then you can make small steps downhill until you reach some low points. Now it might not get you to the real lowest point, but that’s the best you could do, right? The family of gradient descent methods falls into this category. The light of the torch is an analogy to the gradient vector. Most of ML algorithms currently are in this category, because it is common and easy to solve.
• If the torch is expensive to light (it uses expensive fuel), you can somehow light it up a bit every now and then, knowing that what you see is not very reliable, you can still make some small steps. This is where you can’t compute the exact gradient of the objective function. You can approximate the gradient by using local gradient approximation, or in some extreme cases where the gradient is intractable (Restricted Boltzmann Machines, or Energy-based models in general) because it involves intractable integrals, you have to use Markov Chain Monte Carlo to approximate the gradient. In any case, algorithms in this category tend to be expensive computationally.
• Now if you don’t even have the torch, and can’t see anything, things get a bit messier. In optimization, this is called black-box objective function. However, lucky you, you are superhuman and can teleport to any location on the meadow (and know the height of that location). In this case, you can probably teleport around and figure out some clever strategy to get to the lowest point. In optimization terminology, it means you can sample the objective function at any location you want.
• Life is hard. To do the teleport, you have to use a lot of your energy (or mana, or whatever), so you don’t want to jump around too much. In optimization, this means the objective function is expensive to sample. In this case, you can either construct a surrogate function which is cheaper to sample (surrogate-based optimization), or utilize some kinds of prior knowledge on the objective function, making smarter choices about where to sample next (Bayesian Optimization).
• As in the photo, there might be some bushes on the meadow, which means you might be careful to prevent those bushes from interfering your reasoning about the meadow. In optimization, this is called noisy objective function, in the sense that the samples you get from the objective function might be noisy. This happens quite often in physics and other fields. In order to properly optimize noisy functions, very often the algorithm has to explicitly model the noise.

That should cover almost all optimization algorithms you might encounter in Machine Learning.

But concretely in Machine Learning, what the hell are we going to optimize? Basically for supervised learning, the objective function is the difference between the model output and the target training data. How to formulate that difference, however, depends on whether you are a Bayesian or a frequentist, as discussed in the next section.

## 2. Frequentist vs. Bayesian: two schools of thoughts in Statistics, and how they invade Machine Learning

We hereby formulate the supervised learning problem in statistical Machine Learning. Before the raise of Machine Learning, this was long known as Parameter Estimation in Statistics.

Generally, given a dataset $\left\{x_i\right\}_{i=1}^N, x_i \in \mathbb{R}^d$, we assume that the data was sampled from a distribution $\displaystyle p\left(x \vert \theta\right)$, where $\theta$ denotes all the parameters of this distribution.

The distribution $\displaystyle p\left(x \vert \theta\right)$ can be as simple as a histogram ($\theta$ will be the frequencies in each bin), or a Gaussian distribution ($\theta$ will be the mean and variance), or as complicated as linear regression, logistic regression, neural networks, etc…

Now, given the dataset $\left\{x_i\right\}_{i=1}^N, x_i \in \mathbb{R}^d$ and the model $\displaystyle p\left(x \vert \theta\right)$, what are we going to optimize, in order to find the best $\theta$?

### 2.1. Frequentist: Maximum Likelihood Estimation (MLE)

Naturally we will want to find $\theta$ that maximizes $\displaystyle p\left(x_i \vert \theta\right)\;$ for all $x_i\;$ in the dataset, which means to maximize the product of all probabilities $\displaystyle p\left(x_i \vert \theta\right)$ of all the $x_i$‘s. We write it in a single formula like this

$\displaystyle \theta^* = \arg\max_\theta \prod_{i=1}^N p\left(x_i\vert \theta\right)$

(Technically, this is only true if the samples in the training set are independent and identically distributed, which is the basic assumption of all ML training algorithm).

The function $L\left(\theta \vert X \right) = \prod_{i=1}^N p\left(x_i\vert \theta\right)$ is called the likelihood function, because it represents, given the model parameters $\theta$, how likely the model generated the given training data.

Since the product of probabilities will get tiny when there are too many terms, we will optimize the logarithm of that product, which will turn into a sum of logarithms. Moreover, to make it a canonical minimization problem, we will negate the sum and obtain this formulation:

$\theta^{*}_{ML} = \arg\min_\theta \left( -\sum_{i=1}^N \log\left(p\left(x_i \vert \theta\right)\right)\right)$

The objective function being optimized is, naturally, called the Negative Log-likelihood function. This approach of maximizing the likelihood is, therefore, called Maximum Likelihood Estimation.

Obviously this is an unconstrained optimization problem. The solution of this problem depends on the nature of of the model $\displaystyle p\left(x \vert \theta\right)$:

• For simple models (e.g. Gaussian distribution), we can directly work out the analytical formula for estimating the parameters. This is called closed-form solution.
• For slightly more complicated models (e.g. Logistic regression, neural networks…), there is no analytical formula for the parameters, but since we can compute the gradient at every location, we can use gradient descent to optimize.
• For even more complicated models (e.g. Restricted Boltzmann Machines, and other Energy-based models), the Negative Log Likelihood function is intractable, hence people has to use some approximation methods, notably Markov Chain Monte Carlo, to estimate the gradient.

### 2.2. Bayesian: Maximum A-Posterior

In maximum likelihood estimation (and frequentist methods in general), we treat the model parameters as a fixed but unknown variable. Bayesian approaches, in other hands, treat the model parameters as a random variable. This sounds simple, but it condenses the essential difference between Frequentist and Bayesian, and totally changes the way we approach the problem.

Now, since we treat the model parameters $\theta$ as a random variable, let’s say it follows a distribution $p\left(\theta\right)$. This distribution represents our prior knowledge about the parameters, and it is called the prior distribution.

Naturally, using the Bayes formula, we have:

$p\left(\theta\vert x_1, x_2, ..., x_N\right) = \frac{\displaystyle \left(\prod_{i=1}^N p\left(x_i\vert \theta \right)\right) p \left(\theta\right)}{\displaystyle \int \prod_{i=1}^N p\left(x_i\vert \theta' \right) p \left(\theta'\right)d\theta'}$

This is so-called the posterior distribution: it is the full distribution of the model parameters, given the training data. Knowing the full distribution of the parameters is something very powerful that we couldn’t have in the frequentist approach. Unfortunately, this formula involves an intractable integral in the denominator. This integral has to be computed over the whole domain of the model parameters, which is often infeasible.

However, we only need to find $\theta$ that maximizes the posterior distribution, and note that the denominator doesn’t depend on any particular value of  $\theta$, hence with the Bayesian approach, we can estimate the parameters using this formulation:

$\displaystyle \theta_{MAP}^{*} = \arg\max_\theta p\left(\theta \vert x_1, x_2, ..., x_N\right) = \arg\max_\theta \left(\prod_{i=1}^N p\left(x_i\vert\theta\right)\right) p\left(\theta\right)$

This is called Maximum-a-posterior Estimation, because we are optimizing the posterior distribution. Note that this formula is very similar to the case of MLE, except that it has an additional term for the distribution $p\left(\theta\right)$.

Note that whether it is Frequentist or Bayesian, it is just a matter of which criteria we are optimizing for. In many other settings in Machine Learning, the objective functions do not belong to either Frequentist or Bayesian, but they can still be formulated as am optimization problem, and people solve it using standard optimization techniques.