**TL;DR**

If you have an target distribution that you want to model, but you can’t compute it (maybe because it involves a partition function), then starting with an initial set of particles , you can iteratively update them as following:

where:

- is a small step size at iteration
- is the steepest update direction
- is a kernel, typically RBF.

At the end, this update rule gives a set of particle that approximates the target distribution.

Note that to compute the update direction, you only need to compute the derivative of the log of the (unnormalized) target distribution with respect to a set of samples .

This result is significant because it is analogous to gradient descent that minimizes a KL divergence. The set of particles , therefore, normally comes from another black-box model.

An interesting intuition is that this direction will push the particles into the regions with high values of , while the second term (derivatives of the kernel), for the case of RBF, has “regularization” effect, which keeps the particles away from each other, prevent them from collapsing into the same mode. This is how the method is probably better than pure MCMC.

**The short story**

Let be a continuous random variables in and is the (intractable) target distribution. Let a smooth vector-valued function, the so-called *Stein**‘s identity* says:

where and is called the *Stein operator*.

Let be another distribution. Now will no longer be zero. It turns out this can be used to define the *discrepancy* between two distributions *p* and *q*:

Meaning we consider all possible smooth function and use the one that maximizes the violation of the Stein’s identity. This *maximum violation* is defined to be the *Stein discrepancy* between *q* and *p*.

Considering all possible is impractical. But it turns out if we consider in the unit ball of a reproducing kernel Hilbert space (RKHS) , then the *kernelized Stein discrepancy* is defined as:

and this has a closed-form solution: .

Now if we take and is the distribution of when , then [1] shows that the derivatives of the KL divergence between and *p* has an interesting form:

(without the square)

Relating to all the above, we now have the direction that minimizes the KL divergence is the expectation of the Stein operator:

**The long story**

The above is a huge simplification, don’t take it too seriously. If you are dying for more details, the following might help:

[1] https://arxiv.org/abs/1602.03253 introduce the kernelized Stein discrepancy

[2] https://arxiv.org/abs/1608.04471 presents the SVGD algorithm

[3] https://arxiv.org/abs/1611.01722 uses the algorithm to estimate parameters of an energy-based deep neural net.