Let us, for simplicity, assume that the event space $X$ is finite, with $N$ elements.
Suppose we have a probabilistic model on this space, that is a parametrised probability distribution, which I denote by \begin{equation} p_θ(x) \end{equation}
Suppose now that we observe data. That means points $x_1,\ldots,x_N \in X$.
Likelihood
A natural idea is to try and maximise the likelihood that this data appears. Perhaps a better way to formulate this is we want to minimise the surprise that this data occurs, given our probability model. We'll see that there is a very good definition of the “surprise”.
The likelihood of the data is just: \begin{equation} p(x_1|θ) p(x_1|θ) \cdots p(x_N|θ) \end{equation} It is more convenient to consider instead the log-likelihood: \begin{equation} \log(p(x_1|θ)) + \cdots + \log(p(x_N|θ)) \end{equation}
But here comes a fundamental insight: we can group together the terms with the same data count. So, for the count of three, let us write: \[ [x = 3] := \# \{i | x_i = 3\} \]
Now the log-likelihood becomes \begin{equation} [x = 0]p(x=0|θ) + \cdots + [x = N]p(x=N|θ) \end{equation}
Here comes the insight: the real important information here is provided by the empirical distribution defined as \begin{equation} q(x = i) := \frac{[x = i]}{N} \end{equation}
That is simply the frequency number, for each value.
And now the log-likelihood (rescaled) takes a very interesting value: \begin{equation} q(x=0) \log(p(x=0|θ)) + \cdots + q(x=N) \log(p(x=N| θ)) \end{equation}
In other words, it so happens that the log-likelihood is just the empirical mean of the log of the model:
\begin{equation} E_q [\log(p_θ)] \end{equation}
This is quite beautiful!
Cross and Relative Entropy
There is more to this: suppose we could choose $p(\cdot |θ)$ in any way we wanted? In other word, suppose our model is universal (the distribution $p$ itself is the parameter). Then, it turns out that the maximum possible value is when $p = q$! In equations: \begin{equation} E_q[\log(p)] ≤ E_q[\log(q)] \end{equation}
By rewriting a bit, we obtain
\begin{equation} E_q [\log(q/p)] ≥ 0 \end{equation}
The quantity above is called the relative entropy of $q$ with respect to $p$, and is denoted by $D(q || p)$.
Perhaps an alternative name would be the surprise of “observing” the empirical distribution $q$ (remember, this is just the data), given a model $p$. Maximising the likelihood is then just minimising the surprise.
Note that the inequality \[ D(q||p) ≥0 \] is fundamental in statistics. Perhaps it should be called the fundamental inequality in statistics.
Why is the relative entropy nonnegative?
So, why is it that $D(q||p) ≥0$? There are several possible proofs. Here is one. Notice that $\log$ is a concave function, or that $-\log$ is a convex function. Essentially the definition of convexity of a function $f\colon V \to \RR$ ($V$ is a vector space) is that for any random variable $Y$ taking values in the vector space $V$, \[ f(E[Y]) ≤ E[f(Y)] \]
Now, take \(V = \RR\), and define the random variable \(X\) with density \(q(x)\). Now define the random variable $Y := p(X)/q(X)$. Then $E[Y] = \int (p(x)/q(x)) q(x) \dd x = \int p(x) \dd x = 1$. Now, for any function \(f \colon \RR \to \RR\), $E[f(Y)] = \int f(p(x)/q(x)) q(x) \dd x$. In particular with \(f = -\log\), this gives \(E[f(Y)] = D(q||p)\). Now, since \(f\) is convex, \(f(E[Y]) = f(1) = 0\), and using the inequality above, we obtain \(E[f(Y)] ≥ f(E[Y])\), that is, precisely $D(q||p) ≥ 0$.
Prior and Regularisation
What about the maximum a posteriori distribution? That is, what if we have a prior $π(θ)$ on the parameter $θ$?
There is no big difference, the quantity to maximise is then \begin{equation} E_q[p(\cdot | θ)] + \log(π(θ)) \end{equation} where $π(θ)$ is the prior distribution on $θ$. We can also reformulate this as minimising \[ D(q||p_θ) - \log(p(θ)) \] Here, one minimises the surprise $D(q||p_θ)$, with $\log(π(θ))$ as regularisation term.
What if you want to go full Bayesian? The posterior distribution $p(θ)$ is again a function of the empirical distribution: \[ \log(p(θ)) = -D(q||p_θ) + H(q) + \log(π(θ)) - A \] where $A$ corresponds to a normalising constant. It is simply defined as \[ A := \log \Big( \int \exp(-D(q||p_θ) + H(q) + \log(π(θ))) \dd θ \Big) \]
Here again, the dependence with respect to the data is only through the empirical distribution $q$.
References
Some references on information theory and Bayesian estimation are
- On cross/relative entropy: Elements of information theory by Cover & Thomas.
- On Bayesian estimation in general: Machine Learning: a Probabilistic Perspective by Murphy.