Exponential Families

Some notes on the elementary properties of the exponential families.

Basic Probability
Maximum Entropy Distributions
Free Energy Function
- Mean
- Covariance
Duality
- General Convex Duality
- Entropy as Conjugate to the Free Energy
Conjugate Priors
Final Remarks

Basic Probability

Some basic notions of probability theory adapted to the particular goal of exponential families.

State Space

We first assume that we have a state space (a set) $X$ with support measure $δx$, a positive measure, of possibly infinite mass.

Two typical examples could be:

$X$ a discrete set, and $δx$ is the counting, discrete measure
$X = \RR$, and $δx = \dd x$, the Lebesgue measure on $\RR$

Two concrete examples:

$X = [0,+∞)$, and $δx$ is the Lebesgue measure; it is the suport measure for the exponential distribution
$X = \NN$, and $δx = 1/x!$; this is the support measure for the Poisson distribution

Probability Distribution

A probability distribution is then a function $p \colon X \to \RR$ such that \[ p(x) ≥ 0 \qquad \forall x \in X \] \[ \int_X p(x) δx = 1 \]

If the set is discrete, the last normalisation condition means that \[ \sum_{x\in X} p(x) = 1 \]

If the state space is, for example, $\RR$ with Lebesgue measure as above, this means that \[ \int p(x) \dd x = 1 \]

Consider a vector space $V$ of dimension $n$ Given an arbitrary function \[ E \colon X \to V \] which we will call energy function, (or energy functions), we can compute its means with: \[ \langle E \rangle_p := \int_X E(x) p(x) δx \]

Again, in the discrete case, we get \[ \langle E \rangle_p = \sum_{x\in X} p(x) E(x)\] while on $\RR$ we get \[ \langle E \rangle_p = \int E(x) p(x) \dd x \]

Entropy

This leads us to the important notion of entropy, defined for any probability distribution as: \[ H(p) := \langle -\log(p) \rangle_p \]

Essentially, the entropy of a distribution measures how far it is from being deterministic. One of the best introductions to entropy is in the discussion of the weighing problem, in the beginning of Chapter 4 in “Information Theory, Inference, and Learning Algorithms” by the late David MacKay.

Maximum Entropy Distributions

If the state space $X$ has finite mass (i.e., $δx(X) < ∞$), then the maximum entropy distribution is the uniform distribution given by \[ p(x) = \frac{1}{δx(X)} \]

In any other case, however, there is no such maximum entropy distribution!

The idea is to look for a distribution of maximum entropy given some constraints.

Derivation

Given an energy function $E$, we look for:

\[ \mathrm{arg}\max_{\langle E \rangle_p = μ} H(p) \]

In order to find the solution, define the Lagrangian $L$ on $V \times V^* \times \RR$ by

\[ L(p, θ, A) := H(p) + \Bigl\langle θ,\int_X E(x) p(x) δx - μ \Bigr\rangle - A \Bigl(\int_X p(x)δx - 1\Bigr) \]

As $\langle \dd (-\log(p))\rangle_p = 0$, we get \[ \dd L = \int_X (-\log(p(x)) + \langle θ, E(x)\rangle - A) \dd p δx + \cdots \dd θ + \cdots \dd A \]

We obtain that an extremum point must satisfy $p(x) = \exp(\langle θ, E(x) \rangle - A)$, and $A$ is implicitly defined by the normalisation condition of $p$.

We reformulate the definition of $p$ with explicit dependency upon $θ$ as:

\[p_{θ}(x) := \exp\big( \langle θ , E(x)\rangle - A(θ)\big) \]

where $θ \in V^{\ast} $ is a kind of temperature, which is a function of the constraint parameter $μ$. It is implicitly defined by the relation $μ = \langle E \rangle_{p_θ}$.

$A(θ)$ is for the moment just a normalising factor defined by

\[ \exp(A(θ)) := \int_X \exp(θ \cdot E(x)) δx \]

Optimality

How do we make sure that the solution above is actually maximising the entropy? Let us compute the relative entropy between an arbitrary state $q$ and the state $p$ with distribution above. The general definition of the relative entropy is \[ D(q || p) := -H(q) - \langle \log(p) \rangle_{q}\] One striking property is that it is always non-negative, and zero only if $q = p$.

Let us compute it for the particular $p_{θ}$ given above. This gives \[ D(q || p_{θ}) = -H(q) - θ \cdot \langle E \rangle_{q} + A(θ) \]

This is valid for any state $q$, and if we choose in particular $q = p_{θ}$, we obtain $-H(p_θ) - \langle E \rangle_{p_{θ}} + A(θ)$, that is

\[A (θ) = H(p_θ) + θ \cdot \langle E \rangle_{p_{θ}} \]

We see that it roughly corresponds to the free energy in thermodynamics.

By using this expression back, and using the non-negativity of the relative entropy, we obtain

\[H(q) \leq H(p_{θ}) + θ \cdot (\langle E \rangle_{p_θ} - \langle E \rangle_q) \]

So, if $\langle E \rangle_{p_θ} = \langle E \rangle_q$, we obtain $H(q) \leq H(p_θ)$, which shows optimality.

Parameter Sets

The mean parameter space denotes all the values that the mean energy can take:

\[ \mathcal{M} := \{ μ\in V \ |\ \exists p,\quad \langle E \rangle_p = μ \} \]

A seeminlgy unrelated set is the canonical parameter space (or temperature space), defined by

\[ Ω := \{ θ \in V^* \ | \ A(θ) < ∞ \} \]

Example: the Exponential Distribution

Let us figure out the case of the exponential distribution. The sample space is $X = [0,∞)$, and $δx = \dd x$. The energy function is $E(x) = x$.

The set of possible mean energy is thus $\mathcal{M} = (0,∞)$. This is because for any $μ \in (0,∞)$, the distribution $p(x) := \frac{1}{2μ} \mathrm{sgn}(2μ-x)$ is such that $\langle E \rangle_p = μ$. Suppose on the other hand that $\int_X x p(x) \dd x = 0$: this implies that $p(x) = 0$ for $x>0$, but this contradicts the normalisation condition of $p$, so we conclude that $0 \not\in \mathcal{M}$.

What about the set of temperatures? We compute: $\int_X \exp(θ x) \dd x = \frac{1}{θ}[ \exp(θ x)]_{0}^{∞}$, which is only finite if $θ < 0$.

We conclude that \[ \mathcal{M} = (0,∞) \qquad Ω = (-∞,0) \]

Free Energy Function

Recall that the free energy is the function $A$ which was defined by \[ \exp(A(θ)) := \int_X \exp(θ \cdot E(x)) δx \]

Mean

The mean is the derivative of the free energy:

\[ \frac{∂A}{∂θ_i} = \langle E_i \rangle_{p_θ} (= μ_i) \]

Indeed: \[ \frac{∂A}{∂θ_i} = \exp(-A(θ)) \int_X E_i(x) \exp(θ \cdot E(x)) δx = \langle E_i \rangle_{p_θ} \]

Covariance

The covariance is the second order derivative of the free energy:

\[ \frac{∂^2 A}{∂θ_i ∂θ_j} = \langle E_i E_j \rangle_θ - \langle E_i \rangle_θ \langle E_j \rangle_θ ≥ 0 \]

Indeed: \[ \frac{∂}{∂θ_j} p_{θ} = \Bigl(E_j(x) - \frac{∂A}{∂θ_j}\Bigr) p_{θ} = (E_j(x) - μ_j) p_{θ} \] so \[ \frac{∂μ_i}{∂θ_j} = \frac{∂}{∂θ_j} \int_X E_i(x) p_{θ} δx = \langle E_i(E_j - μ_j)\rangle_{p_θ} \]

Duality

General Convex Duality

The general idea, described in Making Sense of the Legendre Transform or in A graphical derivation of the Legendre transform is that

if a function is convex, then its derivative is increasing
as the derivative is increasing, it defines a bijection
the inverse of this bijection is still increasing
by integrating it, we thus obtain another convex function

If we summarise the idea in one dimension, given a convex function $A(θ)$, one computes the derivative:

\[ \dd A = μ \dd θ\]

This defines $μ(θ)$ as the derivative of $A$ at $θ$.

If we define $A^{*}(μ)$ by $\dd A^{*}(μ) = θ \dd μ$, this defines $A^{*}$ up to a constant. We then notice that $\dd A + \dd A^{*} = μ \dd θ + θ \dd μ = \dd (μ θ)$, which suggests to define $A^{*}$ by

\[ A^*(μ) := μ θ - A(θ)\]

Notice finally that this extends to

arbitrary dimensions
not necessarily smooth convex function
convex functions not defined on the whole of $\RR^n$ (one can assume that the function is infinite where it is not defined)

Entropy as Conjugate to the Free Energy

Let us compute the entropy: \[ H(p_θ) = \langle -\log(p_θ) \rangle_{p_θ} = \langle A(θ) - \langle θ , E(x)\rangle \rangle_{p_θ} = A(θ) - \langle θ , μ\rangle \]

So we obtain that the convex conjugate to the free energy is the entropy:

\[A^*(μ) = - H(p_θ) \]

where $μ = \langle E \rangle_{θ}$.

Example: the Exponential Distribution

Let us figure out what happens in the exponential distribution case: $X = [0,∞)$, $δx = \dd x$, $E(x) = x$.

We have essentially already computed $A(θ) = \log(-\frac{1}{θ}) = -\log(-θ)$, which is defined on $Ω = (-∞,0)$ (alternatively: $A$ is defined on $\RR$, but $A(θ) = ∞$ when $θ ≥0$).

This gives $μ = A’(θ) = -\frac{1}{θ}$, and thus $θ = - \frac{1}{μ}$. Using the convex conjugate identity, we define $A^{*}(μ) = θμ - A(θ) = -1 + \log(-θ) = -1 - \log(μ)$.

Conjugate Priors

Suppose we observe $N$ points $x_i$. The corresponding log likelihood is: \[ \log\bigl(p_{θ}(x_1,\ldots, x_N)\bigr) = \Bigl \langle θ , \sum_{i=1}^N E(x_i) \Bigr\rangle - N A(θ) \]

Let us define $ε := \sum_{i=1}^N E(x_i)$ and $ν = N$. If we add a new observed point $x_{N+1}$ we now obtain \[ \log\bigl(p_{θ}(x_{N+1}, x_N,\ldots, x_1)\bigr) = \langle \theta , E(x_{N+1}) + ε \rangle - (ν+1) A(θ) \] This suggests to use a prior of the exponential form \[ \log (p_{ε,ν}(θ )) = \langle θ, ε\rangle - ν A(θ) - A(ε, ν) \] then the posterior after observing the point $x_{N+1}$ is a distribution of the same form, but with parameters updated as

$ε \to ε + E(x_{N+1}) $
$ν \to ν +1$.

So in a sense, the parameters $ε, ν$ of the priors have meaning, namely

$ε$ is an “observed” mean energy multiplied by the number of observations
$ν$ is the “number” of such observations

Example: the Gamma Distribution

Let us first look at a centered normal distribution on $\RR$. Its log probability density is \[ \log(p_{λ}(x)) = -\frac{1}{2}λx^2 + \frac{1}{2}\log(λ/τ) \] where $τ = 2π$, and $λ$ is the precision of the distribution. It belongs to the exponential family with

$E(x) = -\frac{1}{2}x^2$
$A(λ) = -\frac{1}{2}\log(λ/τ)$

In that case, a conjugate prior will be an exponential family distribution depending on two parameters. It will be of the form

\[ \log(p_{ε,ν}(λ)) = ε λ + ν \frac{1}{2}\log(λ/τ) - A(ε, ν) \]

This distribution can also be written as $p_{ε,ν}(λ) = \frac{1}{C(ε,λ)} λ^{ν/2} \exp(ελ)$ and is known as the Gamma distribution, usually described with parameters $α = ν/2 + 1$ and $β = - ε$.

As it turns out, the free energy function can be computed explicitly as \[ A(ε, ν) = \log(Γ(ν/2+1)) - ν/2 \log(τ) - (ν/2+1)\log(-ε) \] where $Γ$ denotes the extension of the factorial function known as the Gamma function.

Since the energy function represents minus one-half a variance $σ_0$, and the distribution is over the precision $λ$, we obtain $ε = -ν σ_0/2 = -ν / (2λ_0)$ so we use the density function \[ p_{λ_0,ν}(λ) \propto λ^{ν/2}\exp\Bigl(-\frac{ν}{2}\frac{λ}{λ_0}\Bigr) \] As you see in the plot, the parameter $ν$ corresponds to a confidence as to the measured value $λ_0$.

ν: 15

λ₀: 5

Final Remarks

There are many other aspect that would deserve to be mentioned.

The set of all probability distribution is a member of an exponential family (see “How the simplex is a vector space”)
When the energy function is constant (and the underlying measure has finite mass), one recovers maximum probability distributions. It is a nice exercise to see what happens to the convex duality in that case.
About conjugate priors, see these notes by David M. Blei
Check out the list of exponential families on Wikipedia

Finally, these notes are much inspired by the excellent monograph “Graphical Models, Exponential Families, and Variational Inference” by Wainwright and Jordan.

Edit (2020-01-04): add an optimality section.

Edit (2024-10-24): conjugate priors