Some notes on the elementary properties of the exponential families.
Basic Probability
Some basic notions of probability theory adapted to the particular goal of exponential families.
State Space
We first assume that we have a state space (a set) $X$ with support measure $δx$, a positive measure, of possibly infinite mass.
Two typical examples could be:
- $X$ a discrete set, and $δx$ is the counting, discrete measure
- $X = \RR$, and $δx = \dd x$, the Lebesgue measure on $\RR$
Two concrete examples:
- $X = [0,+∞)$, and $δx$ is the Lebesgue measure; it is the suport measure for the exponential distribution
- $X = \NN$, and $δx = 1/x!$; this is the support measure for the Poisson distribution
Probability Distribution
A probability distribution is then a function $p \colon X \to \RR$ such that \[ p(x) ≥ 0 \qquad \forall x \in X \] \[ \int_X p(x) δx = 1 \]
If the set is discrete, the last normalisation condition means that \[ \sum_{x\in X} p(x) = 1 \]
If the state space is, for example, $\RR$ with Lebesgue measure as above, this means that \[ \int p(x) \dd x = 1 \]
Mean
Given an arbitrary function \[ E \colon X \to \RR^n \] which we will call energy function, (or energy functions), we can compute its means with: \[ \langle E \rangle_p := \int_X E(x) p(x) δx \]
Again, in the discrete case, we get \[ \langle E \rangle_p = \sum_{x\in X} p(x) E(x)\] while on $\RR$ we get \[ \langle E \rangle_p = \int E(x) p(x) \dd x \]
Entropy
This leads us to the important notion of entropy, defined for any probability distribution as: \[ H(p) := \langle -\log(p) \rangle_p \]
Essentially, the entropy of a distribution measures how far it is from being deterministic. One of the best introductions to entropy is in the discussion of the weighing problem, in the beginning of Chapter 4 in "Information Theory, Inference, and Learning Algorithms" by the late David MacKay.
Maximum Entropy Distributions
If the state space $X$ has finite mass (i.e., $δx(X) < ∞$), then the maximum entropy distribution is the uniform distribution given by \[ p(x) = \frac{1}{δx(X)} \]
In any other case, however, there is no such maximum entropy distribution!
The idea is to look for a distribution of maximum entropy given some constraints.
Derivation
Given an energy function $E$, we look for:
\[ \mathrm{arg}\max_{\langle E \rangle_p = μ} H(p) \]
In order to find the solution, define the Lagrangian
\[ L(p, θ, A) := H(p) + θ \Bigl(\int_X E(x) p(x) δx - μ\Bigr) - A \Bigl(\int_X p(x)δx - 1\Bigr) \]
As $\langle \dd (-\log(p))\rangle_p = 0$, we get \[ \dd L = \int_X (-\log(p(x)) + θE(x) - A) \dd p δx + \cdots \dd θ + \cdots \dd A \]
We obtain that an extremum point must satisfy $p(x) = \exp(θE(x) - A)$, and $A$ is implicitly defined by the normalisation condition of $p$.
We reformulate the definition of $p$ with explicit dependency upon $θ$ as:
\[p_{θ}(x) := \exp\big( θ \cdot E(x) - A(θ)\big) \]
where $θ \in \RR^n$ is a kind of temperature, which is a function of the constraint parameter $μ$. It is implicitly defined by the relation $μ = \langle E \rangle_{p_θ}$.
$A(θ)$ is for the moment just a normalising factor defined by
\[ \exp(A(θ)) := \int_X \exp(θ \cdot E(x)) δx \]
Optimality
How do we make sure that the solution above is actually maximising the entropy? Let us compute the relative entropy between an arbitrary state $q$ and the state $p$ with distribution above. The general definition of the relative entropy is \[ D(q || p) := -H(q) - \langle \log(p) \rangle_{q}\] One striking property is that it is always non-negative, and zero only if $q = p$.
Let us compute it for the particular $p_{θ}$ given above. This gives \[ D(q || p_{θ}) = -H(q) - θ \cdot \langle E \rangle_{q} + A(θ) \]
This is valid for any state $q$, and if we choose in particular $q = p_{θ}$, we obtain $-H(p_θ) - \langle E \rangle_{p_{θ}} + A(θ)$, that is
\[A (θ) = H(p_θ) + θ \cdot \langle E \rangle_{p_{θ}} \]
We see that it roughly corresponds to the free energy in thermodynamics.
By using this expression back, and using the non-negativity of the relative entropy, we obtain
\[H(q) \leq H(p_{θ}) + θ \cdot (\langle E \rangle_{p_θ} - \langle E \rangle_q) \]
So, if $\langle E \rangle_{p_θ} = \langle E \rangle_q$, we obtain $H(q) \leq H(p_θ)$, which shows optimality.
Parameter Sets
The mean parameter space denotes all the values that the mean energy can take:
\[ \mathcal{M} := \{ μ\in\mathbb{R}^n \ |\ \exists p,\quad \langle E \rangle_p = μ \} \]
A seeminlgy unrelated set is the canonical parameter space (or temperature space), defined by
\[ Ω := \{ θ \ | \ A(θ) < ∞ \} \]
Example: the Exponential Distribution
Let us figure out the case of the exponential distribution. The sample space is $X = [0,∞)$, and $δx = \dd x$. The energy function is $E(x) = x$.
The set of possible mean energy is thus $\mathcal{M} = (0,∞)$. This is because for any $μ \in (0,∞)$, the distribution $p(x) := \frac{1}{2μ} \mathrm{sgn}(2μ-x)$ is such that $\langle E \rangle_p = μ$. Suppose on the other hand that $\int_X x p(x) \dd x = 0$: this implies that $p(x) = 0$ for $x>0$, but this contradicts the normalisation condition of $p$, so we conclude that $0 \not\in \mathcal{M}$.
What about the set of temperatures? We compute: $\int_X \exp(θ x) \dd x = \frac{1}{θ}[ \exp(θ x)]_{0}^{∞}$, which is only finite if $θ < 0$.
We conclude that \[ \mathcal{M} = (0,∞) \qquad Ω = (-∞,0) \]
Free Energy Function
Recall that the free energy is the function $A$ which was defined by \[ \exp(A(θ)) := \int_X \exp(θ \cdot E(x)) δx \]
Mean
The mean is the derivative of the free energy:
\[ \frac{∂A}{∂θ_i} = \langle E_i \rangle_{p_θ} (= μ_i) \]
Indeed: \[ \frac{∂A}{∂θ_i} = \exp(-A(θ)) \int_X E_i(x) \exp(θ \cdot E(x)) δx = \langle E_i \rangle_{p_θ} \]
Covariance
The covariance is the second order derivative of the free energy:
\[ \frac{∂^2 A}{∂θ_i ∂θ_j} = \langle E_i E_j \rangle_θ - \langle E_i \rangle_θ \langle E_j \rangle_θ ≥ 0 \]
Indeed: \[ \frac{∂}{∂θ_j} p_{θ} = \Bigl(E_j(x) - \frac{∂A}{∂θ_j}\Bigr) p_{θ} = (E_j(x) - μ_j) p_{θ} \] so \[ \frac{∂μ_i}{∂θ_j} = \frac{∂}{∂θ_j} \int_X E_i(x) p_{θ} δx = \langle E_i(E_j - μ_j)\rangle_{p_θ} \]
Duality
General Convex Duality
The general idea, described in Making Sense of the Legendre Transform or in A graphical derivation of the Legendre transform is that
- if a function is convex, then its derivative is increasing
- as the derivative is increasing, it defines a bijection
- the inverse of this bijection is still increasing
- by integrating it, we thus obtain another convex function
If we summarise the idea in one dimension, given a convex function $A(θ)$, one computes the derivative:
\[ \dd A = μ \dd θ\]
This defines $μ(θ)$ as the derivative of $A$ at $θ$.
If we define $A^{*}(μ)$ by $\dd A^{*}(μ) = θ \dd μ$, this defines $A^{*}$ up to a constant. We then notice that $\dd A + \dd A^{*} = μ \dd θ + θ \dd μ = \dd (μ θ)$, which suggests to define $A^{*}$ by
\[ A^*(μ) := μ θ - A(θ)\]
Notice finally that this extends to
- arbitrary dimensions
- not necessarily smooth convex function
- convex functions not defined on the whole of $\RR^n$ (one can assume that the function is infinite where it is not defined)
Entropy as Conjugate to the Free Energy
Let us compute the entropy: \[ H(p_θ) = \langle -\log(p_θ) \rangle_{p_θ} = \langle A(θ) - θ \cdot E(x) \rangle_{θ} = A(θ) - θ \cdot μ \]
So we obtain that the convex conjugate to the free energy is the entropy:
\[A^*(μ) = - H(p_θ) \]
where $μ = \langle E \rangle_{θ}$.
Example: the Exponential Distribution
Let us figure out what happens in the exponential distribution case: $X = [0,∞)$, $δx = \dd x$, $E(x) = x$.
We have essentially already computed $A(θ) = \log(-\frac{1}{θ}) = -\log(-θ)$, which is defined on $Ω = (-∞,0)$ (alternatively: $A$ is defined on $\RR$, but $A(θ) = ∞$ when $θ ≥0$).
This gives $μ = A'(θ) = -\frac{1}{θ}$, and thus $θ = - \frac{1}{μ}$. Using the convex conjugate identity, we define $A^{*}(μ) = θμ - A(θ) = -1 + \log(-θ) = -1 - \log(μ)$.
Final Remarks
There are many other aspect that would deserve to be mentioned.
- The set of all probability distribution is a member of an exponential family (see "How the simplex is a vector space")
- When the energy function is constant (and the underlying measure has finite mass), one recovers maximum probability distributions. It is a nice exercise to see what happens to the convex duality in that case.
- Members of the exponential families have conjugate priors (see these notes by David M. Blei)
- Check out the list of exponential families on Wikipedia
Finally, these notes are much inspired by the excellent monograph "Graphical Models, Exponential Families, and Variational Inference" by Wainwright and Jordan.
Edit (2020-01-04): add an optimality section.