Some notes on the elementary properties of the exponential families.

## Basic Probability

Some basic notions of probability theory adapted to the particular goal of exponential families.

### State Space

We first assume that we have a state space (a set) $X$ with support measure $δx$, a positive measure, of possibly infinite mass.

Two typical examples could be:

• $X$ a discrete set, and $δx$ is the counting, discrete measure
• $X = \RR$, and $δx = \dd x$, the Lebesgue measure on $\RR$

Two concrete examples:

### Probability Distribution

A probability distribution is then a function $p \colon X \to \RR$ such that $p(x) ≥ 0 \qquad \forall x \in X$ $\int_X p(x) δx = 1$

If the set is discrete, the last normalisation condition means that $\sum_{x\in X} p(x) = 1$

If the state space is, for example, $\RR$ with Lebesgue measure as above, this means that $\int p(x) \dd x = 1$

### Mean

Given an arbitrary function $E \colon X \to \RR^n$ which we will call energy function, (or energy functions), we can compute its means with: $\langle E \rangle_p := \int_X E(x) p(x) δx$

Again, in the discrete case, we get $\langle E \rangle_p = \sum_{x\in X} p(x) E(x)$ while on $\RR$ we get $\langle E \rangle_p = \int E(x) p(x) \dd x$

### Entropy

This leads us to the important notion of entropy, defined for any probability distribution as: $H(p) := \langle -\log(p) \rangle_p$

Essentially, the entropy of a distribution measures how far it is from being deterministic. One of the best introductions to entropy is in the discussion of the weighing problem, in the beginning of Chapter 4 in "Information Theory, Inference, and Learning Algorithms" by the late David MacKay.

## Maximum Entropy Distributions

If the state space $X$ has finite mass (i.e., $δx(X) < ∞$), then the maximum entropy distribution is the uniform distribution given by $p(x) = \frac{1}{δx(X)}$

In any other case, however, there is no such maximum entropy distribution!

The idea is to look for a distribution of maximum entropy given some constraints.

### Derivation

Given an energy function $E$, we look for:

$\mathrm{arg}\max_{\langle E \rangle_p = μ} H(p)$

In order to find the solution, define the Lagrangian

$L(p, θ, A) := H(p) + θ \Bigl(\int_X E(x) p(x) δx - μ\Bigr) - A \Bigl(\int_X p(x)δx - 1\Bigr)$

As $\langle \dd (-\log(p))\rangle_p = 0$, we get $\dd L = \int_X (-\log(p(x)) + θE(x) - A) \dd p δx + \cdots \dd θ + \cdots \dd A$

We obtain that an extremum point must satisfy $p(x) = \exp(θE(x) - A)$, and $A$ is implicitly defined by the normalisation condition of $p$.

We reformulate the definition of $p$ with explicit dependency upon $θ$ as:

$p_{θ}(x) := \exp\big( θ \cdot E(x) - A(θ)\big)$

where $θ \in \RR^n$ is a kind of temperature, which is a function of the constraint parameter $μ$. It is implicitly defined by the relation $μ = \langle E \rangle_{p_θ}$.

$A(θ)$ is for the moment just a normalising factor defined by

$\exp(A(θ)) := \int_X \exp(θ \cdot E(x)) δx$

Finally, note that the derivation above is only heuristic: it is much harder to actually be sure that the exponential distribution that we obtained actually maximises entropy (see Probability Distributions and Maximum Entropy by Keith Conrad)

### Parameter Sets

The mean parameter space denotes all the values that the mean energy can take:

$\mathcal{M} := \{ μ\in\mathbb{R}^n \ |\ \exists p,\quad \langle E \rangle_p = μ \}$

A seeminlgy unrelated set is the canonical parameter space (or temperature space), defined by

$Ω := \{ θ \ | \ A(θ) < ∞ \}$

Let us figure out the case of the exponential distribution. The sample space is $X = [0,∞)$, and $δx = \dd x$. The energy function is $E(x) = x$.

The set of possible mean energy is thus $\mathcal{M} = (0,∞)$. This is because for any $μ \in (0,∞)$, the distribution $p(x) := \frac{1}{2μ} \mathrm{sgn}(2μ-x)$ is such that $\langle E \rangle_p = μ$. Suppose on the other hand that $\int_X x p(x) \dd x = 0$: this implies that $p(x) = 0$ for $x>0$, but this contradicts the normalisation condition of $p$, so we conclude that $0 \not\in \mathcal{M}$.

What about the set of temperatures? We compute: $\int_X \exp(θ x) \dd x = \frac{1}{θ}[ \exp(θ x)]_{0}^{∞}$, which is only finite if $θ < 0$.

We conclude that $\mathcal{M} = (0,∞) \qquad Ω = (-∞,0)$

## Free Energy Function

We call free energy the function $A$ which was defined by $\exp(A(θ)) := \int_X \exp(θ \cdot E(x)) δx$

### Mean

The mean is the derivative of the free energy:

$\frac{∂A}{∂θ_i} = \langle E_i \rangle_{p_θ} (= μ_i)$

Indeed: $\frac{∂A}{∂θ_i} = \exp(-A(θ)) \int_X E_i(x) \exp(θ \cdot E(x)) δx = \langle E_i \rangle_{p_θ}$

### Covariance

The covariance is the second order derivative of the free energy:

$\frac{∂^2 A}{∂θ_i ∂θ_j} = \langle E_i E_j \rangle_θ - \langle E_i \rangle_θ \langle E_j \rangle_θ ≥ 0$

Indeed: $\frac{∂}{∂θ_j} p_{θ} = \Bigl(E_j(x) - \frac{∂A}{∂θ_j}\Bigr) p_{θ} = (E_j(x) - μ_j) p_{θ}$ so $\frac{∂μ_i}{∂θ_j} = \frac{∂}{∂θ_j} \int_X E_i(x) p_{θ} δx = \langle E_i(E_j - μ_j)\rangle_{p_θ}$

## Duality

### General Convex Duality

The general idea, described in Making Sense of the Legendre Transform or in A graphical derivation of the Legendre transform is that

1. if a function is convex, then its derivative is increasing
2. as the derivative is increasing, it defines a bijection
3. the inverse of this bijection is still increasing
4. by integrating it, we thus obtain another convex function

If we summarise the idea in one dimension, given a convex function $A(θ)$, one computes the derivative:

$\dd A = μ \dd θ$

This defines $μ(θ)$ as the derivative of $A$ at $θ$.

If we define $A^{*}(μ)$ by $\dd A^{*}(μ) = θ \dd μ$, this defines $A^{*}$ up to a constant. We then notice that $\dd A + \dd A^{*} = μ \dd θ + θ \dd μ = \dd (μ θ)$, which suggests to define $A^{*}$ by

$A^*(μ) := μ θ - A(θ)$

Notice finally that this extends to

• arbitrary dimensions
• not necessarily smooth convex function
• convex functions not defined on the whole of $\RR^n$ (one can assume that the function is infinite where it is not defined)

### Entropy as Conjugate to the Free Energy

Let us compute the entropy: $H(p_θ) = \langle -\log(p_θ) \rangle_{p_θ} = \langle A(θ) - θ \cdot E(x) \rangle_{θ} = A(θ) - θ \cdot μ$

So we obtain that the convex conjugate to the free energy is the entropy:

$A^*(μ) = - H(p_θ)$

where $μ = \langle E \rangle_{θ}$.

Let us figure out what happens in the exponential distribution case: $X = [0,∞)$, $δx = \dd x$, $E(x) = x$.

We have essentially already computed $A(θ) = \log(-\frac{1}{θ}) = -\log(-θ)$, which is defined on $Ω = (-∞,0)$ (alternatively: $A$ is defined on $\RR$, but $A(θ) = ∞$ when $θ ≥0$).

This gives $μ = A'(θ) = -\frac{1}{θ}$, and thus $θ = - \frac{1}{μ}$. Using the convex conjugate identity, we define $A^{*}(μ) = θμ - A(θ) = -1 + \log(-θ) = -1 - \log(μ)$.

## Final Remarks

There are many other aspect that would deserve to be mentioned.

Finally, these notes are much inspired by the excellent monography "Graphical Models, Exponential Families, and Variational Inference" by Wainwright and Jordan.