Some notes on the elementary properties of the exponential families.

## Basic Probability

Some basic notions of probability theory adapted to the particular goal of exponential families.

### State Space

We first assume that we have a state space (a set) $X$ with support measure $δx$, a positive measure, of possibly infinite mass.

Two typical examples could be:

- $X$ a discrete set, and $δx$ is the counting, discrete measure
- $X = \RR$, and $δx = \dd x$, the Lebesgue measure on $\RR$

Two concrete examples:

- $X = [0,+∞)$, and $δx$ is the Lebesgue measure; it is the suport measure for the exponential distribution
- $X = \NN$, and $δx = 1/x!$; this is the support measure for the Poisson distribution

### Probability Distribution

A probability distribution is then a function $p \colon X \to \RR$ such that \[ p(x) ≥ 0 \qquad \forall x \in X \] \[ \int_X p(x) δx = 1 \]

If the set is discrete, the last normalisation condition means that \[ \sum_{x\in X} p(x) = 1 \]

If the state space is, for example, $\RR$ with Lebesgue measure as above, this means that \[ \int p(x) \dd x = 1 \]

### Mean

Given an arbitrary function
\[ E \colon X \to \RR^n \]
which we will call *energy function*, (or *energy functions*), we can compute its means with:
\[ \langle E \rangle_p := \int_X E(x) p(x) δx \]

### Entropy

This leads us to the important notion of **entropy**, defined for any probability distribution as:
\[ H(p) := \langle -\log(p) \rangle_p \]

Essentially, the entropy of a distribution measures how far it is from being deterministic. One of the best introductions to entropy is in the discussion of the weighing problem, in the beginning of Chapter 4 in "Information Theory, Inference, and Learning Algorithms" by the late David MacKay.

## Maximum Entropy Distributions

If the state space $X$ has finite mass (i.e., $δx(X) < ∞$), then the maximum entropy distribution is the *uniform distribution* given by
\[ p(x) = \frac{1}{δx(X)} \]

In any other case, however, there is no such maximum entropy distribution!

The idea is to look for a distribution of maximum entropy *given some constraints*.

### Derivation

Given an energy function $E$, we look for:

\[ \mathrm{arg}\max_{\langle E \rangle_p = μ} H(p) \]

In order to find the solution, define the Lagrangian

\[ L(p, θ, A) := H(p) + θ \Bigl(\int_X E(x) p(x) δx - μ\Bigr) - A \Bigl(\int_X p(x)δx - 1\Bigr) \]

As $\langle \dd (-\log(p))\rangle_p = 0$, we get \[ \dd L = \int_X (-\log(p(x)) + θE(x) - A) \dd p δx + \cdots \dd θ + \cdots \dd A \]

We obtain that an extremum point must satisfy $p(x) = \exp(θE(x) - A)$, and $A$ is implicitly defined by the normalisation condition of $p$.

We reformulate the definition of $p$ with explicit dependency upon $θ$ as:

\[p_{θ}(x) := \exp\big( θ \cdot E(x) - A(θ)\big) \]

where $θ \in \RR^n$ is a kind of *temperature*, which is a function of the constraint parameter $μ$.
It is implicitly defined by the relation $μ = \langle E \rangle_{p_θ}$.

$A(θ)$ is for the moment just a normalising factor defined by

\[ \exp(A(θ)) := \int_X \exp(θ \cdot E(x)) δx \]

Finally, note that the derivation above is only heuristic: it is much harder to actually be sure that the exponential distribution that we obtained actually maximises entropy (see Probability Distributions and Maximum Entropy by Keith Conrad)

### Parameter Sets

The *mean parameter space* denotes all the values that the mean energy can take:

\[ \mathcal{M} := \{ μ\in\mathbb{R}^n \ |\ \exists p,\quad \langle E \rangle_p = μ \} \]

A seeminlgy unrelated set is the *canonical parameter space* (or temperature space), defined by

\[ Ω := \{ θ \ | \ A(θ) < ∞ \} \]

Let us figure out the case of the exponential distribution. The sample space is $X = [0,∞)$, and $δx = \dd x$. The energy function is $E(x) = x$.

The set of possible mean energy is thus $\mathcal{M} = (0,∞)$. This is because for any $μ \in (0,∞)$, the distribution $p(x) := \frac{1}{2μ} \mathrm{sgn}(2μ-x)$ is such that $\langle E \rangle_p = μ$. Suppose on the other hand that $\int_X x p(x) \dd x = 0$: this implies that $p(x) = 0$ for $x>0$, but this contradicts the normalisation condition of $p$, so we conclude that $0 \not\in \mathcal{M}$.

What about the set of temperatures? We compute: $\int_X \exp(θ x) \dd x = \frac{1}{θ}[ \exp(θ x)]_{0}^{∞}$, which is only finite if $θ < 0$.

We conclude that \[ \mathcal{M} = (0,∞) \qquad Ω = (-∞,0) \]

## Free Energy Function

We call **free energy** the function $A$ which was defined by
\[ \exp(A(θ)) := \int_X \exp(θ \cdot E(x)) δx \]

### Mean

The mean is the derivative of the free energy:

\[ \frac{∂A}{∂θ_i} = \langle E_i \rangle_{p_θ} (= μ_i) \]

Indeed: \[ \frac{∂A}{∂θ_i} = \exp(-A(θ)) \int_X E_i(x) \exp(θ \cdot E(x)) δx = \langle E_i \rangle_{p_θ} \]

### Covariance

The covariance is the second order derivative of the free energy:

\[ \frac{∂^2 A}{∂θ_i ∂θ_j} = \langle E_i E_j \rangle_θ - \langle E_i \rangle_θ \langle E_j \rangle_θ ≥ 0 \]

Indeed: \[ \frac{∂}{∂θ_j} p_{θ} = \Bigl(E_j(x) - \frac{∂A}{∂θ_j}\Bigr) p_{θ} = (E_j(x) - μ_j) p_{θ} \] so \[ \frac{∂μ_i}{∂θ_j} = \frac{∂}{∂θ_j} \int_X E_i(x) p_{θ} δx = \langle E_i(E_j - μ_j)\rangle_{p_θ} \]

## Duality

### General Convex Duality

The general idea, described in Making Sense of the Legendre Transform or in A graphical derivation of the Legendre transform is that

- if a function is convex, then its derivative is increasing
- as the derivative is increasing, it defines a bijection
- the
*inverse*of this bijection is still increasing - by integrating it, we thus obtain
*another*convex function

If we summarise the idea in one dimension, given a convex function $A(θ)$, one computes the derivative:

\[ \dd A = μ \dd θ\]

This defines $μ(θ)$ as the derivative of $A$ at $θ$.

If we define $A^{*}(μ)$ by $\dd A^{*}(μ) = θ \dd μ$, this defines $A^{*}$ up to a constant. We then notice that $\dd A + \dd A^{*} = μ \dd θ + θ \dd μ = \dd (μ θ)$, which suggests to define $A^{*}$ by

\[ A^*(μ) := μ θ - A(θ)\]

Notice finally that this extends to

- arbitrary dimensions
- not necessarily
*smooth*convex function - convex functions not defined on the whole of $\RR^n$ (one can assume that the function is infinite where it is not defined)

### Entropy as Conjugate to the Free Energy

Let us compute the entropy: \[ H(p_θ) = \langle -\log(p_θ) \rangle_{p_θ} = \langle A(θ) - θ \cdot E(x) \rangle_{θ} = A(θ) - θ \cdot μ \]

So we obtain that the convex conjugate to the free energy is the entropy:

\[A^*(μ) = - H(p_θ) \]

where $μ = \langle E \rangle_{θ}$.

Let us figure out what happens in the exponential distribution case: $X = [0,∞)$, $δx = \dd x$, $E(x) = x$.

We have essentially already computed $A(θ) = \log(-\frac{1}{θ}) = -\log(-θ)$, which is defined on $Ω = (-∞,0)$ (alternatively: $A$ is defined on $\RR$, but $A(θ) = ∞$ when $θ ≥0$).

This gives $μ = A'(θ) = -\frac{1}{θ}$, and thus $θ = - \frac{1}{μ}$. Using the convex conjugate identity, we define $A^{*}(μ) = θμ - A(θ) = -1 + \log(-θ) = -1 - \log(μ)$.

## Final Remarks

There are many other aspect that would deserve to be mentioned.

- The set of all probability distribution is a member of an exponential family (see "How the simplex is a vector space")
- When the energy function is constant (and the underlying measure has finite mass), one recovers maximum probability distributions. It is a nice exercise to see what happens to the convex duality in that case.
- Members of the exponential families have conjugate priors (see these notes by David M. Blei)
- Check out the list of exponential families on Wikipedia

Finally, these notes are much inspired by the excellent monography "Graphical Models, Exponential Families, and Variational Inference" by Wainwright and Jordan.