Chapter 3 - Evidence Approximation - pattern-recognition-and-machine-learning

The Evidence Approximation

Our linear regression model currently depends on the weights $w$ and on the hyperparameters $α, β$ (see prev. paragraphs). A fully Bayesian treatment would introduce prior distribution over all the parameters and hyperparameters, and calculate the predictive distribution by marginalization. Anyway, solving the integral of the marginalization on all these parameters is analytically intractable.

If we introduce two priors over $α, β$ (hyperpriors), then the predictive distribution is obtained by marginalizing over $w, α, β$ as follows:

$p (t ∣ T) \int\int\int p (t ∣ w, β) p (w ∣ T, α, β) p (α, β ∣ T) d w d α d β$

Where $p (t ∣ w, β)$ is a likelihood function (given by 3.8) and $p (w ∣ T, α, β)$ is the posterior (the Gaussian with mean $m_{N}$ and covariance matrix $S_{N}$ ) and $p (α, β ∣ T)$ is a posterior for the hyperparameters.

An approximation, called Empirical Bayes, is given by:

Obtaining the marginal likelihood by integrating over $w$
Maximizing the likelihood to obtain the hyperparameters

Another approximation can be used if the posterior $p (α, β ∣ T)$ is peaked around the values $\overset{α}{^}, \hat{β}$ . In this case we just obtain the two values, replace them in the marginalization, and we marginalize over $w$ :

$p (t ∣ T) \approx p (t ∣ T, \overset{α}{^}, \hat{β}) = \int p (t ∣ w, \hat{β}) p (w ∣ T, \overset{α}{^}, \hat{β}) d w$

From Bayes theorem we know that:

$p (α, β ∣ T) \propto p (T ∣ α, β) p (α, β)$

If the prior $p (α, β)$ is relatively flat, then $\overset{α}{^}, \hat{β}$ can be obtained by maximizing the likelihood $p (T ∣ α, β)$ instead of the posterior $p (α, β ∣ T)$ .

But how do we compute the likelihood $p (T ∣ α, β)$ ? Let's marginalize over $w$ :

$p (T ∣ α, β) ⋮ = \int p (T ∣ w, β) p (w ∣ α) d w = (\frac{β}{2 π})^{N /2} (\frac{α}{2 π})^{M /2} \int exp {- E (w)} d w$

where

$E (w) = β E_{D} (w) + α E_{W} (w) = \frac{β}{2} ∣∣ T - Φ w ∣ ∣^{2} + \frac{α}{2} w^{T} w$

If you want to know the intermediate calculation denoted by $⋮$ , read the content of this image:

calcs

After this, the book does a little bit of magic. It defines:

$A = α I + β Φ^{T} Φ$ which is also $A = \nabla\nabla E (w)$ (hessian of $E$ )
$E (m_{N}) = \frac{β}{2} ∣∣ T - ϕ m_{N} ∣ ∣^{2} + \frac{α}{2} m_{N}^{T} m_{N}$
$m_{N} = β A^{- 1} Φ^{T} T$

And then derives:

$E (w) = E (m_{N}) + \frac{1}{2} (w - m_{N})^{T} A (w - m_{N})$

The steps are depicted in the online exercises solutions provided by the author:

steps

There is a connection between this and the posterior distribution $p (w ∣ T)$ as we can see $m_{N}$ and $S_{N}^{- 1} = A$ .

We can now solve the integral inside the likelihood function $p (T ∣ α, β)$ :

$\int exp {- E (w)} d w = exp {- E (m_{N})} \int exp {- \frac{1}{2} (w - m_{N})^{T} A (w - m_{N})} d w = exp {- E (m_{N})} (2 π)^{M /2} ∣ A ∣^{- 1/2}$

Now we can replace in the likelihood formula:

$p (T ∣ α, β) = \int p (T ∣ w, β) p (w ∣ α) d w = (\frac{β}{2 π})^{N /2} (\frac{α}{2 π})^{M /2} \int exp {- E (w)} d w = (\frac{β}{2 π})^{N /2} (\frac{α}{2 π})^{M /2} exp {- E (m_{N})} (2 π)^{M /2} ∣ A ∣^{- 1/2}$

And we can also calculate the log likelihood:

$ln p (T ∣ α, β) = \frac{M}{2} ln α + \frac{N}{2} ln β - E (m_{N}) - \frac{1}{2} ln ∣ A ∣ - \frac{N}{2} ln (2 π)$

The book does an example with the polynomials as basis functions and shows how the likelihood prefers a trade-off between model accuracy and complexity, with lower values for high order polynomials.

Finding the hyperparameters

Now that we have obtained the likelihood expression $p (T ∣ α, β)$ , we want to maximize it for $α$ and $β$ .

Maximizing for $α$

When we derive $p (T ∣ α, β)$ w.r.t. $α$ we only have to look at the term $ln ∣ A ∣$ (click for more info). The eigenvalues of matrix $A$ have the form $α + λ_{i}$ with $i = 1 \dots M$ . The derivative is: $\frac{\partial}{\partial α} ln ∣ A ∣ = \frac{\partial}{\partial α} ln i = 1 \prod M (λ_{i} + α) = \frac{\partial}{\partial α} i = 1 \sum M ln (λ_{i} + α) = i = 1 \sum M \frac{1}{λ _{i} + α}$

pattern-recognition-and-machine-learning

The Evidence Approximation

Finding the hyperparameters

Maximizing for α

Maximizing for $α$