Probability distributions

This chapter will focus on the problem of density estimation, which consists in finding / estimating the probability distribution $p (x)$ from $N$ independent and identically distributed datapoints $x_{1}, x_{2}, \dots, x_{N}$ drawn from $p (x)$ . There are two main ways of doing that: the first way is to use parametric density estimation, where you choose one known parametric distribution (e.g., Gaussian) and try to get the right parameters that fit the data. This method assumes that the parametric distribution we use it's suitable for the data, which is not always the case. Another way of doing that is using non-parametric density estimation techniques (e.g., histograms, nn, kernels).

Bernoulli experiment

Suppose we have a data set $D = {x_{1}, \dots, x_{N}}$ of i.i.d. observed values of $x \sim B er n (x ∣ μ)$ . We can estimate the $μ$ parameter from the sample in a frequentist way, by maximizing the likelihood (or the log-likelihood):

$ln p (D ∣ μ) = ln n = 1 \prod N p (x_{i} ∣ μ) = n = 1 \sum N ln [μ^{x_{n}} (1 - μ)^{(1 - x_{n})}] = n = 1 \sum N [x_{n} ln μ + (1 - x_{n}) ln (1 - μ)]$

To find $μ$ , let's set the log-likelihood derivative w.r.t. $μ$ to 0:

$\frac{\partial}{\partial μ} ln p (D ∣ μ) n = 1 \sum N (\frac{x _{n}}{μ} + \frac{1 - x _{n}}{1 - μ}) n = 1 \sum N \frac{x _{n} - μ}{μ + μ ^{2}} = 0 = 0 = 0$

Since $\frac{x _{n} - μ}{μ + μ ^{2}} = 0 \Leftrightarrow x_{n} = μ$ then:

$n = 1 \sum N x_{n} - μ n = 1 \sum N x_{n} - n = 1 \sum N μ n = 1 \sum N x_{n} \frac{1}{N} n = 1 \sum N x_{n} = 0 = 0 = N μ = μ$

$μ$ is estimated from the sample mean. In this case, the sample mean is an example of sufficient statistic for the model, i.e. calculating other statistics from the sample will not add more information than that.

Binomial distribution

Sticking with the coin flips example, the binomial distribution models the probability of obtaining $m$ heads out of $N$ total coin flips:

$B in (m ∣ N, μ) = (m N) μ^{m} (1 - μ)^{N - m}$

Where $(m N)$ represents all the possible ways of obtaining $m$ heads out of $N$ coin flips. The mean and variance of a binomial variabile can be estimated by knowning that for i.i.d events the mean of the sum is the sum of the mean, and the variance of the sum is the sum of the variances. Because $m = x_{1} + \dots + x_{N}$ then:

$E [m] = N μ v a r [m] = N μ (1 - μ)$

Beta distribution

Please read the estimating_parameters_using_a_bayesian_approach notebook. Some quick notes here:

$β (μ) = \frac{Γ ( a + b )}{Γ ( a ) + Γ ( b )} μ^{(a - 1)} (1 - μ)^{(b - 1)}$

and

$E [μ] = \frac{a}{a + b} v a r [μ] = \frac{ab}{( a + b ) ^{2} ( a + b + 1 )}$

Multinomial variables

It's a generalization of the Bernoulli distribution where a random variable has $K$ possible values instead of being binary. We can represent the variable as a $K$ -dimensional binary vector $x = ⟨ x_{1}, x_{2}, \dots, x_{K} ⟩$ where only one component can be asserted:

$k = 1 \sum K x_{k} = 1$

The probability of each component to be asserted is regulated by a probability vector $μ = ⟨ μ_{1}, μ_{2}, \dots, μ_{K} ⟩$ , so that basically $x_{k} \sim B er n (μ_{k})$ . Since the $μ$ vector represents a probability distribution, then:

$k = 1 \sum K μ_{k} = 1$

The multinomial distribution of $x$ is given by:

$p (x ∣ μ) = k = 1 \prod K μ_{k}^{x_{k}}$

And the expected values is $E [x ∣ μ] = μ$ . Let's consider a dataset of N independente observations, then the likelihood function is:

$p (D ∣ μ) = n = 1 \prod N k = 1 \prod K μ_{k}^{x_{nk}} = k = 1 \prod K μ_{k}^{(\sum_{n = 1}^{N} x_{nk})} = k = 1 \prod K μ_{k}^{m_{k}}$

where $m_{k} = \sum_{n = 1}^{N} x_{nk}$ .

If we want to find $μ$ from $D$ by maximizing the (log) likelihood, we have to constrain that to be a probability distribution and therefore we can use the Lagrangian multiplier $λ$

$k = 1 \sum K m_{k} ln μ_{k} + λ (k = 1 \sum K μ_{k} - 1)$

Setting the derivative w.r.t. $μ_{k}$ to zero we get $μ_{k} = - m_{k} / λ$ . We can solve for the Lagrangian multiplier $λ$ by replacing this result in the equation and then we get that $λ = - N$ and $μ_{k}^{M L} = m_{k} / N$ .

We can also consider the distribution of the quantities $m_{1}, \dots, m_{k}$ (Multinomial distribution) conditioned on the parameter $μ$ and on the number $N$ of observations:

$M u lt (m_{1}, d o t s, m_{K} ∣ μ, N) = (m _{1} m _{2} \dots m _{K} N) k = 1 \prod K μ_{k}^{m_{k}}$

where

$(m _{1} m _{2} \dots m _{K} N) = \frac{N !}{m _{1} ! m _{2} ! \dots m _{K} !} k = 1 \sum K m_{k} = N$

Short description of Lagrangian Multiplier utility taken from Quora: You are trying to maximize or minimize some function $f$ (distance to treasure), while keeping some other function $g$ fixed at a certain value $c$ (stay on the path). At this point, the gradient $\nabla f$ (the compass needle) must be parallel to the gradient $\nabla g$ (the arrows on the signs), but the two vectors will not generally have the same length. The test for whether or not they’re parallel is $\nabla f = λ \nabla g$ , where $λ$ is whatever multiplier is needed to have them match; it will still only be able to be equal if they’re parallel (you can resize the compass needle however you want to make it match the sign arrow, but you have to be at a spot with the right direction).

Dirichlet distribution

While the beta distribution is a prior of the Bernoulli parameter $μ$ , the Dirichlet distribution is a prior of the Multinomial probability vector $\overset{μ}{ˉ}$ . The definition is:

$D i r (μ ∣ α) = \frac{Γ ( α _{0} )}{Γ ( α _{1} ) \dots Γ ( α _{K} )} k = 1 \prod K μ_{k}^{α_{k} - 1}$

Where $α_{0} = \sum_{k = 1}^{K} α_{k}$ . Since the $\overset{μ}{ˉ}$ parameters are bounded to $\sum_{k} μ_{k} = 1$ , then the distribution is confined to a simplex in the $K - 1$ space.

By multiplying the likelihood function (which is the multinomial distribution) by the prior (which is a Dirichlet distribution) we get something that is proportional to the posterior $p (μ ∣ D, α)$ . Assuming a conjugate prior, the posterior has the same form and hence we can derive the normalization constant by comparison with the dirichlet distribution definition. The posterior is defined as:

$p (μ ∣ D, α) = \frac{Γ ( α _{0} + N )}{Γ ( α _{1} + m _{1} ) \dots Γ ( α _{K} + m _{k} )} k = 1 \prod K μ_{k}^{α_{k} + m_{k} - 1}$

Gaussian distribution

Univariate Gaussian distribution:

$N (x ∣ μ, σ^{2}) = \frac{1}{( 2 π σ ^{2} ) ^{1/2}} exp (- \frac{( x - μ ) ^{2}}{2 σ ^{2}})$

Where $μ$ and $σ^{2}$ are the mean and variance of the population.

Multivariate Gaussian distribution:

$N (\overset{x}{ˉ} ∣ \overset{μ}{ˉ}, Σ) = \frac{1}{( 2 π ) ^{D /2} ∣Σ ∣ ^{1/2}} exp (- \frac{( x ˉ - μ ˉ ) ^{T} Σ ^{- 1} ( x ˉ - μ ˉ )}{2})$

Where $D$ is the dimensionality of $\overset{x}{ˉ}$ , $μ$ is the mean vector and $Σ$ is the covariance matrix.

Central Limit Theorem. Subject to certain mild conditions, the sum of a set of random variables, which is of course itself a random variable, has a distribution that becomes increasingly Gaussian as the number of terms in the sum increases.

Observation n.1 - The covariance matrix $Σ$ is always positive semi-definite. This means that the eigenvalues are non-negative.

Observation n.2 - To be well-defined, a Gaussian must have a positive definite covariance matrix, which means that all the eigenvalues are strictly positive. If the covariance matrix has one or more null eigenvalues (positive semi-definite), then the distribution is singular and is confined to a subspace of lower dimensionality.

Observation n.3 - Given a 2D Gaussian distribution, we find that elements $x$ with constant density are distributed as ellipses, where the axis of the ellipse is given by the eigenvectors, and the length of the axis is proportional to the square root of the corresponding eigenvalue. The ellipse defined by the axis having a length equal to the square root of the eigenvalues, we find all the elements with a density of $exp (- 1/2)$ .

This is a hint on my preferred interpretation of eigenvectors and eigenvalues calculated from data: eigenvectors represent that capture most of the variability, and the corresponding eigenvalues are an indicator of the variability on that axis (see PCA).

2D gaussian

Observation n.4 - A multivariate Gaussian can be decomposed as a product of multiple univariate Gaussians.

Observation n.5 - The book provides a formal proof to find that $E [\overset{x}{ˉ}] = \overset{μ}{ˉ}$ and $var [\overset{x}{ˉ}] = Σ$ .

Observation n.6 - A general symmetric covariance matrix $Σ$ has $D (D + 1) /2$ independent parameters, and there are another $D$ independent parameters in $μ$ , giving $D (D + 3) /2$ parameters in total. This means that the number of parameters grows quadratically with the dimension $D$ .

One way to reduce the number of parameters is to use restricted forms of the covariance matrix. For example, with a diagonal covariance matrix (figure b) we a linear dependency between parameters and dimensionality, but we explain the variability only from the features axis. Another possibility is to use the isotropic covariance (Figure c) $σ^{2} I$ where we have only one parameter, but we discard the variability along multiple axis. We have a trade-off between model complexity and flexibility that must be adressed based on the application.

Alt text

Limitations of the Gaussian - Gaussian is unimodal (only 1 maximum) and thus is not good at representing multimodal data.

Conditional & Marginal Gaussian Distributions

Given a joint Gaussian distribution $N (x ∣ μ, Σ$ with $Λ = Σ^{- 1}$ (precision matrix) and

$x = [x_{a} x_{b}] μ = [μ_{a} μ_{b}]$

$Σ = [Σ_{aa} Σ_{ba} Σ_{ab} Σ_{bb}] Λ = [Λ_{aa} Λ_{ba} Λ_{ab} Λ_{bb}]$

where $Σ_{ab} = Σ_{ba}^{T}$ and $Λ_{ab} = Λ_{ba}^{T}$ .

Then we have that the conditional distribution $p (x_{a} ∣ x_{b})$ is

$p (x_{a} ∣ x_{b}) μ_{a ∣ b} = N (μ_{a ∣ b}, Λ_{aa}^{- 1}) = μ_{a} + Σ_{ab} Σ_{bb}^{- 1} (x_{b} - μ_{b})$

Note that $μ_{a ∣ b}$ is a linear function of $x_{b}$ .

where

$Λ_{aa}^{- 1} = Σ_{aa} - Σ_{ab} Σ_{bb}^{- 1} Σ_{ba}$

And that the marginal distribution $p (x_{a})$ :

$p (x_{a}) = N (x_{a} ∣ μ_{a}, Σ_{aa})$

All the derivations focus on the quadratic relationship of the exponential factor to the $x$ and are detailed starting from page 85. The point is that the conditional and marginal distribution of a joint Gaussian distribution are again Gaussian distributions.

Given a marginal Gaussian distribution for $x$ and $y$ and a conditional Gaussian distribution for $y$ given $x$ in the form:

$p (x) = N (x ∣ μ, Λ^{- 1}) p (y ∣ x) = N (A x + b, L^{- 1})$

Where $A x + b$ expresses the fact that the mean of the conditional distribution $y$ given $x$ is a linear function of $x$ , and $L^{- 1}$ is another precision matrix. The marginal distribution of $y$ and the conditional distribution of $x$ given $y$ are given by:

$p (y) p (x ∣ y) = N (y ∣ A μ + b, L^{- 1} + A Λ A^{T}) = N (x ∣ Σ [A^{T} L (y - b) + Λ μ], Σ)$

where

$Σ = (Λ + A^{T} L A)^{- 1}$

Maximum likelihood for the Gaussian

Let $X = (x_{1}, \dots, x_{N})^{T}$ be a dataset of observation $x_{n}$ drawn independently from a multivariate Gaussian distribution. We can estimate the parameters by maximizing the log-likelihood:

$p (X ∣ μ, Σ) = - \frac{N D}{2} ln (2 π) - \frac{N}{2} ln ∣Σ∣ - \frac{1}{2} n = 1 \sum N (x_{n} - μ)^{T} Σ^{- 1} (x_{n} - μ)$

By setting the derivative to zero, we can compute:

$μ_{M L} = \frac{1}{N} n = 1 \sum N x_{n} Σ_{M L} = \frac{1}{N} n = 1 \sum N (x_{n} - μ_{M L}) (x_{n} - μ_{M L})^{T}$

Where we calculate the two parameters in sequential steps since there's no dependency on $Σ$ when maximizing with respect to $μ$ . By taking the expectation we see that:

$E [μ_{M L}] = μ E [Σ_{M L}] = \frac{N - 1}{N} Σ$

We see that $Σ_{M L}$ is a biased estimator, and we can correct it by:

$\tilde{Σ}_{M L} = \frac{1}{N - 1} n = 1 \sum N (x_{n} - μ_{M L}) (x_{n} - μ_{M L})^{T}$

Now $E [\tilde{Σ}_{M L}] = Σ$ is a correct extimator for the true covariance.

Student's t distribution

In order to estimate the precision $τ$ of a Gaussian distribution $N (x ∣ μ, τ^{- 1})$ by using a Bayesian approach, we can use the Gamma distribution as a prior: $G am (τ ∣ a, b) = \frac{b ^{a}}{Γ ( a )} τ^{a - 1} exp (- b τ)$ Then we can marginalize to obtain $p (x ∣ μ, a, b) = \int_{0}^{\infty} N (x ∣ μ, τ^{- 1}) G am (τ ∣ a, b) d τ$ By performing a number of steps, we can say that this is a Student's t distribution: $St (x ∣ μ, λ, ν) = \frac{Γ ( \frac{ν}{2} + \frac{1}{2} )}{\frac{ν}{2}} \frac{λ}{π ν} [1 + \frac{λ ( x - μ ) ^{2}}{ν}]^{- \frac{ν}{2} - \frac{1}{2}}$ Where $λ = a / b$ (precison) and $ν = 2 a$ (degrees of freedom). The precision of the t distribution doesn't correspond to the inverse of the variance!

The parameters can be estimated by Expectation-Maximization, and the result has the property of robustness: outliers does not severely affect the distribution.

The multivariate version of the t distribution is: $St (x ∣ μ, Λ, ν) = \frac{Γ ( \frac{D}{2} + \frac{ν}{2} )}{\frac{ν}{2}} \frac{∣Λ∣}{( π ν ) ^{D /2}} [1 + \frac{Δ ^{2}}{ν}]^{- \frac{D}{2} - \frac{ν}{2}}$ and $Δ^{2} = (x - μ)^{T} Λ (x - μ)$ The statistics are

$E [x] = μ if ν > 1 co v [x] = \frac{ν}{ν - 2} Λ^{- 1} if ν > 2 m o d e [x] = μ$

Exponential family

A distribution that is part of the exponential family can be represented as:

$p (x ∣ η) = h (x) g (η) exp (η^{T} u (x))$

Where $η$ are the natural parameters of the distribution. The function $g (η)$ ensures that the distribution is normalized:

$g (η) \int h (x) exp (η^{T} u (x)) d x = 1$

The Bernoulli ( $η = μ$ ), Multinomial ( $η = ⟨ μ_{1}, \dots, μ_{K} ⟩$ ) and Gaussian distributions $(η = ⟨ μ, σ ⟩)$ are part of this family, and the PRML book proofs this at page 113.

If we want to estimate $η$ , we can do that by maximum-likelihood. Let's set the gradient of $p (x ∣ η)$ w.r.t. $η$ to 0:

$\nabla [(g (η) \int h (x) exp (η^{T} u (x)) d x] = 0 \nabla g (η) (b) \int h (x) exp (η^{T} u (x)) d x + g (η) (a) \int h (x) exp (η^{T} u (x)) u (x) d x = 0$

(a) To understand the underlined expression, remember that the integration is w.r.t. $x$ , while the differentiation is w.r.t. $\nabla$ , so we directly differentiate the content as the derivative of sums is the sum of derivatives.

(b) recall that $g (η) \int h (x) exp (η^{T} u (x)) d x = 1$ , this means that $\int h (x) exp (η^{T} u (x)) d x = 1/ g (η)$

$- \frac{1}{g ( η )} \nabla g (η) = \int h (x) g (η) exp (η^{T} u (x)) u (x) d x - \frac{1}{g ( η )} \nabla g (η) = \int p (x ∣ η) u (x) d x (c) - \frac{1}{g ( η )} \nabla g (η) = E [u (x)] - \nabla ln g (η) = E [u (x)]$

Point (c) is provided by the identity $\nabla ln g (η) = \frac{1}{g ( η )} \nabla g (η)$ (chain rule).

We will use this result later on. Now suppose to have $N$ i.i.d. observations drawn from the exponential distribution $X = {x_{1}, \dots, x_{N}}$ . The likelihood function is given by:

$p (X ∣ η) = (n = 1 \prod N h (x_{n})) g (η)^{N} exp {η^{T} n = 1 \sum N u (x_{n})}$

Setting $\nabla ln p (X ∣ η) = 0$ we get:

$- \nabla ln g (η_{M L}) = \frac{1}{N} n = 1 \sum N U (x_{n})$

which can be solved to obtain $η_{M L}$ . The solution depends on the data only through $\sum_{n} u (x_{n})$ , which is called the sufficient statistic of the exponential distribution. For $N \to \infty$ then $η_{M L} \to η$ ,

For each exponential distribution of the previous form, there exists a conjugate prior distribution over the parameters $η$ of the following form

$p (η ∣ χ, ν) = f (χ, ν) g (η)^{ν} exp {ν η^{T} χ}$

Where $f (χ, η)$ is a normalization coefficient, $g (η)$ is the same function presented in the exponential distribution, $ν$ can be interpreted as a effective number of pseudo-observations in the prior, each of which has a value for the sufficient statistic $u (x)$ given by $χ$ .

Again - why do we need conjugate priors??

A prior $p (η)$ which is conjugate to the likelihood $p (X ∣ η)$ produce a posterior $p (η ∣ X)$ that has the same functional form as the choosen prior. This allows to derive a closed-form expression for the posterior distribution (otherwise you need to compute the normalization coefficient by integration, YOU DON'T WANT TO DO THAT, RIGHT?)

Noninformative prior

If we have no prior information, we want a prior with minimal influence on the inference. We call such a prior a noninformative prior. The Bayes/Laplace postulate, stated about 200 years ago says the following:

The principle of insufficient reason. When nothing is known about $θ$ in advance, let the prior $π (θ)$ be a uniform distribution, that is, let all possible outcomes of $θ$ have the same probability.

One noninformative prior could be the uniform distribution, but there are two problems:

If the parameter $θ$ is unbounded, the prior distribution cannot be correctly normalized because the integral over $λ$ diverges. In that case, we have an improper prior. In pratice, improper priors are used if the posterior is proper, i.e. is correctly normalized.
The second problem is that if we perform a non-linear change of variable, then the resulting density will not be constant (recall the Jacobian multiplier).

Nonparametric Methods

The distribution we have seen are governed by parameters that are estimated from the data. This is called parametric approach to density modelling.

In this section we talk about nonparametric approaches to density estimation (only simple frequentist methods).

Consider a continuous variable $x$ , the simplest way to model that distribution is to partition $N$ observations of $x$ in different bins of width $Δ_{i}$ (often the same for every bin $Δ_{i} = Δ$ ), and then count the number $n_{i}$ of observation of $x$ falling in bin $i$ . To turn this count into a normalized probability density:

$p_{i} = \frac{n _{i}}{N Δ _{i}}$

Problems:

If we choose $Δ$ too small, the resulting distribution will be too spiky (i.e. will show structures that are not present in the real distribution)
If we choose $Δ$ too big, the resulting distribution may fail to capture the structures of the real distribution

Alt text

Advantages:

Good visualization of the distribution
The dataset can be discarded once the histogram is built
Good setup if data points are arriving sequentially

Disvantages:

The estimated density has discontinuities in the bin edges
Does not scale with dimensionality (curse of dimensionality), the amount of data needed to work in high dimensional spaces is prohibitive.

Good ideas:

To estimate probability density at a particular location, we should consider the data points that lie within some local neighbourhood of that point.

pattern-recognition-and-machine-learning