Linear models for regression

The goal of regression is to predict the value of one or more continuous targets variables $t$ given the value of a D-dimensional vector $x$ of input variables.

By linear models we mean that the model is a linear function of the adjustable parameters. E.g. the polynomial curve fitting algorith builds a linear model. The simplest form of linear regression models are also linear functions of the input variables.

We get a more useful class of functions by taking linear combinations of a fixed set of nonlinear functions of the input variables, known as basis functions. Such models are linear functions of the parameters (which gives simple analytical properties) and yet can be nonlinear with respect to the input variables.

Given a dataset of $N$ observations ${x_{n}}$ where $n = 1, \dots, N$ , together with the corresponding target values ${t_{n}}$ , the goal is to predict $t$ for a new value of $x$ .

Simple approach: Find an appropiate function $y (x) \approx t$
General approach: Find the predictive distribution $p (t ∣ x)$ to get the uncertainty of a prediction

Linear Basis Function Models

The simplest linear model involves a linear combination of the input variables, also called linear regression:

$y (x, w) = w_{0} + w_{1} x_{1} + w_{2} x_{2} + \dots + w_{D} x_{D}$

This is:

A linear function of the parameters (good for optimization)
A linear function of the inputs (bad for expressiveness)

Extend the concept of linear combination to combine fixed nonlinear functions of the input:

$y (x, w) = w_{0} + j = 1 \sum M - 1 w_{j} ϕ_{j} (x)$

where $ϕ_{j} (x)$ are known as basis functions. By having $M - 1$ components, the total number of parameters is $M$ (consider the bias).

If we consider $ϕ_{0} (x) = 1$ , then we can write:

$y (x, w) = j = 0 \sum M - 1 w_{j} ϕ_{j} (x)$

These linear models are:

A linear function of the parameters (good for optimization)
A nonlinear function of the inputs (good for expressiveness)

Polynomials are basis functions of the form $ϕ_{j} (x) = x^{j}$ . The problem with polynomial is that they are global functions: a change in a region of the input space affects all the other regions.

Other choices for the basis functions are:

$ϕ_{j} (x) = exp {- \frac{( x - μ _{j} ) ^{2}}{2 s ^{2}}}$

Which is the Gaussian basis function. In this case, $μ_{j}$ is the location in the input space, and $s$ is the scale. This function doesn't have a probabilistic interpretation.

Another possibility is the Sigmoidal basis function:

$ϕ_{j} (x) = σ (\frac{x - μ _{j}}{s}) σ (a) = \frac{1}{1 + exp ( - a )}$

Where $σ$ is the logistic function, but we can also use the tanh function.

We can use also Fourier basis functions such that the the regression function is an expansion of sinusoidal functions at a certain frequency. Combining basis functions localized in both space and frequency leads to a class of functions known as wavelets.

Maximum likelihood and least squares

Let $y$ be a deterministic function such that $y (x) \approx t$ , let $ϵ \sim N (ϵ ∣ 0, β^{- 1})$ be a random Gaussian variable with precision $β$ . We assume that the target variable $t$ is given by:

$t = y (x, w) + ϵ$

The conditional distribution of $t$ will then be

$p (t ∣ x) = N (t ∣ y (x, w), β^{- 1})$

For a new value $x$ , the optimal prediction of $t$ is given by the conditional mean:

$E [t ∣ x] = \int tp (t ∣ x) d t = y (x, w)$

For a dataset $X = {(x_{n}, t_{n})}_{n = 1}^{N}$ , let $T = [t_{1}, \dots, t_{N}]^{T}$ , assuming that $y (x, w)$ is given by a linear model $y (x, w) = w^{t} ϕ (x)$ , then the likelihood of the target variables is given by:

$p (T ∣ X, w, β) = n = 1 \prod N N (t_{n} ∣ w^{T} ϕ (x_{n}), β^{- 1}$

The log-likelihood is:

$ln p (T ∣ X, w, β) = \frac{N}{2} ln β - \frac{N}{2} ln (2 π) - β E_{D} (w)$

Where $E_{D}$ is the sum-of-squares error function:

$E_{D} (w) = \frac{1}{2} n = 1 \sum N (t_{n} - w^{T} ϕ (x_{n}))^{2}$

We now estimate the parameters $β, w$ by maximum likelihood. The gradient w.r.t. to $w$ is:

$\nabla ln p (T ∣ X, w, β) = n = 1 \sum N (t_{n} - w^{T} ϕ (x_{n})) ϕ (x_{n})^{T}$

By setting the gradient to 0 and solving for $w$ we find:

$w_{M L} = (Φ^{T} Φ)^{- 1} Φ^{T} T$

which are known as the normal equations for the least squares problem. Here $Φ$ is a NxM matrix called design matrix whose elements are given by $Φ_{nj} = ϕ_{j} (x_{n})$ .

The quantity

$Φ^{†} = (Φ^{T} Φ)^{- 1} Φ^{T}$

Is known as Moore-Penrose pseudo-inverse of the matrix $Φ$ , which is a generalization of the inverse for nonsquare matrices.

If we solve for the bias parameter $w_{0}$ , the solution suggests that the bias compensates for the difference between the averages (over the training set) of the target values and the weighted sum of the averages of the basis function values.

Maximizing for the precision parameter we get:

$\frac{1}{β _{M L}} = \frac{1}{N} n = 1 \sum N {t_{n} - w_{M L}^{T} ϕ (x_{n})}^{2}$

which is basically the precision of the residuals.

Geometric interpretation of least square solution

Consider an N-dimensional space. Let $t$ be a vector in that space, where the N components are the ground truth target variables for all the N observations we are trying to predict.

Build a vector $t = [t_{1}, \dots, t_{N}]^{T}$ made of the target variables of our dataset made of N observation. This vector lives in a N-dimensional space.

The input variable $x$ is D-dimensional, while we use the basis functions $ϕ (x)$ that are M-dimensional. Consider each component of the basis function evaluated on all the N observations of our dataset, we have $M$ vectors in the N-dimensional space which span a subregion S of dimension $M$ .

The target value is predicted by combining the basis function output using some weights $w$ , and therefore the N-dimensional vectore $y$ made of the predicted target value for each observation in the dataset is indeed a linear combination of the $M$ vectors, and resides inside the subregion $S$ .

The book demonstrates how the solution $y$ from the least square problem corresponds to the orthogonal projection of $t$ to the closest M-dimensional subregion S.

Sequential Least Squares

Authors suggest to use gradient descent to get the least square solution sequentially (one observation at the time). Given the sum-of-squares loss, the update of the weights $w$ is $w^{(τ + 1)} = w^{τ} + η [(t_{n} - w^{(τ) T} ϕ (x_{n})) ϕ (x_{n})]$

Regularized least squares

Adding a regularization term to the loss function helps avoiding overfitting to the data. The most famous regularization term for least squares is weight decay, where the optimization process is forced to produce small weights unless supported by the data. The general form of weight decay is: $\frac{λ}{2} j = 1 \sum M ∣ w_{j} ∣^{q}$ For $q = 2$ we have the classic quadratic regularizer. For $q = 1$ we have the Lasso regularizer which has the property (for $λ$ sufficiently large) of driving some of the weights to zero, leading to a sparse model. This is useful to avoid overfitting when we have a small dataset, even if the problem becomes to find the suitable $λ$ .

Multiple outputs

Given a regression problem with a multivariate output, the book demonstrates how the solution decouples between the different target variables (they all share the same pseudo-inverse matrix $Φ^{†}$ assuming that the target variables are distributed by an isotropic gaussian). Most of the time, we can work with a single variable and easily generalize to the multivariate case.

Bias-variance decomposition

Suppose we want to find a function $y$ that approximates the target value $y (x) \approx t$ on the input $x$ . We model the relation between the input $x$ and the target value $t$ as $t = f (x) + ϵ ϵ \sim N (0, σ)$ We assume that $t$ has random noise, so it's a random variable distributed by $t \sim N (y (x), σ)$ We want to find $y = f$ . Let $L (t, y (x))$ be a loss function that measures the prediction error, then the average loss is: $E [L] = \iint L (t, y (x)) \cdot p (x, t) d x d t$ If the loss is the MSE, then we have: $E [L] = \iint [y (x) - t]^{2} \cdot p (x, t) d x d t = depends on y \int [y (x) - E [t ∣ x]] p (x) d x + depends on data \int [E [t ∣ x] - t] p (x) d x$

$E [t ∣ x]$ is the expected value of $t$ , which is now considered a random variable since we assume it contains random noise. The conditioning on $x$ reflects the fact that the Gaussian distribution is centered at $f (x)$ , which depends on $x$ .

The first term depends on $y$ and can be reduced to zero with an unlimited amount of data.
The second term depends on the noise $ϵ$ in the data, so it can't be changed by acting on $y$ , so it is the minimum achievable value of expected loss.

Now let's consider K different datasets drawn indipendently from the same distribution $p (x, t)$ . We estimate a different function $y$ for each dataset, since they all contain random noise. We can define $E_{D} [y (x; D)]$ as $E_{D} [y (x; D)] = \frac{1}{K} D \sum y (x; D)$ Now consider the square loss and add and subtract the term $E_{D} [y (x; D)]$ ${y (x; D) - E [t ∣ x]}^{2} = = {y (x; D) - E_{D} [y (x; D)] + E_{D} [y (x; D)] - E [t ∣ x]}^{2} = = {y (x; D) - E_{D} [y (x; D)]}^{2} + {E_{D} [y (x; D)] - E [t ∣ x]}^{2} + 2 {y (x; D) - E_{D} [y (x; D)]} {E_{D} [y (x; D)] - E [t ∣ x]}$ If we take the expectation of this term w.r.t. the dataset $D$ , then we have: $E [{y (x; D) - E [t ∣ x]}^{2}] = bias^{2} {E_{D} [y (x; D)] - E [t ∣ x]}^{2} + variance E_{D} [{y (x; D) - E_{D} [y (x; D)]}^{2}]$ The expected squared difference between the model predictions and the observed data can be expressed as the sum of two terms, the bias squared and the variance.

The squared bias term represents to which extent the average prediction over all datasets differs from the desired function $E [t ∣ x]$
The variance term measures the extent to which the solutions for individual datasets vary around their average (sensitiveness to the choice of dataset)

If we apply this observation to the expected loss value shown before, we have the following decomposition: $expected loss = (bias)^{2} + variance + noise$ Where $(bias)^{2} variance noise = \int {E_{D} [y (x; D)] - E [t ∣ x]}^{2} p (x) d x = \int E_{D} [{y (x; D) - E_{D} [y (x; D)]}^{2} p (x) d x = \int {E [t ∣ x] - t}^{2} p (x, t) d x d t$

Mathematically: recall the decomposition of the loss in two terms, we took the first term and further decomposed it into squared variance + variance. The expectation we took is w.r.t. the datasets, but we need to calculate it against the input $x$ .

In practice, bias-variance decomposition can be estimated numerically by replacing the expectation with averages on the observed data. The method requires to have multiple datasets, but that means that all the datasets can be merged in a single big dataset that will produce less overfitted models. Bias-variance decomposition isn't the best way to validate our models, but it's useful to understand how overfitting works.

Bayesian Linear Regression

We introduce a Bayesian treatment for linear regression, which will avoid over-fitting and will lead to automatic methods of determining model complexity using training data alone.

Parameter distribution

The likelihood function is the exponential of a quadratic function of the parameters $w$ (as defined previously) $p (T ∣ w) = n = 1 \prod N N (t_{n} ∣ w^{t} ϕ (x_{n}), β^{- 1})$ Where $T$ are all the target values in the dataset and $β$ is the noise precision. Therefore, the conjugate prior over $w$ is given by a Gaussian distribution of the form: $p (w) = N (w ∣ m_{0}, S_{0})$ Where $m_{0}, S_{0}$ are the mean and covariance.

The posterior $p (w ∣ T)$ is a Gaussian distribution (we are using a conjugate prior) proportional to the likelihood and the prior. We calculate the normalization coefficient using the result from 2.116 (from PRML). $p (w ∣ T) = N (w ∣ m_{N}, S_{N})$ Where $m_{N} = S_{N} (S_{0}^{- 1} m_{0} + β Φ^{T} T) S_{N}^{- 1} = S_{0}^{- 1} + β Φ^{T} Φ$ Since the posterior is a Gaussian, its mode coincides with its mean, thus the maximum posterior weight vector is simply given by $w_{ma p} = m_{N}$ .

The Bayesian approach is automatically regularized. Assume the prior to be a zero-mean isotropic Gaussian governed by a single parameter $α$ $p (w ∣ α) = N (w ∣ 0, α^{- 1} I)$ The parameters of the posterior distribution will then be given by: $m_{N} = β S_{N} Φ^{T} T S_{N}^{- 1} = α I + β Φ^{T} Φ$ The log of the posterior distribution is given by: $ln p (w ∣ T) = - \frac{β}{2} n = 1 \sum N {t_{n} - w^{T} ϕ (x_{n})}^{2} - \frac{α}{2} w^{t} w + const$ The maximization of the posterior is equivalent to the minimization of the sum of squares with the addition of a quadratic regularization term with $λ = α / β$ .

Predictive distribution

Once we have the posterior distribution over the weights $w$ , how do we estimate the target value $t$ for a new point $x$ ? We use the predictive distribution. $predictive p (t) = \int target p (t ∣ w) posterior p (w) d w$ Where we recall that: $p (t ∣ w) p (w) = p (t ∣ x, w, β) = N (t ∣ y (x, w), β^{- 1}) = p (w ∣ T) = N (w ∣ m_{N}, S_{N})$ The solution to this integral is explained in (2.115). We have $p (t) = N (t ∣ m_{N}^{T} ϕ (x), σ_{N}^{2} (x))$ where the variance $σ_{N}^{2} (x) = noise \frac{1}{β} + uncertainty ϕ (x)^{T} S_{N} ϕ (x)$ Because the noise process and the distribution of $w$ are independent, the variances are additive. For $N \to \infty$ , the second term goes to zero, and the variance of the predictive distr. is only given by noise in the data.

The more data we have, the narrower is the predictive distribution, in fact it can be shown that $σ_{N + 1}^{2} (x) \leq σ_{N}^{2} (x)$ (Qazaz et al., 1997).

Equivalent kernel

To perform inference using the predictive distribution, we return the mean value, which can be written in the form: $y (x, m_{N}) = m_{N}^{T} ϕ (x) = βϕ (x)^{T} S_{n} Φ^{T} T = n = 1 \sum N βϕ (x)^{T} S_{n} ϕ (x_{n}) t_{n}$ The mean of the predictive distribution is a linear combination of the target variables $t_{n}$ from the training set: $y (x, m_{N}) = n = 1 \sum N k (x, x_{n}) t_{n} k (x, x_{n}) = βϕ (x)^{T} S_{n} ϕ (x_{n})$ Where $k (x, x^{'})$ is called smoother matrix or equivalent kernel. Regression functions that make inference by taking linear combination of the training target values are called linear smoothers. Such kernels have a localization property that increase the response if $x$ and $x^{'}$ are closer.

An alternative approach of linear regression is to directly compute an equivalent kernel instead of working with the basis functions. This leads to the Gaussian processes.

Some properties of the kernels are that (1) the weights sum to one $\sum_{n = 1}^{N} k (x, x_{n}) = 1$ and (2) the function can be expressed as an inner product $k (x, z) = ψ (x)^{T} ψ (z)$ where $ψ$ is a non linear function.

Bayesian Model Comparison

Suppose we want to compare $L$ models $M_{1}, \dots, M_{L}$ , where a model represents a different probability distribution over the observed data $D$ . The uncertainty of the model is expressed by a prior distribution $p (M_{i})$ (we can assume to be uniform). Given the dataset $D$ , we want to evaluate the posterior distribution:

$p (M_{i} ∣ D) \propto p (D ∣ M_{i}) p (M_{i})$

$P (D ∣ M_{i})$ is called model evidence or marginal likelihood, since it can be viewed as a likelihood function over the space of models, in which the parameters have been marginalized out.

The ratio of model evidences $p (D ∣ M_{i}) / p (D ∣ M_{j})$ is called Bayes factor.

Given the posterior $p (M_{i} ∣ D)$ , the predictive distribution is given by the sum and product rule:

$p (t ∣ x, D) = i = 1 \sum L p (t, M_{i} ∣ x, D) = i = 1 \sum L p (t ∣ x, D, M_{i}) p (M_{i} ∣ x, D) = i = 1 \sum L p (t ∣ x, D, M_{i}) p (M_{i} ∣ D)$

This is an example of a mixture distribution, obtained by averaging the predictive distributions of individual models weighted by the posterior probabilities of those models.

An approximation of model averaging is to use the most probable model alone to make predictions. This is called model selection.

Now we focus on the model evidence / marginal likelihood. For a model $M_{i}$ governed by parameters $w$ , the evidence is:

$p (D ∣ M_{i}) = \int p (D ∣ w, M_{i}) p (w ∣ M_{i}) d w$

The evidence can be viewed as the probability of generating $D$ from $M_{i}$ by randomly sampling $w \sim p (w ∣ M_{i})$ .
The evidence appears as the normalization coefficient in the posterior $p (w ∣ D)$

Bayesian model complexity. The evidence is useful to evaluate the complexity of the model. Let's approximate the integral above by (i) assuming the posterior $p (D ∣ w)$ is peaked at most probable $w_{M A P}$ with a width of $Δ_{p os t er i or}$ , and the prior is flat with width $Δ_{p r i or}$ . In this case, the integral can be approximated by: $P (D) = P (D ∣ w_{M A P}) \frac{Δ _{p os t er i or}}{Δ _{p r i or}}$ If we take the logs: $lo g P (D) = lo g P (D ∣ w_{M A P}) + lo g \frac{Δ _{p os t er i or}}{Δ _{p r i or}}$ If the model complexity increases, then $lo g P (D ∣ w_{M A P})$ increases because it better fits the data, but $lo g \frac{Δ _{p os t er i or}}{Δ _{p r i or}}$ decreases, because $Δ_{p os t er i or}$ becomes smaller and the ratio approaches 0. In general, the Bayesian approach favours the best trade-off between accuracy and complexity.

pattern-recognition-and-machine-learning