Linear models for classifications

4.1.3 Least squares for classification

Matrices legend.

Matrix Dimension

$\tilde{W}$ $(D + 1) \times K$

$\tilde{X}$ $N \times (D + 1)$

$T$ $N \times K$

Matrix	Dimension
$\tilde{W}$	$(D + 1) \times K$
$\tilde{X}$	$N \times (D + 1)$
$T$	$N \times K$

Consider a classification task with $K$ classes, let $t$ be a one-hot encoding target vector. Each class $C_{k}$ is described by its own linear model so that

$y_{k} (x) = w_{k}^{T} x + w_{k_{0}} k = 1, \dots, K$

By using vector notation, we can combine them together:

$y (x) = W^{T} x$

Where $W$ is a $(D + 1) \times K$ matrix such that the $k$ -th column is $(w_{k_{0}}, w_{k}^{T})^{T}$ and $x = (1, x^{T})^{T}$ .

Objective: to determine the parameters of $\tilde{W}$ by minimizing a sum-of-squares loss function.

Consider a training dataset ${x_{n}, t_{n}}_{n = 1}^{N}$ and define two sparse matrices

$T$ of dimension $N \times K$ such that the $n$ -th row is the binary one-hot-encoded vector $t_{k}$ .
$X$ of dimension $N \times (D + 1)$ such that the $n$ -th row is $x_{n}^{T}$

The sum-of-squares loss function can be written as:

$E_{D} (W) = \frac{1}{2} Tr {(X W - T)^{T} (X \tilde{W} - T)}$

Question: why do we use the trace?

Set the derivative of $E_{D}$ w.r.t. $\tilde{W}$ to zero and obtain the following solution:

$W = (X^{T} X)^{- 1} X^{T} T = (\tilde{X}^{†})^{T} T$

If we want to obtain the result without using too much matrix calculus we can do the following:

$X W X^{T} X W (X^{T} X)^{- 1} X^{T} X W W W = T = X^{T} T = (X^{T} X)^{- 1} X^{T} T = (X^{T} X)^{- 1} X^{T} T = X^{†} T$

The discriminant function will be:

$y (x) = W^{T} x = T^{T} (X^{†})^{T} x$

Problems with the discriminant function obtained through minimization of SSE:

Sensible to outliers
Bad performances since it estimates $E [t ∣ x]$ under assumption of Gaussian noise, which is clearly wrong when estimating a binary vector $t_{n}$

An interesting property

Every target vector in the training set satisfies some linear constraint:

$a^{T} t_{n} + b = 0$

For some costants $\overset{a}{ˉ}, b$ . The model prediction for any value of $x$ will satisfy the same constraint

$a^{T} y (x) + b = 0$

If we use a one-hot-encoding scheme for $t_{n}$ , then components of $y (x)$ will sum up to 1. However, this is not enough for considering $y (x)$ a probability distribution since its components are no bound to $[0, 1]$ .

4.1.4 Fisher's linear discriminant

Suppose we have 2 classes, the idea is to project the D-dimensional input $x$ to a scalar value $y = w^{T} x$ and classify $x$ as class $C_{1}$ if $y \geq - w_{0}$ and class $C_{2}$ otherwise.

The problem is that projecting the input from D dimensions to 1 dimension consists on a significant loss of information, and if the classes are well separated in the high-dimensional space, they can overlap in the 1-dimensional space. However, we can optimize $w$ in order to maximize the separation between classes in the 1-dimensional space.

One way to do this is to consider the class mean vectors:

$\overset{m}{ˉ}_{1} = \frac{1}{N _{1}} n \in C_{1} \sum x_{n} \overset{m}{ˉ}_{2} = \frac{1}{N _{2}} n \in C_{2} \sum x_{n}$

And maximize

$\overset{m}{^}_{2} - \overset{m}{^}_{1} = w^{T} (\overset{m}{ˉ}_{1} - \overset{m}{ˉ}_{2})$

where $\overset{m}{^}_{k} = w^{T} \overset{m}{ˉ}_{k}$ . One problem is that we can make $\overset{m}{^}_{k}$ arbitrarly large by incrementing the magnitude of $w$ . This can be solved by constraining a fixed magnitude $\sum_{i} w_{i}^{2} = 1$ . To enforce this constraint during optimization, we can use Lagrange multipliers. We find that $w \propto (\overset{m}{ˉ}_{2} - \overset{m}{ˉ}_{1})$ .

The problem with this simple approach is that it doesn't take the variance into account, and the datapoints in the 1-dimensional space may be overlapped (e.g., when their distribution has a strongly nondiagonal covariance). See the figure on the left below.

Fisher

The discriminant function on the right is obtained using the Fisher linear discriminant, which introduce the variance in the objective to optimize.

Define the variance of class $k$ in the projected space as:

$s_{k}^{2} = n \in C_{k} \sum (y_{n} - \overset{m}{^}_{k})^{2}$

where $y_{n} = w^{T} x_{n}$ . The total within-class variance for the whole dataset is simply $s_{1}^{2} + s_{2}^{2}$ . The fisher criterion to maximize is defined as the ratio of the between-class variance to the within-class variance:

$J (w) = \frac{( m ^ _{2} - m ^ _{1} ) ^{2}}{s _{1}^{2} + s _{2}^{2}}$

We can rewrite the Fisher criterion in the following form to make the dependence on $w$ explicit:

$J (w) = \frac{w ^{T} S _{B} w}{w ^{T} S _{W} w}$

Where $S_{B}$ is the between-class covariance matrix:

$S_{B} = (\overset{m}{ˉ}_{2} - \overset{m}{ˉ}_{1}) (\overset{m}{ˉ}_{2} - \overset{m}{ˉ}_{1})^{T}$

and $S_{W}$ is the total within-class covariance matrix:

$S_{W} = n \in C_{1} \sum (x_{n} - \overset{m}{ˉ}_{1}) (x_{n} - \overset{m}{ˉ}_{1})^{T} + n \in C_{2} \sum (x_{n} - \overset{m}{ˉ}_{2}) (x_{n} - \overset{m}{ˉ}_{2})^{T}$

By differentiation, we find that $J (w)$ is maximized when

$(w^{T} S_{B} w) S_{W} w = (w^{T} S_{W} w) S_{B} w$

Since we only care about the direction of $w$ , we can drop the scalar factors $(w^{T} S_{B} w)$ and $(w^{T} S_{W} w)$ . Then we multiply both sides by $S_{W}^{- 1}$ and obtain the Fisher Linear Discriminant:

$w \propto S_{W}^{- 1} (\overset{m}{ˉ}_{2} - \overset{m}{ˉ}_{1})$

If the within-class variance is isotropic, so that $S_{W}$ is proportional to the unit matrix, then we find that $w$ is proportional to the difference of the class means.

The projection function $y (x) = w^{T} x$ is not really a discriminant, but we can construct a discriminant by choosing a threshold $y_{0}$ to classify the points.

Relation to least squares

For the two class problem, the Fisher criterion can be obtained as a special case of least squares.

Let $N$ be the total number of observations, $N_{1}$ the observations from $C_{1}$ class, and $N_{2}$ from $C_{2}$ class. Reparameterize the target values as:

$t_{n} = N / N_{1}$ if $x_{n} \in C_{1}$
$t_{n} = - N / N_{2}$ if $x_{n} \in C_{2}$

Write the sum of squares error function:

$E = \frac{1}{2} n = 1 \sum N [(w^{T} x_{n} + w_{0}) - t_{n}]^{2}$

By setting $\partial E / \partial w_{0} = 0$ and $\partial E / \partial w = 0$ , after some algebraic manipulations, we find:

$w \propto S_{W}^{- 1} (m_{2} - m_{1}) w_{0} = - w^{T} m$

where $w$ corresponds to the solution for the Fisher criteria, and we have also obtained an expression for the threshold $w_{0}$ , where $m$ is the mean of all the observation. For intermediate steps, check page 210.

Fisher discriminant for multiple classes

Suppose $K > 2$ classes. Let $x_{n} \in R^{D}$ , where $D > K$ . We need to project the observation $x_{n}$ to $y_{n} \in R^{D^{'}}$ , and we can calculate each component $y_{n}^{(k)}$ as follows

$y_{n}^{(k)} = w_{k}^{T} x_{n}$

Where $w_{k} \in R^{D}$ . We can group $w_{k}$ for $k = 1, \dots, D^{'}$ as columns of a matrix $W$ of dimension $D \times D^{'}$ , and calculate the $y_{n}$ vector in one step as:

$y_{n} = W^{T} x_{n}$

Let's define the within and between class covariances for the multi-class problem.

The within class covariance $S_{W}$ is:

$S_{W} = k = 1 \sum K S_{k} S_{k} = n \in C_{k} \sum (x_{n} - m_{k}) (x_{n} - m_{k})^{T} m_{k} = \frac{1}{N _{k}} n \in C_{k} \sum x_{n}$

The between class covariance $S_{B}$ is:

$S_{B} = k = 1 \sum K N_{k} (m_{k} - m) (m_{k} - m)^{T} m = \frac{1}{N} n = 1 \sum N x_{n}$

Now let's define the covariances but in the projected space of $y_{n}$ .

The within class covariance $S_{W}$ is:

$s_{W} = k = 1 \sum K n \in C_{k} \sum (y - μ_{k}) (y - μ_{k})^{T} μ_{k} = \frac{1}{N _{k}} n \in C_{k} \sum y_{n}$

The between class covariance $S_{B}$ is:

$s_{B} = k = 1 \sum K N_{k} (μ_{k} - μ) (μ_{k} - μ)^{T} μ = \frac{1}{N} n = 1 \sum N y_{n}$

There exist different objective functions to maximize, but here we use the one from (Fukunaga, 1990), which is:

$J (W) = Tr (s_{W}^{- 1} s_{B})$

Rewriting with explicit dependence on $W$ :

$J (W) = Tr ((W S_{W} W)^{- 1} (W S_{B} W))$

Solution: the columns of the matrix $W$ that maximizes $J (W)$ are the D' eigenvectors of the D' largest eigenvalues of the matrix $S_{W}^{- 1} S_{B}$ .

Observation: Since the matrix resulting from an outer product of non-zero vector has always a rank of 1, since $S_{B}$ is composed by the sum of $K$ rank-1 matrices and since only $K - 1$ of these matrices are independent, then $S_{B}$ as a rank at most equal to $K - 1$ , and therefore there are at most $K - 1$ non-zero eigenvalues. This means that we are unable to find more than $K - 1$ linear features by this method (Fukunaga, 1990).

Right now we have only reduced the dimensionality of the data. Where is the discriminant function?

4.1.7 Perceptron

Find my notes about Perceptron here (in italian).

4.2 Probabilistic Generative Models

Consider the binary classification task. We want to compute the posterior probability $p (C_{1} ∣ x)$ :

$p (C_{1} ∣ x) = \frac{p ( x ∣ C _{1} ) p ( C _{1} )}{p ( x ∣ C _{1} ) p ( C _{1} ) + p ( x ∣ C _{2} ) p ( C _{2} )} = \frac{1}{\frac{p ( x ∣ C _{1} ) p ( C _{1} ) + p ( x ∣ C _{2} ) p ( C _{2} )}{p ( x ∣ C _{1} ) p ( C _{1} )}} = \frac{1}{1 + \frac{p ( x ∣ C _{2} ) p ( C _{2} )}{p ( x ∣ C _{1} ) p ( C _{1} )}} = \frac{1}{1 + exp [ ln \frac{p ( x ∣ C _{2} ) p ( C _{2} )}{p ( x ∣ C _{1} ) p ( C _{1} )} ]} = \frac{1}{1 + exp [ - ln \frac{p ( x ∣ C _{1} ) p ( C _{1} )}{p ( x ∣ C _{2} ) p ( C _{2} )} ]} = \frac{1}{1 + exp ( - a )} = σ (a)$

Where $σ (a)$ is the sigmoid function, which has the property $σ (- a) = 1 - σ (a)$ .

The inverse of the logistic sigmoid is given by the logit function:

$a = ln (\frac{σ}{1 - σ})$

It represents the log of the ratio of probabilities $ln [p (C_{1} ∣ x) / p (C_{2} ∣ x)]$ for the two classes.

For $K > 2$ classes, we have:

$p (C_{k} ∣ x) = \frac{p ( x ∣ C _{k} ) p ( C _{k} )}{\sum _{j} p ( x ∣ C _{j} ) p ( C _{j} )} = \frac{exp ( a _{k} )}{\sum _{j} e x p ( a _{j} )}$

where the quantities $a_{k}$ are defined as:

$a_{k} = ln p (x ∣ C_{k}) p (C_{k})$

This function, called normalized exponential, is also known as the softmax function.

Continuous inputs

Assume that

$x$ is continuous
Class conditional densities are Gaussian
They share the same covariance matrix $Σ$

$p (x ∣ C_{k}) = N (μ_{k}, Σ)$

Consider the binary classification task. From the results above, we have:

$p (C_{1} ∣ x) = σ (w^{T} x + w_{0})$

Where:

$w = Σ^{- 1} (μ_{1} - μ_{2}) w_{0} = - \frac{1}{2} μ_{1}^{T} Σ μ_{1} + \frac{1}{2} μ_{2}^{T} Σ μ_{2} + ln \frac{p ( C _{1} )}{p ( C _{2} )}$

Hence we have that the argument of the activation function is linear w.r.t. the input $x$ .

Since the decision boundaries are defined as the areas where $p (C_{k} ∣ x)$ is constant, and since this depends only on the argument of the sigmoid, which is linear w.r.t. $x$ , therefore the decision boundaries are also linear w.r.t. $x$ .

Changing the prior probabilities will only shift the decision boundaries, since they appear only in the bias parameter $w_{0}$ .

For $K$ classes, we have

$a_{k} (x) = w_{k}^{T} x + x_{k 0}$

where

$w_{k} = Σ^{- 1} μ_{k} w_{k o} = - \frac{1}{2} μ_{k}^{T} Σ^{- 1} + ln p (C_{k})$

Relaxing the assumption of a shared covariance matrix and considering a different covariance matrix $Σ_{k}$ for each class will result in a quadratic dependency w.r.t. of $P (C_{k} ∣ x)$ w.r.t. $x$ (also called quadratic discriminant).

Maximum likelihood solutions

See the derivations at page 220. Suppose we have $N$ samples $x_{1}, \dots, x_{N}$ and their binary class labels $t_{1}, \dots, t_{N}$ , where $t_{n} \in {0, 1}$ , suppose $N_{1}$ samples belong to class $C_{1}$ and $N_{2}$ to class $C_{2}$ . If we condire Gaussian class conditional densities $p (x ∣ C_{1}) = N (x ∣ μ_{1}, Σ)$ and $p (x ∣ C_{2}) = N (x ∣ μ_{2}, Σ)$ with shared covariance matrix $Σ$ , then the maximum likelihood likelihood solutions are:

$p (C_{1}) = N_{1} / N$
$p (C_{2}) = N_{2} / N$
$μ_{1}$ centroid of samples from class $C_{1}$
$μ_{2}$ centroid of samples from class $C_{2}$
$Σ$ sum of covariance matrices of $C_{1}$ and $C_{2}$ weighted by the priors.

pattern-recognition-and-machine-learning