The basics of Gaussian Processes

Def. A stochastic process $y (x)$ is specified by giving the joint probability distribution for any finite set of values $y (x_{1}), \dots, y (x_{N})$ in a consistent manner.

Def. A Gaussian process is defined as a probability distribution over functions $y (x)$ such that the set of values $y (x)$ evaluated at an arbitrary set of points $x_{1}, d o t s, x_{N}$ jointly have a Gaussian distribution.

Suppose we have a training set ${x_{n}, t_{n}}_{n = 1}^{N}$ and that we try to predict the target values using the following model: $y (x; w) = w^{T} ϕ (x)$ Where $ϕ : R^{d} \to R^{m}$ is a non linear kernel and $w \in R^{m}$ is the parameter vector of our model. If we define a probability distribution over $w$ as $p (w) = N (w ∣ 0, α^{- 1} I)$ where $α$ is a hyperparameter. We are technically defining a probability distribution over our function $y (x; w)$ which relies on $w$ .

Let's do something unusual. Let $\overset{y}{ˉ} \in R^{N}$ be a vector such that $\overset{y}{ˉ}_{n} = y (x_{n})$ . We can obtain this vector as $\overset{y}{ˉ} = Φ w$ where $Φ$ is the design matrix (i.e., a $N \times M$ matrix where the i-th row corresponds to $ϕ (x_{i})$ ). As $\overset{y}{ˉ}$ is a linear combination between our training data and a normally distributed random variable, its distribution is also Gaussian. The parameters can be obtained as follows:

$E [\overset{y}{ˉ}] = Φ E [w] = 0$
$co v [\overset{y}{ˉ}] = E [\overset{y}{ˉ} \overset{y}{ˉ}^{T}] = Φ E [w w^{t}] Φ^{T} = \frac{1}{α} Φ Φ^{T} = K$ (the Gram's matrix)

The cool thing to observe here is that the covariance matrix is entirely composed by kernel evaluations: $K_{ij} = k (x_{i}, x_{j}) = \frac{1}{α} ϕ (x_{i})^{T} ϕ (x_{j})$ In this case the kernel is very simple, but we can replace it with another arbitrary valid kernel, this will allow us to build more complex models and it is a great feature of Gaussian Processes.

Gaussian processes for regression

In our model, we are going to assume that every target variable has an independent noise term $ϵ_{n} \sim N (0, β^{- 1})$ separating our prediction from the observed value: $t_{n} = y_{n} + ϵ_{n}$ Therefore the conditional probability distribution of the target variable is: $p (t_{n} ∣ y_{n}) = N (t_{n} ∣ y_{n}, β^{- 1})$ Similarly to what we have done in the past section, let's define $\overset{ˉ}{t} \in R^{N}$ such that $\overset{ˉ}{t}_{n} = t_{n}$ . Thanks to the assumption of independent noise, the previous equation can be generalized as: $p (\overset{ˉ}{t} ∣ \overset{y}{ˉ}) = N (\overset{ˉ}{t} ∣ \overset{y}{ˉ}, β^{- 1} I)$ Turns out this can be marginalized easily: $p (\overset{ˉ}{t}) = \int p (\overset{ˉ}{t} ∣ \overset{y}{ˉ}) d \overset{y}{ˉ} = N (\overset{ˉ}{t} ∣ 0, C)$ where $C$ is an $N \times N$ matrix such that $C_{ij} = {k (x_{i}, x_{j}) + β^{- 1} k (x_{i}, x_{j}) i = j i \neq = j$ Now suppose we want to infer $t_{N + 1}$ from a test point $x_{N + 1}$ . Build $\overset{ˉ}{t}_{N + 1}$ by extending $\overset{ˉ}{t}$ with another component $t_{N + 1}$ , then the distribution will be: $p (\overset{ˉ}{t}_{N + 1}) = N (\overset{ˉ}{t}_{N + 1} ∣ 0, C_{N + 1})$ Where $C_{N + 1}$ will be a $(N + 1) \times (N + 1)$ matrix build like this: $C_{N + 1} = [C k^{T} k c]$ We already know $C$ . $k \in R^{N}$ is a vector such that $k_{i} = k (x_{N + 1}, x_{i})$ and finally $c = k (x_{N + 1}, x_{i}) + β^{- 1}$ . We know the joint distribution $p (\overset{ˉ}{t}_{N + 1}) = p (t_{1}, \dots, t_{N}, t_{N + 1})$ , but we are interested in predicting $t_{N + 1}$ , so we are interested in the posterior, which is another Gaussian distribution: $p (t_{N + 1} ∣ t_{1}, \dots, t_{N}) = N (t_{N + 1}, μ (x_{n}), σ^{2} (x_{n}))$ Results 2.81 and 2.82 from the PRML show how to compute the parameters: $μ (x_{n}) = k^{T} C \overset{ˉ}{t} σ^{2} (x_{n}) = c - k^{T} C^{- 1} k$ Both quantities depend on $x_{n}$ as $k$ and $c$ terms also depend on it. This is it (for now). We can use an arbitrary valid kernel $k (\cdot, \cdot)$ as long as the resulting matrix $C$ is invertible.

pattern-recognition-and-machine-learning

The basics of Gaussian Processes

Gaussian processes for regression