The basics of Gaussian Processes

Def. A stochastic process is specified by giving the joint probability distribution for any finite set of values in a consistent manner.

Def. A Gaussian process is defined as a probability distribution over functions such that the set of values evaluated at an arbitrary set of points jointly have a Gaussian distribution.

Suppose we have a training set and that we try to predict the target values using the following model: Where is a non linear kernel and is the parameter vector of our model. If we define a probability distribution over as where is a hyperparameter. We are technically defining a probability distribution over our function which relies on .

Let's do something unusual. Let be a vector such that . We can obtain this vector as where is the design matrix (i.e., a matrix where the i-th row corresponds to ). As is a linear combination between our training data and a normally distributed random variable, its distribution is also Gaussian. The parameters can be obtained as follows:

  • (the Gram's matrix)

The cool thing to observe here is that the covariance matrix is entirely composed by kernel evaluations: In this case the kernel is very simple, but we can replace it with another arbitrary valid kernel, this will allow us to build more complex models and it is a great feature of Gaussian Processes.

Gaussian processes for regression

In our model, we are going to assume that every target variable has an independent noise term separating our prediction from the observed value: Therefore the conditional probability distribution of the target variable is: Similarly to what we have done in the past section, let's define such that . Thanks to the assumption of independent noise, the previous equation can be generalized as: Turns out this can be marginalized easily: where is an matrix such that Now suppose we want to infer from a test point . Build by extending with another component , then the distribution will be: Where will be a matrix build like this: We already know . is a vector such that and finally . We know the joint distribution , but we are interested in predicting , so we are interested in the posterior, which is another Gaussian distribution: Results 2.81 and 2.82 from the PRML show how to compute the parameters: Both quantities depend on as and terms also depend on it. This is it (for now). We can use an arbitrary valid kernel as long as the resulting matrix is invertible.