Compare GPs and kernel ridge regression (KRR)#
In this section we show that GPs and KRR are the same w.r.t. weights and
predictions. They only differ in the way hyperparameter optimization is done.
We use sklearn
for all code examples.
Below we give an introduction and in the next section we explore the comparison using code examples.
Kernel / covariance function#
We use the radial basis function (RBF) kernel function
Note that there are many other RBFs but sklearn
only implements that one.
from sklearn.gaussian_process.kernels import RBF
RBF(length_scale=...)
KernelRidge
#
We solve
for the weights KernelRidge.alpha_
in sklearn
).
We specify noise_level
as alpha
from sklearn.kernel_ridge import KernelRidge
krr = KernelRidge(
alpha=noise_level,
kernel=RBF(length_scale=length_scale),
)
Calling
krr.fit(X, y)
solves (1) once.
GaussianProcessRegressor
#
In addition to RBF
, we can use a WhiteKernel
“to learn the global noise”, so
the kernel we use is a combination of two kernels which are responsible for
modeling different aspects of the data (i.e. “kernel engineering”). The
resulting kernel matrix is the same as the above, where
from sklearn.gaussian_process.kernels import RBF, WhiteKernel
from sklearn.gaussian_process import GaussianProcessRegressor
gp = GaussianProcessRegressor(
kernel=RBF(length_scale=length_scale) + WhiteKernel(noise_level=noise_level),
...
)
results in
and we solve (1) for GaussianProcessRegressor.alpha_
.
Differences to KRR#
Noise in WhiteKernel
#
The way to specify noise_level
is to use WhiteKernel
as in
gp = GaussianProcessRegressor(
kernel=RBF(length_scale=length_scale) + WhiteKernel(noise_level=noise_level),
...
)
One can also specify KernelRidge
gp = GaussianProcessRegressor(
alpha=noise_level,
kernel=RBF(length_scale=length_scale),
...)
(in fact the default alpha
value is not zero but KernelRidge
when solving for the weights. Thus, when calling
y_pred = gp.predict(X)
we get the same prediction as we would with
y_pred = krr.predict(X)
However, the covariance matrix we get from
y_pred, y_cov = gp.predict(X, return_cov=True)
will be different. For more details, see the sklearn
part
of this section.
GP optimizer#
In contrast to KernelRidge
, calling
gp.fit(X, y)
does not solve for the weights once, but by default
(GaussianProcessRegressor(optimizer=...)
is not None
) uses a local
optimizer to optimize all kernel hyperparameters, solving (1) in
each step.
Which parameters are optimized depends on what kernel
we use:
kernel |
optimized hyperparameters |
---|---|
|
|
|
|
What this means is that the GaussianProcessRegressor
optimizer cannot
optimize WhiteKernel(noise_level=)
where we interpret it as noise, while setting the
regularization parameter alpha=0
. This is a technicality and might be a bit
confusing since from a textbook point of view,
Hyperopt objective function#
The difference to KRR is that the GP implementation optimizes the kernel’s
params, here y_std
or y_cov
if we want, so of course the GP is in general
the preferred solution. Additionally, the GP can use the LML’s gradient to do
local optimization, which can be fast if the LML evaluation is fast and it can
be the global min if the LML surface is convex, at least the neighborhood of a
good start guess.
Changing the GP optimizer#
In addition to using GaussianProcessRegressor
’s internal optimizer code path,
either by using the default l_bfgs_b
local optimizer or by setting a custom
one using optimizer=my_optimizer
, we show how to optimize the GP’s
hyperparameters in the same way as any other model by setting
GaussianProcessRegressor(optimizer=None)
in combination with an external
optimizer. With that, we can use either
GaussianProcessRegressor(alpha=noise_level)
, i.e. treat it as
KernelRidge(alpha=noise_level)
, or WhiteKernel(noise_level=noise_level)
and
get the exact same results.
We define a custom GP optimizer using scipy.optimize.differential_evolution
(i) to show how this can be done in general and (ii) because the default local
optimizer (l_bfgs_b
), also with n_restarts_optimizer>0
can get stuck in
local optima or on flat plateaus sometimes. Sometimes because the start guess
is randomly selected from bounds (the docs are bleak on how to fix the RNG for
that, so we don’t).
Example results of optimized models#
GP, using the internal optimizer API and RBF+WhiteKernel
:
k1
is the Gaussian RBF kernel. length_scale
is the optimized kernel width
parameter. k2
is the WhiteKernel
with its optimized noise_level
parameter.
>>> gp.kernel_.get_params()
{'k1': RBF(length_scale=0.147),
'k1__length_scale': 0.14696558218508174,
'k1__length_scale_bounds': (1e-05, 2),
'k2': WhiteKernel(noise_level=0.0882),
'k2__noise_level': 0.08820850820059796,
'k2__noise_level_bounds': (0.001, 1)}
Fitted GP weights can be accessed by
GaussianProcessRegressor.alpha_
and optimized kernel hyper params by
GaussianProcessRegressor.kernel_.k1.length_scale
GaussianProcessRegressor.kernel_.k2.noise_level
where trailing underscores denote values after calling fit()
: weights
alpha_
and hyperopt kernel_
.
Why the GP and KRR hyperparameters are different after optimization#
We use both models to solve (1) and therefore the results of the
hyperopt (
The reason is, as mentioned above already, that KRR has to resort to something
like cross validation (CV) to get a useful optimization objective, while GPs
can use maximization of the LML (see also this part of the sklearn
docs). They can be equivalent, given one performs a very
particular and super costly variant of CV involving an “exhaustive leave-p-out
cross-validation averaged over all values of p and all held-out test sets when
using the log posterior predictive probability as the scoring rule”, see
https://arxiv.org/abs/1905.08737 for details. This is nice but hard to do in
practice. Instead, we use KFold
(try to replace KFold
by LeavePOut
with
p>1
and then wait …). This basically means that any form of practically
usable CV is an approximation of the LML with varying quality.
We plot the CV and -LML surface as function of sklearn
uses the log of sklearn.gaussian_process.kernels.Kernel.theta
, where theta=[length_scale, noise_level]
here):
Note that theta are typically the log-transformed values of the
kernel's hyperparameters as this representation of the search space
is more amenable for hyperparameter search, as hyperparameters like
length-scales naturally live on a log-scale.
We do the same if HyperOpt(..., logscale=True)
.
Data scaling#
For both KRR and GP below, we work with the same scaled data (esp. zero mean).
KernelRidge
has no
constant offset term to
fit()
, i.e. there is nofit_intercept
as inRidge
normalize_y
as inGaussianProcessRegressor
, which is why we useGaussianProcessRegressor(normalize_y=False)
to ensure a correct comparison.
Still KRR can fit the data when the mean is very non-zero (e.g. y += 1000
)
since the hyperopt still finds correct params. Also with fixed GaussianProcessRegressor(normalize_y=False)
fails
because the LML is changed such that we can’t find a global opt any longer even
for large param bounds up to say [1e-10, 1000]
for both normalize_y=True
, the GP implementation zeros the data
mean before doing anything since it implements Alg. 2.1 from
[RW06] which assumes a zero mean
function in the calculation of weights and LML. In the prediction the mean is
added back at the end.