Radial Basis Function interpolation an regression

Note

While the material here is a useful reference and the code has been used in production, we actually recommend to use Gaussian process regression instead, e.g. sklearn.gaussian_process.GaussianProcessRegressor, or, if you want to replicate rbf, then use KRR (sklearn.kernel_ridge.KernelRidge). See https://github.com/elcorto/gp_playground for a detailed comparison of GPs and KRR.

Some background information on the method implemented in rbf. For code examples, see the doc string of Rbf and examples/rbf.

Theory

The goal is to interpolate or fit an unordered set of \(M\) ND data points \(\mathbf x_i\) and values \(z_i\) so as to obtain \(z=f(\mathbf x)\). In radial basis function (RBF) interpolation, the interpolating function \(f(\mathbf x)\) is a linear combination of RBFs \(\phi(r)\)

\[f(\mathbf x) = \sum_j w_j\,\phi(|\mathbf x - \mathbf c_j|)\]

with the weights \(w_j\) and the center points \(\mathbf c_j\).

An RBF \(\phi(r)\) is a function of the distance \(r=|\mathbf x - \mathbf c_j|\) between \(\mathbf x\) and a center point. Common functions include

\[\begin{split}\begin{align} & \phi(r) = \exp\left(-\frac{r^2}{2\,p^2}\right) && \text{Gaussian}\\ & \phi(r) = \sqrt{r^2 + p^2} && \text{multiquadric}\\ & \phi(r) = \frac{1}{\sqrt{r^2 + p^2}} && \text{inverse multiquadric} \end{align}\end{split}\]

All RBFs contain a single parameter \(p\) which defines the length scale of the function.

../../_images/rbfs.png

In interpolation the center points \(\mathbf c_j\) are equal to the data points \(\mathbf x_j\) such that

\[\begin{split}\begin{gather} z_i = f(\mathbf x_i) = \sum_j w_j\,\phi(|\mathbf x_i - \mathbf x_j|) = \sum_j w_j\,K_{ij}\\ \mathbf K\,\mathbf w = \mathbf z\\ \end{gather}\end{split}\]

with \(\mathbf K\) the \(M\times M\) matrix of RBF function values. The weights \(\mathbf w = (w_j)\) are found by solving the linear system \(\mathbf K\,\mathbf w = \mathbf z\).

Note that for certain values of \(p\), \(\mathbf K\) may become singular, esp. large ones with the limit being \(p\rightarrow\infty\Rightarrow\phi\rightarrow\text{const}\). Then one should solve a regression problem instead (see below).

RBF parameter \(p\)

Each RBF has a single “length scale” parameter \(p\) (Rbf(p=...), attribute Rbf.p). While \(f(\mathbf x)\) goes through all points (\(\mathbf x_i\), \(z_i\)) by definition in interpolation, the behavior between points is determined by \(p\) where e.g. very narrow functions \(\phi\) can lead to oscillations between points. Therefore \(p\) must be optimized for the data set at hand, which should be done by cross-validation as will be explained below.

Nevertheless we provide two methods to estimate reasonable start values.

The scipy implementation \(p_{\text{scipy}}\) in scipy.interpolate.Rbf uses something like the mean nearest neighbor distance. We provide this as Rbf(points, values, p='scipy') or rbf.estimate_p(points, 'scipy'). The default in pwtools.rbf.Rbf however is the mean distance of all points

\[p_{\text{pwtools}}=1/M^2\,\sum_{ij} |\mathbf x_i - \mathbf x_j|\]

Use Rbf(points, values, p='mean') or rbf.estimate_p(points, 'mean'). This is always bigger than \(p_{\text{scipy}}\) and can be a better start guess for \(p\), especially in case of regression.

Interpolation vs. regression, regularization and numerical stability

In case of noisy data, we may want to do regression. Similar to the s parameter of scipy.interpolate.bisplrep() or the smooth parameter for regularization in scipy.interpolate.Rbf, we solve a regularized version of the linear system

\[(\mathbf K + r\,\mathbf I)\,\mathbf w = \mathbf z\]

using, for instance, Rbf(points, values, r=1e-8) (attribute Rbf.r). The default however is \(r=0\) (interpolation). \(r>0\) creates a more “stiff” (low curvature) function \(f(\mathbf x)\) which does not interpolate all points and where r is a measure of the assumed noise. If the RBF happens to imply a Mercer kernel (which the Gaussian one does), then this is equal to kernel ridge regression (KRR).

Apart from smoothing, the regularization also deals with the numerical instability of \(\mathbf K\,\mathbf w = \mathbf z\).

One can also switch from scipy.linalg.solve() to scipy.linalg.lstsq() and solve the system in a least squares sense without regularization. You can do that by setting Rbf(..., r=None). In sklearn.kernel_ridge.KernelRidge there is an automatic switch to least squares when a singular \(\mathbf K\) is detected. However, we observed that in many cases this results in numerical noise in the solution, esp. when \(\mathbf K\) is (near) singular. So while we provide that feature for experimentation, we actually recommend not using it in production and instead use a KRR setup and properly determine \(p\) and \(r\) as described below.

How to determine \(p\) and \(r\)

\(p\) (and \(r\)) should always be tuned to the data set at hand by minimization of a cross validation (CV) score. You can do this with pwtools.rbf.hyperopt.fit_opt(). Here is an example where we optimize only \(p\), with \(r\) constant and very small such as 1e-11. See examples/rbf/overfit.py.

../../_images/overfit_reg.png

We observe a flat CV score above \(p>1\) without a pronounced global minimimum, even though we find a shallow one at around \(p=13\), while the MSE loss (fit error) would suggest a very small \(p\) value (i.e. interpolation or “zero training loss” in ML terms).

Now we investigate \(p\) and \(r\) (see examples/rbf/crossval_map_p_r.py) by plotting the CV score as function of both. This is the result for the Gaussian RBF.

../../_images/crossval_pr_gauss.png

When \(r < 10^{-6}\), it basically doesn’t matter which \(p\) we use, as long as it is big enough to not overfit, i.e. \(p>1\). In these regions, both \(p\) and \(r\) provide stiffness. The large \(p\)-\(r\) valley of low CV score is flat and a bit rugged such that there is no pronounced global optimum, as shown above for \(r=10^{-11}\). As we incease \(r\) (stronger regularization = stronger smoothing) we are restricted to lower \(p\) (narrow RBFs which can overfit) since the now higher \(r\) already provides enough stiffness to prevent overfitting.

Data scaling

In pwtools.num.PolyFit (see the scale keyword) we scale \(\mathbf x\) and \(z\) to \([0,1]\). Here we don’t have that implemented but suggest to always scale at least \(z\). Since in contrast to polynomials, the model here has no constant bias term, one should instead normalize to zero mean using something like sklearn.preprocessing.StandardScaler.

Other implementations