.. include:: ../refs.rst

.. _rbf:


Radial Basis Function interpolation an regression
=================================================

Some background information on the method implemented in :mod:`~pwtools.rbf`.
For code examples, see the doc string of :class:`~pwtools.rbf.core.Rbf` and
``examples/rbf``.

Theory
------

The goal is to interpolate or fit an unordered set of :math:`M` ND data points
:math:`\mathbf x_i` and values :math:`z_i` so as to obtain :math:`z=f(\mathbf
x)`. In radial basis function (RBF) interpolation, the interpolating function
:math:`f(\mathbf x)` is a linear combination of RBFs :math:`\phi(r)`

.. math::
    f(\mathbf x) = \sum_j w_j\,\phi(|\mathbf x - \mathbf c_j|)

with the weights :math:`w_j` and the center points :math:`\mathbf c_j`.

An RBF :math:`\phi(r)` is a function of the distance :math:`r=|\mathbf x -
\mathbf c_j|` between :math:`\mathbf x` and a center point. Common functions
include

.. math::
    \begin{align}
        & \phi(r) = \exp\left(-\frac{r^2}{2\,p^2}\right) && \text{Gaussian}\\
        & \phi(r) = \sqrt{r^2 + p^2} && \text{multiquadric}\\
        & \phi(r) = \frac{1}{\sqrt{r^2 + p^2}} && \text{inverse multiquadric}
    \end{align}


All RBFs contain a single parameter :math:`p` which defines the length scale of the
function.

.. image:: ../../_static/rbfs.png

In interpolation the center points :math:`\mathbf c_j` are equal to the data points
:math:`\mathbf x_j` such that

.. math::
    \begin{gather}
        z_i = f(\mathbf x_i) = \sum_j w_j\,\phi(|\mathbf x_i - \mathbf x_j|) = \sum_j w_j\,K_{ij}\\
        \mathbf K\,\mathbf w = \mathbf z\\
    \end{gather}

with :math:`\mathbf K` the :math:`M\times M` matrix of RBF function values. The
weights :math:`\mathbf w = (w_j)` are found by solving the linear system
:math:`\mathbf K\,\mathbf w = \mathbf z`.

Note that for certain values of :math:`p`, :math:`\mathbf K` may become
singular, esp. large ones with the limit being
:math:`p\rightarrow\infty\Rightarrow\phi\rightarrow\text{const}`. Then one
should solve a regression problem instead (see below).


RBF parameter :math:`p`
-----------------------

Each RBF has a single "length scale" parameter :math:`p` (``Rbf(p=...)``,
attribute ``Rbf.p``). While :math:`f(\mathbf x)` goes through all points
(:math:`\mathbf x_i`, :math:`z_i`) by definition in interpolation, the behavior
between points is determined by :math:`p` where e.g. very narrow functions
:math:`\phi` can lead to oscillations between points. Therefore :math:`p` must
be optimized for the data set at hand, which should be done by cross-validation
as will be explained below.

Nevertheless we provide two methods to estimate reasonable start values.

The scipy implementation :math:`p_{\text{scipy}}` in
:class:`scipy.interpolate.Rbf` uses something like the mean nearest neighbor
distance. We provide this as ``Rbf(points, values, p='scipy')`` or
``rbf.estimate_p(points, 'scipy')``. The default in :class:`pwtools.rbf.Rbf`
however is the mean distance of all points

.. math::
   p_{\text{pwtools}}=1/M^2\,\sum_{ij} |\mathbf x_i - \mathbf x_j|

Use ``Rbf(points, values, p='mean')`` or ``rbf.estimate_p(points, 'mean')``.
This is always bigger than :math:`p_{\text{scipy}}` and can be a better start
guess for :math:`p`, especially in case of regression.


Interpolation vs. regression, regularization and numerical stability
--------------------------------------------------------------------

In case of noisy data, we may want to do regression. Similar to the `s`
parameter of :func:`scipy.interpolate.bisplrep` or the `smooth` parameter for
regularization in :class:`scipy.interpolate.Rbf`, we solve a regularized
version of the linear system

.. math::
        (\mathbf K + r\,\mathbf I)\,\mathbf w = \mathbf z

using, for instance, ``Rbf(points, values, r=1e-8)`` (attribute ``Rbf.r``). The
default however is :math:`r=0` (interpolation). :math:`r>0` creates a more
"stiff" (low curvature) function :math:`f(\mathbf x)` which does not
interpolate all points and where ``r`` is a measure of the assumed noise. If
the RBF happens to imply a Mercer kernel (which the Gaussian one does), then
this is equal to kernel ridge regression (KRR).

Apart from smoothing, the regularization also deals with the numerical
instability of :math:`\mathbf K\,\mathbf w = \mathbf z`.

One can also switch from :func:`scipy.linalg.solve` to
:func:`scipy.linalg.lstsq` and solve the system in a least squares sense
without regularization. You can do that by setting ``Rbf(..., r=None``). In
:class:`sklearn.kernel_ridge.KernelRidge` there is an automatic switch to least
squares when a singular :math:`\mathbf K` is detected. However, we observed
that in many cases this results in numerical noise in the solution, esp. when
:math:`\mathbf K` is (near) singular. So while we provide that feature for
experimentation, we actually recommend not using it in production and instead
use a KRR setup and properly determine :math:`p` and :math:`r` as described
below.

How to determine :math:`p` and :math:`r`
----------------------------------------

:math:`p` (and :math:`r`) should always be tuned to the data set at hand by
minimization of a cross validation (CV) score. You can do this with
:func:`pwtools.rbf.hyperopt.fit_opt`. Here is an example where we optimize only
:math:`p`, with :math:`r` constant and very small such as ``1e-11``. See
``examples/rbf/overfit.py``.


.. image:: ../../_static/overfit_reg.png

We observe a flat CV score above :math:`p>1` without a pronounced global
minimimum, even though we find a shallow one at around :math:`p=13`, while the
MSE loss (fit error) would suggest a very small :math:`p` value (i.e.
interpolation or "zero training loss" in ML terms).

Now we investigate :math:`p` and :math:`r` (see
``examples/rbf/crossval_map_p_r.py``) by plotting the CV score as function of
both. This is the result for the Gaussian RBF.

.. image:: ../../_static/crossval_pr_gauss.png

When :math:`r < 10^{-6}`, it basically doesn't matter which :math:`p` we use,
as long as it is big enough to not overfit, i.e. :math:`p>1`. In these regions,
both :math:`p` and :math:`r` provide stiffness. The large :math:`p`-:math:`r`
valley of low CV score is flat and a bit rugged such that there is no
pronounced global optimum, as shown above for :math:`r=10^{-11}`. As we incease
:math:`r` (stronger regularization = stronger smoothing) we are restricted to
lower :math:`p` (narrow RBFs which can overfit) since the now higher
:math:`r` already provides enough stiffness to prevent overfitting.

Data scaling
------------

In :class:`pwtools.num.PolyFit` (see the `scale` keyword) we scale
:math:`\mathbf x` and :math:`z` to :math:`[0,1]`. Here we don't have that
implemented but suggest to always scale at least :math:`z`. Since in contrast
to polynomials, the model here has no constant bias term, one should instead
normalize to zero mean using something like
:class:`sklearn.preprocessing.StandardScaler`.

Other implementations
---------------------

* :class:`scipy.interpolate.Rbf`