Multiple-output Gaussian Process regression in scikit-learn

First of all, if the parameter used is "sigma", that's referring to standard deviation, not variance (recall, variance is just standard deviation squared).

It's easier to conceptualize using variance, since variance is defined as the Euclidean distance from a data point to the mean of the set.

In your case, you have a set of 2D points. If you think of these as points on a 2D plane, then the variance is just the distance from each point to the mean. The standard deviation than would be the positive root of the variance.

In this case, you have 16 test points, and 16 values of standard deviation. This makes perfect sense, since each test point has its own defined distance from the mean of the set.

If you want to compute the variance of the SET of points, you can do that by summing the variance of each point individually, dividing that by the number of points, then subtracting the mean squared. The positive root of this number will yield the standard deviation of the set.

ASIDE: this also means that if you change the set through insertion, deletion, or substitution, the standard deviation of EVERY point will change. This is because the mean will be recomputed to accommodate the new data. This iterative process is the fundamental force behind k-means clustering.


Well, you have inadvertently hit on an iceberg indeed...

As a prelude, let's make clear that the concepts of variance & standard deviation are defined only for scalar variables; for vector variables (like your own 3d output here), the concept of variance is no longer meaningful, and the covariance matrix is used instead (Wikipedia, Wolfram).

Continuing on the prelude, the shape of your sigma is indeed as expected according to the scikit-learn docs on the predict method (i.e. there is no coding error in your case):

Returns:

y_mean : array, shape = (n_samples, [n_output_dims])

Mean of predictive distribution a query points

y_std : array, shape = (n_samples,), optional

Standard deviation of predictive distribution at query points. Only returned when return_std is True.

y_cov : array, shape = (n_samples, n_samples), optional

Covariance of joint predictive distribution a query points. Only returned when return_cov is True.

Combined with my previous remark about the covariance matrix, the first choice would be to try the predict function with the argument return_cov=True instead (since asking for the variance of a vector variable is meaningless); but again, this will lead to a 16x16 matrix, instead of a 3x3 one (the expected shape of a covariance matrix for 3 output variables)...

Having clarified these details, let's proceed to the essence of the issue.


At the heart of your issue lies something rarely mentioned (or even hinted at) in practice and in relevant tutorials: Gaussian Process regression with multiple outputs is highly non-trivial and still a field of active research. Arguably, scikit-learn cannot really handle the case, despite the fact that it will superficially appear to do so, without issuing at least some relevant warning.

Let's look for some corroboration of this claim in the recent scientific literature:

Gaussian process regression with multiple response variables (2015) - quoting (emphasis mine):

most GPR implementations model only a single response variable, due to the difficulty in the formulation of covariance function for correlated multiple response variables, which describes not only the correlation between data points, but also the correlation between responses. In the paper we propose a direct formulation of the covariance function for multi-response GPR, based on the idea that [...]

Despite the high uptake of GPR for various modelling tasks, there still exists some outstanding issues with the GPR method. Of particular interest in this paper is the need to model multiple response variables. Traditionally, one response variable is treated as a Gaussian process, and multiple responses are modelled independently without considering their correlation. This pragmatic and straightforward approach was taken in many applications (e.g. [7, 26, 27]), though it is not ideal. A key to modelling multi-response Gaussian processes is the formulation of covariance function that describes not only the correlation between data points, but also the correlation between responses.

Remarks on multi-output Gaussian process regression (2018) - quoting (emphasis in the original):

Typical GPs are usually designed for single-output scenarios wherein the output is a scalar. However, the multi-output problems have arisen in various fields, [...]. Suppose that we attempt to approximate T outputs {f(t}, 1 ≤t ≤T , one intuitive idea is to use the single-output GP (SOGP) to approximate them individually using the associated training data D(t) = { X(t), y(t) }, see Fig. 1(a). Considering that the outputs are correlated in some way, modeling them individually may result in the loss of valuable information. Hence, an increasing diversity of engineering applications are embarking on the use of multi-output GP (MOGP), which is conceptually depicted in Fig. 1(b), for surrogate modeling.

The study of MOGP has a long history and is known as multivariate Kriging or Co-Kriging in the geostatistic community; [...] The MOGP handles problems with the basic assumption that the outputs are correlated in some way. Hence, a key issue in MOGP is to exploit the output correlations such that the outputs can leverage information from one another in order to provide more accurate predictions in comparison to modeling them individually.

enter image description here

Physics-Based Covariance Models for Gaussian Processes with Multiple Outputs (2013) - quoting:

Gaussian process analysis of processes with multiple outputs is limited by the fact that far fewer good classes of covariance functions exist compared with the scalar (single-output) case. [...]

The difficulty of finding “good” covariance models for multiple outputs can have important practical consequences. An incorrect structure of the covariance matrix can significantly reduce the efficiency of the uncertainty quantification process, as well as the forecast efficiency in kriging inferences [16]. Therefore, we argue, the covariance model may play an even more profound role in co-kriging [7, 17]. This argument applies when the covariance structure is inferred from data, as is typically the case.


Hence, my understanding, as I said, is that sckit-learn is not really capable of handling such cases, despite the fact that something like that is not mentioned or hinted at in the documentation (it may be interesting to open a relevant issue at the project page). This seems to be the conclusion in this relevant SO thread, too, as well as in this CrossValidated thread regarding the GPML (Matlab) toolbox.

Having said that, and apart from reverting to the choice of simply modeling each output separately (not an invalid choice, as long as you keep in mind that you may be throwing away useful information from the correlation between your 3-D output elements), there is at least one Python toolbox which seems capable of modeling multiple-output GPs, namely the runlmc (paper, code, documentation).