Missing Data (part three)
Nov 21, 2011
In the first and second installments of this series, we considered aspects of using an existing PCA model to replace missing variables. In this third part, we’ll move on to using PLS models.
Although it was shown previously that PCA can be used to perfectly impute missing values in rank deficient, noise free data, it’s not hard to guess that PCA might be suboptimal with regards to imputing missing elements in real, noisy data. The goal of PCA, after all, is to estimate the data subspace, not predict particular elements. Prediction is typically the goal of regression methods, such as Partial Least Squares. In fact, regression models can be used to construct estimates of any and all variables in a data set based on the remaining variables. In our 1989 AIChE paper we proposed comparing those estimates to actual values for the purpose of fault detection. Later this became known as regression adjusted variables, as in Hawkins, 1991.
There is a little known function in PLS_Toolbox, (since the first version in 1989 or 90), plsrsgn, that can be used to develop collections of PLS models, where each variable in a data set is predicted by the remaining variables. The regression vectors are mapped into a matrix that generates the residuals between the actual and predicted values in much the same way as the I–PP‘ matrix from PCA.
We can compare the results of using these collections of PLS models to using the PCA done previously. Here we created the coeff matrix using (a conservative) 3 LVs in each of the PLS submodels. Each sub model could of course be optimized individually, but for illustration purposes this will be adequate. The reconstruction error of the PLS models is compared with PCA in the figure shown at left, where the error for the collection of PLS models is shown in red, superimposed over the reconstruction via the PCA model error, in blue. The PLS models’ error is lower for each variable, in some cases, substantially, e.g. variables 3-5.
The second figure, at left, shows the estimate of variable 5 for both the PLS (green) and PCA (red) methods compared to the measured values (blue). It is clear that the PLS model tracks the actual value much better.
Because the estimation error is smaller, collections of PLS models can be much more sensitive to process faults than PCA models, particularly individual sensor faults.
It is also possible to replace missing variables based on these collections of PLS models in (nearly) exactly the same manner as in PCA. The difference is that, unlike in PCA, the matrix which generates the residuals is not symmetric, so the R12 term (see part one) does not equal R21‘. The solution is to calculate b using their average, thus
b = 0.5(R12 + R21‘)R11-1
Curiously, unlike the PCA case, the residuals on the replaced variables will not be zero except in the unlikely case that R12 = R21‘.
In the case of an existing single PLS model, it is of course possible to use this methodology to estimate the values of missing variables based on the PLS loadings. (Or, if you insist, on the PLS weights. Given that residuals based on weights are larger than residuals based on loadings, I’d expect better luck reconstructing from the loadings but I offer that here without proof.)
In the next installment of this series, we will consider the more challenging problem of building models on incomplete data records.
B.M. Wise, N.L. Ricker, and D.J. Veltkamp, “Upset and Sensor Failure Detection in Multivariate Pocesses,” AIChE Annual Meeting, 1989.
D.M. Hawkins, “Multivariate Quality Control Based on Regression Adjusted Variables,” Technometrics, Vol. 33, No. 1, 1991.