Use of Unlabeled Data in Regression Modeling
Jun 7, 2009
In 1995 Edward V. Thomas published “Incorporating Auxiliary Predictor Variation in Principal Components Regression” in J. Chemometrics, (vol. 9, no. 6, pps 471-481). Thomas demonstrated how additional samples, without matching references values, can be used when building a PCR model. These samples, commonly called “unlabeled data,” help stabilize the estimates of the principal components. So their use often results in slightly better models than using only the labeled data.
While at EPFL last fall, I was working with Paman Gujral, Michael Amrhein and Dominique Bonvin considering methods for updating regression models. Often, one of the problems with updating models is lack of reference values, i.e. all the new data is unlabeled. Thus, it seemed natural to see how “Edward’s PCR” worked in this situation.
The result is the study “On the bias-variance trade-off in principal component regression with unlabeled data,” which will be presented this week as a poster at SSC-11. The study shows that “Edward’s PCR” works great if the new data is still in the same subspace as the old data. This might occur, for instance, in spectroscopic applications where the same set of analytes still exists but their range has been expanded. But when new analytes are thrown into the mix, thus expanding the subspace, this method leads to even larger prediction biases than not updating models at all. This is because new unlabeled samples rotate the PCs, and ultimately the regression vector, more towards the new analytes while not having any reference values to tell the model to ignore this subspace.
Hope you enjoy the poster, and I hope everybody has a good week at SSC-11!