Eigenvector University Europe is in Rome, ITALY October 14-17, 2024 Complete Info Here!

Category Archives: Chemometrics

Chemometics news and issues.

A History of PLS_Toolbox

Jan 24, 2008

I started graduate school at the University of Washington Department of Chemical Engineering in the Fall of 1985. Sometime around Fall of 1986 somebody showed me MATLAB. Wow. That was the last day I ever wrote anything in Basic or Fortran–it was MATLAB from there on out. In 1987 I finished my MS in ChemE and started on a new project which became my dissertation, “Adapting Multivariate Analysis for Modeling and Monitoring Dynamic Systems“. In order to do this research I needed to develop multivariate analysis routines and process simulations, so MATLAB was the logical tool of choice.

At some point late in 1989 I realized that I had created a significant number of routines that might be of use to other researchers. I collected these functions, wrote sensible help files for them, and wrote a brief manual. I’d been working a lot with Partial Least Squares (PLS) regression, and the bulk of the functions I’d created related to that, so I decided (for better or worse) to call it PLS_Toolbox. Why the underscore? Honestly, I don’t remember. It may have had to do with inability of some operating systems to deal with path names that included whitespace. And I didn’t like running it together, PLSToolbox, because that reads like PL Stoolbox, and I didn’t like the connotation.

So in the fall of 1989 I printed up some manuals for PLS_Toolbox 1.0 and started distributing it around the Chemical Engineering Department and the Center for Process Analytical Chemistry. The rest, as they say, is history. After graduating from UW in 1991, I continued to update PLS_Toolbox and distribute it under the company Eigenvector Technologies. Battelle Pacific Northwest National Laboratory, my employer, had no interest in it. So I worked on it evenings and weekends and continued to release updates.

I founded Eigenvector Research, Inc. with Neal Gallagher on January 1, 1995, though PLS_Toolbox still came out under Eigenvector Technologies until version 2.0. A complete list of releases is given below.

PLS_Toolbox 1.0 late 1989 or early 1990
PLS_Toolbox 1.1 1990
PLS_Toolbox 1.2 1991
PLS_Toolbox 1.3 1993
PLS_Toolbox 1.4 1994 (July)
PLS_Toolbox 1.5 1995 (July-added author Neal B. Gallagher)
PLS_Toolbox 2.0 1998 (April-first version under Eigenvector Research)
PLS_Toolbox 2.1 2000 (November)
PLS_Toolbox 3.0 2002 (December–added authors Rasmus Bro and Jeremy M. Shaver)
PLS_Toolbox 3.5 2004 (August–added authors Willem Windig and R. Scott Koch)
PLS_Toolbox 4.0 2006 (May)
PLS_Toolbox 4.1 2007 (June)
PLS_Toolbox 4.2 2008 (January)

The release of PLS_Toolbox 4.2 this month brings the total number of versions to 13. We’ve been pretty stingy with our version numbers, changing them in increments of only 0.1 even when we added significant functionality. In other software companies PLS_Toolbox 4.2 would probably be known as version 9.1 or something like that.

Hope you enjoyed the history lesson, and thanks for checking in!


A Recipe for Failure

Jun 6, 2007

Recently, I’ve seen numerous examples of what we call the “throw the data over the wall” model of data analysis, mostly in the form of “challenge problems.” Typically, participants are given data sets, without much background information, and asked to analyze them or develop models to make predictions on new data sets. Tests of this sort are of questionable value, as this mode of data analysis is really a recipe for failure. In some instances, I’ve wondered if the tests weren’t set up intentionally to produce failure. So it got me to thinking about how to set up a test to best meet this sinister objective.

If you want to assure failure in data analysis, just follow this recipe:

1) Don’t tell the analyst anything about the chemistry or physics of the system that generated the data. Just present it as a table of numbers. This way the analyst won’t be able to use any physical reasoning to help make choices in data pre-treatment or interpretation of results.

2) Be sure to not include any information about the noise levels in measurements or the reference values. The analyst will have a much harder time determining when the models are over-fit or under-fit. And for sure don’t tell the analyst if the noise has an unusual distribution.

3) Leave the data in a form that makes it non-linear. Alternately, preprocess it in a way that “adds to the rank.”

4) Don’t tell the analyst anything about how the data was obtained, e.g. from “happenstance” or designed data. If its designed, make sure that the data is taken in such a way that system drift and/or other environmental effects are confounded with the factors of interest.

5) Make sure the data sets are “short and fat,” i.e. with many times more variables than samples. This will make it much harder for the analyst to recognize issues related to 2), 3) and 4). And it will make it especially fun for the analyst if the problem has anything to do with variable selection.

6) Compare the results obtained to those from an expert who knows everything about the pedigree of the data. Additionally, its useful if the expert has a lot of preformed “opinions” about what the results of the analysis should be.

If you stick to this recipe, you can certainly improve the odds of making the analyst, and/or the software they are using, look bad. Significantly, it won’t guarantee it, as some analysts are very knowledgeable about the limitations of their methods and are very good with their tools. Obviously, if your goal is to make software and/or chemometric methods unsuccessful, choose a less knowledgeable analyst in addition to 1) – 6) above.

Of course, if you actually want to succeed at data analysis, try to not follow the recipe. At EVRI we’re firm believers in understanding where the data comes from, and anything about it that will help us make intelligent modeling choices. Software can do alot, and we’re making it better all the time, but you just can’t replace a well educated and informed analyst.