Eigenvector University returns to Seattle, USA May 12-16, 2025 Complete Info Here!

Category Archives: Chemometrics

Chemometics news and issues.

Model Transparency, Validation & Model_Exporter

Feb 20, 2008

With the advent of the US Food and Drug Administration’s (FDA) Process Analytical Technology (PAT) Initiative the possibilities for putting multivariate models on-line in pharmaceutical applications increased dramatically. In fact, the Guidance for Industry on PAT lists Multivariate tools for design, data acquisition and analysis explicitly as PAT Tools. This opens the door for the use analytical techniques which rely on multivariate calibration to produce estimates of product quality. An example of this would be using NIR with PLS regression to obtain concentration of API in a blending operation.

That said, any multivariate model that is run in a regulated environment is going to have to be validated. I found a good definition of validate on the web: To give evidence that a solution or process is correct. So how do you show that a model is correct? It seems to me that the first step is to understand what it is doing. A multivariate model is really nothing more than a numerical recipe for turning a measurement into an answer. What’s the recipe?

Enter Model_Exporter. Model_Exporter is an add-on to our existing multivariate modeling packages PLS_Toolbox and Solo. Model_Exporter takes models generated by PLS_Toolbox and Solo and turns them into a numerical recipe in an XML format that can be implemented in almost any modern computer language. It also generates m-scripts that can be run in MATLAB or Octave, and Tcl for use with Symbion.

But the main point here is that Model_Exporter makes models transparent. All of the mathematical steps (and the coefficients used in them), including preprocessing, are right there for review. Is the model physically and chemically sensible? Look and see.

The next step in validation is to show that the model behaves as expected. This would include showing that, once implemented, the model produces the same results on the training data as the software that produced the model. One should also show that the model produces the same (acceptable) results on additional test sets that were not used in the model development.

What about the software that produced the model to begin with? Should it be validated? Going back to the definition of validate, that would require showing that the modeling software produced answers that are correct. OK, well, for PLS regression, correct would have to mean that it calculates factors that maximize the covariance between scores in X and scores in Y. That’s great, but what does it have to do with whether the model actually performs as expected or not? Really, not much. Does that mean its not important? No, but assuring software accuracy won’t assure model performance.

Upon reading a draft of this post, Rasmus wrote:

Currently software validation focuses on whether the algorithm achieves what’s claimed, e.g. that a correlation is correctly calculated. This is naturally important and also a focus point for software developers anyhow. However, this sort of validation is not terribly important for validating that a specific process model is doing what it is supposed to. This is similar to thoroughly checking the production facility for guitars in order to check that Elvis is making good music. There are so many decisions and steps involved in producing a good prediction model and the quality of any correlation estimates in the numerical algorithms are of quite insignificant importance compared to all the other aspects. Even with a ‘lousy’ PLS algorithm excellent models could be made if there is a good understanding of the problem.

So when you start thinking about preprocessing options and how many ways there are to get to models with different recipes but similar performance, and also how its possible by making bad modeling choice to get to a bad model with software that’s totally accurate, it’s clear that models should be validated, not the software that produces them. And that’s why Model_Exporter is so useful, it makes models transparent, which simplifies model validation.

Some Thoughts on Freeware in Chemometrics

Feb 6, 2008

Once again there is a discussion on the chemometrics listserv (ICS-L) concerning freeware in chemometrics. There have been some good comments, and its certainly nice to see some activity on the list! I’ll add my thoughts here.

On Feb 5, 2008, at 3:39 PM, David Lee Duewer wrote:
I and the others in Kowlski’s Koven built ARTHUR as part of our PhuDs; it was distributed for a time as freeware. It eventually became semi-commercial as a community of users developed who wanted/needed help and advice. Likewise, Barry’s first PLS_Toolbox was his thesis and was (maybe still is?) freeware.

No, its not freeware, but it is still open source. One of my pet-peeves is that “freeware” and “open-source” are often used synonymously, but they aren’t the same thing.

PLS_Toolbox is open source, so you can see exactly what its doing (no secret, hidden meta-parameters), and you can modify it for your own uses. (Please don’t ask us to help you debug your modified code, though!) You can also compile PLS_Toolbox into other applications IFF (if and only if) you have a license from us for doing so. And of course PLS_Toolbox is supported, regularly updated, etc. etc. If something doesn’t work as advertised, you can complain to us and we’ll fix it, pronto.

I think we occupy a sweet spot between the free but unsupported (must rely on the good will of others) model and the commercial but closed source (not always sure what its doing and can’t modify it) model.

OK, end of commercial!

But the problem with freeware projects is that there have to be enough people involved in a quite coordinated way in order to reach the critical mass required to make a product that is very sophisticated. Yes, its possible for a single person or a few people to make a bunch of useful routines (e.g. PLS_Toolbox 1.4, ca 1994). But a fully gui-fied tool that does more than a couple things is another story. PLS_Toolbox takes several man-years per year to keep it supported, maintained and moving forward. And if it wasn’t based on MATLAB, it would be considerably more.

On Feb 5, 2008, at 4:17 PM, Scott Ramos wrote:
… the vast majority of folk doing chemometrics fall into Dave’s category of tool-users. This is the audience that the commercial developers address. Participants in this discussion list fall mostly into the tool-builder category. Thus, the discussion around free or shareware packages and tools is focused more on this niche of chemometricians.

And that’s the problem. Like it or not, chemometrics is a bit of niche market. So whether you can get enough people together to make freeware that is commercial-worthy, that tool-users are willing to rely on, is going to be even tougher than for other, broader markets. The most successful opensource/freeware projects that I’m aware of are tools for software developers themselves: tools by software geeks for software geeks. Tools for CVS are a great example of this (like the copy of svnX that I use, and hey, WordPress, which I’m using to write this blog).

MATLAB is interesting in that it occupies a middle ground, it is both a development environment and an end-user tool. You can pretty much say the same for PLS_Toolbox.

On Feb 5, 2008, at 2:32 PM, Rick Dempster wrote:
I was taught not to reinvent the wheel many years ago and that point seems to have stuck with me.

That’s good practice. But it seems to me that a substantial fraction of the freeware effort out there really is just reinventing things that exist elsewhere. The most obvious example is Octave, which is a MATLAB clone. I notice that most of the freeware proponents out there have .edu and .org email addresses, and likely don’t have the same perspective as most of us .com folks do on what its worth doing ourselves versus paying for. And they might get credit in the academic world for recreating a commercial product as freeware:

On Feb 5, 2008, at 2:50 PM, Thaden, John J wrote:
…but I can’t help dreaming of creating solutions to my problems that I can also share with communities facing similar problems — part of this is more than a dream, it’s the publish-or-perish dictum of academia…

Isn’t that what Octave is really all about? At this point it is just starting to get to the functionality of the MATLAB 5.x series (from ~10 years ago?). This is pretty obvious if you read Bjørn K. Alsberg and Ole Jacob Hagen, “How octave can replace MATLAB in chemometrics“, ChemoLab, Volume 84, pps 195-200, 2006. I’d like Octave to succeed, heck, we could probably charge more for PLS_Toolbox if people didn’t have to pay for MATLAB too. But at this point using Octave would be like writing with charcoal from my fireplace because I didn’t want to pay for pencils. The decrease in productivity wouldn’t make up for the cost savings on software. I don’t know about some of the other freeware/opensource packages discussed, such as R, but one should think hard about cost/productivity trade-offs before launching into a project with them.

Thanks for stopping by!

BMW

A History of PLS_Toolbox

Jan 24, 2008

I started graduate school at the University of Washington Department of Chemical Engineering in the Fall of 1985. Sometime around Fall of 1986 somebody showed me MATLAB. Wow. That was the last day I ever wrote anything in Basic or Fortran–it was MATLAB from there on out. In 1987 I finished my MS in ChemE and started on a new project which became my dissertation, “Adapting Multivariate Analysis for Modeling and Monitoring Dynamic Systems“. In order to do this research I needed to develop multivariate analysis routines and process simulations, so MATLAB was the logical tool of choice.

At some point late in 1989 I realized that I had created a significant number of routines that might be of use to other researchers. I collected these functions, wrote sensible help files for them, and wrote a brief manual. I’d been working a lot with Partial Least Squares (PLS) regression, and the bulk of the functions I’d created related to that, so I decided (for better or worse) to call it PLS_Toolbox. Why the underscore? Honestly, I don’t remember. It may have had to do with inability of some operating systems to deal with path names that included whitespace. And I didn’t like running it together, PLSToolbox, because that reads like PL Stoolbox, and I didn’t like the connotation.

So in the fall of 1989 I printed up some manuals for PLS_Toolbox 1.0 and started distributing it around the Chemical Engineering Department and the Center for Process Analytical Chemistry. The rest, as they say, is history. After graduating from UW in 1991, I continued to update PLS_Toolbox and distribute it under the company Eigenvector Technologies. Battelle Pacific Northwest National Laboratory, my employer, had no interest in it. So I worked on it evenings and weekends and continued to release updates.

I founded Eigenvector Research, Inc. with Neal Gallagher on January 1, 1995, though PLS_Toolbox still came out under Eigenvector Technologies until version 2.0. A complete list of releases is given below.

PLS_Toolbox 1.0 late 1989 or early 1990
PLS_Toolbox 1.1 1990
PLS_Toolbox 1.2 1991
PLS_Toolbox 1.3 1993
PLS_Toolbox 1.4 1994 (July)
PLS_Toolbox 1.5 1995 (July-added author Neal B. Gallagher)
PLS_Toolbox 2.0 1998 (April-first version under Eigenvector Research)
PLS_Toolbox 2.1 2000 (November)
PLS_Toolbox 3.0 2002 (December–added authors Rasmus Bro and Jeremy M. Shaver)
PLS_Toolbox 3.5 2004 (August–added authors Willem Windig and R. Scott Koch)
PLS_Toolbox 4.0 2006 (May)
PLS_Toolbox 4.1 2007 (June)
PLS_Toolbox 4.2 2008 (January)

The release of PLS_Toolbox 4.2 this month brings the total number of versions to 13. We’ve been pretty stingy with our version numbers, changing them in increments of only 0.1 even when we added significant functionality. In other software companies PLS_Toolbox 4.2 would probably be known as version 9.1 or something like that.

Hope you enjoyed the history lesson, and thanks for checking in!

BMW

A Recipe for Failure

Jun 6, 2007

Recently, I’ve seen numerous examples of what we call the “throw the data over the wall” model of data analysis, mostly in the form of “challenge problems.” Typically, participants are given data sets, without much background information, and asked to analyze them or develop models to make predictions on new data sets. Tests of this sort are of questionable value, as this mode of data analysis is really a recipe for failure. In some instances, I’ve wondered if the tests weren’t set up intentionally to produce failure. So it got me to thinking about how to set up a test to best meet this sinister objective.

If you want to assure failure in data analysis, just follow this recipe:

1) Don’t tell the analyst anything about the chemistry or physics of the system that generated the data. Just present it as a table of numbers. This way the analyst won’t be able to use any physical reasoning to help make choices in data pre-treatment or interpretation of results.

2) Be sure to not include any information about the noise levels in measurements or the reference values. The analyst will have a much harder time determining when the models are over-fit or under-fit. And for sure don’t tell the analyst if the noise has an unusual distribution.

3) Leave the data in a form that makes it non-linear. Alternately, preprocess it in a way that “adds to the rank.”

4) Don’t tell the analyst anything about how the data was obtained, e.g. from “happenstance” or designed data. If its designed, make sure that the data is taken in such a way that system drift and/or other environmental effects are confounded with the factors of interest.

5) Make sure the data sets are “short and fat,” i.e. with many times more variables than samples. This will make it much harder for the analyst to recognize issues related to 2), 3) and 4). And it will make it especially fun for the analyst if the problem has anything to do with variable selection.

6) Compare the results obtained to those from an expert who knows everything about the pedigree of the data. Additionally, its useful if the expert has a lot of preformed “opinions” about what the results of the analysis should be.

If you stick to this recipe, you can certainly improve the odds of making the analyst, and/or the software they are using, look bad. Significantly, it won’t guarantee it, as some analysts are very knowledgeable about the limitations of their methods and are very good with their tools. Obviously, if your goal is to make software and/or chemometric methods unsuccessful, choose a less knowledgeable analyst in addition to 1) – 6) above.

Of course, if you actually want to succeed at data analysis, try to not follow the recipe. At EVRI we’re firm believers in understanding where the data comes from, and anything about it that will help us make intelligent modeling choices. Software can do alot, and we’re making it better all the time, but you just can’t replace a well educated and informed analyst.

BMW