Eigenvector University Europe is in Rome, ITALY October 14-17, 2024 Complete Info Here!

Category Archives: Chemometrics

Chemometics news and issues.

Use of Unlabeled Data in Regression Modeling

Jun 7, 2009

In 1995 Edward V. Thomas published “Incorporating Auxiliary Predictor Variation in Principal Components Regression” in J. Chemometrics, (vol. 9, no. 6, pps 471-481). Thomas demonstrated how additional samples, without matching references values, can be used when building a PCR model. These samples, commonly called “unlabeled data,” help stabilize the estimates of the principal components. So their use often results in slightly better models than using only the labeled data.

While at EPFL last fall, I was working with Paman Gujral, Michael Amrhein and Dominique Bonvin considering methods for updating regression models. Often, one of the problems with updating models is lack of reference values, i.e. all the new data is unlabeled. Thus, it seemed natural to see how “Edward’s PCR” worked in this situation.

The result is the study “On the bias-variance trade-off in principal component regression with unlabeled data,” which will be presented this week as a poster at SSC-11. The study shows that “Edward’s PCR” works great if the new data is still in the same subspace as the old data. This might occur, for instance, in spectroscopic applications where the same set of analytes still exists but their range has been expanded. But when new analytes are thrown into the mix, thus expanding the subspace, this method leads to even larger prediction biases than not updating models at all. This is because new unlabeled samples rotate the PCs, and ultimately the regression vector, more towards the new analytes while not having any reference values to tell the model to ignore this subspace.

Hope you enjoy the poster, and I hope everybody has a good week at SSC-11!

BMW

EAS Chemometrics Award Session for Romà Tauler

Jun 4, 2009

As announced in a previous post, Romà Tauler was selected as this year’s recipient of the EAS Award for Achievements in Chemometrics. Romà asked me to organize the award session, and I have happily obliged. The session will be on Tuesday afternoon, November 17. The theme of the session is “Uncertainties, Ambiguities, and Chemometrics.” Talks will include:

• Peter Wentzell, Dalhousie University, “Exploratory Data Analysis with Noisy Data”
• Anna de Juan, Universitat de Barcelona, “Using Noise Structure Knowledge in MCR Process Analysis”
• Willem Windig, Eigenvector Research, Inc., “How Being Negative Can Be Good”
• Age Smilde, University of Amsterdam, “Modeling Dynamic Metabolomics Data Using Prior Knowledge”
• Romà Tauler, Spanish Council of Scientific Research, “Ambiguities and Error Propagation Effects on Multivariate Curve Resolution Solutions”

We’re looking forward to a great session. See you in the fall!

BMW

EigenU Redux: Chemometrics in Wenatchee, July 13-16

Jun 4, 2009

We had a number of people that just couldn’t make it to EigenU last month and wanted a Chemometrics course this summer. So we’re planning on doing a course here in Wenatchee, WA, July 13-16, 2009. We’re doing our “Basic Chemometrics” course on Monday-Wednesday, including:

Linear Algebra for Chemometricians
MATLAB for Chemometricians
Chemometrics I: Principal Components Analysis
Chemometrics II: Partial Least Squares and Regression

On the optional fourth day, Thursday, we’ll go over some special topics, including:

Instrument Standardization and Calibration Transfer
Variable Selection
Advanced Preprocessing

The cost of the course will be $1475/$650 (industrial/academic) for the first three days, and $475/$225 for the optional fourth day.

Please come with a laptop with either MATLAB + PLS_Toolbox or Solo installed. (The free demo versions are just fine for this.) Let me know if this is a problem and I’ll try to help you out.

For further information and to register, just contact me.

BMW

Chemometrics in Cultural Heritage

May 29, 2009

Last fall I had the pleasure, with Rasmus Bro, to teach a chemometrics course in Rome. The choice of this location was a result of Rasmus just wanting to go to Rome, and me making an email acquaintance of Prof. Giovanni Visco of the University of Rome, (La Sapienza). In 2008 Giovanni was organizing the second CMA4CH meeting, which is a rather un-obvious acronym for “Application of Multivariate Analysis and Chemometry to Cultural Heritage and Environment.” We gave Giovanni a copy of Solo for the Best Presentation Prize at CMA4CH 2008, and a friendship was born. So when we (Rasmus, along with my wife and daughters, had easily convinced me) wanted to do a course in Rome, I contacted Giovanni and he figured it all out for us. Between Giovanni and his colleague Federico Marini, we were very well taken care of during our stay in Rome!

Giovanni is now in the process of organizing CMA4CH 2010, which will be held on the island of Sicily on September 26-29. He was kind enough to ask me to be the Co-chair for Chemometrics, and I gladly agreed. While somewhat specific, this meeting considers in depth a rather important intersection between a scientific method and an application.

Of course, Italy is THE place for a meeting focused on cultural heritage; they have more of it than just about anybody. And there are so many potential applications of chemometric methods in this arena, (identification of artifacts, provenance of origin, fraud detection, effect of climate and pollution, restoration, etc.), that there should be plenty to discuss! We’re looking forward to it.

BMW

Chemometrics and Fortune 500 Companies

May 28, 2009

The other day I was updating my bio for a conference and was working on some sentences regarding our experience teaching chemometrics. It included a reference to teaching employees of Fortune 500 companies. So I decided to try to figure out how many of these companies had sent employees to our courses.

Scanning the list, (and just from memory), I came up with 50+ companies we have either taught in-hours courses for or who have sent people to our open courses, including: 3M Company, Abbott Laboratories, Advanced Micro Devices, Agilent Technologies, Air Products and Chemicals, Alcoa, Amgen, Applied Materials, AT&T, Avery Dennison, Becton, Dickinson & Co., Boston Scientific, Boeing, Bristol-Myers Squibb, Chevron, Colgate-Palmolive, ConocoPhillips, Corning, Delphi, Dow Chemical, E. I. du Pont de Nemours & Co., Eastman Chemical, Eastman Kodak, Eli Lilly & Co., ExxonMobil, Ford, General Electric, General Motors, Goodrich, Goodyear Tire & Rubber, Hershey, Hewlett-Packard, Honeywell, Huntsman, Intel, IBM, International Paper, Johnson & Johnson, Kraft, Lockheed Martin, Lucent, Merck & Co., Micron, Owens Corning, Pfizer, Praxair, Procter & Gamble, Rohm & Haas, Schering-Plough, Sunoco, Texas Instruments, Weyerhaeuser and Wyeth.

First off, I’m pleased that personnel at all of these companies have thought enough of us to come to our courses, (such as EigenU). But beyond that, it shows that chemometrics is an important and widely applicable discipline. In these companies chemometric methods play a critical role in many aspects of their product life cycles, from basic research, through product development and scale-up to manufacturing. Multivariate methods improve efficiency and, therefore, are part of these companies competitive advantage.

BMW

Another EigenU Complete!

May 26, 2009

The fourth edition of Eigenvector University came to a close last Friday, May 22. The week-long EigenU 2009 had 25 participants from a wide variety of industries and universities. This was somewhat smaller than 2008, but still a good showing given the current state of the economy. Of course, we think it’s the really smart companies that use the slow times to improve the skill sets of their employees!

As usual, the kind folks at the WAC took good care of us. Many thanks to Rick, Amanda, Wayne, Joe, Joshua, Timothy, Bernie, Randall, Eddie, Quentin and all the rest that kept us hydrated and well-fed.

Speaking of well-fed, we all enjoyed Thursday evening’s workshop dinner at Torchy’s. Here everyone contemplates the dinner choices while discussing the day’s courses, which included Chuck and Bob’s “Implementing Chemometrics in PAT,” Rasmus’ “Variable Selection,” and Willem’s “Chemometrics in Mass Spectrometry.”

EigenU 2009 Dinner at Torchy\'s

Thanks to all the course participants for braving the swine flu panic to join us! Also, to the 8 “Eigenvectorian” instructors (Scott, Jeremy, Willem, Neal, Chuck, Bob, Rasmus and myself) for developing and leading the courses.

EigenU 2010 is tentatively scheduled for May 16-21, 2010. In the mean time, catch one of our courses at SIMS XVII, FACSS or EAS, or contact me to schedule in-house training!

BMW

Congratulations Romà!

Apr 17, 2009

This year’s Eastern Analytical Symposium Award for Achievements in Chemometrics goes to Romà Tauler. Romà is a Research Professor at CSIC, the Institute of Chemical and Environmental Research in Barcelona, Spain.

Romà continues to be a pioneer in Multivariate Curve Resolution, the collection of techniques used for decomposing spectral data into its physically meaningful underlying components. Romà has published an astounding number of papers concerning both the theoretical and practical aspects of MCR, in addition to many other papers in the general field of chemometrics.

Romà is also Editor in Chief of Chemometrics and Intelligent Laboratory Systems.

Professor Tauler joins the previous EAS Chemometrics Award winners (a distinguished group if I may say so myself) listed below:

    1996-Steven D. Brown
    1997-Tormod Næs
    1998-Edmund R. Malinowski
    1999-Harald Martens
    2000-Svante Wold
    2001-Barry M. Wise
    2002-Paul Geladi
    2003-Paul Gemperline
    2004-Rasmus Bro
    2005-David Haaland
    2006-Age Smilde
    2007-Philip Hopke
    2008-John F. MacGregor

A special session honoring Romà’s achievements will be presented at this year’s EAS in November.

Eigenvector has been the sponsor of the Chemometrics Award since 2002, and we’re pleased to do it again this year. Congratulations Romà!

BMW

EigenU Registration Open

Apr 9, 2009

The fourth edition of Eigenvector University will be held in Seattle May 17-22. We’re excited about this edition in part because we have four new courses: Robust Methods, Correlation Spectroscopy, Common Mistakes in Chemometrics (and how not to make them), and Implementing Chemometrics in PAT.

But we’re also excited because EigenU is about the only time during the year where we get the whole Eigenvector staff together. This is a benefit for us–we like to see each other and exchange ideas on consulting projects, talk about software development, etc. But it’s also a benefit for the attendees–a chance to talk to all the Eigenvectorians and find whichever one of us has the most experience on your problem.

While we’ve been somewhat worried that the current economy would affect our attendance, I’m pleased to report that registrations are coming in and we’re up to 17 participants as of April 9. Apparently, there are some companies out there that realize that the best time to sharpen the saw is before all the orders for logs come in.

Early registration for EigenU ends on April 17. After that, prices go up, so get your training plan in order now!

BMW

EigenGuys at FACSS in Reno

Oct 20, 2008

This was the first year in a long time that I didn’t make it to FACSS, but that doesn’t mean that Eigenvector wasn’t there. The EigenGuys attending included Neal Gallagher, Jeremy Shaver, Chuck Miller and Scott Koch.

As usual, EVRI taught some courses: Neal took the lead on our popular Chemometrics without Equations, and introduced a new course, Advanced Chemometrics without Equations. As its name implies, ACWE explains concepts such as advanced preprocessing and variable selection in words and pictures rather than equations.

The EigenGuys also gave a number of talks. Jeremy presented “Making Do-Weighted regression models for use with less-than-perfect data.” This work describes a strategy for developing models based on historical data when the most interesting or critical data is underrepresented.

Chuck presented our still-not-quite-complete study of preprocessing and calibration transfer methods, “Combining Calibration Transfer and Preprocessing: What methods, What Order?” The good news is, as far as the examples we have go, it doesn’t matter if you preprocess then do calibration transfer or the other way around. (If you have data where you think it makes a real difference, please drop a line.) Chuck’s other offering, “Analytical Chemistry and Multi-Block Modeling for Improved NIR Spectral Interpretation,” demonstrated how PLS2 can be used to analyze data from multiple analytical instruments in order improve understanding. This deeper knowledge can be used in turn to improve model performance.

Scott headed up the trade show aspect of the conference, manning our booth. Scott’s main task was doing demos of our new PLS_Toolbox 5.0, which was just released last week. Look for Solo 5.0 shortly!

BMW

Properties of PLS

Sep 28, 2008

As part of my sabbatical here at the Automatic Control Laboratory at EPFL, I was asked to give a seminar. I wanted to talk about some of the work I’d done lately concerning properties of PLS, and differences between PLS algorithms, pretty much the same material I’d presented as a poster at CAC-2008 in Montpellier.

I was asked to make the presentation a little more tutorial in nature, so I included more background on multivariate calibration. The result is “Properties of Partial Least Squares Regression and Differences between Algorithms,” which I presented Friday, September 26, 2008. Enjoy!

BMW

Chemometrics Short Course in Rome, October 27-29, 2008

Jul 16, 2008

Some time ago I asked Rasmus Bro if he would be interested in teaching a short course with me in Europe this fall. He said, “Yes, and I really want to go to Rome!” Fortunately, I’d been in contact lately with Dr. Giovanni Visco of Rome University Chemistry Department regarding the CMA4CH meeting.

Dr. Visco has been kind enough to put us in touch with CASPUR, the nearby “Interuniversity Consortium for Supercomputing and Research,” which has good facilities for teaching a computer based course. We’re currently planning on teaching an introductory 3-day course October 27-29, 2008. The course will include:

Obviously, we’ll have to do a little editing of these courses to fit this 4.5 days worth of material into 3 days! If you have questions, please drop me a line (bmw@eigenvector.com).

See you in Rome this fall!

BMW

IUPAC Glossary of Chemometric Terms and Concepts

Jul 16, 2008

Nomenclature has been a subject of some discussion within the chemometrics community, such as on the list ICS-L. I recall exchanges dealing with the definition of various terms such as “factor,” “latent variable,” and “principal component.” Its clear that we don’t all use these terms in exactly the same way. For the most part, this doesn’t bother me. Authors should be free to use terms as they wish provided that they define them unambiguously in their text.

However, it would be useful for the community to have a set of generally agreed upon definitions for commonly used terms and concepts. Enter IUPAC, the International Union of Pure and Applied Chemistry. I remember first hearing of IUPAC when I was an undergrad learning organic chemistry. Learning the IUPAC names for compounds was always straightforward as they were very systematic. This was in contrast to learning common names, which, it seemed at times, were pretty much random.

Professor D. Brynn Hibbert, of the University of New South Wales, has received funding for a small IUPAC project to develop a glossary of concepts and terms in chemometrics. He presented a brief introduction to this project at CAC-2008. His collaborators on this include Professor Pentti Minkkinen, Lappeenranta University of Technology, Dr. Klaas Faber, Chemometry Consultancy, and myself.

The initial project goal is to establish the scope of the problem, and to develop a draft glossary and a consultation process. To do this we plan to set up a “wiki” where members of the community could edit terms or add new ones. We’ve had several offers of existing glossaries which could be used to populate the wiki initially. We’ll do that and then let everybody have at it. The wiki software will keep track of all the edits submitted, so we’ll know what terms are particularly contentious. Once it has settled down, the project team will create a consensus list for eventual presentation to IUPAC.

An IUPAC glossary would make it easier for authors as they could simply state that they will adhere to IUPAC definitions, and thus not have to define terms further. But perhaps more importantly, it would make things easier for students of chemometrics, who could learn a common set of terms and then only have to worry about the exceptions as they come up. Ultimately, it should be good for the field of chemometrics.

It’s Eigenvector’s job to get the wiki set up. I’ll let you know when it becomes available.

BMW

CAC-2008 Poster Prize Winners

Jul 4, 2008

Eigenvector was pleased to sponsor the “Best Poster” prize at CAC-2008. The top three poster presenters all received a certificate good for a copy of PLS_Toolbox or Solo (well, OK, it wasn’t exactly a certificate, it was one of my business cards with “Good for one PLS_Toolbox” written on the back!). The top poster also got $500USD, which equates to 320€.

There were 160 posters presented at CAC, so this was quite a contest! The winners, selected by the CAC scientific committee, represent some exceptional efforts selected from a very large body of good work.

The third place poster was “Drift compensation of gas sensor array data by Orthogonal Signal Correction” by M. Padilla, A. Perera, I. Montoliu, A. Chaudry, K. Persaud and S. Marco. This is a nice application of OSC. We’ve used it for spectroscopic instrument standardization and found it to work well in that application. It makes sense that it would work well for electronic noses as well.

Second place went to Pat Wiegand, Randy Pell and Enric Comas, all of Dow, for “Simultaneous Variable and Sample Selection for PLS Calibrations Using a Robust Genetic Algorithm.” This work addressed the problem where one has both samples and variables that are irrelevant for building a predictive model for a given property. Most previous work address either the variable selection or the sample selection problem, but not both. The robustness of their algorithm comes, in part, from a robust PLS algorithm from the LIBRA Toolbox, developed by Sabine Verboven and Mia Hubert. This toolbox is what provides the robust options for PCA and PLS in PLS_Toolbox, so of course we think that was a very good choice!

Emma Peré-Trepat accepted the first place prize on behalf of herself and co-workers I. Montoliu, F.P. Martin, S. Rezzi and S. Kochhar, all of Nestlé Research Center. They presented “Data fusion strategies for nutrimetabonomics.” Nutrimetabonomics, the application of metabonomics to nutritional sciences, is the study of metabolic responses to the consumption of specific foods and ingredients. Their approach used hierarchical modeling to fuse NMR and meta-data.

Congratulations again to the winners!

BMW

More from CAC-2008

Jul 4, 2008

Its been a long week, absolutely packed. I haven’t gotten to every session, but I thought I’d include a few notes about several more talks I really enjoyed.

Selena Richards presented “Self-Modeling Curve Resolution: a new approach to recovering temporal metabolite signal modulation in NMR spectroscopic data: Application to a life-long caloric restriction in dogs.” Its been known for some time that restricting caloric intake lengthens the life span of most mammals. This talk is concerned with finding the metabolomic signature of this effect. Besides the novel use of MCR, I enjoyed the talk because the subjects were Labrador Retrievers. We’ve been trying to keep our yellow lab, Jenny, thin, also because she has some joint problems that would be exacerbated if she was over weight. But man, labs will eat anything, so keeping them out of the food can be a challenge! I’m not sure how calorie restriction works in humans, but I’m sure life seems longer!

Steven Short talked on “Determination of Figures of Merit for Near-Infrared and Raman Spectrophotometers by Net Analyte Signal Analysis for a Four Compound Solid Dosage System.” This work discussed how NAS can be used to compare analytical instruments. I took a look at NAS some years ago after Avi Lorber published “Net analyte signal calculation in multivariate calibration.” My main disappointment with NAS, when calculated based on a regression model, is that its a function of the number of factors in the model, and it isn’t particularly useful for picking number of factors. Short gave a nice application of where NAS can be truly useful.

Resolution of hyperspectral images. Pre-, in- and post-processing” was presented by Anna de Juan. The talk was something of a overview of past work, but really summarized very well many of the possibilities of using MCR in images. Much of this talk is included in her article (with Maeder, Hancewicz and Tauler) “Use of local rank-based spatial information for resolution of spectroscopic images,” J. Chemo, 22, pps 291-298, 2008. I think the work is a good guide for users of PLS/MIA_Toolbox in that it shows a lot of what you can do with the tools.

All in all it was a very good conference. The only down side was that it was sometimes a victim of its own success–there were simply too many talks, posters and people I wanted to talk with to get to them all!

BMW

Update from CAC-2008

Jul 3, 2008

Greetings from Montpellier, where Jeremy and I are attending CAC-2008. We’re now into our third day of the conference, and it has gotten off to a good start. I thought I’d just take a minute and highlight several talks that I really enjoyed.

Brynn Hibbert presented “Analysis of variance of complex data sets using GEMANOVA: An example using kill kinetics data.” GEMANOVA is essentially a variant of PARAFAC, used like ANOVA to determine what effects are significant, but in multi-way data. The talk made me want to make sure that we can get PARAFAC working in this way for our users. The trick is in setting the constraint options, and in automating the building of sequences of models with different constraints. In any case, this talk demonstrates that PARAFAC, in the right hands, is a very powerful and versatile technique.

New proposals for PCA model building with missing data” was delivered by Alberto Ferrer. As usual, Alberto gave a very clear presentation–a nice talk to listen to. Alberto showed how methods for imputing missing data in PCA models when a model exists can also be used to develop new PCA models in the face of missing data. PLS_Toolbox, incidentally, uses one of these methods. It was also shown that the NIPALS method for building models with missing data does not work well in comparison to the other methods.

I also really enjoyed Henri Tapp’s talk, “OPLS: an ideal tool for interpreting PLS regression models?” Henri discussed, why, in his opinion, there really isn’t much advantage to OPLS, even in interpretability. (It is admitted by its creator, Johan Trygg, that it does not improve predictive ability over conventional PLS.) Another interesting point in Tapp’s talk was the bibliographic survey of papers citing the original OPLS paper, which showed that OPLS is mostly referenced by Umeå/Umetrics authors and Imperial University. I wonder, how much do you suppose the patent on OPLS has to do with this rather in-bred distribution?

My own talk, “Tools for Multivariate Calibration Robustness Testing with Observations on Effects of Data Preprocessing” was reasonably well-received (at least I wasn’t booed off the stage) and sparked some discussion. I’ve learned over the years that a relatively simple talk with some nice graphics is a good thing to present in the right after lunch spot, when conferees are suffering from PLS (post-lunch syndrome). And of course always energetic & enthusiastic Jeremy did a great job with “Automatic Sample Weighting for Inferential Modeling of Historical In-Control Process Data.”

So far, so good. More later!

BMW

NIPALS versus Lanczos Bidiagonalization

Jun 24, 2008

In 2007, Randy Pell, Scott Ramos and Rolf Manne (PRM) ignited a controversy when they published “The model space in PLS regression.” Their paper pointed out that the X-block residuals in different PLS packages were not the same. Specifically, packages which use the NIPALS or SIMPLS method for PLS (including PLS_Toolbox/Solo, Unscrambler and SIMCA-P) produce different residuals than those that use Lanczos Bidiagonalization (primarily Pirouette). PRM claimed that that residuals in NIPALS were “inconsistent” and made the rather inflammatory statement that NIPALS “amounted to giving up mathematics.”

As you might imagine, this has resulted in a considerable amount of activity in the chemometrics community. And it really has been useful because many of us, including myself, have learned quite a bit about PLS, a subject we thought we already understood pretty well.

There will be a crop of articles in the upcoming issue of Journal of Chemometrics on this subject. This will include a letter to the editor by Svante Wold et. al., “The PLS model space revisited,” which takes a theoretical/philosophical look at how PLS via NIPALS is derived and shows that, in this light, it is not inconsistent. Rasmus Bro and Lars Eldén’s contribution, “PLS Works,” shows that while the PLS NIPALS residual space is orthogonal to the model scores, and thus the fitted y-values, this is not true of Bidiag. I understand that there will also be a paper in the upcoming issue from Rolf Ergon, though I don’t know the title yet.

The work of Bro and Eldén served as a launching point for an investigation of my own regarding how and why Bidiag residuals are correlated with scores. The result is a poster which I will show at CAC-2008 next week, “Properties of PLS, and Differences between NIPALS and Lanczos Bidiagonalization.” The poster shows why and when NIPALS and Bidiag residuals are different, and shows some examples of when Bidiag residuals are strongly correlated with the scores. This includes the main example given in PRM, where, as it turns out, the main difference in the residuals is due to the 3rd factor in the Bidiag model being quite correlated with the residuals.

If you are attending CAC, please drop by and talk to me during the poster presentation. I’m sure we’ll have a lively discussion!

BMW

References:
R. J. Pell, L. S. Ramos and R. Manne, “The model space in PLS regression,” J.Chemometrics, Vol. 21, pps 165-172, 2007. 
R. Bro and L. Eldén, “PLS Works,” J. Chemometrics, in press, 2008.
S. Wold, M. Høy, H. Martens, J. Trygg, F. Westad, J. MacGregor and B.M. Wise, “The PLS model space revisited, J. Chemometrics, in press, 2008.
B.M. Wise, “Properties of PLS, and Differences between NIPALS and Lanczos Bidiagonalization,” CAC-2008, Montpellier, France, 2008.

CAC-2008 in Montpellier, France

Jun 23, 2008

The Eleventh Conference on Chemometrics in Analytical Chemistry, CAC-2008, begins next week in Montpellier, France. The conference runs from June 30 through July 4.

All indications are that it will be a great conference. The organizers say that attendance is will be close to 350, which must be a record for CAC.

Eigenvector will be there, of course. Our Jeremy Shaver will present Automatic Sample Weighting for Inferential Modeling of Historical In-Control Process Data, which is concerned with the problem of developing calibration models from data where the bulk of the samples are tightly clustered, with only a few samples exhibiting significant variation.

I’ll be there as well, presenting Tools for Multivariate Calibration Robustness Testing with Observations on Effects of Data Preprocessing. We all want calibration models that are robust, and thus, have good longevity. But how do you tell how brittle a model is? This talk demonstrates some tools for assessing model performance in the face of changes in the samples and instruments.

I’m also presenting a poster, Properties of PLS, and Differences between NIPALS and Lanczos Bidiagonalization. I’ll write about this a little more in my next post, but suffice it to say that there is a bit of controversy of late about various algorithms for Partial Least Squares Regression and the residuals they generate.

Eigenvector is of course proud to be a sponsor of CAC. We are sponsoring the Best Poster Contest, and will present the winner with $500USD (about 322€ today). I personally really like poster sessions. Its a great time to really talk with people about their research, and its generally much more of an exchange of scientific ideas than a talk, which are primarily one-way communications.

So, if you are going to CAC, look us up. Jeremy and I are always happy to answer questions about our products and services, and are always looking for user input on features for PLS_Toolbox, Solo, etc.

See you at CAC!

BMW

Chemometrics Software Prices

Jun 13, 2008

I was doing a little market research the other day, trying to find out what our competitors charge for their software. As it turns out, it’s somewhat difficult to get prices for several of them.

EVRI publishes its price list, as does Infometrix. So if you want to find out the price of PLS_Toolbox, or the price of Pirouette, it’s just a click away. But if you want to get a price on Unscrambler from CAMO, SIMCA-P+ from MKS/Umetrics, or GRAMS from Thermo Scientific, that’s a little tougher. You have to write for a quote.

So, on Wednesday, June 11, I wrote for quotes on Unscrambler and SIMCA-P+. I’m still waiting to hear back. I’ll let you know if/when I get my quotes.

But, it’s my understanding that SIMCA-P+, Unscrambler, and GRAMS are all priced similarly to Pirouette, which is $4500, or more. We think that our PLS_Toolbox and Solo products are a much better value.

If you already have MATLAB, (and more than one million people do), PLS_Toolbox is an absolute steal at $995. And if you don’t, Solo, at $1695, offers the point-and-click interfaces of PLS_Toolbox without requiring MATLAB. Both Solo and PLS_Toolbox are easy to use, offer sophisticated data preprocessing techniques, and many tools not found in other packages, such as PARAFAC and calibration transfer tools.

And with the current exchange rate of $1.00 = 0.65€, our prices are at historic lows for our European customers.

So if you’re in the market for multivariate software solutions, be sure to check out EVRI. It’s where the value is!

BMW

Model Transparency, Validation & Model_Exporter

Feb 20, 2008

With the advent of the US Food and Drug Administration’s (FDA) Process Analytical Technology (PAT) Initiative the possibilities for putting multivariate models on-line in pharmaceutical applications increased dramatically. In fact, the Guidance for Industry on PAT lists Multivariate tools for design, data acquisition and analysis explicitly as PAT Tools. This opens the door for the use analytical techniques which rely on multivariate calibration to produce estimates of product quality. An example of this would be using NIR with PLS regression to obtain concentration of API in a blending operation.

That said, any multivariate model that is run in a regulated environment is going to have to be validated. I found a good definition of validate on the web: To give evidence that a solution or process is correct. So how do you show that a model is correct? It seems to me that the first step is to understand what it is doing. A multivariate model is really nothing more than a numerical recipe for turning a measurement into an answer. What’s the recipe?

Enter Model_Exporter. Model_Exporter is an add-on to our existing multivariate modeling packages PLS_Toolbox and Solo. Model_Exporter takes models generated by PLS_Toolbox and Solo and turns them into a numerical recipe in an XML format that can be implemented in almost any modern computer language. It also generates m-scripts that can be run in MATLAB or Octave, and Tcl for use with Symbion.

But the main point here is that Model_Exporter makes models transparent. All of the mathematical steps (and the coefficients used in them), including preprocessing, are right there for review. Is the model physically and chemically sensible? Look and see.

The next step in validation is to show that the model behaves as expected. This would include showing that, once implemented, the model produces the same results on the training data as the software that produced the model. One should also show that the model produces the same (acceptable) results on additional test sets that were not used in the model development.

What about the software that produced the model to begin with? Should it be validated? Going back to the definition of validate, that would require showing that the modeling software produced answers that are correct. OK, well, for PLS regression, correct would have to mean that it calculates factors that maximize the covariance between scores in X and scores in Y. That’s great, but what does it have to do with whether the model actually performs as expected or not? Really, not much. Does that mean its not important? No, but assuring software accuracy won’t assure model performance.

Upon reading a draft of this post, Rasmus wrote:

Currently software validation focuses on whether the algorithm achieves what’s claimed, e.g. that a correlation is correctly calculated. This is naturally important and also a focus point for software developers anyhow. However, this sort of validation is not terribly important for validating that a specific process model is doing what it is supposed to. This is similar to thoroughly checking the production facility for guitars in order to check that Elvis is making good music. There are so many decisions and steps involved in producing a good prediction model and the quality of any correlation estimates in the numerical algorithms are of quite insignificant importance compared to all the other aspects. Even with a ‘lousy’ PLS algorithm excellent models could be made if there is a good understanding of the problem.

So when you start thinking about preprocessing options and how many ways there are to get to models with different recipes but similar performance, and also how its possible by making bad modeling choice to get to a bad model with software that’s totally accurate, it’s clear that models should be validated, not the software that produces them. And that’s why Model_Exporter is so useful, it makes models transparent, which simplifies model validation.

Some Thoughts on Freeware in Chemometrics

Feb 6, 2008

Once again there is a discussion on the chemometrics listserv (ICS-L) concerning freeware in chemometrics. There have been some good comments, and its certainly nice to see some activity on the list! I’ll add my thoughts here.

On Feb 5, 2008, at 3:39 PM, David Lee Duewer wrote:
I and the others in Kowlski’s Koven built ARTHUR as part of our PhuDs; it was distributed for a time as freeware. It eventually became semi-commercial as a community of users developed who wanted/needed help and advice. Likewise, Barry’s first PLS_Toolbox was his thesis and was (maybe still is?) freeware.

No, its not freeware, but it is still open source. One of my pet-peeves is that “freeware” and “open-source” are often used synonymously, but they aren’t the same thing.

PLS_Toolbox is open source, so you can see exactly what its doing (no secret, hidden meta-parameters), and you can modify it for your own uses. (Please don’t ask us to help you debug your modified code, though!) You can also compile PLS_Toolbox into other applications IFF (if and only if) you have a license from us for doing so. And of course PLS_Toolbox is supported, regularly updated, etc. etc. If something doesn’t work as advertised, you can complain to us and we’ll fix it, pronto.

I think we occupy a sweet spot between the free but unsupported (must rely on the good will of others) model and the commercial but closed source (not always sure what its doing and can’t modify it) model.

OK, end of commercial!

But the problem with freeware projects is that there have to be enough people involved in a quite coordinated way in order to reach the critical mass required to make a product that is very sophisticated. Yes, its possible for a single person or a few people to make a bunch of useful routines (e.g. PLS_Toolbox 1.4, ca 1994). But a fully gui-fied tool that does more than a couple things is another story. PLS_Toolbox takes several man-years per year to keep it supported, maintained and moving forward. And if it wasn’t based on MATLAB, it would be considerably more.

On Feb 5, 2008, at 4:17 PM, Scott Ramos wrote:
… the vast majority of folk doing chemometrics fall into Dave’s category of tool-users. This is the audience that the commercial developers address. Participants in this discussion list fall mostly into the tool-builder category. Thus, the discussion around free or shareware packages and tools is focused more on this niche of chemometricians.

And that’s the problem. Like it or not, chemometrics is a bit of niche market. So whether you can get enough people together to make freeware that is commercial-worthy, that tool-users are willing to rely on, is going to be even tougher than for other, broader markets. The most successful opensource/freeware projects that I’m aware of are tools for software developers themselves: tools by software geeks for software geeks. Tools for CVS are a great example of this (like the copy of svnX that I use, and hey, WordPress, which I’m using to write this blog).

MATLAB is interesting in that it occupies a middle ground, it is both a development environment and an end-user tool. You can pretty much say the same for PLS_Toolbox.

On Feb 5, 2008, at 2:32 PM, Rick Dempster wrote:
I was taught not to reinvent the wheel many years ago and that point seems to have stuck with me.

That’s good practice. But it seems to me that a substantial fraction of the freeware effort out there really is just reinventing things that exist elsewhere. The most obvious example is Octave, which is a MATLAB clone. I notice that most of the freeware proponents out there have .edu and .org email addresses, and likely don’t have the same perspective as most of us .com folks do on what its worth doing ourselves versus paying for. And they might get credit in the academic world for recreating a commercial product as freeware:

On Feb 5, 2008, at 2:50 PM, Thaden, John J wrote:
…but I can’t help dreaming of creating solutions to my problems that I can also share with communities facing similar problems — part of this is more than a dream, it’s the publish-or-perish dictum of academia…

Isn’t that what Octave is really all about? At this point it is just starting to get to the functionality of the MATLAB 5.x series (from ~10 years ago?). This is pretty obvious if you read Bjørn K. Alsberg and Ole Jacob Hagen, “How octave can replace MATLAB in chemometrics“, ChemoLab, Volume 84, pps 195-200, 2006. I’d like Octave to succeed, heck, we could probably charge more for PLS_Toolbox if people didn’t have to pay for MATLAB too. But at this point using Octave would be like writing with charcoal from my fireplace because I didn’t want to pay for pencils. The decrease in productivity wouldn’t make up for the cost savings on software. I don’t know about some of the other freeware/opensource packages discussed, such as R, but one should think hard about cost/productivity trade-offs before launching into a project with them.

Thanks for stopping by!

BMW