Author Archives: Barry M. Wise

About Barry M. Wise

Co-founder, President and CEO of Eigenvector Research, Inc. Creator of PLS_Toolbox chemometrics software.

PLS_Toolbox User Survey

May 30, 2025

As part of our effort to make a future version of PLS_Toolbox compatible with the new MATLAB 2025a interface tools (and deprecation of Java) we surveyed our users to find out more about what they do and how they work. We got a nice response and some of the results were quite interesting, so I thought we’d share them here. We used Survey Monkey to conduct the survey so for the most part I just copied the figures from it.

When asked what version of MATLAB was being used with PLS_Toolbox nearly 50% of our users responded with R2024b. This surprised me as I figured there would be a smaller percentage of users with the most current version of MATLAB. I’m reticent to update my MATLAB (and other software) versions unless I have a compelling reason to do so. My experience with R2024b is that it is an excellent release, (some are better than others), so that might be part of the reason. We also found that ~75% of users were on the current version of PLS_Toolbox, 9.5.

When asked how they primarily interacted with PLS_Toolbox about 75% answered with a mixture of use of the interfaces and use of the command line. The most frequent responses were an equal mix of both and mostly interfaces with occasional command line. I’m in this latter category myself. But this response really highlights one of the major strengths of PLS_Toolbox: you can use it either way. You can make models from the command line that are completely compatible with models generated by the interfaces, and vice-versa. How you work largely depends on the situation and personal preferences.

When asked about what types of data users were working with, it was no surprise that over 80% were working with NIR followed by 50% using Raman. Raman use has grown a lot in the last several years and we expect that trend to continue. UV-VIS was a larger group than I expected at a little over 35%. I was pleased that over 30% use it with chemical process data as that is how I got into this business to begin with. We were surprised to find that the “other” option produced quite a few instances of NMR use.

We asked our customers how frequently they used various analysis tools and methods. It was no surprise to us that over 80% did exploratory analysis, which includes PCA, frequently. Over 80% also use linear regression, which includes PLS, frequently. PCA and PLS are certainly the workhorses of chemical data science. About 70% of users accessed the classification methods in PLS_Toolbox at least occasionally while over 50% of them used the non-linear regression methods at least occasionally. Diviner, our new semi-automated machine learning tool for regression model development was used at least occasionally by about 30%. Given that Diviner is the newest major tool in PLS_Toolbox this seems like a good start and I expect that number to grow.

Over 70% of customers responding to the survey use PLS_Toolbox weekly or daily. Of course I expect that the heavier users were more likely to answer the survey, but still, that’s a lot. I only get a chance to use it weekly myself!

Many of our users have been with us for a long time. Over 70% responded that they’ve been using PLS_Toolbox for more than 5 years. When you combine that with the amount they use it above that accounts for a lot of use. This ultimately contributes to the reliability of PLS_Toolbox: issues get found and our development staff fixes them.

Finally, we asked users to rate their satisfaction with PLS_Toolbox on a scale from 1 to 100. A histogram of the responses is shown below. 55% of our customers gave PLS_Toolbox a rating of 90 or higher. The average was 86. Still room for improvement and of course we’re working on how to bring the bottom end of the scale up!

We of course asked users about what they find difficult and what they’d like us to add, etc., and got some good suggestions. So plenty to work on, as always! But I’ll leave my favorite comments here.

Love the product, creativity and quick support responsiveness!

Very complementary to my own MATLAB coding – often I jump back and forth. Nice PCA interface. I appreciate the missing data capabilities in a lot of the code.

I love Diviner. Please add preprocessing sets for the spectrosopic/chromatographic methods.

I have stuck with this product through the years as I trust the software, I find it very powerful (certainly better than anything my colleagues can do in free or shareware), and it is constantly changing and improving.

Thanks to all our users who completed the survey and provided this feedback!

BMW

Tags: chemometrics, datascience, machinelearning, PLS_Toolbox, software, survey

EigenU 2025 Wrap-up

May 17, 2025

The 19th Eigenvector University wrapped up on Friday, May 16. A little smaller this year, we had 32 students from locations as far away as Milan, Italy and Santiago, Chile. EigenU 2025 started with online classes the week before including the new “Introduction to PLS_Toolbox and Solo” developed by our Lyle Lawrence. In person classes started Monday, May 12 at the Washington Athletic Club in Seattle with “Chemometrics I: Principal Components Analysis.” This year Neal Gallagher and I covered the teaching duties for this class on PCA, arguably the most important method in modern data science. And in spite of the fact that I’ve taught this 100+ times over 30+ years I’m still enthusiastic to do so. PCA is just so useful and the insights gained from it are often significant, and, well, it’s just fun to see what’s in data sets!

The rest of the EVRI staff caught up with us on Monday evening. As always, EigenU is an opportunity for us to get together and we enjoyed a nice dinner at Tulio on Tuesday (below). This is a rare photo in that it includes Jill, my spouse and the person that runs the Eigenvector “back office.” It is always great to have the whole crew together!

Eigenvector staff including (from left) Lyle Lawrence, Manny Palacios, Barry Wise, Jill Wise, Scott Koch, Nate Watson, Donal O’Sullivan and Bob Roginski.

Wednesday night we were at the Top of the WAC with the “PowerUser Tips & Tricks” session where the EigenGuys take turns presenting their favorite underutilized or new software feature. Also the “PLS_Toolbox/Solo User Poster Session” was held, where users get to share analyses they’ve been working on. This year’s best poster winners were Shirmir Branch with “Using PLS_Toolbox for the Analysis of Complex Molten Salt Chemistry” and Benjamin Panitz with “Development of a Chemometrics Model for Rapid Fish Egg Quality Assessment in Aquaculture.” Both of these posters addressed really interesting issues. Shirmir’s involved developing methodology to monitor metals including radioactive elements in the molten salt cooling system in nuclear reactors. The problem of selecting optimal fish eggs for production in hatcheries. Shirmir and Ben took home Bose SoundLink Bluetooth Speakers for their efforts!

Eigenvector Research Vice-President Neal B. Gallagher presenting Shirmir Branch with the prize for her winning poster.

Eigenvector Research President Barry M. Wise presenting Benjamin Panitz with a prize for his winning poster.

Professor Rasmus Bro joined us again this year, teaching several classes including the new “PLS_Toolbox from the Command Line.” If you want to learn how to automate analyses or make them run sans interfaces this is the place to start. We’ve found that most of our experienced PLS_Toolbox users use a mix of interfaces and command line, as do we in our roles as consultants. It is a very powerful combination!

EigenU 2025 concluded with Thursday evening’s workshop dinner, this year in Haggerty’s Sports Bar, and Friday’s “Non-linear Machine Learning for Calibration and Classification.” In all sixteen hands-on classes in chemometrics and machine learning were presented along with the two evening events. A very busy week for sure! We wish all our students the best of luck as they employ their new data science skills back at home.

BMW

Tags: calibration, chemometrics, classification, datascience, machinelearning, multivariate, patternrecognition, shortcourse

Grit, Resilience and Ski Racing

Apr 23, 2025

It may be that the only way to improve your grit is to do things that are hard. — Clare E. Wise, M.D.

The end of the ski season is upon us and with it the end of the ski racing season. It has been a great year and I’ve been fortunate to officiate at alpine ski racing events at (almost) all levels this year, with athletes from 6 to 86 (yes, really).

We’re big fans of ski racing and I’m pleased we’re able to sponsor two ski teams under the Eigenvector banner, the Schweitzer Alpine Racing School (SARS) and the Mission Ridge Ski Team (MRST). Given that ski racing has nothing to do with chemometrics, machine learning or, say, spectroscopy, you may be wondering why we choose to sponsor ski racing. The answer? Because it’s hard. And learning to do things that are extremely difficult under highly variable and largely non-ideal circumstances improves grit and resilience. This is true for young people and adults.

Grit made its way into the popular lexicon upon the publication of Angela Duckworth’s eponymous book in 2016. Grit is defined as “passion and perseverance in pursuit of long term goals,” or alternately, as “courage and resolve, strength of character.” (For a quick introduction to grit see Duckworth’s 2014 TED talk.) Grit is closely related to resilience, “the capacity to withstand or to recover quickly from difficulties and setbacks.” Though hard to measured accurately, (grit is assessed using self reported questionaires), grit is the single biggest factor in predicting the success of people in everything from math scores to recovery from surgery.

So how do you build grit? Duckworth’s TED talk from 11 years ago ends with a “we don’t know.” Since then additional research has been done. Developing long term habits, such as doing the New York Times crossword puzzle every day, (or in my case, studying French on Duolingo), appears to contribute. But the most effective way, and maybe even the only way to build grit and resilience is to repeatedly do things that are hard. Things that you often fail at but learn to recover from and progress.

Ski racing provides many opportunities to fail and recover. Is totally objective: time is the only thing that counts and it is measured to the hundredth of a second. But course sets, surface and weather conditions vary widely and races are seldom run under ideal conditions. It is inherently unfair because no two athletes ski exactly the same course: the surface degrades and the weather can change in an instant. Add to that frequent schedule changes and the resulting equipment and mental preparation gymnastics. There are many, many variables that are out of the control of the athletes.

Athlete at Downhill Start during Eigenvector Western Region Speed Series at Schweitzer Mountain Resort, Sandpoint, Idaho.

Beyond grit and resilience, ski racing (and other sports and activities too, of course) improve the mental health and social development of young people. Research discussed in Greg Lukianoff and Jonathan Haidt’s “The Coddling of the American Mind: How Good Intentions and Bad Ideas are Setting a Generation up for Failure” and Haidt’s recent “The Anxious Generation: How the Great Rewiring of Childhood is Causing an Epidemic of Mental Illness” indicates the rise of “safetyism” contributes to arrested development and “failure to launch.” Participation in skiing in general is the opposite of safetyism, skiers can and do get injured. Ski racers even more so. Having to pick yourself up from “face-plants” is a pretty regular thing. And while we wouldn’t wish serious injury on anyone and do our utmost to prevent it, ski racers and other athletes learn a lot about what they are capable of when they are recovering. Haidt’s (and other’s) work also indicates that time spent outdoors with friends away from cell phones is clearly good for the mental health of young people.

For a nice summary of recent research on this topic, please see “Too Grit to Quit: Building Grit & Resilience in Orthopedic Patients & Providers,” a Grand Rounds talk by Clare Wise. Though focused on the world of orthopedic surgery, the research covered includes many generally applicable results, along with some interesting information about what the US Ski Team is doing to improve the grit of their athletes (a pretty gritty bunch to begin with).

Our daughters Clare and Mattie, who have both been ski racers and coaches, have many fellow ski racer friends and I am constantly amazed at what a fun, outgoing, and very successful group they are. Their ski racing experience surely contributed to this. Eigenvector is proud to support that experience for future and current generations.

Barry M. Wise
Technical Delegate
USSS #5940341

Tags: grit, resilience, safetyism, ski racing, success

Overfit < 1??

Feb 25, 2025

I’ve learned a number of things using Diviner, our new semi-automated Machine Learning tool for creating regression models. When you make models in large numbers on each data set you are presented, you begin to see some trends. For me, one of these was the discovery of how some data preprocessing methods work well surprisingly often. (I’m still flabbergasted that SNV followed by autoscaling works well for many spectroscopic data sets.) Another thing that happens is that you begin to see “anomalies” much more often when you make models by the hundreds.

The anomaly of interest here is the occasion when the error of calibration (root-mean-square error of calibration, or RMSEC) exceeds the error of cross-validation (root-mean-square error of cross-validation, or RMSECV). This is generally not expected as what this is saying is rather counter-intuitive: a model’s ability to predict data when it is left out is better than its fit to the data in the model. (!) Cross-validation curves for a Partial Least Squares (PLS) model generally looks like Figure 1 below.

Figure 1: Typical Calibration Error (RMSEC) and Cross-validation Error (RMSECV) Curves for a PLS Model for Glucose from NIR Spectroscopy.

If you look closely in Figure 1 you’ll see that the RMSECV line (blue) is above the RMSEC line (red) in all cases. This is perhaps more obvious if you plot the ratio of RMSECV to RMSEC, as shown in Figure 2. In all instances this ratio is greater than 1. We refer to this as the “Overfit Ratio,” and it has become one of our favorite diagnostics for selecting model complexity. Ideally, a model’s fit error on the calibration data should not be much lower than its prediction error on the same data. If it is then the model is fitting the data much better than its ability to predict it which suggests it is fitting the noise in the data and is, therefore, overfit. It is typical for the overfit ratio to increase as the model complexity increases.

Figure 2. Overfit Ratio RMSECV/RMSEC for Model in Figure 1.

But every once in a while we get cross-validation curves that look like Figure 3, where the ratio RMSECV/RMSEC is great than 1 except at 6 PLS components. Note that this is for a 5-fold cross-validation split of the Casein/Glucose/Lactate (CGL) NIR data (compliments of Tormod Næs), where we are making a model for Casein and have pre-processed the data with a second derivative (21 point window, 2nd order polynomial).

Figure 3. Calibration and Cross-Validation Error Curves for Casein (5-fold, 2nd derivative, 21-pt window).

When presented this way the anomaly is easy to ignore. In Diviner, however, we like to plot the model overfit RMSECV/RMSEC ratio versus the predictive ability, RMSECV. The idea being that the best models are the ones that have good predictive ability without being overfit, i.e. the models in the lower left corner. But when RMSECV < RMSEC, then RMSECV/RMSEC < 1, and these points stick out like a sore thumb on the plots, as shown in Figure 4, lower left corner. This is rather hard to ignore.

Figure 4. Overfit Ratio RMSECV/RMSEC versus Predictive Ability RMSECV for 80 Models with Different Combinations of Preprocessing and Variable Selection from Diviner. Number of PLS Latent Variables Shown.

My first question when I saw this phenomena was, “is it reproducible, or is it some quirk in the software?” The answer to that is that it is definitely reproducible. I should note here that we are doing conventional/full cross validation: the entire model is rebuilt from scratch using the full data set minus each left out split, no shortcuts, such as not recalculating the model mean, or cross-validating only the current factor, are used.

The next question is “what conditions lead to this phenomena?” On that I’m substantially less certain. First it seems likely that this is NOT possible when using Leave-one-out (LOO) cross-validation. When leaving only one sample out of a model, the model certainly must swing towards that sample when it is included rather than excluded. I don’t think it has anything to do with some quirk of preprocessing during cross-validation. We see this sometimes using very simple preprocessing methods including row-wise methods (such as the derivative used here or sample normalization) that don’t change with the data split.

I suspect it has to do with the stability of the model at the given number of components. If there are two components in the model (with all the data) that have nearly the same co-variance values (variance captured in x times variance captured in y) then a small change in the data, such as leaving a small amount of it out, can cause the factors to rotate or come out in different order. Thus the model is quite different in the last factor or factors. This instability can lead to a “lucky guess” as to the predicted values on one or more of the left out sample sets.

As a test of this model instability theory the agreement between the calibration and cross-validation y-residuals is considered. Figure 5 shows the cross-validation residuals versus the calibration residuals for the 6 component model. The samples are colored by their T² values. For comparison the residuals from the 7 component model are shown in Figure 6. It is clear that there is much better agreement between residuals in the 7 component model, which shows the usual overfit ratio greater than 1 behavior.

Figure 5. Cross-validation versus Calibration y-residuals for the 6 Component Casein Model Showing Substantial Differences Between Residuals, colored by T² values.

Figure 6. Cross-validation versus Calibration y-residuals for the 7 Component Casein Model Showing Good Agreement Between Residuals, colored by T² values.

It is interesting to note that samples with the largest difference in residuals for the 6 LV model tend to have high T2 values, as shown in Figure 5. Many of these samples also have relatively high Q-residual values as well, as shown in Figure 7. The fact that the samples that have the largest disagreement tend to be samples that could have a high impact on the model when included versus excluded suggests (to me at least) that model instability at the given number of factors is key to the overfit < 1 problem.

Figure 7. Cross-validation versus Calibration y-residuals for the 6 Component Casein Model Showing Substantial Differences Between Residuals, colored by Q-residual values.

There is much still to be done to fully investigate the cause of the RMSECV<RMSEC phenomenon. The practical question, of course, is what to do about it when it is observed. My suggestion is to simply not trust the model for the given number of factors. Therefore do not consider it as a candidate for final model selection. It’s fun to think you could get a model that predicts better than it fits but it likely the results of model instability that produces some good lucky guesses.

More to come!

BMW

Tags: calibration, chemometrics, cross-validation, datascience, machinelearning, overfit

PLS_Toolbox and MATLAB 2025a

Jan 23, 2025

Eigenvector Research has always worked to make PLS_Toolbox compatible with the most current version of MATLAB, plus an approximate five year window of older versions. Compatibility with the upcoming MATLAB 2025a, however, presents unique challenges. In their transition to an entirely HTML-based interface, The MathWorks (TMW) has removed support for Java in 2025a. As a result, many elements in our graphical user interfaces will not function in 2025a. To make matters even more complicated, the Java functionality that we have relied on has not been completely replaced by HTML equivalents. Thus, PLS_Toolbox will not be compatible with MATLAB 2025a.

TMW is aware of these issues and we are working with them (and other MATLAB experts) to find solutions that will maintain the high level of user-friendliness in our current interfaces in future versions of MATLAB. We expect this process to take the better part of 2025 to sort out. We want to make sure that the solutions we come up with are user-friendly, stable, and have a long lifetime. This will be challenging and will take time and careful consideration.

If you use PLS_Toolbox then we recommend that your organization not adopt MATLAB 2025a. If you choose to do so, we expect that our command line functions will still work as always, but most of our interfaces will not. In its current state, we cannot offer support for MATLAB 2025a. We will, of course, happily support PLS_Toolbox, MIA_Toolbox and Model_Exporter on MATLAB 2019b through 2024b. And of course we will continue to expand and refine the capabilities of our software as we work to become compatible with future MATLAB versions.

Note that users of our stand-alone software Solo and its variants (Solo+MIA, Solo+Model_Exporter) will not be affected by this change, nor will users of our prediction engine Solo_Predictor.

Please write to bmw@eigenvector.com if you have any questions about this. We’re happy to give more detailed explanations of the technical issues. We would also be interested to know if your organization is affected by the removal of Java from MATLAB and how you are dealing with it.

BMW

Tags: 2025a, java, matlab, PLS_Toolbox, software

On Turning 30

Jan 6, 2025

Eigenvector Research was founded on January 1, 1995, which means that we just turned 30. When I mentioned writing a piece for the occasion of our 30^th anniversary, our Donal O’Sullivan replied “I don’t know if you want people to know we’re that old!” And I understand where he’s coming from. In the software business, especially in data science, maybe you don’t want to advertise that you’ve been around for 3+ decades. Perhaps better to look like the bright shiny new thing.

But I think that 30 years of experience counts for something. We’ve seen quite a few shiny new things get misused and abused, mostly by people that don’t appreciate the basics that you just can’t get around. Things like if your model is purely based on data (rather than physics and chemistry) you can’t expect it to work outside the range and subspace of the calibration data. And the more effects you have contributing to the variance in your data, the smaller the unique part of the signal you are looking for will be, until it disappears all together. And if you are fitting your data better than your ability to predict it you’re fitting the noise or clutter. And, and….

The downside of turning 30 in the software business is what you might call technical debt, except most definitions of technical debt focus on the cost of implementing short term work-arounds instead of long term solutions. But if you’ve been in the business as long as we have, you know that, in spite of the advantages they initially offered, most software frameworks are eventually abandoned. (Remember ActiveX?) Learning to utilize new technologies to add new methods and features to software is fun, but replacing old architecture is hard work. Over the last couple years we’ve put a lot of effort into updating the infrastructure that supports our users behind the scenes. In the coming year we’ll be focused on updating some of the technologies behind our end user software PLS_Toolbox, Solo and their variants while still improving the existing methods. It is going to be challenging!

Looking back, 2024 was a great year. It was definitely the “year of training.” Between our in person classes (Eigenvector University in Seattle and Rome) our online classes (both open and for specific companies) and our recorded courses we reached more students than ever. And we also had our biggest year of software sales. Associations with our instrument company partners and software resellers was a big part of that. We were also happy to help many new users transition from other software packages. And of course our students, users and consulting clients all benefited from that 30 years experience.

Our biggest software development of 2024 was the release of Diviner, our semi-automated Machine Learning tool for accelerating the development of calibration models. Diviner automates the construction of regression models and keeps the analyst in the loop so that they can learn from the process. We have lots of plans for improving and expanding the use of Diviner: even as useful as it is now, there is still much to do!

Example output from Diviner. Each point represents a model with different preprocessing, variable selections and meta-parameters. The best models are the ones that have the best predictive ability but are not overfit.

When young people turn 30 they are often a bit depressed to be leaving their 20’s behind. (A search on “turning 30” reveals lots of this, along with a lot of stuff that is really, really not useful!) But generally that feeling is soon replaced with the realization that they’ve entered a very productive period of their life, where they can make real progress on careers and relationships, both business and personal. As a company we feel the same way. Here’s to life beyond 30!

BMW

Tags: chemometrics, datascience, machinelearning

Year 30

Dec 27, 2023

I believe that days go slow and years go fast
And every breath’s a gift, the first one to the last – Luke Bryan, ‘Most People are Good’

Eigenvector Research starts its 30th year this January. Reflecting on that made me think of the Luke Bryan quote above. Some days have been long, for sure. For me that’s particularly true when I’m doing administrative stuff. There are also some days that go incredibly fast. When you are “in the zone” writing code to investigate a new method you get to five o’clock and wonder where the day went. But all of the years go fast. This really hits me when I think of the things I started long ago but still haven’t finished (like that journal article on Gray Classical Least Squares). And of course when you realize your children are adults with job titles like “Senior Financial Analyst” and “Orthopedic Surgical Resident” you wonder where the years went.

Every breath is a gift, especially when you are doing something you love. In the field of chemometrics (chemical data science, machine learning) there’s just no end of things to think about and explore. And of course part of the fun is working with such a great crew. Here’s our technical staff ( I guess I can still include myself in that) at Shuckers during the 17th Eigenvector University last May. We group up differently depending on the project (software, consulting, training, webinars, helpdesk, etc.) so I get to work with EVRIbody from time to time, as do we all. That really keeps it interesting, in part because we span a wide range of ages (almost 40 years) and hobbies (ski racing, rowing, trail running, biking, baseball, photography, playing the bagpipes) and other interests (cosmology, cooking, movie and music trivia, French). (Et bien sûr j’adore encore partager le bureau derriere avec ma femme Jill, la Directrice Générale!)

The Eigenvector Technical Staff: Lyle W. Lawrence, Sean Roginski, R. Scott Koch, Shamus Driver, Barry M. Wise, Neal B. Gallagher, Robert T. “Bob” Roginski, Manuel A. “Manny” Palacios.

It piles up. When I checked this morning there have been 24,532 commits to our software version control system. Version 9.3 is the 38th release of PLS_Toolbox. We’ll have our 18th EigenU this year and our 13th Eigenvector University Europe. I lost count long ago of how many short courses we’ve done but it’s way over 200. We’ve done over 30 webinars, and have had consulting projects with 50+ companies and national laboratories. And no end in sight!

Neal and I joke that we got into this because life is too short to drink bad coffee, bad beer, do boring work or live in a crappy place. That sounds flippant but it is actually true. So a big THANKS! to all our customers who have enabled us in this software odyssey and intellectual pursuit. We have a lot planned for 2024 that we think you will find useful! We hope to continue to serve you.

A Happy and Prosperous New Year to All!

BMW

Tags: chemometrics, datascience, machinelearning, PLS_Toolbox, software

Transparency

Dec 26, 2023

Jonathan Stratton of Optimal posted a nice summary on LinkedIn of the 2nd Annual PAT and Real Time Quality Summit which took place in Boston this month. In it he included the following bullet point:

“The age-old concept of data modeling with spectroscopy has been revitalized through the integration of Machine Learning and AI initiatives. The pharmaceutical industry, in particular, is embracing data science, exploring the potential of deep learning and AI tools in spectroscopy applications. The emergence of open-source tools adds transparency to the ‘black box’ of these advanced technologies, sparking discussions around regulatory concerns.”

I’ve been doing data modeling and spectroscopy for 35+ years now and I’ve never felt particularly un-vital (vital–adjective; full of energy, lively). However, there is certainly more vibrancy in chemical data science right now largely due to the hype surrounding Artificial Intelligence/Deep Learning (AI/DL). Rasmus Bro wrote me with “I feel like AI/DL has sparked a new energy and suddenly we are forced to think more about what we are and how we are.” But the part of this bullet that really got my attention is the part I’ve underlined. I wasn’t at the meeting so I don’t know exactly what was said, but I must take issue with the idea that open-source has anything to do with the transparency of ‘black-box’ models.

First off, there is a pervasive confusion between the software that generates models and the models themselves–these are two separate things. The software determines the path that is followed to arrive at a model. But in the end, how you get to a model doesn’t matter (except perhaps in terms of efficiency). What’s important is where you wound up. That’s why I’ve said many times: VALIDATE THE MODEL, NOT THE SOFTWARE THAT PRODUCED IT! I first heard Jim Tung of The MathWorks say this about 25 years ago and it is still just as true. The model is where the rubber meets the road. And so it is also true that, ultimately, transparency hinges on the model.

Why do we care?

Backing up a bit, why do we care if data models are transparent, i.e. explainable or interpretable? In some low-impact low-risk cases (e.g. a recommender system for movies) it really doesn’t matter. But in a growing number of applications machine learning models are used to control systems that affect people’s heath and safety. In order to trust these systems, we need to understand how they work.

So what must one do to produce a transparent model? In data modeling ‘black-box’ is defined as ‘having no access to or understanding of the logic which the model uses to produce results.’ Open-source has nothing to do with the transparency or lack thereof in ‘black-box’ models. It is a requirement of course that for transparency you need to have access to the numerical recipe that constitues the model itself, i.e. the procedure by which a model takes an input and creates an output. This is a necessary condition of transparency. But it doesn’t matter if you have access to the source code that generated or implements e.g. a deep Artificial Neural Network (ANN) model if you don’t actually understand how it is making its predictions. The model is still black as ink.

This is the crux.

Getting to Transparency

The first step to creating transparent models is to be clear about what data went into them (and what data didn’t). Data models are first and foremost a reflection of the data upon which they are based. Calibration data sets should be made available and the logic which was used to include or exclude data should be documented. Sometimes we’re fortunate enough to be able use experimental design to create or augment calibration data sets. In these cases, what factors were considered? This gives a good indication of the domain where the model would be expected to work, what special cases or variations it may or may not work with, and what biases may have been built in.

The most obvious road to transparency includes the use interpretable models to begin with. We’ve been fans of linear factor-based methods like Partial Least Squares (PLS) since day one because of this. It is relatively easy to see how these models make their predictions. There is also a good understanding of the data preprocessing steps commonly used with them. Linear models can be successfully extended to non-linear problems by breaking up the domain. Locally Weighted Regression (LWR) is one example for quantitative problems while Automated Hierarchical Models (AHIMBU) is an example for qualitative (classification) models. In both cases interpretable PLS models are used locally and the logic by which they are constructed and applied is clear.

With complex non-linear models, e.g. multi-layer ANNs, Support Vector Machines (SVMs), and Boosted Regression Trees (XGBoost), it is much more difficult to create transparency as it has to be done post-facto. For instance, the Local Interpretable Model-Agnostic Explanations (L IME) method perturbs samples around the data point of interest to create a locally weighted linear model. This local model is then interpreted (which begs the question ‘why not use LWR to begin with?’). In a somewhat similar vein, Shapley values indicate the effect of inclusion of a variable on the prediction of a given sample. The sum of these estimated effects (plus the model offset) equal the prediction for the sample. Both LIME and SHAP explain the local behavior of the model around specific data points.

It is also possible to explore the global behavior of models using perturbation and sensitivity tests as a function of input values. Likewise, visualizations such as the one below can be created that give insight into the behavior of complex models, in this case an XGBoost model for classification.

To summarize, transparency, aka “Explainable AI” is all about understanding how models, that are the outputs of Machine Learning software, behave. Transparency can either be built-in or achieved through interrogation of the models.

Transparency at Eigenvector

Users of our PLS_Toolbox have always had access to its source code, including both the code used to identify models and the code used to apply models to new data, along with the model parameters themselves. It is not “open-source” in the sense that you can “freely copy and redistribute it,” (you can’t) but it is in the sense that you can see how it works, which in the context of transparent models is the critical aspect. Our Solo users don’t have the source code, but we’re happy to tell you exactly how something is computed. In fact we have people on staff who’s job it is to do this. And our Model_Exporter software can create numerical recipes of our models that make them fully open and transportable. So with regards to being able to look inside the computations involved with developing or applying models we have you covered.

In terms of understanding the behavior of the black-box models we support (ANNs, SVMs, XGBoost) we now offer Shapley values and have expanded our model perturbation tests for elucidating the behavior of non-linear models. At EAS I presented “Understanding Nonlinear Model Behavior with Shapley Values and Variable Sensitivity Measures” with Sean Roginski, Manuel A. Palacios and Rasmus Bro. These methods are going to play a key role in the use of NL models going forward.

ANN Model for Fat in Meat showing Shapley Values and Model Sensitivity Test from PLS_Toolbox 9.3.

A Final Word

There are a lot of reasons why one might care about model transparency. We like transparency because it increases the level of trust we have in our models. In our world of chemical data science/chemometrics we generally want to assure that models are making their predictions based on the chemistry, not some spurious correlation. We might also want to know what happens outside the boundary of the calibration data. To that end we recommend in our courses and consulting projects that modeling always begin with linear models as they are much more informative, you (the human on the other side of the screen) stand a good chance of actually learning something about the problem at hand. Our sense is that black-box models are currently way over-used. That’s the part of the AI/DL hype cycle we are in. I agree with the sentiments expressed in Why are we using Black-Box models in AI When we Don’t Need to? and it includes a very interesting example. Clearly, we are going to see continued work on explainable/interpretable machine learning because it will be demanded by those that are impacted by the model responses. And rightly so!

BMW

Tags: chemometrics, datascience, explainable, interpretable, machinelearning, multivariate, PLS_Toolbox, software, transparency

In-Person, Live Online or Recorded?

Sep 25, 2023

At Eigenvector Research we offer many courses in chemometrics and machine learning and several ways to take them. As such, I often get variations on the question “what’s the difference between your in-person classes and your online classes?”

I’ll address this by starting with what stays the same. We use the same notes, and have the same instructors. For the most part we use the same data sets and go over the same examples hands-on with our PLS_Toolbox/Solo software. And we spend approximately the same amount of time on each topic.

So what’s the difference? At the risk of sounding condescending, the main difference is that you attend our in-person classes in person and with the online options you don’t. But the in-person aspect brings many unique possibilities with it. The most important one is the student-teacher interaction. In a live class the instructors can answer questions in real time, and also observe the students and sense if they are “getting it” or not. For the hands-on parts, we also usually have one or more additional instructors walking around behind people to see what’s on their screen to check that they are following and assist if necessary.

With in-person classes you also have the time at breaks and after class to talk face to face, like our Manny Palacios at left below discussing aspects of regression with an EigenU attendee. Plus there are also opportunities to interact, network and socialize with fellow attendees. Beyond this, in-person classes force you to set aside time, focus on learning and take advantage of the immersive environment.

The downside of in-person classes? Time and expense. You have to block the time off for the class plus the travel time. Unless the courses are nearby there are travel and lodging expenses and the courses themselves cost more.

For remote learning, we offer classes both live online and recorded online. The live online classes are at scheduled times, generally early morning in North America and late afternoon in Europe. Students can ask questions that the instructor can answer as part of the lecture or his assistants can answer through online chat. Plus we record these so students can review them later, which is especially helpful with the hands-on exercises. But of course the student-teacher interaction doesn’t match what possible in person, and there is not much interaction between students. And like in-person classes, you might not find the class you want in the time frame you need it. Online classes are, however, much less expensive and can be done from the comfort of your home or office, like the guy on the right above (whose desk hasn’t been this clean since the picture was taken 3 years ago).

Finally, there are recorded online classes. The main advantage of these is that they can be done completely on your own schedule, are available on demand, and like live online classes, are less expensive and don’t require travel. They do, however, put more distance between instructors and students as they become separated in both space and time! Questions are answered via email but not in real time.

In-person vs. Live Online vs. Recorded Online Pros and Cons

	Pros	Cons
In-person Classes	Best student-teacher interaction	Most expensive
	Real-time answers to questions	Added travel time
	Interactions with fellow students	Time away from office
	Forces focus on learning
	After class & social opportunities

Live Online Classes	Real time answers to questions	Less student-teacher interaction
	Access to recordings for review	No interaction with fellow students
	Forces setting time aside	No after class & social
	Less expensive
	No travel required

Recorded Online Classes	Learn at your own pace	Least amount of student-teacher interaction
	Available on demand	No interaction with fellow students
	Less expensive	No after class & social
	No travel required	Easy to put off

So what to choose? I’ve listed the pros and cons as I see them in the table above. Different people learn differently, so what’s a pro to one may be a con to another. For some, like me, attending something live forces me to set aside time then pay attention, e.g. turn off my phone and email, until it’s over. Some people don’t consider this an advantage!

That said, IMHO, if you can afford the time and expense and can work them into your schedule, in-person short courses are the best way to get started with a new subject in the shortest amount of time. They are the Cadillac (or in my case the Lincoln) way to learn. Between live online and recorded online I’d choose live if you can find the right course at the right time. But if you need it and have to have it right now, you can’t beat recorded online for on demand convenience.

Happy learning!

BMW

For more thoughts about teaching chemometrics please see: Wise, Barry M., Teaching Chemometrics in Short Course Format, J. Chemometrics, 36(5), April 2022.

Tags: chemometrics, datascience, hands-on, in-person, learnfromhome, machinelearning, online, training

Why Solo?

Aug 30, 2023

I was asked the other day to provide a list of advantages that our Solo software has over its competitors for chemometrics and machine learning. Well, I don’t spend much time keeping track of what’s in other companies’ software. But I can tell you what is in Solo and why we think it’s a great value. (Apologies in advance for all the acronyms but I’ve included a guide below.)

Solo supports a very wide array of methods for data exploration, regression and classification. The standard PCA, PCR, PLS, MLR and CLS are included of course, but also MCR, SIMPLISMA/Purity, Robust PCA and PLS, O-PLS, PLS-DA, Gray CLS, PARAFAC, PARAFAC2, MPCA, batch data digester, batch maturity, LWR, ANN, Deep Learning ANNs, SVM, N-PLS, XGBoost, KNN, Logistic Regression, user specified Hierarchical Models, Automated Hierarchical Models, SIMCA, UMAP and t-SNE.

Solo has a sophisticated point and click user interface (see below) for graphical data editing and model building which users find very intuitive. Plots are easily customizable with class sets, color by properties, etc.

Solo supports the most extensive set of data preprocessing methods: centering, scaling, smoothing, derivatives, automatic WLS baseline, selected points baseline, Whitaker filter, EEM filtering, MSC, EMSC, detrend, EPO, GLSW, despiking, Gap-Segment derivatives, OSC, normalization, PQN, SNV, block scaling, class centering, Pareto and Poisson plus build your own math functions. User selectable preprocessing order and looping for more flexibility.

Solo does full cross-validation including all preprocessing steps, and includes a variety of cross validation methods as well as customizable specified data splits.

Solo supports a host of methods for variable selection: i-PLS, VIP, SR, Genetic Algorithms, r-PLS and stepwise.

Solo includes a full suite of methods for calibration transfer: DS, PDS, DW-PDS, SST, OSC, plus they can be inserted before or between preprocessing steps as required with the Model Centric Calibration Transfer Tool (MCCT).

Solo includes the Model Optimizer which can be used to create, calculate, compare and rank linear and non-linear models with a variety of preprocessing and selected variables

Solo is available on all three major platforms: Windows, MacOS and Linux. Models and data files are completely compatible between platforms.

Solo includes the Report Writer which makes models easy to document by creating PowerPoint or web pages from models. Solo maintains data history and includes model caching for preserving traceability.

Solo optional add-ons include Model_Exporter which allows users to export models as numerical recipes as well as Python and MATLAB code so they can be applied online and in handheld devices. Solo also works with Solo_Predictor, a stand-alone, configurable, prediction engine for online use. The MIA_Toolbox add-on allows users to seamlessly apply all the methods above to hyperspectral images.

Solo is completely compatible with PLS_Toolbox for use with MATLAB. PLS_Toolbox has all the features of Solo, including the point and click interfaces and graphical data editing, but also allows users to access all the functionality from the command line and incorporate these methods in user specified scripts and functions for ultimate flexibility. This allows users to work the way they want (command line or point and click) and still work together.

Solo has the widest array of training options available including our “EVRI-thing You Need to Know About” webinar series, Eigenvector University live classes, and EigenU Recorded courses as well as courses at conferences such as APACT, SCIX and EAS. Eigenvector teaches over 20 specific short course modules.

Solo gives users access to the Eigenvector HELPDESK, user support that is prompt and actually helpful. HELPDESK is manned by our developer staff, the people that actually write the software. Need more help on specific applications? Eigenvector offers consulting services.

Despite all of its advantages, Solo costs less than other major chemometric packages. We publish our price list so you can compare. We offer single-user and floating licenses that work great for small or large groups.

Finally, Solo is and always has been a product of Eigenvector Research, Inc, owned and operated by the same people for 29 years now.

Still have questions? You can try Solo yourself with our free demos; start by creating an account. Or you can always e-mail me, bmw@eigenvector.com.

Best regards,

BMW

PCA == Principal Components Analysis

PCR == Principal Components Regression

PLS == Partial Least Squares Regression

MLR == Multiple Linear Regression

CLS == Classical Least Squares

MCR == Multivariate Curve Resolution

SIMPLISMA == SIMPLe to use Interactive Self-modeling Mixture Analysis

Purity == Self-modeling mixture analysis via Pure Variables

O-PLS == Orthogonal PLS

PLS-DA == PLS Discriminant Analysis

Gray CLS == CLS incorporating EPO and GLS filters

PARAFAC == Parallel Factor Analysis

PARAFAC2 == PARAFAC for uneven and shifted arrays

MPCA == Multi-way PCA

LWR == Locally Weighted Regression

ANN == Artificial Neural Network

SVM == Support Vector Machine

n-PLS == PLS for n-way arrays

XGBoost == Boosted classification and regression trees

kNN == k-Nearest Neighbors classification

SIMCA == Soft Independent Modeling of Class Analogy

UMAP = Uniform Manifold Approximation and Projection

t-SNE == t-distributed Stochastic Neighbor Embedding

WLS == Weighted Least Squares

EEM == Excitation Emission Matrix

MSC == Multiplicative Scatter Correction

EMSC == Extended MSC

EPO == External Parameter Orthogonalization

GLSW == Generalized Least Squares Weighting

OSC == Orthogonal Signal Correction

PQN == Probabilistic Quotient Normalization

SNV == Standard Normal Variate

i-PLS == interval PLS

VIP == Variable Influence on Prediction

SR == Selectivity Ratio

r-PLS == recursive PLS

DS == Direction Standardization

PDS == Piecewise Direct Standardization

DW-PDS == Double Window PDS

SST == Subspace Standardization Transform

Tags: calibration, chemometrics, datascience, hyperspectral, machinelearning, multivariate, patternrecognition, regression, software, spectroscopy

LCVSF Awards 22 Scholarships

Aug 9, 2023

The Lake Chelan Valley Scholarship Fund (LCVSF) received 48 applications for 2023 and has awarded 22 scholarships, each in the amount of $2500. Awards went to 7 previous award winners now in their second, third or fourth year of study, as well as 10 graduates from Chelan High School (CHS) and 5 graduates from Manson High School (MHS) class of 2023.

CHS 2023 graduate awardees are Ryan Allen, Macie Cowan, Melina Cruz-Magallon, Irene Hernandez, Arden Paglia, Kaylee Patino, Kimberly Pineda, Mariana Sanchez-Mendoza, Tate Sandoval and Lauren Ware. MHS 2023 graduate awardees are Lissett Hernandez, Mackenzie Marble, Briseida Mendez-Resendez, Jude Petersen and Alondra Serrato Bailon. Renewals include Quinn Stamps and Casey Simpson (4th award for each), Cody Fitzpatrick (3rd award), Savannah Gresham, Odaliz Ordaz, Titus Peterson and Zoee Stamps (2nd award for each).

LCVSF board president Betsy Kronschnabel observed “It’s great to see renewals from students that are succeeding at their chosen school and are receiving awards for multiple years. Unlike many other scholarships, the LCVSF awards help students throughout their undergraduate education.”

The LCVSF was made possible by Dr. Doug and Eva Dewar (shown below), who wished that their estates be used to help the children of the Chelan Valley. Though they had no children of their own, they loved kids and helped many young people throughout their lives. The Dewars wished to enable motivated, well rounded students to further their education, and hoped that these students would return to the Chelan Valley. LCVSF was founded in 1991, and in that year five scholarships in the amount of $1000 each were awarded. The fund has grown substantially over the years from contributions from many people, but especially significant contributions from John Gladney, Ray Bumgardner, Don & Betty Schmitten, Marion McFadden, Virginia Husted, the Dick Slaugenhaupt Memorial and Irma Keeney. Now in its 33rd year, the $55,000 awarded this year brings the total to nearly $900,000 to Chelan Valley students since its inception.

LCVSF accepts applications from residents of the Chelan valley for undergraduate education. The awards are renewable for up to four years. LCVSF welcomes applications from graduating high school seniors as well as current college students and adults returning to school.

The LCVSF board includes Betsy Kronschnabel (President), Arthur Campbell, III, Linda Mayer (Secretary), Sue Clouse, Barry M. Wise, Ph.D. and John Pleyte, M.D. (Treasurer). For further information, please contact Barry Wise at bmw@eigenvector.com.

Tags: chelan, lake chelan valley, manson, scholarships, student

Eigenvector Software Explained

Jun 15, 2023

Eigenvector Research produces a variety of software products for chemometrics and machine learning and we often get asked how they work together. Here’s the roadmap!

We have two main packages for modeling, our MATLAB® based PLS_Toolbox, and our stand alone Solo (with versions for Windows, macOS and Linux). Solo is the compiled version of PLS_Toolbox, so in practice they are nearly identical, the difference being that if you are using PLS_Toolbox under MATLAB you also have access to the command line versions of all the functionality. PLS_Toolbox and Solo are highly interfaced point and click programs and include PCA, PLS, PCR, MCR, ANNs, SVMs, PARAFAC, MPCA, PLS-DA, SIMCA, kNN, etc., a very wide array of preprocessing methods, and also tools for specific tasks such as calibration transfer/instrument standardization.

Eigenvector Software for Chemometrics and Machine Learning

	MATLAB®-based	Stand-alone
General Modeling & Analysis	PLS_Toolbox Our flagship product 30 years in the making. Point and click or command line access to the widest array of chemical data science tools and methods.	Solo The stand-alone version of PLS_Toolbox, available for Windows, macOS and Linux. Point and click chemometrics and machine learning.
Hyperspectral & Multivariate Image Analysis	MIA_Toolbox Add-on to PLS_Toolbox, allows seamless use of modeling methods on hyperspectral data plus additional image analysis tools.	Solo+MIA The stand-alone version of PLS_Toolbox plus MIA_Toolbox. Point and click modeling of hyperspectral images.
Model Export for Online Predictions	Model_Exporter Add-on to PLS_Toolbox, turns models into numerical recipes or code for application in other software or platforms.	Solo+Model_Exporter Solo with Model_Exporter built in.
Online Prediction Engine		Solo_Predictor Full featured online prediction engine applies any PLS_Toolbox or Solo model to new data, MIA_Toolbox compatible.

If you are doing hyperspectral imaging then you can add our MIA_Toolbox to PLS_Toolbox, or choose Solo+MIA. This allows use of all the above methods directly on hyperspectral images plus adds a few more image specific tools.

If you want to automate model application (say you want to get a PLS model online and have it make predictions as new data comes in) there are two main routes. Solo_Predictor is a full featured stand alone prediction engine that can apply any model you make in PLS_Toolbox/Solo and there are a variety of ways to communicate with it, the most common being socket connections. Solo_Predictor is compatible with hardware from many of our Technology Partner spectrometer companies.

Model_Exporter, on the other hand, creates numerical recipes and code required to apply models to new data streams in a variety of languages including MATLAB and Python. These recipes can then be compiled into other programs or run on hand held devices (such as ThermoFisher’s TruScan or Si-Ware’s NeoSpectra).

So what software should you buy? PLS_Toolbox or Solo?

Buy PLS_Toolbox if you …
— already have access to MATLAB, it’s less expensive than Solo
— know you want to automate pieces of your modeling process
— want to customize plots using MATLAB
— want to access additional functionality from other MATLAB toolboxes

Buy Solo if you …
— want to work only within visual interfaces
— don’t need to script or program
— prefer the lower cost of Solo compared to MATLAB + PLS_Toolbox

Still have questions? Write to sales@eigenvector.com. Happy modeling!

BMW

MATLAB is a registered trademark of The MathWorks, Inc., Natick, MA.

Tags: calibration, chemometrics, datascience, hyperspectral, machinelearning, multivariate, onlinepredictions, processanalytics, software, spectroscopy

Chemometrics without Equations

Nov 29, 2022

In 1988 Donald Dahlberg, Professor of Chemistry at Lebanon Valley College (LVC), decided to take a sabbatical leave at the University of Washington (UW) Center for Process Analytical Chemistry (CPAC). At the time, his former student Mary Beth Seasholtz was a second year graduate student in Bruce Kowalski’s Laboratory for Chemometrics. Mary Beth asked Don if he’d be interested in seeing what she was doing. Before Don knew it, he was attending Kowalski’s chemometrics courses and group meetings. I met Don during this period as I was also at CPAC.

When he returned to LVC he started teaching chemometrics to undergraduate students, and involving them in research. This included collaborative research with a local confectionary company.

Meanwhile, at Eigenvector we were interested in developing chemometrics courses for a wider audience. So sometime in 2001 our Neal B. Gallagher contacted Don about the possibility of creating a chemometrics workshop that did not involve the parallel presentation of matrix algebra. They struggled over a title, but eventually settled on “Chemometics without Equations (or hardly any).” We call it CWE for short. Don, having recently retired from teaching at LVC, wrote the workshop with Neal reviewing the content.

A slide from Chemometrics without Equations explaining PCA in everyday terms.

Don and Neal first presented CWE at the 16th International Forum on Process Analytical Chemistry (IFPAC) in San Diego on January 21-22, 2002. The course was taught hands-on using PLS_Toolbox. CWE was repeated at CPAC’s Summer Institute that July and again at the Federation of Analytical Chemistry and Spectroscopy Societies (FACSS, now SCIX) conference in October 2022 in Ft. Lauderdale, FL. This marked the beginning of CWE’s 20 year run at fall conferences. It was repeated in 2003 at FACSS and in 2004 moved to the Eastern Analytical Symposium (EAS), its home through this year. The workshop has been offered every year, except in 2020 when COVID-19 prevented a physical conference.

EAS 2022 marks Don’s final presentation of the course at EAS, making a total of 20 fall conference appearances. Each time Don has been assisted by either Neal or myself. Knowing that Don was an avid bourbon connoisseur we commemorated the occasion with a bottle of Blanton’s as he completed his final class.

Neal, Don and Barry celebrating Don’s final Chemometrics without Equations course at Eastern Analytical Symposium.

Looking back on 20 years of teaching CWE Don observed:

EAS has allowed me to meet many scientist who wish to learn and use chemometrics. They have included not only scientists in chemistry, but also those in related fields such as forensic science and cultural heritage. I have had the privilege to offer special versions of the course, tailored to the latter two fields. I have been able to present the course at John Jay College of Criminal Justice, the Forensic Science Department at the University of New Haven, the Museum of Modern Art, the Getty Museum and the Library of Congress. My goal has been to introduce the power of chemometrics to those inside and outside of analytical chemistry. Even though it is time to end my presentations at EAS, I intend to continue to help those who wish to explore the field of chemometrics.

Over 20+ years Professor Dahlberg has gently introduced hundreds to the field of chemometrics with CWE taught at conferences, at in-house classes and online. Thanks Don for your service to field! Cheers and bottoms up!

If you’d like to have Chemometrics without Equations presented at your site, please write bmw@eigenvector.com and we’ll help you arrange it.

Tags: chemometrics, datascience, multivariate, patternrecognition, PLS_Toolbox, regression, shortcourse, software

2022 Lake Chelan Valley Scholarships Announced

Aug 5, 2022

The Lake Chelan Valley Scholarship Foundation (LCVSF) will award 20 scholarships this year to college bound seniors and previous awardees. Recipients are from Chelan Valley schools including Chelan High School (CHS) and Manson High School (MHS). Two graduates from MHS class of ’22 will receive awards: Anthony Martinez and Francisco Munoz; seven members of CHS class of ’22: Charlie Bordner, Savannah Gresham, Quinn McLaren, Beau Nordby, Liam Ross, Cassandra Sanchez and Reed Stamps; and eleven renewals: Colt Corrigan, Cody Fitzpatrick, Adrian Martinez, Emma McLaren, Odaliz Ordaz, Titus Petersen, Elise Rothlisberger, Sierra Rothlisberger, Casey Simpson, Quinn Stamps and Zoee Stamps.

Each recipient will receive $2500 for a total of $50,000 in awards this year. Checks will be presented to the recipients at the flagpole at Chelan Riverwalk Park at 10 am, Saturday, August 6.

LCVSF board member Barry M. Wise noted “This year’s pool of renewals was especially strong, with almost every previous recipient qualifying. And the small number of awards given to this year’s graduates was less a reflection of their academic achievement than of their success in attracting other scholarships.”

The LCVSF was made possible by Doug and Eva Dewar, who wished that their estates be used to help the children of the Chelan Valley. LCVSF was founded in 1991, and in that year five scholarships in the amount of $1000 each were awarded. The fund has grown substantially over the years from contributions from many people, but especially significant contributions from John Gladney, Ray Bumgardner, Don & Betty Schmitten, Marion McFadden, Virginia Husted, the Dick Slaugenhaupt Memorial and Irma Keeney. Now in its 32nd year, LCVSF has awarded over $725,000 to Chelan Valley students since its inception.

Tags: chelan, manson, scholarships

Evaluating Models: Hating on R-squared

Jun 16, 2022

There are a number of measures that are used to evaluate the performance of machine learning/chemometric models for calibration, i.e. for predicting a continuous value. Most of these come down to reducing the model error, the difference between the reference values and the values predicted (estimated) by the model, to some single measure of “goodness.” The most ubiquitous measure is the correlation coefficient R². Many people have produced examples of how data sets can be very different and still produce the same R² (see for example What is R2 All About?). Those arguments are important but well known and I’m not going to reproduce them here. My main beef with R² is that it is not in the units that I care about; it is in fact dimensionless. Models used in the chemical domain typically predict values like concentration (moles per liter or weight percent) or some other property value like tensile strength (kilo-newtons per mm²) so it is convenient to have the model error expressed in these terms. That’s why the chemometrics community tends to rely on the measures like root mean square error of calibration (RMSEC). This measure is in the units of the property being predicted.

We also want to know about “goodness” in a number of different circumstances. The first is when the model is being applied to the same data that was used to derive it. This is the error of calibration and we often refer to the calibration R² or alternately the RMSEC. The second situation is when we are performing cross validation (CV) where the majority of the calibration data set is used to develop the model and then the model is tested on the left out part, generally repeating this procedure until all samples (observations) have been left out once. The results are then aggregated to produce the cross-validation equivalent of R², which is Q², or the cross validation equivalent of RMSEC, which is RMSECV. Finally, we’d like to know how models perform on totally independent test sets. For that we use the prediction Q² and the RMSEP.

Besides the fact that R² is not in the same units as the value being predicted, it has a non-linear relationship with it. In the plot below the RMSEC is plotted versus R² for a synthetic example.

Figure 1: Example of Non-linear Relationship between RMSEC and R2 for a synthetic example.

Overfit is another performance measure of interest. The amount of overfit is the difference between the error of calibration (R² or RMSEC) and prediction, typically in cross-validation (Q² or RMSECV). Generally, model error is lower in calibration than in cross validation (or prediction, but that is subject to corruption by prediction test set selection). So a somewhat common plot to make is R²-Q² versus Q². (I first saw these in work by David Broadhurst.) An example of such a plot is shown in Figure 2 on the left for a simple PLS model where each point is using a different number of LVs. The problem with this plot is that neither axis is physically meaningful. We know that the best models should somehow be in the lower right corner, but how much difference is significant? And how does it relate to the accuracy with which we need to predict?

Figure 2: Example R2-Q2 versus Q2 plot for a PLS model (left) and RMSECV/RMSEC versus RMSECV (right). Number of PLS latent variables indicated.

I propose here an alternative, which is to plot the ratio of RMSECV/RMSEC versus RMSECV, which is shown in Figure 2 on the right. Now the best model would be in the lower left corner, where the amount of overfit is small along with the prediction error in cross-validation.

Lately I’ve been working with Gray Classical Least Squares (CLS) models, where Generalized Least Squares (GLS) weighting filters are used with CLS to improve performance. Without going into details here (for more info see this presentation) the model can be tuned with a single parameter, g, which governs the aggressiveness of the GLS filter. A data set of NIR spectra of a styrene-butadiene co-polymer system is used as an example (data from Dupont thanks to Chuck Miller). The goal of the model is to predict the weight percent of each of the 4 polymer blocks. The R²-Q² versus Q² plot for the four reference concentrations are shown in the left panel of Figure 3 while the corresponding RMSECV/RMSEC versus RMSECV curves are shown the right. The g parameter is varied from 0.1 (least filtering) to 0.0001 (most filtering).

Figure 3: R2-Q2 versus Q2 plot for a GLS-CLS model (left) and RMSECV/RMSEC versus RMSECV (right). Series for each analyte is a function of the GLS tuning parameter g.

As in figure 2 approximately the same information is presented in each plot but the RMSECV/RMSEC plot is more easily interpreted. For instance, the R²-Q² plot would lead one to believe that the predictions for 1-2-butadiene were quite a bit better than for styrene as their Cross-validation Q² is substantially better. However, the RMSECV/RMSEC plot shows that the models perform similarly with an RMSECV around 0.8 weight percent. The difference in Q² for these models is a consequence of the distribution of the reference values and is not indicative of a difference in model quality. The RMSECV/RMSEC shows that the models are somewhat prone to overfitting as this ratio goes to rather high values for aggressive GLS filters. The is less obvious in the R²-Q² plot as it is not obvious what this difference really relates to in terms of amount of overfitting. And a given change in R²-Q² is more significant in models with high Q2 than with lower Q2. The R²-Q² plot would lead one to believe that the 1-2-butadiene model was not overfit much even at g = 0.0001, whereas the styrene model is. In fact, their RMSECV/RMSEC ratio is over 5 for both these models at the g = 0.00001 point, which is terribly overfit.

The RMSECV/RMSEC plot can be further improved if the reference error in the property being predicted is known. In general it is not possible for the apparent model performance to be better than this. Even if the model is predicting the divine omniscient only God knows true answer, the apparent error will still be limited by the error in the reference values. Occasionally models do appear to predict better than the reference error but this is generally a matter of luck. (And yes, it is possible for models to predict better than the reference error, but that is a demonstration for another time.) So it would be useful to add the (root mean square or standard deviation) reference error, if known, as a vertical line.

Based upon the results I’ve seen to date, I highly recommend the RMSECV/RMSEC plot over the R²-Q² plot for assessing model performance. Model fit and cross-validation metrics are in units of the properties being predicted, and the plots are more linear with respect to changes in these metrics. This plot is easily made for PLS models in PLS_Toolbox and Solo, of course!

BMW

Tags: chemometrics, classical least squares, generalized least squares, machinelearning, modelevaluation, partial least squares, performance measures, r-squared

Python is free

Apr 5, 2022

PLS_Toolbox is not free.

But you don’t have to be a dedicated data scientist to use PLS_Toolbox (or its stand-alone equivalent Solo). Many of its users are, but the real expertise of most users is in something else such as analytical instrumentation (typically spectroscopy) or the specific problem they are working on (e.g. chemical process control, disease detection, art provenance etc.). We made PLS_Toolbox because we believe that the best data analysts are the people that generated the data and have the physics, chemistry or engineering background to understand it.

PLS_Toolbox includes a very wide array of tools for pattern recognition, data visualization, sample classification and regression plus many data preprocessing tools. Tools for problems specific to spectroscopy like calibration transfer, curve resolution and variable selection. Plus particle analysis, batch modeling tools and tools for reading all sorts of data files from various analytical instruments. It is pretty much one stop shopping for most people that work with analytical chemistry and related data.

Unlike Python, when you use PLS_Toolbox you don’t have to decide first which of the 40 most popular libraries you need. It doesn’t require a nine page cheat sheet. You can use it from the command line if you want, and script it too, but the vast majority of analyses can be done using the highly refined point-and-click interfaces. And when you are comparing model results from different methods, you can be sure that they are evaluated in precisely the same way, apples to apples.

And if you can’t figure out how to use a tool or think you’ve found a bug? There’s one email address to write to: helpdesk@eigenvector.com. We have five full time equivalents working on it, and one of them will get right back to you with help that’s actually helpful. We’ve been supporting it for more than 30 years (and we have no intention of stopping). Plus we have other data scientists on staff who can help you with your application when you really get in over your head.

Yeah, Python is free. We like that about it too. That’s why we search through Python libraries to find the tools that PLS_Toolbox users will find useful. We then incorporate them with our wrappers and interfaces around them so they behave the way our users have come to expect. It’s why we say “we learned Python so you don’t have to.”

So what’s your time worth? If you are someone who proudly displays Dr. in front of your name or Ph.D. after it, you are worth at least a couple hundred bucks an hour. Those hours of command line bullshittery add up pretty fast. Not to mention the opportunity cost of not being focused on the problem you’re actually trying to solve.

So yes, PLS_Toolbox is not free. But for many if not most analytical scientists it is a better value proposition than Python alone.

BMW

Tags: chemometrics, datascience, machinelearning, PLS_Toolbox, python, software

We used to call it “Chemometrics”

Feb 23, 2022

The term chemometrics was coined by Svante Wold in a grant application he submitted in 1971 while at the University of Umeå. Supposedly, he thought that creating a new term, (in Swedish it is ‘kemometri’), would increase the likelihood of his application being funded. In 1974, while on a visit to the University of Washington, Svante and Bruce Kowalski founded the International Chemometrics Society over dinner at the Casa Lupita Mexican restaurant. I’d guess that margaritas were involved. (Fun fact: I lived just a block from Casa Lupita in the late 70s and 80s.)

Chemometrics is a good word. The “chemo” part of course refers to chemistry and “metrics” indicates that it is a measurement science: a metric is a meaningful measurement taken over a period of time that communicates vital information about a process or activity, leading to fact-based decisions. Chemometrics is therefore measurement science in the area of chemical applications. Many other fields have their metrics: econometrics, psychometrics, biometrics. Chemical data is also generated in many other fields including biology, biochemistry, medicine and chemical engineering.

So chemometrics is defined as the chemical discipline that uses mathematical, statistical, and other methods employing formal logic to design or select optimal measurement procedures and experiments, and to provide maximum relevant chemical information by analyzing chemical data.

In spite of being a nearly perfect word to capture what we do here at Eigenvector, there are two significant problems encountered when using the term Chemometrics: 1) In spite of the existence of the field for nearly five decades and two dedicated journals (Journal of Chemometrics and Chemometrics and Intelligent Laboratory Systems), the term is not widely known. I still run into graduates of chemistry programs who have never heard the term, and of course it is even less well known in the related disciplines, and less yet in the general population. 2) Many that are familiar with the term think it refers to a collection of primarily projection methods, (e.g. Principal Components Analysis (PCA), Partial Least Squares Regression (PLS)), and therefore other Machine Learning (ML) methods (e.g. Artificial Neural Networks (ANN), Support Vector Machines (SVM)) are not chemometrics regardless of where they are applied. Problem number 2 is exacerbated by the current Artificial Intelligence (AI) buzz and the proclivity of managers and executives towards things that are new and shiny: “We have to start using AI!”

Typical advertisement presented when searching on Artificial Intelligence

This wouldn’t matter much if choosing the right terms wasn’t so critical to being found. Search engines pretty much deliver what was asked for. So you have to be sure you are using terms that are actually being searched on. So what to use?

A common definition of artificial intelligence is the theory and development of computer systems able to perform tasks that normally require human intelligence. This is a rather low bar. Many of the models we develop make better predictions than humans could to begin with. But AI is generally associated with problems such as visual perception and speech recognition, things that humans are particularly adept at. These AI applications generally require very complex deep neural networks etc. And so while you could say we do AI this feels like too much hyperbole, and certainly there are other arguments against using this term loosely.

Machine learning is the use and development of computer systems that are able to learn and adapt without following explicit instructions, by using algorithms and statistical models to analyze and draw inferences from patterns in data. Most researchers (apparently) view ML as a subset of AI. Do a search on “artificial intelligence machine learning images” and you’ll find many Venn diagrams illustrating this. I tend to see it as the other way around: AI is the subset of ML that uses complex models to address problems like visual perception. I’ve always had a problem with the term “learning” as it anthropomorphizes data models: they don’t learn, they are parameterized! (If these models really do learn I’m forced to conclude that I’m just a machine made out of meat.) In any case, models from Principal Components Regression (PCR) through XGBoost are commonly considered ML models, so certainly the term machine learning applies to our software.

Google Search on ‘artificial intelligence machine learning’ with ‘images’ selected.

Process analytics is a much less used term and particular to chemical process data modeling and analysis. There are however conferences and research centers that use this term in their name, e.g. IFPAC, APACT and CPACT. Cheminformatics sounds relevant to what we do but in fact the term refers to the use of physical chemistry theory with computer and information science techniques in order to predict the properties and interactions of chemicals.

Data science is defined as the field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data. Certainly this is what we do at Eigenvector, but of course primarily in chemistry/chemical engineering where we have a great deal of specific domain knowledge such as the fundamentals of spectroscopy, chemical processes, etc. Thus the term chemical data science describes us pretty well.

So you will find that we will use the terms Machine Learning and Chemical Data Science a lot in the future though we certainly will continue to do Chemometrics!

BMW

Tags: artificialintelligence, chemometrics, datascience, machinelearning

Remembering Svante

Jan 6, 2022

Svante Wold passed away on January 4, 2022, at the age of 81. In the fields of chemometrics and data science, there is no need to use his last name. If you say “Svante,” everybody knows who you are talking about.

I first met Svante almost 30 years ago at the Chemometrics in Analytical Chemistry conference in Montreal, CAC-92. I was a pretty new Ph.D. at that point, having graduated the year before. I presented a talk, “Monitoring the Health of Multi-Channel Analytical Instruments with Multivariate Statistical Process Control (MSPC).” Before the meeting, Bruce Kowalski told me to be sure to talk to Svante. So at some point, perhaps after his talk, I walked up and introduced myself. The first thing Svante said to me was something to the effect of “We’ve been following your work on MSPC and find it very interesting.”

I was shocked. I had no idea such an influential professor would even want to talk to me, let alone have an interest in the work I’d been doing. But that was Svante. Always interested, positive and easy to talk to. Very encouraging to young people in the field, especially those who were working with latent variable/projection methods.

Attending many of the same conferences we ran into each other frequently after that. I remember especially fondly the Gordon Research Conference (GRC) on Statistics in Chemistry and Chemical Engineering. I attended my first GRC in 1993. Roger Hoerl was chair that year with Svante as his vice-chair, who would become chair the following year. Much to my surprise, Svante organized a side meeting to discuss development of PLS_Toolbox, which I had started distributing freely a couple years previously. The next thing I knew, “young Barry” (Svante’s nickname for me) had been elected vice-chair of the GRC. This meant that I would work with Svante in ’94 and become chair in ’95 (with Age Smilde as vice-chair).

The GRCs were exceedingly interesting scientifically and great fun. There was ample time for discussion, technical and otherwise. It was there that I discovered what a hoot it was to have beers with Svante. Many jokes and funny stories. (I still tell the Digger joke.) Svante always had a crowd around him. This of course continued at many more conferences and other gatherings over the years.

Svante had some unique views on data modeling. For instance, he once told me that he had found it useful to be somewhat lazy when modeling, and that philosophy had always served him well. “Don’t try too hard,” he said. This is good advice: the harder you dig and the more tightly you fit a model, the more likely you are to get a spurious correlation that won’t hold up. If what you are looking for isn’t readily apparent at first, you should be very cautious!

Along with my wife, Jill, I was very lucky to get to spend time with Svante and Nouna at their homes in Umeå, Boston and Hollis. We enjoyed their hospitality greatly. Unfortunately, we were never successful in talking them into coming out West to see us.

Barry M. Wise, Svante Wold, and Jill Wise at EAS in 2013

I’ll close with a picture that for me captures a bit of what was Svante. It was taken at Eastern Analytical Symposium (EAS) in 2013. We were all just goofing and having a good time at the President’s Ball. You can do a literature search and easily determine Svante’s vast influence on chemometrics and data science. But when I think of Svante I remember the fun times like this.

Rest in peace Svante, you will be long remembered and sorely missed.

BMW

Tags: chemometrics, datascience, software

Under Same (Old) Management

Oct 21, 2021

That’s not a headline you see very often. Usually it’s “Under New Management.” But here at Eigenvector Research we’re proud of our stability. I wrote the first version of our MATLAB-based PLS_Toolbox while I was in graduate school thirty-one years ago. I still oversee its development along with our other software products.

In 1990 Partial Least Squares (PLS) regression was still fairly novel. PLS_Toolbox 1.0 included it, of course, along with a non-linear version of PLS and a number of tools for Multivariate Statistical Process Control (MSPC) including Principal Components Analysis (PCA). The goal then, as it is now, was to bring new multivariate modeling methods to users in a timely fashion and in a consistent and easy to use package.

Neal B. Gallagher joined me in 1995 to form Eigenvector Research, Inc. He has been contributing to PLS_Toolbox development for almost 27 years now, along with consulting and teaching chemometrics, (i.e. chemical data science). Our senior software developers R. Scott Koch, Bob Roginski and Donal O’Sullivan have been with us for a combined 45 years (18, 15 and 12 respectively). That continuity is one reason why our helpdesk is actually so helpful. When you contact helpdesk with a question or problem we can generally get you in touch with the staff involved in writing the original code.

To assure that continuity going forward we’ve brought some younger developers on board including Lyle Lawrence and Sean Roginski. (Lyle was still sleeping in a crib and Sean wasn’t born yet when PLS_Toolbox first came out-ha!) Both have taken deep dives into our code and have been instrumental in the recent evolution of our software. Primarily on the consulting side of EVRI, Manny Palacios brings his youthful energy and extensive experience to our clients’ data science challenges.

PLS_Toolbox/Solo Analysis Interface with Integrated Deep Learning ANN from scikit-learn and TensorFlow.

Over the years we have developed and refined PLS_Toolbox along with our standalone software Solo, adding many, many new routines while advancing usability. Currently we are completing the process of integrating new methods from the Python libraries scikit-learn and TensorFlow into the soon to be released PLS_Toolbox/Solo 9.0. So when we bring you new methods, like Deep Learning Artificial Neural Networks (ANNDL, shown above) or Uniform Manifold Approximation and Projection (UMAP, below) you can be sure that they are implemented, tested, supported and presented in the way that you’ve come to expect in our software. They have the same preprocessing, true cross-validation, graphical data editing, plotting features, etc. as our other methods.

PCA of Mid-IR Reflectance Image of Excedrin Tablet with Corresponding UMAP Embeddings

Now, 25+ years in, we’re moving forward with the same vision we’ve had from the beginning: bring new modeling methods to the people that own the data in a consistent straightforward package. This same old management is working to assure that far into the future!

BMW

Tags: chemometrics, datascience, hyperspectral, machinelearning, PLS_Toolbox, python, software, Solo, UMAP

Lake Chelan Valley Scholarships for 2021 Announced

Aug 3, 2021

For 2021 the Lake Chelan Valley Scholarship Fund (LCVSF) will award 18 scholarships to college bound seniors and previous awardees currently attending. Recipients are from Chelan Valley schools including Chelan High School (CHS) and Manson High School (MHS). Included this year are three graduates from MHS class of ’21: Jonathan Fernandez, Cody Fitzpatrick and Zoe Thomas; seven members of CHS class of ’21: Cash Corrigan, Adrian J. Martinez, Nancy Carmona Palestino, Nancy Perez, Aden Slade, Dayana Vega-Ramirez and Ruby Wier; and eight renewals: Alvaro Arteaga, Emma McLaren, Olivia Nygreen, Sierra Rothlisberger, Casey Simpson, Joe Strecker, Addi Torgesen and Tobin Wier. Ms. Rothlisberger and Mr. Simpson are designated as the Betty Schmitten Art Scholarship recipients.

Each recipient will receive $2750 for a total of $49,500 in awards this year. Checks will be presented to the recipients at the flagpole at Chelan Riverwalk Park at 10 am, Saturday, August 7.

Board member Dr. John Pleyte reflected on this year’s recipients, “We had a very good group of applicants this year, and we were especially impressed with the renewals. It’s great to see these scholars transition successfully from high school to college, and also from community college to four year schools.”

The LCVSF was made possible by Doug and Eva Dewar, who wished that their estates be used to help the children of the Chelan Valley. LCVSF was founded in 1991, and in that year five scholarships in the amount of $1000 each were awarded. The fund has grown substantially over the years from contributions from many people, but especially significant contributions from John Gladney, Ray Bumgardner, Don & Betty Schmitten, Marion McFadden, Virginia Husted, the Dick Slaugenhaupt Memorial and Irma Keeney. Now in its 31th year, LCVSF has awarded over $675,000 to Chelan Valley students since its inception.

Tags: chelan, manson, scholarships

About Barry M. Wise

May 30, 2025

May 17, 2025

Apr 23, 2025

Feb 25, 2025

Jan 23, 2025

Jan 6, 2025

Dec 27, 2023

Dec 26, 2023

Why do we care?

Getting to Transparency

Transparency at Eigenvector

A Final Word

Sep 25, 2023

In-person vs. Live Online vs. Recorded Online Pros and Cons

Aug 30, 2023

Aug 9, 2023

Jun 15, 2023

Eigenvector Software for Chemometrics and Machine Learning

Nov 29, 2022

Aug 5, 2022

Jun 16, 2022

Apr 5, 2022

Feb 23, 2022

Jan 6, 2022

Oct 21, 2021

Aug 3, 2021

Events