Dec 26, 2023
“The age-old concept of data modeling with spectroscopy has been revitalized through the integration of Machine Learning and AI initiatives. The pharmaceutical industry, in particular, is embracing data science, exploring the potential of deep learning and AI tools in spectroscopy applications. The emergence of open-source tools adds transparency to the ‘black box’ of these advanced technologies, sparking discussions around regulatory concerns.”
I’ve been doing data modeling and spectroscopy for 35+ years now and I’ve never felt particularly un-vital (vital–adjective; full of energy, lively). However, there is certainly more vibrancy in chemical data science right now largely due to the hype surrounding Artificial Intelligence/Deep Learning (AI/DL). Rasmus Bro wrote me with “I feel like AI/DL has sparked a new energy and suddenly we are forced to think more about what we are and how we are.” But the part of this bullet that really got my attention is the part I’ve underlined. I wasn’t at the meeting so I don’t know exactly what was said, but I must take issue with the idea that open-source has anything to do with the transparency of ‘black-box’ models.
First off, there is a pervasive confusion between the software that generates models and the models themselves–these are two separate things. The software determines the path that is followed to arrive at a model. But in the end, how you get to a model doesn’t matter (except perhaps in terms of efficiency). What’s important is where you wound up. That’s why I’ve said many times: VALIDATE THE MODEL, NOT THE SOFTWARE THAT PRODUCED IT! I first heard Jim Tung of The MathWorks say this about 25 years ago and it is still just as true. The model is where the rubber meets the road. And so it is also true that, ultimately, transparency hinges on the model.
Why do we care?
Backing up a bit, why do we care if data models are transparent, i.e. explainable or interpretable? In some low-impact low-risk cases (e.g. a recommender system for movies) it really doesn’t matter. But in a growing number of applications machine learning models are used to control systems that affect people’s heath and safety. In order to trust these systems, we need to understand how they work.
So what must one do to produce a transparent model? In data modeling ‘black-box’ is defined as ‘having no access to or understanding of the logic which the model uses to produce results.’ Open-source has nothing to do with the transparency or lack thereof in ‘black-box’ models. It is a requirement of course that for transparency you need to have access to the numerical recipe that constitues the model itself, i.e. the procedure by which a model takes an input and creates an output. This is a necessary condition of transparency. But it doesn’t matter if you have access to the source code that generated or implements e.g. a deep Artificial Neural Network (ANN) model if you don’t actually understand how it is making its predictions. The model is still black as ink.
This is the crux.
Getting to Transparency
The first step to creating transparent models is to be clear about what data went into them (and what data didn’t). Data models are first and foremost a reflection of the data upon which they are based. Calibration data sets should be made available and the logic which was used to include or exclude data should be documented. Sometimes we’re fortunate enough to be able use experimental design to create or augment calibration data sets. In these cases, what factors were considered? This gives a good indication of the domain where the model would be expected to work, what special cases or variations it may or may not work with, and what biases may have been built in.
The most obvious road to transparency includes the use interpretable models to begin with. We’ve been fans of linear factor-based methods like Partial Least Squares (PLS) since day one because of this. It is relatively easy to see how these models make their predictions. There is also a good understanding of the data preprocessing steps commonly used with them. Linear models can be successfully extended to non-linear problems by breaking up the domain. Locally Weighted Regression (LWR) is one example for quantitative problems while Automated Hierarchical Models (AHIMBU) is an example for qualitative (classification) models. In both cases interpretable PLS models are used locally and the logic by which they are constructed and applied is clear.
With complex non-linear models, e.g. multi-layer ANNs, Support Vector Machines (SVMs), and Boosted Regression Trees (XGBoost), it is much more difficult to create transparency as it has to be done post-facto. For instance, the Local Interpretable Model-Agnostic Explanations (LIME) method perturbs samples around the data point of interest to create a locally weighted linear model. This local model is then interpreted (which begs the question ‘why not use LWR to begin with?’). In a somewhat similar vein, Shapley values indicate the effect of inclusion of a variable on the prediction of a given sample. The sum of these estimated effects (plus the model offset) equal the prediction for the sample. Both LIME and SHAP explain the local behavior of the model around specific data points.
It is also possible to explore the global behavior of models using perturbation and sensitivity tests as a function of input values. Likewise, visualizations such as the one below can be created that give insight into the behavior of complex models, in this case an XGBoost model for classification.
To summarize, transparency, aka “Explainable AI” is all about understanding how models, that are the outputs of Machine Learning software, behave. Transparency can either be built-in or achieved through interrogation of the models.
Transparency at Eigenvector
Users of our PLS_Toolbox have always had access to its source code, including both the code used to identify models and the code used to apply models to new data, along with the model parameters themselves. It is not “open-source” in the sense that you can “freely copy and redistribute it,” (you can’t) but it is in the sense that you can see how it works, which in the context of transparent models is the critical aspect. Our Solo users don’t have the source code, but we’re happy to tell you exactly how something is computed. In fact we have people on staff who’s job it is to do this. And our Model_Exporter software can create numerical recipes of our models that make them fully open and transportable. So with regards to being able to look inside the computations involved with developing or applying models we have you covered.
In terms of understanding the behavior of the black-box models we support (ANNs, SVMs, XGBoost) we now offer Shapley values and have expanded our model perturbation tests for elucidating the behavior of non-linear models. At EAS I presented “Understanding Nonlinear Model Behavior with Shapley Values and Variable Sensitivity Measures” with Sean Roginski, Manuel A. Palacios and Rasmus Bro. These methods are going to play a key role in the use of NL models going forward.
A Final Word
There are a lot of reasons why one might care about model transparency. We like transparency because it increases the level of trust we have in our models. In our world of chemical data science/chemometrics we generally want to assure that models are making their predictions based on the chemistry, not some spurious correlation. We might also want to know what happens outside the boundary of the calibration data. To that end we recommend in our courses and consulting projects that modeling always begin with linear models as they are much more informative, you (the human on the other side of the screen) stand a good chance of actually learning something about the problem at hand. Our sense is that black-box models are currently way over-used. That’s the part of the AI/DL hype cycle we are in. I agree with the sentiments expressed in Why are we using Black-Box models in AI When we Don’t Need to? and it includes a very interesting example. Clearly, we are going to see continued work on explainable/interpretable machine learning because it will be demanded by those that are impacted by the model responses. And rightly so!