Model Transparency, Validation & Model_Exporter
Feb 20, 2008
With the advent of the US Food and Drug Administration’s (FDA) Process Analytical Technology (PAT) Initiative the possibilities for putting multivariate models on-line in pharmaceutical applications increased dramatically. In fact, the Guidance for Industry on PAT lists Multivariate tools for design, data acquisition and analysis explicitly as PAT Tools. This opens the door for the use analytical techniques which rely on multivariate calibration to produce estimates of product quality. An example of this would be using NIR with PLS regression to obtain concentration of API in a blending operation.
That said, any multivariate model that is run in a regulated environment is going to have to be validated. I found a good definition of validate on the web: To give evidence that a solution or process is correct. So how do you show that a model is correct? It seems to me that the first step is to understand what it is doing. A multivariate model is really nothing more than a numerical recipe for turning a measurement into an answer. What’s the recipe?
Enter Model_Exporter. Model_Exporter is an add-on to our existing multivariate modeling packages PLS_Toolbox and Solo. Model_Exporter takes models generated by PLS_Toolbox and Solo and turns them into a numerical recipe in an XML format that can be implemented in almost any modern computer language. It also generates m-scripts that can be run in MATLAB or Octave, and Tcl for use with Symbion.
But the main point here is that Model_Exporter makes models transparent. All of the mathematical steps (and the coefficients used in them), including preprocessing, are right there for review. Is the model physically and chemically sensible? Look and see.
The next step in validation is to show that the model behaves as expected. This would include showing that, once implemented, the model produces the same results on the training data as the software that produced the model. One should also show that the model produces the same (acceptable) results on additional test sets that were not used in the model development.
What about the software that produced the model to begin with? Should it be validated? Going back to the definition of validate, that would require showing that the modeling software produced answers that are correct. OK, well, for PLS regression, correct would have to mean that it calculates factors that maximize the covariance between scores in X and scores in Y. That’s great, but what does it have to do with whether the model actually performs as expected or not? Really, not much. Does that mean its not important? No, but assuring software accuracy won’t assure model performance.
Upon reading a draft of this post, Rasmus wrote:
Currently software validation focuses on whether the algorithm achieves what’s claimed, e.g. that a correlation is correctly calculated. This is naturally important and also a focus point for software developers anyhow. However, this sort of validation is not terribly important for validating that a specific process model is doing what it is supposed to. This is similar to thoroughly checking the production facility for guitars in order to check that Elvis is making good music. There are so many decisions and steps involved in producing a good prediction model and the quality of any correlation estimates in the numerical algorithms are of quite insignificant importance compared to all the other aspects. Even with a ‘lousy’ PLS algorithm excellent models could be made if there is a good understanding of the problem.
So when you start thinking about preprocessing options and how many ways there are to get to models with different recipes but similar performance, and also how its possible by making bad modeling choice to get to a bad model with software that’s totally accurate, it’s clear that models should be validated, not the software that produces them. And that’s why Model_Exporter is so useful, it makes models transparent, which simplifies model validation.