A Recipe for Failure
Jun 6, 2007
Recently, I’ve seen numerous examples of what we call the “throw the data over the wall” model of data analysis, mostly in the form of “challenge problems.” Typically, participants are given data sets, without much background information, and asked to analyze them or develop models to make predictions on new data sets. Tests of this sort are of questionable value, as this mode of data analysis is really a recipe for failure. In some instances, I’ve wondered if the tests weren’t set up intentionally to produce failure. So it got me to thinking about how to set up a test to best meet this sinister objective.
If you want to assure failure in data analysis, just follow this recipe:
1) Don’t tell the analyst anything about the chemistry or physics of the system that generated the data. Just present it as a table of numbers. This way the analyst won’t be able to use any physical reasoning to help make choices in data pre-treatment or interpretation of results.
2) Be sure to not include any information about the noise levels in measurements or the reference values. The analyst will have a much harder time determining when the models are over-fit or under-fit. And for sure don’t tell the analyst if the noise has an unusual distribution.
3) Leave the data in a form that makes it non-linear. Alternately, preprocess it in a way that “adds to the rank.”
4) Don’t tell the analyst anything about how the data was obtained, e.g. from “happenstance” or designed data. If its designed, make sure that the data is taken in such a way that system drift and/or other environmental effects are confounded with the factors of interest.
5) Make sure the data sets are “short and fat,” i.e. with many times more variables than samples. This will make it much harder for the analyst to recognize issues related to 2), 3) and 4). And it will make it especially fun for the analyst if the problem has anything to do with variable selection.
6) Compare the results obtained to those from an expert who knows everything about the pedigree of the data. Additionally, its useful if the expert has a lot of preformed “opinions” about what the results of the analysis should be.
If you stick to this recipe, you can certainly improve the odds of making the analyst, and/or the software they are using, look bad. Significantly, it won’t guarantee it, as some analysts are very knowledgeable about the limitations of their methods and are very good with their tools. Obviously, if your goal is to make software and/or chemometric methods unsuccessful, choose a less knowledgeable analyst in addition to 1) – 6) above.
Of course, if you actually want to succeed at data analysis, try to not follow the recipe. At EVRI we’re firm believers in understanding where the data comes from, and anything about it that will help us make intelligent modeling choices. Software can do alot, and we’re making it better all the time, but you just can’t replace a well educated and informed analyst.
BMW