Data Sets
The following data sets are available to download from the Eigenvector Archive. All of them are stored as ZIPPED MATLAB files. The archive includes:
- CGL_NIR, Prediction of grain protein from NIR
- IASIM16 Challenge, NIR Images of Melamine in Wheat Gluten
- IR Image of an Excedrin Tablet, useful for testing image analysis algorithms
- NIR spectra of corn samples, useful for standardization and preprocessing bench-marking
- Semiconductor metal etch, including known faults
- NIR spectra of diesel fuels, for testing variables selection and calibration algorithms
- Pharmaceutical tablets, NIR “Shootout” data set consisting of spectra from 654 pharmaceutical tablets from two spectrometers
- ArtImageDataA, NIR Images of Oil Paints
CGL_NIR: Prediction of grain protein from NIR
Full three-component mixture design DOE for prediction of casein, glucose, lactate and moisture (wt%) from NIR (117 wavelengths, 1104 to 2495 nm) from Tormod Naes and Tomas Isaakson.
The data is stored in a MAT file (CGL_nir.mat) containing DataSet objects as described below:
PropVals: Y-block of property values [231x4 dataset] Spectra: X-block of NIR spectra [231x117 dataset] Xcal: Calibration X-block [153x117 dataset] Xtest: Test X-block [ 78x117 dataset] Ycal: Calibration Y-block [153x4 dataset] Ytest: Test Y-block [ 78x4 dataset]
IASIM16 Challenge: NIR Images of Melamine in Wheat Gluten
This data set includes NIR hyperspectral images presented as a challenge problem for the International Association for Spectral Imaging Meeting 2016. The measurements were to evaluate algorithms for detection of melamine particles in wheat gluten. NIR Reflectance images were measured by G. Israelson of Nestlé Purina on an Opotek HySpec(TM) imaging system October 28, 2010.
The data is stored as DataSet Objects as described below:
IASIM_2016_Challenge.pdf: Information about the data ForTheJudges: Folder with a solution Wheat_Gluten_Pure.mat: [243x244 pixels x 229 dataset] Melamine_Pure.mat: [243x244 pixels x 229 dataset] M_200ppm.mat: [243x244 pixels x 229 dataset] Test_1.mat: [243x244 pixels x 229 dataset] Test_2.mat: [243x244 pixels x 229 dataset] Test_3.mat: [243x244 pixels x 229 dataset]
IR Image of an Excedrin Tablet
Infrared image of an Excedrin tablet. The tablet contains aspirin, acetominophen, caffeine and microcrystalline cellulose measured with a tunable laser from 1800 to 800 cm-1 over an approximate 2 mm square and is stored as a DataSet object. The image was provided by Agilent (www.agilent.com).
NIR of Corn Samples for Standardization Bench-marking
This data set consists of 80 samples of corn measured on 3 different NIR spectrometers. The wavelength range is 1100-2498nm at 2 nm intervals (700 channels). The moisture, oil, protein and starch values for each of the samples is also included. A number of NBS glass standards were also measured on each instrument. The data was originally taken at Cargill. Many thanks for Mike Blackburn for letting us distribute it.
The data is stored as DataSet Objects as described below:
information: [20x59 char]Information about the data
m5spec: [80x700 dataset] Spectra on instrument m5
mp5spec: [80x700 dataset] Spectra on instrument mp5
mp6spec: [80x700 dataset] Spectra on instrument mp6
propvals: [80x4 dataset] Property values for samples
m5nbs: [3x700 dataset] NBS glass stds on m5
mp5nbs: [4x700 dataset] NBS glass stds on mp5
mp6nbs: [4x700 dataset] NBS glass stds on mp
Metal Etch Data for Fault Detection Evaluation
This data set consists of the engineering variables from a LAM 9600 Metal Etcher over the course of etching 129 wafers. The data consists of 108 normal wafers taken during 3 experiments (numbers 29, 31 and 33) and 21 wafers with intentionally induced faults taken during the same experiments. Note that the experiments were run several weeks apart and data from different experiments has a different mean and somewhat different covariance structure. The experiment number is in the name of the wafer in the calib_names and test_names fields, respectively.
For more information about this data set, please see: B.M. Wise, N.B. Gallagher, S.W. Butler, D.D. White, Jr. and G.G. Barna, “A Comparison of Principal Components Analysis, Multi-way Principal Components Analysis, Tri-linear Decomposition and Parallel Factor Analysis for Fault Detection in a Semiconductor Etch Process”, J. Chemometrics, 13, 379396 (1999).
The data is stored as a MATLAB structure array. The specific fields in this structure array are described below:
INFORMATION: [ 29x63 char] Information about the data
calibration: {108x1 cell} The normal or calibration wafers
calib_names: [108x9 char] Names of the calibration wafers
test: { 21x1 cell} The test or faulty wafers
test_names: [ 21x9 char] Names of the test wafers
fault_names: [ 21x9 char] Names of the specific faults
variables: [ 21x14 char] Names of the variables
Near Infrared Spectra of Diesel Fuels
These data consist of NIR spectra of diesel fuels along with various properties of those fules including:
- bp50 – boiling point at 50% recovery, deg C (ASTM D 86)
- CN – cetane Number (like Octane number only for diesel, ASTM D 613)
- d4052 – density, g/mL, @ 15 deg C, (ASTM D 4052)
- freeze – freezing temperature of the fuel, deg C
- total – total aromatics, mass% (ASTM D 5186)
- visc – viscosity, cSt, @ 40 deg C
There are three formats of these data: Matlab DataSet objects, Standard Matlab variables, and CSV files. This data was obtained at Soutwest Research Institute (SWRI) on a project sponsored by the U.S. Army. Many thanks to them for letting us post it here!
DataSet Object Format
The file “SWRI_Diesel_NIR.zip” contains a .mat file which can be loaded into MATLAB. This .mat file contains two dataset objects: One includes all the raw unpreprocessed spectra (diesel_spec) and another that is all the properties (diesel_prop). Some of the properties are not measured on some of the samples, so diesel_prop has some missing values (NaNs) in it. The wavelength axis is included as axisscale in the diesel_spec. If you don’t have PLS_Toolbox or our freeware for the DataSet Object, these two variables should turn into structures when you load them into MATLAB.
Standard Matlab Variable Format
The following are .zip files of separate .mat files, each with standard Matlab variables containing the same data as above. There are 6 workspace variables in each file, 3 for the spectra and 3 matching ones for the property value. In each case the data includes 20 high leverage samples (_hl) and the remaining samples are split into two random groups (_ll_a and _ll_b). These spectra can be used to test variable selection and calibration algorithms. For instance, you can use the high leverage samples and one of the other sets to make a calibration model (say the _hl and _ll_a), then test it on the third set (the _ll_b). In all cases the data have been pretty thoroughly weeded: outliers removed, and all samples belong to the same class (all summer fuels, no winter fuels).
All of the files end in GATEST because we’ve used the data to test genetic algorithms for variable selection.
CSV Format
The file “SWRI_Diesel_NIR_CSV.zip” contains two .csv files. One includes all the raw unpreprocessed spectra (diesel_spec) and another that is all the properties (diesel_prop). Some of the properties are not measured on some of the samples, so diesel_prop has some missing values (NaNs) in it. The wavelength axis is included as axisscale in the diesel_spec.
NIR Spectra of Pharmaceutical Tablets from “Shootout”
In 2002, the International Diffuse Reflectance Conference (IDRC) published a “Shootout” data set consisting of spectra from 654 pharmaceutical tablets from two spectrometers. The data is divided up into calibration, validation and test sets. We’ve converted it to our MATLAB DataSet Object format for your computing pleasure. It’s a great data set for chemometrics training and algorithm testing.
The original information on this data can be found here: http://www.idrc-chambersburg.org/shootout_2002.htm
NIR Images of Oil Paints
This dataset contains NIR hyperspectral images of 24 samples of artist oil paints applied to canvas. The paints contain known proportions of blue pigments: Prussian blue, heliogen blue,and ultramarine blue added to an oil binder. Three sets of individual HSI images of each of the samples were acquired in random sequence (sets A, B, then C). Mosaic images (240 ×240 pixels) were created, containing 24 image sub-regions, (each 40 × 60 p i x e l ) selected from the three imagesets. Each image contains 207 wavelengths.