Get the Right Training for Your Application with our Data Science Course Bundles!Tell me more!

Data Sets

The following data sets are available to download from the Eigenvector Archive. All of them are stored as ZIPPED MATLAB files. The archive includes:

CGL_NIR: Prediction of grain protein from NIR

Full three-component mixture design DOE for prediction of casein, glucose, lactate and moisture (wt%) from NIR (117 wavelengths, 1104 to 2495 nm) from Tormod Naes and Tomas Isaakson.

The data is stored in a MAT file (CGL_nir.mat) containing DataSet objects as described below:

PropVals: Y-block of property values [231x4 dataset]
 Spectra: X-block of NIR spectra     [231x117 dataset] 
    Xcal: Calibration X-block        [153x117 dataset]
   Xtest: Test X-block               [ 78x117 dataset]
    Ycal: Calibration Y-block        [153x4 dataset]
   Ytest: Test Y-block               [ 78x4 dataset]

IASIM16 Challenge: NIR Images of Melamine in Wheat Gluten

This data set includes NIR hyperspectral images presented as a challenge problem for the International Association for Spectral Imaging Meeting 2016. The measurements were to evaluate algorithms for detection of melamine particles in wheat gluten. NIR Reflectance images were measured by G. Israelson of Nestlé Purina on an Opotek HySpec(TM) imaging system October 28, 2010.

The data is stored as DataSet Objects as described below:

IASIM_2016_Challenge.pdf: Information about the data
            ForTheJudges: Folder with a solution 
   Wheat_Gluten_Pure.mat: [243x244 pixels x 229 dataset]
       Melamine_Pure.mat: [243x244 pixels x 229 dataset]
            M_200ppm.mat: [243x244 pixels x 229 dataset]
              Test_1.mat: [243x244 pixels x 229 dataset]
              Test_2.mat: [243x244 pixels x 229 dataset]
              Test_3.mat: [243x244 pixels x 229 dataset]

IR Image of an Excedrin Tablet

Infrared image of an Excedrin tablet. The tablet contains aspirin, acetominophen, caffeine and microcrystalline cellulose measured with a tunable laser from 1800 to 800 cm-1 over an approximate 2 mm square and is stored as a DataSet object. The image was provided by Agilent (

NIR of Corn Samples for Standardization Bench-marking

This data set consists of 80 samples of corn measured on 3 different NIR spectrometers. The wavelength range is 1100-2498nm at 2 nm intervals (700 channels). The moisture, oil, protein and starch values for each of the samples is also included. A number of NBS glass standards were also measured on each instrument. The data was originally taken at Cargill. Many thanks for Mike Blackburn for letting us distribute it.

The data is stored as DataSet Objects as described below:

information: [20x59 char]Information about the data
m5spec: [80x700 dataset] Spectra on instrument m5
mp5spec: [80x700 dataset] Spectra on instrument mp5
mp6spec: [80x700 dataset] Spectra on instrument mp6
propvals: [80x4 dataset] Property values for samples
m5nbs: [3x700 dataset] NBS glass stds on m5
mp5nbs: [4x700 dataset] NBS glass stds on mp5
mp6nbs: [4x700 dataset] NBS glass stds on mp

Metal Etch Data for Fault Detection Evaluation

This data set consists of the engineering variables from a LAM 9600 Metal Etcher over the course of etching 129 wafers. The data consists of 108 normal wafers taken during 3 experiments (numbers 29, 31 and 33) and 21 wafers with intentionally induced faults taken during the same experiments. Note that the experiments were run several weeks apart and data from different experiments has a different mean and somewhat different covariance structure. The experiment number is in the name of the wafer in the calib_names and test_names fields, respectively.

For more information about this data set, please see: B.M. Wise, N.B. Gallagher, S.W. Butler, D.D. White, Jr. and G.G. Barna, “A Comparison of Principal Components Analysis, Multi-way Principal Components Analysis, Tri-linear Decomposition and Parallel Factor Analysis for Fault Detection in a Semiconductor Etch Process”, J. Chemometrics, 13, 379­396 (1999).

The data is stored as a MATLAB structure array. The specific fields in this structure array are described below:

  INFORMATION: [ 29x63 char]  Information about the data
  calibration: {108x1  cell}  The normal or calibration wafers
  calib_names: [108x9  char]  Names of the calibration wafers
         test: { 21x1  cell}  The test or faulty wafers      
   test_names: [ 21x9  char]  Names of the test wafers       
  fault_names: [ 21x9  char]  Names of the specific faults   
    variables: [ 21x14 char]  Names of the variables

Near Infrared Spectra of Diesel Fuels

These data consist of NIR spectra of diesel fuels along with various properties of those fules including:

  • bp50 – boiling point at 50% recovery, deg C (ASTM D 86)
  • CN – cetane Number (like Octane number only for diesel, ASTM D 613)
  • d4052 – density, g/mL, @ 15 deg C, (ASTM D 4052)
  • freeze – freezing temperature of the fuel, deg C
  • total – total aromatics, mass% (ASTM D 5186)
  • visc – viscosity, cSt, @ 40 deg C

There are three formats of these data: Matlab DataSet objects, Standard Matlab variables, and CSV files. This data was obtained at Soutwest Research Institute (SWRI) on a project sponsored by the U.S. Army. Many thanks to them for letting us post it here!

DataSet Object Format

The file “” contains a .mat file which can be loaded into MATLAB. This .mat file contains two dataset objects: One includes all the raw unpreprocessed spectra (diesel_spec) and another that is all the properties (diesel_prop). Some of the properties are not measured on some of the samples, so diesel_prop has some missing values (NaNs) in it. The wavelength axis is included as axisscale in the diesel_spec. If you don’t have PLS_Toolbox or our freeware for the DataSet Object, these two variables should turn into structures when you load them into MATLAB.

Standard Matlab Variable Format

The following are .zip files of separate .mat files, each with standard Matlab variables containing the same data as above. There are 6 workspace variables in each file, 3 for the spectra and 3 matching ones for the property value. In each case the data includes 20 high leverage samples (_hl) and the remaining samples are split into two random groups (_ll_a and _ll_b). These spectra can be used to test variable selection and calibration algorithms. For instance, you can use the high leverage samples and one of the other sets to make a calibration model (say the _hl and _ll_a), then test it on the third set (the _ll_b). In all cases the data have been pretty thoroughly weeded: outliers removed, and all samples belong to the same class (all summer fuels, no winter fuels).

All of the files end in GATEST because we’ve used the data to test genetic algorithms for variable selection.

CSV Format

The file “” contains two .csv files. One includes all the raw unpreprocessed spectra (diesel_spec) and another that is all the properties (diesel_prop). Some of the properties are not measured on some of the samples, so diesel_prop has some missing values (NaNs) in it. The wavelength axis is included as axisscale in the diesel_spec.

NIR Spectra of Pharmaceutical Tablets from “Shootout”

In 2002, the International Diffuse Reflectance Conference (IDRC) published a “Shootout” data set consisting of spectra from 654 pharmaceutical tablets from two spectrometers. The data is divided up into calibration, validation and test sets. We’ve converted it to our MATLAB DataSet Object format for your computing pleasure. It’s a great data set for chemometrics training and algorithm testing.

The original information on this data can be found here: