Join us for the 18th Eigenvector University in Seattle May 6-10, 2024 Complete Info Here!

Advanced Preprocessing for Spectroscopic Applications

Course Description

Preprocessing is the term used for the transformations done to data before the actual modeling step (regression, classification, etc.). Preprocessing is often the key to successful chemometric/machine learning model development. Spectroscopic data poses its own unique problems and also opportunities due to its highly structured nature and a large number of preprocessing methods have been developed for use with it. This course surveys these methods and shows how they can be applied, evaluated and optimized.

The objective of data preprocessing is to remove extraneous variance so that the variance of interest can be more easily modeled. Extraneous variance includes instrument artifacts (nonlinearities, baseline drift, etc.), sample presentation issues (scatter, temperature, etc.), interferent species and other non-idealities. The objective of spectroscopic data preprocessing is to maximize signal-to-clutter (S/C) where clutter is defined as extraneous variance and data anomalies that can ‘distract’ model development. Maximizing S/C is a different paradigm than maximizing signal-to-noise and a firm understanding of the preprocessing algorithms and objectives can lead to more efficient and effective model development.

Advanced Preprocessing for Spectroscopic Applications starts with a brief review of basic preprocessing methods to demonstrate how they work within the objective of maximizing S/C and how they can be misused. The course then delves into more advanced topics such as multiplicative scatter correction, extended multiplicative scatter correction and generalized least squares-like weighting. Examples will be focused on spectroscopic applications although many methods are directly extensible to other types of data. The mathematical principles for the preprocessing methods will also be covered. The course includes hands-on computer time for participants to work example problems using PLS_Toolbox or Solo.

Prerequisites

Linear Algebra for Machine Learning and ChemometricsChemometrics I-Principal Components Analysis and Chemometrics II–Regression and PLS or equivalent experience.

Course Outline

  • Software and Data Sets
  • Preprocessing Objectives
  • Centering and Scaling
    • Mean Centering
    • Autoscaling
    • Other scaling: autoscaling with offset, Poisson, log decay, other weighting 
  • Baseline Removal
    • Preselected points
    • Fitting polynomials
    • Automated methods
    • Detrend
  • Alignment
    • Aligning matrices (Alignmat)
    • Matching variables (Matchvars)
    • Correlation Optimized Warping (COW)
  • Normalization
    • Norms: 1, 2 and infinity
    • Standard Normal Variate
  • Linearization
    • Matrix rank and the bilinear Model
    • Using arithmetic functions
    • Transmission versus absorbance 
  • Smoothing and Derivatives
    • Smoothing operators
    • Derivatives
    • Savitzky-Golay, gap-segment
  • Scatter Correction
    • Multiplicative Scatter Correction (MSC)
    • Extended Multiplicative Scatter Correction (EMSC)
    • Windowed MSC
  • Clutter
    • Definition
    • Sources
    • Ways to estimate clutter
  • Orthogonalization Filters
    • External Parameter Orthogonalization (EPO)
    • Generalized Least Squares weighting (GLS)
    • Orthogonal Signal Correction (OSC)
  • Using the Model Optimizer
    • Snapshots
    • Surveying preprocessing options 
  • Other topics
    • Scaling for Multi-block data
    • Preprocessing order
    • Undoing preprocessing
    • Missing data
    • Compression
  • Conclusions