FAQ  Frequently Asked Questions
Issue:

How are the ROC curves calculated for PLSDA?
Possible Solutions:

The ROC curves are based on the predicted yvalues for each of your samples. These values are not discrete zeros and ones but range from around zero to around one in a continuous manner (take a look at a plot of y predicted to see what I mean). Each point in an ROC curve (or pair of points at a given threshold value in the "threshold" plots on the right hand side of the ROC figure) comes from calculating the sensitivity and specificity for a given threshold value. Specificity is calculated as the fraction of "notinclass" samples which are below the given threshold. Sensitivity is calculated as the fraction of "inclass" samples which are above the given threshold.
These are empirical curves in that they are calculated from the data directly and not from a model of the distribution of the data so there will be some "stepping". In fact, with smaller sample sizes, the curves may NEVER be smooth because sensitivity and specificity only change (up or down) when the threshold moves past a sample's predicted yvalue. For example, if the number of "notinclass" samples above a threshold of 0.46 is no different than the number above 0.45, these two thresholds technically give the same specificity. As of version 3.5.4 of PLS_Toolbox, we actually calculate only "critical" thresholds (those that actually make a difference in the sensitivity and specificity curves) and interpolate between them. Even then, a multimodal distribution of ypredictions for either in or outofclass samples will lead to nonsmooth curves.
The crossvalidated versions of the curves are determined by using the same procedure outlined above except that we use the yvalue predicted for each sample when it was left out of the calibration set (during crossvalidation). One might assume that doing multiple replicate crossvalidation subsets would lead to smoother crossvalidation curves. Two things keep this from happening:
First, before version 4.0 of PLS_Toolbox, the software doesn't actually average the predicted yvalues from multiple replicates. It only remembers the predicted yvalue from the LAST time a given sample was left out.
Secondly, even if the above "issue" weren't there, the curves would only get smoother if the different sets of samples left out during each crossvalidation replicates induced a significant change in the model, and thus the predicted yvalue for a sample. If the different models calculated in each cycle are essentially the same, there will be little to no variation in the predicted yvalue and the curves will appear very similar for all replicates. In fact, significant variation in the predicted yvalue from one subset to the next is an indication that the crossvalidation is unstable (e.g., outliers in the data, too little data, or "critical" good samples which, when left out, keep a good model from being calculated).
See also: prediction probability and threshold calculations
Still having problems? Check our documentation Wiki or try writing our helpdesk at helpdesk@eigenvector.com