Why Many Proteomic “Probabilities” Aren’t

Probability scores make search engine results easier to interpret. However, it is important to understand what they mean in order to avoid assigning more significance to the data than there is.

We continue to find researchers who mistakenly believe that there is only one correct way to compute a probability, and that the probability calculated by well-respected programs must be correct.

In fact, there can be as many different statistical models as there are modellers, and some of the best-known probability scores are simply scores and not true probabilities. The difference? Probabilities need to sum to 1 for mutually exclusive outcomes, while scores do not.

For instance, before a horse race, it helps greatly to know that your favorite horse has less than 2% probability of “not winning” (i.e. 98% probability of winning). However, it would not help nearly as much to know that your horse has a 2% probability of “matching the characteristics of a winning horse by random chance” (i.e. within the acceptable height and weight as known winning horses), since several contenders may score similarly. The first is a true probability, while the second is simply a score expressed in probabilistic terms.

Mowse/Mascot Ionscores are not Probabilities

The Mowse score, used in peptide mass fingerprinting, is a “similarity score” derived using a statistical model that calculates the “probability of matching N peaks by random chance”. It does so by assigning such a probability value to each matched m/z peak using a training set ofprotein sequences, and multiplying all such probability values to compute the composite probability P, which for convenience is expressed as -10logP.

It is a useful scoring method that provides a higher score when there are more matched peaks or when a peak is judged to be more rare.

However, the Mowse score is a score and not a true probability, since there is no requirement that a higher score for one protein sequence will reduce the score for other protein sequences.

The Mascot ionscore is directly derived from Mowse. It uses the Mowse scoring methodology on each ion series individually, and picks the highest score among all the ion series as the composite score. Like Mowse, Mascot assumes that all the m/z peaks in an ion series (say all the b+ ions) are independent, which is a mathematical simplification that is clearly inconsistent with tandem mass spec data.

The Mascot score has proven to be especially useful for scoring tandem mass spectra with high-accuracy (< 10 ppm) fragment mass data, where the significance of each matched peak is high. (The Mascot model does not vary the assigned peak probability with fragment mass accuracy, which may limit its theoretical applicability for ion trap spectra of poorer quality.)

While Mowse and Mascot ionscores have proven their usefulness where they apply, they are not true probabilities and should not be treated as such. For example, it is meaningless to talk about error bars when using these values.

This is true for other useful statistically based similarity scores as well, such as for phosphorylation site localization (Beausoleil et al, Nature Biotech 2006 doi:10.1038/nbt1240) and for combined MS2/MS3 scoring (Olsen & Mann, PNAS 2004 Sep 14).

Neither are P-Values

The P-Value (and its close cousin the E-Value) is a useful statistical construct adopted from the genomics world. Unlike a similarity score that measures how close the top peptide candidate matches the measured spectrum, the P-Value is a “dissimilarity score” that measures how different the top peptide candidate is from the rest of the search space at large. (For those familiar with SEQUEST, the parameter “deltaCn” does a similar function, albeit in a less sophisticated manner and is not probabilistic in value.)

We believe P-Values and E-Values were first introduced into proteomics with the X! Tandem search engine (Fenyo and Beavis, Anal Chem 2003, 75).

The E-Value is an empirically derived “expected value” of how many peptides can achieve a particular score by random chance for a particular spectrum, which is computed by extrapolating the decaying exponential distribution of all the peptide scores for that spectrum. The P-Value is the probability analog computed by dividing the E-Value by the number of candidates.

(It is interesting to note that the genomics field has a more rigorous approach to statistics than proteomics today, and would not mistakenly call similarity scores or P-Values “probabilities.” It helps that top genomics practitioners like Stephen Altschul and Eric Lander got their math PhDs long before they did much biology.)

Like the similarity score, the P-Value is also a score and not a probability.

PeptideProphet computes Probabilities

Unlike the similarity scores and “dissimilarity scores” above, the values computed by the PeptideProphet algorithm (Keller, Nesvizhskii, et al, Anal Chem 2002, 74) from rescoring peptide search engine results from SEQUEST, Mascot, X!Tandem, and now Phenyx are probabilities.

This by itself doesn’t mean that the computed values are necessarily correct (depends on the data and underlying assumptions), or that there cannot be other equally valid ways to model the statistics. However, at least the definition matches what is expected of a probability.

Much like a teacher may put the test scores on a curve to convert numerical scores into a more meaningful measure, PeptideProphet assumes the score distribution arises from a large “false positive” distribution superimposed on a smaller “true positive” distribution, and uses curve-fitting to compute the resulting probabilities. Where the FP and TP curves intersect is the 50% probability point.

As with any other statistical tool, its results are only as valid as its assumptions. There is always a chance of “garbage-in, garbage-out,” and the results depend on clean, well-fitting data.

PeptideProphet was originally designed to work with SEQUEST, where a “discriminant score” is derived by combining the similarity score (XCorr) and the “dissimilarity score” (deltaCn). Ideally, the discriminant score should incorporate both elements (the case for rescoring SEQUEST results), so that the highest probability is assigned where (1) the top candidate very closely matches the spectrum and (2) the top candidate is very dissimilar from the others in the search space.

It has since been adapted for Mascot (albeit using only the similarity score) and other search engines. PeptideProphet is part of the Trans-Proteomic Pipeline, while the same algorithm has been re-implemented in commercial products like Scaffold and Elucidator.

It should be noted that a probability rescoring at the peptide assignment stage is not the only way to filter the search engine results. Other methods, notably using decoy (reversed) protein sequence databases, can be employed with parameter-based filtering, such as with DTASelect from the Yates Lab. These methods allow the final false positive rates to be computed without requiring individual peptide assignments to be probabilistically determined. In the future, we expect that many of these different methodologies can be integrated to achieve the highest level of results quality in advanced “Proteomics 2.0″ analysis driving the upcoming “BiotechIndustrial Revolution.”

Tags: , , , , , ,

You must be logged in to post a comment.