‹ What Probability Scoring algorithm does the SORCERER offer? •
“All models are wrong, but some are useful.” George E. P. Box
The Truth About Probability Scores
What is Probability Scoring?
Probability Scoring is a popular method of ranking possible peptide sequences that best fits an observed tandem mass spectrum. This can be computed as the primary score in a search engine (e.g. Mascot), or as a second stage re-scoring of, say, the top 10 results from another search engine.
Why is it important?
Of all the different types and styles of similarity scores used in proteomics search engines, Probability Scoring is considered a conceptually easy and simple score to understand. Other scores, notably SEQUEST’s cross-correlation score (XCorr) based on vectors and linear algebra, can be more mathematically rigorous, but require more technical background to understand its calculation.
What does it derive from?
The Probability Scoring functions used by both the Matthias Mann Lab from Max Planck and Steven Gygi Lab from Harvard use the coin-flip model with a biased coin (i.e. the binomial distribution).
For example, if a peptide sequence is predicted to yield N=18 fragment ions, and of those exactly K=6 observed peaks match these, and assume the success probability is modeled as p=0.05 (we will get to that later), then the “random-chance probability” of that happening (i.e. the p-value) is computed as the probability of getting exactly 6 Heads out of 18 Tosses using a Biased Coin, where each coin is modeled to have a 5% chance of yielding Heads.
The p-value computed for this case is , which is deemed “statistically significant” as compared to the expected number of Heads on average of only 0.9 (= 18 * 5%). Just as people do occasionally win a million bucks at Vegas, any unusual occurrence (or erroneously identified peptide with a high Score) can in fact be happen for no reason other than dumb luck.
Note that the p-value calculates how unlikely the observed event is (i.e. seeing so many “Heads” when each is rare), and so the smaller the p-value the more significant the event. To create a score that is more human-friendly (i.e. generally between 1 to 100, and higher being “rarer”), the p-value is generally expressed as 10^(-Score/10), so that Score = -10log(pvalue). For example, an event with a p-value of 1/100 = 10^(-20/10) has a Score of 20.
To see how this model applies to proteomics, you can readily see that doing 18 tosses with a 5%-biased coin is analogous to throwing 18 random, independent darts at a special dart board with 5% of the area colored red.
To score a peptide sequence against an observed spectrum, imagine the m/z axis as a one-dimensional dart board divided into sections of 100 m/z. Within each 100-m/z section, pick the 5 tallest fragment peaks (i.e. a filter step) and color a 1 amu section around its m/z red. Therefore, approximately 5% of the area is colored red. That is, for this example, the 5% estimate arises from the way we filter the observed peaks.
For each peptide, calculated their predicted m/z peaks, say all the +1 and +2 b- and y-ions (or whatever the fragmentation model predicts). In our example, there may be 18 predicted peaks within the observed range. These 18 m/z values become the 18 “random darts” that have a 5% chance of hitting a red zone. That is, they count as a success if they “land” within the mass tolerance of a filtered observed peak. For example, a predicted b-ion of 857.6 m/z matched to an observed 857.8 m/z peak is analogous to a random dart hitting the 857.6 location within the red zone of 857.8 +/- 0.5.
The general approach I described is Gygi Lab’s “Bino 5-Score” used to re-score SEQUEST top 10 search results, by keep the top 5 peaks and using a 5% bias. Olsen and Mann first published the binomial model specifically for MS3 spectra, although the Mann Lab uses the methodology with a 6% bias for re-scoring Mascot results.
Other probability distributions are also used, including the hypergeometric model by the John Yates Lab (Scripps) and also for Thermo’s ZCore. The Poisson model by the Neil Kelleher Lab (U. Illinois) and later by Lewis Geer’s OMSSA (NCBI).
In our opinion, all three probability distributions are conceptually equivalent for to real-world spectral data, with generally insignificant differences in computed p-values in the range of where it matters - in between the “obvious yes” and “obvious no” peptide identifications with maybe 3 to 10 matching peaks.
Notably, the Mascot model was the first MS2 search engine to use some kind of probability scoring, but its details remain unpublished. Results from ideal synthetic spectra suggest the base model is a version of the Mowse PMF model, but using modeled matched peak probabilities, and operating independently on each ion series, with the max score from any series taken as the peptide score. However, it seems to do a good job considering “partial matches” (i.e. the number of matched peaks need not be whole numbers), particularly for spectra without many extra noise peaks.
You must be logged in to post a comment.

No comments
Comments feed for this article
Trackback link
http://proteomics2.com/wordpress/wp-trackback.php?p=152