‹ New workflow for XCalibur 2.1 RAW files (Velos) is released to beta •
Starting with v4.0 software, the Sage-N Research SORCERER platform will provide the 2-stage scoring (i.e. different from 2-stage searching) architecture that generally mimics the current Gygi Lab workflow, which does a first-stage SEQUEST (i.e. SEQUEST 3G starting with v4.0) followed by our open-source MUSE scripting version of the Gygi Lab’s “Bino 5-score”. (This is analogous to the Mann Lab workflow, which generally uses a Mascot first-stage followed by a “6-score” re-score stage, according to private communication with Matthias.)
Users can also modify these re-score modules to incorporate their own scoring functions, such as to accommodate water and/or ammonia losses, incorporate special cleavage rules, or otherwise tune coefficients and parameters.
Anything to keep in mind about Probability Scoring?
Some researchers mistakenly believe that a “probability” is some kind of absolute “word of God” but they are very much a creation of man. Indeed, in science, a probability of an event has more to do with you — or rather your lack of all the relevant information — than the event itself, and is best considered a “degree of confidence” measure based on incomplete information. After all, mass spectrometers do not measure peptide sequence per se, but only a collection of mass/charge ratios from which you infer sequence information.
Probability Scores are simply tools that based on underlying (sometimes hidden) models, which as George Box observed are always “wrong” because they necessarily involve simplifications and assumptions. Probability Scoring, by its nature, tends to have increased specificity but reduced sensitivity. Their Achilles heel is the filtering step - how does one decide which peaks are “real” and which are “noise”, particularly for noisy spectra common for phospho and low abundance peptides? With only a handful of matching peaks to determine the Score, their accurate selection becomes critical.
Therefore, they are best used as a second re-scoring stage of results from a search engine like SEQUEST 3G designed to find specific patterns with significant noise. In addition, it is important to note that p-values are NOT true probabilities, since there is no requirement for such values of competing hypotheses to sum to 1. (See this Proteomics 2.0 blog entry for further discussion: http://proteomics2.com/?p=65 )
More on p-value
The examples here talk about p-values for exact numbers of success to simplify the explanation. For proteomics applications, it is more appropriate to compute “6 or more matched peaks” out of 18 predicted, rather than “exactly 6 matched peaks”, since the filtering step could have eliminated otherwise matching peaks. The resulting calculation is done by adding the p-values of 6, 7, …, 17, and 18 matched peaks. This results in a higher overall “random probability” and a lower, more conservative Score.
To understand the p-value distinction between exactly some number vs. at least some number, imagine you are asked to flip a perfectly balanced coin 1,000,000 times. (I’m glad you’re doing that and not me.) If you told me that you got exactly 500,000 Heads, I would strongly suspect that you didn’t really run the experiment, since that number is too convenient and the p-value for exactly 500,000 Heads is very small. However, the p-value for at least 500,000 Heads would be 0.5.
Why hypergeometric, Poisson, and binomial models are generally equivalent for proteomics Probability Scoring
The binomial model for exactly 6 darts out of 18 throws hitting red assumes that all 6 are independent. That is, each dart has the same probability (5%) of hitting the red zone. The binomial p-value for 6 successes is 0.0001567, or Score of 38.0.
The hypergeometric model is conceptually similar to the binomial model, except that each time a red zone is hit by a dart, it is removed from the target so that a later dart cannot register the same hit. That is, the success probability of each subsequent dart is slightly less.
The Poisson model may be viewed as a limit case of the binomial model, where it predicts how unlikely it is to see 6 successes for a set of trials, given that only 0.9 successes are expected on average. The Poisson p-value of exactly 6 successes is 0.0003000, or a Score of 35.2.
(As another example of using Poisson, if your house only sees 0.9 police cars per day on average, but one day you see at least 6 police cars, you may infer something unusual is happening. The p-value concept provides a mathematical measure of “something’s up”.)
For Probability Scoring in proteomics, you don’t need a fancy model to tell you about >10 or <2 matched peaks. For the sweet-spot of 2 to 10, all of these models will give similar p-values.
As with modeling of any poorly understood, real-world phenomena, there is no one “right” model, only ones that better model a specific situation. However, with experience, one will develop the sense of what is right and can be trusted.
References
Perkins, DN, Pappin, DJ, Creasy, DM and Cottrell, JS, Electrophoresis, 20(18) 3551-67 (1999).
Meng F, Cargile BJ, Miller LM, Forbes AJ, Johnson JR, Kelleher NL. Nature Biotech. 2001; 19: 952-7.
Sadygov RG, Yates, JR, Anal. Chem., 2003, 75 (15), pp 3792-3798 DOI: 10.1021/ac034157w
Geer LY, Markey SP, Kowalak JA, Wagner L, Xu M, Maynard DM, Yang X, Shi W, Bryant SH (2004). J. Proteomics Res. 3 (5): 958-64. doi:10.1021/pr0499491
Olsen JV, Mann M., Proc Natl Acad Sci U S A. 2004 Sep 14;101(37):13417-22
Beausoleil S.A., Vill????????n J., Gerber S.A., Rush J., Gygi S.P. Nat Biotechnol 24,(10):1285-92 (2006).
Sadygov RG, Good DM, Swaney DL, Coon JJ, J. Proteome Res., 2009;8:3198-205.
You must be logged in to post a comment.

No comments
Comments feed for this article
Trackback link
http://proteomics2.com/wordpress/wp-trackback.php?p=155