“All models are wrong, but some are useful.” George E. P. Box
The Truth About Probability Scores
What is Probability Scoring?
Probability Scoring is a popular method of ranking possible peptide sequences that best fits an observed tandem mass spectrum. This can be computed as the primary score in a search engine (e.g. Mascot), or as a second stage re-scoring of, say, the top 10 results from another search engine.
Why is it important?
Of all the different types and styles of similarity scores used in proteomics search engines, Probability Scoring is considered a conceptually easy and simple score to understand. Other scores, notably SEQUEST’s cross-correlation score (XCorr) based on vectors and linear algebra, can be more mathematically rigorous, but require more technical background to understand its calculation.
What does it derive from?
The Probability Scoring functions used by both the Matthias Mann Lab from Max Planck and Steven Gygi Lab from Harvard use the coin-flip model with a biased coin (i.e. the binomial distribution).
For example, if a peptide sequence is predicted to yield N=18 fragment ions, and of those exactly K=6 observed peaks match these, and assume the success probability is modeled as p=0.05 (we will get to that later), then the “random-chance probability” of that happening (i.e. the p-value) is computed as the probability of getting exactly 6 Heads out of 18 Tosses using a Biased Coin, where each coin is modeled to have a 5% chance of yielding Heads.
Read the rest of this entry »




