tips and hints

You are currently browsing the archive for the tips and hints category.

“All models are wrong, but some are useful.” George E. P. Box

The Truth About Probability Scores

What is Probability Scoring?
Probability Scoring is a popular method of ranking possible peptide sequences that best fits an observed tandem mass spectrum. This can be computed as the primary score in a search engine (e.g. Mascot), or as a second stage re-scoring of, say, the top 10 results from another search engine.

Why is it important?

Of all the different types and styles of similarity scores used in proteomics search engines, Probability Scoring is considered a conceptually easy and simple score to understand. Other scores, notably SEQUEST’s cross-correlation score (XCorr) based on vectors and linear algebra, can be more mathematically rigorous, but require more technical background to understand its calculation.

What does it derive from?
The Probability Scoring functions used by both the Matthias Mann Lab from Max Planck and Steven Gygi Lab from Harvard use the coin-flip model with a biased coin (i.e. the binomial distribution).

For example, if a peptide sequence is predicted to yield N=18 fragment ions, and of those exactly K=6 observed peaks match these, and assume the success probability is modeled as p=0.05 (we will get to that later), then the “random-chance probability” of that happening (i.e. the p-value) is computed as the probability of getting exactly 6 Heads out of 18 Tosses using a Biased Coin, where each coin is modeled to have a 5% chance of yielding Heads.
Read the rest of this entry »

We have developed an new flow for processing Thermo RAW files that works both with the most recent XCalibur V2.1, as well as with earlier versions. This flow has been giving good results in internal testing, and we are now releasing it for beta testing to any interested, actively supported Sorcerer customer.

Thermo LTQ Velos users will have noticed the major changes to the XCalibur software that were introduced at version 2.1. The installation process is different, and requires a new component called Thermo Foundation, and some of the file names and locations have changed. All of these changes are no longer compatible with the ReAdW program that is used within the CrossOver environment by Sorcerer. One workaround which has been commonly suggested in the Thermo field is to down-rev the XCalibur used on the instrument to V2.0 and to continue using the old software for analysis. This remains a viable option, but with our newly developed solution, it is now also possible to use 2.1 RAW files on Sorcerer.

We are moving to a new spectrum extractor called msconvert (part of the ProteoWizard suite)  which works with a different version of the Thermo libraries, and for which we have developed a new integration in the CrossOver environment. We are offering this as a beta release to our in-warranty customers. This solution  entails a few Linux operations to reinstall CrossOver with the latest release, to configure the required libraries and to install a new Sorcerer workflow script; it is fairly straightforward for people comfortable with the Linux environment, or alternatively, we can do it for you if you give us remote access to your system. Please contact us at support@sagenresearch.com for more information.


Prof. Josh Elias (left) of Stanford University receives a thank-you gift from David Chiang after his talk.

Ever wondered about target-decoy searching? Want to gain a better understanding and realistic expectation of this effective tool? SageNResearch’s video “Addressing Peptide Identification Signal-to-noise With Target-Decoy Searching”, given by Professor Josh Elias of Stanford University at our “Translational Proteomics 2.0″ meeting, can help. Dr. Elias is an Assistant Professor in Chemical and Systems Biology at Stanford University, and was part of the Steven Gygi Lab at Harvard Medical School before that. His lab is keenly interested in developing and applying methods to meet the current challenges facing scientists engaged in large scale proteome characterization.

Josh kicked off his talk with a stunning and very powerful visual to hit home the concept of what target-decoy database searching can do — you’ll never look at coffee beans in quite the same way. With this talk, you’ll know how to better find a happy medium for thresholds, smarter ways of designing your filtering criteria, when not to even consider using the method, how to get the most out of (really easy) decoy searching in SORCERER, and what’s so good about partial tryptic searches.

The 30-minute presentation is available at: http://www.scivee.tv/node/15544
To view slides, I recommend using the “full screen” mode. The slide set can also be downloaded as a Powerpoint file.


Prof. Alexey Nesvizhskii (left) of University of Michigan receives a thank-you gift from David Chiang after his talk.

If you really want to understand how peptide and protein identification is done, this video talk is a must-see!

Professor Alexey Nesvizhskii of the University of Michigan is one of the co-inventors (with Dr. Andy Keller) of the popular PeptideProphet/ProteinProphet algorithm for turning search engine results into statistically consistent peptide and protein identifications. (This algorithm is also the basis for the popular Scaffold software.)

At the “Translational Proteomics 2.0″ meeting, we were privileged to have Alexey give his insightful talk that reviews the various steps involved in inferring peptide and protein identifications from large spectra datasets.

In this talk, you will learn why False Discovery Rates are preferred over P-values, why you probably should not run more than 4 replicates of a MudPIT experiment, how FDR estimations from decoy differ from Peptide/ProteinProphet, how “The Two Prophets” compute probabilities by curve-fitting the score distributions, how sensitivity and FDR are computed, and the what and why of some advanced TPP options.

The talk is available at: http://www.scivee.tv/node/12671 (45 minutes).

I recommend using the “full screen” mode so you can view the slides, which are also available as a download from the site. (Please be aware that the slideset order is different from that in the presentation.)

(Note: Both Trans-Proteomic Pipeline and Scaffold Batch software are integrated into the SORCERER platforms.)

We’ve developed a new Muse workflow for target-decoy analysis and false discovery rate estimation, based on our integration of DTASelect from the Yates lab. DTASelect can now use target-decoy FASTA files that are installed on Sorcerer to support its statistical analysis. It provides an easy-to-interpret results report complete with match statistics and estimated false discovery rates.

Our DTASelect on Sorcerer page on this blog has been updated to describe the target-decoy workflow, in addition to the existing material on installing, configuring and running DTASelect and the Muse script. Please visit it to get links to the latest scripts and for a detailed How-To.

Three of the world’s leading experts on MS-MS protein identification came together recently at Sage-N Research’s annual user group meeting, and presented methods and results for the techniques and tools with which they are associated:

  • Jimmy Eng, co-inventor of Sequest and developer of many proteomics tools, presented tips for Sequest analysis
  • Josh Elias, who pioneered the systematic use of decoy databases for FDR estimation, gave a talk on how to use that technique to address Peptide ID signal-to-noise.
  • Alexey Nesvizhskii spoke about the tools he co-authored, in “Peptide identification and protein inference using PeptideProphet and ProteinProphet”

Their talks were very wide-ranging and full of practical insights for the proteomics user community, and they explored the different research interests, data sets, analysis methods and workflows in the individual labs.  However, they all had this in common: they had kept a careful eye on their search settings, monitored sensitivity and error rates, and come to a common, if perhaps not entirely intuitive, conclusion: the most sensitive search and the lowest error rates for shotgun proteomics are achieved when using semi-enzymatic searches — that is, when one end, but not both, of the peptide is allowed to diverge from the expected cleavage site.

Read the rest of this entry »


Jimmy Eng (left) of University of Washington receives a thank-you gift from David Chiang after his talk.

During our Translational Proteomics 2.0 Meeting, we were privileged to have Jimmy Eng (University of Washington) give us his uncommon insights into using SEQUEST with the Trans-Proteomic Pipeline (TPP).

This talk will be invaluable for advanced users of the SEQUEST search engine for sensitive translational proteomics analysis. All active SEQUEST users should listen to this talk!

Researchers will benefit by increasing their sensitivity and decreasing their false discovery rates when identifying proteins and post-translational modifications using proteomics mass spectrometers like the Orbitrap.

Jimmy is one of the most prolific proteomics developers over almost two decades, as the co-inventor (with John Yates) of proteomic search engines and SEQUEST, as well as the developer of a number of TPP tools.

Conclusions from slides:
- Semi-tryptic searches are better
- Use monoisotopic masses for fragment ions
(Use monoisotopic masses for precursor ions if data from a high-res instrument)
- Narrow mass tolerance searches better if search considers precursor mass isotope assignment error

The talk is available at:  http://www.scivee.tv/node/11920 (31 minutes).

I recommend using the “full screen” mode so you can view the slides, which are also available as a download from the site.

Many of our customers have found DTASelect to be a very useful postprocessing tool for Sequest results, and have reported success using it with Sorcerer output. Up until now, however, these customers have generally run the tool manually on a separate desktop computer. Now we have developed a Muse script to make it easy to do this automatically, on Sorcerer itself.

See our DTASelect on Sorcerer page on this blog for a detailed How-to on installing, configuring and running DTASelect and the Muse script.

If you are interested in using Ascore as described in the application note on the blog, please contact us for new Muse scripts for your Sorcerer. We’ve just updated them, and they are needed to work with the recent v4.0 release of TPP, which is what’s in the current Sorcerer release.

Here’s a how-to for technically advanced users who need to update the Java platform on Sorcerer. It’s not required for the base Sorcerer software, including ScaffoldBatch, but it may be necessary for Phenyx installation. Please consult our technical support staff before deciding to do the update.

These instructions assume that you have a recent 64-bit Sorcerer operating platform (either RHEL 5.2 or Centos 5-based), and that your Sorcerer software is at V3.5.

Here are the steps:

  1. Get the latest Java Development Kit (JDK)  (currently v6 update 18) from http://java.sun.com/javase/downloads/index.jsp. Click on the ‘Download JDK’ button. Get the Linux x64 platform, and download the non-rpm file which has a name like jdk-6u18-linux-x64.bin
  2. Log in as root in a terminal window and type: cd /opt
  3. Copy the file you downloaded to /opt, and unpack it:  /bin/sh jdk-6u18-linux-x64.bin
  4. Note the name of the pathname to java in the unpacked directory for use in the next step, e.g. /opt/jdk1.6.0_18/bin/java
  5. Type:  /usr/sbin/alternatives --install /usr/bin/java java /opt/jdk1.6.0_18/bin/java 2
    • This sets up a system of links from /usr/bin/java to the new installation
  6. Type: /usr/sbin/alternatives --config java
    • Enter ‘2′ at the prompt to select the newly installed alternative
  7. Check you have the latest java by typing:  java -version

(Optional) Update Firefox Java plugin:

  1. Create a plugins directory in the Firefox installation directory if the plugins directory does not exist. Please check your version of Firefox to determine the correct path to use: mkdir /usr/lib64/firefox-3.x.x/plugins
  2. Create a symbolic link to the new Java plugin. Again please check your Firefox and JRE version for the correct paths: ln -s /opt/jdk1.6.0_18/jre/lib/amd64/libnpjp2.so /usr/lib64/firefox-3.0.5/plugins/

Hear Dr. Laurence Brill, senior research scientist at the Burnham Institute (La Jolla, CA) describe his advanced proteomics setup with the SORCERER 2 system:

Click here to here Dr. Laurence Brill

Reference: Laurence M. Brill, Khatereh Motamedchabokia, Shuangding Wu, and Dieter A. Wolf, “Comprehensive proteomic analysis of Schizosaccharomyces pombe by two-dimensional HPLC-tandem mass spectrometry”, Methods (2009), doi:10.1016/j.ymeth.2009.02.023.

Click here for another Success Profile

High accuracy mass specs (e.g. Orbitrap) need to be calibrated every 3 days or so to maintain its mass accuracy, reported to be routinely around 2 ppm. But a common calibration solution can accumulate in the instrument and clog up the tubes, causing some labs to prolong the time between needed calibrations. What to do?

One Orbitrap user reports excellent results from the Acetonitrile solution marketed by Agilent for calibrating their instruments. The “ES Tuning Mix” (part number G2421A) fragments well, leaves not a trace, and is said to cost about $100 for a 100 ml bottle.

Email techteam@SageNResearch.com if you need more information. And thanks to B.G. for the tip!

The built-in mechanism for uploading and downloading files through the Sorcerer GUI is very convenient and right at one’s fingertips, but it is not recommended for very large files, as it is a relatively inefficient method. In the worst case, if you are sitting at your desktop and using the Web GUI to transfer a file from one directory to another on a server (either Sorcerer or some other machine), then that file has to travel over the network to your desktop computer and all the way back again. Not very efficient if you have a large file, a slow network and a creaky PC! So consider using Windows Explorer and Samba running on the server to make a move like this.

Using semi-enzyme or no-enzyme digests for database searches can be a powerful tool for some analyses where non-specific protease activity is being modelled, but those searches come at the expense of considerably larger databases and longer run times, not to mention more noise hits.

To keep the benefits while containing the downside, here are some pointers to optimize search conditions:

  • Work with a compact, non-redundant, preferably species-specific protein database (like the IPI series) that really represents the proteins you might expect to see. (This is a good rule in general.)
  • Make sure the precursor mass range is no wider than needed. Masses below a few hundred Da represent 2- or 3-mer peptides, and contribute little to the identifications, while increasing the noise and database size. Masses over a few kDa or 15-20 AA may be unexpected in an enzyme digest and will not be seen in your instrument anyway.
  • Use semi-enzyme in preference to no-enzyme if you can, for example if you are using trypsin, but you want to find the occasional non-enzymatic cleavage, or if you are generating background hits (to be filtered later) for statistical analysis. Using semi-tryptic conditions rather than full tryptic may increase the search space by an order of magnitude, but then going to full no-enzyme incurs a further two orders of magnitude increase, typically.

Steve Gygi (Harvard Medical School) writes…

We don’t find this to be a problem anymore. There used to be two problems:1) the precursor mass in the header was sometimes from the pre-scan (15K resolution) in the Orbi and not the one in the actual MS scan.

2) Sometimes (for large peptides) the second isotope (1st carbon-13 isotope peak) was
chosen for MS/MS (because its larger for peptides with masses above 1800).

Both of these problems are fixed by Xcalibur software when the radio checkbox is checked under the “exclude charge states” tab that says “Undetermined charge states.”

The mass spectrometer doesn’t collect MS/MS information if it doesn’t know the charge state. If it knows the charge state, then the right information gets put into the headers almost all the time. We always check that box now.

Finally, one can always check how well this works by just doing a search at 1.1 Da tolerance (instead of 50 ppm) and then examining the PPM values for the best-scoring peptides. Usually only one out of a hundred or so will be right (very high Xcorr) and have a PPM value that corresponds to exactly a 1.003355 Da shift.

Jimmy Eng (University of Washington) writes …

Regarding the question of ReAdW vs. extract_msn and the observation that ReAdW masses are almost always +10 to +15 ppm higher than extract_msn:

The current version of ReAdW is being distributed with the Trans-Proteomic Pipeline (TPP) and can also be individually downloaded directly from here:
http://sourceforge.net/project/showfiles.php?group_id=69281&package_id=68160

However, there is an imminent TPP release (3.5.1) which will include a new ReAdW which is the first update for this tool in well over a year. The new ReAdW incorporates a fix to profile scans that get centroided. Previously, centroided peaks were off by both m/z and intensity and we recently became aware of a proper Thermo function call to extract correct centroided data for Orbi/FT data using their Xcalibur Developer’s Kit (XDK) API. This new ReAdW also supports zlib peak compression which will generate in the range of 20-40% smaller files.

There is no change to the precursor mass determination though. For scans with a more precise precursor mass available, ReAdW reports what it gets out of the Thermo API (Monoisotopic M/Z Trailer Extra Value; I presume it’s the same accurate mass also visible in the scan header when viewing a spectrum in Qual Browser). There appears to be newer Thermo function calls to extract (more accurate?) precursor masses but the latest ReAdW continues to use the “Trailer Extra Value” masses for now. On one dataset I tested, the newer precursor mass function generated more accurate precursor masses within the high quality identifications. It had precursor mass accuracies in the range of 0-6PPM with distribution centered around 2PPM versus an error range of 0-12PPM with distribution centered around 4PPM for the “Trailer Extra Value” masses. But testing by others showed that there were enough scans where the new mass function failed so the precursor mass determination is left as-is for the time being.

Read the rest of this entry »

According to Mike Senko of Thermo, there are two complexities with determining the precursor mass using the Extract-MSn program within Bioworks.The first is the occasional +1 Da that is included in the reported precursor monoisotopic mass. The second is the random mass error that is estimated to be between 5 to 10 PPM.

In a Orbitrap or FT acquisition, the instrument tries to determine the monoisotopic m/z while it picks precursors based on abundance. If the m/z can be confidently inferred, it will be listed in the scan header (also called the trailer) as “Monoisotopic M/Z”. If it cannot, a ‘0′ is listed.

For the first complexity, the occasional extra +1 Da in the precursor mass arises when Extract-MSn is used to generate the peaklist and a ‘0′ is encountered. In those cases, the mass of the most abundant isotopic peak is chosen, which is the M+1 peak for precursor masses higher than about 1700 Da. Therefore, the extra +1 Da arises not from the instrument itself, but from the way Extract-MSn extracts the information. The potential for this error still exists today.

For the second complexity involving the random mass error, the issue is that the analytical scan is not done when the data dependent scan decisions are made, which is during the first 25% of the time domain signal (called the Preview Scan). For this reason, the listed masses for the precursor and isotopic M/Z will not match those values in the analytical scan. In that case, Extract-MSn will go back to the analytical scan to extract the more accurate values. This capability is reportedly in Extract-MSn version 3.2 or later.

[Editor’s Note: Mike is a research scientist at Thermo responsible for optimizing the interface between the LTQ and Orbitrap/FT sections of hybrid instruments. We thank Mike for providing the above information, which we summarized for this newsletter.]

A single large file can be transferred far faster – sometimes hours faster – than hundreds of small files.

For example, you can use a single click in the web interface to both zip up a directory of DTA files and to submit it to Sorcerer for searching. Sorcerer’s Web GUI can automatically unpack the zip archive ready for searching. Simply bring up the web user interface. Where you submit the raw or spectral files for search, simply find the parent directory, then right-click to bring up a menu, and select the compression function.

Go wider than experimental conditions: e.g.  5x the precursor mass tolerance of the mass spec, semi-trypsin, and reversed decoy databases.

For example, if your instrument has an expected precursor mass tolerance of 10 ppm or less, search using 50 ppm. If you expect 95%+ of the peptides to be full tryptic, search using semi-trypsin. And if your workflow handles reversed decoy databases, include reversed peptides in your search.

The 2 reasons for this are: (1) it provides auxiliary information that the post-search validation software can use for filtering, and (2) many validation software tools rely on a “population distribution” for curve-fitting, so it is important to provide enough noise for the algorithm to work properly.

For example, you can gain increased confidence on a particular peptide hit if you searched under widened conditions, but the top scoring hit has a mass of 1 ppm, is full-tryptic, and non-reversed.

Keep 2000 instead of 500 top preliminary scores (Sp), especially for phosphorylated peptides.

SEQUEST uses a 2-pass approach, whereby the 1st pass keeps only the top 500 preliminary scores (Sp) by default. These are in turn analyzed using the cross-correlation score (XCorr). However, we have found that 500 is too low for either poorer quality spectra, large protein sequence databases, or multiple variable modifications, because the distribution of random Sp values becomes so large that the true hit can be ranked beyond 500 or 1000.

This is especially true for phosphorylated peptides, whose MSMS spectra are typically dominated by one or two precursor derivatives. The resulting Sp distribution becomes more concentrated, causing true hits to be crowded out.

Increasing the parameter to 2000 (or even higher) will increase search engine sensitivity for general searches, and especially phosphopeptide searches.

For sensitive SEQUEST searching, use ExtractMSN or ReAdW in preference to Mascot Distiller, which generates only 1/2 to 1/5 as many peaks

For peak-rich MS-MS spectra typical from ion traps, SEQUEST uses 2 to 5 times more peaks in its score than Mascot. Extract-MSn (in Bioworks) and ReAdW (in Trans-Proteomic Pipeline) faithfully reproduce the MS-MS peaks from the raw file in the peaklist (the SRF, DTA or mzXML file). In contrast, Mascot Distiller, which is optimized for Mascot, tries to distill the measured MS-MS spectra down to only the relevant peaks to the extent possible. For poor quality spectra, this can result in lost information and sensitivity.

Set ‘print_duplicate_references’ to 10 or higher in SEQUEST for increased protein coverage in most workflows.

Do this when using Scaffold, DTASelect, or Bioworks for your post-SEQUEST analysis. Incorrectly leaving this parameter at its default value is one of the most common reasons why researchers mistakenly obtain lower than expected protein coverage from SEQUEST searches.

The parameter causes SEQUEST to print all the different protein references (up to the specified limit) containing the top peptide hit in its output file, which is then used by the protein inference program to re-assemble the protein from the peptide assignments.

Both Trans-Proteomic Pipeline (PeptideProphet and ProteinProphet) and Rosetta Elucidator re-compute the protein inferences directly from the peptides and do not depend on the reporting of multiple protein references.

Orbitrap or LTQ-FT MS-MS spectra may have precursor mass error of ~1 amu, which must be taken into account by the search engine.

High mass accuracy instruments can resolve a precursor peak mass to < 10 ppm, but they may occasionally mis-assign a carbon-13 isotope peak as a carbon-12, and possibly vice versa. This is said to affect up to 10% of the spectra, resulting in the mass error of a neutron. In SORCERER-SEQUEST, be sure to set the “isotope check” option which allows a small precursor mass tolerance (e.g. 50 ppm) to be applied to M, M-1, and M+1 masses, where M is the reported precursor mass. In Bioworks SEQUEST (including SEQUEST Cluster) which lacks this option, it may be necessary to search without the benefit of the mass accuracy (i.e. +/- 1.5 amu) to cover the isotope mass error.

For highly charged species, the max mass error can be 2 amu’s. Please contact our Tech Team for a Sorcerer scripting solution to help address this.