by David.Chiang@SageNResearch.com
Orbitraps and other fast ion trap mass spectrometers (e.g. FT, LTQ) are popular instruments for discovery proteomics research.
The SEQUEST cross-correlation score is almost tailor-made for the spectral characteristics of ion trap data, whose information-rich spectra are challenging due to multiply-charged ions reported with relatively low fragment mass accuracy. This is especially important for analyzing noisy spectra that arise from low-abundance peptides and phosphorylated peptides, where the information content is embedded in the abundant small peaks.
However, you may be unaware how the basic SEQUEST functionality has evolved from the first ’sequest27′ prototype program to the latest SORCERER-SEQUEST implementation.
Software continues to evolve to adapt to new requirements. Like a home remodeling job that never ends, at some point it becomes more practical to start over from scratch. After all, maintenance costs are several times higher than the initial development costs over the life of a software product.
The recommended architecture for high-throughput analysis is a client-server system architecture, which separates the interactive user client computer from the heavy-duty number-crunching server. This simplifies the sharing, updating, and backup of the central server, and isolates it from viruses and other sources of system instability from the user accessible client PCs.
Sequest27
Proteomic search engines were first invented by John Yates and Jimmy Eng at the University of Washington in the early 1990’s, based on the novel idea that a peptide sequence can be inferred not just from the tandem mass spectrum alone (i.e. de novo sequencing), but using known protein sequences as a reference.
The prototype search engine software was a standalone program named ’sequest27′ comprising approximately 3000 lines of C code. The source code has since been separately maintained by the Yates Lab and by Thermo, with PTM searches and other modifications added later.
The ’sequest27′ program processes one mass spectrum at a time, and searches a protein sequence database from the beginning to end each time it is run. For example, to analyze a MudPIT experiment with 8,000 spectra, the ’sequest27′ program is run exactly 8,000 times to generate 8,000 output files, with no attempt to use information from one ’sequest27′ run to another.
SEQUEST Cluster
The simplest way to scale up the throughput is to run the same program on many computers at once, such as in a Beowulf cluster architecture (http://www.beowulf.org/).
The SEQUEST Cluster (”SC”) product once marketed by ThermoFinnigan uses this approach, with typically 4 to 32 Linux slave node computers running ’sequest27′ under the control of the Windows master node computer running Bioworks.
The SC architecture partitions the set of input spectra into smaller sets for each node, and uses the master node to aggregate the results. While this approach is simpler to implement than partitioning the protein sequences, it requires each local disk to contain the same protein files, resulting in inefficient disk usage (i.e. a 16-node cluster searching the NCBI nr file must store 16 identical copies). As well, it makes the indexed search capability impractical. If the local files are large, then manually copying the files across the network to each node will take a lot of time.
To proteomics researchers new to clusters, the SC architecture seems to offer two benefits: (1) higher throughput than a single computer, and (2) ability to expand throughput in the future by adding nodes.
However, the devil is in the details. In practice, the cluster may not offer higher throughput than an optimized, non-cluster architecture. As well, future expansion for this software architecture is impractical in light ofMoore’s Law.
Depending on the search conditions, one high-end server (say with 8 GB RAM, 1.6 terabyte disk) with an optimized software architecture can outrun a 16-node cluster, whereby each slave node has 1/16th the resources (i.e. 512 MB RAM and 100 GB disk). And it will be simpler to maintain, easier to program, and approximately 16x more reliable. The partitioned RAM and disk resources make system-wide optimization difficult.
Future expansion is also impractical beyond the first year for the SC architecture, since all the slave nodes are assumed to have identical specs. With Moore’s Law predicting 2x performance increase every 18 months at the same price, it is more effective to replace the computing hardware every 2 to 3 years with a brand-new system rather than to try to buy older nodes to add to an old cluster.
Server vs. PC
Servers are not just big Personal Computers (PCs). Quality server hardware is designed for reliable 24/7 multi-processing and continuous disk access, unlike PC hardware designed for the cost-sensitive consumer market.
Robust server operating systems like Enterprise Linux are designed to simultaneously run dozens of independent programs in multi-user environments and to isolate crashed programs from affecting our programs.
Server programs have fewer restrictions than PC programs designed for easy installation and use by non-experts. Therefore, they can incorporate powerful server modules like Perl, PHP, Ruby on Rails, Apache, and MySQL, but require IT expertise for installation and configuration.
One important benefit of the server platform is ease of integration, which is increasingly important as the workflow evolves from just the search engine to a full proteomic workflow.
In contrast, integration can be very complex on the standard Windows operating system. For example, some mass spec software from different vendors cannot co-exist on the same Windows PC. In general, PC software is easy to install but difficult to integrate, while server software tends to be the opposite.
SORCERER-SEQUEST
The SORCERER software architecture was developed from the ground up as a server platform for high-throughput search engines and workflows, with focus on robustness, scripting flexibility, and scalable performance.
The SORCERER platform is not hard-coded for SEQUEST, but instead is a general-purpose proteomics search platform that uses the scoring subsystem for algorithm customization. (It was initially prototyped with X!Tandem, and later introduced with SEQUEST.)
At the heart of the SORCERER software architecture is the micro-partitioning of a search job into self-contained “micro-jobs” that are distributed and managed by a relational database.
In order to further reduce search time, the protein sequences are re-arranged into a peptide-centric data structure when they are first loaded into the SORCERER and “prepared” for peptide searches. Specifically, protein sequences are pre-digested in silico into unmodified peptides, which are sorted by mass, and partitioned into 0.5 GB chunks call ’seqblobs’.
When a large search job is submitted to the SORCERER, it is added to the queue by the queuing subsystem. The Sorcerer PE Application Layer subsystem partitions each search job into possibly thousands of self-contained micro-jobs, each containing 300 spectra with associated seqblobs. With PTM searches, the same spectra unit may be search against different seqblobs with different mass ranges. (For example, a spectrum with 1000 amu precursor mass may have its unmodified peptide sequence be 1000 amu with no mods, or 920 amu with a single phospho-site.)
All the micro-jobs are recorded in a MySQL relational database. Available CPU cores from either the master or slave nodes will query the database for the next micro-job, and submit the results when completed.
Since each seqblob contains pre-searched peptide information, each micro-job performs only the scoring function, which is the only part customized to SEQUEST or other search engines. (Before the advent of multi-core CPUs, FPGA subsystems were also used to execute search micro-jobs. Other exotic architectures, such as Nvidia GPUs and the upcoming Intel Larrabee, are also compatible and may be implemented depending on market needs.)
When all the micro-jobs associated with one queue search job is done, the results are aggregated and written out to the file subsystem. As well, an optional MUSE script is run at this time on the output directory. For example, Ascore phospho-site localization can be done with the search results, or additional re-scoring using different user-defined search engines.
This powerful mechanism also allows algorithm developers to use the SORCERER search as a pre-search function to enrich the peptide candidates to perhaps the top 50 or 500, and then use MUSE scripting to rapidly develop scoring functions to increase accuracy. In particular, algorithm developers can optimize the important scoring functions without needing to first develop the base software to read FASTA files, compute PTM combinations, or perform other necessary but low-value operations.
Applications include the analysis of CID+ETD spectra, whereby the top CID search results are used to drive the ETD search, and MS2/MS3 phosphorylation analysis, whereby associated MS3 spectra may be separately searched in MUSE and re-combined with the MS2 results.
The SORCERER architecture includes a ‘custom’ directory, which has a higher priority than the application directory, to allow knowledgeable developers to substitute and overwrite almost any part of the SORCERER platform. (By confining all customization to this directory, it is simple to revert back to the original factory state.) Therefore, researchers can start with a powerful, functional workflow using a standard SORCERER product, then customize it as needed from simple MUSE scripts to a full re-architecting of major subsystems.