How to use SORCERER to search ETD ms/ms spectra

Electron transfer dissociation (ETD) is a promising dissociation technology for analyzing labile post-translational modifications (PTMs) such as phosphorylation. Unlike CID, ETD generates positively charged c and z* (z-radical) ions instead of b and y ions. There are two caveats in using standard SEQUEST for ETD tandem mass spectra:

  1. Standard c/z option doesn’t compute z* ions correctly.
  2. Standard SEQUEST allows only low charge states, and would not work for highly charged, long peptides.

It is important to note that z* ions are not the same as z ions, and have an extra hydrogen (1.08 Da monoisotopic mass). This means that the standard SEQUEST option of searching c/z ions will not search ETD spectra correctly, since the computed z ions will have the wrong mass. On SORCERER, correct c/z* ions can be obtained using user-defined static peptide terminus modifications on standard b/y searches, as described below. As well, SORCERER* allows very high precursor charge states (up to +255) in order to accommodate highly charged species. Here is how to search ETD spectra using SORCERER …

1. Define peptide terminus mods that shift b/y ions to c/z* ions, and use these for ETD searches.

Define the following static peptide terminus modifications using the web interface (click “Add/edit modifications…” on the Search page, then click “New/edit modifications” on top):

  • Name: “BtoC” with Mono Mass: “17.02655″ and Type=”N-Terminus”
  • Name: “YtoZrad” with Mono Mass “-16.01872407″ and Type=”C-Terminus”

In both cases, Residue is left blank.

2. Define a new search profile that incorporates the above peptide terminus mods.

In the Search page under “(2) Choose a Search profile”, select the most similar existing search profile, then click “Edit this profile…”. Be sure to name it something different and memorable, then select the above 2 mods under “Terminus modifications” and “Static”. Select other applicable options.

Note that many common post-SEQUEST probability re-scoring algorithms, such as PeptideProphet or Scaffold, are not tuned for ETD scores. From first principles, we believe that the resulting probabilities may not be wrong per se, but rather be lacking in specificity.  *The Yates Lab’s version of SEQUEST has 2 code modifications for ETD. The first is the increased charge state (same as in SORCERER). The second is exclusion of the Proline cleavage, which is not implemented in the standard SORCERER search engine. However, this can be done with a MUSE post-processing step in the future if it is found to have a large effect. As always, in-warranty clients can contact our TechTeam for help on this and other advanced capabilities.

Video: “Peptide ID with Target-Decoy Searching” by Prof. Josh Elias (Stanford)

Prof. Josh Elias (left) of Stanford University receives a thank-you gift from David Chiang after his talk.

Ever wondered about target-decoy searching? Want to gain a better understanding and realistic expectation of this effective tool? SageNResearch’s video “Addressing Peptide Identification Signal-to-noise With Target-Decoy Searching”, given by Professor Josh Elias of Stanford University at our “Translational Proteomics 2.0″ meeting, can help. Dr. Elias is an Assistant Professor in Chemical and Systems Biology at Stanford University, and was part of the Steven Gygi Lab at Harvard Medical School before that. His lab is keenly interested in developing and applying methods to meet the current challenges facing scientists engaged in large scale proteome characterization.

Josh kicked off his talk with a stunning and very powerful visual to hit home the concept of what target-decoy database searching can do — you’ll never look at coffee beans in quite the same way. With this talk, you’ll know how to better find a happy medium for thresholds, smarter ways of designing your filtering criteria, when not to even consider using the method, how to get the most out of (really easy) decoy searching in SORCERER, and what’s so good about partial tryptic searches.

The 30-minute presentation is available at:
To view slides, I recommend using the “full screen” mode. The slide set can also be downloaded as a Powerpoint file.

Secret Insights to Translational Proteomics Success


Proteomics mass spectrometry is finally sensitive and specific enough for robust translational medicine (at least in capable hands), and holds tremendous promise to revolutionize biology and medicine. For some, it holds the key to incredible research power for decades to come.

However, there is a chasm that continues to grow between the productive and unproductive labs, because too many proteomics practitioners focus too early on low-level issues (i.e. cost, automation, ease-of-use) without first resolving high-level ones (i.e. sensitivity in presence of noise, quality of results, algorithmic suitability).

For many researchers experimenting with a new high-resolution instrument, the most common scenario is to select a workflow based on running a simple protein solution, usually a purified BSA solution or a commercial protein mixture.

Since different workflows will give basically identical protein IDs results for these simple test cases, they may conclude that all search engines are equivalent. While true when there is almost no signal noise, it is largely irrelevant in translational research. In fact, the exact same test will likely show that low-resolution and high-resolution mass specs are equivalent, the lowest quality reagents will suffice, or maybe you don’t have to clean your glassware as often. These are also true when there is little or no signal noise, but again, that is irrelevant for real-world research.

Seeing that there is little difference in protein IDs, some focus on using protein coverage as the sole metric for evaluating search engines. However, this is actually the opposite of what is needed for sensitive discovery proteomics. For example, if you are hunting for new protein biomarkers (especially a “one-hit wonder”), you do not want the protein inference engine tuned to assigning any ambiguous peptides to already found proteins, thereby hiding them from further study.

Not surprisingly, a workflow selected based on low-noise experiments and focused on protein coverage will excel for simple mixtures, but is not sensitive enough to analyze complex mixtures with wide dynamic range, such as in translational research. Scientists will be able to see the abundant peptides and proteins, but probably little else. That is roughly what most proteomics researchers find today, nothing meaningful, but enough of the obvious to not change their methodologies.

The result is that most labs are not getting the value commensurate with their investments in proteomics mass spectrometry. Under the current economic environment, this is both wasteful and dangerous.

Within the academic world, while many proteomics researchers have trouble getting any interest, a select few are swamped and have to turn away collaborators. Within drug discovery firms, while many are staring at their mostly idle mass spectrometers, a select few are running multiple mass spectrometers 24/7 sieving productively through millions of peptides.

So why are the majority of the proteomics research not producing high-value results?

With our access into the world’s top academic and drug discovery proteomics labs, we have a unique bird’s eye view into the answer. (However, like attorneys, we never give out client-specific information.)

Please allow me to share some secrets to your future success.

Continue reading

Announcing Sorcerer PE v4.0 with Enhanced ETD and Quantitation

Our R&D team is busy working on the next major version of the Sorcerer-PE software, and expects to release it to then-in-warranty customers in the next few weeks.  Early previews and beta tests of some of the components will be made available by arrangement to qualified customer sites.

Highlights of the upcoming release include:

  • ETD fragmentation support and analysis
  • MUSE scripting modules for rescoring peptide matches with Olsen-Mann and Sadygov-Coon scores
  • Interoperation with major components of the Yates lab Sequest suite, including the DTASelect filtering and statistical analysis tool, and the Census quantitation application
  • Enhancements to the SEQUEST engine which provide first-pass cross-correlation scoring and E-values for greater accuracy and sensitivity

Continue reading

New Target-Decoy capabilities with DTASelect and Muse

We’ve developed a new Muse workflow for target-decoy analysis and false discovery rate estimation, based on our integration of DTASelect from the Yates lab. DTASelect can now use target-decoy FASTA files that are installed on Sorcerer to support its statistical analysis. It provides an easy-to-interpret results report complete with match statistics and estimated false discovery rates.

Our DTASelect on Sorcerer page on this blog has been updated to describe the target-decoy workflow, in addition to the existing material on installing, configuring and running DTASelect and the Muse script. Please visit it to get links to the latest scripts and for a detailed How-To.

Experts agree: use semi-enzymatic search for best sensitivity and specificity

Three of the world’s leading experts on MS-MS protein identification came together recently at Sage-N Research’s annual user group meeting, and presented methods and results for the techniques and tools with which they are associated:

  • Jimmy Eng, co-inventor of Sequest and developer of many proteomics tools, presented tips for Sequest analysis
  • Josh Elias, who pioneered the systematic use of decoy databases for FDR estimation, gave a talk on how to use that technique to address Peptide ID signal-to-noise.
  • Alexey Nesvizhskii spoke about the tools he co-authored, in “Peptide identification and protein inference using PeptideProphet and ProteinProphet”

Their talks were very wide-ranging and full of practical insights for the proteomics user community, and they explored the different research interests, data sets, analysis methods and workflows in the individual labs.  However, they all had this in common: they had kept a careful eye on their search settings, monitored sensitivity and error rates, and come to a common, if perhaps not entirely intuitive, conclusion: the most sensitive search and the lowest error rates for shotgun proteomics are achieved when using semi-enzymatic searches — that is, when one end, but not both, of the peptide is allowed to diverge from the expected cleavage site.

Continue reading

Video: “SEQUEST and TPP Tips” by Jimmy Eng (U. Washington)

Jimmy Eng (left) of University of Washington receives a thank-you gift from David Chiang after his talk.

During our Translational Proteomics 2.0 Meeting, we were privileged to have Jimmy Eng (University of Washington) give us his uncommon insights into using SEQUEST with the Trans-Proteomic Pipeline (TPP).

This talk will be invaluable for advanced users of the SEQUEST search engine for sensitive translational proteomics analysis. All active SEQUEST users should listen to this talk!

Researchers will benefit by increasing their sensitivity and decreasing their false discovery rates when identifying proteins and post-translational modifications using proteomics mass spectrometers like the Orbitrap.

Jimmy is one of the most prolific proteomics developers over almost two decades, as the co-inventor (with John Yates) of proteomic search engines and SEQUEST, as well as the developer of a number of TPP tools.

Conclusions from slides:
- Semi-tryptic searches are better
- Use monoisotopic masses for fragment ions
(Use monoisotopic masses for precursor ions if data from a high-res instrument)
- Narrow mass tolerance searches better if search considers precursor mass isotope assignment error

The talk is available at: (31 minutes).

I recommend using the “full screen” mode so you can view the slides, which are also available as a download from the site.

Secrets to successful workflows for advanced ‘Proteomics 2.0′ analyses

Common PC proteomic software is designed primarily to be easy to use with low throughput and small datasets up to a few 1000 spectra. PC programs like Mascot or other software generally work fine at this scale.

However, high-throughput and large-scale analysis (e.g. 100K+ spectra experiments) — a foundation capability for biomarker discovery, molecular profiling and advanced post-translational modification research, requires a different methodology because of the increased need for sensitivity, noise-reduction, and automation.

Horses for Courses

This British maxim states that what may be suitable for one situation may not be suitable for another, as no one race horse is ideal for all course conditions.

When you need to go somewhere, you would walk, drive, or take a plane depending on whether the distance is 1, 100, or 10000 miles/kilometers, respectively.

If your annual income is USD $1K, $100K, or $10M, you would prepare your tax forms manually, use the TurboTax software, or hire a very expensive accountant, respectively.

However, I still occasionally meet scientists who mistakenly believe they can evaluate a large-scale workflow by using a simple BSA or other standard commercial mixture.

Advanced, large-scale analysis is highly specialized, and requires a lot of messy statistics tested against big datasets for true validation. Unless you enjoy that sort of thing, it’s easier to find someone else you respect who has done the heavy statistical lifting for you, so you can focus on what’s really important for you.

Two common large-scale workflows, both use SEQUEST

Continue reading

Summarize SEQUEST outputs in CSV for Excel with ‘’

The MUSE script ‘’ can be used to summarize the top peptide scores from SORCERER-SEQUEST into a CSV format for importing into Excel.

This is useful to performing non-standard analyses (i.e. separate from PeptideProphet or Scaffold), or for further manipulation of the data using scripting languages like Perl or MUSE.

Simply type “” in the MUSE box (under Advanced Options in the Search page).

It can also be run interactively after the search, by running it inside the output directory for the search job (e.g. “/home/sorcerer/output/45/”), just above the ‘original’ directory.

It will search all subdirectories for *.out files, and turn the top peptide from each *.out file into a single CSV line.

As well, the MUSE script can be copied and modified as needed to customize to a specific format.

Note: is available in Sorcerer PE v3.5+ revisions.

Sequence-based search for N-linked glycopeptides

N-linked protein glycosylation is a common post-translational modification (PTMs) in many cellular processes. Atwood et al (RCMS 2005) describe a tandem mass spec-based methodology to analyze N-linked glycopeptides.

Enriched glycopeptides are treated with peptide N-glycosidase F, which removes the carbohydrate moieties from the peptide backbone. Deglycosylated peptides are analyzed with a tandem mass spec. The resulting MS/MS spectra are searched against a modified protein sequence database that allows only PTMs on N’s within the consensus sequence N-x-y, where x is any residue other than proline, and y is either serine or threonine.

To analyze this PTM on the deglycosylated peptides on SORCERER, we need to search for a monoisotopic mass shift of 0.9840 Da on N’s only in the {N[^P][ST]} consensus sequence.

To search this PTM on the SORCERER, we do the following 2 steps:

1) Create a new protein sequence database that replaces ‘N’ with ‘J’ in the consensus sequence.

2) Prepare this new sequence database for searching by defining ‘J’ to have the same mass as ‘N’ using a static modification setting on ‘J’.

3) Submit a search on SORCERER with a variable modification search on ‘J’ with a mass shift of +0.9840 Da.

Create New Protein Database

Use the MUSE script ‘’ (part of SORCERER PE v3.5) to create a new protein sequence database that replaces each N in the consensus sequence with J.

Simply log onto SORCERER, go to directory ‘/home/sorcerer/fasta/’ where the protein sequences are, and create a new fasta file from an existing one (for example, create ‘ipi.human_n2j.fasta’ from ‘ipi.HUMAN.fasta’) . Then use prepare this new fasta file for searching as you would any other protein sequence file.

Once you log onto the SORCERER, and type the following 2 commands (do not type the ‘sorc$’ which is the SORCERER prompt):

   sorc$ cd /home/sorcerer/fasta/

   sorc$ < ipi.HUMAN.fasta > ipi.human_n2j.fasta

The latter command literally means to run the MUSE script using “standard input” from file ipi.HUMAN.fasta (after the ‘<’ symbol) and sending the “standard output” to the new file ipi.human_n2j.fasta (after the ‘>’ symbol).

(The script may be easily copied and modified for another consensus sequence. Contact TechTeam for details.)

Prepare Database for Searching

When the new protein sequence database is prepared for searching, assign a static modification ‘MakeN’ of -9885.95707256 Da. This will cause the final ‘J’ mass to be the monoisotopic mass of 114.04292744 Da. (The normally unused codes ‘J’ and ‘U’ are set at 10,000 Da to flag any inadvertent usage.) The resulting peptide database will be used for subsequent searching.


The search can now be submitted by creating a user-defined variable modification ‘Nlinkglyco’ with mass of 0.9840 Da on the residue ‘N’ against the new peptide database.


We thank Dr. Rebekah Gundry from the Van Eyk Lab at Johns Hopkins for bringing this SORCERER application to our attention!

Reference: Atwood et al (Rapid Comm Mass Spec 2005; 19: 3002-3006 DOI: 10.1002/rcm.2162)

Download John Yates talk on Quantitative Mass Spec

Dr. John Yates from the Scripps Research Institute gave the talk “Driving Biological Discovery using Quantitative Mass Spectrometry” at the 2008 Proteomics 2.0 Meeting hosted by Sage-N Research.


The audio MP3 file is available by download here (click to play, right click to download):


The complete slideset is available by download in 5 parts here (click to view, right click to download):






The meeting was held on June 1, 2008 in Denver, just before the ASMS conference.

SORCERER: Evolution of the SEQUEST Architecture


Orbitraps and other fast ion trap mass spectrometers (e.g. FT, LTQ) are popular instruments for discovery proteomics research.

The SEQUEST cross-correlation score is almost tailor-made for the spectral characteristics of ion trap data, whose information-rich spectra are challenging due to multiply-charged ions reported with relatively low fragment mass accuracy. This is especially important for analyzing noisy spectra that arise from low-abundance peptides and phosphorylated peptides, where the information content is embedded in the abundant small peaks.

However, you may be unaware how the basic SEQUEST functionality has evolved from the first ‘sequest27′ prototype program to the latest SORCERER-SEQUEST implementation. 

Software continues to evolve to adapt to new requirements. Like a home remodeling job that never ends, at some point it becomes more practical to start over from scratch. After all, maintenance costs are several times higher than the initial development costs over the life of a software product.

The recommended architecture for high-throughput analysis is a client-server system architecture, which separates the interactive user client computer from the heavy-duty number-crunching server. This simplifies the sharing, updating, and backup of the central server, and isolates it from viruses and other sources of system instability from the user accessible client PCs.


Proteomic search engines were first invented by John Yates and Jimmy Eng at the University of Washington in the early 1990′s, based on the novel idea that a peptide sequence can be inferred not just from the tandem mass spectrum alone (i.e. de novo sequencing), but using known protein sequences as a reference.

The prototype search engine software was a standalone program named ‘sequest27′ comprising approximately 3000 lines of C code. The source code has since been separately maintained by the Yates Lab and by Thermo, with PTM searches and other modifications added later. 

The ‘sequest27′ program processes one mass spectrum at a time, and searches a protein sequence database from the beginning to end each time it is run. For example, to analyze a MudPIT experiment with 8,000 spectra, the ‘sequest27′ program is run exactly 8,000 times to generate 8,000 output files, with no attempt to use information from one ‘sequest27′ run to another. 


The simplest way to scale up the throughput is to run the same program on many computers at once, such as in a Beowulf cluster architecture ( 

The SEQUEST Cluster (“SC”) product once marketed by ThermoFinnigan uses this approach, with typically 4 to 32 Linux slave node computers running ‘sequest27′ under the control of the Windows master node computer running Bioworks. 

The SC architecture partitions the set of input spectra into smaller sets for each node, and uses the master node to aggregate the results. While this approach is simpler to implement than partitioning the protein sequences, it requires each local disk to contain the same protein files, resulting in inefficient disk usage (i.e. a 16-node cluster searching the NCBI nr file must store 16 identical copies). As well, it makes the indexed search capability impractical. If the local files are large, then manually copying the files across the network to each node will take a lot of time.

To proteomics researchers new to clusters, the SC architecture seems to offer two benefits: (1) higher throughput than a single computer, and (2) ability to expand throughput in the future by adding nodes. 

However, the devil is in the details. In practice, the cluster may not offer higher throughput than an optimized, non-cluster architecture. As well, future expansion for this software architecture is impractical in light ofMoore’s Law

Depending on the search conditions, one high-end server (say with 8 GB RAM, 1.6 terabyte disk) with an optimized software architecture can outrun a 16-node cluster, whereby each slave node has 1/16th the resources (i.e. 512 MB RAM and 100 GB disk). And it will be simpler to maintain, easier to program, and approximately 16x more reliable. The partitioned RAM and disk resources make system-wide optimization difficult.

Future expansion is also impractical beyond the first year for the SC architecture, since all the slave nodes are assumed to have identical specs. With Moore’s Law predicting 2x performance increase every 18 months at the same price, it is more effective to replace the computing hardware every 2 to 3 years with a brand-new system rather than to try to buy older nodes to add to an old cluster.

Server vs. PC

Servers are not just big Personal Computers (PCs). Quality server hardware is designed for reliable 24/7 multi-processing and continuous disk access, unlike PC hardware designed for the cost-sensitive consumer market.

Robust server operating systems like Enterprise Linux are designed to simultaneously run dozens of independent programs in multi-user environments and to isolate crashed programs from affecting our programs.

Server programs have fewer restrictions than PC programs designed for easy installation and use by non-experts. Therefore, they can incorporate powerful server modules like Perl, PHP, Ruby on Rails, Apache, and MySQL, but require IT expertise for installation and configuration. 

One important benefit of the server platform is ease of integration, which is increasingly important as the workflow evolves from just the search engine to a full proteomic workflow. 

In contrast, integration can be very complex on the standard Windows operating system. For example, some mass spec software from different vendors cannot co-exist on the same Windows PC. In general, PC software is easy to install but difficult to integrate, while server software tends to be the opposite.


The SORCERER software architecture was developed from the ground up as a server platform for high-throughput search engines and workflows, with focus on robustness, scripting flexibility, and scalable performance. 

The SORCERER platform is not hard-coded for SEQUEST, but instead is a general-purpose proteomics search platform that uses the scoring subsystem for algorithm customization. (It was initially prototyped with X!Tandem, and later introduced with SEQUEST.)

At the heart of the SORCERER software architecture is the micro-partitioning of a search job into self-contained “micro-jobs” that are distributed and managed by a relational database.

In order to further reduce search time, the protein sequences are re-arranged into a peptide-centric data structure when they are first loaded into the SORCERER and “prepared” for peptide searches. Specifically, protein sequences are pre-digested in silico into unmodified peptides, which are sorted by mass, and partitioned into 0.5 GB chunks call ‘seqblobs’.

When a large search job is submitted to the SORCERER, it is added to the queue by the queuing subsystem. The Sorcerer PE Application Layer subsystem partitions each search job into possibly thousands of self-contained micro-jobs, each containing 300 spectra with associated seqblobs. With PTM searches, the same spectra unit may be search against different seqblobs with different mass ranges. (For example, a spectrum with 1000 amu precursor mass may have its unmodified peptide sequence be 1000 amu with no mods, or 920 amu with a single phospho-site.)

All the micro-jobs are recorded in a MySQL relational database. Available CPU cores from either the master or slave nodes will query the database for the next micro-job, and submit the results when completed. 

Since each seqblob contains pre-searched peptide information, each micro-job performs only the scoring function, which is the only part customized to SEQUEST or other search engines. (Before the advent of multi-core CPUs, FPGA subsystems were also used to execute search micro-jobs. Other exotic architectures, such as Nvidia GPUs and the upcoming Intel Larrabee, are also compatible and may be implemented depending on market needs.)

When all the micro-jobs associated with one queue search job is done, the results are aggregated and written out to the file subsystem. As well, an optional MUSE script is run at this time on the output directory. For example, Ascore phospho-site localization can be done with the search results, or additional re-scoring using different user-defined search engines. 

This powerful mechanism also allows algorithm developers to use the SORCERER search as a pre-search function to enrich the peptide candidates to perhaps the top 50 or 500, and then use MUSE scripting to rapidly develop scoring functions to increase accuracy. In particular, algorithm developers can optimize the important scoring functions without needing to first develop the base software to read FASTA files, compute PTM combinations, or perform other necessary but low-value operations.

Applications include the analysis of CID+ETD spectra, whereby the top CID search results are used to drive the ETD search, and MS2/MS3 phosphorylation analysis, whereby associated MS3 spectra may be separately searched in MUSE and re-combined with the MS2 results.

The SORCERER architecture includes a ‘custom’ directory, which has a higher priority than the application directory, to allow knowledgeable developers to substitute and overwrite almost any part of the SORCERER platform. (By confining all customization to this directory, it is simple to revert back to the original factory state.) Therefore, researchers can start with a powerful, functional workflow using a standard SORCERER product, then customize it as needed from simple MUSE scripts to a full re-architecting of major subsystems.

SORCERER Takes Over From SEQUEST Cluster

Discovery proteomics research, such as for biomarker discovery, requires advanced “Proteomics 2.0″ analyses for PTMs like phosphorylation, ETD, and quantitation in addition to high-throughput.

With the transfer of the high-throughput SEQUEST Cluster business, the choice for high-throughput data analysis is simplified to one of two SORCERER products, both of which bring powerful “Proteomics 2.0″ capabilities with the integrated MUSE scripting environment.

Many advanced proteomics analyses require some level of customization, so the MUSE scripting can be invaluable. For example, some PTMs of interest occur only on certain residues at a peptide terminus, which can be implemented as a post-search filtering step. Workflow automation, such as the compression and copying of results after search completion, can be easily scripted in MUSE. Indeed, the Ascore phospho-site localization algorithm is scripted entirely within MUSE.

Algorithm developers can quickly experiment with new scoring functions, such as for ETD, PTMs, quantitation, or even replicating other common peptide search engines, by simply re-scoring, say, the top 50 candidate peptides from a Sorcerer search. 

SEQUEST Cluster users who have developed custom interface modules to their workflow can most likely adapt their infrastructure to SORCERER with little or no change.

The SORCERER 2 system will be the product of choice for most high-throughput users. It is a plug-and-play, pre-configured Enterprise Linuxserver. Users can install it in minutes, and immediately use a web browser interface (with a password) from any network PC for uploading and downloading data and submitting search jobs. They will also appreciate the reliability, as many Sorcerer systems in the field have been continuously running for more than a year without downtime.

The SORCERER Enterprise software will be a better fit for high-throughput users who must run software on approved servers within a data center, such as in biopharmaceutical companies or large centralized labs. It can be viewed as an “a la carte” version of the software architecture within the SORCERER 2 IDA, and allows other software to co-exist on the same server. 

The SORCERER Enterprise software can be purchased pre-installed and tested on customer-specified servers. Otherwise, it and its dependent components must be installed and configured by qualified IT staff on qualified powerful servers. As well, the semi-custom nature of the installation and maintenance will result in higher support costs.

Like the SEQUEST Cluster, the SORCERER Enterprise product allows throughput to be increased with additional slave nodes running the SORCERER Enterprise Plus software. Note, however, that each high-performance slave node may be worth 16 nodes of SEQUEST Cluster under common search conditions, so you won’t need as many.

Furthermore, the combination of Thermo Discoverer and Sage-N Research SORCERER provides a powerful, customizable, client-server data analysis platform. Discoverer provides a Windows user interface customizable using the Windows .NET environment, while the SORCERER provides the back-end Enterprise Linux server with MUSE customizability.

See the joint press release at:

If you plan to buy a new Orbitrap or other fast mass spectrometer for discovery proteomics, we would strongly recommend that you include a SORCERER 2 system (or SORCERER Enterprise software if you must run in a data center) in your budget. PC software will not be able to keep up with a frequently used Orbitrap. 

If you have a SEQUEST Cluster that is over 2 years old, we recommend that you upgrade to SORCERER within one year to replace the outdated hardware. And please inquire about the special time-limited upgrade offer to make this transition easier.

Use the 80/20 Rule to Guarantee Research Success

If you have attended a conference lately, such as the ASMS in Denver, you would have found a bewildering array of exciting new products and ideas for your advanced proteomic mass spectrometry research

More than ever, there is a need to remain focused with both your resources and efforts. Now is a good time to use the “80-20 Rule” to cut through the clutter and sharpen your focus. 

The 80-20 Rule (also called the “Pareto Principle” or the “law of the vital few”) states that in many situations 80% of the effects come from only 20% of the causes. It readily applies to many aspects of mass spectrometry and proteomic research.

Focus on the Vital Few in Proteomics

The key to success is to remain focused on what is truly significant, by strategically investing your resources on the “vital few” products, technologies and people with the highest impact to you. 

Here are some of the ways that the 80-20 Rule may apply to your proteomic research, and how you can sharpen your focus.

Invest in Quality

Less than 20% of today’s products account for 80% of overall sales. The rest (80%!) will fade away. Invest only in high-quality products that thrive, while avoiding products that will disappear.

Ideally, you want to get word-of-mouth references from trusted colleagues before purchasing critical tools for your focus areas. This is especially true for software products, some of which are laden with incomplete features designed to be demo’ed and sold rather than used.

With today’s limited budgets, it is important to invest in solid tools rather than to buy the latest widgets. Avoid the temptation to buy inexpensive rather than high-quality, well-supported products. (Most people find that the costliest products are the cheap ones they buy but do not use.)

To optimize your tools purchase, start with a prioritized “Top 10″ capabilities list, then focus on a workflow that can solidly deliver the top 2 must-haves as well as some number of the remaining 8. Given today’s tool market, you will need to integrate tools from different sources to address most of your key requirements.

For the top 2 must-haves, the question is never a yes/no question of whether a certain capability (e.g. ETD) is supported, but how well. Do not choose tools on the basis of the length of the features list. And you may find that technical support can be at least as important as the product itself in a highly technical, evolving field like proteomics. 

Pamper your Workhorse

Perhaps 20% (i.e. 1 of 5) of your mass spectrometers may impact 80% of your research by generating 80% of the spectral data. For example, many labs with several different mass spectrometers tend to have one workhorse instrument, commonly a fast-scan ion trap (e.g. Orbitrap or LTQ) or Tof/Tof (e.g. 4700) mass spectrometer.

If that is the case in your lab, allocate your resources according to impact, and focus at least half your analysis budget on the one key instrument. Avoid the mistake of having the majority of the mass specs dictate the workflow of your most important one.

With tools like PeptideProphet and Scaffold that can accommodate different search engines, you can tailor the search engine to each mass spec, while maintaining a consistent high-level workflow optimized for different instruments.

Associate with Leaders

Perhaps 20% of the researchers seem to publish 80% of the papers and get 80% of the funding. You need to be part of this elite class. How? By doing top-notch work and focusing on your unique value-added capability to the existing network, while minimizing distractions from low-value activities (particularly IT issues).

Whether in elite athletics like the Olympics or science, winners associate with winners. To be part of the winner’s circle, you need to maintain the winner’s mindset, use the professional quality tools and methodologies, and bring yourself to that top level in terms of knowledge, expertise, and professionalism.

The best way to get started is to first replicate the workflow used by leaders in the field, which is more efficient than starting from scratch. The workflow will serve as a good foundation for adding your own unique capability while leveraging an existing high-performance infrastructure.

An Integrated Workflow System for World-Class Proteomics

For more than six years, Sage-N Research has worked with the world’s top technologists to develop a professional quality, server-class integrated proteomics analysis platform targeted for the top 20% proteomics laboratories expected to make the greatest impact.

The result is the Sorcerer 2 system, which delivers a robust workflow incorporating the best technologies from the laboratories of our Scientific Advisory Board members — Drs. John Yates, Steven Gygi and Ruedi Aebersold. In the near future, the Sorcerer 2 platform will incorporate ETD/ECD analysis technologies from Dr. Roman Zubarev, our newest scientific advisor. 

The Sorcerer 2 system is designed to fill in an important gap for advanced “Proteomics 2.0″ analysis, by delivering a standardized workflow platform that can accommodate 80% of standard proteomics analyses right out of the box, while providing an integrated “MUSE” scripting environment to allow user-customizable post-search analyses and workflows for the remaining 20%. This allows labs to focus on scripting their own unique analysis IP while leveraging an existing powerful workflow system.

Why Many Proteomic “Probabilities” Aren’t

Probability scores make search engine results easier to interpret. However, it is important to understand what they mean in order to avoid assigning more significance to the data than there is.

We continue to find researchers who mistakenly believe that there is only one correct way to compute a probability, and that the probability calculated by well-respected programs must be correct.

In fact, there can be as many different statistical models as there are modellers, and some of the best-known probability scores are simply scores and not true probabilities. The difference? Probabilities need to sum to 1 for mutually exclusive outcomes, while scores do not.

For instance, before a horse race, it helps greatly to know that your favorite horse has less than 2% probability of “not winning” (i.e. 98% probability of winning). However, it would not help nearly as much to know that your horse has a 2% probability of “matching the characteristics of a winning horse by random chance” (i.e. within the acceptable height and weight as known winning horses), since several contenders may score similarly. The first is a true probability, while the second is simply a score expressed in probabilistic terms.

Mowse/Mascot Ionscores are not Probabilities

The Mowse score, used in peptide mass fingerprinting, is a “similarity score” derived using a statistical model that calculates the “probability of matching N peaks by random chance”. It does so by assigning such a probability value to each matched m/z peak using a training set ofprotein sequences, and multiplying all such probability values to compute the composite probability P, which for convenience is expressed as -10logP.

It is a useful scoring method that provides a higher score when there are more matched peaks or when a peak is judged to be more rare.

However, the Mowse score is a score and not a true probability, since there is no requirement that a higher score for one protein sequence will reduce the score for other protein sequences.

The Mascot ionscore is directly derived from Mowse. It uses the Mowse scoring methodology on each ion series individually, and picks the highest score among all the ion series as the composite score. Like Mowse, Mascot assumes that all the m/z peaks in an ion series (say all the b+ ions) are independent, which is a mathematical simplification that is clearly inconsistent with tandem mass spec data.

The Mascot score has proven to be especially useful for scoring tandem mass spectra with high-accuracy (< 10 ppm) fragment mass data, where the significance of each matched peak is high. (The Mascot model does not vary the assigned peak probability with fragment mass accuracy, which may limit its theoretical applicability for ion trap spectra of poorer quality.)

While Mowse and Mascot ionscores have proven their usefulness where they apply, they are not true probabilities and should not be treated as such. For example, it is meaningless to talk about error bars when using these values.

This is true for other useful statistically based similarity scores as well, such as for phosphorylation site localization (Beausoleil et al, Nature Biotech 2006 doi:10.1038/nbt1240) and for combined MS2/MS3 scoring (Olsen & Mann, PNAS 2004 Sep 14).

Neither are P-Values

The P-Value (and its close cousin the E-Value) is a useful statistical construct adopted from the genomics world. Unlike a similarity score that measures how close the top peptide candidate matches the measured spectrum, the P-Value is a “dissimilarity score” that measures how different the top peptide candidate is from the rest of the search space at large. (For those familiar with SEQUEST, the parameter “deltaCn” does a similar function, albeit in a less sophisticated manner and is not probabilistic in value.)

We believe P-Values and E-Values were first introduced into proteomics with the X! Tandem search engine (Fenyo and Beavis, Anal Chem 2003, 75).

The E-Value is an empirically derived “expected value” of how many peptides can achieve a particular score by random chance for a particular spectrum, which is computed by extrapolating the decaying exponential distribution of all the peptide scores for that spectrum. The P-Value is the probability analog computed by dividing the E-Value by the number of candidates.

(It is interesting to note that the genomics field has a more rigorous approach to statistics than proteomics today, and would not mistakenly call similarity scores or P-Values “probabilities.” It helps that top genomics practitioners like Stephen Altschul and Eric Lander got their math PhDs long before they did much biology.)

Like the similarity score, the P-Value is also a score and not a probability.

PeptideProphet computes Probabilities

Unlike the similarity scores and “dissimilarity scores” above, the values computed by the PeptideProphet algorithm (Keller, Nesvizhskii, et al, Anal Chem 2002, 74) from rescoring peptide search engine results from SEQUEST, Mascot, X!Tandem, and now Phenyx are probabilities.

This by itself doesn’t mean that the computed values are necessarily correct (depends on the data and underlying assumptions), or that there cannot be other equally valid ways to model the statistics. However, at least the definition matches what is expected of a probability.

Much like a teacher may put the test scores on a curve to convert numerical scores into a more meaningful measure, PeptideProphet assumes the score distribution arises from a large “false positive” distribution superimposed on a smaller “true positive” distribution, and uses curve-fitting to compute the resulting probabilities. Where the FP and TP curves intersect is the 50% probability point.

As with any other statistical tool, its results are only as valid as its assumptions. There is always a chance of “garbage-in, garbage-out,” and the results depend on clean, well-fitting data.

PeptideProphet was originally designed to work with SEQUEST, where a “discriminant score” is derived by combining the similarity score (XCorr) and the “dissimilarity score” (deltaCn). Ideally, the discriminant score should incorporate both elements (the case for rescoring SEQUEST results), so that the highest probability is assigned where (1) the top candidate very closely matches the spectrum and (2) the top candidate is very dissimilar from the others in the search space.

It has since been adapted for Mascot (albeit using only the similarity score) and other search engines. PeptideProphet is part of the Trans-Proteomic Pipeline, while the same algorithm has been re-implemented in commercial products like Scaffold and Elucidator.

It should be noted that a probability rescoring at the peptide assignment stage is not the only way to filter the search engine results. Other methods, notably using decoy (reversed) protein sequence databases, can be employed with parameter-based filtering, such as with DTASelect from the Yates Lab. These methods allow the final false positive rates to be computed without requiring individual peptide assignments to be probabilistically determined. In the future, we expect that many of these different methodologies can be integrated to achieve the highest level of results quality in advanced “Proteomics 2.0″ analysis driving the upcoming “BiotechIndustrial Revolution.”