Sorcerer

You are currently browsing articles tagged Sorcerer.

Release 4.1 is an update to V4.0, which was only released as beta software to a limited number of users, so this release will be the first general release in the Sorcerer PE version 4 series.   The release is currently entering a beta-testing period, following which (probably in late summer), it will be made available to Sorcerer customers with active support arrangements, as well as installed on newly purchased Sorcerer systems.

This release contains enhancements in many different areas of the Sorcerer software:

  • The SEQUEST 3G scoring module has new features to improve the sensitivity and thoroughness of peptide searches.
  • The data flows for Sorcerer processing have been rearchitected to use MS2 and SQT data formats instead of the legacy SEQUEST DTA and OUT file formats.
  • As a solution for the issue of extracting from recent RAW files, an interface has been developed within the Sorcerer software to connect to a separate Windows system and to remotely run ProteoWizard’s new MSConvert extractor with instrument -specific libraries
  • The bundled version of Trans-Proteomic Pipeline software is updated to V4.4.1, which offers multiple enhancements.
  • The new Sorcerer software now supports Scaffold V3.1.2, with new features in TIC quantitation and batch file merging
  • The Scaffold flow has also been reworked on the Sorcerer side, enabling users to identify multiple biosamples for Scaffold in a single search.
  • A new Web API for submitting and getting results from Sorcerer searches over the network has been implemented to help developers use Sorcerer as a search engine within their programs and scripts.
  • This software release has been designed as a component for the Sorcerer-as-a-platform architecture, co-existing with other life science analysis software
  • Enhancements to the MUSE scripting framework to allow more powerful scripts to customize Sorcerer searching.

Read the rest of this entry »

Tags: , , ,

by David.Chiang@SageNResearch.com

Proteomics technology is now a robust discovery tool, at least in capable hands with the right tools, for characterizing post-translational modifications such as phosphorylation, right alongside gene expression and cellular imaging for tumor and stem cell research.

However, the complexity, scale, and criticality of the data from a modern mass spectrometer such as an Orbitrap Velos are well beyond the capability of desktop PCs and require specialized infrastructure IT solutions.

When losing data becomes catastrophic rather than merely annoying, it is time to move beyond PCs into robust infrastructure solutions, such Sage-N Research’s SORCERER Enterprise system. Unlike traditional business-oriented IT systems, the SORCERER Enterprise system is optimized for the large multi-gigabyte data files of proteomics research.

Robust servers and storage systems provide the needed capacity, reliability, and throughput for storing and analyzing proteomics mass spec data that inexpensive PCs cannot provide. For example, a typical throughput of 300GB of raw data per week for a single mass spec will fill up a PC in less than a month. As well, the lower grade disk drives used in cost-sensitive, consumer-oriented PCs and external USB drives can lead to costly data loss and system downtime.

In addition, the nature of the data analysis needed for proteomics is changing, as it becomes more akin to hedge fund data mining than an administrative assistant running an Excel spreadsheet. This is especially true for quantitation and ETD data analyses where the field has not settled onto a de facto one-size-fits-all methodology, and where some semi-customization of the analysis to query and adapt to a particular data-set will be necessary. This is why the large-scale SILAC papers are always done by research groups with their own bioinformatics resource, and why just about any off-the-shelf software you can download or buy will probably not work well for your needs without some customization.

Why does quantitation or ETD software need to be semi-customized? Read the rest of this entry »

Tags: , , ,

The meeting is open to in-warranty Sorcerer customers and by invitation only. Pre-registration is required. A light buffet and refreshments are being provided, and there will be a drawing for customer door prizes. We will have the brand-new, ultra-cool Apple iPad as our door prize! (But make sure you come on time for the best chance to win!)

We are privileged to have Profs John Yates (Scripps) and Steven Gygi (Harvard) confirmed to give a talk. We will also have training talks on the new SEQUEST 3G and the new VersaSearch technology on the SORCERER platform.

If you wish to receive a meeting invitation, please contact: tnowak@sagenresearch.com.
Seating will be limited, so reserve your spot today!

Date: Sunday 23rd May 2010
Time: 1:30 PM to 5:00 PM
Address: Hotel Monaco, 15 West 200 South, Salt Lake City, UT 84101 (801) 595-0000
Room: Suite Paris A

Hope to see many of you there!

Tags: , , ,


Prof. Josh Elias (left) of Stanford University receives a thank-you gift from David Chiang after his talk.

Ever wondered about target-decoy searching? Want to gain a better understanding and realistic expectation of this effective tool? SageNResearch’s video “Addressing Peptide Identification Signal-to-noise With Target-Decoy Searching”, given by Professor Josh Elias of Stanford University at our “Translational Proteomics 2.0″ meeting, can help. Dr. Elias is an Assistant Professor in Chemical and Systems Biology at Stanford University, and was part of the Steven Gygi Lab at Harvard Medical School before that. His lab is keenly interested in developing and applying methods to meet the current challenges facing scientists engaged in large scale proteome characterization.

Josh kicked off his talk with a stunning and very powerful visual to hit home the concept of what target-decoy database searching can do — you’ll never look at coffee beans in quite the same way. With this talk, you’ll know how to better find a happy medium for thresholds, smarter ways of designing your filtering criteria, when not to even consider using the method, how to get the most out of (really easy) decoy searching in SORCERER, and what’s so good about partial tryptic searches.

The 30-minute presentation is available at: http://www.scivee.tv/node/15544
To view slides, I recommend using the “full screen” mode. The slide set can also be downloaded as a Powerpoint file.

Tags: , , , , ,


Prof. Alexey Nesvizhskii (left) of University of Michigan receives a thank-you gift from David Chiang after his talk.

If you really want to understand how peptide and protein identification is done, this video talk is a must-see!

Professor Alexey Nesvizhskii of the University of Michigan is one of the co-inventors (with Dr. Andy Keller) of the popular PeptideProphet/ProteinProphet algorithm for turning search engine results into statistically consistent peptide and protein identifications. (This algorithm is also the basis for the popular Scaffold software.)

At the “Translational Proteomics 2.0″ meeting, we were privileged to have Alexey give his insightful talk that reviews the various steps involved in inferring peptide and protein identifications from large spectra datasets.

In this talk, you will learn why False Discovery Rates are preferred over P-values, why you probably should not run more than 4 replicates of a MudPIT experiment, how FDR estimations from decoy differ from Peptide/ProteinProphet, how “The Two Prophets” compute probabilities by curve-fitting the score distributions, how sensitivity and FDR are computed, and the what and why of some advanced TPP options.

The talk is available at: http://www.scivee.tv/node/12671 (45 minutes).

I recommend using the “full screen” mode so you can view the slides, which are also available as a download from the site. (Please be aware that the slideset order is different from that in the presentation.)

(Note: Both Trans-Proteomic Pipeline and Scaffold Batch software are integrated into the SORCERER platforms.)

Tags: , , , , ,

by David.Chiang@SageNResearch.com

Proteomics mass spectrometry is finally sensitive and specific enough for robust translational medicine (at least in capable hands), and holds tremendous promise to revolutionize biology and medicine. For some, it holds the key to incredible research power for decades to come.

However, there is a chasm that continues to grow between the productive and unproductive labs, because too many proteomics practitioners focus too early on low-level issues (i.e. cost, automation, ease-of-use) without first resolving high-level ones (i.e. sensitivity in presence of noise, quality of results, algorithmic suitability).

For many researchers experimenting with a new high-resolution instrument, the most common scenario is to select a workflow based on running a simple protein solution, usually a purified BSA solution or a commercial protein mixture.

Since different workflows will give basically identical protein IDs results for these simple test cases, they may conclude that all search engines are equivalent. While true when there is almost no signal noise, it is largely irrelevant in translational research. In fact, the exact same test will likely show that low-resolution and high-resolution mass specs are equivalent, the lowest quality reagents will suffice, or maybe you don’t have to clean your glassware as often. These are also true when there is little or no signal noise, but again, that is irrelevant for real-world research.

Seeing that there is little difference in protein IDs, some focus on using protein coverage as the sole metric for evaluating search engines. However, this is actually the opposite of what is needed for sensitive discovery proteomics. For example, if you are hunting for new protein biomarkers (especially a “one-hit wonder”), you do not want the protein inference engine tuned to assigning any ambiguous peptides to already found proteins, thereby hiding them from further study.

Not surprisingly, a workflow selected based on low-noise experiments and focused on protein coverage will excel for simple mixtures, but is not sensitive enough to analyze complex mixtures with wide dynamic range, such as in translational research. Scientists will be able to see the abundant peptides and proteins, but probably little else. That is roughly what most proteomics researchers find today, nothing meaningful, but enough of the obvious to not change their methodologies.

The result is that most labs are not getting the value commensurate with their investments in proteomics mass spectrometry. Under the current economic environment, this is both wasteful and dangerous.

Within the academic world, while many proteomics researchers have trouble getting any interest, a select few are swamped and have to turn away collaborators. Within drug discovery firms, while many are staring at their mostly idle mass spectrometers, a select few are running multiple mass spectrometers 24/7 sieving productively through millions of peptides.

So why are the majority of the proteomics research not producing high-value results?

With our access into the world’s top academic and drug discovery proteomics labs, we have a unique bird’s eye view into the answer. (However, like attorneys, we never give out client-specific information.)

Please allow me to share some secrets to your future success.

Read the rest of this entry »

Tags: , , , , , ,


“Translational Proteomics 2.0″ 2009 Users Meeting in Philadelphia.
Guest speakers Jimmy Eng (UWashington), Alexey Nesvizhskii (UMichigan), Josh Elias (Stanford), along with SAB member John Yates (Scripps) are in the middle row.


Stanford’s Dr. Chris Adams (left) must be feeling pretty lucky!
He gets to use a SORCERER 2 for his research (as part of Allis Chien’s mass spec core facility), AND wins an Acer One netbook door prize from David Chiang!

Translational proteomics — aka Proteomics 2.0 — is high-sensitivity proteomics for translational research, whose mastery is your key to unimaginable fame and fortune in biology and medicine!

Whether you need to catch up or to keep up, you need to hear the leading proteomics technologists reveal their secrets!

We were fortunate to have three of most accomplished technologists (Mr. Jimmy Eng, Prof Josh Elias, and Prof Alexey Nesvizhskii) at our “Translational Proteomics 2.0 Meeting” give their insider insights on high-sensitivity data analysis.

In addition, we were privileged to have Sage-N Research SAB advisor Prof John Yates, one of the fathers of proteomics, attend our meeting and join in our lively panel discussions regarding the present and future of translational proteomics.

From the talks, these are tips for best sensitivity and specificity:

* There are several equivalent ways to calculate precursor mass, all of which can result in several AMUs of mass error due to incorrect isotope assignment.
* Semi-tryptic settings for database searching gives the best performance
* Use a wider mass tolerance than your experiments will yield
* However, you don’t need a wide mass tolerance for searching if (a) you use isotope shift check and (b) you have a decent source of noisy peptide, e.g. with semi-enzyme search
* Post-process peptide IDs with proper statistical tools (e.g. PeptideProphet, DTASelect or target-decoy analysis)
* Key is to monitor the false discovery rates (FDR) with different filtering criteria
* Use monoisotopic mass for fragment ions, and for precursor ions if using high-resolution instrument
* P-values or E-values are not good for large-scale proteomics, because they don’t give you estimated data rates for a given score cut-off, and they ignore other relevant factors (e.g. retention time, mass accuracy, etc.)
* The target-decoy method is a simple and effective means of FDR estimation. It gives scores more discriminatory power by improving signal-to-noise ratio.
* Can use search scores in combination with other characteristics to get more good IDs at a particular FDR than by using score alone

We will be publishing the meeting talks online. Watch this space for details!

Tags: , , , , , , ,

Hear Khatereh discuss her work and her success with the SORCERER 2 system!

Dr. Khatereh Motamedchaboki is currently the Manager of the Proteomics Facility at the Burnham Institute for Medical Research.

She is one of our increasing number of two-time SORCERER success stories, as a previous user at the Ebrahim Zandi Lab at the University of Southern California.

Reference: Laurence M. Brill, Khatereh Motamedchabokia, Shuangding Wu, and Dieter A. Wolf, “Comprehensive proteomic analysis of Schizosaccharomyces pombe by two-dimensional HPLC-tandem mass spectrometry”, Methods (2009), doi:10.1016/j.ymeth.2009.02.023.

Click Here to See Video

Tags: , , ,

Our R&D team is busy working on the next major version of the Sorcerer-PE software, and expects to release it to then-in-warranty customers in the next few weeks.  Early previews and beta tests of some of the components will be made available by arrangement to qualified customer sites.

Highlights of the upcoming release include:

  • ETD fragmentation support and analysis
  • MUSE scripting modules for rescoring peptide matches with Olsen-Mann and Sadygov-Coon scores
  • Interoperation with major components of the Yates lab Sequest suite, including the DTASelect filtering and statistical analysis tool, and the Census quantitation application
  • Enhancements to the SEQUEST engine which provide first-pass cross-correlation scoring and E-values for greater accuracy and sensitivity

Read the rest of this entry »

Tags: , , , ,

If you are interested in using Ascore as described in the application note on the blog, please contact us for new Muse scripts for your Sorcerer. We’ve just updated them, and they are needed to work with the recent v4.0 release of TPP, which is what’s in the current Sorcerer release.

Tags: , ,

Sage-N Research is hosting its annual users’ meeting on the afternoon of Sunday, 31st May, immediately before the ASMS meeting in Philadelphia. We are proud to announce a compelling  agenda with talks from the principal developers of several of the key proteomics data analysis methods that are used as standard in the community, including SEQUEST, target-decoy search strategies, and Peptide/Protein Prophet.  The insights our clients will take away from this meeting will be very relevant to their use of Sorcerer, and promise to enhance their proteomics analysis productivity greatly.

2009 Users’ Meeting Arrangements

The meeting is open to in-warranty Sorcerer customers and by invitation only. Pre-registration is required. A light buffet and refreshments are being provided, and there will be a drawing for customer door prizes. Attending this meeting is your chance to win one of three Acer Aspire One netbook computers that we are giving away to our customers (must pre-register and be present to win)!Acer Aspire One Netbook

Date: Sunday 31st May 2009
Time:
1:30 PM to 5:00 PM
Address: Courtyard Marriott Hotel, 21 N. Juniper St, Philadelphia.
Room: Ballroom Level 1

Agenda

1:30 PM   Welcome and introductory remarks

1:45 PM    “What’s new in Sorcerer”
James Candlin, Sage-N Research, Inc.

2:05 PM    “Sequest analysis tips”
Jimmy Eng, University of Washington

2:45 PM    “Using target-decoy searching to visualize peptide identification signal-to-noise”
Dr. Joshua Elias, Stanford University School of Medicine

3:15 PM    Break and refreshments

3:50 PM   “Peptide identification and protein inference using PeptideProphet and ProteinProphet”
Dr. Alexey Nesvizhskii, University of Michigan

4:30 PM   Panel Discussion: “Putting it together: strategies for a productive proteomics analysis workflow”

4:50 PM   Concluding remarks

5:00 PM   End of meeting

Tags: , , ,

Hear Dr. Laurence Brill, senior research scientist at the Burnham Institute (La Jolla, CA) describe his advanced proteomics setup with the SORCERER 2 system:

Click here to here Dr. Laurence Brill

Reference: Laurence M. Brill, Khatereh Motamedchabokia, Shuangding Wu, and Dieter A. Wolf, “Comprehensive proteomic analysis of Schizosaccharomyces pombe by two-dimensional HPLC-tandem mass spectrometry”, Methods (2009), doi:10.1016/j.ymeth.2009.02.023.

Click here for another Success Profile

Tags: , , ,

We are pleased to announce the availability of the ISIS (Integrated Storage and Information System), which is configured and integrated to work directly with the SORCERER Enterprise bladecenter system to provide 4 to 100+ terabytes of integrated, protected storage for proteomics, genomics, imaging, and other repository needs. A second backup ISIS system can be configured offsite to provide additional backup and disaster recovery needs. To simplify maintenance and warranty for our clients, it will be covered under the same warranty plan as the SORCERER system for 3 years or 5 years.

The base ISIS system will provide approximately 4.1 terabytes of secure storage in a “2U” height, rack-mount system, consisting of twelve 450 GB SAS disks with 2 disk redundancy in RAID6.

In most countries, the ISIS system consists of the following:

- ISIS storage integration software interface running on SORCERER platform
- Fujitsu ETERNUS DX80 with single controller
- Approximately 4.1TB usable (12 x 1TB SATA disks in Raid6) per 2U rack, with up to 20 racks
- Min 3 year warranty is included (subject to the TSP coverage of the SORCERER)

Note that future expansion to 100+ TB will require additional ISIS expansion units or higher density SAS drives.

New clients can order the SORCERER Enterprise blade system with the ISIS system together as two rack-mount units. Clients with newer SORCERER 2 integrated data appliances with at least 8 CPU cores can simply add the ISIS to their existing system. (Older SORCERER systems will require a hardware upgrade.)

Please contact sales@SageNResearch.com for more information.

Tags: , , ,

Here are some notes from the TPP support group on using Tandem Mass Tags (i.e. similar to iTRAQ):

http://groups.google.com/group/spctools-discuss/browse_thread/thread/98dcb28f8dfa2349?hl=en

Here is Thermo’s TMT information:

http://www.thermo.com/com/cda/article/general/1,,20815,00.html

Note that TMT pre-dates iTRAQ, and is a significantly larger molecular tag. At present, iTRAQ has a larger marketshare than TMT.

Tags: , , , ,

Common PC proteomic software is designed primarily to be easy to use with low throughput and small datasets up to a few 1000 spectra. PC programs like Mascot or other software generally work fine at this scale.

However, high-throughput and large-scale analysis (e.g. 100K+ spectra experiments) — a foundation capability for biomarker discovery, molecular profiling and advanced post-translational modification research, requires a different methodology because of the increased need for sensitivity, noise-reduction, and automation.

Horses for Courses

This British maxim states that what may be suitable for one situation may not be suitable for another, as no one race horse is ideal for all course conditions.

When you need to go somewhere, you would walk, drive, or take a plane depending on whether the distance is 1, 100, or 10000 miles/kilometers, respectively.

If your annual income is USD $1K, $100K, or $10M, you would prepare your tax forms manually, use the TurboTax software, or hire a very expensive accountant, respectively.

However, I still occasionally meet scientists who mistakenly believe they can evaluate a large-scale workflow by using a simple BSA or other standard commercial mixture.

Advanced, large-scale analysis is highly specialized, and requires a lot of messy statistics tested against big datasets for true validation. Unless you enjoy that sort of thing, it’s easier to find someone else you respect who has done the heavy statistical lifting for you, so you can focus on what’s really important for you.

Two common large-scale workflows, both use SEQUEST

Read the rest of this entry »

Tags: , , , , ,

A ‘Muse’ is a Greek goddess with inspirational and creative power — perhaps someone you might expect to hang out with Sorcerers!

Indeed, MUSE(R) is a recursive acronym for ”MUSE Utilities for Search Engines”. The MUSE platform is developed to allow rapid prototyping of new scoring algorithms, such as for “Proteomics 2.0″ analyses of PTMs, ETD, and quantitation.

There are currently 3 big challenges in proteomic data analysis today:

  1. Data scale and throughput
  2. Workflow integration
  3. Analysis flexibility

The few proteomic labs with the compute servers to handle large-scale data-sets, the know-how to integrate robust workflows, and the programming capability to develop semi-custom analyses and algorithms can do big science. Today, much more than instrumentation, data analysis capability separates the ‘haves’ from the ‘have-nots’ in proteomics research.

The SORCERER 2 appliance already addresses throughput and workflow integration. With the new MUSE integrated scripting platform, the SORCERER 2 appliance can now address all three to provide the most advanced platform for advanced proteomic data analysis.

The MUSE platform is specifically designed to allow trained researchers to quickly interrogate, filter, and manipulate their large-scale data-sets interactively, along with easy-to-use scripting libraries for developing new scoring functions that compare spectra against a peptide sequence with PTMs.

Technically, the MUSE platform consists of two components: the MUSE scripting language and the MUSE scripting environment.

The MUSE scripting language is a proteomics extension of the LUA language popularized by online video games due to its speed and extensibility. (It is considered one of the fastest scripting languages, is easier to read than Perl, and is syntactically similar to Java.)

The MUSE scripting environment is based on the Bash shell, and includes Perl, PHP, sed, awk, and other popular tools on a 64-bit Enterprise Linux platform, with three decades of robust history.

Even with the very first MUSE platform, it is possible to write single lines to make regular expression substitutions, sort search results by score or delta-mass, write new scoring functions, re-arrange or combine fields, or change formats.

In one test case, we are able to write out the search results into a virtual spreadsheet with 6000 rows and 6000 columns that can be filtered and sorted at will. With adequate training and tech support, researchers can rapidly sort results by XCorr or mass difference, search for phosphorylated sites, and convert PTM symbols to actual masses without programming, for example.

You can see MUSE examples at the Proteomics 2.0 blog by searching for “MUSE”:   http://www.proteomics2.com/ .

Tags: , , ,

Electron transfer dissociation (ETD) is a promising dissociation technology for analyzing labile post-translational modifications (PTMs) such as phosphorylation. Unlike CID, ETD generates positively charged c and z* (z-radical) ions instead of b and y ions. There are two caveats in using standard SEQUEST for ETD tandem mass spectra:

  1. Standard c/z option doesn’t compute z* ions correctly.
  2. Standard SEQUEST allows only low charge states, and would not work for highly charged, long peptides.

It is important to note that z* ions are not the same as z ions, and have an extra hydrogen (1.08 Da monoisotopic mass). This means that the standard SEQUEST option of searching c/z ions will not search ETD spectra correctly, since the computed z ions will have the wrong mass. On SORCERER, correct c/z* ions can be obtained using user-defined static peptide terminus modifications on standard b/y searches, as described below. As well, SORCERER-SEQUEST* allows very high precursor charge states (up to +255) in order to accommodate highly charged species. Here is how to search ETD spectra using SORCERER …

1. Define peptide terminus mods that shift b/y ions to c/z* ions, and use these for ETD searches.

Define the following static peptide terminus modifications using the web interface (click “Add/edit modifications…” on the Search page, then click “New/edit modifications” on top):

  • Name: “BtoC” with Mono Mass: “17.02655″ and Type=”N-Terminus”
  • Name: “YtoZrad” with Mono Mass “-16.01872407″ and Type=”C-Terminus”

In both cases, Residue is left blank.

2. Define a new search profile that incorporates the above peptide terminus mods.

In the Search page under “(2) Choose a Search profile”, select the most similar existing search profile, then click “Edit this profile…”. Be sure to name it something different and memorable, then select the above 2 mods under “Terminus modifications” and “Static”. Select other applicable options.

3. Include a MUSE script to generate a Excel-readable tab-delimited text (TDT) summary file of the SEQUEST top peptides.

In many cases, it can be useful to have a TDT file of the SEQUEST outputs for your Excel analysis, especially for ETD analysis of purified proteins or very simple mixtures. (See note below.) Simply include the MUSE script “sorcout.mu” (part of Sorcerer PE v3.5) as follows: Click Advanced Options “Expand”, and type “sorcout.mu” into the MUSE custom script box. (From now on, any submitted search will have a “sorcout.tdt” file automatically created in the appropriate ‘output’ directory.) Save the search profile. It is now ready for SEQUEST searches on ETD spectra.

4. Try the search using this test DTA file.

Download the following ETD test DTA file and search against SwissProt.

Right Click to Download Sample ETD DTA file

If using built-in TPP’s Spectrum Viewer, simply set the display options to “c” and “z” ions (here, “z” really means “z*”). The z* ions should match pretty well against peptide “KLYNKEPSEIVELK”.

 

Note that many common post-SEQUEST probability re-scoring algorithms, such as PeptideProphet or Scaffold, are not tuned for ETD scores. From first principles, we believe that the resulting probabilities may not be wrong per se, but rather be lacking in specificity. Therefore, particularly for ETD analysis of PTMs in purified proteins or other simple mixtures, we recommend downloading the SEQUEST scores to an Excel spreadsheet for manual interpretation rather than using CID-tuned tools. *The Yates Lab’s version of SEQUEST has 2 code modifications for ETD. The first is the increased charge state (same as in SORCERER-SEQUEST). The second is exclusion of the Proline cleavage, which is not implemented in standard SORCERER-SEQUEST. However, this can be done with a MUSE post-processing step in the future if it is found to have a large effect. As always, in-warranty clients can contact our TechTeam for help on this and other advanced capabilities.

Tags: , , ,

Article on Sage-N Research and Thermo Fisher Scientific collaboration:

http://www.drugdiscoverynews.com/index.php?newsarticle=2475

Tags: , , , ,

by David.Chiang@SageNResearch.com

Orbitraps and other fast ion trap mass spectrometers (e.g. FT, LTQ) are popular instruments for discovery proteomics research.

The SEQUEST cross-correlation score is almost tailor-made for the spectral characteristics of ion trap data, whose information-rich spectra are challenging due to multiply-charged ions reported with relatively low fragment mass accuracy. This is especially important for analyzing noisy spectra that arise from low-abundance peptides and phosphorylated peptides, where the information content is embedded in the abundant small peaks.

However, you may be unaware how the basic SEQUEST functionality has evolved from the first ‘sequest27′ prototype program to the latest SORCERER-SEQUEST implementation. 

Software continues to evolve to adapt to new requirements. Like a home remodeling job that never ends, at some point it becomes more practical to start over from scratch. After all, maintenance costs are several times higher than the initial development costs over the life of a software product.

The recommended architecture for high-throughput analysis is a client-server system architecture, which separates the interactive user client computer from the heavy-duty number-crunching server. This simplifies the sharing, updating, and backup of the central server, and isolates it from viruses and other sources of system instability from the user accessible client PCs.

Sequest27

Proteomic search engines were first invented by John Yates and Jimmy Eng at the University of Washington in the early 1990′s, based on the novel idea that a peptide sequence can be inferred not just from the tandem mass spectrum alone (i.e. de novo sequencing), but using known protein sequences as a reference.

The prototype search engine software was a standalone program named ‘sequest27′ comprising approximately 3000 lines of C code. The source code has since been separately maintained by the Yates Lab and by Thermo, with PTM searches and other modifications added later. 

The ‘sequest27′ program processes one mass spectrum at a time, and searches a protein sequence database from the beginning to end each time it is run. For example, to analyze a MudPIT experiment with 8,000 spectra, the ‘sequest27′ program is run exactly 8,000 times to generate 8,000 output files, with no attempt to use information from one ‘sequest27′ run to another. 

SEQUEST Cluster

The simplest way to scale up the throughput is to run the same program on many computers at once, such as in a Beowulf cluster architecture (http://www.beowulf.org/). 

The SEQUEST Cluster (“SC”) product once marketed by ThermoFinnigan uses this approach, with typically 4 to 32 Linux slave node computers running ‘sequest27′ under the control of the Windows master node computer running Bioworks. 

The SC architecture partitions the set of input spectra into smaller sets for each node, and uses the master node to aggregate the results. While this approach is simpler to implement than partitioning the protein sequences, it requires each local disk to contain the same protein files, resulting in inefficient disk usage (i.e. a 16-node cluster searching the NCBI nr file must store 16 identical copies). As well, it makes the indexed search capability impractical. If the local files are large, then manually copying the files across the network to each node will take a lot of time.

To proteomics researchers new to clusters, the SC architecture seems to offer two benefits: (1) higher throughput than a single computer, and (2) ability to expand throughput in the future by adding nodes. 

However, the devil is in the details. In practice, the cluster may not offer higher throughput than an optimized, non-cluster architecture. As well, future expansion for this software architecture is impractical in light ofMoore’s Law

Depending on the search conditions, one high-end server (say with 8 GB RAM, 1.6 terabyte disk) with an optimized software architecture can outrun a 16-node cluster, whereby each slave node has 1/16th the resources (i.e. 512 MB RAM and 100 GB disk). And it will be simpler to maintain, easier to program, and approximately 16x more reliable. The partitioned RAM and disk resources make system-wide optimization difficult.

Future expansion is also impractical beyond the first year for the SC architecture, since all the slave nodes are assumed to have identical specs. With Moore’s Law predicting 2x performance increase every 18 months at the same price, it is more effective to replace the computing hardware every 2 to 3 years with a brand-new system rather than to try to buy older nodes to add to an old cluster.

Server vs. PC

Servers are not just big Personal Computers (PCs). Quality server hardware is designed for reliable 24/7 multi-processing and continuous disk access, unlike PC hardware designed for the cost-sensitive consumer market.

Robust server operating systems like Enterprise Linux are designed to simultaneously run dozens of independent programs in multi-user environments and to isolate crashed programs from affecting our programs.

Server programs have fewer restrictions than PC programs designed for easy installation and use by non-experts. Therefore, they can incorporate powerful server modules like Perl, PHP, Ruby on Rails, Apache, and MySQL, but require IT expertise for installation and configuration. 

One important benefit of the server platform is ease of integration, which is increasingly important as the workflow evolves from just the search engine to a full proteomic workflow. 

In contrast, integration can be very complex on the standard Windows operating system. For example, some mass spec software from different vendors cannot co-exist on the same Windows PC. In general, PC software is easy to install but difficult to integrate, while server software tends to be the opposite.

SORCERER-SEQUEST

The SORCERER software architecture was developed from the ground up as a server platform for high-throughput search engines and workflows, with focus on robustness, scripting flexibility, and scalable performance. 

The SORCERER platform is not hard-coded for SEQUEST, but instead is a general-purpose proteomics search platform that uses the scoring subsystem for algorithm customization. (It was initially prototyped with X!Tandem, and later introduced with SEQUEST.)

At the heart of the SORCERER software architecture is the micro-partitioning of a search job into self-contained “micro-jobs” that are distributed and managed by a relational database.

In order to further reduce search time, the protein sequences are re-arranged into a peptide-centric data structure when they are first loaded into the SORCERER and “prepared” for peptide searches. Specifically, protein sequences are pre-digested in silico into unmodified peptides, which are sorted by mass, and partitioned into 0.5 GB chunks call ‘seqblobs’.

When a large search job is submitted to the SORCERER, it is added to the queue by the queuing subsystem. The Sorcerer PE Application Layer subsystem partitions each search job into possibly thousands of self-contained micro-jobs, each containing 300 spectra with associated seqblobs. With PTM searches, the same spectra unit may be search against different seqblobs with different mass ranges. (For example, a spectrum with 1000 amu precursor mass may have its unmodified peptide sequence be 1000 amu with no mods, or 920 amu with a single phospho-site.)

All the micro-jobs are recorded in a MySQL relational database. Available CPU cores from either the master or slave nodes will query the database for the next micro-job, and submit the results when completed. 

Since each seqblob contains pre-searched peptide information, each micro-job performs only the scoring function, which is the only part customized to SEQUEST or other search engines. (Before the advent of multi-core CPUs, FPGA subsystems were also used to execute search micro-jobs. Other exotic architectures, such as Nvidia GPUs and the upcoming Intel Larrabee, are also compatible and may be implemented depending on market needs.)

When all the micro-jobs associated with one queue search job is done, the results are aggregated and written out to the file subsystem. As well, an optional MUSE script is run at this time on the output directory. For example, Ascore phospho-site localization can be done with the search results, or additional re-scoring using different user-defined search engines. 

This powerful mechanism also allows algorithm developers to use the SORCERER search as a pre-search function to enrich the peptide candidates to perhaps the top 50 or 500, and then use MUSE scripting to rapidly develop scoring functions to increase accuracy. In particular, algorithm developers can optimize the important scoring functions without needing to first develop the base software to read FASTA files, compute PTM combinations, or perform other necessary but low-value operations.

Applications include the analysis of CID+ETD spectra, whereby the top CID search results are used to drive the ETD search, and MS2/MS3 phosphorylation analysis, whereby associated MS3 spectra may be separately searched in MUSE and re-combined with the MS2 results.

The SORCERER architecture includes a ‘custom’ directory, which has a higher priority than the application directory, to allow knowledgeable developers to substitute and overwrite almost any part of the SORCERER platform. (By confining all customization to this directory, it is simple to revert back to the original factory state.) Therefore, researchers can start with a powerful, functional workflow using a standard SORCERER product, then customize it as needed from simple MUSE scripts to a full re-architecting of major subsystems.

Tags: , , , , ,

Discovery proteomics research, such as for biomarker discovery, requires advanced “Proteomics 2.0″ analyses for PTMs like phosphorylation, ETD, and quantitation in addition to high-throughput.

With the transfer of the high-throughput SEQUEST Cluster business, the choice for high-throughput data analysis is simplified to one of two SORCERER products, both of which bring powerful “Proteomics 2.0″ capabilities with the integrated MUSE scripting environment.

Many advanced proteomics analyses require some level of customization, so the MUSE scripting can be invaluable. For example, some PTMs of interest occur only on certain residues at a peptide terminus, which can be implemented as a post-search filtering step. Workflow automation, such as the compression and copying of results after search completion, can be easily scripted in MUSE. Indeed, the Ascore phospho-site localization algorithm is scripted entirely within MUSE.

Algorithm developers can quickly experiment with new scoring functions, such as for ETD, PTMs, quantitation, or even replicating other common peptide search engines, by simply re-scoring, say, the top 50 candidate peptides from a Sorcerer search. 

SEQUEST Cluster users who have developed custom interface modules to their workflow can most likely adapt their infrastructure to SORCERER with little or no change.

The SORCERER 2 system will be the product of choice for most high-throughput users. It is a plug-and-play, pre-configured Enterprise Linuxserver. Users can install it in minutes, and immediately use a web browser interface (with a password) from any network PC for uploading and downloading data and submitting search jobs. They will also appreciate the reliability, as many Sorcerer systems in the field have been continuously running for more than a year without downtime.

The SORCERER Enterprise software will be a better fit for high-throughput users who must run software on approved servers within a data center, such as in biopharmaceutical companies or large centralized labs. It can be viewed as an “a la carte” version of the software architecture within the SORCERER 2 IDA, and allows other software to co-exist on the same server. 

The SORCERER Enterprise software can be purchased pre-installed and tested on customer-specified servers. Otherwise, it and its dependent components must be installed and configured by qualified IT staff on qualified powerful servers. As well, the semi-custom nature of the installation and maintenance will result in higher support costs.

Like the SEQUEST Cluster, the SORCERER Enterprise product allows throughput to be increased with additional slave nodes running the SORCERER Enterprise Plus software. Note, however, that each high-performance slave node may be worth 16 nodes of SEQUEST Cluster under common search conditions, so you won’t need as many.

Furthermore, the combination of Thermo Discoverer and Sage-N Research SORCERER provides a powerful, customizable, client-server data analysis platform. Discoverer provides a Windows user interface customizable using the Windows .NET environment, while the SORCERER provides the back-end Enterprise Linux server with MUSE customizability.

See the joint press release at:http://www.sagenresearch.com/news_10.html

If you plan to buy a new Orbitrap or other fast mass spectrometer for discovery proteomics, we would strongly recommend that you include a SORCERER 2 system (or SORCERER Enterprise software if you must run in a data center) in your budget. PC software will not be able to keep up with a frequently used Orbitrap. 

If you have a SEQUEST Cluster that is over 2 years old, we recommend that you upgrade to SORCERER within one year to replace the outdated hardware. And please inquire about the special time-limited upgrade offer to make this transition easier.

Tags: , ,

If you have attended a conference lately, such as the ASMS in Denver, you would have found a bewildering array of exciting new products and ideas for your advanced proteomic mass spectrometry research

More than ever, there is a need to remain focused with both your resources and efforts. Now is a good time to use the “80-20 Rule” to cut through the clutter and sharpen your focus. 

The 80-20 Rule (also called the “Pareto Principle” or the “law of the vital few”) states that in many situations 80% of the effects come from only 20% of the causes. It readily applies to many aspects of mass spectrometry and proteomic research.

Focus on the Vital Few in Proteomics

The key to success is to remain focused on what is truly significant, by strategically investing your resources on the “vital few” products, technologies and people with the highest impact to you. 

Here are some of the ways that the 80-20 Rule may apply to your proteomic research, and how you can sharpen your focus.

Invest in Quality

Less than 20% of today’s products account for 80% of overall sales. The rest (80%!) will fade away. Invest only in high-quality products that thrive, while avoiding products that will disappear.

Ideally, you want to get word-of-mouth references from trusted colleagues before purchasing critical tools for your focus areas. This is especially true for software products, some of which are laden with incomplete features designed to be demo’ed and sold rather than used.

With today’s limited budgets, it is important to invest in solid tools rather than to buy the latest widgets. Avoid the temptation to buy inexpensive rather than high-quality, well-supported products. (Most people find that the costliest products are the cheap ones they buy but do not use.)

To optimize your tools purchase, start with a prioritized “Top 10″ capabilities list, then focus on a workflow that can solidly deliver the top 2 must-haves as well as some number of the remaining 8. Given today’s tool market, you will need to integrate tools from different sources to address most of your key requirements.

For the top 2 must-haves, the question is never a yes/no question of whether a certain capability (e.g. ETD) is supported, but how well. Do not choose tools on the basis of the length of the features list. And you may find that technical support can be at least as important as the product itself in a highly technical, evolving field like proteomics. 

Pamper your Workhorse

Perhaps 20% (i.e. 1 of 5) of your mass spectrometers may impact 80% of your research by generating 80% of the spectral data. For example, many labs with several different mass spectrometers tend to have one workhorse instrument, commonly a fast-scan ion trap (e.g. Orbitrap or LTQ) or Tof/Tof (e.g. 4700) mass spectrometer.

If that is the case in your lab, allocate your resources according to impact, and focus at least half your analysis budget on the one key instrument. Avoid the mistake of having the majority of the mass specs dictate the workflow of your most important one.

With tools like PeptideProphet and Scaffold that can accommodate different search engines, you can tailor the search engine to each mass spec, while maintaining a consistent high-level workflow optimized for different instruments.

Associate with Leaders

Perhaps 20% of the researchers seem to publish 80% of the papers and get 80% of the funding. You need to be part of this elite class. How? By doing top-notch work and focusing on your unique value-added capability to the existing network, while minimizing distractions from low-value activities (particularly IT issues).

Whether in elite athletics like the Olympics or science, winners associate with winners. To be part of the winner’s circle, you need to maintain the winner’s mindset, use the professional quality tools and methodologies, and bring yourself to that top level in terms of knowledge, expertise, and professionalism.

The best way to get started is to first replicate the workflow used by leaders in the field, which is more efficient than starting from scratch. The workflow will serve as a good foundation for adding your own unique capability while leveraging an existing high-performance infrastructure.

An Integrated Workflow System for World-Class Proteomics

For more than six years, Sage-N Research has worked with the world’s top technologists to develop a professional quality, server-class integrated proteomics analysis platform targeted for the top 20% proteomics laboratories expected to make the greatest impact.

The result is the Sorcerer 2 system, which delivers a robust workflow incorporating the best technologies from the laboratories of our Scientific Advisory Board members — Drs. John Yates, Steven Gygi and Ruedi Aebersold. In the near future, the Sorcerer 2 platform will incorporate ETD/ECD analysis technologies from Dr. Roman Zubarev, our newest scientific advisor. 

The Sorcerer 2 system is designed to fill in an important gap for advanced “Proteomics 2.0″ analysis, by delivering a standardized workflow platform that can accommodate 80% of standard proteomics analyses right out of the box, while providing an integrated “MUSE” scripting environment to allow user-customizable post-search analyses and workflows for the remaining 20%. This allows labs to focus on scripting their own unique analysis IP while leveraging an existing powerful workflow system.

Tags: , , ,

The Ascore algorithm was developed by the Gygi Lab at Harvard for phosphorylation site localization. It re-analyzes phosphopeptide search engine results to assign a confidence value to each phosphorylated site.

http://ascore.med.harvard.edu/

Sage-N Research offers a commercially developed and supported version, called SORCERER-ASCORE, that is integrated into the Sorcerer systems.

Reference

Beausoleil et al, “A probability-based approach for high-throughput protein phosphorylation analysis and site localization”, Nature Biotech, 10/06, doi:10.1038/nbt1240.

Tags: , , , ,

Times are changing fast

In 2003, a Nature Biotechnology article title declared, “Data analysis – the Achilles heel of proteomics.”

In 2008, this is still true for most researchers, but it need not be any longer. Today’s analysis workflows are more powerful, accurate, and better integrated than that of 5 or even 2 years ago.

It’s the software that will make the difference

If you’re working in an important field like stem cell or cancer pathways, then you may be just a few key proteins away from a fundamental and groundbreaking discovery. If you have a sensitive mass spec like an LTQ-Orbitrap® or an LTQ®, you already have the basic “hardware” to make those earth-shattering discoveries.

But if you are like most mass spec researchers today, you may lack the right “software,” including both the tools and the know-how, to sensitively and accurately characterize those subtle proteins in your samples.

Read the rest of this entry »

Tags: ,