Proteomics mass spectrometry is finally sensitive and specific enough for robust translational medicine (at least in capable hands), and holds tremendous promise to revolutionize biology and medicine. For some, it holds the key to incredible research power for decades to come.
However, there is a chasm that continues to grow between the productive and unproductive labs, because too many proteomics practitioners focus too early on low-level issues (i.e. cost, automation, ease-of-use) without first resolving high-level ones (i.e. sensitivity in presence of noise, quality of results, algorithmic suitability).
For many researchers experimenting with a new high-resolution instrument, the most common scenario is to select a workflow based on running a simple protein solution, usually a purified BSA solution or a commercial protein mixture.
Since different workflows will give basically identical protein IDs results for these simple test cases, they may conclude that all search engines are equivalent. While true when there is almost no signal noise, it is largely irrelevant in translational research. In fact, the exact same test will likely show that low-resolution and high-resolution mass specs are equivalent, the lowest quality reagents will suffice, or maybe you don’t have to clean your glassware as often. These are also true when there is little or no signal noise, but again, that is irrelevant for real-world research.
Seeing that there is little difference in protein IDs, some focus on using protein coverage as the sole metric for evaluating search engines. However, this is actually the opposite of what is needed for sensitive discovery proteomics. For example, if you are hunting for new protein biomarkers (especially a “one-hit wonder”), you do not want the protein inference engine tuned to assigning any ambiguous peptides to already found proteins, thereby hiding them from further study.
Not surprisingly, a workflow selected based on low-noise experiments and focused on protein coverage will excel for simple mixtures, but is not sensitive enough to analyze complex mixtures with wide dynamic range, such as in translational research. Scientists will be able to see the abundant peptides and proteins, but probably little else. That is roughly what most proteomics researchers find today, nothing meaningful, but enough of the obvious to not change their methodologies.
The result is that most labs are not getting the value commensurate with their investments in proteomics mass spectrometry. Under the current economic environment, this is both wasteful and dangerous.
Within the academic world, while many proteomics researchers have trouble getting any interest, a select few are swamped and have to turn away collaborators. Within drug discovery firms, while many are staring at their mostly idle mass spectrometers, a select few are running multiple mass spectrometers 24/7 sieving productively through millions of peptides.
So why are the majority of the proteomics research not producing high-value results?
With our access into the world’s top academic and drug discovery proteomics labs, we have a unique bird’s eye view into the answer. (However, like attorneys, we never give out client-specific information.)
Please allow me to share some secrets to your future success.
Clog in the flow
After more than 15 years, proteomics is still only producing quality results for maybe the top 20% of the labs. Why?
Most people believe the main problem is in instrumentation, possibly because that is the view of the mass spec companies. That is, if you can simply get faster, more accurate mass spectrometers, the research productivity problem will be solved. So hundreds of labs rushed to buy the latest mass specs.
With more than 1000 labs worldwide already using high-throughput, high-resolution mass specs, you would expect tremendous research productivity across the board, if mass spectrometry is really the problem. But you already know that is not the case.
Survey papers confirm what you suspected but were afraid to ask, that you can use the fastest, most accurate mass specs and still get the wrong protein identifications, even from simple commercial protein mixtures! (For example, see Bell et al. 2009.) In many cases, the problem is faulty data analysis due to human error that can be corrected by training.
The inability for many labs to get quality protein identifications from relatively simple synthetic mixtures is a big problem for the proteomics field, as protein ID forms the foundation for more advanced analyses.
At its essence, translational proteomics is about answering 3 questions within a biological system: (1) which proteins? (2) how modified? and (3) how much?
Unfortunately, if you cannot reliably identify what the peptide is, it becomes even less reliable to analyze post-translational modifications (PTMs) and quantitation. In other words, PTM localization (including ETD) and protein/peptide quantitation are really modules built on top on a robust protein ID workflow, not independent tools.
Data analysis — both computing tools and the know-how — has emerged as the main bottleneck for discovery translational proteomics. To the credit of instrument manufacturers, many mass specs are now fast and accurate enough (a good baseline is >3 spectra per second with <50 ppm precursor mass accuracy) that they have become commoditized and are no longer the main limitation.
To see what is blocking proteomics progress as a whole, consider this analogy of where proteomics is today.
You just learned about a wonderful new technology called an airplane (proteomics mass spectrometry) that can get you to remote islands full of treasures (biomarkers). So you try to get funding for a plane and learn to fly, with the intent of finding the treasures first. The high-stakes race has begun!
At the first level of competence, you learn the basic controls involved in flight (i.e. shotgun proteomics basics).
At the second level, you learn to fly visually in clear weather (i.e. peptide and protein ID from simple mixtures). For this, you don’t need a sophisticated cockpit (i.e. a basic software workflow will do), because you can get a direct reading by looking out the window, so there is little reason to infer information indirectly via gauge numbers.
Finally, at the third level, to be a truly effective pilot, you learn to fly completely by wire, using only indirect information from your cockpit (i.e. bioinformatics workflow) — think 747′s with its many gauges — because this allows you to fly into clouds (i.e. in the presence of significant noise) with confidence. This is when you don’t have the luxury of more direct evidence, and have to rely entirely on gauges alone.
Using this analogy, the proteomics field at a whole is stuck at the second level, where the lack of both a sophisticated bioinformatics workflow and the know-how in 80% of the labs is preventing them from going beyond analyzing simple mixtures, despite having capable instrumentation.
In fact, when it is said that biology is turning into an information science, it implies the need to progress to this third level, where you need information technology to help transform low-value data to high-value information.
For simpler biological experiments, the data leads directly to the information. For example, there is or there isn’t fluorescence in the treated cells. No sophisticated statistical data analysis is necessary to interpret the data into information.
At this advanced level of experiments, there is a significant separation of data and information. Data, due to its sheer quantity and its variability in quality and relevance, requires significant bioinformatics and human expertise to convert to high-value information.
To intuitively understand why data is different from information, consider this analogy: There is more free financial and economic data than you can ever use. That data costs almost nothing to obtain and is essentially worthless by itself. However, if you can correctly deduce the information of when the stock market will turn around, that information is worth a billion dollars. This also explains why the value of software can vary widely from next to nothing up to millions of dollars, because the value of its associated information varies just as much.
Similarly, deep in your millions of spectral data of varying quality there may be one or two peptides that will transform your research and career forever. How valuable is it to you to get this information out? Do you have the tools, the technical support, and the know-how to do so? (This is why skimping on data analysis tools is probably the most expensive mistake.)
Unfortunately, unlike the flight instrumentation analogy, the optimal workflows for low-noise and high-noise datasets are different, because digital signal processing (DSP) essential to suppress noise for the latter will necessarily distort the former. (Think the visual artifacts from using night-vision goggles in clear daylight.) This spooks many traditional researchers used only to direct evidence, and generally hinders further progress in the field.
Fortunately, the big reward of success (and the high cost of failure) incentivizes increasingly more proteomics researchers to jump the gap. This gap is the difference between productive and unproductive today, and probably between funded vs. unfunded after about 2 years.
I would also suggest reading “News and Views” in 6/09 Nature Methods by proteomics pioneer (and Sage-N Research SAB advisor) Ruedi Aebersold, which provides an insightful explanation of the challenge of bioinformatics in proteomics reproducibility (Aebersold, 2009).
Mine the gap
We have discussed where the main bottleneck is. Now let’s look at how valuable it is. After all, the ability to judge value (i.e. determine “how valuable?”) is the true secret to sustained success.
Consider the workflow as comprising 3 components:
1. Sample preparation
3. Data analysis
Of these, the most valuable component, by far, is data analysis, because it is closest to the end result. Within it, the know-how is more important than the software workflow itself.
In contrast, the measurement step — particularly mass spectrometry — is the least valuable of the three. High-performance mass specs are now readily available from many core facilities and are relatively simple to use. This has ironically reduced its strategic value to you relative to your peers.
Data analysis is far more valuable than instrumentation?!?
This may sound like heresy, but that is how society has attributed value for at least 50 years.
First example: consider the discovery of the DNA structure in the 1950′s, succinctly summarized here:
1. Key DNA sample was prepared by Rudolf Signer.
2. X-ray measurements were made by Rosalind Franklin and Maurice Wilkins.
3. Data analysis, in the form of a 3D model, was done by James Watson and Francis Crick.
Who got the 1962 Nobel Prize in Medicine?
Most would guess Watson and Crick (data analysis people!). Actually, Wilkins (measurement person) also shared the prize with them, but got relatively little public recognition, so he wrote an autobiography called “The Third Man of the Double Helix”.
The discovery of the DNA structure offers an interesting parallel to proteomics, in terms of how an expensive instrument that produced cryptic patterns (x-ray diffraction vs. mass spectra) drove an important discovery based on the correct analysis of the data. At the time, noise was also a big problem (like translational proteomics), as there was concern the double helix was a noise artifact rather than the structure sought.
In any case, the data analysis people got the bulk of the credit. Why? The value is in the final information, which comes from the last step of analyzing the data.
Another example: For those who are MDs, let’s say a patient comes to see you, so you can perform “disease identification”. Here is what happens:
1. Blood sample is prepared by the medical technician.
2. Measurements, such as blood panels, are run by a contract lab.
3. Data is analyzed by you to search for the closest match among known diseases, including interpreting possible ambiguous or contradictory data.
Of the three, who delivers the highest value and “makes the big bucks”? You do, of course!
Again, the data analysis component captures the bulk of the value, because it is closest to the valuable information.
Today, it’s no secret that data analysis is the neglected stepsister of the proteomics world. As funds are lavished on her privileged siblings (instrumentation), data analysis is relegated to second-class status. But she will turn out to be Cinderella, whose virtues will soon be known.
The greatest opportunities lie in discovering those rare value gaps, where the perceived value is inconsistent with the intrinsic value. (In the stock market, the value gap is the essence of wealth creation.) Now you know that data analysis is one of those rare opportunities.
Cover your bases
You now know that data analysis is both the critical bottleneck and the highest value component. Let’s look at how you can increase your value in this critical component over time.
To make the most of your translational proteomics experiments, it is important to maintain a database, so you can build research “equity” in the form of a growing amount of information from previous experiments.
Translational proteomics is detective work, trying to solve the puzzle from incomplete clues. At its simplest, it is like that game show where contestants try to guess at a phrase when only a few letters are visible. For example, a particularly good word detective can see that “_IN_S_ IN_IBI_OR” spells “kinase inhibitor”. Clearly, you want to accumulate all previous clues to help you solve the puzzle.
Or maybe it is like the detective TV show CSI where the killer leaves behind subtle but unambiguous clues that the detectives can use to ID the killer before the hour is out. In real life, of course, noise and ambiguity are constant challenges, much like for translational proteomics.
The point is that the real power of discovery proteomics is collecting increasing amount of data that gives successively more clues to the puzzle, allowing the most insightful to solve it first, which is then validated with targeted experiments.
So perhaps a better mental model is that of an impressionist Monet painting. If you only see a small piece, it looks like a bunch of random color dots without any pattern. But when you collect enough pieces in aggregate, a complete and beautiful picture emerges with astonishing details.
The key is to keep collecting data and information, so that you can continue to build up the picture an experiment at a time. This can be a spectral library or identified peptides associated with similar samples.
As proteomics increasingly becomes an information science, biomarker discovery workflows for example can be modeled on business intelligence workflows for Visa identifying high-value targets for marketing campaigns or Walmart developing models that predict flashlight shortages during winter storms. Drawing parallels from analogous workflows in a different field helps us to gain insight into the what and how of sophisticated workflow development.
It is interesting to note that business workflows tend to be semi-customized using relational database technologies, which allows business analysts to slice and dice data to discover valuable business patterns. I believe this is how successful proteomics workflows will evolve as well, as standard workflow products like the Rosetta Elucidator (now discontinued) have not proven to be a sustainable business model, at least in today’s environment.
Note that an astonishing $15B of IBM’s $100B annual sales is from business services — i.e. workflow consulting and development. This is approximately the annual sales of Amgen and Eli Lilly! So there is tremendous value in semi-custom workflows for business. Not surprisingly, these workflows are designed for robust servers with robust storage with built-in redundancy, and not simple off-the-shelf PC programs.
As the proteomics field understands the strategic value of bioinformatics and database workflows, more labs will invest in their semi-custom bioinformatics workflow development. (Not surprisingly, Sage-N Research has also moved toward this direction, and has started offering semi-custom workflow development on our SORCERER platform, based our MUSE scripting environment, to serious translational proteomics clients.)
Putting it all together, quickly
The proteomics field has rightly, in my opinion, moved toward mass spectrometry-based analysis from 2D gels, but it has confused using mass spectrometry with owning and operating a mass spectrometer. The result is that mass spec companies have had unrivaled market power to drive instrument-centric purchasing decisions, with little emphasis on downstream data analysis. The net result is that scientists are not getting the research value out of their investment. (Unfortunately, mass specs are like PCs in terms of getting exponentially faster and better with time, so buying too early means paying more and getting less, especially as different manufacturers can now deliver similar specs.)
The proteomics hardware-software imbalance is hurting the field, because proteomics as a whole continues to be viewed as an experimental technology, even though knowledgeable technologists know it is more powerful and robust than is believed by the funding authorities.
The good news is that this is just another value gap, and hence a unique opportunity for those who understand the bottleneck and how to break through with powerful bioinformatics and IT tools.
In fact, the IT industry is evolving at such an explosive rate that it is difficult for even high-tech professionals to keep up. The reason is that high-tech is probably the most brutally competitive field in terms of leapfrog technologies, where market-share battles are fought in timeframes of months with billion-dollar effects. (Remember when LCD TVs were new? Now it’s all about LEDs. And don’t even mention ancient plasmas.)
If you still think IT simply means taking a programming class in C or Perl and running a single program on a Windows PC, you are about two decades behind.
FPGA computing is already passe (our first generation products used this), and multi-core CPUs are ho-hum (our current products are optimized for this). Now the high-performance IT is about high-density computing, low-latency networking, virtualization, and, most importantly, storage technologies with 10TB’s of disks.
All of this makes today a very exciting time for putting together the right technology pieces for fast, simple, reliable, and comprehensive data analysis for translational proteomics. Unfortunately, many IT capabilities are stuck in legacy mode, so IT infrastructure is most often a boat anchor rather than a propeller.
The good news is, since many labs are only starting to ramp up, proteomics IT infrastructure can start afresh with the latest IT technologies (high-density blade computers, integrated storage/backup, virtualized PCs, etc.). This is the direction we are taking, to continue to add value to our proteomics clients.
Sage-N Research is continuing to evolve our SORCERER platform to include protein-centric integrated storage, with capability to have multiple search/scoring engines in a scriptable, unified workflow, to help you breakthrough your data analysis bottleneck in translational proteomics.
There is not much time though, as the window of opportunity to show meaningful research results is only about 2 years. (My prediction is that average funding will drop by at least a factor of 2 after that, with almost all the money going to the proven productive scientists.)
As you decide how to spend your precious funds, ask yourself in 5 years which is more valuable: a 5-year-old database of results, or a 5-year-old mass spectrometer?
As you decide where to invest your precious time, ask yourself in 5 years which is more valuable: 5 years experience in analyzing complex data, or 5 years experience running a mass spectrometer?
I hope this helps your understanding of the real bottlenecks in translational proteomics, and provokes thinking and insight into your own workflow needs.
If you have any questions or comments, or if I may be able to help you, please feel free to contact me at david@SageNResearch.com.
Aebersold, R. Nat. Methods 6, 411-412 doi:10.1038/nmeth.f.255 (June 2009).
Bell, A.W. et al, Nat. Methods 6, 423-430 doi:10.1038/nmeth.1333 (17 May 2009).
You must be logged in to post a comment.