Subscribe rss-microbe
Home Home COMBREX Seeks To Bridge Genomics-Protein Function Data Gap
COMBREX Seeks To Bridge Genomics-Protein Function Data Gap Print E-mail

 

Reducing the gap between sequence data and information about protein functions will help in characterizing microbial diversity

Merry R. Buckley

Merry R. Buckley is a freelance science writer and the Commentaries and Social Media Editor for ASM's openaccess journal, mBio.

Summary
● The Computational Bridge to Experiments program, called COMBREX, aims at closing the gap between accumulated sequence data and its fuller comprehension.
● COMBREX is funding small-scale protein function investigations, building a database of all predicted gene functions, and compiling another specialized database devoted to proteins whose features are validated.
● In addition to building two databases, the founders of COMBREX are developing a machine- readable structured vocabulary for describing gene functions.
● COMBREX is evaluating other analytic technologies, including reactome arrays, as possible means for accelerating annotation.


How is genomics like an old episode of the long-running television series, I Love Lucy? Picture Lucy and her neighbor Ethel in a candy factory doing their best to wrap chocolates as the machinery churns out an endless stream of bonbons on a conveyor belt. They struggle to keep up with the candy output, but soon fall further and further behind, stuffing candies into their mouths and clothes at an increasingly frantic pace.

Microbiologists now face a similar bottleneck, except with genomics data instead of chocolate candies. The speed with which researchers can determine the sequences of microbial genomes currently outruns the pace at which they can annotate that information and determine the functions of proteins those sequenced genes encode. Sequence data- "chocolate candies"-are spilling from sequencing facilities all over the world, and the technology that provides that information is running faster and faster. About 25% of sequenced microbial genes have no known or proposed function.

This informational mismatch is stunting progress toward characterizing the microbial world in all its splendid diversity. However, the new Computational Bridge to Experiments program, called COMBREX, seeks to correct that mismatch. The brain child of Nobel Laureate Rich Roberts of New England Biolabs and funded in part by the National Institutes of Health (NIH), COMBREX aims to reduce the gap between accumulated sequence data and its fuller comprehension via a three-pronged approach: funding small-scale protein function investigations, building a database of all predicted gene functions, and compiling another, more specialized database devoted to only those proteins whose functions and other features are experimentally validated, called the gold standard proteins.

By setting priorities for studying proteins and funding those studies, COMBREX seeks to coordinate annotation resources and efforts, and thereby accelerate the overall pace of genomics-generated discovery. While some critics express reservations about the COMBREX strategy, many of them agree that knowledge linking gene sequence data to protein functions has fallen behind and that efforts like those COMBREX recommend are sorely needed.

COMBREX Seeks To Coordinate Protein Analyses with Faster-Paced Genomics

COMBREX was established specifically to address the problem of slow, uncoordinated progress in protein functional validation, according to its three principal developers, Roberts along with Simon Kasif and Martin Steffen, both of Boston University (BU). "How do you keep up with this fast DNA sequencing?" asks Steffen, who directs the BU Proteomics Core Facility. "We have all these great genome sequences, but we don't know what so many of the genes do. We want to help coordinate the community towards identifying gene functions, especially since, with the new sequencing technologies, the problem might get much worse."

Their original concept called for forging a community of biochemists and computational biologists to organize their activities and to coordinate funding for experimental work on specified proteins. With early support from NIH, the COMBREX group began developing databases and computational tools to annotate genes. COMBREX also provides context and tools for selecting proteins to study, including a database of predicted functions for 3.4 million genes in completed bacterial and archaeal genomes. Experimentalists can review the COMBREX database to find predictions to test in their laboratories.

That the scientific community is inundated with DNA sequence data without comparable functional validation studies comes as no surprise, according to Steffen. For 20 years, federal agencies and the scientific community focused on improving sequencing technology while lowering its cost, he says. "It's not surprising that we're great at doing DNA sequencing right now. Lots of great scientists from lots of different fields have been working on this problem for a long time. But there hasn't been any comparable or systematic funding effort or even recognized importance of the annotation problem."

COMBREX Provides Modest Funds for Protein Analysis and Annotation Projects

Producing the shotgun sequence of a microbial genome costs roughly as much as rigorously analyzing the function of a single gene product, a disparity that makes annotation the rate-limiting step in genomics discovery. To help in relieving that bottleneck, COMBREX is providing funds through competitive grants to experimentalists to characterize protein functions. The grants, ranging from $5,000 to $10,000, are meant to cover costs associated with performing molecular function assays, including reagent costs and partial salary support.

COMBREX grants are, admittedly, limited in size, but they are putting funds into the hands of biochemists and other experimentalists who study protein functions, according to Roberts. "If someone predicts that [a given] protein is a helicase, then a lab that works with helicases would be able to verify or negate that hypothesis very easily," he says. "It's easy for them to do the biochemistry because they have the reagents and expertise already-this is their bread and butter."

Proposals submitted to COMBREX are presented to external reviewers, who judge the rationale and impact of validating particular proteins or related sets of proteins, and evaluate the expertise of the scientists who propose each project. Because COMBREX grants are available to worthy experimentalists, regardless of seniority, Roberts sees an educational and training opportunity here as well as a means of supporting research. "Especially the younger scientists . . . can get funding for a small project," he says. "It would be very good for undergraduates who could do, maybe, an internship in a lab. I think the educational opportunities in this are quite high-a good starter project for students."

COMBREX and "Gold Standard" Databases Plus a New Vocabulary

The primary COMBREX database was constructed to guide the choice of gene products for functional validations. All genes from completely sequenced microbial genomes contained in RefSeq are incorporated into the database, which organizes them into functional clusters according to designations provided by Protein Clusters from NCBI (the successor to the COGS database, Clusters of Orthologous Groups). The database also houses predictions of protein functions submitted to COMBREX by computational biologists, including COMBREX co-principal investigators Steven Salzberg of the University of Maryland, Dennis Vitkup of Columbia University, Charles Delisi of Boston University, and several other computational research groups. The database includes information pertaining to approximately 2.9 million proteins from 1,200 microbial strains organized into 500,000 clusters; another 0.5 million proteins are classified as singletons, according to Kasif.

Deciding what to analyze from 3.4 million uncharacterized gene products is largely up to the scientists who apply for funding, but grants will be selected with an eye to applicability, according to Steffen. "We have a basic philosophy that we want to get the biggest bang for the buck, so we favor genes that come from big [relatedness] clusters," he says. Kasif is developing a searchable priority system for COMBREX to help identify sets of related proteins. Users also may search the database by protein phenotype, such as proteins that confer antibiotic resistance, for instance, to point experimentalists to products that best match their interests.

A database that accesses RefSeq is not unique, but COMBREX adds traceability, Kasif says. "COMBREX is the only database in the game that actually tries to insist on traceable evidence of function. All the other databases will give you the prediction, but they don't give you where it came from."

Further, COMBREX is collaborating with NCBI and UniProt to develop a "Gold Standard" database to contain data on only those genes that are experimentally validated and accompanied by published results, Roberts says. This resource "developed as an offshoot of COMBREX. It did not occur to me that such a database didn't exist."

This database with its high-quality annotated functions can also be used for developing algorithms with which to predict protein functions from gene sequences, Kasif says. Between 50,000 and 100,000 genes are characterized experimentally with results published in the literature, but tracking down these results can be slow going. "The gold standard database is very small right now," he says. "There are tens of thousands of proteins that have experimental evidence that we don't yet have in the system."

Yet another purpose of this database is to prevent further propagation of misannotations, according to Steffen. A 2009 study by Patricia Babbitt of the University of California, San Francisco and her collaborators found that among 37 protein families and 7,000 sequences, the overall rate of misannotation was 40%, and the authors note that misannotation rates appear to be increasing. Because such databases lack documentation as well as a clear trail of evidence, they also can amplify errors. "It becomes like a game of ‘telephone'," he says. "When people repeat things, slightly different meanings and messages can get propagated [in the annotation] and at the end it could be very different than it started."

In addition to building two databases, the founders of COMBREX are developing a machine- readable structured vocabulary for describing gene functions. "When you get two predictions for the same protein, if they're coded in the form of text, we can't even compare them [in an automated fashion]," Kasif says. "So we're trying to assign structured annotations to genes with as much accuracy as we can." It is essential to develop a comprehensive systematization of function across all known genes, he adds. Formalizing gene function descriptions would be useful for developing algorithms to work on a genome-wide scale, identifying gaps in metabolic pathways or essential functions.

Anticipating-and Debating-the Role COMBREX Can Play

Because COMBREX ties in with the larger bioinformatics community, its role is being debated. In its support, its founders point to knowledge gaps, disorganization in terms of research goals, widespread errors, the lack of data traceability, and large numbers of uninformative annotations as key reasons for establishing COMBREX. Providing support for projects to validate protein functions, for example, appears necessary in light of the huge number of genes with no known molecular functions. These genes, described as "hypothetical proteins," account for roughly 25% of the 3.4 million genes in the COMBREX database.

The problem of unidentified gene functions is "significant" and, as genomics explores a broader swath of microbial diversity, the situation is worsening, says Folker Meyer, Associate Division Director for the Institute of Genomics and Systems Biology at Argonne National Laboratory in Argonne, Ill., who is not affiliated with COMBREX. A decade ago, the functions of about 65% of the proteins encoded in a microbial genome could be deciphered, he says. "If you did that same study with some of the genomes being studied today, the ratio would be even worse. The community has, in many ways, stopped making progress on improving the percent of proteins we can identify in microbial genomes."

The COMBREX initiative, if it works out, would be very useful to the community, according to Owen White, Director of Bioinformatics at the School of Medicine at the University of Maryland, Baltimore. "There are a lot of unknown proteins out there, and if we knew more that would be a very good thing," he says. However, an approach to validate protein functions that relies on one-protein-at-a-time studies does not seem practical, he says. Meyer expresses similar concerns about throughput. "From what I can tell about the project, the pipelines being built [in COMBREX] can only test for certain functions," he says. "To me, it seems the question is how many functional annotations do we get per dollar?" COMBREX plans to adopt high-throughput methods for validating protein functions as soon as they become available, according to Roberts. "The problem is [that] there is no large-scale test of all metabolites," he says. "If someone who knows of good high-throughput methods would get in touch with us, we would love to work with them."

One recently developed technology-the reactome array-appears to offer a promising alternative approach for validating protein functions, Roberts continues. Developed by Manuel Ferrer of the Institute of Catalysis in Madrid, Spain, and his collaborators, this technology is useful for describing metabolic patterns in cells, but operates independently of genomic analyses. However, promising though it may be, the report describing this technology and published in Science in October 2009 was subsequently retracted in November 2010 on the basis of technical inconsistencies. Nonetheless, Roberts defends the underlying techniques. "We are collaborating with Manuel Ferrer on a Helicobacter pylori project, so we will get a much better feel for what his technique is capable of after that," he says.

Another question about COMBREX is what it has to offer in terms of metagenomics, Meyer says. He and his collaborators are developing a service called MG-RAST to automate annotations using metagenomics data, which are vast. "The problem is there isn't enough money to do this properly," he says. "There are just not enough resources to go mine that data."

"There will never be the resources to do the biochemistry on everything," Roberts says. "It's a general problem that the amount of sequence being generated is too high to be able to look at everything. Our perspective with COMBREX is to look at genes we know are members of big families because they're present in lots of sequenced genomes, and many of these will show up in the metagenomics datasets. You have to start somewhere. As far as we're concerned, COMBREX is a pilot project and it is successful if we can get a lot of genes annotated. Then, hopefully that success will convince funding agencies to put a lot more money into this project."

How to Use COMBREX

The COMBREX website enables users to participate in number of ways (http://www.combrex.org):

Apply for a Grant. COMBREX awards small grants in the range of $5,000 to $10,000 to defray reagent costs and to support salaries. Because the grants are modest, applications are simplified, and a short (2-3 pages) template is available from the website. Reviews are completed within a few weeks, and proposals are accepted on a rolling basis without deadlines. COMBREX encourages graduate students and postdoctoral fellows to submit proposals.

Submit Predictions of Gene Functions. User-submitted predictions become candidates for experimental validation by the entire community, without excluding submitters.

Notify COMBREX of Experimental Validations. Users can recommend classifying gene products as "Gold Standard" proteins by providing two numbers: a UniProt accession number and a PubMed ID. By notifying COMBREX of relevant results, they can help to publicize their own findings.


SUGGESTED READING

Beloqui, A., M.-E. Guazzaroni, F. Pazos, J. M. Vieites, M. Godoy, O. V. Golyshina, T. N. Chernikova, A. Waliczek, R. Silva-Rocha, Y. Al-ramahi, V. La Cono, C. Mendez, J. A. Salas, R. Solano, M. M. Yakimov, K. N. Timmis, P. N. Golyshin, and M. Ferrer.
2009. Reactome array: forging a link between metabolome and genome. Science 326:252-257. (Retraction Science 330:912, 2010.)

Schnoes, A. M., S. D. Brown, I. Dodevski, and P. C. Babbitt. 2009. Annotation error in public databases: misannotation of molecular function in enzyme superfamilies. PLoS Comput. Biol. 5: e1000605. doi:10.1371/journal.pcbi.1000605.