Peter D. Karp is Director, Ingrid M. Keseler is a Database Curator, Tomer Altman is a Scientific Programmer, Ron Caspi is a Database Curator, Carol A. Fulcher is a Database Curator, Pallavi Subhraveti is a Scientific Programmer, Anamika Kothari is a Database Curator, Markus Krummenacker is a Scientific Programmer, Mario Latendresse is a Computer Scientist, Tom Lee is a Computer Scientist, Suzanne M. Paley is a Scientific Programmer, Alexander G. Shearer is a Database Curator, and Miles Trupp is a Database Curator in the Bioinformatics Research Group, SRI International, Menlo Park, Calif.
Summary
● The BioCyc system offers search and visualization tools for genes, metabolic pathways, genomes, and regulatory and metabolic networks.
● Databases within BioCyc are organized into tiers according to the amount of manual review and updating they receive.
● BioCyc uses the PathoLogic program to infer metabolic pathways from genome data, and to generate new Pathway/Genome Databases.
● Users can use BioCyc over the Internet or may choose to install the program on their personal computers to run as a desktop application.
Genome sequencing capabilities and capacity are so advanced that the genomes of virtually all microbial species of interest are likely to be deciphered during the next decade. Improved means for interpreting and applying this sequence information enhance its value. The BioCyc database collection and website include data for more than 1,000 microbial genomes as well as tools for analyzing that information. These resources help users to address key questions such as: What are the functions of the upstream and downstream neighbors of a gene? How far apart are the genes, and are they likely to form an operon? What is the sequence of the gene and DNA segments near it? What metabolic pathway does a gene play a role in?
The BioCyc.org website offers user-friendly search and visualization tools that include information on individual genes and pathways as well as browsers for genomes and regulatory and metabolic networks. With BioCyc, users can navigate a map of metabolic reactions, much as if using Google Maps, or follow connections in regulatory networks, as if exploring social networks. Furthermore, BioCyc provides a wealth of comparative analysis capabilities at both the gene and system levels.
BioCyc contains databases (DBs) mainly describing prokaryotes but also several eukaryotes, including humans, mouse, fruit flies, and yeast. We call the BioCyc databases Pathway/Genome Databases (PGDBs) because these DBs couple genome data with information about metabolic pathways. Software that predicts the metabolic pathways of organisms from annotated genome sequences was used extensively in assembling the BioCyc PGDBs, many of which include results from other computational procedures applied to genomes, including predictions of which genes code for missing enzymes in metabolic pathways and the location of predicted operons. BioCyc also includes comparative genomics tools and software for analyzing gene expression, proteomics, metabolomics, and ChIP-chip data.
Many microbial genome sequencing projects are making use of the Pathway Tools software that underlies BioCyc to create PGDBs for organisms of interest. Moreover, this software enables researchers to share genome data through Pathway Tools-powered Web sites that are accessed using the same interface as BioCyc. A BioCyc workshop/tutorial will be presented at the 2011 ASM General Meeting.
BioCyc Data Content
DBs within BioCyc are organized into tiers according to the amount of manual review and updating (curation) they receive. Although researchers at SRI created most BioCyc PGDBs, other research groups also produce PGDBs for BioCyc to host, facilitating comparative analyses.
Tier 1 PGDBs are created through intensive manual curation efforts and are updated on an ongoing basis. The EcoCyc PGDB for Escherichia coli K-12 reflects efforts to enter information from 20,300 published articles about the metabolism, transport, and regulatory processes of this organism; it is updated frequently to reflect newly characterized E. coli genes.
The multiorganism MetaCyc Tier 1 PGDB, which differs from all other BioCyc PGDBs, contains information describing experimentally elucidated enzymes and metabolic pathways from more than 1,800 organisms. Although broad-based, it does not model the genome, proteome, reactome, or pathway complement of any one organism. Instead, scientists use Meta-Cyc to answer metabolic questions that span multiple domains of life, such as "what are all the pathways for arginine degradation in microbes?" or "what cofactor biosynthesis pathways are known in bacteria?"
For questions about the genome, proteome, or metabolic network of a particular organism, researchers typically consult the organism-specific PGDB. For example, MetaCyc contains experimental data from studies of 36 enzymes in 14 metabolic pathways of Staphylococcus aureus. In contrast, the BioCyc Staphylococcus aureus RF122 PGDB describes 189 metabolic pathways, most of which are computationally predicted, plus genome sequence and proteome data.
We use Pathway Tools to generate Tier 2 PGDBs, which have usually received less than one year of curation to remove erroneous pathway predictions, add information about metabolic pathways from the literature, and define multimeric protein complexes. Recent additions to Tier 2 PGDBs include BsubCyc and SynelCyc. One of us (I.M.K.) at SRI generated BsubCyc for Bacillus subtilis as part of a larger resourcedevelopment project for researchers who focus on this microorganism. BsubCyc is lightly curated on an ongoing basis, with references to new literature, updates to gene functions and gene and protein names, and data describing insights into its metabolic and regulatory networks. In addition, publicly available data sets as well as data and links to resources are being imported.
The SynelCyc database is devoted to the freshwater cyanobacterium Synechococcus elongatus PCC 7942, a model organism for studying photosynthesis, circadian clock systems, and nitrogen metabolism. We refined this PGDB manually, curating data from more than 600 research papers from many different fields. The S. elongatus genome contains 2,720 genes, which we assigned computationally to 1,964 transcription units (operons). These genes encode 2,669 polypeptides, including 723 enzymes, 575 of which catalyze 913 reactions involving small molecules. SynelCyc describes 988 enzymatic reactions, 693 of which participate in 194 pathways.
SRI created several Tier 2 PGDBs that other groups adopted to curate-for example, the Tuberculosis Database (TBDB) project took over the BioCyc PGDB for Mycobacterium tuberculosis. Indeed, we encourage scientists to adopt these PGDBs as a way of distributing the task for curating genomes over a wide group of scientists who can contribute their diverse expertise. (Please contact us at
This e-mail address is being protected from spambots. You need JavaScript enabled to view it
if you would like to adopt a DB.)
PathoLogic also generated 970 Tier 3 PGDBs that were not reviewed and are not being updated. Therefore, their predictions should be treated with special caution, particularly because PathoLogic is tuned to err on the side of over-predicting pathways as a way of bringing attention to their potential presence.
Many Sources for BioCyc Databases
We obtain genome data to include in BioCyc from a number of sources. Most genome data were obtained from the Comprehensive Microbial Resource, whose main source is GenBank. BioCyc contains additional genomes obtained directly from GenBank (usually RefSeq). Users can determine the source of a BioCyc genome from the PGDB summary page that can be viewed by invoking the Web command Tools3 Reports 3 Summary Statistics.
When creating a new PGDB, the PathoLogic program goes through the following series of operations:
● It creates DB objects for each replicon and/or contig, gene, protein, and RNA within the annotated genome.
● It predicts the reactome of the organism by matching enzyme names, EC numbers, and gene ontology terms within the annotated genome to reactions in the MetaCyc DB.
● It predicts metabolic pathways by matching reactions to MetaCyc pathways.
● The pathway hole-filler component of Patho- Logic predicts which genes likely code for enzymes that appear to be missing from metabolic pathways.
● The operon component predicts likely operons.
● The transport inference parser infers transport reactions.
● An algorithm generates an organism-specific metabolic diagram (Fig. 1).
Using BioCyc
The BioCyc data and software tools are being used in many different ways, including as a reference for operons predicted from analyses of gene regulation, as a resource for identifying metabolic pathways shared by related species, and for analyzing meta-proteomic data from the human salivary microbiome. In metabolic engineering they are used as a source of reaction and enzyme information to target genetic modifications, to create modeling tools using the reactions and pathways of BioCyc DBs, as the basis for predictive tools to identify missing enzymes in metabolic models, and as a reference tool to test metabolite network prediction algorithms. In drug development, it is used for mapping of metabolites quantified during pharmacokinetics studies, for construction of genome-scale metabolic networks, for analyzing essential metabolites for pathogen viability, for mapping of infective phase transcriptomes to metabolic pathways, and for mapping drug interactions with metabolic reactions.
Searches using the BioCyc Web site typically begin with a user selecting a database, with EcoCyc being the default. To search a different DB, a user can choose the link "change organism database." In the next dialog, the user then can find a PGDB by genus and species name of the organism via one of several routes. Users who create (free) BioCyc accounts may select a preferred PGDB to replace EcoCyc as a default for future website visits. The quick search option on each BioCyc page is useful for users who know the name of a sought object, such as genes, proteins, com pounds, RNAs, reactions, pathways, operons, and GO terms.
When a query string matches a single object, the page displays immediately. If there are multiple matches, the full list of matches displays, organized by object type. Users then can move easily to corresponding BioCyc gene or pathway pages, and beyond. A number of more sophisticated search tools are available under the Search menu. One example is the Search 3 Genes/Proteins/RNAs command, which supports single- or multiple-criteria searches based on properties of genes, proteins, and RNAs (analogous tools are available for compounds, reactions, and pathways). This search page contains many different alternative search fields, including searching for genes whose nucleotide sequence length falls within a specified range; for gene products that interact with a specified small molecule as a regulator, substrate, cofactor, or ligand; for gene products with a specified cellular location; and for gene products that are annotated with a given Gene Ontology term. BLAST searches against individual BioCyc genomes are also available.
The genome browser can start either at a gene-level view or a one-screen view of an entire chromosome (see Tools menu). The tool for navigating metabolic maps can be dragged and zoomed just as with Google Maps, while the regulatory network viewer can be dragged and zoomed in the same way as the cellular overview. It first shows all genes within a genome and enables users to add arrows that denote regulatory relationships. After a set of regulatory relationships is added, the user can examine the network in great detail.
The comparative analysis tools allow users to generate statistical reports within or across a selected set of BioCyc PGDBs. The report options include comparisons of pathways, reactions, metabolites, and transporters. For example, the pathway report compares the pathway complements of selected organisms using the MetaCyc pathway ontology, which includes groups such as biosynthetic pathways, degradative pathways, and subclasses thereof. For instance, to compare pathway complements between E. coli and B. subtilis, see http://biocyc.org/comp-genomics?tables_pathway&orgid_BSUB&orgid_ECOLI&orgids_%28BSUB_ECOLI_%29.
The transporter report compares the set of substrates that the organisms can import and export, compares the transporter gene complements, and compares the transported substrates with the substrates of metabolic pathways to determine where mismatches may exist, potentially indicating missing pathways or transporters, or annotation errors.
The comparative genome browser looks at chromosomal regions around orthologous sets of genes in different organisms (Fig. 3). Such
comparisons are invoked using the gene-page button, "Align in Multi-Genome Browser," and BioCyc subscribers can store organism lists to speed future comparisons. Moreover, from a given gene, compound, pathway, or reaction page, users can view pages for that same entity in other PGDBs, e.g., users can ask the program to show a particular gene from all databases. In addition, the reaction and pathway pages contain a species comparison command that compares reaction or pathway information across sets of organisms.
Omics Data Analysis
BioCyc provides visual tools for analyzing largescale datasets that, for example, can convert tab-delimited data into a color scale (Fig. 4). Moreover, omics data can be painted onto cellular overviews or onto single-pathway diagrams. These omics analysis tools provide different perspectives for the same data. All three tools can produce animations when multiple data columns are provided. Further, transcriptomics, proteomics, and metabolomics data can be painted onto such diagrams. These tools are available for any PGDB except MetaCyc.
Users can also install BioCyc on their personal computers to run as a desktop application, which runs faster and has many additional operations than the Web version. In addition, a locally installed version can run as a Web server on an intranet, eliminating Internet delays. Additional information about BioCyc can be accessed through the Internet, including a guided tour, examples of different types of data in BioCyc, a guide to BioCyc, and also through a set of interactive online seminars. We value suggestions regarding BioCyc, and they can be submitted using the "Report Errors or Provide Feedback" link at the bottom of each BioCyc data page.
ACKNOWLEDGMENTS
We thank the many groups that have contributed PGDBs to BioCyc. The projects described were supported by award numbers GM080746, GM75742, GM077678, GM088849, and GM092616 from the National Institute of General Medical Sciences. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute of General Medical Sciences or the National Institutes of Health.
SUGGESTED READING
R. Caspi, T. Altman, J. M. Dale, K. Dreher, C. A. Fulcher, F. Gilham, P. Kaipa, A. S. Karthikeyan, A. Kothari, M. Krummenacker, M. Latendresse, L. A. Mueller, S. Paley, L. Popescu, A. Pujar, A. G. Shearer, P. Zhang, and P. D. Karp. 2010. The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases. Nucleic Acids Res. 38:D473-479, 2010. advanced access doi: 10.1093/nar/gkp875.
Dale, J. M., L. Popescu, and P. D. Karp. 2010. Machine learning methods for metabolic pathway prediction. BMC Bioinformatics 11:15.
Green, M. L., and P. D. Karp. 2004. A Bayesian method for identifying missing enzymes in predicted metabolic pathway databases. BMC Bioinformatics 5:76.
Karp, P. D. 2001. Pathway databases: a case study in computational symbolic theories. Science 293:2040-2044.
Karp, P. D., S. M. Paley, M. Krummenacker, M. Latendresse, J. M. Dale, T. Lee, P. Kaipa, F. Gilham, A. Spaulding, L. Popescu, T. Altman, I. Paulsen, I. M. Keseler, and R. Caspi. 2010. Pathway Tools version 13.0: Integrated software for pathway/genome informatics and systems biology. Briefings Bioinformatics 11:40-79, doi: 10.1093/bib/bbp043.
Keseler, I. M., C. Bonavides-Martinez, J. Collado-Vides, S. Gama-Castro, R. P. Gunsalus, D. Aaron Johnson, M. Krummenacker, L. M. Nolan, S. M. Paley, I. T. Paulsen, M. Peralta-Gil, A. Santos-Zavaleta, A. G. Shearer, and P. D. Karp. 2009. EcoCyc: A comprehensive view of E. coli biology. Nucleic Acids Res. 37:D464-470.
Lee, T. J., I. Paulsen, and P. D. Karp. Annotation-based inference of transporter function. Bioinformatics 24:i259-67, 2008.
Suthers, P. F., M. S. Dasika, V. S. Kumar, G. Denisov, J. I. Glass, and C. D. Maranas. 2009. A genome-scale metabolic reconstruction of Mycoplasma genitalium, ips189. PLoS Comput. Biol. 5:e1000285.