Subscribe rss-microbe
Home Letters The First Paper in Bioinformatics?
The First Paper in Bioinformatics? Print E-mail
In the early years of molecular biology, once it became known that DNA was the hereditary material and that DNA encoded information exactly specifying protein sequences, a number of coding schemes were proposed to explain how various combinations of the four unique bases of DNA might specify particular amino acids.  It is obvious that at least three bases would be required to form "words" (codons in contemporary parlance) corresponding to the 20 unique amino acids. A doublet code could specify at most 16 (42) amino acids while a triplet code, the most parsimonious option, could easily specify 20 amino acids, with some degeneracy, because it allows for 64 (43) unique codons. A more vexing and less tractable problem was related to the actual reading of the code: was the code an overlapping one? That is, once the (unknown) reading machinery had sensed a triplet, did it advance by three bases to the next triplet, or did it advance by only one base, sensing a triplet that had two letters in common with the preceding one?

In his 1957 paper addressing this problem (S. Brenner, Proc. Natl. Acad. Sci. USA
43:687-694, 1957), Sydney Brenner noted that, in the event of the code being an overlapping one, the sharing of two bases among adjacent codons automatically imposed constraints on the identity of neighboring amino acids. Thus, the identity of each successive amino acid residue in the polypeptide chain is constrained by the identity of the preceding one, and four bases would be required to specify any given dipeptide. This would imply that only certain amino acid pairs could occur as neighbors, and other pairs would be forbidden. Therefore, there could not be more than 256 (i.e., 44
) unique dipeptides, however large the set of protein sequences. But, in 1957, fewer than 256 dipeptides were known and therefore, a verdict could not be given based on the number of unique dipeptides observed in nature. On the other hand, if the code were nonoverlapping, 400 (20 x 20) unique dipeptides would be found to occur if a sufficiently large set of protein sequences were available.

Brenner's elegant solution to this problem was to use the limited protein data to draw inferences about the genetic code by counting the observed number of amino acid neighbors (N- or C-terminal) for each of the 20 amino acids. As noted earlier, an overlapping code implies the sharing of two bases between successive triplets. Therefore, any triplet can be followed by (or preceded by) only four unique triplets, i.e., the two shared bases plus any one of the four bases. Thus, for any given amino acid
x, one unique base triplet must be assigned for every four unique N-terminal (or C-terminal) neighbors observed. Now, the number of unique N- and C-terminal neighbors observed for each of the 20 amino acids can be compiled from the list of known dipeptides. The greater of the numbers of unique N-terminal and C-terminal neighbors for x indicates the minimum number of unique triplets required to encode x
. Taking the specific case of serine, 17 unique N-terminal neighbours and 13 unique C-terminal neighbors are observed from the list of dipeptides. The greater of these numbers is 17, and corresponds to 4 sets of 4 unique amino acid neighbors plus one lone amino acid (forming the fifth set). The minimum number of unique triplets required to encode serine is therefore at least five. This number can be determined for each of the 20 amino acids in a like manner from the list of dipeptides. The total of this quantity for all 20 amino acids is the minimum number of triplets required to encode all amino acids, assuming an overlapping code.

However, upon actual enumeration, Brenner found that the minimum number of unique triplets required to encode the 20 amino acids came to 70, 6 more than the theoretically possible 64. This summarily ruled out the possibility of an overlapping triplet code.

Why is this bioinformatics? Can we justifiably classify this as "bioinformatics?" The National Center for Biotechnology Information defines "bioinformatics" as the "merging of biology, computer science and information technology into a single science" (http://www.ncbi.nlm.nih.gov/About /primer/bioinformatics.html). Broadening this definition to the use of computational approaches to analyze biological data, with or without automation, most publications variously consider the inference of evolutionary history and phylogeny from gene and protein sequences by Zuckerkandl and Pauling [E. Zuckerkandl, and L. Pauling, p. 97-166,
in V. Bryson and H. J. Vogel (ed.), Evolving genes and proteins, Academic Press,NewYork, 1965; E. Zuckerkandl and L. Pauling, J. Theoret. Biol. 8:357-366, 1965] or Margaret Dayhoff's Atlas of Protein Sequence and Structure (M. O. Dayhoff, National Biomedical Research Foundation, Silver Spring, Md., 1965) as the earliest milestones in the field. In the detailed and extensive review of the history of bioinformatics by Ouzounis and Valencia (C. A. Ouzounis and A. Valencia, Bioinformatics 19:2176- 2190, 2003), we find mention of no paper before Vernon Ingram's 1961 paper on Gene Evolution and the Haemoglobins (Nature 89
:704-708, 1961) as an instance of "early bioinformatics." However, a reappraisal of Brenner's 1957 paper indicates that, in essence, it involves the (manual) computational analysis of a database of dipeptide sequences to draw inferences regarding a biological process, and may well merit the appellation of bioinformatics.

Ramakrishnan
Sitaraman
TERI
University
New Delhi, India


Comments (0)add
Write comment

busy