Title: Comparing function and structure between entire proteomes
Author:Jinfeng Liu & Burkhard Rost
Quote: Protein Science(2001), 10, 1970-1979

Comparing function and structure between entire proteomes

Jinfeng Liu1,2 & Burkhard Rost1,3

1 CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street, New York, NY 10032, USA

2 Graduate Program in Pharmacology, Columbia University, 630 West 168th Street, New York, NY 10032, USA

3 Corresponding author: rost@columbia.edu, http://cubic.bioc.columbia.edu/

Table of Contents


Abstract

More than 30 organisms have been entirely sequenced. Here, we applied a variety of simple bioinformatics tools to analyse 29 proteomes for representatives from all three kingdoms: eukaryotes, prokaryotes and archaebacteria. We confirmed that eukaryotes have relatively more long proteins than prokaryotes and archaes, and that the overall amino acid composition is similar between the three. We predicted that about 15-30% of all proteins contained transmembrane helices. We could not find a correlation between the content of membrane proteins and the complexity of the organism. In particular, we did not find significantly higher percentages of helical membrane proteins in eukaryotes than in prokaryotes or archae. However, we found more proteins with 7 transmembrane helices in eukaryotes and more with 6 and 12 in prokaryotes. We found twice as many coiled-coil proteins in eukaryotes (10%) as in prokaryotes and archaes (4-5%), and we predicted about 15-25% of all proteins to be secreted by most eukaryotes and prokaryotes. Every tenth protein had no known homologue in current databases, and 30-40% of the proteins fall into structural families with more than 100 members. A classification by cellular function verified that eukaryotes had a higher proportion of proteins for communication with the environment. Finally, we found at least one homologue of experimentally known structure for about 20%-45% of all proteins; the regions with structural homology covered 20%-30% of all residues. These numbers may or may not suggest that there are 1200-2600 folds in the universe of protein structures. All predictions are available at {http://cubic.bioc.columbia.edu/genomes}.

Key words: protein sequence analysis; analysing entire genomes; helical membrane proteins; coiled coil proteins; signal peptides; comparative modelling.


Introduction

Comparative genomics begins with collecting and describing. Sequencing the entire genome of the first free-living organism, Haemophilus influenzae, opened the new era of flooding data in molecular biology [1] . Since, more than 40 genomes have been sequenced, mostly for pathogens and model organisms. These include the first eukaryotic genome, Saccharomyces cerevisiae [2] , and the first animal genomes Caenorhabditis elegans [3] , Drosophila melanogaster [4] , and Homo sapiens [5, 6] . What can we learn from all the data? Like zoology and botany a century ago, we are just commencing to catalogue the components of life trying to find common features and systematic schemes. Thus, comparative genomics at its infancy is confined to describing similarities and differences. So far, synopses of 'the whole genome' have focused on archiving the functional and the structural content of organisms [7, 8, 9, 10, 11, 12, 13] . Classifying proteins according to functional criteria is difficult since function is a complex phenomenon associated with many mutually overlapping levels: chemical, biochemical, cellular, physiological, organism mediated, and developmental. These levels are related in complex ways, for example, protein kinases can be related to different cellular functions (such as cell cycle), and to a chemical function (transferase) plus a complex control mechanism by interaction with other proteins.

Bioinformatics identifies most helical membrane proteins. Membrane proteins are crucial for survival; one reason is that they mediate communication across the cell membrane. Despite the great biological and medical importance, we still have very little experimental information about the 3D structures: less than 1% of the proteins of known structure are membrane proteins. In contrast, helical membrane proteins are relatively easy to identify by bioinformatics. How many helical membrane proteins are in a genome [14, 15, 12] ? Is the fraction of helical membrane proteins constant, or does the fraction of helical membrane proteins correlate with the complexity of the organism [12] ? Are there preferences for particular numbers and topologies of transmembrane helices (TM) in some organisms [16] ? Some analyses concluded that there was no preference for proteins with a certain number of membrane helices [17, 18] , while others reported some preferences [16, 12] .

Many extra-cellular proteins identified through signal peptides. Signal peptides at the N-terminal end target many prokaryotic and eukaryotic proteins to the secretory pathway [19, 20, 21, 22] . Signal peptides are predicted accurately [20, 23, 24] . However, although secreted proteins were studied in various bacteria [23] , few groups analysed entirely sequenced eukaryotes.

Most heterogeneous identifiable class: coiled-coil proteins. Coiled-coils are typically formed as bundles of several right-handed alpha helices twisted around each other forming a left-handed super helix [25, 26] . Coiled-coil structures are often used to mediate protein-protein interaction or to build filaments and other macroscopic structures. Most known coiled-coil proteins can be detected based on particular sequence signals [27] . Thus, we can identify most coiled-coil proteins in a proteome. Six percent of all proteins in GenBank [28] appear to contain coiled-coil regions, and the percentages appear to vary between organisms [8] . Does this old finding hold up for entire proteomes?

Here, we analysed predictions for transmembrane helices, coiled-coil regions, and signal peptides for 28 entire proteomes and for 24,000 human proteins. Due to the comprehensive size of the data, we also re-analysed the distribution of protein length explored by others [29, 18] .


Results and Discussion

Simple 'bio-physical' criteria: protein length and amino acid composition

Protein lengths differ between kingdoms. Generally, prokaryotes and archaes appeared to have an asymmetric bell-shape distribution of ORF lengths with the peak around 100-300 residues, while the ORFs of eukaryotes were distributed much more evenly within the range of 100-600 residue (Fig. 1A). This was also reflected in the cumulative distributions (Fig. 1A Inset): The distributions for prokaryotes and archaes were steepest around 300 residues; while those for eukaryotes were relatively constant between 100-600 residues. Furthermore, every tenth eukaryotic protein had more than 1000 residues, while less than five percent of the ORFs in prokaryotes and archaes were as long (Fig. 1A Inset). The length of ORFs has been reported to follow an extreme value distribution [30] . Although our data agreed with this (Fig. 1A, bold lines), a detailed statistical analysis did not support the hypothesis for all three kingdoms. In particular, eukaryotes deviated significantly from an extreme value distribution.



Fig. 1
figure 1
Fig 1. (A) The distribution of length of ORFs (in bins of 10 residues) in 29 genomes (cumulative values in the Insets). The extreme value distribution fit is shown in bold. The abbreviations for the organisms are given in Table 1. (B) Amino acid composition for six representative genomes: the letter height is proportional to the observed composition of the respective amino acid (one-letter code). (C) Percentage of membrane proteins, coiled coil proteins, and proteins with signal peptides in 29 genomes. (D) Less than half of the predicted membrane proteins could have been identified through homology to known membrane proteins.


Amino acid compositions did not differ between proteomes. The codon usage differs between the genomes we analysed. In contrast, we the amino acid compositions were rather similar (Fig. 1B). Leucine (L), Valine (V), Serine (S), and Alanine (A) were the most abundant amino acids, and Cysteine (C), Tryptophan (W), Histidine (H), and Methionine (M) the most underrepresented ones. The only differences were that eukaryotes tended to have fewer Alanines (A) and more Asparagines (N), and that archaes had fewer Glutamines (Q, Fig. 1B).

Prediction-based classifications of proteomes

Errors of prediction methods affected averages marginally. The following results based on predictions that may be wrong. In particular, PHDhtm [31] may have missed about 5% of the helical membrane proteins, and may have falsely predicted membrane helices in about 3% of all globular proteins. We adjusted the number of predicted membrane proteins accordingly (Eq. 1). The accuracy of SignalP was estimated to be around 90% [20, 24] implying that the predicted compositions under-estimated the actual percentage of secreted proteins. For COILS [32] we used a threshold assuring high accuracy (probability > 0.9). At this level, the prediction would have 95% coverage for parallel coiled coils and 80% for anti-parallel ones (A Lupas, personal communication). When transferring the functional assignment of SWISS-PROT, we used a threshold estimated to yield about 70% accuracy [33] . Since this error seems not to be class-specific to particular classes, we expect the relative proportions to be relatively accurate.

Multi-cellular appeared to have similar membrane content as uni-cellular ones. For most organisms less than 25% of all ORFs encoded helical membrane proteins (Fig. 1C). While we predicted the highest content for C. elegans (30%), Drosophila appeared to have less than 18%; three prokaryotes had over 25%: Rickettsia prowazeki, Borrelia burgdorferi, Chlamydia trachomatis, and two archaes had more than 20%: Pyrococcus horikoshii, Pyrococcus abyssi. 40-80% of all the membrane proteins identified by PHDhtm could have been detected by homology to known membrane proteins [34] . Excluding proteins annotated as PUTATIVE, POSSIBLE, PROBABLE, or BY SIMILARITY in SWISS-PROT, the number of previously known membrane proteins dropped to 1-7% (exception: human with 14%, Fig. 1D). Thus, we could not verify that more complex organisms need larger fractions of membrane proteins [14, 15, 12] . The discrepancy probably resulted from the insufficient amount of human and fly data available earlier.

Number of transmembrane helices differed between kingdoms. We confirmed that most membrane proteins have fewer than four helices (Fig. S-1A in Electronic Supplementary Material). In contrast to previous results [17] , we found that 7-TM proteins were significantly over-represented in C. elegans and H. sapiens, as were proteins with 6 and 12-TM in most prokaryotes (Fig. 2A). These three classes were also the only ones with an imbalance in the distribution between the two possible orientations of the transmembrane helices: Eukaryotic 7-TM proteins were dominated by topology 'in' (G protein-coupled receptors), 6 and 12-TM proteins in prokaryotes by the topology 'out' (transporters). Surprisingly, we found relatively few 7 TM proteins in fly [35, 36] and many in worm (Fig. 2A). The worm appears to contain 1000 smell receptors, whereas the fly has fewer than 100 [35, 36] . Thus, the difference in this single class of proteins might explain the observed differences.



Fig. 2
figure 2
Fig. 2. (A) Fraction of membrane proteins with different numbers of predicted transmembrane segments. White bars: proteins with topology "in". Black bars: proteins with topology "out". (B) Contour plot showing the relation between ORF length (in bins of 10 residues) and the number of predicted membrane helices for two representative organisms.


Many membrane proteins had almost no globular regions. Relating protein length to the number of transmembrane helices, we observed two clusters (Fig. 2B, and Fig. S-1B in Electronic Supplementary Material). One contained proteins of varying length with one helix, the other was populated by proteins with about 35 residues per helix. Most transmembrane helices spanned over 17-33 residues. Thus, the second cluster contained almost no globular regions. These data confirmed earlier findings [12] . Long non-transmembrane regions in membrane proteins are likely to form structurally compact globular domains; most of these were longer than 100 residues and were as often inside as outside of the membrane (Fig. S-2 in Electronic Supplementary Material).

Almost every tenth eukaryotic proteins contained coiled-coil regions. We found coiled-coil regions for about 8-11% of all eukaryotic and 2-9% of all prokaryotic and archae proteins (Fig. 1C). Most eukaryotes had more coiled-coil proteins than prokaryotes, and most prokaryotes more than archaes. Exceptions were Mycoplasma pneumoniae, Helicobacter pylori, and Aquifex aeolicus which all had higher percentages of coiled-coil proteins than C. elegans. Surprisingly low contents of coiled-coil proteins appeared in Deinococcus radiodurans, Mycobacterium tuberculosis, and Aeropyrum pernix K1. A previous survey of GenBank suggested that 4.96% of bacterial, 9.47% of invertebrate, and 6.80% of vertebrate proteins have coiled-coil regions [8] . We could not find significant differences between vertebrate (human) and invertebrates (C. elegans and Drosophila). The vast majority of coiled-coil proteins contained only one long helix and most coiled-coil regions extended over 28 residues (Fig. S-3 in Electronic Supplementary Material). Protein length did not correlate with the number of coiled-coil regions (Fig. S-3C). As expected, the amino acid composition of coiled-coil regions (Fig. S-3D) differed from that of all other regions (Fig. 1B).

Prokaryotes and eukaryotes had similar fractions of secreted proteins. We predicted 15%-25% of the ORFs in prokaryotes and eukaryotes to have signal peptides (Fig. 1C). The exception was yeast for which we found only 7% secreted proteins. Supposedly, our estimates constitute lower bounds due to the fact that the prediction program SignalP misses secreted proteins.

Too many functionally unclassified proteins hampered comparing function. Using EUCLID [9, 37] , we could classify about 45-65% of all proteins into one of 13 functional classes at a level reported to yield 70% correct classifications (>30% pairwise sequence identity, [33] ). When grouping the 13 classes into three super-classes: energy, information and communication [9] we found similar compositions within the archaen and eukaryotic kingdoms (Fig. 3A). In contrast, the composition varied significantly between prokaryotes: for Escherichia coli and Synechocystis PCC6803 the composition resembled that of eukaryotes, for Aquifex aeolicus and Thermotoga maritima that of archeas (Fig. 3A). Finally, we found the following differences in the 13 classes. "Amino acid biosynthesis", "Biosynthesis of cofactors, prosthetic groups, and carriers", "Energy metabolism" were abundant in prokaryotes and human seemed to have a larger portion of the classes "Transport and Binding" and "Regulatory Functions" (Fig. 3B). Previously, bacteria were reported to have smaller fractions of proteins responsible for "communication" (5%) than plants (20%) and animals (45%); possibly, since the complex organisation of multi-cellular organisms required more proteins communicating with the environment [9] . Our data differed in that we found bacteria to contain more proteins associated with "communication" (15-32%) than reported previously (Fig. 3A). The inherent bias of SWISS-PROT may have contributed most to these differences. The significant variation of the class composition between the various prokaryotic genomes may reflect the very different environments in which these organisms dwell. However, the most important result was that although accepting classification errors of 30% or more, we still could classify only about half of all proteins. Thus, conclusions about the meaning of the relative proportions remained highly speculative.



Fig. 3
Figure 3
Fig. 3. Functional classification of genomes. (A) Super-class distribution for the genomes: we grouped the 14 EUCLID classes into "Energy", "Communication" and "Information" super-classes. (B) Distribution of 13-category classification for selected genomes, including those without functional classification.


About 10% of the proteins were orphans. We examined the size of all families with proteins of similar structure (structural families) found in SWISS-PROT / TrEMBL [34] . We found most proteomes to contain 5-10% orphan families; a few organisms deviated significantly from this level (Fig. 4A). Another 10% of the proteins had only one homologue (Fig. 4A). Overall, 30% of the proteins were in families with fewer than ten proteins, 30-40% in families with more than 100 members (Fig. 4B).



Fig. 4
Figure 4
Fig. 4. For each protein in all proteomes, we counted the number of proteins found in the respective family at a PSI-BLAST E value < 10-3. The graphs show the cumulative percentages of proteins found in families of particular sizes. For example, about 5-10% of all ORFs were orphans, i.e. had no homologue in current databases, 30-40% were in families with more than 100 members.


For 60% of all residues we could not predict protein structure through comparative modelling. Comparative modelling could predict 3D structure at very low resolution for about 25-40% of all proteins (Fig. 5A). Exceptions were the biased subset of human sequences found in SWISS-PROT and TrEMBL (>50%) and the proteome of Aeropyrum pernix K1 (<19%). These numbers comprised the most optimistic values in that we applied an iterated PSI-BLAST search (Methods), which may yield a considerable number of false positives [38] . Using a more conservative cut-off for pairwise comparisons, we still found structural similarities in about 20-35% of all proteins. The similarity to regions of experimentally known structure covered about 17-36% of the entire residue mass of all proteomes (Fig. 5B). Large-scale initiatives in structural genomics aim at experimentally determining all protein structures [39, 40, 41, 42] . Obviously, membrane proteins, as well as other non-globular proteins will be left out in this search to cover protein structure space. Today, we have experimental information about structure for less than 1% of the proteomes we analysed. However, through comparative modelling we could obtain good structure predictions (< 3 Å rmsd for main chain; [43, 44] ) for about 6% of all proteins [45] . When we relaxed the model accuracy to a level at which most models will be better than 6 Å rmsd [44] , we find that about 25-40% of all ORFs were similar to a protein of known structure (Fig. 5A). At this low level of accuracy about 26% of all eukaryotic residues could be modelled. We found about 4% of the residues in transmembrane regions, about 2% in coiled-coil regions, and 11% in long regions that lack regular secondary structure (J Liu & B Rost, unpublished). Thus, we estimated that structural genomics would have to cover about 60% of all the proteomes. Obviously, many of these proteins/fragments belong to the same structural families (Fig. 4). In fact, we found the 40,000 proteins from fly, worm, and yeast to cluster into about 17,000 structural families (data not shown). Assuming a similar reduction, structural genomics will have to experimentally determine structures for about one fourth of all the proteomes we analysed.



Fig. 5
Figure 5
Fig. 5. Structural annotation of genomes. (A) 25-40% of all ORFs were sequence similar to at least one PDB protein. (B) The total percentage of residues that could thus be homology modelled amounted to about 20-30% of all residues.


How many folds are there? Cyrus Chothia estimated that the universe of proteins contains about 1000 different folds [46] . Depending on what we consider to be different folds, we currently know about 600-800 folds [47, 48, 49] . We found that these folds span about 25-45% of the proteomes we analysed. A more detailed analysis of regions with missing structural information [45] suggested that about 15-20% of the entire residue mass will not correspond to folds. Thus, we need folds for 80-85% of all proteins. Of these we currently cover 30-50% with 600-800 folds. Hence, if we assume that today's folds and proteomes are representative for the universe of life, we estimate about 1200-2600 folds in the universe. However, we have many good reasons to assume that today's databases are biased and that our perspective of the world is too narrow to thoroughly support such conclusions.


Materials and methods

Source of sequences

We obtained the sequences for all 28 organisms we analysed from the public domain [1, 50, 51, 52, 53, 2, 54, 55, 56, 57, 58, 59, 3, 60, 61, 62, 63, 64, 65, 66, 67, 68, 4, 69, 70] . We downloaded most ORFs from ftp://ncbi.nlm.nih.gov/genbank/genomes/. The exceptions were Homo sapiens (from SWISS-PROT release 39 and TrEMBL database release 15) and Drosophila melanogaster (from http://www.fruitfly.org/).



Table 1: Genomes analysed Methanobacterium thermoautotrophicum
Latin Name Abbreviation a No ORF b
Archael bacteria  
Aeropyrum pernix K1aerpe 2694
Archaeoglobus fulgidusarcfu 2383
Methanococcus jannaschiimetja 1735
metth 1871
Pyrococcus abyssipyrab 1765
Pyrococcus horikoshiipyrho2064
Prokaryotes  
Aquifex aeolicusaquae 1522
Bacillus subtilisbacsu 4099
Borrelia burgdorferiborbu 850
Campylobacter jejunicamje 1731
Chlamydia pneumoniaechlpn 1052
Chlamydia trachomatischltr 894
Deinococcus radioduransdeira 3103
Escherichia coliecoli 4285
Haemophilus influenzaehaein 1716
Helicobacter pylorihelpy 1788
Mycoplasma genitaliummycge 470
Mycoplasma pneumoniaemycpn 677
Mycobacterium tuberculosismyctu 3918
Neisseria meningitidisneime 2081
Rickettsia prowazekiiricpr 834
Synechocystis PCC6803syny3 3169
Thermotoga maritimathema 1846
Treponema pallidumtrepa 1031
Ureaplasma urealyticumureur 613
Eukaryotes  
Caenorhabditis eleganscaeel 18944
Drosophila melanogasterdrome 14218
Subset of Homo sapiensc human 24235
Homo sapiens(chromosome 22) hs22 887
Saccharomyces cerevisiaeyeast 6307


a: Abbreviations: taken from SWISS-PROT; b No ORFs: the number of Open Reading Frames (predicted proteins) is taken from the respective original publication. c human sequences: the only non-complete data were the human sequences, taken from SWISS-PROT release 39 [34] and from TrEMBL release 15 [34] . Note: all sequences used are available on our web site [76] .




Prediction methods

Membrane proteins. We obtained multiple sequence alignments through a search with MaxHom [71] against SWISS-PROT. We filtered the resulting alignments [72] and used them as input for PHDhtm [31] using the default threshold of 0.8. The total number of membrane proteins was adjusted according to the false positive rate and false negative rate published in the original paper:

Eq 1 (Eq. 1)

where n was the final number of membrane proteins we reported, FP and FN were the false positive and false negative rates respectively, npred was the number of predicted membrane proteins in the genome, and ntotal was the total number of proteins in the genome.

Secreted proteins and coiled-coil regions. We predicted signal peptides with the program SignalP [20] considering a protein to contain a signal peptide if the "mean S" value was above the default threshold. We predicted coiled-coil regions with COILS [32] using a window-size of 28 and a probability threshold of 0.9.

Structural homologues. We run three iterations of PSI-BLAST [73] searching against our local filtered databases [74] to detect homologues of experimentally known structure in PDB [75] . We included hits with E-values < 10-3. We obtained more conservative estimates, searching with a pairwise BLAST against PDB, reporting hits with E-values < 10-3.

Functional classification. We classified cellular function using EUCLID [37] . As input we used the SWISS-PROT homologues identified by MaxHom. EUCLID assigned the following 13+1 categories of cellular function [50] : "Amino acid biosynthesis", "Biosynthesis of cofactors, prosthetic groups, and carriers", "Cell envelope", "Cellular processes", "Central intermediary metabolism", "Energy metabolism", "Fatty acid and phospholipid metabolism", "Purines, pyrimidines, nucleosides, and nucleotides", "Regulatory functions", "Replication", "Transcription", "Translation", "Transport and binding proteins", and "Other categories", . Proteins described as "Unclassified" either had no SWISS-PROT homologue or could not be classified by EUCLID.

Statistical analysis

To examine whether or not the ORF lengths followed an extreme value distribution:

Eq 2 (Eq. 2)

we calculated the mean and standard deviation s of the ORF length from archaes, prokaryotes, and eukaryotes, and estimated the parameter and using the method of moments:

,

We then performed a goodness-of-fit test to determine whether the observed distribution was compatible with the extreme value distribution.


Electronic supplementary material

The supplementary material includes three figures showing the following: (1) cumulative percentage of membrane proteins with different numbers of predicted transmembrane segments relation between the ORF length and the frequency of proteins with given number of predicted transmembrane segments, (2) distribution of length of globular regions in membrane proteins, (3) distribution of coiled coil segments in the proteins, and amino acid composition of coiled coil segments.

All the figures and figure legend are included in a PDF file called "supplement.pdf" (HTML version).


Acknowledgements

Thanks to Henrik Nielsen (CBS, Denmark) for the source code of SignalP and for his generous help in using this program. Thanks to Andrei Lupas (SmithKline Beecham Pharmaceuticals) for helpful suggestions about running the COILS program. Thanks to Florencio Pazos, Damien Devos and Alfonso Valencia (all CNB Madrid) for supplying and helping with the program EUCLID. The work of JL and BR were supported by the grants 1-P50-GM62413-01 and RO1-GM63029-01 from the National Institute of Health. Last, not least, thanks to all those who enabled this analysis by depositing experimental information into public databases and to all those who maintain these.


References