| Title: | Comparing function and structure between entire proteomes |
| Author: | Jinfeng Liu & Burkhard Rost |
| Quote: | Protein Science(2001), 10, 1970-1979 |
Jinfeng Liu1,2 & Burkhard Rost1,3
1 CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street, New York, NY 10032, USA
2 Graduate Program in Pharmacology, Columbia University, 630 West 168th Street, New York, NY 10032, USA
3 Corresponding author: rost@columbia.edu, http://cubic.bioc.columbia.edu/
More than 30 organisms have been entirely sequenced. Here, we
applied a variety of simple bioinformatics tools to analyse 29
proteomes for representatives from all three kingdoms: eukaryotes,
prokaryotes and archaebacteria. We confirmed that eukaryotes have
relatively more long proteins than prokaryotes and archaes, and
that the overall amino acid composition is similar between the
three. We predicted that about 15-30% of all proteins contained
transmembrane helices. We could not find a correlation between
the content of membrane proteins and the complexity of the organism.
In particular, we did not find significantly higher percentages
of helical membrane proteins in eukaryotes than in prokaryotes
or archae. However, we found more proteins with 7 transmembrane
helices in eukaryotes and more with 6 and 12 in prokaryotes. We
found twice as many coiled-coil proteins in eukaryotes (10%) as
in prokaryotes and archaes (4-5%), and we predicted about 15-25%
of all proteins to be secreted by most eukaryotes and prokaryotes.
Every tenth protein had no known homologue in current databases,
and 30-40% of the proteins fall into structural families with
more than 100 members. A classification by cellular function verified
that eukaryotes had a higher proportion of proteins for communication
with the environment. Finally, we found at least one homologue
of experimentally known structure for about 20%-45% of all proteins;
the regions with structural homology covered 20%-30% of all residues.
These numbers may or may not suggest that there are 1200-2600
folds in the universe of protein structures. All predictions are
available at {http://cubic.bioc.columbia.edu/genomes}.
Key words: protein sequence analysis; analysing entire
genomes; helical membrane proteins; coiled coil proteins; signal
peptides; comparative modelling.
Comparative genomics begins with collecting and describing. Sequencing the entire genome of the first free-living organism, Haemophilus influenzae, opened the new era of flooding data in molecular biology [1] . Since, more than 40 genomes have been sequenced, mostly for pathogens and model organisms. These include the first eukaryotic genome, Saccharomyces cerevisiae [2] , and the first animal genomes Caenorhabditis elegans [3] , Drosophila melanogaster [4] , and Homo sapiens [5, 6] . What can we learn from all the data? Like zoology and botany a century ago, we are just commencing to catalogue the components of life trying to find common features and systematic schemes. Thus, comparative genomics at its infancy is confined to describing similarities and differences. So far, synopses of 'the whole genome' have focused on archiving the functional and the structural content of organisms [7, 8, 9, 10, 11, 12, 13] . Classifying proteins according to functional criteria is difficult since function is a complex phenomenon associated with many mutually overlapping levels: chemical, biochemical, cellular, physiological, organism mediated, and developmental. These levels are related in complex ways, for example, protein kinases can be related to different cellular functions (such as cell cycle), and to a chemical function (transferase) plus a complex control mechanism by interaction with other proteins.
Bioinformatics identifies most helical membrane proteins. Membrane proteins are crucial for survival; one reason is that they mediate communication across the cell membrane. Despite the great biological and medical importance, we still have very little experimental information about the 3D structures: less than 1% of the proteins of known structure are membrane proteins. In contrast, helical membrane proteins are relatively easy to identify by bioinformatics. How many helical membrane proteins are in a genome [14, 15, 12] ? Is the fraction of helical membrane proteins constant, or does the fraction of helical membrane proteins correlate with the complexity of the organism [12] ? Are there preferences for particular numbers and topologies of transmembrane helices (TM) in some organisms [16] ? Some analyses concluded that there was no preference for proteins with a certain number of membrane helices [17, 18] , while others reported some preferences [16, 12] .
Many extra-cellular proteins identified through signal peptides. Signal peptides at the N-terminal end target many prokaryotic and eukaryotic proteins to the secretory pathway [19, 20, 21, 22] . Signal peptides are predicted accurately [20, 23, 24] . However, although secreted proteins were studied in various bacteria [23] , few groups analysed entirely sequenced eukaryotes.
Most heterogeneous identifiable class: coiled-coil proteins. Coiled-coils are typically formed as bundles of several right-handed alpha helices twisted around each other forming a left-handed super helix [25, 26] . Coiled-coil structures are often used to mediate protein-protein interaction or to build filaments and other macroscopic structures. Most known coiled-coil proteins can be detected based on particular sequence signals [27] . Thus, we can identify most coiled-coil proteins in a proteome. Six percent of all proteins in GenBank [28] appear to contain coiled-coil regions, and the percentages appear to vary between organisms [8] . Does this old finding hold up for entire proteomes?
Here, we analysed predictions for transmembrane helices, coiled-coil regions, and signal peptides for 28 entire proteomes and for 24,000 human proteins. Due to the comprehensive size of the data, we also re-analysed the distribution of protein length explored by others [29, 18] .
Protein lengths differ between kingdoms. Generally, prokaryotes
and archaes appeared to have an asymmetric bell-shape distribution
of ORF lengths with the peak around 100-300 residues, while the
ORFs of eukaryotes were distributed much more evenly within the
range of 100-600 residue (Fig. 1A). This was also reflected in
the cumulative distributions (Fig. 1A Inset): The distributions
for prokaryotes and archaes were steepest around 300 residues;
while those for eukaryotes were relatively constant between 100-600
residues. Furthermore, every tenth eukaryotic protein had more
than 1000 residues, while less than five percent of the ORFs in
prokaryotes and archaes were as long (Fig. 1A Inset). The length
of ORFs has been reported to follow an extreme value distribution
[30] . Although our data agreed with this (Fig. 1A, bold lines),
a detailed statistical analysis did not support the hypothesis
for all three kingdoms. In particular, eukaryotes deviated significantly
from an extreme value distribution.
Amino acid compositions did not differ between proteomes.
The codon usage differs between the genomes we analysed. In contrast,
we the amino acid compositions were rather similar (Fig. 1B).
Leucine (L), Valine (V), Serine (S), and Alanine (A) were the
most abundant amino acids, and Cysteine (C), Tryptophan (W), Histidine
(H), and Methionine (M) the most underrepresented ones. The only
differences were that eukaryotes tended to have fewer Alanines
(A) and more Asparagines (N), and that archaes had fewer Glutamines
(Q, Fig. 1B).
Errors of prediction methods affected averages marginally. The following results based on predictions that may be wrong. In particular, PHDhtm [31] may have missed about 5% of the helical membrane proteins, and may have falsely predicted membrane helices in about 3% of all globular proteins. We adjusted the number of predicted membrane proteins accordingly (Eq. 1). The accuracy of SignalP was estimated to be around 90% [20, 24] implying that the predicted compositions under-estimated the actual percentage of secreted proteins. For COILS [32] we used a threshold assuring high accuracy (probability > 0.9). At this level, the prediction would have 95% coverage for parallel coiled coils and 80% for anti-parallel ones (A Lupas, personal communication). When transferring the functional assignment of SWISS-PROT, we used a threshold estimated to yield about 70% accuracy [33] . Since this error seems not to be class-specific to particular classes, we expect the relative proportions to be relatively accurate.
Multi-cellular appeared to have similar membrane content as uni-cellular ones. For most organisms less than 25% of all ORFs encoded helical membrane proteins (Fig. 1C). While we predicted the highest content for C. elegans (30%), Drosophila appeared to have less than 18%; three prokaryotes had over 25%: Rickettsia prowazeki, Borrelia burgdorferi, Chlamydia trachomatis, and two archaes had more than 20%: Pyrococcus horikoshii, Pyrococcus abyssi. 40-80% of all the membrane proteins identified by PHDhtm could have been detected by homology to known membrane proteins [34] . Excluding proteins annotated as PUTATIVE, POSSIBLE, PROBABLE, or BY SIMILARITY in SWISS-PROT, the number of previously known membrane proteins dropped to 1-7% (exception: human with 14%, Fig. 1D). Thus, we could not verify that more complex organisms need larger fractions of membrane proteins [14, 15, 12] . The discrepancy probably resulted from the insufficient amount of human and fly data available earlier.
Number of transmembrane helices differed between kingdoms.
We confirmed that most membrane proteins have fewer than four
helices (Fig. S-1A in Electronic Supplementary Material). In contrast
to previous results [17] , we found that 7-TM proteins were
significantly over-represented in C. elegans and H.
sapiens, as were proteins with 6 and 12-TM in most prokaryotes
(Fig. 2A). These three classes were also the only ones with an
imbalance in the distribution between the two possible orientations
of the transmembrane helices: Eukaryotic 7-TM proteins were dominated
by topology 'in' (G protein-coupled receptors), 6 and 12-TM proteins
in prokaryotes by the topology 'out' (transporters). Surprisingly,
we found relatively few 7 TM proteins in fly [35, 36] and
many in worm (Fig. 2A). The worm appears to contain 1000 smell
receptors, whereas the fly has fewer than 100 [35, 36] . Thus,
the difference in this single class of proteins might explain
the observed differences.
Many membrane proteins had almost no globular regions. Relating protein length to the number of transmembrane helices, we observed two clusters (Fig. 2B, and Fig. S-1B in Electronic Supplementary Material). One contained proteins of varying length with one helix, the other was populated by proteins with about 35 residues per helix. Most transmembrane helices spanned over 17-33 residues. Thus, the second cluster contained almost no globular regions. These data confirmed earlier findings [12] . Long non-transmembrane regions in membrane proteins are likely to form structurally compact globular domains; most of these were longer than 100 residues and were as often inside as outside of the membrane (Fig. S-2 in Electronic Supplementary Material).
Almost every tenth eukaryotic proteins contained coiled-coil regions. We found coiled-coil regions for about 8-11% of all eukaryotic and 2-9% of all prokaryotic and archae proteins (Fig. 1C). Most eukaryotes had more coiled-coil proteins than prokaryotes, and most prokaryotes more than archaes. Exceptions were Mycoplasma pneumoniae, Helicobacter pylori, and Aquifex aeolicus which all had higher percentages of coiled-coil proteins than C. elegans. Surprisingly low contents of coiled-coil proteins appeared in Deinococcus radiodurans, Mycobacterium tuberculosis, and Aeropyrum pernix K1. A previous survey of GenBank suggested that 4.96% of bacterial, 9.47% of invertebrate, and 6.80% of vertebrate proteins have coiled-coil regions [8] . We could not find significant differences between vertebrate (human) and invertebrates (C. elegans and Drosophila). The vast majority of coiled-coil proteins contained only one long helix and most coiled-coil regions extended over 28 residues (Fig. S-3 in Electronic Supplementary Material). Protein length did not correlate with the number of coiled-coil regions (Fig. S-3C). As expected, the amino acid composition of coiled-coil regions (Fig. S-3D) differed from that of all other regions (Fig. 1B).
Prokaryotes and eukaryotes had similar fractions of secreted proteins. We predicted 15%-25% of the ORFs in prokaryotes and eukaryotes to have signal peptides (Fig. 1C). The exception was yeast for which we found only 7% secreted proteins. Supposedly, our estimates constitute lower bounds due to the fact that the prediction program SignalP misses secreted proteins.
Too many functionally unclassified proteins hampered comparing
function. Using EUCLID [9, 37] , we could classify about
45-65% of all proteins into one of 13 functional classes at a
level reported to yield 70% correct classifications (>30% pairwise
sequence identity, [33] ). When grouping the 13 classes into
three super-classes: energy, information and communication [9]
we found similar compositions within the archaen and eukaryotic
kingdoms (Fig. 3A). In contrast, the composition varied significantly
between prokaryotes: for Escherichia coli and Synechocystis
PCC6803 the composition resembled that of eukaryotes, for
Aquifex aeolicus and Thermotoga maritima that of
archeas (Fig. 3A). Finally, we found the following differences
in the 13 classes. "Amino acid biosynthesis", "Biosynthesis
of cofactors, prosthetic groups, and carriers", "Energy
metabolism" were abundant in prokaryotes and human seemed
to have a larger portion of the classes "Transport and Binding"
and "Regulatory Functions" (Fig. 3B). Previously, bacteria
were reported to have smaller fractions of proteins responsible
for "communication" (5%) than plants (20%) and animals
(45%); possibly, since the complex organisation of multi-cellular
organisms required more proteins communicating with the environment
[9] . Our data differed in that we found bacteria to contain
more proteins associated with "communication" (15-32%)
than reported previously (Fig. 3A). The inherent bias of SWISS-PROT
may have contributed most to these differences. The significant
variation of the class composition between the various prokaryotic
genomes may reflect the very different environments in which these
organisms dwell. However, the most important result was that although
accepting classification errors of 30% or more, we still could
classify only about half of all proteins. Thus, conclusions about
the meaning of the relative proportions remained highly speculative.
About 10% of the proteins were orphans. We examined the
size of all families with proteins of similar structure (structural
families) found in SWISS-PROT / TrEMBL [34] . We found most
proteomes to contain 5-10% orphan families; a few organisms deviated
significantly from this level (Fig. 4A). Another 10% of the proteins
had only one homologue (Fig. 4A). Overall, 30% of the proteins
were in families with fewer than ten proteins, 30-40% in families
with more than 100 members (Fig. 4B).
For 60% of all residues we could not predict protein structure
through comparative modelling. Comparative modelling could
predict 3D structure at very low resolution for about 25-40% of
all proteins (Fig. 5A). Exceptions were the biased subset of human
sequences found in SWISS-PROT and TrEMBL (>50%) and the proteome
of Aeropyrum pernix K1 (<19%). These numbers comprised
the most optimistic values in that we applied an iterated PSI-BLAST
search (Methods), which may yield a considerable number of false
positives [38] . Using a more conservative cut-off for pairwise
comparisons, we still found structural similarities in about 20-35%
of all proteins. The similarity to regions of experimentally known
structure covered about 17-36% of the entire residue mass of all
proteomes (Fig. 5B). Large-scale initiatives in structural genomics
aim at experimentally determining all protein structures [39, 40, 41, 42] .
Obviously, membrane proteins, as well as other non-globular proteins
will be left out in this search to cover protein structure space.
Today, we have experimental information about structure for less
than 1% of the proteomes we analysed. However, through comparative
modelling we could obtain good structure predictions (< 3 Å
rmsd for main chain; [43, 44] ) for about 6% of all proteins
[45] . When we relaxed the model accuracy to a level at which
most models will be better than 6 Å rmsd [44] , we find
that about 25-40% of all ORFs were similar to a protein of known
structure (Fig. 5A). At this low level of accuracy about 26% of
all eukaryotic residues could be modelled. We found about 4% of
the residues in transmembrane regions, about 2% in coiled-coil
regions, and 11% in long regions that lack regular secondary structure
(J Liu & B Rost, unpublished). Thus, we estimated that structural
genomics would have to cover about 60% of all the proteomes. Obviously,
many of these proteins/fragments belong to the same structural
families (Fig. 4). In fact, we found the 40,000 proteins from
fly, worm, and yeast to cluster into about 17,000 structural families
(data not shown). Assuming a similar reduction, structural genomics
will have to experimentally determine structures for about one
fourth of all the proteomes we analysed.
How many folds are there? Cyrus Chothia estimated that the universe of proteins contains about 1000 different folds [46] . Depending on what we consider to be different folds, we currently know about 600-800 folds [47, 48, 49] . We found that these folds span about 25-45% of the proteomes we analysed. A more detailed analysis of regions with missing structural information [45] suggested that about 15-20% of the entire residue mass will not correspond to folds. Thus, we need folds for 80-85% of all proteins. Of these we currently cover 30-50% with 600-800 folds. Hence, if we assume that today's folds and proteomes are representative for the universe of life, we estimate about 1200-2600 folds in the universe. However, we have many good reasons to assume that today's databases are biased and that our perspective of the world is too narrow to thoroughly support such conclusions.
We obtained the sequences for all 28 organisms we analysed from
the public domain [1, 50, 51, 52, 53, 2, 54, 55, 56, 57, 58, 59, 3, 60, 61, 62, 63, 64, 65, 66, 67, 68, 4, 69, 70] .
We downloaded most ORFs from ftp://ncbi.nlm.nih.gov/genbank/genomes/.
The exceptions were Homo sapiens (from SWISS-PROT release
39 and TrEMBL database release 15) and Drosophila melanogaster
(from http://www.fruitfly.org/).
| Latin Name | Abbreviation a | No ORF b |
|---|---|---|
| Archael bacteria | ||
| Aeropyrum pernix K1 | aerpe | 2694 |
| Archaeoglobus fulgidus | arcfu | 2383 |
| Methanococcus jannaschii | metja | 1735 |
| metth | 1871 | |
| Pyrococcus abyssi | pyrab | 1765 |
| Pyrococcus horikoshii | pyrho | 2064 |
| Prokaryotes | ||
| Aquifex aeolicus | aquae | 1522 |
| Bacillus subtilis | bacsu | 4099 |
| Borrelia burgdorferi | borbu | 850 |
| Campylobacter jejuni | camje | 1731 |
| Chlamydia pneumoniae | chlpn | 1052 |
| Chlamydia trachomatis | chltr | 894 |
| Deinococcus radiodurans | deira | 3103 |
| Escherichia coli | ecoli | 4285 |
| Haemophilus influenzae | haein | 1716 |
| Helicobacter pylori | helpy | 1788 |
| Mycoplasma genitalium | mycge | 470 |
| Mycoplasma pneumoniae | mycpn | 677 |
| Mycobacterium tuberculosis | myctu | 3918 |
| Neisseria meningitidis | neime | 2081 |
| Rickettsia prowazekii | ricpr | 834 |
| Synechocystis PCC6803 | syny3 | 3169 |
| Thermotoga maritima | thema | 1846 |
| Treponema pallidum | trepa | 1031 |
| Ureaplasma urealyticum | ureur | 613 |
| Eukaryotes | ||
| Caenorhabditis elegans | caeel | 18944 |
| Drosophila melanogaster | drome | 14218 |
| Subset of Homo sapiens | c human 24235 | |
| Homo sapiens | (chro | mosome 22) hs22 887 |
| Saccharomyces cerevisiae | yeast | 6307 |
Membrane proteins. We obtained multiple sequence alignments through a search with MaxHom [71] against SWISS-PROT. We filtered the resulting alignments [72] and used them as input for PHDhtm [31] using the default threshold of 0.8. The total number of membrane proteins was adjusted according to the false positive rate and false negative rate published in the original paper:
(Eq. 1)
where n was the final number of membrane proteins we reported, FP and FN were the false positive and false negative rates respectively, npred was the number of predicted membrane proteins in the genome, and ntotal was the total number of proteins in the genome.
Secreted proteins and coiled-coil regions. We predicted signal peptides with the program SignalP [20] considering a protein to contain a signal peptide if the "mean S" value was above the default threshold. We predicted coiled-coil regions with COILS [32] using a window-size of 28 and a probability threshold of 0.9.
Structural homologues. We run three iterations of PSI-BLAST [73] searching against our local filtered databases [74] to detect homologues of experimentally known structure in PDB [75] . We included hits with E-values < 10-3. We obtained more conservative estimates, searching with a pairwise BLAST against PDB, reporting hits with E-values < 10-3.
Functional classification. We classified cellular function using EUCLID [37] . As input we used the SWISS-PROT homologues identified by MaxHom. EUCLID assigned the following 13+1 categories of cellular function [50] : "Amino acid biosynthesis", "Biosynthesis of cofactors, prosthetic groups, and carriers", "Cell envelope", "Cellular processes", "Central intermediary metabolism", "Energy metabolism", "Fatty acid and phospholipid metabolism", "Purines, pyrimidines, nucleosides, and nucleotides", "Regulatory functions", "Replication", "Transcription", "Translation", "Transport and binding proteins", and "Other categories", . Proteins described as "Unclassified" either had no SWISS-PROT homologue or could not be classified by EUCLID.
To examine whether or not the ORF lengths followed an extreme value distribution:
(Eq. 2)
we calculated the mean and standard deviation s of the ORF length from archaes, prokaryotes, and eukaryotes, and estimated the parameter and using the method of moments:
,
We then performed a goodness-of-fit test to determine whether
the observed distribution was compatible with the extreme value
distribution.
The supplementary material includes three figures showing the following: (1) cumulative percentage of membrane proteins with different numbers of predicted transmembrane segments relation between the ORF length and the frequency of proteins with given number of predicted transmembrane segments, (2) distribution of length of globular regions in membrane proteins, (3) distribution of coiled coil segments in the proteins, and amino acid composition of coiled coil segments.
All the figures and figure legend are included in a PDF file called
"supplement.pdf" (HTML version).
Thanks to Henrik Nielsen (CBS, Denmark) for the source code of
SignalP and for his generous help in using this program. Thanks
to Andrei Lupas (SmithKline Beecham Pharmaceuticals) for helpful
suggestions about running the COILS program. Thanks to Florencio
Pazos, Damien Devos and Alfonso Valencia (all CNB Madrid) for
supplying and helping with the program EUCLID. The work of JL
and BR were supported by the grants 1-P50-GM62413-01 and RO1-GM63029-01
from the National Institute of Health. Last, not least, thanks
to all those who enabled this analysis by depositing experimental
information into public databases and to all those who maintain
these.