Folding & Design, 1997, 2, 519-524.
Burkhard Rost
Address: EMBL, 69 012 Heidelberg, Germany
Correspondence: Burkhard Rost
e-mail: rost@embl-heidelberg.de
contact e-mail:rost@embl-heidelberg.de
Background: A protein sequence folds into a unique three-dimensional protein structure. However, different sequences can fold into similar structures. How stable is a protein structure with respect to sequence changes? What percentage of the sequence are 'anchor' residues, i.e., are crucial for protein structure and function? Here, these questions were pursued by analysing large numbers of structurally homologous protein pairs.
Results: The distribution of pairwise sequence identity between structural homologues showed one prominent and unexpected feature: most pairs clustered in an approximately Gaussian peak centred at 8-9% sequence identity. The distribution was surprisingly similar to that expected for 'random' pairs of completely unrelated sequences. Qualitatively, the distributions were similar for different criteria to consider two structures homologous. An analysis of four entire genomes suggested that most structurally homologous pairs have less than 45% pairwise sequence identity.
Conclusions: (1) Most pairs of similar structures have
sequence identity as low as expected from randomly related sequences.
(2) On average only three to four percent of all residues are
'anchor' residues (residues crucial for maintaining the structure).
(3) The symmetric shape of the distribution at low sequence identity
suggests that for most structures, four billion years of evolution
was sufficient to reach an equilibrium. (4) The mean identities
for convergent (different ancestor) and divergent evolution (same
ancestor) of proteins to similar structures are quite close, and
hence, in most cases it is difficult to distinguish between the
two effects. In particular, low levels of sequence identity appear
not to be indicative of convergent evolution.
Large scale studies of protein structure evolution can now begin. We have a detailed and ever-widening knowledge of the evolution of DNA sequences. But what do we really know about the evolution of protein structure? Until recently, the answer was: not much. The first detailed structures were determined 27 years ago [1, 2], and 14 years ago, the database of atomic-resolution protein structures contained just 312 structures (PDB [3]). Since then, due to advances in determination methods [4], the PDB has grown exponentially; presently it holds 5000 entries. A parallel development has occurred in methods for aligning of protein structures [5-16]. Applying these automatic methods to the current PDB, we can now begin to analyse the evolution of protein structure.
Stability of structures with respect to sequence changes enable evolutionary drift. It has long been accepted that each protein sequence folds into a unique 3D structure, and that proteins with similar sequences have similar structures [17]. But exactly how similar do two sequences have to be to have similar structures? In other words, how large is the sequence attractor of a protein structure (i.e., the region in sequence space which maps onto the same fold)? The answer is: surprisingly large: essentially all protein pairs of known structure with more than 30 out of 100 residues pairwise identical have similar structures[18]. This high robustness of structures with respect to residue exchanges explains partly the robustness of organisms with respect to gene-replication errors, and it allows much scope for variety in evolution. In recent years, many examples of protein pairs have been uncovered which have similar structures at even lower levels of pairwise sequence identity. At first, this was a surprise [19, 20]. However, we now start to realise that low levels of sequence identity between similar structures is not the exception.
Here, a [21]
of protein structure evolution
is detailed, and extended. This analysis is based on all pairs
of proteins in the PDB with similar three-dimensional structures.
For each pair, the structures were aligned, and the sequence
identity (pairwise identical residues) in the aligned regions
was measured. To minimise bias in regions of higher identity,
distributions of pairwise sequence identity were compiled for
four entire genomes representing all terrestrial kingdoms: haemophilus
influenzae (prokaryote), yeast (eukaryote), mycoplasma genitalium
(prokaryote), and methanococcus jannaschii (archea).
Expected distributions for convergent and divergent evolution.
A priori, we might suppose that divergent evolution of sequences
from the same ancestor would give rise to a distribution of sequence
identity scores with a peak value, D, at some probably
low value (e.g. D < 30%), and a smooth, relatively flat
tail for high values (Fig. 1). In the case of convergent evolution,
where two unrelated sequences evolve to the same structure, we
would expect a sharp Gaussian distribution with a peak value,
C, at very low identity, e.g. C < 10% (Fig. 1).
Hypothetical distribution of pairwise sequence identity for two evolutionary events: (1) protein pairs that converged from a different ancestor to similar structures (grey line; peak at C), and (2) proteins that diverged from a common ancestor maintaining a similar structure (black line; peak at D). The dashed line indicates that it is a priori not clear which relation to expect between the divergent peak and its tail at high levels of sequence identity.
Observed distribution for structural homologues: one peak around
8-9%. The distribution of sequence identity scores for structural
homologues (Fig. 2) has three distinct regions: (1) a large, approximately
Gaussian, peak centred at 8-9%; (2) many smaller, sharp peaks
between 15-95%; and (3) a large peak near 100%. The last peak
may arise from engineered mutants to facilitate structure determination.
The second region can be explained by 'incoherent noise' peaks
arising from uneven sampling and the still relatively small size
of the current PDB (see below). The peak in the first region
seemed to be incompatible with the hypothesis that convergent
and divergent evolution yielded two different Gaussian distributions
(around C and D ); the observed peak occurs at
very low average identity (8-9%), and is remarkably symmetric
(Fig. 2, left panel). The peak is also very similar to the distribution
of random sequence identities with a peak value, R , at
5.6% (Fig. 2, left panel). Qualitatively similar results were
obtained when using sequence similarity instead of sequence identity
[21].
Distribution of pairwise sequence identity for structural alignments (open circles, black line) and random alignments (left panel, only; crosses, grey line; see Materials). The average sequence identity of all remote structural homologues (< 25% pairwise sequence identity, left panel) was about 8.5% (standard deviation 5%). The dashed line is a Gaussian envelope (left panel) fitted to the observed distribution. The average sequence identity of random alignments was about 5.6% (standard deviation 3%).
Have divergent and convergent evolution reached a similar equilibrium? Three scenarios could have generated the observed distribution for similar structures with vanishing pairwise sequence identity (<15%) as a superposition of two separate events (Fig. 2). (1) Divergent evolution has not reached down to very low levels of sequence identity; the observed distribution for remote homologues is entirely dominated by pairs that converged from different ancestors to adopt similar structures. (2) Convergent evolution is negligible; all observed pairs have originated from divergence to very low levels of sequence identity. (3) Divergent and convergent evolution have reached similar equilibrium distributions.
Divergent evolution not under-represented for remote structural
homologues. The under-representation of pairs that have
diverged from a common ancestor may have been caused by the particular
definition of structural similarity (the more 'relaxed' the more
likely even functionally unrelated proteins could be deemed 'structurally
similar'). However, various different criteria yielded qualitatively
the same result: a single Gaussian distribution peaking around
8-9% sequence identity explained the observed data best (Fig. 3).
Distribution of pairwise sequence identity for remote structural homologues using
different thresholds for structural similarity. Different thresholds were generated
by excluding all pairs in the FSSP database [23] below a given threshold in the
DALI z-score (data given for zDali < 2, 3, and 6) [24]; and for which the
alignment covered a too small percentage of the aligned preoteins (data
given for R < 0.2, 0.6). Note: the logarithmic scale of the
vertical axis, in lagarithmic scale a Gaussian becomes a parabola.
By tuning the threshold, rather different data sets
were generated. At the most stringent cut-off value (Z<6),
too few examples for statistical analysis were found in the PDB.
However, the plots suggest that the details of the shape of the
function displayed in Fig. 2 were independent of the particular
choice of the cut-off for structural similarity.
Similar results for alignment inter- and intra-kingdom.
The observation of one peak at around 8-10% sequence identity
(Fig. 2) may have arose from alignments between proteins from
the same terrestrial kingdom (e.g. prokaryotes with prokaryotes,
or eukaryotes with eukaryotes) whereas the several peaks above
30% sequence identity may have resulted from alignments between
proteins from the different terrestrial kingdom (e.g. prokaryotes
with eukaryotes). Compiling the distributions separately for
all prokaryote-prokaryote, for all eukaryote-eukaryote, and for
all eukaryote-prokaryote alignments did not confirm this suspicion.
Instead, for all inter- and intra-kingdom alignments the distributions
appeared qualitatively similar to the one for all protein pairs
(Fig. 4).
Distribution of pairwise sequence identity for structural homologues
belonging to the same kingdom; shown are the results for all prokaryotes
against all prokaryotes (solid black line), for all eukaryotes
against all eukaryotes (dashed black line), and for all eukaryotes
against all prokaryotes (solid grey line). The lower left and
the upper right chart differ in the subsets of PDB considered
(lower left chart: starting from the sequence unique subset of
PDB, Fig. 6; upper right chart: starting from the structure unique
subset of PDB, Fig. 6). The bars indicate that the absolute values
of the distribution below and above 25% sequence identity are
not comparable! To minimise the effect of database bias below
25% the start set was aligned against a subset of PDB (in which
no two proteins had more than 25% pairwise sequence identity),
whereas above 25% the start set was aligned against the entire
PDB (Fig. 6).
Most close structural homologues have less than 45% pairwise
sequence identity. Sequence alignments of all proteins from
the four entire genomes against SWISS-PROT, and TREMBL [22] yielded
several clear results. (1) The coherent peak near 100% (Fig. 2,
right panel) is not present (Fig. 5) for any of the genomes.
(2) The various smaller peaks between 40 and 80% in the distribution
of structural alignments (Fig. 2, right panel) are not coherently
observed in the four genomes (Fig. 5). (3) The majority of close
structural homologues (>30% pairwise sequence identity) for
all four genomes had 30-42% pairwise sequence identity (data not
shown).
Distribution of pairwise sequence identity for all close structural
homologues between the four entire genomes and the SWISS-PROT
database. The results from the analysis of structural pairs are
also plotted. Counts are normalised to percentages by the sum
over all pairs for each genome (resp. the structural pairs).
Values between the structural alignments (PDB) and the sequence
alignments (genomes) were not strictly comparable. However, the
main trend was clear: the peaks in the PDB distribution around
80 and 100% sequence identity arise from bias in the selection
of structures in the PDB.
How stable is protein structure with respect to sequence changes? Most pairs of similar structures have sequence identity as low as expected from randomly related sequences (Fig. 2, left panel). This does not imply that sequence changes were random but that to us - as observers of the record of evolutionary history - the sequence variations look random.
How many 'anchor' residue define a structure? The average percentage 'anchor' residues, i.e., residues that are crucial for protein structure and function, is not given by the average of the observed distribution. Instead, the relevant number is the difference between the peak observed for structural homologues (O = 8-%, Fig. 2), and the peak for random alignments (R). Thus, on average only three to four percent of the residues anchor a protein structure (O - R , Fig. 2).
Did evolution have enough time to reach an equilibrium? Most remote structural homologues have less than 15% pairwise sequence identity (Fig. 2), and most close structural homologues have less than 45% pairwise sequence identity (Fig. 3). This implies that the rate of creation of new structures is much slower than the drift towards the mean sequence identity (D ). Furthermore, the symmetric shape of the distribution at low sequence identity suggests that four billion years of evolution was sufficient to reach an equilibrium between these two processes.
Can we distinguish between convergent and divergent evolution? Naïvely, we may have supposed that the level of pairwise sequence identity for remote homologues can be used to distinguish between convergent and divergent evolution. However, the results presented here suggest that convergent, and divergent evolution may have quite similar equilibrium states (difference between divergent and convergent average, D - C quite small, Fig. 2), and hence, in the remote homology region (<15%) it is difficult to distinguish between the two effects.
How will the distribution look for all globular proteins? Assuming that the three terrestrial kingdoms (eukaryotes, prokaryotes, archaeans) would result in separated clusters, the observed distributions could also be attributed to such clusters. The relation between the major peak around 8-10% sequence identity, and the minor peaks around 60, and 80% (Fig. 2) would then reflect merely the distribution of organisms that are the sources for the protein sequences in the PDB. Furthermore, the conclusion suggested by Fig. 2 that most structural homologues have less than 15% pairwise sequence identity may also be a result of the particular distribution of organisms in PDB. However, restricting the distributions to alignments the same kingdom, and to alignments between two different kingdoms did qualitatively not alter the distributions (Fig. 4). The fact that very different data sets produced similar results may suggest that the distribution for all globular proteins would look similar to the one observed today. However, the data sets are still too small to allow drawing firm conclusions.
Trivial or surprise? In presenting this analysis at the
Copenhagen meeting, and elsewhere, most experts have expressed
surprise at the low value of the average pairwise sequence identity.
Clearly, then, the distributions shown in Fig. 3 contain
an important lesson in advancing our understanding of the evolution
of proteins.
Compiling pairwise sequence identity. The basic score compiled was the level of pairwise sequence identity, i.e., the percentage of residues identical between two aligned proteins. Structure alignments were taken from the FSSP database of structure alignments [23]. The method that produced these alignments (DALI) attempts to superimpose two structures according to their similarity in the pattern of inter-residue contacts [24]. Thus, the feature analysed here (pairwise sequence identity) has NOT been used by the structural alignment algorithm.
Two regions of pairwise sequence identity: close and remote structural homologues. The level of pairwise sequence identity for which two naturally evolved proteins are guaranteed to have similar structures [17], depends on the alignment length [18]. A 'twilight zone' [25] distinguishes the region in which sequence identity implies structure similarity, and the region for which many proteins have different structures. Schneider and Sander [18] defined a length-dependent cut-off line which was used in this analysis to separate: (1) the region of close structural homologues (pairwise sequence identity > 25%), and (2) the region of remote structural homologues (pairwise sequence identity < 25%). In the first region sequence alignment methods produce accurate alignments; in the second region, reliable alignments have to be based on knowledge about structure.
Avoiding bias by populated folds through selection of data
set. First, the largest subset of unique structures was
selected (272 proteins for which 148 had at least one remote homologue;
Fig. 6; compiled from [23]). Second, the largest subset of sequence
unique structures was selected (849 proteins; Fig. 6; compiled
from [23]). To further reduce possible bias, the unique structures
were aligned against the set of unique sequences, only (instead
of against the entire PDB; Fig. 6). The distributions of levels
of pairwise sequence identity above 25% were generated by aligning
all proteins in the 'structure unique' set against all proteins
in PDB (note that by definition of the set of sequence unique
structures there is no pair in that set with more than 25% pairwise
sequence identity; Fig. 6). To explore the effect of comparisons
inter- and intra the two major terrestrial kingdoms the alignments
were additionally restricted to homologues between: (1) prokaryotes
and prokaryotes; (2) eukaryotes and eukaryotes; and (3) mixed
pairs, i.e. one protein from prokaryotes, the other from eukaryotes.
For the later, starting from the structure unique set yielded
too low counts (upper right chart in Fig. 4). Therefore, the
inter- and intra kingdom data was also compiled when starting
from the sequence unique subset of PDB (lower left chart in Fig. 4).
Avoiding bias by selection of data set. The analysis presented
in this paper was compiled based on the largest possible subsets
of PDB that fulfilled the following criteria: (A) sequence unique:
no pair of structures in that set had more than 25% pairwise sequence
identity (according to structural alignment [24]); (B) structure
unique: starting from the sequence unique set, a set was selected
in which no pair of structures had a significant structural similarity
(defined by the DALI cut-off [24]). This procedure implied the
separation into two regions of pairwise sequence identity: (1)
remote structural homologues (< 25% pairwise sequence identity):
all proteins in the structure unique set (grey inner circle) were
aligned against all proteins in the sequence unique set (white
outer circle); (2) close structural homologues (> 25% pairwise
sequence identity): all protein in the structure unique set (grey
inner circle) were aligned against all proteins in PDB.
Exploring four entire genomes. The structure alignments yielded a rather 'noisy' distribution in the region above 40% sequence identity (Fig. 2). A way to less biased distributions has been opened by sequencing the entire genomes of (1) haemophilus influenzae (HI) [26]; (2) yeast (saccaromyces cerevisiae, YE) [27]; (3) mycoplasma genitalium (MG) [28]; and (4) methanococcus jannaschii (MJ) [29]. For each genome, distributions were generated by aligning all protein sequences against the SWISS-PROT and the TREMBL [22] database (using the multiple sequence alignment program MAXHOM [18, 30]).
Generating random background distributions. The random
background distribution was generated by the following procedure.
All proteins in the 'structure unique' set were randomly superimposed
onto all proteins in the 'sequence unique' set. 'Randomly superimposed'
refers to selecting 'alignment' begin and end in both 'aligned'
proteins by generating random numbers, i.e. irrespective of the
amino acids that were superimposed. A constraint was imposed
while randomly selecting pairs: the pairs had to mirror the distribution
of alignment length observed by the structurally aligned pairs.
This particular construction of random pairs guaranteed that
the random background was representative for the set of structurally
aligned proteins: as well singlet frequencies (amino acid composition),
as higher order correlations (di-, tri-, multi-peptide frequencies)
were similar. (Therefore, the average value of 5.6% was lower,
than what would have been expected from superimposition of randomly
shuffled sequences.)
I am most grateful to Sean O'Donoghue (EMBL, Heidelberg) for crucial
encouragement, many discussions, extremely valuable remarks, and
last not least, for proof-reading. Thanks also to Chris Sander
(EBI, Hinxton) for his intellectual, and financial support. Furthermore,
thanks to the extremely constructive comments from one of the
anonymous referees. In general thanks to all those who deposit
information in public databases, and to those who carry the burden
of maintaining these valuable evolutionary record.