EMBL, 69 012 Heidelberg, Germany
contact e-mail:rost@embl-heidelberg.de
The problem of predicting protein structure from sequence remains
fundamentally unsolved despite more than three decades of intensive
research effort. However, new and promising methods in 3D, 2D,
and 1D prediction have reopened the field. Mean-force-potentials
derived from the protein databases can distinguish between correct
and incorrect models (3D). Inter-residue contacts (2D) can be
detected by analysis of correlated mutations, albeit with low
accuracy. Secondary structure, solvent accessibility, and transmembrane
helices (1D) can be predicted with significantly improved accuracy
using multiple sequence alignments. Multiple sequence alignments
tailored to particular protein families intrude into the twilight
zone of sequence similarity. Combination of advanced sequence
alignment, and threading techniques with expertise permits to
intrude into the midnight zone of extremely low sequence identity,
even. Some of these new prediction methods have proven accurate
and reliable enough to be useful in genome analysis, and in experimental
structure determination. Moreover, the new generation of theoretical
methods is increasingly influencing experiments in molecular biology.
In Greek mythology, Sisyphus is condemned to an eternity of hard labour; his labour is a frustrating and fruitless, for just as he is about to achieve his goal, his work is undone and he must start again from the beginning. Those who work in protein structure prediction seem to share the same fate...
In the first part of the tutorial, some principles about protein structure will be briefly reviewed. Due to advances in sequencing methods, the number of proteins for which the amino acid sequence is know is currently over 200,000 and rapidly increasing. In principle, the three-dimensional (3D) structure of proteins is determined by the amino acid sequence. Currently, the relationship between sequence and structure is unknown: we cannot in general predict structure from sequence. However, from the growing database of experimentally-determined protein structures, some rules are emerging. Firstly, the number of unique protein folds is quite limited. Secondly, there are many proteins with the same fold, but no similarity of sequence. Thirdly, 'neutral' mutations not altering the protein structure are relatively unlikely. Hence naturally evolved proteins are a record of the unlikely, since most neutral mutations are probably realised. These rules suggest that a key to understanding protein structure lies in the patterns of neutral amino acid exchanges.
How far can theory bridge the growing gap between the data bases of sequence and structure? For a sequence with significant similarity to a protein of known structure, homology modelling can be used to construct a 3D model with correct fold, but inaccurate loop regions. Homology modelling effectively raises the number of 'known' 3D structures to over 30,000. In absence of significant sequence identity, threading techniques can potentially detect remote homologies. In general, after more than 30 years of ardent research on protein structure prediction, technical improvements and data base growth have made alignments, threading, and homology modelling increasingly powerful.
For most proteins neither homology modelling nor threading is
applicable: the prediction problem has to be simplified. I shall
discuss generic methods for prediction at three different levels
of simplification, namely one, two, and three dimensions. Prediction
in 1D (secondary structure, solvent accessibility and transmembrane
helices) can be improved significantly through the use of evolutionary
information. Prediction in 2D (inter-residue contacts, inter-strand
contacts, disulphide bonds) can also, to a certain extent, profit
from evolutionary information, but so far, is of only limited
accuracy. Some progress in 3D prediction has been made: (1) particular
protein structures (coiled-coil proteins) can be predicted from
sequence, (2) incorrect structures can now be detected with remarkable
accuracy (mean-force potentials).
| section | title | duration | time |
| 1 | Cosmos of protein structures and sequences | 30' | 15:00 |
| Evaluating prediction methods - a chaos? | 15' | 15:30 | |
| State-of-the-art in protein structure prediction | 15' | 15:45 | |
| Break | 15' | 16:00 | |
| 2 | Structure prediction for known folds (presentation of methods) | ||
| Sequence alignments | 15' | 16:15 | |
| Homology modelling | 10' | 16:30 | |
| Remote homology modelling (threading) | 20' | 16:40 | |
| Break | 30' | 17:00 | |
| 3 | Structure prediction for unknown folds | ||
| Prediction in 1D | 45' | 17:30 | |
| Break | 15' | 18:15 | |
| Prediction in 2D | 15' | 18:30 | |
| Prediction in 3D | 15' | 18:45 | |
| Apotheosis: why continue structure prediction in the wake of the genome data flow? | 15' | 19:00 | |
| Tutorial ends | 19:30 |
Colours in the cosmos
The basic principles of protein structures are shortly introduced.
Protein structure is determined by sequence. However, there
are many proteins which have strong structural similarity, but
no similarity of sequence. In other words, structure is more
conserved than is sequence. Indeed, the majority of similar structure
pairs populates levels of sequence identity as low as expected
by random (8-10%). Thus, naturally evolved proteins are a record
of the unlikely in that all mutations not altering the structure
are probably realised, although the likelihood to find a neutral
mutation is small. The patterns of amino acid exchanges not changing
structure are highly informative about a given structure.
Evaluating prediction methods - a chaos?
Useful computational are urgently demanded by molecular biologists.
However, to make a prediction method useful three points have
to be met. Firstly, the predicted feature of protein structure
or function has to be suitable for an experiment (or post-processing
prediction methods). Secondly, the method has to be made available.
And thirdly, most importantly to keep theory in the game, prediction
accuracy has to be estimated at a sustained level. In the wake
of today's flood in literature, experimental biologists and even
theoreticians from slightly different fields have no chance to
critically assess the claims of predictors. Thus, the application
of prediction methods requires quite a level of trust. This demands
a significant level of modesty on the side of the predictors.
Levels of expected accuracy should be conservative and tend to
under-estimate rather than to over-estimate the abilities of tools.
Given the ease of distributing software and services over current
internet resources, the issue of appropriate evaluation becomes
increasingly sensitive. For the predictor appropriate evaluation
implies to spend much more time on testing than on developing
a tool.
State of the art in protein structure prediction
The problem of predicting protein structure from sequence remains
fundamentally unsolved despite more than three decades of intensive
research effort. However, new and promising methods in 3D, 2D,
and 1D prediction have reopened the field. For 10-30% of all
known protein sequences, homology modelling can be applied to
predict 3D structure. Mean-force-potentials derived from the
protein databases can distinguish between correct and incorrect
models (3D). Inter-residue contacts (2D) can be detected by analysis
of correlated mutations, albeit with low accuracy. Secondary
structure, solvent accessibility, and transmembrane helices (1D)
can be predicted with significantly improved accuracy using multiple
sequence alignments. Some of these new prediction methods have
proven accurate and reliable enough to be useful in genome analysis,
and in experimental structure determination. Moreover, the new
generation of theoretical methods is increasingly influencing
experiments in molecular biology.
What is a protein?
Building blocks: amino acids. Proteins are built up from 20 different types of amino acids that are joined by peptide bonds to form a linear chain. The information is coded in the DNA and translated into protein sequences. The basic information about life is coded in a sequence of four different nucleotide bases in the genes. There are two types of nucleic acids: the permanent storage system of the more stable deoxyribonucleic acid (DNA), and the intermediate blueprint tool of the less stable ribonucleic acid (RNA). The information is translated from the genes into the sequences of macromolecules which are involved in every process that keeps life going in an organism, the proteins. Proteins are build up by sequences of amino acids. Known protein sequences contain from some 30 to 10,000 amino acids.
Formation of peptide bonds. In general, there are some one hundred different natural amino acids, but only 20 are usually found in proteins. They all have in common the same basic tetragonal structure (Fig. 1.1). The amino acids differ in their side chains (Fig. 1.2). Transcribing the four base alphabet of the RNA on the ribosomes into the 20 letter alphabet of amino acids, the protein sequences are build up residue by residue by joining the amino acids with peptide bonds (Fig. 1.3). The atoms along the line connecting the Ca atoms are referred to as the main chain of the protein or as its backbone.
What determines protein structure?
Hierarchy of protein structure terminology. The following hierarchy is often used: primary structure = amino acid sequence; secondary structure = regular patterns of the main chain atoms, like a-helices or b-strands; tertiary structure = the arrangement of all atoms in a protein chain in three dimensions; quaternary structure = the arrangement of all atoms of the whole protein possibly consisting of multiple chains.
Sequence determines structure. A fully unfolded amino acid sequence diluted in the appropriate solvent (under proper conditions in terms of pH value and temperature) folds into a unique tertiary (3D) structure (Anfinsen et al., 1961; Anfinsen et al., 1963; Epstein et al., 1963; Anfinsen, 1973; Anfinsen & Scheraga, 1975). The process is reversible (Creighton, 1984; Creighton, 1991). Consequently, it is assumed that folding is determined exclusively by the information contained in the amino acid sequence (Ewbank & Creighton, 1992). Some experiments suggest that the formation of some secondary structure precedes tertiary organisation (Ewbank, 1992; Ewbank et al., 1995). A possible exception to the Anfinsen-hypothesis constitute molecular chaperones, i.e., proteins which assist or hinder folding (Hubbard & Sander, 1991; Agard, 1993; Craig, 1993; Martin & Hartl, 1993; Hartl et al., 1994; Ewbank et al., 1995; Clarke & Fersht, 1996; Corrales & Fersht, 1996).
Secondary structure facilitates dense packing. The main driving force for folding water-soluble globular protein molecules is the need to pack hydrophobic side chains into the interior of the molecule, thus creating a hydrophobic core and a hydrophilic surface. But how can that be realised with the main chain being highly polar (with NH as hydrogen donor and C'=O as hydrogen acceptor, Fig. 1.1)? The simple trick is to neutralise the NH and C'O groups by a formation of hydrogen bonds (Ptitsyn, 1992). These bonds effect the formation of the regular patterns of secondary structure like a-helix and b-strand (Schulz & Schirmer, 1979). Any region of the protein that is not in either helix or strand will be termed 'loop' in this work (some authors use the term 'random coil' based on the helix-coil model (Zimm & Bragg, 1959). Helices and strands form dipoles (Hol et al., 1981). The existence of secondary structure elements was first proposed by Pauling and Corey on theoretical grounds prior to their discovery in protein structures (Pauling & Corey, 1951; Pauling et al., 1951; Pauling & Corey, 1953b; Pauling & Corey, 1953a).
Function-specific motifs of secondary structure. Combinations of a few secondary structure segments with a specific geometric arrangement occur frequently in protein structures (3D structure). Such combinations are termed super-secondary structure or motifs. Some of these motifs are associated with particular functions. Examples are the helix-loop-helix DNA binding motif (Gibson et al., 1993), the calcium binding motif (Fig. 1.9), or the Greek key or b-meander motif (Fig. 1.10) (Hutchinson & Thornton, 1993).
Classification of proteins into structural classes. Motifs can be used to classify proteins (Richardson, 1981; Richardson, 1985b; Richardson & Richardson, 1989; Murzin & Chothia, 1992; Orengo et al., 1993a; Orengo et al., 1993b; Wodak & Rooman, 1993; Murzin, 1994; Orengo, 1994; Murzin et al., 1995; Orengo & Taylor, 1996). Three widely used databases of structure classifications are SCOP (Murzin et al., 1995; Brenner et al., 1996; Hubbard et al., 1997), FSSP (Holm & Sander, 1995; Holm & Sander, 1996a; Holm & Sander, 1996b), and CATH (Orengo, 1994; Orengo & Taylor, 1996; Orengo et al., 1997). A more simple classification is based purely on the content of secondary structure (Chothia, 1975; Levitt, 1976; Levitt & Chothia, 1976; Richardson, 1981; Zhang & Chou, 1992). A protein can be classified as, e.g., all-a, if it contains almost no strand structure and a high content of helix (Fig. 1.11) (Rost WWW, 1996).
Protein folding: a problem solved only by nature?
Variety of protein structures. Protein structures show a fascinating variety. Structure is more conserved by evolution than sequence. This is mainly explained by the fact that the 3D structure is closely related to the function of the protein. Although the mutation of a few residues in a protein are likely to destabilise the fold (Dao-pin et al., 1990; Dao-pin et al., 1991a; Dao-pin et al., 1991b), evolution has created a record of sequence variation not changing the 3D structure. Two natural protein sequences can differ by 75% of their residues and, yet, have the same 3D structure (Chothia & Lesk, 1986; Sander & Schneider, 1991; Rost, 1997a).
"When the first structures of proteins were solved by X-ray crystallography biochemists were struck by the beautiful topologies of their backbone folds and soon researchers in the field became eager to collect structures, and much like zoologists and botanists in past centuries they developed systematic schemes and looked for common features among the various families of folds hoping to unravel the underlying theme responsible for their bizarre structures." (Sippl et al., 1994a)
Cracking the code. Solving the protein folding problem means deciphering the code according to which the 3D structure is encrypted in the amino acid sequence. Can we crack the code, i.e., can we unboil the egg (Perutz, 1940)? Many researchers successfully fail in doing the neat trick (which is why the issue of predicting protein structure is so interesting...). Prediction methods can be distinguished according to the principle they start from: physics or statistics. The prediction success of methods based on physical principles is still very limited.
Marginal entropy differences determine protein stability. What determines protein stability? The hypothesis of Anfinsen is that the folded state of a globular protein is characterised by a minimum in free energy (Anfinsen, 1973; Anfinsen & Scheraga, 1975). The folding transition is largely a two-state process: unfolded non-native chain (U) -> folded native structure (N). As a first approximation, intermediate states can be neglected (though recent exceptions have been found (Ewbank & Creighton, 1991; Ewbank et al., 1995), and the difference in free energy between unfolded and native state (Æ G) can be approximated by (Lattman & Rose, 1993):
Æ Gµ
- RT ln K
with R being the gas constant, T the absolute temperature, and
the equilibrium constant K = number of chains in U / number of
chains in N. Typical values for ÆG are -5 to -15 kcal/mol
(Lattman & Rose, 1993).
Hydrophobic forces drive folding stability. Why do proteins fold? The driving force for folding has been established to be the reduction of solvent accessible surface (Kauzmann, 1959). Folding is driven by the attempt for dense packing (Schulz & Schirmer, 1979; Jaennicke, 1987; Lesk, 1991; Stigter et al., 1991; Pickett & Sternberg, 1993). Globular proteins are known to have mean packing densities reminiscent of solids (Lattman & Rose, 1993). This density can possibly be explained by the complementarity between interior side chains, fitting together like pieces of a jigsaw puzzle (Taylor, 1992).
Dense packing determines the conformational specificity? What determines the specific conformation of a fold? One explanation could again be the density of packing, i.e. only very specific conformations allow the residues to pack into the jigsaw puzzle. However, there is evidence that such a grouping can readily be done, i.e. does not require one particular conformation (Lattman & Rose, 1993). This suggests that dense packing is not the primary source of conformational specificity. What then determines the fold? One attractive candidate is the stereo-chemical code:
"It is plausible that conformational specificity is imposed through a redundant stereo-chemical code that arises from the interplay between the shape and polarity of residue side chains and secondary structure conformation." (Lattman & Rose, 1993).
Evolution creates a record of the unlikely!
A single mutation can destabilise a protein. The mutation of a single residue typically causes an approximate reduction of the free energy difference between native and unfolded state of about 1 kcal/mol (Lattman & Rose, 1993). Thus, the exchange of a few residues can already destabilise a protein of more than 100 residues (Dao-pin et al., 1990; Dao-pin et al., 1991a; Dao-pin et al., 1991b; Zabin et al., 1991). Does this imply that two proteins with some different residues have a different 3D structure? And if, are all potential 3D structures realised in nature, i.e. are there some 20N different folds for proteins with N residues realised in nature? The fact that a single mutation can destabilise a protein implies only that the majority of the 20N possible sequences adopt different structures. But, has evolution created such an immense variety?
Only mutations not altering the structure survive. Random errors in the DNA lead to the wrong translation of the information coded in the genes into sequences of amino acids. These errors are the basis for evolution (Darwin, 1859; Monod et al., 1965; Zuckerkandl & Pauling, 1965b; Zuckerkandl & Pauling, 1965a; Monod, 1970). Are all such errors carved into fossils, or do only the fittest survive? The function is determined by the structure and the environment of the protein. Mutations resulting in a structural change are not likely, since the protein cannot perform its task. Thus, only those errors are likely to be accepted which do not alter the structure. Of course, this is only one side of the coin, would it not be possible to accept changes of the structure and consequently of the function, there would not be much room for evolution. Indeed, one of the evolving pictures is that proteins consists of functional modules, which are combined in various ways to yield different properties for the proteins (Doolittle, 1979; Bork, 1992; Doolittle & Bork, 1993; Green et al., 1993).
How much variation in sequence is possible? Mutations of amino acids survive if they do not change the 3D structure of the folded protein. The known proteins are a record of exploration for variation of sequence with no effect to structure. A protein sequence folds into a unique three-dimensional protein structure. Different sequences, though, can fold into similar structures. Structure is more conserved than sequence (Chothia & Lesk, 1986). But, how much variation of the sequence can exactly be accepted without changing the structure? How stable is a protein structure with respect to sequence changes? What percentage of the sequence are 'anchor' residues, i.e., are crucial for protein structure and function? The surprising result is quite some variation is possible (Fig. 1.13). Evolution has realised pairs of proteins which have the same 3D structure, although they have only 25 of their 100 residues alike (Sander & Schneider, 1991; Rost, 1997a). Of course not any two residues can be exchanged anywhere in the sequence. Instead, the possible exchanges depend on the details of the structure and on the physico-chemical properties of the amino acids involved. Thus, the pattern of residue substitution - the record of the unlikely - carries information rather specific for a particular protein structure (Zuckerkandl & Pauling, 1965b; Zuckerkandl & Pauling, 1965a).
Evolution into the midnight zone. Structure can be maintained despite changing 75% of all residues. However, recent analyses of a large number of similar structures (Perez-Iratxeta et al., 1997; Rost, 1997b; Rost et al., 1997a) have unravelled that this is only a the tip of the iceberg of possible variation. Indeed, opposite to what one may have expected (Fig. 1.14), most pairs of similar structures have sequence identity as low as expected from randomly related sequences (Fig. 1.15). On average only three to four percent of all residues are 'anchor' residues (residues crucial for maintaining the structure). The symmetric shape of the distribution at low sequence identity suggests that for most structures, four billion years of evolution was sufficient to reach an equilibrium. The mean identities for convergent (different ancestor) and divergent evolution (same ancestor) of proteins to similar structures are quite close, and hence, in most cases it is difficult to distinguish between the two effects. In particular, low levels of sequence identity appear not to be indicative of convergent evolution.
How many different protein folds exist?
Speculations are that the number of different protein folds realised by nature is fairly limited (Chothia, 1992; Finkelstein & Reva, 1992; Finkelstein et al., 1993). However, the concept of 'similarity' between folds is not clear-cut (Sippl, 1982; Zu-Kang & Sippl, 1996). Today, the number of unique structures is about 300 (Holm & Sander, 1996b; Hubbard et al., 1997; Orengo et al., 1997). Based on this number and recent analyses of entire chromosomes the estimate for the number of folds appears to confirm the notion of 1,000 folds (factor of 3 possible).
Publishing optimistic results? A sustained testing of the performance is a precondition for any prediction to become useful. For example, the history of secondary structure prediction has gone through a head-hunting for highest accuracy scores. Over-optimistic claims by predictors on the one hand, nourished scepticism of potential users on the other hand. Two points became clear in the first meeting for the 'State of the art in structure prediction' in Asilomar, C.A., Dec., 1994 (Defay & Cohen, 1995). First, an inaccurate prediction is not as bad, as is an over-estimated one. Second, even a prediction method of limited accuracy can be useful if the user knows what to expect. In the following, some criteria will be summarised which help reducing the likelihood to fall into the trap of overestimation. As an example the prediction of secondary structure will be chosen.
What is the goal and which limits are to be expected? Say the goal is to predict secondary structure in three states. Which is the best current method for the prediction of secondary structure? If applicable, homology modelling. Which is the worst method? Random prediction. How accurate are existing methods for prediction?
How to choose the data set? Proteins used for deriving a method and for evaluating should have a pairwise sequence identity < 25%, else homology modelling can be applied (Sander & Schneider, 1991; Rost, 1997a). For all prediction methods the data set has to be split into a set used to set free parameters (training set) and another to estimate the expected performance on unknown proteins (test set). The criterion for separating training and test set is provided by the best alternative method (homology modelling). An excellent example for how to choose data sets is given for the goal of signal peptide prediction (Nielsen et al., 1996)
How many proteins to use for the test set?
All available unique proteins should be used for testing (currently
more than 900 (Hobohm et al., 1992; Hobohm & Sander, 1994; Holm & Sander, 1996b)).
Furthermore, results should be compared to standard sets used
for the evaluation of alternative methods (Rost & Sander, 1993b; Rost et al., 1993).
The reason for taking as many proteins as possible is simply
that proteins have a wide spread facet of features: some are easy
to predict, others harder. A criterion for a sufficient size
of a test set could be the following. N proteins are enough,
if (and only if) the standard deviation of a certain measure for
accuracy fulfils:
sÅ
s,
in other words, if doubling the test set would not alter the results.
Optimising free parameter with respect to the test set? The cross-validation described so far is still not enough. A seemingly trivial - and often violated - rule is that methods should never be optimised with respect to the test set. Instead, parameters should be optimised (if necessary) based on yet a different set (screening or optimisation set), and should be kept fixed BEFORE the final cross-validation experiment is performed.
How many cross-validation experiments have to be performed? Say the test set consists of 300 proteins. One extreme a two-fold cross-validation would mean to split the set into two set with 150 proteins each (A and B) and perform two experiments: first train on A, test on B, then train on B, test on A, and finally report the test results for A+B. The other extreme a 300-fold cross-validation (jack-knife) implies 300 splits into pairs of a training sets A with 299 proteins and test sets B with one protein each such that each protein is in one of the test sets. After 300 experiments the results are averaged over all 300 test sets. In practice the choice is often somewhere between two and 300. Does the number of splits have an influence on the correctness of the evaluation? The simple answer is: no! More splits tend to be better for the methods, as more proteins can be used for training. But with respect to the generality of the result there is no difference between two- and 300-fold cross-validation (as long as the previous points had been taken into account).
Enough of testing? Even if all those points had been taken into account, the sceptical molecular biologist should still not be satisfied. A further necessary step is to test the method on a new set of proteins, ideally after the paper had been written. With the rapid increase in the number of known structures, it should never be difficult to find say some 50 proteins which have no significant sequence identity to any of the 300 proteins used so far. This final test helps the reader and the predictor to assess whether or not the estimated performance is likely to be correct for new proteins.
How to measure performance accuracy?
Another rather obvious demand is that to define an appropriate
measure. The measure should reflect the goal of the method.
For example, if the goal is to predict 3D structure by remote
homology modelling (threading), the results have to be given in
e.g. root-mean square deviation. This example may appear particularly
trivial, nevertheless, the current practice is the opposite.
Another negative example are alignments, after more than 25 years
of dynamic programming, there is still no measure for the quality
of an alignment published that was tested on a large enough data
set. No matter what the measure is, the predictor should always
provide an estimate for the standard deviation of the expected
accuracy!
Goal of structure prediction. For over 30 years, there has been an ardent search for methods to the predict 3D structure from the sequence. Many methods were found which looked initially very promising - but always the hope has been dashed. The search has been driven by the belief that the 3D structure of a protein is determined by its amino acid sequence (Anfinsen, 1973). While it is now known that chaperones often play a rôle in the folding pathway, and in correcting misfolds (Hartl et al., 1994; Corrales & Fersht, 1996), it is believed that the final structure is at the free-energy minimum. Thus, all information needed to predict the native structure of a protein is contained in the amino acid sequence, plus a knowledge of its native solution environment.
The sequence-structure gap is rapidly increasing. Currently, databases for protein sequences (e.g. SWISS-PROT (Bairoch & Apweiler, 1996)) are expanding rapidly, largely due to large-scale genome sequencing projects. The first four entire genome sequences have been published; they represent all three terrestrial kingdoms: (1) prokaryotes: haemophilus influenzae (Fleischmann et al., 1995), and mycoplasma genitalium (Fraser et al., 1995); (2) eucaryotes: yeast (Goffeau et al., 1996); and (3) archeans: methanococcus jannaschii (Bult et al., 1996). At least, another dozen of genomes will be completely sequenced before the end of 1997 (Terry Gaasterland: Magpie genome sequencing project list) ; the entire human genome is likely to be known in the year 2003. This implies that the explosion of genome, and hence, protein sequences is supposedly the only field outgrowing the speed in development of computer hardware. It also implies, that despite significant improvements of structure determination techniques the gap between the number of proteins for which structure is deposited in public databases (PDB (Bernstein et al., 1977)), and the number of proteins for which sequences are known is increasing.
Limitations of ab initio prediction of protein structure from sequence. Given only the amino acid sequence, it should be possible in principle to directly predict protein structure from physico-chemical principles using, for example, molecular dynamics methods (Levitt & Warshel, 1975). In practice, however, such approaches are frustrated by the enormous complexity of the calculation (requiring many orders of magnitude more computing time than is currently feasible) and by inaccuracies in the experimental determination of basic parameters (van Gunsteren, 1993; Shortle et al., 1996). Thus, the most successful structure prediction tools are knowledge-based, using a combination of statistical theory and empirical rules.
CASP-Asilomar prediction contests. An important experiment has been initiated by John Moult (CARB, Washington): those who determine protein structures submitted the sequences of proteins for which they were about to solve the structure to a 'to-be-predicted' database; for each entry in that database predictors could send in their predictions before a given deadline (the public release of the structure); finally, the results were compared, and discussed during a workshop (in Asilomar, California). Two such experiments have been completed: in December 1994 (Proteins special issue, Vol. 23, 1995), and in December 1996 (to be published in Proteins, 1997). The results of both experiments demonstrated clearly that the goal to predict structure from sequence has not been reached, yet. So, no improvement despite ardent attempts, and the explosion of knowledge deposited in databases?
Bridging the sequence-structure gap for more than 30% of all sequences. The gap between the number of known sequences (> 100,000 (Bairoch and Apweiler, 1997)) and the number of known structures (> 5,000 (Bernstein et al., 1977)) is widening rapidly. The most successful theoretical approach to bridging this gap is homology modelling. Given a sequence of unknown fold (denote U), if U has significant sequence similarity to a protein of known structure (i.e., if the pairwise sequence identity is >25%), it is possible to construct an approximate 3D model which has a correct fold but inaccurate loop regions. Homology modelling effectively raises the number of 'known' 3D structures from 4,000 to over 15,000 (Schneider et al., 1997; Rost and Schneider, 1997). However, most pairs of proteins with similar structure are remote sequences homologues with less than 25% pairwise sequence identity (Rost, 1997b; Rost et al., 1997a). These remote homologues cannot usually be recognised by conventional sequence alignments, but may sometimes be recognised by threading methods. Once a remote homology is detected, remote homology modelling may be used to construct a 3D model. This could potentially reduce the sequence-structure gap by an additional 5,000 - 10,000 proteins. Now suppose we randomly choose a sequence U from one of the complete genome sequences which have recently become available; what is the likelihood that we could predict the 3D structure by homology modelling or remote homology modelling? A conservative answer would be 10%, based on the success of sequence alignment-based homology modelling. A very optimistic estimate would be over 50%, assuming all remote homologues could be recognised.
Accurate prediction for 1D aspects of 3D structure.
If no remote homologue can be detected for U, we are forced
to simplify the prediction problem. There is a pay-off from making
this simplification: using the rich diversity of information in
current databases, it is possible to make very accurate 1D predictions
from the sequence alone. Automatic prediction services are readily
available for predicting many aspects of 1D structure (Rost WWW & Schneider, 1996; Rost & Schneider, 1997):
secondary structure (Rost & Sander, 1994b; Rost et al., 1994b; Geourjon & Deléage, 1995; Salamov & Solovyev, 1995; Frishman & Argos, 1996; Rost, 1996b),
solvent accessibility (Rost & Sander, 1994c; Rost et al., 1994b; Rost, 1996b),
location and topology for transmembrane helices (von Heijne, 1991; Jones et al., 1994; Persson & Argos, 1994; Rost et al., 1994b; von Heijne, 1994; Rost et al., 1995; Persson & Argos, 1996; Rost, 1996b; Rost et al., 1996b; Rost et al., 1996a; von Heijne, 1996),
coiled-coil regions (Lupas, 1996b), signal peptides (Nielsen et al., 1997),
glycosilation sites (Hansen et al., 1995).
Selected literature for section
Introductions to protein structure and folding (books): my favourite: (Brändén & Tooze, 1991), others: (Schulz & Schirmer, 1979; Fasman, 1989b; Lesk, 1991; Rees et al., 1992; Sternberg, 1996; Creighton, 19893).
Introductions to protein structure and folding (reviews): my favourite: (Lattman & Rose, 1993), others: (Richardson, 1985a; van Gunsteren, 1988; Fasman, 1989a; Richardson & Richardson, 1989; Brünger & Nilges, 1993; Chothia & Murzin, 1993; Dill, 1993; van Gunsteren, 1993; Finkelstein, 1994; Murzin, 1994; Murzin et al., 1994a; Murzin et al., 1994b; Sippl, 1995; Holm & Sander, 1996c; Honig & Cohen, 1996; Lupas, 1996a; Nilges, 1996; Rost & Sander, 1996).
Evaluation of prediction methods: (Schulz & Schirmer, 1979; Cohen et al., 1983; Kabsch & Sander, 1983b; Taylor, 1984; Taylor & Thornton, 1984; Cohen et al., 1989; Crippen & Snow, 1990; Benner, 1992; Presnell et al., 1992; Sternberg, 1992; Benner et al., 1993; Colloc'h et al., 1993; Rao et al., 1993; Rost & Sander, 1993b; Rost et al., 1993; Russell & Barton, 1993; Wodak & Rooman, 1993; Rost & Sander, 1994c; Rost et al., 1994c; Wang, 1994; Defay & Cohen, 1995; Lemer et al., 1995; Rost, 1995b; Rost & Sander, 1995a; Russell & Sternberg, 1995; Zhu, 1995; Fischer et al., 1996a; Rost et al., 1997b).
Protein structure prediction (reviews):
(Fasman, 1989b; Karlin et al., 1991; King, 1992; Rost et al., 1993; Wodak & Rooman, 1993; Rost & Sander, 1994d; von Heijne, 1994; Barton, 1995; Bryant & Altschul, 1995; Eisenhaber et al., 1995; Rost & Sander, 1995b; Sippl, 1995; Barton, 1996; Cohen & Presnell, 1996; Honig & Cohen, 1996; Rost, 1996a; Rost & Sander, 1996; Rost & O'Donoghue, 1997; Rost & Schneider, 1997).
Sequence alignments
Any sequence analysis starts with database searches for homologous
proteins by sequence alignment procedures. When pairwise sequence
identity is over 40%, alignment procedures are usually straightforward.
For less similar protein sequences, alignments may fail. Multiple
alignments improve as data banks grow. The most advanced sequence
alignment tools base the alignment on profiles derived from databases
or particular sequence families. New methods (HMM, GA) may be
more successful in the twilight zone of sequence alignments, however
this remains to be proven.
Homology modelling
The basic assumption of homology modelling is that U and the homologous
template protein of known structure (T) have nearly identical
backbone structure in the aligned regions. The task is to correctly
place the side chains of U into the backbone of T. For levels
of above 70-90% sequence identity, the resulting models are quite
accurate, the limiting factor is the computation time required.
For sequence identities down to about 30% sequence identity,
U and T will still have the same fold, but the number of loops
inserted grows and the divergence between U and T becomes considerable.
For lower levels of pairwise sequence identity, the accuracy
of the sequence alignment becomes an additional problem. However,
even down to levels of 25-30% sequence identity, homology modelling
produces coarse-grained models for the overall fold of proteins
of unknown structure.
Remote homology modelling (threading)
Remote homology modelling (<25% pairwise sequence identity
between the unknown structure, U, and template, T) has three obstacles
to overcome: (1) the remote homology between U and T has to be
detected; (2) U and T have to be aligned correctly; and (3)
the homology modelling procedure has to be tailored to the harder
problem of extremely low sequence identity. In the early 1990s,
there was a great deal of optimism that the first obstacle, the
detection of similar folds, would be solved by threading methods.
The second Asilomar meeting may have re-boosted the hype. But,
automatic threading methods still are fairly inaccurate, both
in recognising the correct fold and in getting the alignment correct.
However, threading is clearly superior to traditional sequence
alignments at this low level (<25%) of sequence identity.
Database searches in practice
Initial search through protein databases. The most common tools for starting a database search are BLAST (Altschul et al., 1990; Altschul, 1993; Altschul & Gish, 1996), and FASTA (Pearson, 1990; Pearson & Miller , 1992; Pearson, 1995; Pearson, 1996). All are available via various WWW sites (EBI, BCM, NCSA Workbench; links in (Rost WWW & Schneider, 1996)). If you have a long sequence or want to submit a large set of sequences to database servers, you may prefer to use email interfaces. Some of the refined alignment programs will run fast searches automatically as a pre-filter (e.g. MAXHOM, links in (Rost WWW & Schneider, 1996)).
Composition bias. Some of the search tools automatically check regions that are composition biased (e.g., Gly-Arg-Ala in DNA binding proteins) by running the SEG program ((Wootton & Federhen, 1993; Wootton & Federhen, 1996); e.g., BLAST (Altschul & Gish, 1996)). To be sure you may submit your sequence additionally to an analysis of composition bias (SAPS, links in (Rost WWW & Schneider, 1996)).
Selecting the putative homologues from the hit list. Given a list of proteins aligned (hit list) to U, which ones are likely to be homologues? Where to set the threshold in terms of the score reported by the alignment program (e.g. the BLAST score)? Few programs help you with that decision (e.g. MAXHOM, links in (Rost WWW & Schneider, 1996)); or the consistent alignment parser CAP in BLAST (Altschul & Gish, 1996)). The best strategy may be to select a large number of proteins from the hit list (>100) and repeat a more accurate full dynamic programming alignment of U against this list (rather than against the entire database).
Finding sequence motifs. A complement to alignment programs are tools that search for motifs, blocks or patterns. Examples on the WWW are PROSITE (Bairoch et al., 1997), PFSCAN (links in (Rost WWW & Schneider, 1996))), or the tools associated with databases of block alignments (BLOCKS (Henikoff, 1991; Henikoff & Henikoff, 1996a; Pietrokovski et al., 1996; Henikoff et al., 1997), PRINTS, PRODOM; links in (Rost WWW & Schneider, 1996))). Motifs can be used to detect more distantly related homologues (<30% sequence identity) and to refine the alignment (e.g. by defining a family specific profile that can be given as input to, e.g., CLUSTALW (Thompson et al., 1994; Higgins et al., 1996)).
Refining the alignment. There are various ways to automatically refine alignments. One is simply running a more accurate dynamic programming algorithm (BIOACCELERATOR or SSEARCH, or MAXHOM). Another is to build a profile and to thus generate a more family-specific alignment (e.g. by ClustalW (Higgins & Sharp, 1988; Higgins et al., 1992; Higgins et al., 1996). An approach that will work better the better the alignment you start with, are hidden Markov models (SAM, (Hughey & Krogh, 1996)). An expert tool that enables you to investigate the final alignment even more is TopAlign (links in (Rost WWW & Schneider, 1996)).
Principle concept of alignments
Evolution distinguishes signal from noise. At the level of protein molecules, selective pressure results from the need to maintain function, which in turn requires maintenance of the specific 3D structure (Chothia & Lesk, 1982; Doolittle, 1986; Lesk, 1988a; Pastore & Lesk, 1990; Doolittle, 1992; Carrell & Lesk, 1994; Doolittle, 1994; Murzin et al., 1994a) This evolutionary history is the basis for attempts to align protein sequences, i.e., to optimally detect equivalent positions in strings of amino-acid letters. Aligning protein sequences may appear to be purely a problem of matching letters. However, sequence alignments unravel information about structural and functional relations between residues in different proteins. Obviously it is not trivial to map the complexity of factors determining protein structure and function onto 1D relations between letters.
Finding the best match for two strings. Goal is to find the best match between two strings of letters (amino acids or nucleotide acids). The solution of this problem is in principle trivial. The main constraint is the computer time required when more than 150,000 pair alignments have to be explored (run of U against TREMBL and SWISS-PROT). This led to the development of a variety of fast alignment programs. Once the time demanding task of scanning the entire database is accomplished, more refined dynamic programming-based (or hidden Markov model-based) alignment programs can be applied to refine the alignment for the hits found. For more sensitive searches, biological knowledge has to be included by basing the alignment on profiles for residue exchange probabilities.
Task trivial for high levels of sequence identity. Any sequence analysis starts with database searches: all known databases are scanned by sequence alignment procedures for proteins homologous to the search sequence U . When the pairwise sequence identity between U and a putative homologue H is over 25-30% (for more than 80 residues), alignment procedures are usually straightforward (Bryant & Amzel, 1987; Altschul & Gish, 1996; Higgins et al., 1996; Pearson, 1996; Taylor, 1996). For less similar protein pairs, alignments may fail (Henikoff & Henikoff, 1993; Bordo et al., 1994; Vingron & Waterman, 1994; Henikoff & Henikoff, 1996b; Henikoff et al., 1997).
Drawback: lack of sufficiently tested cut-off criteria. One of the difficulties in comparing different alignment procedures is the lack of well-defined criteria for measuring the quality of an alignment. Very few papers have attempted to define such measures for the comparison of various methods (Henikoff & Henikoff, 1993; Pearson, 1995). The second problem is that most methods do not supply a cut-off criterion for distinguishing between homologous and non-homologous sequences (i.e., false positives). For some large sequence families remote homologues can be aligned correctly, but for most cases sequences with less than 25% sequence identity will be false positives, i.e., will have no structural or functional similarity to the guide sequence.
Length-dependent cut-off for significant sequence identity. The level of pairwise sequence identity (percentage of identical residues between two aligned proteins) is not sufficient to define the twilight zone. Instead, Schneider & Sander have noticed that the twilight zone is defined by sequence identity and alignment length (Sander & Schneider, 1991). Analysing the relatively small number of protein pairs that were available in 1990, they defined a length-dependent threshold for significant sequence identity (Sander & Schneider, 1991). The threshold curve defined (dubbed HSSP-curve, here) was an exponential that saturated at 25% sequence identity over more than 80 residues. 1990, there were no two proteins in PDB (Bernstein et al., 1977) with more than 30% sequence identity over more than 80 residues that did not have similar structures. Is this still true for the five times larger PDB of 1997? Furthermore, most protein pairs above the threshold evolved by maintaining structure. Did most pairs below the threshold have different structures? A re-evaluation of this analysis based on a 1000 times larger database slightly shifts the level of when sequence alignments appear trivial (Rost, 1997a).
Fast, non-optimal alignments
The two most widely used fast alignment tools are FASTA (Pearson, 1996) and BLAST (Altschul & Gish, 1996). The principle assumption is that most similarities between two sequences can be detected by local rather than global alignments (Sankoff & Kruskal, 1983; Sankoff & Goldstein, 1989; Sankoff, 1990; Altschul & Gish, 1996). FASTA and BLAST base on slightly different concepts. FASTA first searches 'words' of residues with a minimal length of two identical between the two sequences; then the aligned regions are widened based on profiles. BLAST first lists words of residues in one sequence typically of length four with high scoring information content; then the database is scanned for words identical to the words in the list of highly informative words; finally the words are expanded to segments (profile-based). An additional detail is that statistics are based on regions without composition bias (Wootton & Federhen, 1993; Pearson, 1996; Wootton & Federhen, 1996).
Knowledge-based exchange matrices (profiles)
Some amino acid exchanges (e.g. L -> I) are more or less neutral in terms of maintaining structure and/or function. This suggests basing the alignment of two sequences on exchange matrices (or profiles) that capture physico-chemical properties either directly or indirectly by database extraction (Dayhoff, 1978; Dayhoff et al., 1979; Dayhoff et al., 1983; Dayhoff, 1984). The latter was initially proposed by Dayhoff and her co-workers who measured evolutionary distance by the PAM (Percentage of Acceptable point Mutations per 106 years) matrix. For example, PAM 256 corresponds to an exchange of 80% of all amino acids (with different probabilities for different amino acids: S has a high mutability, W a low variability). Recently other exchange matrices have been proposed (Henikoff & Henikoff, 1993; Henikoff et al., 1997). Which exchange matrix to use? Jorja and Steven Henikoff have systematically compared performance for various exchange matrices (Henikoff & Henikoff, 1993). The results favoured on average the BLOSUM62 matrix (on average 62% of the residues exchanged. However, none of the matrices investigated performed worse than BLOSUM62 on all test examples. Thus, given a particular search protein U, a priori there is no 'best' exchange matrix. A useful way around this problem is to re-align based on different exchange metrices (Feng et al., 1985; Risler et al., 1988; Gonnet et al., 1992; Miyazawa & Jernigan, 1993; Wako & Blundell, 1994b; Altschul & Gish, 1996; Barton, 1996; Feng & Doolittle, 1996; Gribskov & Veretnik, 1996; Higgins et al., 1996; Thompson & Goldstein, 1996a).
Slow, optimal alignments (dynamic program-ming)
What is the optimal alignment? Alignments are intended to unravel evolutionary pathways and/or structural homology between two proteins. These two objectives (functional/structural) may be mutually contradictory, i.e., the 'optimal' alignment may differ according to the objective. Yet another perspective is the 'mathematical' optimal alignment. This is the alignment that optimises a given objective function, e.g., to find the alignment with the highest number of pairwise identical residues. FASTA and BLAST are not guaranteed to find such a mathematically optimal alignment. The most simple objective is to optimise the percentage of residues that are identical between the two sequences. A dynamic programming algorithm is guaranteed to find the optimal solution for the problem in an algorithmic time quadratic in the length of both sequences (Needlman & Wunsch, 1970) (Fig. 2.1). For the alignment of protein sequences this simple approach is not sufficient as finding the best alignments, usually, requires to introduce gaps in one sequence, or insertions in the other (Smith & Waterman, 1981b) (Fig. 2.1: rather than placing two dots into sequence T , residues A and E could be deleted in sequence U ; note: the gap increases the score from 2 to 4 identical residues). The introduction of such gaps is mathematically treated by adding a constant (gap open penalty) to the final score (here number of identical residues). Usually, gap penalties (cost of inserting and extending gaps) are chosen to be length dependent. Typically, the cost of extending a gap (gap elongation) is 5-10 times lower than is the cost for introducing a gap (gap open). The optimal choice of gap penalties depends on the particular method and, in detail, on the particular sequence family (Altschul, 1989; Vingron & Waterman, 1994; Altschul & Gish, 1996; Bork & Gibson, 1996; Higgins et al., 1996; Taylor, 1996). However, to sensitively align protein sequences, this still is not sufficient. The major addition to the simple approach described so far is to evaluate scores not based on residue identities but based on bio-chemical properties of amino acid. For example, aligning two hydrophobic residues (I and L) is more beneficial than aligning a hydrophobic and a charged residue (L and K; note: when treating hydrophobic residues as identical, the score for the best gapped alignment in Fig. 2.1 increases from 4 to 6). Dynamic programming methods assure the optimal global or local alignment by simply exploring all possible alignments and choosing the best. An important problem is the treatment of gaps, i.e., residue inserted (or deleted) to optimise the objective function.
Multiple alignments
Merging pairwise alignments into a multiple alignment. The concept of dynamic programming cannot be extended to align more than three sequences optimally. A way around this problem is to first find optimal pairwise alignments and to then merge the pairs (Sankoff & Kruskal, 1983; Doolittle, 1986; Doolittle et al., 1986; Sankoff & Goldstein, 1989). This procedure is illustrated by the program MAXHOM implemented for the PredictProtein prediction service (Rost et al., 1994b; Rost, 1996b; Rost WWW, 1997) or the generation of the HSSP database (Sander & Schneider, 1991; Schneider et al., 1997). (1) A fast algorithm (BLAST) is used to scan the database for possible homologues. (2) The list of putative homologues is filtered by a length-dependent cut-off threshold for structural homology (Sander & Schneider, 1991). (3) All sequences that fall above the threshold are aligned consecutively to the guide sequence (U) by a standard dynamic programming algorithm (Needlman & Wunsch, 1970; Sankoff, 1972; Smith & Waterman, 1981a; Smith et al., 1981; Sankoff & Kruskal, 1983; Waterman & Byers, 1985; Vingron & Argos, 1989; Vingron & Waterman, 1994). (4) After each sequence has been added to the alignment an alignment profile is compiled and used to align the next sequence. (5) After all the sequences have been aligned the profile is recompiled and the dynamic programming algorithm starts once again to align consecutively the sequences, this time using the conservation profile as derived after completion of the first sweep. A slightly different concept is to additionally sort the aligned sequences according to evolutionary trees (Feng & Doolittle, 1987; Feng & Doolittle, 1996; Higgins et al., 1996; Taylor, 1996). In general the power of multiple alignment methods grows with the databases (Depiereux & Feytmans, 1991; Deperieux & Feytmans, 1992; De Filippis et al., 1994; Bryant & Altschul, 1995; Mizuguchi & Go, 1995; Neuwald et al., 1995; Pearson, 1995; Altschul & Gish, 1996; Barton, 1996; Gribskov & Veretnik, 1996; Henikoff & Henikoff, 1996b; Higgins et al., 1996; Krakauer et al., 1996; McClure et al., 1996; Orengo & Taylor, 1996; Pearson, 1996; Poch & Delarue, 1996; Taylor, 1996; Thompson & Goldstein, 1996a).
Hidden Markov models and genetic algorithms. Recently, a method different from dynamic programming is gaining ground in the field of generating high quality multiple alignments: the hidden Markov models (Brown et al., 1993; Haussler et al., 1993; Baldi et al., 1994; Krogh et al., 1994; Sakakibara et al., 1994; Eddy, 1995; Bucher & Hofmann, 1996; Hughey & Krogh, 1996; McClure et al., 1996). The principle idea is to deduce a family specific model reflecting evolutionary processes and to align new sequences according to the model derived. The idea is the same as for traditional family specific profiles used with dynamic programming, the details of the mathematics involved are, however, different (Gribskov & Veretnik, 1996; Hughey & Krogh, 1996). Hidden Markov-based alignments appear to be particularly sensitive to detecting less obvious homologues. Another interesting method is the application of genetic algorithms to the multiple alignment problem; the particular advantage being that any objective function can be optimised by the genetic algorithm (Notredame & Higgins, 1996). A large-scale analysis of the quality of different methods has yet to be accomplished.
Sequence space hopping.
If we could plot the space of protein sequences, would we observe
the protein families as islands? Unfortunately, we cannot tell.
Nevertheless, useful information has been extracted from sequence
(Casari et al., 1995), or structure (Maiorov & Crippen, 1995)
space. In everyday database searches, protein families are widened
by hopping in sequence space: (1) a query sequence U is aligned
against a database, say SWISS-PROT, (2) all sequences aligned
at levels of significant similarity are used as new seeds Ui,
and for each Ui SWISS-PROT is searched again, (3) this procedure
may be repeated until now new sequences are found. Sequence space
hopping may be used in combination with knowledge from structures
to further widen families (Holm & Sander, 1997), or to increase
the information contained in multiple sequence alignments input
to prediction methods (Rost, 1996b; Rost WWW, 1997). A genial
novel idea has been proposed recently by Park, Teichmann, Hubbard
and Chothia (Park et al., 1997): sequence space hopping can be
used to reduce the number of false positives detected in the twilight
zone of sequence alignments. Indeed, sequence space hopping appears
to be rather powerful - insensitive to parameter choices (Park et al., 1997; Rost, 1997a).
Homology modelling in practice
If your database search with U picked a homologue
that has known 3D structure (i.e. is deposited in PDB, you can
predict 3D structure for U by homology modelling. The problem
is that homology modelling is still not trivial (Moult et al., 1995).
There is currently only one WWW server for homology modelling
(SWISS-MODEL, links in (Rost WWW & Schneider, 1996)). If
you venture along this path, be careful not to over-interpret
results (Rost & Valencia, 1996)!
Principles of homology modelling
Basic concept. An analysis of sequence alignments for proteins of known structure reveals that all protein pairs with more than 30% pairwise sequence identity (for alignment length > 80; (Eisenberg et al., 1984) have homologous 3D structures, i.e., the essential fold of the two proteins is identical, details such as additional loop regions may vary. This is the pillar for the success of homology modelling. The principal idea is to model the structure of U (protein of unknown structure) based on the template of a sequence homologue of known structure (T). The accuracy of homology modelling depends on the level of similarity between U and T (Correa, 1990; Greer, 1990; Overington et al., 1990; Sali et al., 1990; Fields et al., 1991; Greer, 1991; Lesk & Boswell, 1992; Levitt, 1992; Overington et al., 1992; Overington, 1992; Johnson et al., 1993; Koymans et al., 1993; Sali & Blundell, 1993; Srinivasan & Blundell, 1993; Topham et al., 1993; Blundell, 1994; De Filippis et al., 1994; May & Blundell, 1994; Sali & Blundell, 1994; Vriend et al., 1994; Cardozo et al., 1995; Chinea et al., 1995; Mosimann et al., 1995; Sali et al., 1995; Vinals et al., 1995; Johnson et al., 1996).
High level of sequence identity: atomic resolution. The basic assumption of homology modelling is that U and T have identical backbones (main chain C). The task is to correctly place the side chains of U into the backbone of T . For very high levels of sequence identity between U and T (ideally differing by one residue only), side chains can be 'grown' during molecular dynamics simulations (Karplus & Petsko, 1990). For slightly lower levels (still of high sequence similarity), side chains are built based on similar environments in known structures (Summers & Karplus, 1990; May & Blundell, 1994). Rotamer libraries (libraries containing all side-chain orientations observed in known structures) are used in the following way. (1) Rotamer distributions are extracted from a database of non-redundant sequences. (2) Fragments of seven (helix, strand) or five residues (other) are compiled. (3) Fragments of the same length are successively shifted through the backbone of U . (4) For modelling the side chains of U only those fragments from the rotamer library are accepted which have the same amino acid in the centre as U , and for which the local backbone is similar to that around the evaluated position). Over the whole range of sequence identity between U and T for which homology modelling is applicable, the accuracy of the model drops with decreasing similarity. For levels of at least 60% sequence identity, the resulting models are quite accurate (May & Blundell, 1994; Moult et al., 1995), for even higher values, the models are as accurate as is experimental structure determination. The limiting factor is the computation time required. How accurate is homology modelling for lower levels of sequence identity?
Low level of sequence identity: loop regions sometimes
correct. With decreasing sequence identity
between the known structure H and the query protein U
the number of loops that have to be inserted to align the two
grows. An accurate modelling of loop regions, however, implies
solving the structure prediction problem. The problem is simplified
in two ways. First, loop regions are often relatively short and
can thus be simulated by molecular dynamics (note the CPU time
required for molecular dynamics simulations grows exponentially
with the number of residues of the polypeptide to be modelled).
Second, the ends of the loop regions are fixed by the backbone
of the template structure. Various methods are employed to model
loop regions. The best have the orientation of the loop regions
correct in some cases (Moult et al., 1995). This illustrates
the current limitations of molecular dynamics: not even short
loop regions can be predicted from sequence. Furthermore, for
experimental structure refinement (use of molecular dynamics to
improve consistency, and accuracy of experimental data) molecular
dynamics is successfully applied to find a better solution when
starting from an almost correct structure. However, for homology
modelling, molecular dynamics refinement usually reduces prediction
accuracy (Moult et al., 1995). Below about 40% sequence identity
the accuracy of the sequence alignment used as basis for homology
modelling becomes an additional problem. A pessimistic view is
that the accuracy of resulting 3D predictions is typically at
the level of ribbon plots, i.e. the mutual orientation of elements
such as helices and sheets can be identified. The optimistic
version is that even down to levels of 30% sequence identity homology
modelling occasionally yields correct predictions at atomic resolution.
Whatever the correct view, even down to levels of 25-30% sequence
identity, homology modelling produces coarse-grained models for
the overall fold of proteins of unknown structure.
Threading in practice
Despite a large number of threading programs, very few are made available via the internet (Rost, 1995b; Fischer & Eisenberg, 1996; Rost WWW & Schneider, 1996; Rost et al., 1997b). The major arguments may be financial interests, and the fact that interpreting threading results usually requires expertise.
Principles of threading
Basic concept. As noted in the previous section, naturally evolved sequences with more than 30% pairwise sequence identity have homologous 3D structures (Sander & Schneider, 1991). Are all others non-homologous? Not at all. In the current PDB database there are thousands of pairs of structurally homologous pairs of proteins with less than 25% pairwise sequence identity (remote homologues). Actually, most similar protein structures are such remote homologues (Rost, 1997b; Rost et al., 1997a). If a correct alignment between U (sequence of unknown structure) and a remote homologue T (pairwise sequence identity to U < 25%) is given, one could build the 3D structure of U by (remote) homology modelling based on the template of T . A successful remote homology modelling must solve three different tasks. (1) The remote homologue (T) has to be detected. (2) U and T have to be correctly aligned. (3) The homology modelling procedure has to be tailored to the harder problem of extremely low sequence identity (with many loop regions to be modelled). Most methods developed so far have been primarily addressed to solve the first, and the second problem. The basic idea is to thread the sequence of U into the known structure of T and to evaluate the fitness of sequence for structure by some kind of environment-based or knowledge-based potential (Ouzounis et al., 1993; Fischer et al., 1996b). Threading is in some respects a harder problem than is the prediction of 3D structure (NP-complete: (Lathrop, 1994); no physical connection between remote homologues, as many remotely homologous protein pairs may have originated from different ancestors, Rost, in preparation; (Sippl, 1995)). However, the stakes are high: solving the threading problem could enable the prediction of thousands of protein structures. Indeed, threading has evolved to become one of the most active fields in the arena of protein structure prediction (with well over 100 annual publications).
Variety of threading techniques. The optimism generated by one of the first papers on threading published in the 90s (Bowie et al., 1990; Bowie et al., 1991) has boosted attempts to develop threading methods (Bowie et al., 1990; Hendlich et al., 1990; Sippl, 1990; Eisenberg et al., 1991; Lüthy et al., 1991; Godzik & Skolnick, 1992; Lüthy et al., 1992; Maiorov & Crippen, 1992; Sippl et al., 1992; Sippl & Lackner, 1992; Sippl & Weitckus, 1992; Blundell & Johnson, 1993; Bryant & Lawrence, 1993; Godzik et al., 1993; Jones & Thornton, 1993; Miyazawa & Jernigan, 1993; Nishikawa & Matsuo, 1993; Ouzounis et al., 1993; Sippl, 1993b; Skolnick et al., 1993; Stultz et al., 1993; Taylor, 1993; Topham et al., 1993; Wilmanns & Eisenberg, 1993; Wodak & Rooman, 1993; Abagyan et al., 1994; Bauer & Beyer, 1994; Crippen & Maiorov, 1994; Kocher et al., 1994; Lathrop, 1994; Sippl & Jaritz, 1994; Zhang & Eisenberg, 1994; Braxenthaler & Sippl, 1995; Bryant & Altschul, 1995; Flöckner et al., 1995; Gerloff et al., 1995; Hubbard & Park, 1995; Jones et al., 1995; Koehl & Delarue, 1995; Lemer et al., 1995; Madej et al., 1995a; Madej et al., 1995b; Matsuo & Nishikawa, 1995; Rost, 1995b; Rost, 1995a; Sippl, 1995; Wang et al., 1995; Wilmanns & Eisenberg, 1995; Finkelstein & Reva, 1996; Fischer & Eisenberg, 1996; Fischer et al., 1996a; Fischer et al., 1996b; Johnson et al., 1996; Miller et al., 1996; Reva & Finkelstein, 1996; Russell et al., 1996; Rost et al., 1997b). The principle idea has been to use structural propensities of amino acids (such as preferences for secondary structure formation, hydrophobicity, and polarity), and to then assess whether or not a given sequence with its structural preferences fits into the structural environment of a given structure (Fischer et al., 1996b). A principally different approach has been pushed by Manfred Sippl (Sippl & Weitckus, 1992; Sippl, 1995). The idea is to use the rich knowledge deposited in the database of protein structures (PDB) by extracting mean-force potentials. Such potentials monitor the observed distances between residue pairs of particular amino acids, with a particular sequence separation (number of residues between the two). Until 1995, most threading methods have used mean-force-potentials (Wodak & Rooman, 1993; Bryant & Altschul, 1995; Sippl, 1995). A more recent generation of threading methods is based on 1D predictions (Rost, 1995a; Rost, 1995b; Fischer & Eisenberg, 1996; Russell et al., 1996; Rost et al., 1997b): first 1D structure (secondary structure and solvent accessibility) is predicted for a sequence of unknown structure, then the 1D structure is extracted for a library of known structures, and finally the observed and the predicted 1D structure strings are aligned by typical dynamic programming algorithms (Smith & Waterman, 1981b). Has all this effort achieved to crack the hard nut threading?
Remote homologues can often be detected. First the good news: since the different mean-force-potentials which have been proposed capture different aspects of protein structure, the correct remote homologue is likely to be found by at least one of them (Lemer et al., 1995). Now the bad news: so far, no single method has been able to detect the correct remote homologue for more than half of all test cases (Lemer et al., 1995). For the methods which have been rigorously evaluated using large test sets, the correct remote homologue is detected in less than 40% of all cases (Rost et al., 1997b). However, this performance is clearly superior to that of traditional sequence alignments at this low level (<25%) of sequence identity. Furthermore, the success of the last Asilomar experiment on structure prediction (forthcoming issue of Proteins) suggests that the likelihood to detect the correct remote homologue is reasonably high when the choice is refined by experts.
3D prediction by threading is still not reliable.
Detecting the remote homology is only the first of the three
obstacles. It appears that the second obstacle (correct alignment
between U and T ) is much more difficult and, unfortunately,
there is no general solution so far. Thus the final step, building
a 3D model, usually fails since the modelling procedures available
today cannot correct the mistakes in the alignments. Although
the last Asilomar experiment on structure prediction (forthcoming
issue of Proteins) suggested that major improvements have been
accomplished over the last two years, there are still very few
publications to date which report accurate 3D predictions from
threading methods. Currently, the successful use of threading
methods requires sceptical, expert user intervention to spot wrong
hits and false alignments. It is still possible that threading
method will become the most successful structure prediction method,
but a lot of detailed work lies ahead.
Prediction in 1D
Secondary structure can usually be predicted more accurately and
reliably than other features of protein structure. Most of the
recent methods for secondary structure prediction rely heavily
upon evolutionary information. Use of evolutionary information
triggered a leap in prediction accuracy, as well, as in the usefulness
of prediction methods. Various methods for predicting accessibility
have been developed recently. Although residue solvent accessibility
is not as well conserved within structural families as is secondary
structure, prediction accuracy is much improved by including evolutionary
information. Predictions of solvent accessibility have also been
used successfully for prediction-based threading, and as basis
for predicting functional sites. Predicting the locations of
the transmembrane helices is a task comparable to secondary structure
prediction. The use of multiple alignment information has been
shown to improve prediction accuracy. Currently, the best methods
predict all transmembrane helices correctly for about 85-90% of
all test proteins. Further methods have been developed for predicting
transmembrane-helix topology, i.e. the orientation of the helices
with respect to the membrane.
Prediction in 2D
Given only a small fraction of the inter-residue distances, it
is possible to calculate the 3D structure using either metric
matrix distance geometry or simulated annealing by molecular dynamics.
Two attempts have been made to use evolutionary information for
prediction of inter-residue contacts. The first was a method
for predicting contacts between beta-strands from multiple sequence
alignments and alignment-based predictions of secondary structure.
The second attempt has been in the development of a group of
methods for prediction of inter-residue contacts from correlated
mutations. In general, the prediction accuracy is rather poor,
with a direct trade-off between the Scylla of predicting enough
contacts, and the Charibdis of predicting only correct ones, e.g.,
taking 5% of the best-predicted long-range contacts (sequence
separation above 10 residues) the accuracy prediction is about
50%. Although this level of accuracy is not high, it is sufficient
to distinguish between correct and incorrect alignments in threading
experiments.
Prediction in 3D
Simplified force-fields in combination with dynamic optimisation
strategies have yielded promising, but still relatively inaccurate
results. Srinivasan and Rose have reported very encouraging results
with their hierarchical search method (1995), however they have
not repeated the initial claims, so it may be that the initial
report was too optimistic. I consider that the single most important
theoretical advance in 3D prediction in the recent years has been
the development of mean-force-potentials (Sippl, 1990, 1992, 1995).
Before these potentials, structure prediction was normally done
with 'physical' potentials, i.e., bonds, angles, torsion angles,
and van der Waals as well as electrostatic non-bonded terms which
describe the internal energy of the molecule. In contrast, the
mean-force-potentials, derived from databases of protein structure,
attempt to describe the free-energy of the molecule. Consequently,
mean-force-potentials of pairwise residue distances are quite
successful in fold recognition, as well as remote homology modelling.
It remains to be seen how best to combine these two different
potentials. In one pilot study on the use of mean-force-potentials
for 3D structure prediction, best results where obtained by combining
both potentials.
1D prediction in practice
Prediction of 1D aspects of 3D structure (e.g. secondary structure, solvent accessibility, transmembrane helices, coiled-coils) is a much simpler task than homology modelling. This becomes obvious by the large number of services that have mushroomed since the first prediction service PredictProtein went on line in 1992 (Tim Hubbard and Jong Park at MRC, Cambridge, England collected some services; links in (Rost WWW & Schneider, 1996)). Some of the services are also available via email. Unfortunately, not all services are sufficiently tested. In general, prediction accuracy is significantly superior if predictions are based on multiple alignments (Barton, 1995; Di Francesco et al., 1996; Rost, 1996b; Rost & Sander, 1996). (1D prediction services listed in section I.)
Principles of secondary structure prediction
Basic concept. The principal idea underlying most secondary structure prediction methods is the fact that segments of consecutive residues have preferences for certain secondary structure states (Szent-Györgyi & Cohen, 1957; Blout et al., 1960; Scheraga, 1960; Blout, 1962; Davies, 1964; Guzzo, 1965; Prothero, 1966; Cook, 1967; Kotelchuck & Scheraga, 1968; Kotelchuck & Scheraga, 1969; Ptitsyn, 1969; Finkelstein & Ptitsyn, 1971; Robson & Pain, 1971; Tinoco et al., 1971; Kabat & Wu, 1973b; Kabat & Wu, 1973a; Nagano, 1973; Burgess et al., 1974; Chou & Fasman, 1974; Fitch, 1974; Lim, 1974; Schulz et al., 1974; Pipas & McMahon, 1975; Dickerson et al., 1976; Levitt & Chothia, 1976; Maxfield & Scheraga, 1976; Robson, 1976; Levitt & Greer, 1977; Argos et al., 1978; Garnier et al., 1978; Sternberg & Thornton, 1978; Maxfield & Scheraga, 1979). Thus, the prediction problem becomes a pattern-classification problem tractable by pattern recognition algorithms. The goal is to predict whether the residue at the centre of a segment of typically 13-21 adjacent residues is in a helix, strand or in none of the two (no regular secondary structure, often referred to as the 'coil' or 'loop' state). Many different algorithms have been applied to tackle this simplest version of the protein structure prediction problem: physico-chemical principles, rule-based devices, expert systems, graph theory, linear and multi-linear statistics, nearest-neighbour algorithms, molecular dynamics, and neural networks (Cohen et al., 1980; Nussinov & Jacobson, 1980; Cohen et al., 1981; Noller & Woese, 1981; Ooi & Takanami, 1981; Argos & Palau, 1982; Cohen et al., 1982; Lapalme et al., 1982; Nishikawa & Ooi, 1982; Palau et al., 1982; Shapiro et al., 1982; Sternberg & Cohen, 1982; Cohen et al., 1983; Fitch, 1983; Nishikawa et al., 1983; Ptitsyn & Finkelstein, 1983; Taylor & Thornton, 1983; Eisenberg et al., 1984; Kabsch & Sander, 1984; Taylor, 1984; Taylor & Thornton, 1984; Garratt et al., 1985; Mironov et al., 1985; Sheridan et al., 1985; Burridge & Todd, 1986; Cohen et al., 1986; Klein, 1986; Klein & DeLisi, 1986; Klein et al., 1986; Leszczynski & Rose, 1986; Levin et al., 1986; Nishikawa, 1986; Nishikawa & Ooi, 1986; Crawford et al., 1987; Deleage & Roux, 1987b; Deleage & Roux, 1987a; Gibrat et al., 1987; Kypr, 1987; Zvelebil et al., 1987; Biou et al., 1988; Bohr et al., 1988; Cedergren et al., 1988; Gascuel & Golmard, 1988; Kanehisa, 1988; Lesk, 1988b; Levin & Garnier, 1988; Qian & Sejnowski, 1988; Richardson & Richardson, 1988; Garnier & Robson, 1989; Holley & Karplus, 1989; McGregor et al., 1989; Staden, 1989; Benner & Gerloff, 1990; Bossa & Pascarella, 1990; Fasman, 1990; Fogelman-Soulié & Mejía, 1990; Johnson, 1990; King & Sternberg, 1990; Kneller et al., 1990; Niermann & Kirschner, 1990; Oas et al., 1990; Sternberg & King, 1990; Wilmot & Thornton, 1990; Benner & Gerloff, 1991; Garratt et al., 1991; Lupas et al., 1991; Niermann & Kirschner, 1991; Nishikawa & Noguchi, 1991; Taylor, 1991; Viswanadhan et al., 1991; Brown, 1992; Hayward & Collins, 1992; Muggleton et al., 1992; Musacchio et al., 1992; Muskal & Kim, 1992; Presnell et al., 1992; Rost & Sander, 1992; Salzberg & Cost, 1992; Sternberg, 1992; Stolorz et al., 1992; Wu et al., 1992; Zhang & Chou, 1992; Zhang et al., 1992; Asai et al., 1993; Barton & Russell, 1993; Boscott et al., 1993; Chakrabartty et al., 1993; Cherkauer & Shavlik, 1993; Edelman, 1993; Fariselli et al., 1993; Geourjon & Deléage, 1993; Gerloff et al., 1993; Juretic et al., 1993a; Juretic et al., 1993b; Levin et al., 1993; Maclin & Shavlik, 1993; Metfessel et al., 1993; Presnell & Cohen, 1993; Reczko, 1993; Rost & Sander, 1993a; Rost & Sander, 1993b; Rost & Sander, 1993c; Rost et al., 1993; Sasagawa & Tajima, 1993; Tchoumatchenko et al., 1993; Yi & Lander, 1993; Aszódi & Taylor, 1994; Brunak et al., 1994; Donnelly et al., 1994; Geourjon & Deléage, 1994; Leng et al., 1994; Livingstone & Barton, 1994; Lupas et al., 1994; Munson et al., 1994; Perkins et al., 1994; Rost & Sander, 1994a; Rost & Sander, 1994b; Rost et al., 1994b; Solovyev & Salamov, 1994; Zimmermann, 1994; Barlow, 1995; Chandonia & Karplus, 1995; Chou, 1995; Frishman & Argos, 1995; Geourjon & Deléage, 1995; Jenny et al., 1995; Mehta et al., 1995; Muñoz & Serrano, 1995; Salamov & Solovyev, 1995; Zhu, 1995; Brunak & Engelbrecht, 1996; Chandonia & Karplus, 1996; Cohen & Presnell, 1996; Di Francesco et al., 1996; Eisenhaber et al., 1996b; Frishman & Argos, 1996; Garnier et al., 1996; Gerloff & Cohen, 1996; Hansen et al., 1996; Lupas, 1996a; Minor & Kim, 1996; Muñoz & Serrano, 1996; Riis & Krogh, 1996; Rost, 1996b; Rychlewski & Godzik, 1996). However, until recently performance accuracy seemed to have been limited to about 60% (percentage of residues correctly predicted in either helix, strand, or other). The limited accuracy was argued to result from the fact that all methods used only information local in sequence (window of less than 20 adjacent residues). Local information was estimated to account for roughly 65% of the secondary structure formation. Two additional problems were common to all methods developed from 1957 to 1993: (1) strands were predicted at levels of accuracy only slightly superior to random predictions, and (2) predicted secondary structure segments were, on average, only half as long as observed segments. The later two shortcomings could be surmounted by using a particular combination of neural networks (Rost, 1996b).
Evolutionary information key to significantly improved predictions. On the one hand, about 75 out of 100 residues can be exchanged in a protein without changing structure. On the other hand, exchanges of 1-5 residues can already destabilise a protein structure. These statements may appear contradictory. However, the explanation is simple: evolution has explored exactly the unlikely exchanges of particular amino acids at particular positions that do not change structure, as a change of structure usually results in a loss of function (and thus would not survive). Thus, the residue exchange patterns extracted from a protein family (i.e. alignments of similar sequences) are highly indicative of the specific structural details for that family. The first method that reached a sustained level of a three-state prediction accuracy above 70% was the profile-based neural network system PHD which uses exactly such evolutionary information derived from multiple sequence alignments as input (Rost, 1996b). By stepwise incorporation of particular evolutionary information, prediction accuracy has been pushed above 72% accuracy (Rost, 1996b). An interesting, technical detail of this network system is that the use of a global 'descriptor', namely the overall amino acid composition (percentage of occurrence of each of the 20 amino acids) does not affect the local score for accuracy as measured by the percentage of correctly predicted single residues. Using amino acid composition, however, improves the accuracy in terms of a more global score, such as the difference between the percentage of observed and predicted secondary structure (Rost, 1996b). Was the neural network an essential tool for the most accurate secondary structure prediction? A nearest-neighbour algorithm can be used to incorporate evolutionary information in a similar manner as the neural network system; the result is a similar performance (Salamov & Solovyev, 1995). Methods combining statistics, and multiple alignment information have been clearly less successful, so far. In comparison to methods using single sequence information only, methods making use of the growing databases are 6-14 percentage points more accurate. Thus, using evolutionary information secondary structure can now be predicted more accurately and reliably than other features of protein structure.
Secondary structure predictions now extremely useful, in practice. How good is a prediction accuracy of 72% in practice? It is certainly reasonably good compared with the prediction of secondary structure by homology modelling (Rost et al., 1994c). However, prediction accuracy varies between different proteins, i.e., prediction accuracy is 72%±9% (one standard deviation) (Rost, 1996b). For applications this implies that predictions can be as good as > 95%, but also as bad as <54%. Can users distinguish one from the other? A few methods successfully use reliability indices allowing to label residues for which predictions are, on average, likely to be more accurate. Indeed, for the neural network system PHD the correlation between such a reliability index and accuracy is linear (Rost, 1996b). Thus, the reliability index effectively becomes a means to predict prediction accuracy, and hence to assess to which class a protein of unknown structure (U) belongs: to the well predicted, or to the badly predicted ones. Various methods successfully use secondary structure predictions as a first step, e.g., prediction-based threading (indeed one of the problems of the Asilomar 1996 prediction contest was that many developers of threading algorithms used the same PHD secondary structure predictions as a first step), inter-strand, and inter-residue distance predictions. However, the use of secondary structure predictions is not limited to structure prediction. Instead, the results of, for instance, the public prediction service (PredictProtein (Rost, 1996b)) have been used to assist the determination of protein structures (chain tracing in X-ray crystallography), as well as to formulate hypothesis about protein structure, and function that guided experiments in molecular biology, in general (in particular: prediction of binding sites, homologous proteins, design of residue mutations).
Separate prediction of secondary structure content not very useful. Proteins have been partitioned into various structural classes, e.g., based on the percentage of residues assigned to helix, strand, and other (Levitt & Chothia, 1976). However, such a coarse-grained classification is not well-defined (Rost & Sander, 1994b). Consequently, given a protein sequence U of unknown structure, attempts to first predict the secondary structure content for U and then to use the result to predict the secondary structural class (i.e., all-a, all-b or intermediates) is of limited practical use (Nishikawa & Ooi, 1982; Muskal & Kim, 1992; Metfessel et al., 1993; Chou, 1995; Eisenhaber et al., 1996a; Eisenhaber et al., 1996b). How do alignment-based predictions compare to experimental means of determining the content in secondary structure? For example, PHD is, on average surprisingly, about as accurate as circular dichroism spectroscopy (Perczel et al., 1991; Böhm et al., 1992; Park et al., 1992; Perczel et al., 1992; Rost & Sander, 1994b). Of course, this does not imply that predictions can replace experiments. In particular, variation of secondary structure as a result of changes in environmental conditions (e.g. solvent) is generally only accessible experimentally.
Principles of solvent accessibility prediction
Basic concept. It has long been argued that if the segments of secondary structure could be accurately predicted, the 3D structure could be predicted by simply trying different arrangements of the segments in space (Cohen & Presnell, 1996). One criterion for assessing each arrangement could be to use predictions of residue solvent accessibility (Lee & Richards, 1971; Chothia, 1976). The principal goal is to predict the extent to which a residue embedded in a protein structure is accessible to solvent. Solvent accessibility can be described in several ways (Lee & Richards, 1971; Chothia, 1976). The simplest is a two-state description distinguishing between residues that are buried (relative solvent accessibility < 16%) and exposed (relative solvent accessibility ³ 16%). The classical method to predict accessibility is to assign either of the two states, buried or exposed, according to residue hydrophobicity. However, a neural network prediction of accessibility has been shown to be superior to simple hydrophobicity analyses (Holbrook et al., 1990).
Evolutionary information improves prediction accuracy. Solvent accessibility at each position of the protein structure is evolutionarily conserved within sequence families. This fact has been used to develop methods for predicting accessibility using multiple alignment information (Rost & O'Donoghue, 1997). Prediction accuracy is about 75±7%, four percentage points higher than for methods not using alignment information (Benner et al., 1994; Rost & Sander, 1994c; Wako & Blundell, 1994a; Rost, 1996b; Thompson & Goldstein, 1996b). Predictions of solvent accessibility have also been used successfully for prediction-based threading, as a second criterion towards 3D prediction by packing secondary structure segments according to upper and lower bounds provided by accessibility predictions, and as basis for predicting functional sites (Rost & O'Donoghue, 1997). More recently, predictions of accessibility were also used successfully to predict sub-cellular location (Andrade et al., 1997).
Principles of predicting transmembrane helices
Basic concept. Even in the optimistic scenario that in the near future most protein structures will be either experimentally determined, one class of proteins will still represent a challenge for experimental determination of 3D structure: transmembrane proteins. The major obstacle with these proteins is that they do not crystallise, and are hardly tractable by NMR spectroscopy. Consequently, for this class of proteins structure prediction methods are even more needed than for globular water-soluble proteins. Fortunately, the prediction task is simplified by strong environmental constraints on transmembrane proteins: the lipid bilayer of the membrane reduces the degrees of freedom to such an extent that 3D structure formation becomes almost a 2D problem. Two major classes of membrane proteins are known: proteins which insert helices into the lipid bilayer, and proteins that form pores by a barrel of 16 strands (the only known cases of this type are porins (von Heijne, 1996)). Since there is not much experimental information available on different porin-like membrane proteins, we can hardly estimate prediction accuracy for this class. The situation is quite different for helical membrane proteins. Once the location of transmembrane segments is known for helical transmembrane proteins, 3D structure can be predicted by exploring all possible conformations (Taylor et al., 1994). Additionally, predicting the locations of these transmembrane helices is a much simpler problem than is the prediction of secondary structure for soluble proteins. Elaborated combinations of expert-rules, hydrophobicity analyses and statistics yields a two-state per-residue accuracy of about 90% (residues predicted correctly as either transmembrane helix, or other) (Degli Esposti et al., 1990; Fasman & Wilberg, 1990; Nilsson & von Heijne, 1990; von Heijne & Manoil, 1990; McGovern et al., 1991; Lemmon et al., 1992; Nakashima & Nishikawa, 1992; Park et al., 1992; von Heijne, 1992; Donnelly et al., 1993; Edelman, 1993; Fariselli et al., 1993; Juretic et al., 1993a; Oliveira et al., 1993; Sipos & von Heijne, 1993; Traxler et al., 1993; Wang et al., 1993; Casadio et al., 1994; Jones et al., 1994; Persson & Argos, 1994; Donnelly & Findlay, 1995; Neuwald et al., 1995; Efremov & Vergoten, 1996; Fariselli & Casadio, 1996; Huang et al., 1996).
Evolutionary information improves prediction accuracy.
For two methods the use of multiple alignment information is
reported to clearly improve the accuracy of predicting transmembrane
helices (Persson & Argos, 1996; Rost, 1996b). The best current
prediction methods have a similar high accuracy around 95%. One
such method uses a system of neural networks similar to the one
used to predict secondary structure (PHDhtm). In order to predict
the orientation of the helices (i.e. the topology) a simple rule
is applied: positively charged residues occur more often in intra-cytoplasmic
than in extra-cytoplasmic regions. The advanced neural network
system has been improved significantly by adding a dynamic programming
algorithm to the neural network output. The principle idea is
to use the neural network output as an energy landscape and to
find the optimal path through this landscape (Rost et al., 1996b).
As reliable data for the locations of transmembrane helices exists
only for a few proteins, data used for deriving these methods
originate predominantly from experiments in cell biology and gene-fusion
techniques. Different experimental groups often report different
locations for transmembrane regions. Thus, the level of 95% accuracy
is not verifiable. Despite this uncertainty in detail, the prediction
of transmembrane helices is a valuable tool to quickly scan entire
chromosomes (Rost et al., 1996b). The classification into membrane/not-membrane
proteins has an expected error rate of less than two percent,
i.e., about two percent of the proteins predicted to contain transmembrane
regions will probably be false positives. The predictions of
transmembrane helices has provided a lower bound to approach the
question of how many proteins organisms need for, e.g., communication:
the percentage of proteins with transmembrane helices has been
estimated to be about 25% for yeast and haemophilus influenzae,
and around 10-15% for mycoplasma genitalium and methanococcus
jannaschii (Rost T, 1996).
2D prediction in practice
No method currently available via the internet.
Principles of predicting inter-residue contacts
Prediction problem is a hard one, but the stakes are high. Given all inter-residue contacts or distances, 3D structure can be reconstructed by distance geometry or molecular dynamics. This is used for the determination of 3D structures by nuclear magnetic resonance (NMR) spectroscopy which produces experimental data of distances between protons (Nilges, 1996). Can inter-residue contacts be predicted? Obviously, some fraction of these contacts can be: helices and strands can be assigned based on hydrogen-bonding pattern between residues. Thus, a successful prediction of secondary structure implies a successful prediction of some fraction of all the contacts. However, contacts predicted from secondary structure assignment are short-ranged, i.e., between residues nearby in sequence. For a successful application of distance geometry, long-range contacts have to be predicted, i.e., contacts between residues far apart in sequence. A few methods have been proposed for the prediction of long-range inter-residue contacts (Nilges & Brünger, 1993; Saitoh et al., 1993; Galaktionov & Marshall, 1994; Goebel et al., 1994; Hubbard, 1994; Neher, 1994; Rodionov & Johnson, 1994; Shindyalov et al., 1994; Taylor & Hatrick, 1994; Aszodi et al., 1995; Galaktionov & Marshall, 1996; Reese et al., 1996). The most promising one may be a rather novel neural network based approach (Lund et al., 1997). Two questions surround such methods: first, how accurate are these prediction methods on average; and second, are all important contacts predicted?
Correlated mutations can imply spatial proximity. In sequence alignments, some pairs of positions appear to co-vary in a physico-chemically plausible manner, i.e., a 'loss of function' point mutation is often rescued by an additional mutation that compensates for the change (Altschuh et al., 1987; Altschuh et al., 1988). One hypothesis is that compensations would be most effective in maintaining a structural motif if the mutated residues were spatial neighbours. Attempts have been made to quantify such a hypothesis and to use it for contact predictions (Goebel et al., 1994; Neher, 1994; Shindyalov et al., 1994; Taylor & Hatrick, 1994). In general, prediction accuracy is rather poor, with a direct trade-off between predicting enough contacts, and predicting only correct ones, e.g., taking 5% of the best-predicted long-range contacts (sequence separation above 10 residues) the accuracy prediction is about 50% (A. Valencia, priv. communication).
Distinction between different models, no prediction of 3D, yet. Analysing correlated mutations is only one way to predict long-range inter-residue contacts. Other methods use statistics, mean-force potentials, or neural networks. So far none of the methods appears to find a path between the Scylla of missing too many true contacts and the Charibdis of predicting too many false contacts. However, some of the methods provide sufficient information to distinguish between alternative models of 3D structure (Valencia, manuscript in preparation). The ambitious goal to predict long-range inter-residue contacts sufficiently accurately will hopefully continue to attract intellectual resources.
Principles of predicting inter-strand contacts
Simplifying the contact prediction problem. One simplification of the problem to predict inter-residue contacts focuses on predicting the contacts between residues in adjacent strands. Such an attempt is motivated by the hope that such interactions are more specific than are sequence-distant (long-range) contacts in general, and hence are easier to predict.
Identifying the correct b-strand alignment. The only method published for predicting inter-strand contacts is based on potentials of mean-force (Hubbard, 1994) similar to those used in the evaluation of strand-strand threading (Lifson & Sander, 1980). Propensities are compiled by database counts for 2 ¥ 2 ¥ 2 classes (parallel/anti-parallel, H-bonded/not H-bonded, N-/C-terminal). Each of the eight classes is divided further into five sub-classes in the following way. Suppose the two strand residues at positions i and j are in close in space. Then the following five residue pairs are counted in separate tables: i/j-2, i/j-1, i/j, i/j+1, i/j+2 . Such pseudo-potentials identify the correct b-strand alignment in 35-45% of the cases.
Using evolutionary information to predict inter-strand
contacts. Even if the locations of strands
in the sequence are known exactly, the pseudo-potentials cannot
predict the correct inter-strand contacts in most cases (Hubbard, 1994).
However, when using multiple alignment information, the signal-to-noise
ratio increases such that inter-strand contacts have been predicted
correctly for most of the strands inspected in some test cases
(Hubbard, 1994). For the purpose of reliable contact prediction,
this result is inadequate, especially as the locations of the
strands are not known precisely. Can the pseudo-potentials handle
errors resulting from incorrect prediction of strands? Various
test examples using predictions by PHDsec (Rost, 1996b) as input
to the strand pseudo-potentials indicate that the accuracy in
predicting inter-strand contacts drops (T Hubbard, unpublished),
but in some cases is still high enough to be useful for approximate
modelling of 3D structure (Hubbard & Park, 1995).
3D prediction in practice
No method currently available via the internet.
Principles of 3D prediction
Recent breakthrough in structure prediction? In the 1994 Asilomar meeting, none of the 3D ab initio methods were able to predict the correct protein structure (Moult et al., 1995). Since that time, new methods have been proposed which indicate possible directions for the future. Several groups have obtained promising results using distance geometry methods (Rost & O'Donoghue, 1997). Simplified force-fields in combination with dynamic optimisation strategies have yielded promising, but still relatively inaccurate results (Elofsson et al., 1995; Pedersen & Moult, 1996). Srinivasan and Rose have reported very encouraging results with their hierarchical search method (Srinivasan & Rose, 1995). However, the second Asilomar experiment on structure prediction (forthcoming issue of Proteins) concluded similarly to the first: no prediction of 3D structure from sequence, yet.
Accurate prediction of 3D structure for coiled-coil proteins. A particular class of proteins are coiled-coils. These are proteins can be defined by a rather simple geometry of long helices, of which two or more wind around one another (Lupas, 1996a). Nilges and Brünger (Nilges & Brünger, 1993) have achieved atomic-accuracy in an ab initio prediction of the GCN4 leucine zipper using a hybrid molecular dynamics/simulated annealing search strategy. Recently, equally accurate models for three leucine zippers were obtained with faster calculations based on mean-force-potentials (O'Donoghue, in press).
Recognising incorrect structures. The single most important theoretical advance in 3D prediction in recent years may have been the development of mean-force-potentials. Before these potentials, structure prediction was normally done with 'physical' potentials, i.e., bonds, angles, torsion angles, and van der Waals as well as electrostatic non-bonded terms which describe the internal energy of the molecule (van Gunsteren, 1993). In contrast, the mean-force-potentials, derived from databases of protein structure (Sippl, 1990), attempt to describe the free-energy of the molecule. The physical potentials have been used very successfully to refine experimentally determined structures (Nilges, 1996). However, these terms cannot distinguish between a native fold and a grossly misfolded structure (Sippl, 1990). In contrast, mean-force-potentials of pairwise residue distances are quite successful in fold recognition, as well as remote homology modelling (Hendlich et al., 1990; Sippl, 1990; Casari & Sippl, 1992; Sippl et al., 1992; Sippl & Lackner, 1992; Sippl & Weitckus, 1992; Sippl, 1993a; Sippl, 1993b; Sippl & Jaritz, 1994; Sippl et al., 1994a; Sippl et al., 1994b; Braxenthaler & Sippl, 1995; Flöckner et al., 1995; Sippl, 1995). It remains to be seen how best to combine these two different potentials. In one pilot study on the use of mean-force-potentials for 3D structure prediction, best results where obtained by combining both potentials (O'Donoghue and Nilges, private communication).
Extracting principles about structure formation
from structures? The mean-force-potential
approach has recently been extended to study protein folding.
Both the physical basis and the general characteristics of protein
folding remain controversial (Honig & Cohen, 1996). Simulations
and other studies indicate that the free energy balance of hydrogen
bond formation is close to zero, or slightly unfavourable (Yang & Honig, 1995a; Yang & Honig, 1995b),
and that a specific fold is selected primarily by side-chain interactions
(Honig & Cohen, 1996). Recently, Sippl et al. have extended
the concept of deriving mean-force potentials to a formalism of
describing Helmholtz free energies of atom-pair interactions (Sippl, 1996; Sippl et al., 1996).
The formalism starts with the following two assumptions: (1)
that protein structures can be described by Helmholtz free energies
(or mean-force-potentials), and (2) that the distribution of intra-molecular
distances in experimentally determined protein structures does,
on average, not deviate substantially from the corresponding distribution
in native proteins. To normalise the absolute free energy contributions,
the ideal gas is chosen (no internal interactions). Without any
further assumptions or approximations, atom-atom mean-force-potentials
are derived from a data set of known protein structures. The
resulting Helmholtz mean-force-potentials unravel interesting
principles about protein structure formation. (1) Backbone H-bonds
(except for the a-helix
interaction Oi ... Ni+4)
do not contribute to the thermodynamic stability of native folds.
(2) H-bond formation (except for Oi
... Ni+4) requires energy
input that is regained when H-bonds are formed. Once formed,
H-bonds are locked in a deep, narrow minimum. (3) The energy
gain of forming one ionic or two hydrophobic contacts can provide
roughly the activation energy required for forming a H-bond.
Both the eloquence and the conclusions of the approach have prompted
strong criticism, even unanimous rejection of these findings.
Do we witness an error in a method laid out to spot errors, or
the begin of a new era of force fields? Further applications
of these mean-force-potentials will be needed to answer this question.
Challenging times for bio-informatics. In this year more protein sequences will be added to public databases by a couple of experts than had been deposited by all researchers between 1970 and 1990. Six years from now, we shall witness the deposition of the last unknown human sequence. What does this imply for bio-informatics? First of all, it means hurry up, be prepared to, at least, manage coping with the data explosion. But what does 'cope with it' mean? Well, store it, be ready to do something with it. What are the questions of interest? Eo ipso, genome data is of no value. However, the value is added by attaching tags to it. We need tools to extract the information contained in this data to, e.g., understand the function of the individual proteins encoded by the open reading frames (ORFs). We wish to automatically predict for each ORF the structure, and function of the resulting protein. Currently, this is impossible although much research is directed towards achieving these goals (Casari et al., 1996; Rost & O'Donoghue, 1997).
What is function? Protein sequence determines protein structure determines protein function. To simplify the cycle we sought to first predict protein structure, and to then use what we learned, both on the way to structure prediction, and from the predicted structure itself to predict function. Today, prediction of protein function may appear more urgent than prediction of protein structure. However, predicting protein function from sequence adds two additional problems in comparison to the unsolved task of structure prediction. (1) Function is not entirely determined by sequence; the environment is crucially important. (2) 'Protein function' is a rather intuitive but ill defined term. Function is a complex phenomenon associated with many mutually overlapping levels: chemical, biochemical, cellular, physiological, organism mediated, and developmental (Ouzounis et al., 1996; Tamames et al., 1996a; Tamames et al., 1996b). These levels are related in complex ways, e.g., protein kinases can be related to different cellular functions (such as cell cycle), and to a chemical function (transferase) plus a complex control mechanism by interaction with other proteins.
Focus on prediction of protein function? Have we learned better to predict protein function over the last 40 years? The good news first: we manage to some labels for most protein sequences in, e.g., the yeast genome (Casari et al., 1996). Now the bad news: this is far from enough. First, we manage to label only proteins for which we find similarities in other proteins of experimentally known function. Second, even for those we label, our knowledge is mostly quite restricted to rough details. Predictions of active sites, binding sites, specificity of binding asf. are usually not successful. Third, labelling does not imply understanding the detailed mechanism. Often we need knowledge about protein structure to manipulate functions. Fourth, labelling a number of proteins does not imply to manage the following task: find all with a particular function, as the labelling may accidentally not contain the appropriate keyword (slightly different function than the one wanted). There are many questions to be answered, many problems to be solved in predicting function. Sleeves up and off to work? Forget protein structure prediction - which proved to be difficult, anyway - and jump to predicting protein function? My intuitive answer is: yes, but jump with one leg, and only when you can keep the other where it is! Here are three reasons for this answer:
We need to develop tools in a controllable environment!
Protein structure prediction is a well defined problem. All tools
we apply today for predicting, e.g., functional classes, are successful
because they were developed, tested, and improved by studying
protein structure prediction.
We need knowledge about protein structure for
the details of function.
Suppose we know that a protein binds a substrate. What we do
not know is why, how, and where. Suppose, we could predict the
binding site to be around residue 132, and we changed some residues.
We may find that residues 38, 88, and 122 appear crucial, for
function. If we knew the structure, we may why exactly those
residues stabilise the structure. And we may direct and thus
speed up the exploration of the details of function.
We are almost there.
We still cannot predict protein structure from sequence in general.
But, evolutionary information has opened the door to come much
closer to the goal. Some of us who work on this issue feel that
we are very close to success. However, protein structure prediction,
supposedly, won't be achieved by a single genial idea. Instead,
the road ahead is steep. We have to roll the heavy stone uphill
together. And in contrast to the Greek mythology, the stone did
never roll backwards over the last 40 years: over-optimistic claims
skewed the image, but what we witnessed was a gradual ascent towards
the top of the hill!
1D one-dimensional
1D structure
one-dimensional (e.g. sequence or string of secondary structure)
2D two-dimensional
2D structure
two-dimensional (e.g. inter-residue distances)
3D three-dimensional
3D structure
three-dimensional (co-ordinates of protein structure)
ADP/ATP
Adenosine Di-Phospate/Adenosine Tri-Phospate, the reaction:
ATP->ADP releases Å 7kcal/mol.
CATH database of classifications of protein structures (Orengo, 1994; Orengo & Taylor, 1996; Orengo et al., 1997)
DALI inter-residue based method for structural alignment (Holm & Sander, 1993; Holm & Sander, 1996a)
DSSP data base containing the secondary structure and solvent accessibility for proteins of known 3D structure (Kabsch & Sander, 1983a)
FSSP data base of remote homologues of known 3D structure (Holm & Sander, 1996b)
GOR prediction of secondary structure based on statistics (Garnier et al., 1978; Biou et al., 1988; Levin & Garnier, 1988; Garnier et al., 1996)
HM Homology Modelling: modelling the 3D structure of a protein based on a significant level of pairwise sequence identity to a protein of known 3D structure
HSSP data base containing for each PDB protein of known 3D structure the alignments of all SWISSPROT sequences homologue to the known structure (Schneider et al., 1997).
HTM Trans-Membrane-helix, helix crossing the lipid bilayer of integral transmembrane proteins
MaxHom dynamic programming algorithm for conservation weight based multiple sequence alignment (Sander & Schneider, 1991; Schneider, 1994)
ORF open reading frame, region of gene that codes for a protein
PDB Protein Data Bank of experimentally determined 3D structures of proteins (Bernstein et al., 1977; Abola et al., 1988)
PHD Profile based neural network prediction:
PHDacc
solvent accessibility (PHDacc; (Rost & Sander, 1994c; Rost, 1996b)),
and
PHDhtm t
ransmembrane helices (PHDhtm; (Rost et al., 1995; Rost, 1996b; Rost et al., 1996a; Rost et al., 1996b)).
PHDsec
secondary structure (PHDsec; (Rost & Sander, 1993b; Rost & Sander, 1993a; Rost & Sander, 1994b; Rost, 1996b)
PredictProtein
internet prediction service (Rost et al., 1994a; Rost, 1996b)
RMSD Root-mean-square deviation, typically difference between backbone of two proteins, values of >5Å typically indicate that two structures do not superpose at all
SCOP database of classifications of protein structures (Murzin et al., 1995; Brenner et al., 1996; Hubbard et al., 1997)
SWISSPROT
data base of protein sequences (Bairoch & Apweiler, 1997)
T target used for homology modelling (protein of known 3D structure)
TM Trans-Membrane, region bound to lipid bilayer of integral trans-membrane proteins
TREMBL
translation of the EMBL-nucleotide database coding DNA to protein
sequences (Bairoch & Apweiler, 1997)
U sequence of unknown structure
sequence identity
percentage of residues identical (D->E = 0; D->D = 1) between
two sequences aligned (insertions excluded)
sequence similarity
percentage of residues similar (D->E = 1; D->D = 1) between
two sequences aligned (insertions excluded); there are two ways
to convert similarity into percentage values: (i) by normalising
the similarity score by the maximal possible score (percentage
residue similarity), and (ii) by setting an arbitrary threshold
of the similarity score to distinguish similar-not similar and
counting the percentage of residues that are similar according
to this threshold (percentage of similar residues); note that
different similarity metrices are used to account for physico-chemical
properties of amino acids, consequently, levels of similarity
are usually not directly comparable between different alignment
methods) miscellaneous
close structural homologues
proteins with similar structures and > 25% pairwise sequence
identity
remote homologues
proteins with similar structures and < 25% pairwise sequence
identity
convergent evolution
evolution from a different ancestor
divergent evolution
evolution from a common ancestor