BioInformatics, 1997 13, 345-356

Sisyphus and protein structure prediction

Burkhard Rost & Sean O'Donoghue

EMBL, D-69012 Heidelberg, Germany
rost@embl-heidelberg.de (http://dodo.bioc.columbia.edu/~rost/), odonoghue@embl-heidelberg.de

contact e-mail:rost@embl-heidelberg.de


CABIOS, 1997, 13, 345-356.

Table of Contents




Abstract

The problem of predicting protein structure from sequence remains fundamentally unsolved despite more than three decades of intensive research effort. However, new and promising methods in 3D, 2D, and 1D prediction have reopened the field. Mean-force-potentials derived from the protein databases can distinguish between correct and incorrect models (3D). Inter-residue contacts (2D) can be detected by analysis of correlated mutations, albeit with low accuracy. Secondary structure, solvent accessibility, and transmembrane helices (1D) can be predicted with significantly improved accuracy using multiple sequence alignments. Some of these new prediction methods have proven accurate and reliable enough to be useful in genome analysis, and in experimental structure determination. Moreover, the new generation of theoretical methods is increasingly influencing experiments in molecular biology.


Introduction

In Greek mythology, Sisyphus is condemned to an eternity of hard labor; his labor is a frustrating and fruitless, for just as he is about to achieve his goal, his work is undone and he must start again from the beginning. Those who work in protein structure prediction seem to share the same fate.

For over 30 years, there has been an ardent search for methods to the predict three-dimensional (3D) structure from the sequence. Many methods were found which looked initially very promising - but always the hope has been dashed. The results of the first, and the second Asilomar meeting on structure prediction (Proteins special issue, Vol. 23, 1995) demonstrate clearly that the goal has not been reached, yet.

The search has been driven by the belief that the 3D structure of a protein is determined by its amino acid sequence (Anfinsen, 1973). While it is now known that chaperones often play a rôle in the folding pathway, and in correcting misfolds (Corrales and Fersht, 1996, Hartl et al., 1994), it is believed that the final structure is at the free-energy minimum. Thus, all information needed to predict the native structure of a protein is contained in the amino acid sequence, plus a knowledge of its native solution environment.

Currently, databases for protein sequence (Bairoch and Apweiler, 1997) and protein structure (Bernstein et al., 1977) are expanding rapidly due to large scale sequencing projects (Oliver et al., 1992; Fleischmann et al., 1995; Fraser et al., 1995; Goffeau et al., 1996; Johnston et al., 1996) and improvements in experimental determination of 3D structures (Lattman, 1994). Can structure prediction profit from this flood of information?

Indeed, there is a flood of literature on protein structure prediction attempting to keep track with the expanding databases. Recent comprehensive reviews include (Rost and Sander, 1994b, Rost and Sander, 1996); recent books on structure prediction include (Doolittle, 1996, Sternberg, 1996); for a user-oriented, practical approach to structure prediction and sequence analysis see (Bork and Gibson, 1996, Rost and Valencia, 1996, Rost and Schneider, 1997). In this review we will focus mainly on recent prediction methods designed to exploit the growing databases. We show that, unlike Sisyphus, the predictor of protein structure has actually moved closer to his goal over the years - although he is still far from it; moreover, the labor has not been fruitless. We will discuss some of the prediction methods which have proven to be accurate and reliable enough to be useful in genome analysis and in experimental structure determination.


Table 1: Abbreviations and notations

1D->3Dthe classification into 1D, 2D, and 3D structure is based on physical properties grouped with respect to protein structure prediction (Rost and Sander, 1994b)
3Dthree-dimensional signifies the co-ordinates of atoms, and/or residues that define a protein structure
2Dtwo-dimensional describes inter-residue distances, or contacts (note: a correct prediction of all inter-residue distances is equivalent to a 3D prediction)
1Done-dimensional summarises properties of single residues that can be written in a 1D string, e.g., sequence, secondary structure, residue solvent accessibility, or hydrophobicity (note: prediction of, e.g., helices implies prediction of some local inter-residue contacts; prediction of residue solvent accessibility implies providing an upper limit to the number of possible contact partners; thus, these properties are occasionally referred to as 2D; however, even a perfect prediction of 1D properties is, in general, not equivalent to a prediction in 3D)
NMRnuclear magnetic resonance
Usequence of unknown structure
Ttarget sequence with determined 3D structure predicted to be similar to that of U
homology modellingprediction of 3D structure for a protein U based on a significant pairwise sequence identity (>25%) to a protein of known structure T
remote homology modellingprediction of 3D structure for a protein U based on low levels of pairwise sequence identity (<25%) to a protein of known structure T
fold recognitionprediction that two proteins with no significant pairwise sequence identity have similar folds (note: in principle the fold has to be known explicitly for any of the two proteins)
significant pairwise sequence similaritypercentage of residues identical between two naturally evolved proteins A and B guaranteeing that A and B have similar structures; the exact number depends on the alignment length: for more than aligned 80 residues 25% pairwise sequence identity mostly suffices to guarantee structural similarity (Schneider and Sander, 1991); there are very few exceptions of pairs with more than 30% sequence identity and dis-similar structures (Steven Brenner, priv. communication)



State-of-the-art in protein structure prediction

Ab initio prediction of protein structure from sequence: not yet. Given only the amino acid sequence, it should be possible in principle to directly predict protein structure from physico-chemical principles using, for example, molecular dynamics methods (Levitt and Warshel, 1975). In practice, however, such approaches are frustrated by the enormous complexity of the calculation (requiring many orders of magnitude more computing time than is currently feasible) and by inaccuracies in the experimental determination of basic parameters (van Gunsteren, 1993, Shortle et al., 1996). Thus, the most successful structure prediction tools are knowledge-based, using a combination of statistical theory and empirical rules.

Bridging the sequence-structure gap for more than 30% of all sequences. The gap between the number of known sequences (>100,000 (Bairoch and Apweiler, 1997)) and the number of known structures (about 4,000 (Bernstein et al., 1977)) is widening rapidly. The most successful theoretical approach to bridging this gap is homology modelling. Given a sequence of unknown fold (denote U), if U has significant sequence similarity to a protein of known structure (i.e., if the pairwise sequence identity is >25%), it is possible to construct an approximate 3D model which has a correct fold but inaccurate loop regions. Homology modelling effectively raises the number of 'known' 3D structures from 4,000 to over 15,000 (Schneider et al., 1997) (Fig. 1, and Figs. 2-3 in Rost and Schneider, 1997). However, most pairs of proteins with similar structure are remote sequences homologues with less than 25% pairwise sequence identity (Rost et al., 1996b). These remote homologues cannot usually be recognised by conventional sequence alignments, but may sometimes be recognised by threading methods. Once a remote homology is detected, remote homology modelling may be used to construct a 3D model. This could potentially reduce the sequence-structure gap by an additional 5,000 - 10,000 proteins (Fig. 1). Now suppose we randomly choose a sequence U from one of the complete genome sequences which have recently become available; what is the likelihood that we could predict the 3D structure by homology modelling or remote homology modelling? A conservative answer would be 10%, based on the success of sequence alignment-based homology modelling (Fig. 1). A very optimistic estimate would be over 50%, assuming all remote homologues could be recognised (Fig. 1 & Fig. 3).

Accurate prediction for 1D aspects of 3D structure. If no remote homologue can be detected for U, we are forced to simplify the prediction problem. There is a pay-off from making this simplification: using the rich diversity of information in current databases, it is possible to make very accurate 1D predictions from the sequence alone. Automatic prediction services are readily available for secondary structure, solvent accessibility, location and topology for transmembrane helices (Rost, 1996a), and coiled-coils (Lupas, 1996).

Fig.1.gif

Fig. 1. Scope of structure prediction. Given any expressed protein, how likely can theory predict its 3D structure? For example, for 30% of the proteins in the current SWISS-PROT database we can find regions for which homology modelling (HoMo ) is applicable (Schneider et al., 1997), but for the first four entirely sequenced genomes (yeast, haemophilus influenzae, mycoplasma genitalium, methanococcus jannaschii) this is true for less than 10% of all proteins (Casari et al., 1996). Thus, SWISS-PROT contains a bias introduced, e.g., by limitations of previous sequencing techniques (note: PDB is biased to an even higher extent: for 58% of the proteins homology modelling is applicable). Estimating the contribution of fold recognition (FoRc ) techniques is rather tricky. Here, we used the following procedure: 35% of all proteins in PDB could be subject to fold recognition techniques. This number allows two estimates: (1) the number of fold recognition targets is about 60% (35/58) of the homology modelling targets (and would thus be about 5% for the four genomes), and (2) fold vrecognition covers about 80% (35/42) of would is not covered by homology modelling (which would suggest that about 70% of the four genomes could be recognised by threading). The truth, supposedly, lies in between these extremes. (Note: today threading techniques are not accurate enough for any large-scale prediction of 3D structure!) The remaining region (supposedly larger than 50%) is occupied by unknown folds (UFo ).



Structure prediction for known folds

Sequence alignments

Trivial for high levels of sequence identity. Any sequence analysis starts with database searches for homologous proteins by sequence alignment procedures. When pairwise sequence identity is over 25-30% (for more than 80 residues), alignment procedures are usually straightforward (Bryant and Altschul, 1995, Barton, 1996, Taylor, 1996, Schneider et al., 1997). For less similar protein sequences, alignments may fail (Henikoff and Henikoff, 1993, Bordo et al., 1994, Vingron and Waterman, 1994).

Multiple alignments improve as data banks grow. The goal of sequence alignment procedures is to accurately align related sequence stretches and to avoid aligning unrelated stretches. The most advanced sequence alignment tools base the alignment on profiles derived from databases or particular sequence families (Altschul and Gish, 1996, Barton, 1996, Deperieux and Feytmans, 1992, Feng and Doolittle, 1996, Gribskov and Veretnik, 1996, Henikoff and Henikoff, 1996a, Henikoff and Henikoff, 1996b, Higgins et al., 1996, Neuwald et al., 1995, Pearson, 1996, Thompson and Goldstein, 1996a, Tomii and Kanehisa, 1996, Vingron and Waterman, 1994). A new generation of alignment methods are based on Hidden Markov Models (Eddy, 1995, Hubbard and Park, 1995, Krogh and Mitchison, 1995, Bucher and Hofmann, 1996, Hughey and Krogh, 1996, McClure et al., 1996) and another on genetic algorithms (Notredame and Higgins, 1996). These new methods may be more successful in the twilight zone of sequence alignments (currently 20-30% sequence identity; (Doolittle, 1986)) than advanced profile-based methods (Higgins et al., 1996, Taylor, 1996, Thompson and Goldstein, 1996a), however this remains to be proven.

Homology modelling

Prediction at atomic accuracy for high levels of sequence identity. The basic assumption of homology modelling is that U and the homologous template protein of known structure (T) have nearly identical backbone structure in the aligned regions. The task is to correctly place the side chains of U into the backbone of T. For levels of above 70-90% sequence identity, the resulting models are quite accurate (De Filippis et al., 1994, May and Blundell, 1994, Sali and Blundell, 1994, Johnson et al., 1996). The limiting factor is the computation time required (Fig. 2).

Prediction at intermediate accuracy for lower levels of sequence identity. For sequence identities down to about 30% sequence identity, U and T will still have the same fold (Sander and Schneider, 1991), but the number of loops inserted grows and the divergence between U and T becomes considerable (De Filippis et al., 1994, May and Blundell, 1994, Chinea et al., 1995, Mosimann et al., 1995, Moult et al., 1995, Samudrala et al., 1995, Vinals et al., 1995). Modelling of loop regions is still a difficult problem (Cardozo et al., 1995, Mosimann et al., 1995, Sali et al., 1995); even the best methods only rarely achieve atomic accuracy and are often completely different to the correct structure. For lower levels of pairwise sequence identity, the accuracy of the sequence alignment becomes an additional problem. A pessimistic view is that the accuracy of resulting 3D predictions is typically at the level of ribbon plots, i.e. the mutual orientation of elements such as helices and sheets can be identified. The optimistic version is that even down to levels of 30% sequence identity homology modelling occasionally yields correct predictions at atomic resolution.

Fig.2.gif

Fig. 2. Limiting steps of homology modelling. Accuracy of homology modelling is proportional to the level of pairwise sequence identity between the protein of unknown structure and its target of known structure. For high levels of identity, CPU time is the major constraint, for lower levels, loop regions become a problem (and thus the quality of the model). Below 40-50% sequence identity errors in the sequence alignment become fatal. Below 25-30% sequence identity, fold recognition (threading) techniques have to replace (or complement) the sequence alignment procedure. (Note: figure partly taken from: Holm et al., 1994.)


Remote homology modelling

Three difficult problems. Remote homology modelling (<25% pairwise sequence identity between the unknown structure, U, and template, T) has three obstacles to overcome: (1) the remote homology between U and T has to be detected; (2) U and T have to be aligned correctly; and (3) the homology modelling procedure has to be tailored to the harder problem of extremely low sequence identity. In the early 1990s, there was a great deal of optimism that the first obstacle, the detection of similar folds, would be solved by threading methods. The basic idea is to thread the sequence of U into the backbone 3D structure of T, at each step evaluating the 'fitness of sequence for structure' using environment-based (Bowie, et al., 1990a, Bowie, et al., 1990b, Bowie, et al., 1991, Bowie, et al., 1996, Eisenberg, et al., 1991, Ouzounis, et al., 1993, Wilmanns and Eisenberg, 1995, Fischer and Eisenberg, 1996) or knowledge-based mean-force-potentials (Bryant and Altschul, 1995, Sippl, 1995). Most threading methods use mean-force-potentials derived from the PDB (Kocher et al., 1994, Lemer et al., 1995, Sippl, 1995, Wodak and Rooman, 1993). An alternative method, originally proposed a decade ago (Sheridan et al., 1985), is to thread using 1D predictions. The first application of 1D threading was reported two years ago (Rost, 1995a); since then, several groups have investigated similar concepts, and refinements to the method (Fischer and Eisenberg, 1996, Fischer et al., 1996b, Rost, 1995b, Rost et al., 1996c, Russell et al., 1996). The general threading problem has been shown to be N-P complete (Lathrop and Smith, 1994); consequently, heuristic algorithms must be employed - and there have been many proposed (Bowie, et al., 1990a, Hendlich, et al., 1990, Sippl, 1990, Bowie, et al., 1991, Eisenberg, et al., 1991, Casari and Sippl, 1992, Sippl and Weitckus, 1992, Godzik et al., 1993, Lathrop and Smith, 1994, Collura et al., 1995, Flöckner et al., 1995, Hubbard and Park, 1995, Jones et al., 1995, LeGrand et al., 1995, Madej et al., 1995b, Matsuo and Nishikawa, 1995, Wang et al., 1995, Wilmanns and Eisenberg, 1995, Rost, 1995b, Finkelstein and Reva, 1996, Fischer and Eisenberg, 1996, Johnson et al., 1996, Jones et al., 1996, Miller et al., 1996, Reva and Finkelstein, 1996, Rost et al., 1996c, Russell et al., 1996). Has all this effort achieved any success for remote homology modelling?

Remote homologues can often be detected. First the good news: since the different mean-force-potentials which have been proposed capture different aspects of protein structure, the correct remote homologue is likely to be found by at least one of them (Lemer et al., 1995). Now the bad news: so far, no single method has been able to detect the correct remote homologue for more than half of all test cases (Lemer et al., 1995). For the methods which have been rigorously evaluated using large test sets, the correct remote homologue is detected in less than 40% of all cases (Rost, 1995b, Fischer et al., 1996a, Fischer et al., 1996b, Rost et al., 1996c, Russell et al., 1996). However, this performance is clearly superior to that of traditional sequence alignments at this low level (<25%) of sequence identity (Madej et al., 1995a, Rost, 1995b, Fischer and Eisenberg, 1996, Rost et al., 1996c).

3D prediction by threading is still not reliable. Detecting the remote homology is only the first of the three obstacles. It appears that the second obstacle (correct alignment between U and T) is much more difficult and, unfortunately, there is no general solution so far. Thus the final step, building a 3D model, usually fails since the modelling procedures available today cannot correct the mistakes in the alignments. As a result, there are very few publications to date which report accurate 3D predictions from threading methods (Flöckner et al., 1995, Lemer et al., 1995, Sippl, 1995, Rost et al., 1996c). Currently, the successful use of threading methods (Hubbard and Park, 1995, Valencia et al., 1995, Hubbard et al., 1996) has required skeptical, expert user intervention to spot wrong hits and false alignments. It is still possible that threading method will become the most successful structure prediction method - however, three large obstacles remain to be dealt with.

Fig.3.gif

Fig. 3. Expected accuracy of automatic threading. The estimates given are controversial, but tend to be a more conservative estimate of an automatic use of fold recognition methods. The numbers refer to the following experiment. Suppose we run a threading program on all 6000 yeast proteins. For each of them we thread the sequence into a library of (say 800) folds. The program will rank the hits according to the predicted similarity to the search protein (U). How often is the first hit in the resulting list a correctly recognised remote homologue (similar fold with < 25% sequence identity)? And how often is the alignment of the first hit correct? (Note: given a correct alignment the accuracy of the finally remote homology modelled 3D prediction will higher for higher levels of sequence identity. However, for most threading programs the detection error is independent of the level of sequence identity.)



Structure prediction for unknown folds

Exploiting the protein databases. For over three decades, researchers have pursued the goal of predicting the structures of unknown folds. However, the major break-through has come about due to the expansion of the protein databases (Bernstein et al., 1977, Bairoch and Apweiler, 1997, Benson et al., 1997). Two main strategies have been developed for exploiting the information in these databases for structure prediction: (1) studying the evolution of protein families from both the sequence and structure databases; and (2) studying the physical principles of protein structure and folding from the structure databases.

Odyssey of evolution teaches us structure prediction. It appears that for most proteins, almost all residues can be changed without affecting the structure (Rost et al., 1996b); however, a single, randomly chosen mutation is more likely to destabilise than to maintain a particular structure. Thus, the precise pattern of amino acid exchanges observed in a multiple sequence alignment of a protein family is highly indicative of the particular structure. These patterns constitute a fossil record of mutations preserving protein structure and function. The importance of such evolutionary information for structure prediction was realised very early (Zuckerkandl and Pauling, 1965), and has long been exploited in exceptional cases by experts (Dickerson et al., 1976, Benner, 1989, Frampton et al., 1989, Benner and Gerloff, 1990, Nardelli et al., 1991, Musacchio et al., 1992, Livingstone and Barton, 1994), as well as in automatic and systematic ways (Maxfield and Scheraga, 1979, Zvelebil et al., 1987). More recently, the use of evolutionary information has grown in importance. This importance was made particularly clear recently when it was shown that the accuracy of secondary structure was improved to over 70% due to the use of evolutionary information (Rost and Sander, 1993).

Prediction in 1D

Secondary structure predictions - the most accurate view of an unknown fold. Secondary structure can usually be predicted more accurately and reliably than other features of protein structure (Fig. 4). Most of the recent methods for secondary structure prediction rely heavily upon evolutionary information (Rost and Sander, 1993, Livingstone and Barton, 1994, Zimmermann, 1994, Barton, 1995, Geourjon and Deléage, 1995, Mehta et al., 1995, Salamov and Solovyev, 1995, Tuckwell et al., 1995, Di Francesco et al., 1996, Garnier et al., 1996, Riis and Krogh, 1996, Rost, 1996a, Rychlewski and Godzik, 1996). A promising new concept is the use of long-range contact potentials (Frishman and Argos, 1996). From the perspective of a user of a prediction method, it is important to know the accuracy of these different methods. Unfortunately, however, in only relatively few cases has the prediction accuracy been rigorously evaluated with sufficiently large test sets (Rost and Sander, 1993, Salamov and Solovyev, 1995, Di Francesco et al., 1996, Riis and Krogh, 1996, Rost, 1996a); in these cases, the sustained average accuracy of three-state predictions is just over 70%. The typical standard deviation in the prediction accuracy is about 10% (one standard deviation over more than 500-700 unique protein chains (Rost, 1996a, Rost WWW, 1996a)). For some of the prediction methods, the strength of the prediction has been shown to correlate with prediction accuracy (Rost and Sander, 1993, Munson et al., 1994, Di Francesco et al., 1996, Garnier et al., 1996, Rost, 1996a). In practice, this enables users to focus on regions for which predictions is more likely to be correct, e.g., about 45% of all residues are predicted at levels of accuracy comparable to homology modelling (Rost, 1996a). By comparison, earlier methods for secondary structure prediction which did not use evolutionary information, such as GOR (Garnier et al., 1996), had an average accuracy of about 60%, and less than 10% of residues were predicted at the same accuracy as homology modelling. Thus, for methods that need highly accurate secondary structure predictions as input to predict other properties of protein structure and/or function, the best prediction methods of today are six times more useful than the methods of five years ago. A general feature of methods based on information from multiple sequence alignments is that errors in the alignment greatly decrease prediction accuracy (Di Francesco et al., 1996, Rost, 1996a, Rost and Valencia, 1996).

Fig.4.gif

Fig. 4. Expected accuracy of predictions in 1D. Numbers were taken from the major prediction programs that make use of information contained in multiple alignments. All estimates are averages over distributions, that are associated with standard deviations in the order of ±10%.


Predicting residue solvent accessibility - the second step towards 3D structure? It has long been argued that if the segments of secondary structure could be accurately predicted, the 3D structure could be predicted by simply trying different arrangements of the segments in space (Cohen et al., 1982, Cohen and Presnell, 1996). One criterion for assessing each arrangement could be to use predictions of residue solvent accessibility (Esposito et al., 1994, Monge et al., 1994, Mumenthaler and Braun, 1995, Nilges, 1995, Galaktionov and Marshall, 1996). Various methods for predicting accessibility have been developed recently (Benner et al., 1994, Rost and Sander, 1994a, Wako and Blundell, 1994, Thompson and Goldstein, 1996b). Although residue solvent accessibility is not as well conserved within structural families as is secondary structure (Russell and Barton, 1994, Rost and Sander, 1994a, Rost WWW, 1996b, Rost, 1996a), prediction accuracy is much improved by including evolutionary information (Rost, 1996a, Thompson and Goldstein, 1996b). Predictions of solvent accessibility have also been used successfully for prediction-based threading (Rost, 1995b, Rost et al., 1996c, Russell et al., 1996), and as basis for predicting functional sites (Cornette et al., 1995, Hansen et al., 1995, Hansen et al., 1996).

Membrane proteins - successful prediction in absence of experimental information. Integral membrane proteins are an important class of proteins for which it is very difficult to obtain atomic-resolution information about 3D structure. Two main classes of membrane proteins are recognised (von Heijne, 1996): proteins with long (17-27 residues) transmembrane helices spanning the membrane; and porins, 16-fold beta-barrel proteins which form a pore through the membrane. Developing prediction methods for the porins is problematic, as there is very little experimental information currently available; some attempts have been made using sequence profiles (von Heijne, 1996). Predicting the locations of the transmembrane helices is a task comparable to secondary structure prediction. Very accurate predictions have been achieved by combining expert-rules, hydrophobicity analyses, and statistics (von Heijne, 1994, Jones et al., 1994, Persson and Argos, 1994, Neuwald et al., 1995, Efremov and Vergoten, 1996, Fariselli and Casadio, 1996, Persson and Argos, 1996, Rost et al., 1996a, von Heijne, 1996). A separate task is the prediction of specific peptides signals (Nielsen et al., 1996). The use of multiple alignment information has been shown to improve prediction accuracy (Persson and Argos, 1994, Rost et al., 1995, Persson and Argos, 1996, Rost, 1996a, Rost et al., 1996a). Currently, the best methods predict all transmembrane helices correctly for about 85-90% of all test proteins (Rost et al., 1996a, von Heijne, 1996). Further methods have been developed for predicting transmembrane-helix topology, i.e. the orientation of the helices with respect to the membrane (Jones et al., 1994, Persson and Argos, 1996, Rost et al., 1996a, von Heijne, 1996).

Prediction in 2D

A hard problem, but the stakes are high. Given only a small fraction of the inter-residue distances, it is possible to calculate the 3D structure using either metric matrix distance geometry or simulated annealing by molecular dynamics (Bohr et al., 1993, Brünger and Nilges, 1993, Aszodi et al., 1995, Galaktionov and Marshall, 1996, Nilges, 1996). Can inter-residue contacts be predicted accurately from the sequence alone? And can evolutionary information help out once again?

Distinction between different models possible. Two attempts have been made to use evolutionary information for prediction of inter-residue contacts. The first was a method for predicting contacts between beta-strands from multiple sequence alignments and alignment-based predictions of secondary structure (Hubbard, 1994, Hubbard and Park, 1995). Thus the method is limited in its applicability, but would be helpful as part of a bouquet of other prediction methods (Hubbard et al., 1996). The second attempt has been in the development of a group of methods for prediction of inter-residue contacts from correlated mutations (Goebel et al., 1994, Shindyalov et al., 1994, Taylor and Hatrick, 1994, Kreisberg et al., 1995). In general, the prediction accuracy is rather poor, with a direct trade-off between the Scylla of predicting enough contacts, and the Charibdis of predicting only correct ones, e.g., taking 5% of the best-predicted long-range contacts (sequence separation above 10 residues) the accuracy prediction is about 50% (A. Valencia, priv. communication). Although this level of accuracy is not high, it is sufficient to distinguish between correct and incorrect alignments in threading experiments (A. Valencia, in preparation).

Prediction in 3D

Sisyphus again? In the 1994 Asilomar meeting, none of the 3D ab initio methods were able to predict the correct protein structure. Since that time, new methods have been proposed which indicate possible directions for the future. Several groups have obtained promising results using distance geometry methods (Aszodi et al., 1995; Mumenthaler and Braun, 1995; Nilges, 1995); these methods may be particularly powerful in combination with 2D contact predictions. Nilges and Brünger (1991,1993) have achieved atomic-accuracy in an ab initio prediction of the GCN4 leucine zipper using a hybrid molecular dynamics/simulated annealing search strategy. Recently, equally accurate models for three leucine zippers were obtained with faster calculations based on mean-force-potentials (O'Donoghue and Nilges, 1997). Simplified force-fields in combination with dynamic optimisation strategies have yielded promising, but still relatively inaccurate results (Elofsson et al., 1995, Pedersen and Moult, 1996a, Pedersen and Moult, 1996b). Srinivasan and Rose have reported very encouraging results with their hierarchical search method (1995), however they have not repeated the initial claims, so it may be that the initial report was too optimistic. In addition to these methods, many research groups have been working at improving their more established methods since 1994 - so we can only wait for the outcome of the next Asilomar meeting.

Recognising incorrect structures. We consider that the single most important theoretical advance in 3D prediction in the recent years has been the development of mean-force-potentials. Before these potentials, structure prediction was normally done with 'physical' potentials, i.e., bonds, angles, torsion angles, and van der Waals as well as electrostatic non-bonded terms which describe the internal energy of the molecule (van Gunsteren, 1993). In contrast, the mean-force-potentials, derived from databases of protein structure (Sippl, 1990), attempt to describe the free-energy of the molecule. The physical potentials have been used very successfully to refine experimentally determined structures (Brünger and Nilges, 1993, Nilges, 1996). However, these terms cannot distinguish between a native fold and a grossly misfolded structure (Novotny et al., 1988, Sippl, 1990). In contrast, mean-force-potentials of pairwise residue distances are quite successful in fold recognition, as well as remote homology modelling (Sippl, 1995). It remains to be seen how best to combine these two different potentials. In one pilot study on the use of mean-force-potentials for 3D structure prediction, best results where obtained by combining both potentials (O'Donoghue and Nilges, 1997).

Extracting principles about structure formation from structures? The mean-force-potential approach has recently been extended to study protein folding. Both the physical basis and the general characteristics of protein folding remain controversial (Honig and Cohen, 1996, Israelachvili and Wennerström, 1996). Simulations and other studies indicate that the free energy balance of hydrogen bond formation is close to zero, or slightly unfavourable (Yang and Honig, 1995a, Yang and Honig, 1995b), and that a specific fold is selected primarily by side-chain interactions (Honig and Cohen, 1996). Recently, Sippl et al. have extended the concept of deriving mean-force potentials to a formalism of describing Helmholtz free energies of atom-pair interactions (Sippl, 1996, Sippl et al., 1996). The formalism starts with the following two assumptions: (1) that protein structures can be described by Helmholtz free energies (or mean-force-potentials), and (2) that the distribution of intra-molecular distances in experimentally determined protein structures does, on average, not deviate substantially from the corresponding distribution in native proteins. To normalise the absolute free energy contributions, the ideal gas is chosen (no internal interactions). Without any further assumptions or approximations, atom-atom mean-force-potentials are derived from a data set of known protein structures. The resulting Helmholtz mean-force-potentials unravel interesting principles about protein structure formation. (1) Backbone H-bonds (except for the a-helix interaction Oi ... Ni+4) do not contribute to the thermodynamic stability of native folds. (2) H-bond formation (except for Oi ... Ni+4) requires energy input that is regained when H-bonds are formed. Once formed, H-bonds are locked in a deep, narrow minimum. (3) The energy gain of forming one ionic or two hydrophobic contacts can provide roughly the activation energy required for forming a H-bond. Both the eloquence and the conclusions of the approach have prompted strong criticism, even unanimous rejection of these findings. Do we witness an error in a method laid out to spot errors, or the begin of a new era of force fields? Further applications of these mean-force-potentials will be needed to answer this question.


Uses of prediction methods

Homology modelling (including remote homology modelling, when it works) has proven to be the most useful prediction tool. Homology models are used to suggest mutation experiments to investigate function, or to facilitate crystallisation (for X-ray diffraction studies), or to reduce aggregation (for NMR spectroscopy). Furthermore, homology-based models are often used to provide starting structures for molecular replacement in X-ray crystallography.

In the absence of known structures with significant sequence identity to a protein of interest, predictions of residue solvent accessibility can be used to investigate function, and to suggest which residues to mutate to facilitate crystallisation or to reduce aggregation. Accurate prediction of secondary structure can help with X-ray diffraction (e.g., the GroEL crystal structure was derived making use of secondary structure predictions for the molecular replacement search). In principle, the early stages of NMR frequency assignment could also be aided by knowledge of the secondary structure, although this has not been attempted.

1D predictions and predictions of topology (transmembrane, coiled-coil) have proven to be quick and accurate enough for the analysis of entire genomes (Koonin et al., 1996, Odgren et al., 1996, Rost, 1996b, Rost et al., 1996a).

Mean-force-potentials can assist experimental structure determination by spotting stresses or anomalies (often errors) in protein structures (Sippl, 1993). A set of methods based on particular statistics has recently been tailored to manage exactly this task (Laskowski et al., 1993, Gray et al., 1996, Hooft et al., 1996).


Conclusions

Native 3D structures of proteins are encoded by a linear sequence of amino acid residues. To predict 3D structure from sequence is a task challenging enough to have occupied a generation of researchers. Have they finally succeeded in their goal? The bad news is: no, we still cannot predict structure for any sequence. The good news are: we have come closer, and growing databases facilitate the task.

(1) Evolutionary information is successfully used for predictions of secondary structure, solvent accessibility, and transmembrane helices. These predictions of protein structure in 1D are significantly more accurate, and more useful than five years ago. (2) Databases of protein structure can be used to derive mean-force-potentials. Residue-pair mean-force-potentials are extremely valuable for the detection of remote homologues, and for the distinction between alternative models (generated by theory or experiment). Moreover, the database of protein structures contains a record of structure formation that has recently been unraveled by the derivation of atom-atom mean-force-potentials (Sippl, 1996, Sippl et al., 1996). (3) Homology modelling allows predictions of 3D structure for about one tenth of all expressed proteins (Fig. 1). (4) Recent improvements in fold recognition (threading), and alignment techniques enable remote homology modelling for another considerable fraction of the expressed proteins (Fig. 1).

All of the above breakthroughs were achieved in the last six years. Thus, although we still cannot solve the general prediction problem, progress has been made. In general, however, we could ask the question - is it worth persevering with structure prediction, given that it is clearly such a difficult task? The answer is: yes. The methods which have spun off from structure prediction have already given us considerable insight into the first four complete genomes. Perseverance with structure prediction will yield fruit in about five years time when the human genome will be known.

Note added in proof

John Moult (CARB Washington, DC) has initiated a meeting for the assessment of prediction methods: theoreticians made predictions for the structure proteins before the structure was experimentally determined. After experimental determination of some of the target structures, the accuracy of the predictions was assessed in a meeting in Asilomar, California (Dec. 1994). This experiment comprises one of the most important events to separate the chaff from the wheat in the field of structure prediction. We wrote our review a few weeks before the second Asilomar meeting (CASP2). Has CASP2 falsified our optimism our skepticism? We dare the following conclusions: (1) structure prediction remains an unsolved problem; (2) new methods for predicting unknown folds may still be regarded as promising, but no major break-through was witnessed; (3) threading methods looked more promising in the optimistic light of the few CASP2 threading targets (than they appear in the light of this review); (4) sufficiently skilled, and motivated experts (such as Alexei Murzin, LMB, Cambridge) can use today's prediction methods to find and relatively accurately align even remotely homologous proteins.


References