Bottom - Index of papers - Previous - CUBIC
Columbia University, Department of Biochemistry and Molecular
Biophysics, 630 West 168th Street, New York, NY 10032,
USA
rost@columbia.edu, http://cubic.bioc.columbia.edu/
Tel: +1-212-305-3773, fax: +1-212-305-7932
contact e-mail:rost@columbia.edu
| Title: | Protein secondary structure prediction continues to rise |
| Author: | Burkhard Rost |
| Quote: | J Structural Biology, 2001, 134:204-218 |
Methods predicting protein secondary structure have improved substantially
in the 90's through using evolutionary information taken from
the divergence of proteins in the same structural family. Recently,
the evolutionary information resulting from improved searches
and larger databases has again boosted prediction accuracy by
more than four percentage points to its current height around
76% of all residues predicted correctly in one of the three states
helix, strand, other. The last year also brought successful new
concepts to the field. These new methods may be particularly interesting
in light of the improvements achieved through simply combining
existing methods. Divergent evolutionary profiles not only contain
enough information to substantially improve prediction accuracy,
but even to correctly predict long stretches of identical residues
observed in alternative secondary structure states depending on
non local conditions. An example is a method automatically identifying
structural switches, and thus finding a remarkable connection
between predicted secondary structure and aspects of function.
Secondary structure predictions are increasingly becoming the
working horse for numerous methods aiming at predicting protein
structure and function. Is the recent increase in accuracy significant
enough to make predictions even more useful? Since the recent
improvement yields a better prediction of segments, and in particular
of beta-strands, I believe the answer is affirmative. What is
the limit of prediction accuracy? We shall see.
| 3D | structure three-dimensional (co-ordinates of protein structure) |
| 1D | structure one-dimensional (e.g. sequence or string of secondary structure) |
| ASP | method identifying regions of structure ambivalent in response to global changes [1] |
| DSSP | data base and method converting 3D co-ordinates into secondary structure [2] |
| HMMSTR | Hidden Markov model-based prediction of secondary structure [3] |
| JPred | method combining other prediction methods [4, 5] |
| JPred2 | divergent profile (PSI-BLAST) based neural network prediction [6] |
| PHD | simple profile-based neural network prediction [7] |
| PHDpsi | divergent profile (PSI-BLAST) based neural network prediction [7, 8] |
| PROF | divergent profile-based neural network prediction trained and tested with PSI-BLAST [9] |
| PSI-BLAST | gapped and iterative specific profile-based, fast and accurate alignment method [10] |
| PSIPRED | divergent profile (PSI-Blast) based neural network prediction [11] |
| SAM-T99sec | neural network prediction, using Hidden Markov models as input [12] |
| SSpro | profile-based advanced neural network prediction method [13] |
History. Linus Pauling correctly guessed the formation of helices and strands [14, 15] (and falsely hypothesised other structures). Three years before Pauling was verified by the publications of the first X-ray structures [16, 17] , one group already ventured to predict secondary structure from sequence [18] . The first generation prediction methods following in the 60's and 70's all based on single amino acid propensities [19] . The second-generation methods dominating the scene until the early 90's utilised propensities for segments of 3-51 adjacent residues [19] . Basically any imaginable theoretical algorithm had been applied to the problem of predicting secondary structure from sequence. However, it seemed that prediction accuracy stalled at levels slightly above 60% (percentage of residues predicted correctly in either of the three-states: helix, strand, other). The reason for this limit was the restriction to local information. Can we introduce some global information into local stretches of residues?
Secondary structure prediction profits from divergence. Early on Dickerson [20] realised that information contained in multiple alignments can improve predictions. Zvelebil et al. [21] incorporated this concept into an automatic prediction method. However, the breakthrough of the third generation methods to levels above 70% accuracy required a combination of larger databases with more advanced algorithms [22, 19] . The major component of these new methods was the use of evolutionary information. All naturally evolved proteins with more than 35% pairwise identical residues over more than 100 aligned residues have similar structures [23] . This seemingly implies an amazing stability of structure with respect to sequence divergence. However, this average figure hides the fact that neutral mutations are extremely unlikely. Supposedly most mutations result in proteins that will not adopt any globular structure, at all. In other words, only a tiny fraction of all possible proteins exist. Hence, position-specific profiles describing which residues can be exchanged against which others at which positions contain crucial information about protein structure. One consequence is that stretches of say 17 adjacent residues implictly contain some implicit information about long-range interactions and environment since the profile reflects evolutionary constraints. Using evolutionary divergence was the start key to the third generation prediction methods. Knowing 3D structure, we can identify very distant relationships between proteins that would improve accuracy even further [24] . Can we build larger and more diverged families without knowing structure?
New database searches extend family divergence found. It was also recognised very early on that information from the position-specific evolutionary exchange profile of a particular protein family facilitates discovering more distant members of that family [20] . Automatic database search methods successfully used position-specific profiles for searching [25] . However, the breakthrough to large-scale routine searches has been achieved by the development of PSI-BLAST [10] and Hidden Markov models [26, 12] . In particular, the gapped, profile-based, and iterated search tool PSI-BLAST continues to revolutionise the field of protein sequence analysis through its unique combination of speed and accuracy. More distant relationships are found through iteration starting from the safe zone of comparisons and intruding deeply and reliably into the twilight zone ( Fig. 1 ).
Fig. 1. Profile-based searches extend evolutionary information.
The cloud signifies a protein structural family for the query
protein U, i.e. all proteins that have a similar 3D structure.
A simple pairwise comparison of U with all other proteins covers
the 'safe zone' of sequence alignment (grey circle around U).
This zone can be defined, e.g., by BLAST scores below 10-10,
or by more than 35% pairwise identical residues over long alignments.
Assume that there are only five other proteins (small white circles)
in the safe zone falling on one side of U. For example, PSI-BLAST
starts the next iteration with the family-specific given by the
proteins found in the safe zone. Searching the database again
with this profile, reaches safely into the twilight zone (zone
reached marked by double-lined egg indicated in figure). However,
no current method generally reaches all members of family U. Furthermore,
in particular for PSI-BLAST the new region may fall outside of
the initial safe zone (black sub-region of safe zone). Finally,
the regions that could have been reached by sequence-space hopping
or intermediate sequence searches (dashed circles around five
initial hits; [120, 121] )
are not entirely covered by the
profile-based search. The tricky bit is to avoid that the profile
will pick unrelated proteins (transparent egg), and thus connect
two separate structural families (U and X). Conclusions:
(1) Iterated PSI-BLAST searches can safely identify fairly divergent
family members. (2) Close homologues may be lost during the extension
of the family. (3) The advanced search can lead astray.
Topics left out here. This review focuses on methods predicting secondary structure for globular proteins, in general. At the infancy of analysing the proteome of entirely sequenced organisms, the most useful structure prediction methods are those that focus on particular classes of proteins, such as proteins containing membrane helices and coiled-coil regions [27, 28, 29, 30] . For predicting the topology of helical membrane proteins, a number of new methods add interesting new facets [31, 32, 33, 34, 35, 36] . However, no method has really utilised the flood of recent experimental information about membrane proteins [37] . Overall, membrane helices can be predicted much more accurately than globular helices. Current state-of-the-art is to correctly predict all membrane helix topology for more than 80% of the proteins, and to falsely predict membrane helices for less than four percent of all globular proteins. We have recently come across evidence suggesting that this figure over-estimates performance (Rost, unpublished). Clearly, methods developed to predict helices in globular proteins go completely wrong for membrane helices! In contrast, porins appear to be predicted relatively accurately by methods developed for globular proteins [38, 39] . Few methods specifically predicting coiled-coil regions have recently been published (older review in: [40] ). Two interesting developments are the prediction of the dimeric state of coiled-coils [41] , and a method predicting 3D structure for coiled-coil regions [42] . In fact, the later is the only existing method predicting 3D structure below 2 Ångstrøm main chain deviation over more than 30 residues. Another example for successful specialised secondary structure prediction methods is the focus on beta-turns [43, 44] . The method from the Thornton group appears to be the most accurate current means of predicting turns. Successful methods specialised to predicting alpha-helix propensities have resulted from the experimental studies of short peptides in solution [45, 46] . Neither the turn, nor the helix-in-solution methods have yet been combined with other secondary structure prediction methods.
Jones broke through by using PSI-BLAST searches of large databases. David Jones pioneered using iterated PSI-BLAST searches automatically [11] . The most important step climbed by the resulting method PSIPRED has been the detailed strategy to avoid polluting the profile through unrelated proteins ( Fig. 1 ). To avoid this trap, the database searched has to be filtered first [11] . At the CASP meeting at which David Jones introduced PSIPRED, Kevin Karplus and colleagues presented their prediction method (SAM-T99sec) finding more diverged profiles through Hidden Markov models [47, 48] . Recently, Cuff & Barton also successfully used PSI-BLAST alignments for JPred2 [49] . Jennings et al. [50] explore an alternative to increasing divergence: they started with a safe zone alignment through ClustalW [51] and HMMer [26] , and iteratively refined the alignment using the secondary structure prediction from DSC [52] . The resulting alignment is reported to be more accurate and to yield higher prediction accuracy than the initial ClustalW / HMMer alignments [50] . How accurate is secondary structure prediction in 2000?
Prediction accuracy peaks at 76% accuracy. The currently best methods reach a level of 76% three-state per-residue accuracy ( Table 1 ). This constitutes a sustained level more than four percentage points above last century's best method not using diverged profiles (PHD in Table 1 ). Fortunately, the improvement is valid for helix, strand and non-regular regions (information and correlation indices in Table 1 ). Furthermore, significantly fewer residues are confused between the states helix and strand (BAD score, Table 1 ). Finally, some new methods also improve in a more global sense by improving the accuracy of assigning the secondary structural class (all-alpha, all-beta, alpha/beta, other) based on the predicted content of regular secondary structure (Class score, Table 1 ).
Method B |
Q3C |
Q3 Claim D |
SOV E |
Info F |
CorrH G |
CorrE H |
CorrL I |
Class K |
BAD L |
PROF | 77.2 | 73 | 0.37 | 0.67 | 0.65 | 0.56 | 82 | 2.2 | |
PSIPRED | 76.6 | 76.5-78.3 M | 73 | 0.37 | 0.66 | 0.64 | 0.56 | 81 | 2.5 |
SSpro | 76.3 | 76 | 71 | 0.36 | 0.67 | 0.64 | 0.56 | 83 | 2.5 |
JPred2 | 75.2 | 76.4 | 70 | 0.34 | 0.65 | 0.63 | 0.54 | 77 | 2.4 |
PHDpsi | 75.1 | 70 | 0.29 | 0.64 | 0.62 | 0.53 | 80 | 2.9 | |
PHD | 71.9 | 71.6 | 68 | 0.25 | 0.59 | 0.59 | 0.49 | 77 | 4.1 |
Copenhagen | 78 N | 77.8 | |||||||
Wang/Yuan | 53 O |
A: Data set and sorting: the results
are compiled by EVA [58] .
All methods for which details are listed have been tested on 195 different new
protein structures (EVA version Feb 2001). None of these proteins was similar
to any protein used to develop the respective method. This set comprised the
largest such set by Feb 1, 2001 for which we had results. Sorting and grouping
reflects the following concept: if the data set is too small to distinguish between
two methods, these two are grouped. For the given set of 195 protein this yielded
three groups. Inside of each group, results are sorted alphabetically.
Due to a lack of data, I could not add the performance of SAM-T99sec [47] ;
on a set of 105 proteins SAM-T99sec appears comparable to the best 3: PSIPRED, SSpro, and PROF.
The results from the 'Copenhagen' method are set apart,
since they were not collected continuously by EVA (since the method
is not publicly available), rather they were provided by the group
in Denmark for this review and thus may have been based on marginally
differing sequence databases. B: Method:
see abbreviations on top of article; Copenhagen refers to the
method from the group in Denmark [63 ]; Wang/Yuan refers to a
method predicting secondary structural class from the amino acid
composition which may be the most accurate such method [59] ;
C Q3: three-state per-residue accuracy,
i.e., number of residues predicted correctly in either of the
three states helix, strand, other (conversion of DSSP states [HG]->
helix, [EB] -> strand; note that the per-residue accuracy tends
to favour methods over-predicting non-regular structure); D
Q3 Claim: three-state per-residue accuracy
published in original publication of method: PSIPRED [11] ,
SSpro [13] , JPred2 [6] , PHD [122] , E SOV:
three-state per-segment score measuring the overlap between predicted
and observed segments [75, 123] ; F Info:
per-residue information content [22] ; G CorrH:
Matthew's correlation coefficient for state helix [124] ; H
CorrE: Matthew's correlation for state strand [124] ;
I CorrL: Matthew's correlation coefficient for
state other [124] ; K Class: percentage of
proteins correctly sorted into one of the four classes: all-alpha (length
> 60, helix > 45%, strand < 5%), all-beta (length > 60, helix
< 5%, strand > 45%), alpha/beta (length > 60, helix > 30%,
strand > 20%), other (thresholds for classification from:
[125, 126, 122] ); L
BAD: percentage of helical residues predicted as strand,
and of strand residues predicted as helix [127] ; M
accuracy range: PSIPRED result were published for different
conversions of the eight DSSP states to three states; M
accuracy range: P; O different set of
proteins: the class accuracy for the method based on amino
acid composition is taken from the original publication [59] ,
i.e. based on a different data set than all other methods.
Sources of improvement: 4 parts database growth, 3 extended search, 2 other. Jones solicited two causes for the improved accuracy: (1) training and (2) testing the method on PSI-BLAST profiles. Cuff & Barton examined in detail how different alignment methods improve [6] . However, which fraction of the improvement results from the mere growth of the database, which from using more diverged profiles, and which from training on larger profiles? Using PHD from 1994 to separate the effects [8] , we first compared a non-iterative standard BLAST [53] search against SWISS-PROT [54] with one against SWISS-PROT + TrEMBL [54] + PDB [55] . The larger database improves performance by about two percentage points [8] . Secondly, we compared the standard BLAST against the big database with an iterative PSI-BLAST search. This yielded less than two percentage points additional improvement [8] . Thus, overall, the more divergent profile search against today's databases supposedly improves any method using alignment information by almost four percentage points (PHDpsi in Table 1). The improvement through using PSI-BLAST profiles to develop the method, are relatively small: PHDpsi was trained on a small database of not very divergent profiles in 1994, e.g., PROF was trained on PSI-BLAST profiles of a 20 times larger database in 2000. The two differ by only one percentage point ( Table 1 ), and part of this difference resulted from implementing new concepts into PROF (Rost, unpublished; [9] ).
Seemingly improve accuracy by ignoring short segments. There are many ways to publish higher levels of accuracy. Amongst the simplest for secondary structure prediction is to convert 310 helices and beta-bulges assigned by DSSP [2] to non-regular structure. This yields higher levels of accuracy since all methods - on average - are better at predicting the middle of helices and strands than their caps, and hence are more accurate for longer regular secondary structure segments [56, 57] . When using predicted secondary structure to predict 3D structure, short helices are important. Thus, I suggest bearing with the more conservative conversion strategy.
Comparing apples and oranges, or too few apples with one another. To overstate the point: there is NO value in comparing methods evaluated on different data sets. Most secondary structure prediction methods are available. Thus, developers may want to compare their results to public methods based on the same data set (not previously used for any of the two). Many methods predicting aspects of protein structure and function have to fight with limited data availability. This is not at all the case for secondary structure prediction. Hundreds of new protein structures are added every year [55] . If because of some reason or other, small data sets have to be used, developers should painstakingly try to estimate what 'significant difference' means for their data set. For example, 16 new protein structures are clearly too few! We currently have results from many prediction methods for 16 proteins. For that set, JPred2, PHD, PROF, PSIPRED, SAM-T99sec and SSpro are indistinguishable [58] !
Seemingly achieve 100% accuracy by using correlated sets. Many publications on predicting secondary structural class from amino acid composition allowed correlations between 'training' and testing sets. Consequently, levels of prediction accuracy published exceeded by far the theoretical possible margins [59] . A very simple operational definition for 'independent sets' is the following: Two proteins A and B are correlated if the sequence similarity between A and B suffice to predict the structure of B knowing A's structure. Assume we have two un-correlated sets of proteins S1 and S2. Can we train the method on set S1 and develop it on set S2 without further ado? While developing PROF, I realised that the answer is negative. In fact, I trained neural networks on about 2000 structures that had no significant level of sequence similarity to our original set of 126 proteins [22] . I used the 126 only after I had completed developing the method and found a prediction accuracy exceeding 80% (unpublished). When testing PROF on a set of about 200 new structures that had been added to PDB in the meantime (different to that given in Table 1 ), prediction accuracy dropped. Do the 126 differ from the set used for Table 1? I failed to answer this question. Conclusion: test as test can, i.e., use as many independent sets of new structures as possible!
EVA: automatic evaluation of automatic prediction servers. In collaboration with Volker Eyrich (Columbia), Marc Marti-Renom, Andrej Sali (both Rockefeller), Florencio Pazos, and Alfonso Valencia (both CNB Madrid), we have started to address the above problems through the automatic server EVA [58] . Leszek Rychlewski (IIMCB Warsaw) and Dani Fischer (Ben-Gurion Univ.) are implementing similar ideas in LiveBench [60] . The simple concept is the following: take the N newest experimental structures added to PDB, send the sequences to all prediction servers, collect the results, and accumulate a continuous evaluation of prediction accuracy every weak. EVA has been evaluating secondary structure prediction methods for more than six months now. I found it instructive to see how the 'ranking' of methods initially changed from week to week due to too small sets. Currently, EVA also provides results for evaluating comparative modelling (Sali group), and residue-residue contacts (Valencia group). We hope that EVA will eventually simplify life for developers, referees, editors and users.
SSpro: advanced recursive neural network system. The only method published recently that appears to improve prediction accuracy significantly not through more divergent profiles but through the particular algorithm is SSpro [13] . The major idea of the method aims at solving the following problem. When, e.g., training neural networks it is important to avoid correlations between training samples presented successively to the system. A neural network may be presented with the window around residue 11 in protein X at time step T and residue 7 in protein Y at step T+1. Thus, the system never learns that secondary structure correlates between adjacent residues. The result is that regular secondary structure segments are predicted - on average - at a length half that observed [19] . PHD addressed this problem by a second level structure-to-structure network that was trained on the predicted secondary structure from the first level sequence-to-structure network [22] . Most authors have since implemented this idea (in particular PSIPRED and JPred2). Pierre Baldi and colleagues deviated substantially from this concept. Instead of using an additional network, they embedded the correlation into one single recursive neural network. In principle, the idea of a recursive network had been implemented before [61] . However, the particular details of the algorithm implemented in SSpro are novel and - as Table 1 illustrates - prove highly successful.
HMMSTR: hidden Markov models for connecting library of structure fragments. Can we predict secondary structure for protein U by local sequence similarity to segments of known structures {S} even when overall U differs from any of the known structures {S}? Yes, as shown by many nearest-neighbour-based prediction methods, the most successful of which seems to be NSSP [62] . A conceptually quite different realisation of the same concept has been implemented in HMMSTR by Chris Bystroff, David Baker and colleagues [3] . Firstly, build a library of local stretches (3-19) of residues with 'basic structural motifs' (I-sites). Secondly, assemble these local motifs through Hidden Markov models introducing structural context on the level of super-secondary structure. Thus, the goal is to predict protein structure through identification of 'grammatical units of protein structure formation'. Although HMMSTR intrinsically aims at predicting higher order aspects of 3D structure, a side-result is the prediction of 1D secondary structure. I find two results surprising. (1) The authors do not find any significant effect of 'over-optimising' their method, i.e. HMMSTR appears as accurate in predicting secondary structure for proteins known today as it will be for those known next year. (2) Three-state per-residue accuracy is reported to be about 74% [3] . If this estimates is correct, HMMSTR is more accurate in predicting secondary structure than most existing methods, and almost as accurate as the state-of-the-art methods (Table 1).
And the winner is? The reason for the particular focus of this review on a small number of methods is largely that I could compare the selected methods to one another based on new proteins. A particular method that was not available to me may turn out to mark the most substantial breakthrough in the field. A Danish group developed a neural network-based method that is most amazing in many respects [63] . (1) The authors estimate the method to yield levels above 77% prediction accuracy (the title is slightly misleading). If true, this is the best current method. Like PSIPRED, JPred2, and PROF, the method uses PSI-BLAST profiles as input, and like most methods since PHD a two-level approach addressing the problem of predicting short segments. (2) A concept that had not been published before is to replace the standard three output units (for helix, strand, other), by nine output units additional coding for the secondary structure states of the residues before and after the central one (dubbed 'output expansion'). (3) Also new is the particular way of weighting the average over different networks by the overall reliability of the prediction for that network, and the mere number of different networks considered (up to 800!). This impressive number of networks may prevent large-scale genome analyses based on this method. However, the major point is: Did the authors over-estimate performance? The authors tested their method in a way, most developers would assume to be error-proof. However, their testing protocol is very similar to the one that I applied when significantly over-estimating the accuracy of PROF (> 81%). Obviously, the similarity of these two situations may very well be purely coincidental!
Plethora of new concepts for secondary structure prediction. The following five methods are a small subset of new ideas explored to improve secondary structure prediction. (1) Ouali & King [64] combine neural networks and rule-based statistics in a cascade of classifiers. Based on a similar data set they estimate a level of prediction accuracy comparable to that of JPred2 (see Table 1 ). (2) Chandonia & Karplus [57] combined simplified output schemes (two output states) with networks trained on different tasks and a particular variant of early stopping; input are non-divergent alignments picked from the safe zone ( Fig. 1 ). Based on a protocol similar to the one applied by the Danish group [63] , the authors estimate a level of > 76% accuracy, i.e. a level that if holding up is similar to SSpro ( Table 1 ). (3) Supposedly the simplest new method that claims to almost approach the performance of PHD combines the information for secondary structure formation contained in amino acid singlets, doublets, and triplets. (4) Schmidler et al. [65] use of a simple statistical model, the novel aspect is to replace compiling statistics over fixed stretches of N residues by segments signifying regular secondary structure (helix, strand). The underlying formalism resembles a hidden semi-Markov model allowing to explicitly incorporate particular propensities such as helix caps [66] . Based on non-comparable data sets the authors estimated prediction accuracy to be around 69%, if correct, impressive for a method not using alignment information. (5) Without claims to surprising levels of accuracy, Figureau et al. [67] combine cleverly chosen pentapeptides from the database to obtain the final prediction.
Secondary structural class predicted almost as accurately as by experiment. Grouping proteins into secondary structure classes (all-alpha, all-beta, alpha/beta, other) appears a useful initial approach toward classifying proteins [27, 68] . Surprisingly, such classes can be predicted successfully based merely on the overall amino acid composition of a protein [69, 70, 59] . More and more increasingly complex and genial methods address this reduced goal; reported levels of prediction accuracy approach 100%. Recently, Wang & Yuan explained these high values by insufficient testing schemes, and challenged that a four-state accuracy of 60% comprises the maximum for methods based solely on composition [59] . Obviously, it is much easier to predict class starting from the detailed information about evolutionary profiles for the entire sequence than by restricting the input to composition. In fact, the best current methods also improve the accuracy in predicting secondary structure class considerably ( Table 1 ). The differences between observed and predicted composition of secondary structure are now below 6% for helix and strand. This is fairly close to what experimental low-resolution (circular dichroism, Fourier transform induced spectroscopy) methods achieve at their best [57] .
Combination improves on non-systematic errors. Any prediction method has two sources of errors: (1) systematic errors, e.g., through non-local effects, and (2) white noise errors caused by, e.g., the succession of the examples during training neural networks.
Theoretically, combining any number of methods improves accuracy as long as the errors of the individual methods are mutually
independent and are not only systematic [71] . PHD - and more recently others [57, 6, 63] - utilised this fact by combining different neural networks. The idea of combining different prediction methods has been around in secondary structure prediction since long [19] ; Cuff & Barton [4, 5] implemented it in JPred for different third generation methods. In particular, JPred uses a simple expert-rule for compiling the final average. Ross King et al. [72] have tested a variety of different combination strategies. Selbig et al. [73] have compiled the jury through an elaborated decision-tree based system. Guermeur et al. [74] have used a more refined variant of the JPred idea of weighting methods.
Overall, combinations of independent prediction methods seem to yield levels of accuracy higher than that of the single best method.
However, for every protein one method tends to be clearly superior to the combined prediction (Fig. 2B).
Is it really wise to include significantly inferior methods into combined prediction? No: averaging over all methods used for EVA
decreased accuracy over the best individual methods, although averaging over the better ones was better than averaging the best
ones (Rost, unpublished). Is there any criterion for when to include a method and when not? Concepts weighting the individual
methods based on its accuracy and 'entropy' [63] appear successful only for large numbers of methods ([63],
Rost, unpublished). Nevertheless, methods that are significantly over-trained can improve when combined (Krogh, unpublished).
More rigorous studies for the optimal combination may provide a better picture.
The technical problem of utilising many methods in a public server is that the field is advancing too fast: today's methods are more
accurate than averages over yesterday's methods (hence, the JPred server now returns JPred2 results by default).
Your protein may be predicted worse or better than average. A few problems for estimating expected prediction accuracy are described
above. However, another problem is relevant for users of prediction methods: A sustained level of 76% accuracy does NOT mean that 76% of
the residues in your protein of unknown structure U are correctly predicted. In contrast, prediction accuracy varies substantially between
proteins ( Fig. 2 A). It seems that such variations are intrinsic to any method predicting aspects of protein structure
and function. What can you then expect as accuracy for your protein when using a state-of-the-art method? Given a divergent family
( Table 2 ), the answer is 66-86%. Do you learn from comparing different methods?
Fig. 2. Prediction accuracy varies substantially for different proteins. All results are based on 150 novel protein structures not used to develop any of the method shown [58] . The considerable difference in the three-state accuracy between different proteins is valid for all methods (A percentage of all 150 proteins predicted at a given level of accuracy; one standard deviation is in the order of ten percentage points). On average, different methods predict different proteins at higher levels (B for each protein and each method, the difference between the per-protein average over all six methods is shown; negative values imply that the respective method is better than the average). Conclusions: (1) If you predict secondary structure for your protein with a method of 76% accuracy, the actual accuracy for that protein may be anywhere between 50% and 90%. (2) As to be expected: most often some methods are more accurate than the average over many methods.
Combining methods improves on average but you may also loose. Averaging over many methods helps, on average. However, most often some methods are more accurate than the average ( Fig. 2 B). Furthermore, there are examples of proteins predicted poorly by all methods ( Fig. 2 B), i.e. for which all methods agree by mistake (data not shown). Thus, trying to use many methods may not provide the answer to the question whether the prediction for your protein is more likely to be below or above average. Are there alternative ways to spot more reliably predicted regions?
More reliable predictions are more accurate. Reliability indices as provided by most methods correlate very well with prediction accuracy ( Fig. 3 ). This implies that you can easily identify regions that are more likely to be predicted accurately than others. Furthermore, if your protein has many residues predicted at low levels of reliability, you may correctly suspect that your protein is predicted at a level below average. Plotting coverage versus accuracy ( Fig. 3 ), also illustrates how beneficial more divergent profiles are to make predictions more useful. For example, PSIPRED has more than half of all residues predicted at levels that would be reached on average when comparing two known structures [75] (Fig. 3, dotted line).
Fig. 3. Prediction accuracy correlates with reliability.
The conclusion from Fig. 2A is that you have a poor idea how
well a method performs when applied to your protein of unknown
structure. Fortunately, there is a way out of this dilemma: Most
methods now provide an index measuring the reliability of the
prediction for each residue. Shown is the accuracy versus the
cumulative percentages of residues predicted at a given level
of reliability (coverage vs. accuracy). For example, PSIPRED and
PROF reach a level above 88% for about 60% of all residues (dashed
line). This particular line is chosen since secondary structure
assignments by DSSP agree to about 88% for proteins of similar
structure. Although JPred2 is only marginally less accurate than
PSIPRED and PROF (Table 1), it reaches this level of accuracy
for less than half of all residues. Conclusions:
(1) Reliability indices are extremely valuable to spot regions
of more-likely-to-be-correct predictions. (2) These indices also
address the problem of variation: if many residues are predicted
with high reliability, your protein is more likely to be predicted
more accurately than average (Fig. 2A).
Regions likely to undergo structural change predicted successfully. Young, Kirshenbaum, Dill & Highsmith [1] have unravelled an impressive correlation between local secondary structure predictions and global conditions. The authors monitor regions for which secondary structure prediction methods give equally strong preferences for two different states. Such regions are processed combining simple statistics and expert-rules. The final method is tested on 16 proteins known to undergo structural rearrangements, and on a number of other proteins. The authors report no false positives, and identify most known structural switches. Subsequently, the group applied the method to the myosin family identifying putative switching regions that were not know before, but appeared reasonable candidates [76] . I find this method most remarkable in two ways: (1) it is the most general method using predictions of protein structure to predict some aspects of function, and (2) it illustrates that predictions may be useful even when structures are known (as in the case of the myosin family).
Classifying proteins based on secondary structure predictions in the context of genome analysis. Proteins can be classified into families based on predicted and observed secondary structure [27, 68] . However, such procedures have been limited to a very coarse-grained grouping only exceptionally useful to infer function ( Table 2 ). Nevertheless, in particular predictions of membrane helices and coiled-coil regions are crucial for genome analysis. Recently, we came across an observation that may have important implications for structural genomics, in particular: More than one fifth of all eukaryotic proteins appeared to have regions longer than 60 residues apparently lacking any regular secondary structure [77] . Most of these regions were not of low-complexity, i.e. not composition-biased. Surprisingly, these regions appeared evolutionarily as conserved as all other regions in the respective proteins. This application of secondary structure prediction may aid in classifying proteins, and in separating domains, possibly even in identifying particular functional motifs.
Aspects of protein function predicted based on expert-analysis of secondary structure. The typical scenario in which secondary structure predictions help to learn about function are experts combining predictions and their intuition, most often to find similarities to proteins of known function but insignificant sequence similarity [78, 79, 39, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89] . Usually, such applications base on very specific details about predicted secondary structure (some examples in Table 2 ). Thus, these successful correlations of secondary structure and function appear difficult to incorporate into automatic methods.
Exploring secondary structure predictions to improve database searches. Initially, three groups independently applied secondary structure predictions for fold recognition, i.e., the detection of structural similarities between proteins of unrelated sequences [90, 91, 92] . A few years later, almost every other fold recognition/threading method has adopted this concept [93, 94, 95, 96, 97, 98, 99, 100, 101, 102] . Two recent methods extended the concept by not only refining the database search, but by actually refining the quality of the alignment through an iterative procedure [103, 50] . A related strategy has been implored by Ng and the Henikoffs to improve predictions and alignments for membrane proteins [104] .
From 1D predictions to 2D, and 3D structure. Are secondary structure predictions accurate enough to help predicting higher order aspects
of protein structure automatically? 2D (inter-residue contacts) predictions: Baldi, Pollastri, Andersen & Brunak
[105] have recently improved the level of accuracy in predicting beta-strand pairings over earlier work
[106] through using another elaborate neural network system. 3D predictions: the following list of five
groups exemplifies that secondary structure predictions have now a popular first step toward predicting 3D structure. (1) Ortiz et al.
[107] successfully use secondary structure predictions as one component of their 3D structure prediction method.
(2) Eyrich et al. [108, 109] minimises the energy of arranging predicted rigid secondary
structure segments. (3) Lomize et al. [110] also start from secondary structure segments. (4) Chen et al.
[111] suggest using secondary structure predictions to reduce the complexity of molecular dynamics simulations.
(5) Levitt et al. [112, 113] combine secondary structure-based simplified presentations
with a particular lattice simulation attempting to enumerate all possible folds.
Particularly sensitive to divergence are the reliability indices, i.e., less divergence yields over-estimated reliability indices.
Detection of membrane proteins has less than 3% error rate for the best methods. Most helices are correctly predicted, yet, the number of helices may nevertheless vary. Helix caps are clearly predicted inaccurately. Note general methods predicting three-state secondary structure for globular proteins also predict caps less accurately.
How to get the best results? The major source of improvement is the divergence of the multiple sequence alignment used for prediction. Thus, if you have a small family the expected prediction accuracy is lower.
The most successful strategy to find the most reliably predicted regions may be to use the reliability index provided by a method rather than the agreement between different methods.
If you know non-globular or structural domains in your protein, chop it up before you build the alignment.
If you can improve the alignment, try to do so before the prediction.
Identify membrane proteins? Predicted membrane helices indicate that your protein is not globular. The accurate membrane predictions are usually more reliable than those for globular proteins. Thus, membrane helix predictions should be given preference. Globular methods often do not predict globular helices at positions of membrane helices; rather, often membrane helices are predicted as strand by mistakenly applied globular methods. In contrast, globular methods appear relatively more accurate for porin like beta-strand membrane regions.
Classify through coiled-coil regions? Predictions of long coiled-coil regions clearly indicate that your protein is locally non-globular. Long coiled-coil proteins are likely to be structural protein. Longer regions are predicted more accurately.
Classify through secondary structure content?
Classifying proteins according to the secondary structure composition is helpful, but arbitrary. One hope may be to infer from the predicted secondary structure content that a particular protein is not typical. However, this attempt fails, since known protein structures vary significantly between 10-90% of regular secondary structure (helix, strand). Thus, secondary structure composition does not help to predict globularity.
Identify domains or structural regions?
If you see two separate secondary structure patterns you may suspect that the protein has two structural domains. An extreme example is an N-terminal all-alpha region and a C-terminal all-beta region.
If you have to cut your protein, stay away more than two residues from predicted helices and strands.
Monitor influences of point mutations? Secondary structure prediction methods are - on average - as accurate in predicting the overall content of secondary structure as are careful CD and FTIR methods. However, such methods allow you to monitor in detail structural responses to mutations. Such changes are less likely to be reflected as accurately by prediction methods.
Find binding sites or motifs? Most often, binding sites lie in non-regular secondary structure elements. For example, we have not predicted regular secondary structure for any of the known nuclear localisation signals [128] .
Secondary structure predictions do not suffice to identify binding motifs, such as the zinc-finger II motif. However, the combination of sequence motif and predicted secondary structure may be very helpful.
Infer functional / structural similarity?
If you know the function/structure of protein A and want to infer whether B shares this function/structure, a similarity in the local secondary structure may help you substantially.
88% is the limit, but shall we ever reach close to there? Protein secondary structure formation is influenced by long-range interactions [114, 45, 46] and by the environment [1, 115] . Consequently, stretches of up to 11 adjacent residues (dubbed chameleon after [114] ) can be found in different secondary structure states [116, 117, 118] . Implicitly, such non-local effects are contained in the exchange patterns of protein families. This is reflected by the fact that strand is predicted almost as accurately as helix ( Table 1 ), although sheets are stabilised by more non-local interactions than helices. Local profiles can even suffice to identify structural switches [76, 1] . Surprisingly, we can find some traces of folding events in secondary structure predictions [119] . Even more amazing is a study suggesting that alignment-based methods achieve similar levels of accuracy for chameleon regions as for all other regions [117] . Secondary structure assignments may vary for two versions of the same structure. One reason is that protein structures are no rocks but dynamic objects with some regions more mobile than others. Another reason is that any assignment method has to choose particular thresholds (e.g. DSSP chooses a cut-off in the Coulomb energy of a hydrogen bond). Consequently, assignments differ by about 5-15 percentage points between different X-ray versions or different NMR models for the same protein (Andersen & Rost, unpublished), and by about 12 percentage points between structural homologues [75]. The latter number provides the upper limit for secondary structure prediction of error-free comparative modelling. I doubt that ab initio predictions of secondary structure will ever become more accurate than that. Hence, I believe a value around 88% constitutes an operational upper limit for prediction accuracy. After the advances over the last two years we reached above 76%. Thus, we need to mount another twelve percentage points (or even less). What is the major obstacle to reaching another six percentage points higher? The size of the experimental database as suggested [116] ? I doubt this, since PHDpsi trained on only 200 proteins using PSI-BLAST input is almost as accurate as PSIPRED trained on 2000 proteins ( Table 1 ). Will the current explosion of sequences boost accuracy? In fact, current databases have less than 10 homologues for more than one third of the 150 tested proteins ( Table 1 ), and more than 100 for only 20% of the proteins. Although based on a too small set for conclusions, for these 20% highly populated families the accuracy of PROF was four percentage points above average (data not shown). Thus, larger databases may get us six percentage points higher, and it may not. The answer remains nebulous.
Methods improved significantly over last two years. Growing databases and improved search techniques ( Fig. 1 ) - predominantly through the iterated PSI-BLAST tool - yielded a substantial improvement in secondary structure prediction accuracy over the last two years. State-of-the-art methods now reach sustained levels of 76% prediction accuracy ( Table 1 ). Even more impressively, about 60% of all residues are predicted at levels reaching the level of agreement between X-ray and NMR structures ( Fig. 3 ). However, novel ideas have also shown to improve prediction accuracy. A standard way to increase the confidence in a particular prediction is to look at the results from many different prediction methods. This strategy is frequently successful, and has been brought to perfection over the last years. However, often the best method is better than the average over many methods ( Fig. 2 B). While structure prediction is coming of age, developers and users slowly learn to reduce over-estimations. However, the correlations between proteins at times of database explosions are becoming more difficult to control. It seems that only continuous, automatic evaluation servers will be able to handle this challenge in the future [58, 60] .
Secondary structure predictions are at the base of structure-based sequence analysis. Almost a decade after its original breakthrough, prediction methods are now increasingly explored by wet-lab biologists to analyse their protein of interest. Secondary structure predictions are used automatically by methods aiming at higher dimensional aspects of protein structure, and at improving database searches and alignment accuracy. One method has successfully related secondary structure predictions automatically to functional aspects [76, 1] . However, secondary structure based identifications of binding sites or other functional aspects is still restricted to single-case expert analyses.
And now we run human? The field has advanced considerably over the last two years. And more improvement appears to lie ahead. Prediction methods are fast enough to analyse entire genomes, and for particular examples the resulting classifications are relevant to structural and functional genomics [68, 28] . Nevertheless, to play the devil's advocate: The field is not up to the challenge of the human sequences to be dubbed into the database very soon. We are missing a variety of approaches relating secondary structure predictions explicitly to function, such as given by ASP [1] . Obviously, this remark may apply to bioinformatics, in general: The year 2001 will commence with the publication of the entire human genome; we must rush to get ready for the data flood.
Thanks to Jinfeng Liu (Columbia) for computer assistance and collection
of the genome data sets; to Jinfeng Liu and Dariusz Przybylski
(Columbia) for providing preliminary information; to Claus Andersen
and Søren Brunak (CBS Copenhagen) for helpful comments.
Particular thanks to Volker Eyrich (Columbia) for having set up
most of the EVA software.
I find many of the publications listed here outstanding. However, I marked only recent publications more directly related to secondary structure prediction (except [9]). I preferentially marked methods introducing new concepts, and having convinced me - at least partially - that their claims hold. If the claims are true, [63] is clearly the most outstanding new development.