Columbia University, Department of Biochemistry and Molecular Biophysics, 650 West 168th Street, New York, NY 10032, USA
rost@columbia.edu, http://cubic.bioc.columbia.edu/ Tel: +1-212-305-3773, fax: +1-212-305-7932
contact e-mail:rost@columbia.edu
| Title: | Rising accuracy of protein secondary structure prediction |
| Author: | Burkhard Rost |
| Quote: | in: 'Protein structure determination, analysis, and modeling for drug discovery' (ed. D Chasman), New York: Dekker, pp. 207-249 |
We still cannot predict protein 3D structure from sequence, in general. But bioinformatics continuously improve methods predicting simplified aspects of structure. Particularly, the field of secondary structure has achieved a break-through by combining algorithms from artificial intelligence with evolutionary information. PHD, the first third generation method surmounted the 'magic' line of predicting more than 70% of all residues correctly in one of three states (helix, strand, other). Furthermore, b -strands were predicted almost twice as often correct as by methods of the first and second generation. Finally, predicted segments look like those observed. Recently, the evolutionary information resulting from improved searches and larger databases has again boosted prediction accuracy by more than four percentage points to its current height around 77%. Divergent evolutionary profiles not only contain enough information to substantially improve prediction accuracy, but even to correctly predict long stretches of identical residues observed in alternative secondary structure states depending on non local conditions. An example is a method automatically identifying structural switches, and thus finding a remarkable connection between predicted secondary structure and aspects of function. Due to their remarkable success, secondary structure predictions have become the working horse for numerous methods aiming at predicting protein structure and function. Moreover, performance can be improved even further by using these methods in an 'expert' rather than in an 'automatic' mode. Have we, now reached the limit of prediction accuracy? Time will tell.
Key words: genome sequence analysis, predicting globularity,
protein domains, protein structure prediction, solvent accessibility,
multiple alignments, trans-membrane helices.
The sequence-structure gap is rapidly increasing. Currently, databases for protein sequences (e.g. SWISS-PROT/TrEMBL × [14] ) are expanding rapidly, largely due to large-scale genome sequencing projects: at the beginning of 2001, we know all sequences for more than 40 entire genomes [15, 16, 17] . This implies that the gap between known structures and known sequences is rapidly increasing, despite significant improvements of structure determination techniques (PDB [18, 6] ). The most successful theoretical approach to bridging this gap is comparative modelling. It effectively raises the number of 'known' 3D structures from 10,000 to over 100,000 [19, 20] . In fact, our ability to find the appropriate template so that we can apply comparative modelling has risen continuously over the last decade. Now, we can predict 3D structure through comparative modelling for more than twice as many proteins as in 1993. However, after four decades of ardent research, we still cannot predict structure from sequence [21, 22, 23] . Nevertheless, the field has had its success: now the best methods come frequently get some features of the fold right [24] .
Simplifying the structure prediction problem. The rapidly growing sequence-structure gap has enticed theoreticians to solve simplified prediction problems [25] . An extreme simplification is the prediction of protein structure in one dimension (1D), as represented by strings of, e.g., secondary structure, or residue solvent accessibility. Theoreticians are lucky in that a simplified predictions in 1D (e.g. secondary structure, or solvent accessibility [26, 25, 27] ) - even when only partially correct - are often useful, e.g., for predicting protein function, or functional sites.
Topics left out here. This review focuses on methods predicting
secondary structure for globular proteins, in general. At the
infancy of analysing the proteome of entirely sequenced organisms,
the most useful structure prediction methods are those that focus
on particular classes of proteins, such as proteins containing
membrane helices and coiled-coil regions [28, 29, 30, 31] .
For predicting the topology of helical membrane proteins, a number
of new methods add interesting new facets [32, 33, 34, 35, 36, 37] .
However, no method has really utilised the flood of recent experimental
information about membrane proteins [38] . Overall, membrane
helices can be predicted much more accurately than globular helices.
Current state-of-the-art is to correctly predict all membrane
helix topology for more than 80% of the proteins, and to falsely
predict membrane helices for less than four percent of all globular
proteins. We have recently come across evidence suggesting that
this figure over-estimates performance (Rost, unpublished). Clearly,
methods developed to predict helices in globular proteins go completely
wrong for membrane helices! In contrast, porins appear to be predicted
relatively accurately by methods developed for globular proteins
[39, 40] . Few methods specifically predicting coiled-coil
regions have recently been published (older review in: [41] ).
Two interesting developments are the prediction of the dimeric
state of coiled-coils [42] , and a method predicting 3D structure
for coiled-coil regions [43] . In fact, the later is the only
existing method predicting 3D structure below 2 Ångstrøm
main chain deviation over more than 30 residues. Another example
for successful specialised secondary structure prediction methods
is the focus on beta-turns [44, 45] . The method from the Thornton
group appears to be the most accurate current means of predicting
turns. Successful methods specialised to predicting alpha-helix
propensities have resulted from the experimental studies of short
peptides in solution [46, 47] . Neither the turn, nor the helix-in-solution
methods have yet been combined with other secondary structure
prediction methods.
Secondary structure assigned by DSSP. Secondary structure is most often assigned automatically based on the hydrogen bonding pattern between the backbone carbonyl and NH groups (e.g. by DSSP [48] ). DSSP distinguishes eight secondary structure states that are often grouped into three classes: H = helix, E = strand, and L = non-regular structure. Typically the grouping is as follows: 'H' (a -helix) -> H, 'G' (310 -helix) -> H, 'I' (p -helix) -> H, 'E' (extended strand) -> E, and 'B' (residue in isolated b -bridge) -> E, 'T' (turn) -> L, 'S' (bend) -> L, ' ' (blank = other) -> L, with the 'corrections': 'B ' -> EE, but 'B_B' -> LLL. Note some developers use different projections of the eight DSSP classes onto three predicted classes; most of these yield seemingly higher levels of prediction accuracy. For example, short helices are more difficult to predict ( [49] , see also Fig 5). Hence, converting 'GGG' to 'LLL' lets authors report higher numbers.
Per-residue prediction accuracy. The simplest and most widely used score for secondary structure prediction is the three-state per-residue accuracy giving the percentage of correctly predicted residues predicted correctly in either of the three states: helix, strand, other:
( eqn. 1 )
where ci is the number of residues predicted correctly in state i (H, E, L), and N is the number of residues in the protein (or in a given data set). As typical data sets contain about 32% H (helix), 21% E (strand), and 47% L (other), correct prediction of the non-regular class (L) tends to dominate the three-state accuracy. More fine-grained methods that avoid this shortcoming are defined in detail elsewhere [50, 51] .
Per-segment prediction accuracy. Measures for single-residue accuracy do not completely reflect the quality of a prediction [52, 53, 54, 55, 51, 56] . Three simple measures assess the quality of predicting segments: (1) the number of correctly predicted segments, (2) the predicted vs. observed average segment length, and (3) the predicted vs. observed distributions of segments with length L [57] . All these measures can, e.g., identify methods with fairly high per-residue accuracy, yet an unrealistic distribution of segments. More elaborated scores base on the overlap between predicted and observed segments (SOV: [51, 58] ).
Conditions for evaluating sustained performance. A systematic testing of performance is a pre-condition for any prediction to become reliably useful. For example, the history of secondary structure prediction has partly been a hunt for highest accuracy scores, with over-optimistic claims by predictors seeding the scepticism of potential users. Given a separation of a data set into a training set (used to derive the method) and a test set (or cross-validation set, used to evaluate performance), a proper evaluation (or cross-validation) of prediction methods needs to meet four requirements. (1) No significant pairwise sequence identity between proteins used for training and test set, i.e., < 25% (length-dependent cut-off [59] ). (2) All available unique proteins should be used for testing, since proteins vary considerably in structural complexity; certain features are easier to predict others harder. (3) No matter which data sets are used for a particular evaluation, a standard set should be used for which results are also always reported. (4) Methods should never be optimised with respect to the data set chosen for final evaluation. In other words, the test set should never be used before the method is set up.
Number of cross-validation experiments of NO meaning.
Most methods are evaluated in n-fold cross-validation experiments
(splitting the data set into n different training and test
sets). How many separations should be used, i.e., which number
of n yields the best evaluation? A misunderstanding is
often spread in the literature: the more separations (the larger
n ) the better. However, the exact number of n is
not important provided the test set is representative, comprehensive
and the cross-validation results are NOT miss-used to again change
parameters. In other words, the choice of n is of no meaning
for the user.
1st generation: single residue statistics. The first experimentally
determined 3D structures of haemoglobin and myoglobin were published
in 1960 [60, 61] . Almost a decade earlier, Pauling and Corey
suggested an explanation for the formation of certain local conformational
patterns like a
-helices and
b
-strands [62, 63] . Shortly
later (and still prior to the first experimental structure), the
first attempt was made to correlate the content of a certain amino
acids (e.g. Proline) with the content of a
-helix
[64] . The idea was expanded by correlating the content for
all amino acids with that of a
-helix
and b
-strand [65, 66] . The
field of secondary structure prediction had been opened. Most
early methods were first generation methods in that they based
on single residue statistics. Preferences of particular amino
acids for particular secondary structure states were extracted
from the given small databases [67, 68, 69, 70, 71, 72, 73, 74, 75] .
By 1983, it became clear that the accuracy of these methods had
been over-estimated [76] ( Fig. 1 ).
Fig. 1. Three-state per-residue accuracy of various prediction methods. I included only methods for which I could run independent tests. Unfortunately, for most old methods this was not possible. However, for each method I had independent results from PHD [50, 78, 7] available. I normalised the differences between data set by simply compiling levels of accuracy with respect to PHD. For comparison, I added the worst possible prediction (random), and the best possible one (through comparative modelling of close homologue). The methods were: C+F Chou & Fasman (1st generation) [73, 242] ; Lim (1st) [74] ; GORI (1st) [83] ; Schneider (2nd) [117] ; ALB (2nd) [92] ; GORIII (2nd) [84] ; COMBINE (2nd) [243] ; S83 (2nd) [116] ; LPAG (3rd) [152] ; NSSP (3rd) [114] ; PHDpsi (3rd) [8] ; JPred2 (3rd) [5] ; SSpro (3rd) [13] ; PSIPRED (3rd) [11] ; PROF (3rd) [244] .
2nd generation: segment statistics. The principal improvement of the 2nd generation profited from the growth of experimental information about protein structure. This data enabled to parameterise the information contained in consecutive segments of residues. Typically 11-21 adjacent residues are taken from a protein and statistics are compiled to evaluate how likely the residue central in that segment is to be in a particular secondary structure state. Similar segments of adjacent residues were also used to base predictions on more elaborated algorithms, some of which were spun off from artificial intelligence. Since, almost any algorithm has been applied to the problem of predicting secondary structure; all were limited to accuracy levels around 60% ( Fig. 1 ). Reports of higher levels of accuracy were usually based on too small, or non-representative data sets [50, 77, 54, 78] . The main algorithms based on: (i) statistical information [79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91] ; (ii) physico-chemical properties [92] ; (iii) sequence patterns [93, 94, 95] ; (iv) multi-layered (or neural) networks [96, 97, 98, 99, 100, 101, 102, 103] ; (v) graph-theory [104, 105] ; (vi) multivariate statistics [106, 107] ; (vii) expert rules [108, 109, 110, 105, 111, 112] ; and (viii) nearest-neighbour algorithms [113, 114, 115] .
Problems with 1st and 2nd generation methods. All methods from the first and second generation shared, at least, two of the following problems (most all three):
(1) three-state per-residue accuracy was below 70%,
(2) b -strands were predicted at levels of 28-48%, i.e., only slightly better than random,
(3) predicted helices and strands were too short.
The first problem (<100% accuracy) is commonly linked to two
features. (A) Secondary structure formation is partially determined
by long-range interactions, i.e., by contacts between residues
that are not visible by any method based on segments of 11-21
adjacent residues. (B) Secondary structure assignments vary by
5-12% even between different crystals of the same protein. Hence,
100% identical assignments are an unrealistic and unreasonable
aim. The second problem (b
-strands
< 50% accuracy) has been explained by the fact that b
-sheet
formation is determined by more non-local contacts than is a
-helix
formation. The third problem (too short segments predicted) was
basically overlooked by most developers (exceptions: [116, 117] ).
This problem makes predictions very difficult to use, in practice
( Fig. 2 .). Many of the recent third generation prediction methods
address all three problems simultaneously, and are clearly superior
to the old methods ( Fig. 1 ). Nevertheless, many of the secondary
structure prediction methods available today (e.g. in GCG [118] ,
or from internet services [119] ) are unfortunately still using
the dinosaurs of secondary structure prediction.
Fig. 2. Example for typical secondary structure prediction of the 2nd generation. The protein sequence (SEQ ) given was the SH3 structure [184] . The observed secondary structure (OBS ) was assigned by DSSP [48] (H = helix; E = strand; blank = non-regular structure; the dashes indicate the continuation of the 2nd strand that was missed by DSSP). The typical prediction of too short segments (TYP ) poses the following problems in practice. (i) Are the residues predicted to be strand in segments 1, 5, and 6 errors, or should the helices be elongated? (ii) Should the 2nd and 3rd strand be joined, or should one of them be ignored, or does the prediction indicate two strands, here? Note: the three-state per-residue accuracy is 60% for the prediction given.
Variation in sequence space. The exchange of a few residues can already destabilise a protein [120] . This implies that the majority of the 20N possible sequences of length N form different structures. Has evolution really created such an immense variety? Random errors in the DNA sequence lead to a different translation of protein sequences. These 'errors' are the basis for evolution. Mutations resulting in a structural change are not likely to survive, since the protein can no longer function appropriately. Furthermore, the universe of stable structures is not continuous: minor changes on the level of the 3D structure may destabilise the structure. Thus, residue exchanges conserving structure are statistically extremely unlikely. However, the evolutionary pressure to conserve structure and function has led to a record of this unlikely event: structure is more conserved than sequence [121, 122, 123] . Indeed, all naturally evolved protein pairs that have 35 of 100 pairwise identical residues have similar structures [124, 59] . However, the attractors of protein structures are larger, even: the majority of protein pairs of similar structures has levels below 15% pairwise sequence identity [125, 59, 126] .
Long-range information in multiple sequence alignments. The residue substitution patterns observed between proteins of a particular structural family, i.e., changes that conserved structure, are highly specific for the structure of that family. Furthermore, multiple alignments of sequence families, implicitly also contain information about long-range interactions. Suppose residues i and i + 100 are close in 3D, then the types of amino acids that can be exchanged (without changing structure) at position i are constrained by that their physico-chemical characteristics have to fit the amino acid types at position i + 100 [127, 128] .
Expert predictions: visual use of alignment information. The first method that used information from family alignments was proposed in the 70's [129] . Since, experts have based single-case predictions successfully on multiple alignments [130, 129, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146] . In fact, analysing the conservation patterns in sequence families is the first step of any expert when wanting to learn anything about a particular protein. Conversely, proteins without homologues constitute the dead-end road of sequence analysis.
Automatic use of pairwise alignment information. The simplest
way to use alignment information automatically has been proposed
first by Maxfield & Scheraga and by Zvelebil et al. [147, 148] :
predictions were compiled for each protein in an alignment, and
then averaged over all proteins. A slightly more elaborated way
of automatically using evolutionary information is to directly
base prediction on a profile compiled from the multiple sequence
alignment [50, 78, 7] . The following steps are applied in particular for the PHD methods [149, 7] ( Fig. 3 ).
A sequence of unknown structure (U) is quickly (typically
by Blast [150] ) aligned against the data base of known
sequences (i.e. no information of structure required!). (2) Proteins
with sufficient sequence identity to U to assure structural
similarity are extracted and re-aligned by a multiple alignment
algorithm MaxHom [151] . (3) For each position the profile
of residue exchanges in the final multiple alignment is compiled,
and used as input to a neural network.
Fig. 3. Using evolutionary information to predict secondary structure. Starting from a sequence of unknown structure (SEQUENCE ) the following steps are required to finally feed evolutionary information into the PHD neural networks (upper right): (1) a data base search for homologues through iterated PSI-Blast [150] (protocol from [8] ), (2) a decision for which proteins will be considered as homologues (BLAST score or length-depend cut-off for pairwise sequence identity [124, 59] ), (3) a reduction of redundancy (purge too many too similar proteins), and (4) a final refinement, and extraction of the resulting multiple alignment. Numbers 1-5 illustrated where users of the PredictProtein server [7, 119] can interfere to improve prediction accuracy without changes made to the actual prediction method PHD.
Example chosen: PHD. In the following, I illustrated the principle concepts of 3rd generation methods based on the particular neural network-based method PHD because it has been the most accurate method for many years, and because most of these concepts were introduced by PHD [50, 78] . Meanwhile, several other methods have reported and/or achieved similar levels of performance [152, 50, 144, 78, 114, 153, 154, 155, 156, 157, 158, 7, 159, 160] . More recent methods will be discussed in more detail below.
Multiple levels of computations. PHD processes the input information on multiple levels (neural network in Fig. 3 ). The first level is a feed-forward neural network with three layers of units (input, hidden, and output). Input to this first level sequence-to-structure network consists of two contributions: one from the local sequence, i.e., taken from a window of 13 adjacent residues, and another from the global sequence. Output of the first level network is the 1D structural state of the residue at the centre of the input window. The second level is a structure-to-structure network. The next level consists of an arithmetic average over independently trained networks (jury decision). The final level is a simple filter.
Balanced predictions by balanced training. The distribution
of the training examples (known structures) is rather uneven:
about 32% of the residues are observed in helix, 21% in strand,
and 47% in loop. Choosing the training examples proportional to
the occurrence in the data set (unbalanced training), results
in a prediction accuracy that mirrors this distribution, e.g.,
strands are predicted inferior to helix or loop [50, 78, 49] .
A simple way around the data base bias is a balanced training:
at each time step one example is chosen from each class, i.e.,
one window with the central residue in a helix, one with the central
residue in a strand and one representing the loop class. This
training results in a performance well balanced between the output
states ( Fig. 4 ).
Fig. 4. Prediction balanced between three secondary structure states. The pies were valid for a simple neural network prediction not using evolutionary information (2nd generation). The entire pies represented 100% of (A + D ) all correctly predicted residues, (B ) all residues in a representative subset of PDB, and (C ) all residues presented during balanced training. The basic message is that the prediction of strand is not inferior to the one for helix for 2nd generation methods (A ) because strand formation is more dominated by long-range interactions (as previously argued), but because the data base distributions differ between the three states (B ). Simply skewing the distribution (C ) resulted in an equally accurate prediction for all three states (D ).
Better segment prediction by structure-to-structure networks. The first level sequence-to-structure networks use as input the following information from 13 adjacent residues: (1) profile of amino acid substitutions for all 13 residues; (2) conservation
Fig. 5. Distribution of segment lengths. The number of secondary structure segments observed (thick black line; according to DSSP [48] ) and predicted is plotted against their length. All methods miss short helices and strands. However, also short regions lacking regular secondary structure were also under-predicted by all methods. Overall, most methods predict segments around the lengths of those observed. All results base on a data set of 201 proteins taken from the EVA server [180, 181] that contained no protein used for training of any of the methods (also used for results in Table 6).
Automatically aligning protein families based on profiles:
the PSI-BLAST wonder. Just as experts have been using alignment
information to predict aspects of structure and function, they
have intruded into the twilight zone of sequence alignments [122]
using profile-based alignment techniques. The idea of profile-based
searches is simply to use the fact that profiles of evolutionary
conservation are highly specific for every protein family. For
example, Glycines can often be mutated without major changes.
However, in particular families, the conservation of some Glycines
may be crucial to maintain mobility. Many groups have successfully
implemented semi-automatic profile-based databases searches [150, 161, 162, 163, 164, 165, 166] .
However, the breakthrough to large-scale routine searches has
been achieved by the development of PSI-BLAST [10] and Hidden
Markov models [167, 12] . In particular, the gapped, profile-based,
and iterated search tool PSI-BLAST continues to revolutionise
the field of protein sequence analysis through its unique combination
of speed and accuracy. More distant relationships are found through
iteration starting from the safe zone of comparisons and intruding
deeply and reliably into the twilight zone ( Fig. 6 ).
Fig. 6. Profile-based searches extend evolutionary information. The cloud signifies a protein structural family for the query protein U, i.e. all proteins that have a similar 3D structure. A simple pairwise comparison of U with all other proteins covers the 'safe zone' of sequence alignment (blue circle around U). This zone can be defined, e.g., by BLAST scores below 10-10, or by more than 35% pairwise identical residues over long alignments. Assume that there are only five other proteins (small white circles) in the safe zone falling all on the same side of U. Now, PSI-BLAST starts the next iteration with the family-specific given by the proteins found in the safe zone. Searching the database again with this profile, reaches safely into the twilight zone (zone reached marked by double-lined egg indicated in figure). However, no current method generally reaches all members of family U. Furthermore, in particular for PSI-BLAST the new region may fall outside of the initial safe zone (black/yellow moon left of safe zone). Finally, the regions that could have been reached by sequence-space hopping or intermediate sequence searches (light blue circles around five initial hits; [245, 246, 59] ) are not entirely covered by the profile-based search. The tricky bit is to avoid that the profile will pick unrelated proteins (transparent egg), and thus connect two separate structural families (U and X). The three circles on the right hand side signify the three zones: safe (no error), twilight (some to many errors), midnight (hardly any correct hit) of database searches. Their size are proportional to the regions they occupy 'real' size. The graph shows that the number of false positives explodes upon a brief region of the twilight zone (0 error at 33% sequence identity, > 90% error at 25%). At the same time, more than half the true homologues are only found in the midnight zone. Conclusions: (1) Iterated PSI-BLAST searches can safely identify fairly divergent family members. (2) Close homologues may be lost during the extension of the family. (3) The advanced search can lead astray.
Jones broke through by using PSI-BLAST searches of large databases. David Jones pioneered using iterated PSI-BLAST searches automatically [11] . The most important step climbed by the resulting method PSIPRED has been the detailed strategy to avoid polluting the profile through unrelated proteins ( Fig. 6 ). To avoid this trap, the database searched has to be filtered first [11] . At the CASP meeting at which David Jones introduced PSIPRED, Kevin Karplus and colleagues presented their prediction method (SAM-T99sec) finding more diverged profiles through Hidden Markov models [168, 169] . Recently, Cuff & Barton also successfully used PSI-BLAST alignments for JPred2 [170] . Jennings et al. [171] explore an alternative to increasing divergence: they started with a safe zone alignment through ClustalW [163] and HMMer [167] , and iteratively refined the alignment using the secondary structure prediction from DSC [157] . The resulting alignment is reported to be more accurate and to yield higher prediction accuracy than the initial ClustalW / HMMer alignments [171] .
SSpro: advanced recursive neural network system. The only method published recently that appears to improve prediction accuracy significantly not through more divergent profiles but through the particular algorithm is SSpro [13] . The major idea of the method aims at solving the problem of predicting too short segments. PHD addressed this problem by a second level structure-to-structure network [50] . Most authors have since implemented this idea (in particular PSIPRED and JPred2). Pierre Baldi and colleagues deviated substantially from this concept. Instead of using an additional network, they embedded the correlation into one single recursive neural network. In principle, the idea of a recursive network had been implemented before [172] . However, the particular details of the algorithm implemented in SSpro are novel and - as Table 1 illustrates - prove highly successful. Interestingly, SSpro is less successful on improving the prediction of segments length than on improving overall accuracy ( Fig. 5 ).
HMMSTR: hidden Markov models for connecting library of structure fragments. Can we predict secondary structure for protein U by local sequence similarity to segments of known structures {S} even when overall U differs from any of the known structures {S}? Yes, as shown by many nearest-neighbour-based prediction methods, the most successful of which seems to be NSSP [160] . A conceptually quite different realisation of the same concept has been implemented in HMMSTR by Chris Bystroff, David Baker and colleagues [2] . Firstly, build a library of local stretches (3-19) of residues with 'basic structural motifs' (I-sites). Secondly, assemble these local motifs through Hidden Markov models introducing structural context on the level of super-secondary structure. Thus, the goal is to predict protein structure through identification of 'grammatical units of protein structure formation'. Although HMMSTR intrinsically aims at predicting higher order aspects of 3D structure, a side-result is the prediction of 1D secondary structure. I find two results surprising. (1) The authors do not find any significant effect of 'over-optimising' their method, i.e. HMMSTR appears as accurate in predicting secondary structure for proteins known today as it will be for those known next year. (2) Three-state per-residue accuracy is reported to be about 74% [2] . This value may be over-estimated. Nevertheless, HMMSTR is clearly one of the better prediction methods.
Plethora of new concepts for secondary structure prediction
explored recently. The following five methods are a small
subset of new ideas explored to improve secondary structure prediction.
(1) Ouali & King [173] combine neural networks and rule-based
statistics in a cascade of classifiers. Based on a similar data
set they estimate a level of prediction accuracy comparable to
that of JPred2 (see Table 1 ). (2) Chandonia & Karplus [174]
combined simplified output schemes (two output states) with networks
trained on different tasks and a particular variant of early stopping;
input are non-divergent alignments picked from the safe zone (Fig.
1). Based on a protocol similar to the one applied by the Danish
group [175] , the authors estimate a level of > 76% accuracy,
i.e. a level that if holding up is similar to SSpro ( Table 1 ).
(3) Supposedly the simplest new method that claims to almost approach
the performance of PHD combines the information for secondary
structure formation contained in amino acid singlets, doublets,
and triplets. (4) Schmidler et al. [176] use of a simple statistical
model, the novel aspect is to replace compiling statistics over
fixed stretches of N residues by segments signifying regular secondary
structure (helix, strand). The underlying formalism resembles
a hidden semi-Markov model allowing to explicitly incorporate
particular propensities such as helix caps [177] . Based on
non-comparable data sets the authors estimated prediction accuracy
to be around 69%, if correct, this value is extremely impressive
for a 2nd generation method. (5) Without claims to
surprising levels of accuracy, Figureau et al. [178] combine
cleverly chosen pentapeptides from the database to obtain the
final prediction.
Seemingly improve accuracy by ignoring short segments. There are many ways to publish higher levels of accuracy. Amongst the simplest for secondary structure prediction is to convert 310 helices and beta-bulges assigned by DSSP [48] to non-regular structure. This yields higher levels of accuracy since all methods - on average - are better at predicting the middle of helices and strands than their caps, and hence are more accurate for longer regular secondary structure segments [49, 174] . When using predicted secondary structure to predict 3D structure, short helices are important. Thus, I suggest bearing with the more conservative conversion strategy.
Comparing apples and oranges, or too few apples with one another. To overstate the point: there is NO value in comparing methods evaluated on different data sets. Most secondary structure prediction methods are available. Thus, developers may want to compare their results to public methods based on the same data set (not previously used for any of the two). Many methods predicting aspects of protein structure and function have to fight with limited data availability. This is not at all the case for secondary structure prediction. Hundreds of new protein structures are added every year [6] . If because of some reason or other, small data sets have to be used, developers should painstakingly try to estimate what 'significant difference' means for their data set. For example, about 20 new protein structures are clearly too few! This is the number of proteins that were available for the CASP4 meeting. Based on that set all 3rd generation methods were equal!
Seemingly achieve 100% accuracy by using correlated sets. Many publications on predicting secondary structural class from amino acid composition allowed correlations between 'training' and testing sets. Consequently, levels of prediction accuracy published exceeded by far the theoretical possible margins [179] . A very simple operational definition for 'independent sets' is the following: Two proteins A and B are correlated if the sequence similarity between A and B suffice to predict the structure of B knowing A's structure. Assume we have two un-correlated sets of proteins S1 and S2. Can we train the method on set S1 and develop it on set S2 without further ado? While developing PROF, I realised that the answer is negative. In fact, I trained neural networks on about 2000 structures that had no significant level of sequence similarity to our original set of 126 proteins [50] . I used the 126 only after I had completed developing the method and found a prediction accuracy exceeding 80% (unpublished). When testing PROF on a set of about 200 new structures that had been added to PDB in the meantime (different to that given in Table 1 ), prediction accuracy dropped. Do the 126 differ from the set used for Table 1? I failed to answer this question.
EVA: automatic evaluation of automatic prediction servers.
In collaboration with Volker Eyrich (Columbia), Marc Marti-Renom,
Andrej Sali (both Rockefeller), Florencio Pazos, and Alfonso Valencia
(both CNB Madrid), we have started to address the above problems
through the automatic server EVA [180, 181] . Leszek Rychlewski
(IIMCB Warsaw) and Dani Fischer (Ben-Gurion Univ.) are implementing
similar ideas in LiveBench [182] . The simple concept is the
following: take the N newest experimental structures added to
PDB, send the sequences to all prediction servers, collect the
results, and accumulate a continuous evaluation of prediction
accuracy every weak. EVA has been evaluating secondary structure
prediction methods for more than six months now. I found it instructive
to see how the 'ranking' of methods initially changed from week
to week due to too small sets. Currently, EVA also provides results
for evaluating comparative modelling (Sali group), and residue-residue
contacts (Valencia group). We hope that EVA will eventually simplify
life for developers, referees, editors and users.
Prediction accuracy peaks at 77% accuracy. The currently
best methods reach a level around 77% three-state per-residue
accuracy ( Table 1 ). This constitutes a sustained level about five
percentage points above last century's best method not using diverged
profiles (PHD in Table 1 ). Fortunately, the improvement is valid
for helix, strand and non-regular regions (information and correlation
indices in Table 1 ). Furthermore, significantly fewer residues
are confused between the states helix and strand (BAD score, Table
1). Finally, some new methods also improve in a more global sense
by improving the accuracy of assigning the secondary structural
class (all-alpha, all-beta, alpha/beta, other) based on the predicted
content of regular secondary structure (Class score, Table 1 ).
Difference between 60% and 70% accuracy may matter a lot! Some
of the 3rd generation methods for secondary structure prediction
are clearly superior to previous methods: b
-strands are predicted more accurately; predicted segments look like those observed; and the overall accuracy is about ten percentage points higher. The advantage in practice is illustrated in Fig. 7 . Not only that the 3rd generation method (here PHD) gets most segments right, but it also enables to focus on more reliably predicted residues. The reliability index (Rel in Fig. 7) is compiled as the difference between the output unit with highest value (winner unit) and the output unit with the next highest value (normalised to a scale from 0 (low) to 9 (high)). All strongly predicted residues (* in Fig. 7) are predicted correctly.
Fig. 7. Example for secondary structure prediction of 1st-3rd generation. TOP panel: SH3 structure [184] . The dashes indicated the continuation of the 2nd strand that was missed by DSSP. The methods are 1st generation: C+F [73] ; 2nd generation: GOR [243] (= GORIII), and 3rd generation: PHD [7] . The levels of three-state accuracy were: C+F = 59%; GOR = 65%; and PHD = 72%. Whereas the 1st and 2nd generation methods performed above their average accuracy (Fig. 1) for this protein, the PHD prediction was average (Fig. 1; Fig. 7). The strength of the PHD prediction was reflected in the one-digit reliability index (Rel , 0 = low, 9 = high) correlated with prediction accuracy. All residues predicted at values of Rel > 4 (marked by *) were predicted correctly. LOWER panel: translation elongation factor beta-1 [247] : shown are examples for methods exploring extended profile searches (Table 1 for abbreviations). An N-terminal strand and helix (not shown) were correctly predicted by all methods. Although the combination of various methods (EVA-4) is better on average (Table 1), it is debatable which prediction is most useful here.
Values for expected prediction accuracy are distributions.
Statements such as 'secondary structure is about 90% conserved
within sequence families' [51] refer to averages over distributions.
The same holds for the expected prediction accuracy ( Fig. 8 ).
Such distributions explain why some developers have over-estimated
the performance of their tools using data sets of only tens of
proteins (or even fewer). In general, single sequences yield accuracy
values about ten percentage points lower than multiple alignments
[50, 54, 78] . Note that for most proteins some helix and
strand residues are confused (BAD predictions in Fig. 8 ).
Fig. 8. Expected variation of prediction accuracy with protein chain for PHD. (A) Three-state per-residue accuracy (eq. 1; PDB identifier given for the proteins predicted worst); (B) percentage of BAD predictions, i.e., residues either predicted in helix and observed in strand, or predicted in strand and observed in helix (introduced by [56] ); (B inlet) cumulative percentage of proteins with BADly predicted residues (e.g. for 80% of the proteins the percentage of confusing helix and strand residues is < 7%; however, for only for 30% of all proteins such a confusion never happened). Given: distributions (over 721 unique protein chains), averages, and one standard deviation. Distributions of all other third generation methods given in Table 1 are qualitatively similar.
Reliability of prediction correlates with accuracy. For
the user interested in a particular protein U, the fact that prediction
accuracy varies with the protein ( Fig. 8 ) implies a rather unfortunate
message: the accuracy for U could be lower than 40%, or it could
be higher than 90% ( Fig. 8 ). Is there any way to provide an estimate
at which end of the distribution the accuracy for U is likely
to be? Indeed, the reliability index correlates with accuracy.
In other words, residues with higher reliability index are predicted
with higher accuracy [50, 78, 7] . Thus, the reliability index
offers an excellent tool to focus on some key regions predicted
at high levels of expected accuracy. Furthermore, the reliability
index averaged over an entire protein correlates with the overall
prediction accuracy for this protein ( Fig. 9 ). (Note however,
that the reliability indices tend to be unusually high for alignments
of sequence families without very divergent sequences.) Plotting
the reliability of the prediction against accuracy ( Fig. 9 ) also
reveals that minor differences in overall accuracy may matter.
For example, JPred2 and PROF differ by only two percentage points
( Table 1 ), however, JPred2 reaches 88% accuracy for 'only' 45%
of all residues whereas PROF reaches that level for more than
60% of all residues ( Fig. 9 ).
Fig. 9. Correlation between reliability and accuracy. Residues predicted at higher reliability are predicted more accurately [50, 78, 7] . In fact, proteins with higher average reliability index are predicted above average (A, method: PROF). For example, no protein predicted at an average reliability ³ 6 has less than 76% accuracy, and only 3 out of 201 are below 70% accuracy for an average index ³ 5. PROF predictions were ³ 5 on average for one fourth of all proteins; for these the prediction accuracy was 83%. Reliability indices are now being used by most methods (B). They also enable users to spot particular regions predicted more accurately than others. For example, PROF and PSIPRED reach a level of accuracy similar to comparative modelling (around 88%, dotted line) for about 60% of all residues, and more than 93% of the quarter of the residues predicted at highest indices are correctly predicted. Note: the values in B are cumulative; e.g. 100% of all residues for PROF are predicted at 77.4 accuracy (Table 1).
| Method B | Q3 C | Q3 Claim D | BAD E | Info F | CorrH G | CorrE H | SOV I | Class K |
| PHD | 71.7 | 71.6 | 4.1 | 0.25 | 0.59 | 0.60 | 68 | 78 |
| JPred2 | 75.0 | 76.4 | 2.4 | 0.34 | 0.64 | 0.63 | 70 | 77 |
| PHDpsi | 75.0 | 2.9 | 0.29 | 0.64 | 0.62 | 70 | 81 | |
| PROF | 77.0 | 2.1 | 0.37 | 0.67 | 0.65 | 73 | 83 | |
| PSIPRED | 76.7 | 76.5-78.3 M | 2.4 | 0.37 | 0.66 | 0.64 | 73 | 81 |
| SSpro | 76.2 | 76 | 2.6 | 0.36 | 0.67 | 0.65 | 71 | 83 |
| EVA-4 | 77.8 | 2.0 | 0.38 | 0.69 | 0.67 | 83 |
A: Data set and sorting: the results are compiled by EVA [248] . All methods for which details are listed have been tested on 201 different new protein structures (EVA version Feb 2001). None of these proteins was similar to any protein used to develop the respective method. This set comprised the largest such set by Feb 7, 2001 for which we had results. Sorting and grouping reflects the following concept: if the data set is too small to distinguish between methods, these two are grouped. For the given set of 195 protein this yielded four groups. Inside each group, results are sorted alphabetically. Due to a lack of data, I could not add the performance of SAM-T99sec [168] ; on a set of 105 proteins SAM-T99sec appears comparable to the best three: PSIPRED, SSpro, and PROF. Another method that appeared at least as accurate when tested on an earlier EVA set is missing since it is not publicly available [175] . B: Method: see abbreviations on top of article; EVA-4 refers to a simple average over the binary prediction output from PHDpsi, PSIPRED, SSpro, and PROF; C Q3: three-state per-residue accuracy (eq. 1); D Q3 Claim: three-state per-residue accuracy published in original publication of method: PSIPRED [11] , SSpro [13] , JPred2 [5] , PHD [78] , E BAD: percentage of helical residues predicted as strand, and of strand residues predicted as helix [56] ; F Info: per-residue information content [50] ; G CorrH: Matthew's correlation coefficient for state helix [249] ; H CorrE: Matthew's correlation for state strand [249] ; I SOV: three-state per-segment score averaged over the three-state segment overlap between predicted and observed segments [51, 58] ; K Class: percentage of proteins correctly sorted into one of the four classes: all-alpha (length > 60, helix > 45%, strand < 5%), all-beta (length > 60, helix < 5%, strand > 45%), alpha/beta (length > 60, helix > 30%, strand > 20%), other (thresholds for classification from: [99, 250, 78] ); M accuracy range: PSIPRED result were published for different conversions of the eight DSSP states to three states.
Understandable why certain proteins predicted poorly? For some of the worst predicted proteins, the low level of accuracy could be anticipated from their unusual features, e.g., for crambin, or the antifreeze glycoprotein type III. However, this procedure turned out to be rather arbitrary. First, some proteins with the same 'unusual features' are predicted at high levels of accuracy. Second, occasionally similar proteins are predicted at very different levels of accuracy, e.g. both the phosphotidylinonitol 3-kinase [183] and the Src-homology domain of cytoskeletal spectrin have homologous structure [184] but prediction accuracy varies between less than 40% (pik) and more than 70% (spectrin). None of the conclusions from studying poor predictions has yielded a way to better predictions, yet. Nevertheless, two observations may be added. First, bad alignments (i.e. non-informative and/or falsely aligned residues) result in bad predictions. Second, frequently the BAD predictions ( Fig. 8 Table 1 ), i.e., the confusion of helix and strand are observed in regions that are stabilised by long range interactions. For example, the peptide around the fourth strand of SH3 ( Fig. 7 ) forms a helix in solution (Luis Serrano, personal communication). Furthermore, helices and strands that are confused despite a high reliability index often have functional properties, or are correlated to disease states (Rost, unpublished data). Regions predicted with equal propensity in two different states often correlate with 'structural switches' (see ASP below).
Sources of improvement: 4 parts database growth, 3 extended search, 2 other. Jones solicited two causes for the improved accuracy: (1) training and (2) testing the method on PSI-BLAST profiles. Cuff & Barton examined in detail how different alignment methods improve [5] . However, which fraction of the improvement results from the mere growth of the database, which from using more diverged profiles, and which from training on larger profiles? Using PHD from 1994 to separate the effects [8] , we first compared a non-iterative standard BLAST [150] search against SWISS-PROT [14] with one against SWISS-PROT + TrEMBL [14] + PDB [6] . The larger database improves performance by about two percentage points [8] . Secondly, we compared the standard BLAST against the big database with an iterative PSI-BLAST search. This yielded less than two percentage points additional improvement [ [8] . Thus, overall, the more divergent profile search against today's databases supposedly improves any method using alignment information by almost four percentage points (PHDpsi in Table 1). The improvement through using PSI-BLAST profiles to develop the method, are relatively small: PHDpsi was trained on a small database of not very divergent profiles in 1994, e.g., PROF was trained on PSI-BLAST profiles of a 20 times larger database in 2000. The two differ by only one percentage point ( Table 1 ), and part of this difference resulted from implementing new concepts into PROF (Rost, unpublished; [9] ).
Combination improves on non-systematic errors. Any prediction method has two sources of errors: (1) systematic errors, e.g., through non-local effects, and (2) white noise errors caused by, e.g., the succession of the examples during training neural networks. Theoretically, combining any number of methods improves accuracy as long as the errors of the individual methods are mutually independent and are not only systematic [185] . PHD - and more recently others [174, 5, 175] - utilised this fact by combining different neural networks. The idea of combining different prediction methods has been around in secondary structure prediction since long [186] ; Cuff & Barton [3, 4] implemented it in JPred for different third generation methods. In particular, JPred uses a simple expert-rule for compiling the final average. Ross King et al. [187] have tested a variety of different combination strategies. Selbig et al. [188] have compiled the jury through an elaborated decision-tree based system. Guermeur et al. [189] have used a more refined variant of the JPred idea of weighting methods. Overall, combinations of independent prediction methods seem to yield levels of accuracy higher than that of the single best method. In particular, combining the four current best methods (PROF, PSIPRED, SSpro, and PHDpsi) improved prediction accuracy to 77.8% ( Table 1 EVA-4). However, for every protein one method tends to be clearly superior to the combined prediction. Is it really wise to include significantly inferior methods into combined prediction? No: averaging over all methods used for EVA decreased accuracy over the best individual methods, although averaging over the better ones was better than averaging the best ones (data not shown). Is there any criterion for when to include a method and when not? Concepts weighting the individual methods based on its accuracy and 'entropy' [175] appear successful only for large numbers of methods ( [175] , Rost, unpublished). Nevertheless, methods that are significantly over-trained can improve when combined (Krogh, unpublished). More rigorous studies for the optimal combination may provide a better picture. The technical problem of utilising many methods in a public server is that the field is advancing too fast: today's methods are more accurate than averages over yesterday's methods (hence the JPred server now returns JPred2 results by default).
Internet prediction services for secondary structure, in general. Programs for the prediction of secondary structure available as internet services have mushroomed since the first prediction service PredictProtein went on line in 1992 [149, 119] (a list of links in [190] ). Our META-PredictProtein server [191] enables users to access a number of the best prediction methods through one single interface. Unfortunately, not all methods available have been sufficiently tested, and some are not very accurate. We try to address this problem by maintaining EVA for the automatic evaluation of prediction servers [180, 181] . In general, prediction accuracy is significantly superior if predictions are based on multiple alignments [192, 155, 25] .
Completely vs. almost automatic. The PHD/PROF prediction
methods are automatically available via the internet service PredictProtein
[119] (use the web interface at http://cubic.bioc.columbia.edu/predictprotein
or send the word help to PredictProtein@columbia.edu).
Users have the choice between the fully automatic procedure taking
the query sequence through the entire cycle, or expert intervention
into the generation of the alignment. Indeed, without spending
much time users typically can improve prediction accuracy easily
by choosing 'good' alignments.
Regions likely to undergo structural change predicted successfully. Young, Kirshenbaum, Dill & Highsmith [1] have unravelled an impressive correlation between local secondary structure predictions and global conditions. The authors monitor regions for which secondary structure prediction methods give equally strong preferences for two different states. Such regions are processed combining simple statistics and expert-rules. The final method is tested on 16 proteins known to undergo structural rearrangements, and on a number of other proteins. The authors report no false positives, and identify most known structural switches. Subsequently, the group applied the method to the myosin family identifying putative switching regions that were not know before, but appeared reasonable candidates [193] . I find this method most remarkable in two ways: (1) it is the most general method using predictions of protein structure to predict some aspects of function, and (2) it illustrates that predictions may be useful even when structures are known (as in the case of the myosin family).
Classifying proteins based on secondary structure predictions in the context of genome analysis. Proteins can be classified into families based on predicted and observed secondary structure [28, 194] . However, such procedures have been limited to a very coarse-grained grouping only exceptionally useful to infer function. Nevertheless, in particular predictions of membrane helices and coiled-coil regions are crucial for genome analysis. Recently, we came across an observation that may have important implications for structural genomics, in particular: More than one fifth of all eukaryotic proteins appeared to have regions longer than 60 residues apparently lacking any regular secondary structure [195] . Most of these regions were not of low-complexity, i.e. not composition-biased. Surprisingly, these regions appeared evolutionarily as conserved as all other regions in the respective proteins. This application of secondary structure prediction may aid in classifying proteins, and in separating domains, possibly even in identifying particular functional motifs.
Aspects of protein function predicted based on expert-analysis of secondary structure. The typical scenario in which secondary structure predictions help to learn about function are experts combining predictions and their intuition, most often to find similarities to proteins of known function but insignificant sequence similarity [196, 197, 40, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207] . Usually, such applications base on very specific details about predicted secondary structure. Thus, these successful correlations of secondary structure and function appear difficult to incorporate into automatic methods.
Exploring secondary structure predictions to improve database searches. Initially, three groups independently applied secondary structure predictions for fold recognition, i.e., the detection of structural similarities between proteins of unrelated sequences [208, 209, 210] . A few years later, almost every other fold recognition/threading method has adopted this concept [211, 212, 213, 214, 215, 216, 217, 218, 219, 220] . Two recent methods extended the concept by not only refining the database search, but by actually refining the quality of the alignment through an iterative procedure [221, 171] . A related strategy has been implored by Ng and the Henikoffs to improve predictions and alignments for membrane proteins [222] .
From 1D predictions to 2D, and 3D structure. Are secondary structure predictions accurate enough to help predicting higher order aspects of protein structure automatically? 2D (inter-residue contacts) predictions: Baldi, Pollastri, Andersen & Brunak [223] have recently improved the level of accuracy in predicting beta-strand pairings over earlier work [153] through using another elaborate neural network system. 3D predictions: the following list of five groups exemplifies that secondary structure predictions have now a popular first step toward predicting 3D structure. (1) Ortiz et al. [224] successfully use secondary structure predictions as one component of their 3D structure prediction method. (2) Eyrich et al. [225, 226] minimises the energy
88% is a limit, but shall we ever reach close to there?
Protein secondary structure formation is influenced by long-range
interactions [231, 46, 47] and by the environment [1, 232] .
Consequently, stretches of up to 11 adjacent residues (dubbed
chameleon after [231] ) can be found in different secondary
structure states [233, 234, 235] . Implicitly, such non-local
effects are contained in the exchange patterns of protein families.
This is reflected by the fact that strand is predicted almost
as accurately as helix ( Table 1 ), although sheets are stabilised
by more non-local interactions than helices. Local profiles can
even suffice to identify structural switches [193, 1] . Surprisingly,
we can find some traces of folding events in secondary structure
predictions [236] . Even more amazing is a study suggesting
that alignment-based methods achieve similar levels of accuracy
for chameleon regions as for all other regions [234] . Secondary
structure assignments may vary for two versions of the same structure.
One reason is that protein structures are no rocks but dynamic
objects with some regions more mobile than others. Another reason
is that any assignment method has to choose particular thresholds
(e.g. DSSP chooses a cut-off in the Coulomb energy of a hydrogen
bond). Consequently, assignments differ by about 5-15 percentage
points between different X-ray versions or different NMR models
for the same protein (Andersen & Rost, unpublished), and by
about 12 percentage points between structural homologues [51] .
The latter number provides the upper limit for secondary structure
prediction of error-free comparative modelling. I doubt that ab
initio predictions of secondary structure will ever become more
accurate than that. Hence, I believe a value around 88% constitutes
an operational upper limit for prediction accuracy. After the
advances over the last two years we reached above 76%. Thus, we
need to mount another twelve percentage points (or even less).
What is the major obstacle to reaching another six percentage
points higher? The size of the experimental database as suggested
[233] ? I doubt this, since PHDpsi trained on only 200 proteins
using PSI-BLAST input is almost as accurate as PSIPRED trained
on 2000 proteins ( Table 1 ). Will the current explosion of sequences
boost accuracy? In fact, current databases have less than 10 homologues
for more than one third of the 150 tested proteins ( Table 1 ),
and more than 100 for only 20% of the proteins. Although based
on a too small set for conclusions, for these 20% highly populated
families the accuracy of PROF was four percentage points above
average (data not shown). Thus, larger databases may get us six
percentage points higher, and it may not. The answer remains nebulous.
The following notes have resulted from nine years of experience with running the PredictProtein server [119] and from various structure prediction workshops [237] . Some comments apply in particular to the PHD/PROF methods [7, 238] . However, most hold also for using other secondary structure prediction methods (a detailed list of 'Hints for users' is given on our WWW pages [119] ).
How accurate are the predictions? The expected levels of accuracy (PROF Q3 = 77±10%) are valid for typical globular, water-soluble proteins when the multiple alignment contains many and diverse sequences. High values for the reliability indices indicate more accurate predictions (Fig. 9). However, for alignments with little variation in the sequences, the reliability indices adopt misleadingly high values. PHD/PROF predictions tend to be relatively accurate for porins [7] ; however, for helical membrane proteins other programs ought to be used [7, 238, 39] .
Confusion between strand and helix? PHD (as well as other methods) focuses on predicting hydrogen bonds. Consequently, occasionally strongly predicted (high reliability index) helices are observed as strands and vice versa ( Fig. 8 Table 1 ). In fact, some of these BAD predictions correspond to structural switching regions.
Strong signal from secondary structure caps? The ends of helices and strands contain a strong signal. However, on average PHD predict the core of helices and strands more accurately than the caps [49] . This is also hold for the other methods listed in Table 1 (data not shown).
Internal helices predicted poorly? Steven Benner has indicated that internal helices are difficult to predict [137, 53] . On average, this is not the case for PHD predictions [239] .
What about protein design and synthesised peptides? The PHD networks are trained on naturally evolved proteins. However, the predictions have been useful in some cases to investigate the influence of single mutations (e.g. for Chameleon [231, 240] , or for Janus [241] , Rost, unpublished). For short poly-peptides, users should bear in mind that the network input consists of 17 adjacent residues. Thus, shorter sequences may be dominated by the ends (which are treated as solvent by the current version of PHD).
70% correct implies 30% incorrect. The most accurate methods for predicting secondary structure reach sustained levels of about 70% accuracy. When interpreting predictions for a particular protein it is often instructive to mark the 30% of the residues you suspect to be falsely predicted.
Special classes of proteins. Prediction methods are usually derived from knowledge contained in proteins from subsets of current databases. Consequently, they should not be applied to classes of proteins not included in these subsets, e.g., methods for predicting helices in globular proteins are likely to fail when applied to predict transmembrane helices. In general, results should be taken with caution for proteins with unusual features, such as proline-rich regions, unusually many cysteine bonds, or for domain interfaces.
Better alignments yield better predictions. Multiple alignment-based predictions are substantially more accurate than single sequence-based predictions. How many sequences do you need in your alignment for an improvement; and how sensitive are prediction methods to errors in the alignment? The more divergent sequences contained in the alignment, the better (two distantly related sequences often improve secondary structure predictions by several percentage points). Regions with few aligned sequences yield less reliable predictions. The sensitivity to alignment errors depends on the methods, e.g., secondary structure prediction is less sensitive to alignment errors than accessibility prediction.
1D structure may or may not be sufficient to infer 3D structure.
Say you obtain as prediction for regular secondary structure:
helix-strand-strand-helix-strand-strand (H-E-E-H-E-E). Assume,
you find a protein of known structure with the same motif (H-E-E-H-E-E).
Can you conclude that the two proteins have the same fold? Yes,
and no, your guess may be correct, but there are various ways
to realise the given motif by completely different structures.
For example, at least, 16 structurally unrelated proteins contain
the secondary structure motif 'H-E-E-H-E-E'.
Particular thanks to Volker Eyrich for his crucial help with setting
up the META-PP and EVA servers without which most of the results
presented here would not exist. Last, not least, thanks to all
those who deposit their experimental data in public databases,
and to those who maintain these databases.