| Title: | Simple jury predicts protein secondary structure best |
| Author: | Burkhard Rost, Pierre Baldi, Geoff Barton, James Cuff, Volker A Eyrich, David Jones, Kevin Karplus, Ross King, Gianluca Pollastri & Dariusz Przybylski |
| Quote: | Preprint CUBIC_2001_10 |
Simple jury predicts protein secondary structure best
1 CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA
2 Univ. of California, Irvine, Dept. of Information and Computer Science, Institute of Genomics and Bioinformatics, Irvine, CA-92697, USA
3 European Bioinformatics Institute, Genome Campus, Hinxton, Cambs CB10 1SD, England
4 The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambs CB10 1SA, England
5 Dept. of Biological Sciences, Brunel Univ., 274348, Uxbridge, Middlesex UB8 3PH, England
6 Computer Engineering, Univ. of California, Santa Cruz, Santa Cruz, CA 95064, USA
7 The Univ. of Wales Aberystwyth, Dept. of Computer Science, Penglais, Ceredigion, SY23 3DB, Wales, UK
* Corresponding author: rost@cubic.bioc.columbia.edu, http://cubic.bioc.columbia.edu/
Journal: Proteins
Rejected: 2001-10
Preprint: CUBIC, Columbia University, New York; PRE_2001-10-01
The field of secondary structure prediction methods has advanced again. The best methods now reach levels of 74-76% of the residues correctly predicted in one of the three states helix, strand, or other. In context of EVA/CASP, we experimented with averaging over the best current methods. The resulting jury decision proved significantly more accurate than the best method. Although the 'jury' seemed the best choice on average, for 60% of all proteins one method was better than the jury. Furthermore, the best individual methods tended to be superior to the jury in estimating the reliability of a prediction. Hence, averaging over predictions may be the method of choice for a quick scan of large data set, while experts may profit from studying the respective method in detail.
Key words: automatic evaluation, large-scale assessment, protein structure prediction.
Accuracy increased substantially in the 90's through the evolutionary information from proteins in the same structural family. Recently, the evolutionary information resulting from improved searches and larger databases has again boosted prediction accuracy by more than four percentage points to its current height around 76% of all residues predicted correctly in one of the three states helix, strand, other. The best current methods solved most of the problems raised at earlier CASP meetings: All good methods now get segments right and perform well on strands. Could we improve prediction accuracy further by averaging over the best prediction methods?
JPred2 [1] : Jpred2 provides multiple sequence alignments constructed automatically by both PSIBLAST searches and ClustalW alignments derived from a carefully filtered non-redundant sequence database. The default in JPred2 is to execute only the JNet [1] algorithm. For the results in this paper (Table 1), we used JPred2 with the option to combine JNet with PHD [2, 3] , NNSSP [4] , Predator [5] , Mulpred (Geoff Barton, unpublished), DSC [6] and Zpred [7] . We estimated that the combination of JNet with these methods improved prediction accuracy slightly.
PHDpsi [Przybylski, 2001 #4523]: PHD [8] is a system of neural networks using evolutionary information as input. It was developed in 1994. The only difference between PHD and PHDpsi is that the latter uses divergent profiles as provided by PSI-BLAST [9] .
Prof_king [10, 11] : The input to Prof_king is a multiple sequence alignment produced using PSI-BLAST [9] and ClustalW [12] . The alignment is "poly-transformed" into attributes using the approaches of: GOR propensities [13] , PHD profiles [2] , and PSI-PRED profiles [14] . The machine learning in Prof_king is a complicated combination of linear and quadratic discrimination, back-propagation neural networks, the use of different priors, and cascaded prediction. Prof_king was designed to predict especially beta-strands well, as their prediction is essential for fold recognition. The method reaches its high level of accuracy since the use of an ensemble of predictors improves prediction because it is closer to the "Bayes-optimal" predictor.
PROFsec [Rost, 2001 #4990]: This program bases upon the principle concept implemented in PHD [2, 3, 8] ; it extends this concept in several ways. (1) The network system was trained on a large data set of divergent PSI-BLAST profiles. (2) PROF also uses more information for the input than PHD. (3) Finally, PROF uses a third layer of networks.
PSIPRED [14, 15] : The current version of PSIPRED (Version 2) implemented on the respective Web server and also available in source code form at the address http://www.psipred.net is based on the method previously published [14, 15] . This method, which is similar in concept to the approach previously described by Rost and Sander [2] employs two standard feed-forward neural networks to analyse the Position Specific Scoring Matrices (PSSMs) generated by PSI-BLAST [9] in order to predict protein secondary structure. Three iterations of PSI-BLAST are used, searching a large non-redundant data bank of protein sequences from which coiled-coil, transmembrane and low-complexity regions have been masked out. The first network employs 75 hidden units and 15x21 (window of 15 residue positions), and is trained using an early-stopped training strategy using 10% of the training data to detect convergence. The second level network, which is used to "filter" the predictions from the first network, comprises 15x4 inputs and 55 hidden units. Version 2 of PSIPRED uses 4 identical first level networks, each trained with a differently chosen training set, and it is the averages of these 4 sets of outputs which are now presented to the second level network. This averaging has been found to improve the prediction accuracy on average by 1.3 percentage points.
SAM-T99sec [16] : The secondary structure predictor on the SAM-T99 web site averages three different methods. All three methods use a multiple alignment of probable homologues found by the SAM-T99 method for iterative search and alignment. The methods were trained on STRIDE classification of secondary structure, not DSSP. Although STRIDE and DSSP differ in 3-state assignment for about 5% of residues, prediction accuracy is only changed by about 0.5%, as the places where they differ are hard to predict correctly anyway. The neural net method is the most accurate of the three methods, and is the only one that provides accurate estimation of probability of correctness for the prediction. The net has four layers of neurons (three hidden layers and the output layer). The input has probability vectors for the 20 amino acids (computed from the multiple alignment using sequence weighting and a Dirichlet mixture prior) and probabilities for insertion and deletion at each position. The three hidden layers have 9, 10, and 9 units respectively, and the output layer has 3 units (corresponding to E, H and L). The inputs to each layer are a window centered on the current residue of the outputs of the previous layer. The window sizes are 5, 7, 5, and 11. Each layer uses soft-max to produce a probability vector as its output. The second method uses a general hidden Markov model which emits pairs of amino acids and secondary structure codes in each state. There are 37 "fat states" in the HMM and 162 "fat edges" (each fat state has a pair-emitting state and a deletion state, and each fat edge contains and insertion state and 9 edges). The target sequence is aligned to the HMM and the posterior probability of each label is computed by a weighted sum over all alignments. The homologues are likewise aligned, and the final prediction is the weighted average of the predictions for the individual sequences. The third method uses the target HMM used by SAM-T99 for fold recognition (one match state per residue in the target sequence), and aligns a subset of PDB sequences to it. The secondary structure codes for the PDB sequences are given weights according to how likely the given residue is to align to that position in the target HMM.
SSpro [17] : SSpro consists of 11 bi-directional recurrent neural networks (BRNNs) derived from the general theory of graphical models [17, 18] . BRNNs are adaptive graphical models that can learn how to translate input sequences into output sequences with variable lengths. They extend typical Markov or hidden Markov model approaches by including both a left-right and a right-left chain of hidden states that can carry information in both directions along the chain, and between the input and output sequences. The left-right and right-left state updates are parameterised by two recurrent networks, which can be viewed as two ``wheelsŐŐ that are rolled along the protein chain, starting from each terminus. A third neural networks produces the secondary structure classification output using the local left and right contexts carried by the wheels, the input, and the hidden units. BRNNs can be trained in supervised fashion using a generalised form of gradient descent. The ensemble prediction is obtained via plain average of the outputs of the single models. From our tests BRNNs are responsible for gains in the 0.5-1% range over structured feed-forward neural networks, and are capable of capturing context information over distances of about 30 amino acids. The results of the ensemble represent gains of 1.3-3.3% over single models. The profile weighting scheme provides a gain of about 0.6% with respect to plain profiles. Further improvements have recently been obtained using PSI-BLAST profiles [9] and a new improved version (SSpro 2.0) will soon be on-line. The SSpro 1.0 server and related references are currently available at http://promoter.ics.uci.edu/BRNN-PRED/.
J-EVA: alternatives of compiling the jury decision. The automatic evaluation server EVA [19] received secondary structure prediction in CASP format. In particular, the predictions did not contain detailed values for the propensities for helix (H), strand (E), other (L). Rather, the outputs were restricted to the actual state predicted (H or E or L) and - for all methods except for SSpro - an estimate of the reliability of the prediction for each residue. We tried three alternative ways to combine all prediction methods. (1) Winner average (dubbed 'J-EVA winner'): count how many methods predicted state S and predict state with highest count. (2) Simple average (dubbed 'J-EVA simple'): weight predictions by provided reliability index. (3) Weighted average (dubbed 'J-EVA weighted'): this average normalised the weighted average by the method and protein averages:
(1)
where
= 1, if method µ predicted
residue in state s, and 0 otherwise;
was the reliability (0-1) of method µ for residue i;
was the per-protein average of the reliability index for method µ, and
the average over the reliability
indices of method µ for protein P.
Jury over best three methods significantly more accurate than best method. All alternative strategies to compile the jury decision were successful in that the resulting predictions were on average not worse than those of the best individual method. The weighted average and the 'winner-take-all' decision compiled over the three methods PROFsec, PSIPRED, and SSpro predicted more than 78% of all residues correctly ( Table 1 ). For the data set of 218 proteins used for evaluation this difference was significant. Interestingly, the improvement was also visible in terms of per-segment scores (SOV), and in that the number of residues confused between helix and strand was reduced. Furthermore, the simple averaging procedure was clearly inferior to the 'winner-take-all' and to the 'weighted average'. When compiling the jury decision over all five methods for which we had results on 218 proteins, prediction accuracy decreased slightly. Hence, averaging over the best three was better than averaging over the best five.
Method | Q3 | SOV3 | BAD | CH | CE |
J-EVA3 weight | 78.0 | 73.7 | 2.0 | 0.692 | 0.653 |
J-EVA3 winner | 78.2 | 73.8 | 1.7 | 0.696 | 0.661 |
|
|
|
|
|
|
J-EVA3 simple | 77.2 | 72.8 | 2.0 | 0.686 | 0.652 |
PROFsec | 76.8 | 72.8 | 2.2 | 0.664 | 0.646 |
PSIPRED | 76.4 | 72.1 | 2.4 | 0.656 | 0.633 |
SSpro | 76.1 | 71.2 | 2.5 | 0.669 | 0.638 |
|
|
|
|
|
|
JPred2 | 74.8 | 69.3 | 2.4 | 0.638 | 0.618 |
PHDpsi | 74.7 | 69.7 | 2.9 | 0.638 | 0.613 |
A:Data set and sorting: All methods have beentested on the same set of 218 different new protein structures (EVA version Feb2001). None of these proteins was similar to any protein used to develop therespective method. Sorting and grouping reflects the following concept: if thedata set is too small to distinguish between two methods, these two aregrouped. For the given set of 218 <<<0.6 percentage pointswere not significant. B:Method: see abbreviations in Methods; Scores[Rost WWW, 2001 #3739]: Q3: three-stateper-residue accuracy, i.e., number of residues predicted correctly in either ofthe three states helix, strand, other; SOV3: three-state per-segmentscore measuring the overlap between predicted and observed segments [21, 22] ; BAD: percentage of helical residues predicted as strand, and of strandresidues predicted as helix [23] ; CH: Matthew's correlationcoefficient for state helix [24] ; CE: Matthew's correlation forstate strand [24] .
Best prediction was most often better than jury. Prediction accuracy varies strongly between proteins. Furthermore, proteins predicted at low levels of accuracy occasionally are predicted more accurately by some method [20] . How did the jury predictions compare to the 'field' and how to the best prediction? The jury prediction was most often more accurate than the average over all methods ( Fig. 1 A). However, for less than 40% of the proteins the jury prediction was better than the best prediction ( Fig. 1 B). Did this imply that our way of compiling the jury decision was not optimal? Obviously, the average over the best predictions on 218 proteins reached 80.8%, the average over the jury reached 78.2% ( Table 1 ). Given that the best single method on that set was 76.8, the jury achieved only 35% (78.2-76.8 / 80.8-76.8) of the improvement we could have reached if we should have combined methods optimally.
Fig. 1. Jury prediction vs. best prediction. For each of 99 sequence-unique proteins we averaged over the three-state per-residue accuracy of all 7 methods described in Methods. Then we compiled the jury prediction according to eq. 2 over all 7 methods. A relates the accuracy of the jury to the average accuracy (crosses) and to the most accurate prediction for each protein (triangles). The jury prediction was more accurate than the average for almost all proteins. How often was the best prediction more accurate than the jury? B shows the cumulative difference between jury and best prediction (negative values imply that the best prediction was better than the jury). For more than 60% of all proteins, the jury decision was less accurate than the best prediction.
Individual methods defined reliability more accurately. All 7 methods accurately estimated the reliability of the prediction for each residue [20] . The prediction accuracy for the jury prediction correlated less impressively with accuracy ( Fig. 1 C). In particular, the jury predictions never reached levels above 85% accuracy for any residue. Hence, experts may still profit substantially from studying the explicit output from good methods.
Best three methods might suffice to reach peak jury performance. Suppose we have NB best methods, and NM methods of medium accuracy. Is it better to compile averages over only the NB best ones, or over all NB+NM methods? Contrary to the na•ve expectation, including the NM methods improves as long as these differ from the NB methods (Anders Krogh, private communication). In our experiment, we did not see a significant difference between compiling the jury over the best three and over all 7 best methods. For a set of 99 proteins for which we had results from all methods, 'winner' jury over 7 methods reached 78.3%, that over 3 methods reached 78.1%. The data in Table 1 suggested that the best strategy may be the most simple one, i.e. average over the final prediction of each method discarding the individual prediction accuracy. However, this result did hide the following problem: we did not have reliability indices for SSpro. Consequently, weighting the jury according to eqn. 2 was problematic. This problem was less important when compiling the jury over 7 methods. The resulting weighted jury reached 78.5% accuracy compared to 78.1% for the simple 'winner' jury. However, the set of 99 proteins was too small to conclude that the two strategies yielded significant differences.
Is it a good idea to combine the best method for secondary structure prediction? We found that a simple jury prediction using the best three methods improved accuracy on average ( Table 1 ). However, for most proteins, at least one of the seven best methods was more accurate than the jury. Furthermore, the best individual methods were more successful in identifying reliably predicted regions than was the jury prediction. Hence, is it a good idea to consider regions as more reliably predicted if all methods agree? Supposedly not. If we know the structure of a protein, we can simply always select the best prediction. This would yield an accuracy above 80%. Can we reach this level without knowing the structure? Not yet.
| 1. | Cuff, J. A. & Barton, G. J.(2000). Application of multiple sequence alignment profiles to improve proteinsecondary structure prediction. Proteins, 40, 502-511. |
| 2. | Rost, B. & Sander, C. (1993).Prediction of protein secondary structure at better than 70% accuracy. J. Mol.Biol., 232, 584-599. |
| 3. | Rost, B. & Sander, C. (1994).Combining evolutionary information and neural networks to predict proteinsecondary structure. Proteins, 19, 55-72. |
| 4. | Salamov, A. A. & Solovyev, V. V.(1995). Prediction of protein secondary structure by combining nearest-neighboralgorithms and multiple sequence alignment. J. Mol. Biol., 247, 11-15. |
| 5. | Frishman, D. & Argos, P. (1996).Incorporation of non-local interactions in protein secondary structureprediction from the amino acid sequence. Prot. Engin., 9, 133-142. |
| 6. | King, R. D. & Sternberg, M. J.(1996). Identification and application of the concepts important for accurateand reliable protein secondary structure prediction. Prot. Sci., 5, 2298-2310. |
| 7. | Zvelebil, M. J., Barton, G. J.,Taylor, W. R. & Sternberg, M. J. E. (1987). Prediction of protein secondarystructure and active sites using alignment of homologous sequences. J. Mol.Biol., 195, 957-961. |
| 8. | Rost, B. (1996). PHD: predictingone-dimensional protein structure by profile based neural networks. Meth.Enzymol., 266, 525-539. |
| 9. | Altschul, S., Madden, T., Shaffer,A., Zhang, J., Zhang, Z. et al. (1997). Gapped Blast and PSI-Blast: a newgeneration of protein database search programs. Nucl. Acids Res., 25,3389-3402. |
| 10. | King, R. D., Ouali, M., Strong, A.T., Aly, A., Elmaghraby, A. et al. (2000). Is it better to combine predictions?Prot. Engin., 13, 15-19. |
| 11. | Ouali, M. & King, R. D. (2000).Cascaded multiple classifiers for secondary structure prediction. Prot. Sci.,9, 1162-1176. |
| 12. | Higgins, D. G., Thompson, J. D.& Gibson, T. J. (1996). Using CLUSTAL for multiple sequence alignments.Meth. Enzymol., 266, 383-402. |
| 13. | Garnier, J., Osguthorpe, D. J.& Robson, B. (1978). Analysis of the accuracy and Implications of simplemethods for predicting the secondary structure of globular proteins. J. Mol.Biol., 120, 97-120. |
| 14. | Jones, D. T. (1999). Proteinsecondary structure prediction based on position-specific scoring matrices. J.Mol. Biol., 292, 195-202. |
| 15. | McGuffin, L. J., Bryson, K. &Jones, D. T. (2000). The PSIPRED protein structure prediction server.Bioinformatics, 16, 404-405. |
| 16. | Karplus, K., Barrett, C., Cline,M., Diekhans, M., Grate, L. et al. (1999). Predicting protein structure usingonly sequence information. Proteins, S3, 121-125. |
| 17. | Baldi, P., Brunak, S., Frasconi,P., Soda, G. & Pollastri, G. (1999). Exploiting the past and the future inprotein secondary structure prediction. Bioinformatics, 15, 937-946. |
| 18. | Baldi, P. & Brunak, S. (2001).Bioinformatics: the machine learning approach. MIT Press, Cambridge. |
| 19. | Eyrich, V., Mart’-Renom, M. A.,Przybylski, D., Fiser, A., Pazos, F. et al. (2001). EVA: continuous automaticevaluation of protein structure prediction servers. Bioinformatics, 17,1242-1243. |
| 20. | Rost, B. & Eyrich, V. (2001).EVA: large-scale analysis of secondary structure prediction. Proteins, 45 Suppl5, S192-S199. |
| 21. | Rost, B., Sander, C. &Schneider, R. (1994). Redefining the goals of protein secondary structureprediction. J. Mol. Biol., 235, 13-26. |
| 22. | Zemla, A., Venclovas, C., Fidelis,K. & Rost, B. (1999). A modified definition of SOV, a segment-based measurefor protein secondary structure prediction assessment. Proteins, 34, 220-223. |
| 23. | Defay, T. & Cohen, F. E.(1995). Evaluation of current techniques for ab initio protein structureprediction. Proteins, 23, 431-445. |
| 24. | Matthews, B. W. (1975). Comparison of the predicted and observed secondary structure of T4 phage lysozyme.Biochim. Biophys. Ac., 405, 442-451. |
| Contact: rost@columbia.edu | Version: Oct, 2001 |