bottom - TOC - CUBIC-papers - CUBIC

Title: Improving fold recognition without folds
Author:Dariusz Przybylski , Burkhard Rost
Quote: QUOTE

Improving fold recognition without folds

Dariusz Przybylski 1,4 & Burkhard Rost 1,2,3

1 CUBIC, Dept. of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA
2 Columbia University Center for Computational Biology and Bioinformatics (C2B2), Russ Berrie Pavilion, 1150 St. Nicholas Avenue, New York, NY 10032, USA
3 North East Structural Genomics Consortium (NESG), Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA
4 Dept. of Physics, Columbia Univ., 538 West 120th Street, New York, NY 10027, USA
* Corresponding author: cubic@cubic.bioc.columbia.edu URL http://cubic.bioc.columbia.edu/  Tel: +1-212-305-4018, fax: +1-212-305-7932

1 CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA

2 Columbia University Center for Computational Biology and Bioinformatics (C2B2), Russ Berrie Pavilion, 1150 St. Nicholas Avenue, New York, NY 10032, USA

3 NorthEast Structural Genomics Consortium (NESG), Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA

4 Department of Physics, Columbia University, 538 West 120th Street, New York, NY 10027, USA

 

* Corresponding author: dsp23@columbia.edu, http://cubic.bioc.columbia.edu/ 
Tel: +1-212-305-4018, fax: +1-212-305-7932

 

Running Title: Improving fold recognition without folds

Document statistics: Abstract = 193, Text = 6606 words, 87 references, 7 figures; 1 tables; 35 pages

Journal: Journal of Molecular Biology

Submitted: Feb 2, 2004

Full title Improving fold recognition without folds

Authors D Przybylski, B Rost

This article is published in (journal name, issue, date and pages) © copyright Journal of Molecular Biology, Academic Press (2004). Academic Press is the only authorised source. All copying of this article including placing on another website requires the written permission of the copyright owner.

Table of contents


Abstract

The most reliable way to align two proteins of unknown structure is through sequence-profile and profile-profile alignment methods. If the structure for one of the two is known, fold recognition methods outperform purely sequence-based alignments. Here, we introduced a novel method that aligns generalized sequence and predicted structure profiles. Using predicted 1D structure (secondary structure and solvent accessibility) significantly improved over sequence-only methods, both in terms of correctly recognising pairs of proteins with different sequences and similar structures and in terms of correctly aligning the pairs. The scores obtained by our generalised scoring matrix followed an Extreme Value Distribution; this yielded accurate estimates of the statistical significance of our alignments. We found that mistakes in 1D structure predictions correlated between proteins from different sequence-structure families. The impact of this surprising result was that our method succeeded in significantly out-performing sequence-only methods even without explicitly using structural information from any of the two. Since AGAPE also outperformed established methods that rely on 3D information, we made it available through http://www.predictprotein.org. If we solved the problem of CPU-time required to apply AGAPE on millions of proteins, our results could also impact everyday database searches.

 

Key words: protein structure prediction, fold recognition, sequence alignment, database search, secondary structure, solvent accessibility. 

 

Abbreviations used

; 1D structureone-dimensional (e.g. sequence or string of secondary structure, or solvent accessibility)
3D structurethree-dimensional co-ordinates of protein structure
3D-PSSMestablished fold recognition method [1, 2]
AGAPEAligning Generalised Profiles, i.e. method introduced here
DSSPprogram and database assigning secondary structure and solvent accessibility for proteins of known 3D structure [3]
EVDExtreme Value Distribution
FUGUEprofile-profile based fold recognition method [4, 5]
GPSSMgeneralised position-specific scoring matrix
PDBProtein Data Bank of experimentally determined 3D structures of proteins [6]
SCOPexpert-curated database of structural similarities [7]
SWISS-PROTdatabase of protein sequences [8]
TrEMBLtranslation of the EMBL-nucleotide database coding DNA to protein sequences [8] .


 

alignment of observed (DSSP) to observed 1D structure; P2O, predicted 1D structure for query vs. observed (DSSP) 1D structure for template; P2P, alignment of predicted to predicted 1D structure.

Abbreviations end

 

NOTE to the printer:
Please maintain italic typesetting for the first sentence of each paragraph.

 

Introduction

Mastering the twilight zone of sequence alignment through profiles. Proteins with similar sequence have similar 3D structure [9, 10, 11, 12] . Sequence-structure families constitute groups of proteins that have similar structural domains and for which this similarity can reliably be recognized through standard sequence comparison methods such as PSI-BLAST [13] , ClustalX [14] or the HMM-based methods HMMer [15] and SAM [16] . These advanced sequence comparison methods intrude safely into the twilight zone of sequence alignment [17, 11, 12, 18] by replacing the standard comparison of sequence-to-sequence with that of sequence-to-profile. This extension is extremely powerful, as demonstrated by the tremendous improvement of PSI-BLAST over the pairwise BLAST [19, 11] . Although the further extension to profile-profile alignments is conceptually straightforward, getting the details right is not. Yona & Levitt have shown how an innovative implementation of profile-to-profile alignments can improve as much over PSI-BLAST as PSI-BLAST improved over simple pairwise BLAST [20] ; other groups have also developed accurate profile-profile methods [21, 4, 22, 23, 24, 25, 26] . Although such profile-profile methods are still not the standard procedure for everyday sequence analysis and although even PSI-BLAST has more pitfalls than the ease-of-use may suggest [27, 28] improved alignment algorithms have clearly pushed the twilight zone below what could be achieved by simple pairwise comparisons.

Fold recognition reaches into the midnight zone of sequence alignments.  Most pairs of proteins with similar 3D structure have less than 15% pairwise sequence identity, i.e. span across sequence-structure families [29, 11, 30] . These pairs populate the midnight zone of sequence alignment in the sense that their structural similarity cannot be identified by methods based only on sequence information [31, 32, 33] . Fold recognition/threading methods intrude into this midnight zone of sequence comparisons by exploiting additional information available if an experimental 3D structure is available for one of the two proteins or families. While most early fold-recognition methods explored propensities and energy functions [34, 35] , many state-of-the-art methods now also explicitly utilise predicted 1D structure [36, 37, 38, 39, 40, 41, 42] , i.e. use generalised alphabets for representing sequence information [43] . The power of using structural information is illustrated by that even the first method that used predicted 1D structure without profiles (TOPITS [44, 45] ), still outperforms PSI-BLAST if there is a fold to be recognised (Results). Generalised profiles were designed to recognise fold similarities in proteins of known structures; however, very early on it was pointed out that these methods could, in principle, also be used to align two proteins of unknown structure, albeit at slightly [44, 45, 46] , or much lower levels of accuracy [47, 48] than for the standard application to proteins of known structure; the most recent implementation of this idea is realised in ORFeus [22] . Despite continued active improvements over the last decade, fold recognition methods remain an expert tool: most of the similarities identified are incorrect. Many methods identify the correct fold as one of their top hits; they fail, however, to correctly score the best solution as best. This niche is explored by ÔmetaÕ methods that successfully re-score original methods [49, 50, 51, 52, 53] . Although META-methods succeed in scoring the best hit higher than original methods, the re-scoring does not appear consistently better than the scores of the best original methods [54] .

Shrinking the sequence-structure gap. Large-scale sequencing projects continue to enrich databases of protein sequences at breathtaking pace: over one million protein sequences are archived in SWISS-PROT and TrEMBL, alone [8] . Although techniques in structural biology are also advancing rapidly [55, 56] , experimental structures are deposited in the PDB [6] for less than 3% of all known proteins [57, 58] . At least 15 structural genomics consortia worldwide address this sequence-structure gap attempting to experimentally determine the 3D structure for at least one representative for each sequence-structure family [59] . During its first three years of existence the field of structural genomics has already increased the rate of determining structures for sequence-structure families [60] . It has also shattered many assumptions, most relevant in context of this manuscript: we have to determine structures for over 30,000 sequence-structure families in order to significantly shrink the sequence-structure gap for the sequences known today [61] . One important task for computational biology in this context is to improve comparative modelling/fold recognition methods in order to expand sequence-structure families and to thus reduce the number of families for which structures have to be determined in the first round. 

Here, we described a novel alignment method based on the ideas developed in the fold recognition field that performs consistently and significantly better than the Ôgold standardÕ PSI-BLAST even in the absence of known structures. One important novelty of this method, dubbed AGAPE, was the high reliability when using predicted 1D structure (secondary structure and solvent accessibility) for query and template. Another important source of improvement was the proper statistical estimate for the significance of alignment scores. This yielded improved performance with respect to other methods tested for pairs with similarities in the twilight zone, as well as for pairs in the midnight zone.

 

 

Results

Mistakes in 1D predictions correlate between proteins with similar folds. Proteins with similar 3D structures usually also have similar 1D strings of secondary structure (helix, strand other) and solvent accessibility (buried, exposed) [62, 45] . This observation and the improved predictions of 1D structure [63] have been explored by fold recognition methods that explicitly match predicted and observed 1D strings to align a protein of unknown structure into a fold library [44, 43, 47, 48, 45, 46, 36, 64, 2, 4, 22, 38, 39, 40, 42] . To realise such comparisons, we need first to define a new substitution matrix that assesses the importance of matching say a Leucine in an observed buried helix with an Isoleucine in a predicted exposed helix [44, 65, 45, 66] . Obviously, we could simply compare the frequency observed for proteins of known structure. However, in order to adequately evaluate such a match, we also have to consider that 1D predictions make particular mistakes. For example, while we scarcely observe structural alignments between strands and helices, prediction methods often confuse strands and helices [27, 67] . Which of those two contributions (observed matches or prediction mistakes) is more important? We aligned a sequence-unique subset of proteins with known 3D structure that are considered as similar folds by SCOP [68, 7] with the 3D alignment method MAMMOTH [69] . Then, we used the structure alignments to compare the similarity between the following three scenarios: (i) observed-observed (1D assignments taken from DSSP [3] ), (ii) observed-predicted (prediction from PROFphd [63, 67] ), and (iii) predicted-predicted. We were surprised by a finding that has not been observed before: predicted-predicted 1D strings were most consistent for protein pairs with more than 15% sequence identity, and were more consistent than observed-predicted throughout the spectrum of sequence similarity ( Fig. 1 ). In other words, prediction mistakes correlated between proteins from different sequence-structure families. However, this finding required accurate predictions of 1D structure. In particular, when we generated the same data for PROFphd predictions based on single sequences rather than on alignment (prediction accuracy about eight percentage points lower [67] ), predictions were no longer consistent (data not shown).



Fig. 1
fig1.gif

Fig. 1 : 1D structure predictions correlated between folds.
We used MAMMOTH [69] to structurally align all-against-all a sequence-unique [12] subset from the PDB. Given these structural alignments, we measured the similarity in six-state 1D structure strings (buried-helix, exposed-helix, buried-strand, exposed-strand, buried-other, exposed-other). We ÒassignedÓ 1D structure by DSSP and through PROFphd [67] predictions, and compared the similarity between observed-observed (DSSP for both proteins in the alignment), observed-predicted (one from DSSP, the other from PROFphd) and predicted-predicted (both PROFphd). The surprising finding was that for levels above 15% pairwise sequence identity, the predicted secondary structure strings appeared more similar between proteins of similar folds than the observed strings. This result suggested that prediction mistakes correlate between proteins with dissimilar sequence and similar structure.



Improved ranking. When analysing the performance of a database search method, we can distinguish three related aspects. The first task for any database search method is to unravel that the query protein Q is similar to some protein P, or is similar to P1 and P2 but more similar to P1 (correct ranking). The second task is to provide a score that reflects that e.g. the pair Q1/P1 is more similar to each other than is the pair Q2/P2 (correct scoring). The third task is to provide an estimate for how much a certain alignment stands out from the background (estimate random background through expectation values). This is particularly important in the realm of the twilight [17, 11, 12, 18, 20] and midnight [29, 32, 30, 33] zones where the signal-to-noise ratio is rather small. We measured the ranking by monitoring how many pairs of proteins with similar structure were identified above a given rank ( Fig. 2 ). We compared our methods to four others, namely simple pairwise BLAST searches, profile-based PSI-BLAST searches (five iterations; note: when restricting PSI-BLAST to fewer than five searches, performance dropped, data not shown), profile-based dynamic programming searches (PSSM-SW, Methods), and a simple pairwise fold recognition method (TOPITS). All methods using only sequence information were outperformed by method that used observed or predicted aspects of structure. For example, while 23% of the structural similarities were identified by BLAST and 35% by PSI-BLAST as one of the 10 best hits in each search, at the same rank, our methods identified more than 50% of the structural similarities ( Fig. 2 ). The difference between the method introduced here and PSI-BLAST increased when considering hits even further down the list (~46% vs. ~70% at rank 50). Since we tested our method for proteins of known structure, we could also compare the performance between aligning predicted with observed and between aligning predicted with predicted 1D strings. Incidentally, the performance of PSI-BLAST was easily improved by re-aligning the hits identified by PSI-BLAST through a sequence-profile Smith-Waterman algorithm ( Fig. 2 PSSM-SW). This provided an important baseline to separate the contributions to the improvement from the particular alignment algorithm (our method used dynamic programming) and from adding 1D information.



Fig. 2
fig2.gif

Fig. 2 : Most fold similarities recognised, but incorrectly ranked.
Our fold-unbiased SCOP data set (Method) contained 1292 Òtrue pairsÓ, i.e. pairs of proteins belonging to the same SCOP fold [7] . Even if we considered the 50 highest-ranking hits in pairwise BLAST searches, we only found about 36% of these relations. PSI-BLAST pushed the coverage to 46% at rank 50, and the methods introduced here reached over 68% at rank 50. One interesting result was that our dynamic programming implementation that used the profile from PSI-BLAST (PSSM-SW) clearly improved over PSI-BLAST for all ranks. Another was that even our old pairwise fold-recognition method TOPITS [44, 29] appeared to improve more over PSI-BLAST than PSI-BLAST improved over BLAST when considering the coverage at very low ranks.



Improved scoring. We evaluated the second aspect of database searches (correct scoring) through ROC-like curves ( Fig. 3 ). For each method we combined all predictions for all query proteins and ordered them according to the score assigned by each method. Two results stood out: First, 1D structure information significantly improved performance in regions of both low and high numbers of false positives (false positives are proteins with different structures identified at a given threshold; Fig. 3 ). Second, aligning predicted with predicted 1D structure performed at least as well as aligning predicted with observed 1D structure ( Fig. 3 ). The poor performance of our outdated method TOPITS for low-error regions indicated the importance of correctly estimating the background (TOPITS used z-scores with respect to all other proteins in the fold library [44, 45] ). When we improved the TOPITS scoring by simply filtering out non-random scores from the score distribution, the modified TOPITS out-performed PSI-BLAST throughout the ROC-like curve (data not shown). On the other hand the method that used our improved scoring procedure but no 1D structure information (PSSM-SW), out-performed PSI-BLAST, but was significantly worse than the methods using 1D structure. Finally, we tested Ôbi-directionalÕ, or Ôforward/backwardÕ searches [64, 2, 70] , i.e. one search with the query against the library (forward) and the other with the library against the query (backward). Simply taking the best score for each direction already improved the scoring significantly, however, combining both (by multiplication of P-values) improved twice as much as did the simple Ôbest-winsÕ scenario (for example the numbers of true superfamily relations at first rank 153 for P2P, 166 when taking best of both, and 180 when multiplying P-values).



Fig. 3
fig3.gif

Fig. 3 : AGAPE improved clearly over sequence-only methods at all levels of accuracy.
We ordered all pairs from all searches according to the significance score of each method and compiled the number of true (similar fold) and false positives (different fold) at all thresholds in these scores. In total, there were 320 true pairs for SCOP-families (left), 522 for SCOP-superfamilies (middle), and 450 for SCOP-folds (right). The dotted thin lines mark 50% accuracy; the point at which any method crosses this line corresponds to a much lower coverage for the difficult-to-detect fold-only pairs than for the easy SCOP-family pairs. The improvement of the methods that use explicit experimental information about structure (AGAPE-P2O, TOPITS) or use predicted structure (AGAPE-P2P-BiS, AGAPE-P2P) over sequence-only methods (PSSM-SW, PSI-BLAST, BLAST) was higher for pairs less related in sequence. Nevertheless, even for SCOP-families it helped to use generalised sequences (1D structure + sequence). Note that AGAPE-P2P and AGAPE-P2P-BiS (predicted-vs-predicted 1D structure) used only predicted structural information, i.e. could be applied to any database search, while AGAPE-P2O (predicted-vs-observed 1D structure) - like all other fold recognition methods - require that the structure is known for one of the proteins aligned.



Improved estimate of random background. We evaluated the third aspect of database searches (estimate random background) by comparing estimates for expectations values (E-values) generated by the methods with the actual number of false positives found at a given E-value threshold. For our methods, we estimated the E-values through the distributions of scores obtained during the database search. Since the scores of our method approximated an Extreme Value Distribution (EVD, Fig. 1 S, Appendix), we based our scoring on a fit to such an extreme value distribution (Methods). We found that our E-values reflected the random background better than did PSI-BLAST E-values ( Table 1 AGAPE-P2P). However, estimates for the background of Ôbi-directional scoringÕ were not as good, in particular, the Ôbi-directionalÕ scores tended to be too ÔoptimisticÕ for low E-values (more random hits than expected for more related pairs). This problem was supposedly explained by the mistake in the assumption that both searches were statistically independent (and hence was more severe for more related pairs). Unfortunately, we did not find a sound fix for this problem.



Table . 1
Table 1 : Accuracy of estimating therandom background. D
Estimated number of random similarities (E-value) reported by methods Observed number of random similarities (± one standard deviation)
AGAPE-P2P PSI-BLAST AGAPE-P2P-BiS
0.1 0.07 (±0.30) 0.17 (±0.41) 0.91 (±2.18)
1 0.68 (±1.07) 1.61 (±1.51) 5.24 (±6.35)
5 4.03 (±3.15) 8.33 (±4.73) 18.5 (±16.2)
10 8.76 (±4.8) 16.4 (±7.8) 32.4 (±25.6)
50 52.9 (±10.5) 74.6 (±29.9) 116 (±70.9)
100 113 (±13.8) 141 (±53.7) 196 (±105)

D Given are expected numbers of random similarities (as estimated by methods) and corresponding observed numbers. Numbers are averages over 494 searches against a database of 3690 random sequences (Methods). Note: for AGAPE-P2P we performed a global fit to all of the data points of the distribution, fitting only the high scoring tail often produces even better E-value estimates.



Detailed inspection of the gain over sequence-only methods. Extreme examples for the benefit from adding predicted 1D structure information are proteins that were identified at low error rates by our method but were not at all found by PSI-BLAST. For the 494 SCOP entries in our test data set, our method not using any information from the known structures of any of the two (predicted-vs-predicted 'bi-directional') found 92 pairs of proteins with similar structure at an average error rate of less than one in 100 that were not found by PSI-BLAST at the same error rate. For most of these relationships (57) SCOP labelled the structural similarity as 'superfamily', 34 fell into the SCOP 'family' category and one into the SCOP 'fold' category. Of those 92, 32 were identified by PSI-BLAST ran in the reversed direction, leaving 60 pairs (corresponding to 30 relationships) not accounted by PSI-BLAST ( Table 1 S, Appendix). To balance this figure: at the corresponding error rate, PSI-BLAST identified only one protein pair that was missed by our method ( Table 2 S, Appendix). In order to also identify this one by our method, we had to increase the error rate to about 4 in 100. Even when increasing the error rate to ten errors for each query, PSI-BLAST still did not pick up 19 of the 92 pairs identified by AGAPE (data not shown).

Improvement sensitive to 1D structure prediction accuracy. To study the influence of the accuracy in predicting 1D structure, we plugged various prediction methods into our generalised alignment procedure ( Fig. 4 ). Using the DSSP assignments of 1D structure for both proteins aligned (observed-vs-observed) provided an upper limit for this analysis. More accurate 1D strings performed better for any number of false positives (higher prediction/assignment accuracy -> higher curve in Fig. 4 ). However, at levels of few false positives, 1D performance mattered slightly less . Furthermore, the performance was not fully proportional to 1D prediction accuracy, instead the prediction errors of our best method (PROFphd) were relatively less relevant than the errors of our worst method (PROFphd without alignments [71] ): 23% error of PROFphd [71] yielded 90-96% of the true positives identified by DSSP-vs-DSSP, while an additional error of 8 percentage points (31% error) for PROFphd without alignments [71] resulted in 76-84% reduction. Apparently, secondary structure prediction methods loose their value rather dramatically when falling below ~73-74%.



Fig. 4
fig4.gif

Fig. 4 : Better 1D predictions yield better fold recognition.
How much does 1D prediction accuracy impact the performance of fold recognition? We compared four different secondary structure prediction/assignment methods, namely DSSP (assignment from structure with an three-state secondary structure and two-state solvent accessibility ÒaccuracyÓ around 90% [62, 93] , note this value constitutes an approximation to differences in 1D assignments for homologous proteins), PROFphd (secondary structure ~77%/accessibility ~78% [67] ), PHD (~74%/75% [76] ), and PROFphd without alignments (~69%/70% [67] ). The trivial finding was: better 1D predictions gave better performance. However, the decrease in performance between DSSP-DSSP and PROFphd-PROFphd was much less significant than that between PROFphd-PROFphd with and without alignments although the decrease in actual 1D structure prediction accuracy was much higher for the first (from DSSP 90/90 to PROFphd 77/78 with alignments and to 69/70 without alignments).



1D information improved alignment quality. The first question was to which extent the reference alignment between query and template (structural super-position by MAMMOTH [69] ) overlapped with the alignment proposed by a certain method. For all levels of similarity between query and template (highest for family, lowest for fold), methods using predicted 1D structure overlapped significantly more with the reference alignments than methods only using sequence information (the bars labelled Ôreference-overlapÕ in Fig. 5 ). The improvement of the methods introduced here over PSI-BLAST was highest for the fold level ( Fig. 5 C; PSI-BLAST only 28% of AGAPE) and lowest for the family level ( Fig. 5 A; PSI-BLAST 86% of AGAPE). Predicted-vs-predicted performed once again better than predicted-vs-observed, and the Smith-Waterman-based sequence-profile method outperformed PSI-BLAST. The measure Ôreference-overlapÕ did not penalise overly long alignments; therefore, we also used the ÔinverseÕ measure, namely, the percentage overlap between the method and the reference alignment (Ômodel-overlapÕ in Fig. 5 ). Note that this measure has the opposite limitation: short alignments could give high scores. The model-overlap was similar for all methods. Thus, the longer AGAPE alignments (higher Ôreference-overlapÕ) were - on average - not achieved by simply aligning too many residues. Both Ôreference-overlapÕ and Ômodel-overlapÕ only evaluated whether the corresponding regions had been aligned, not whether the alignments were identical to the reference alignment ( Fig. 7 ). We measured the residue-by-residue agreement between the reference and the method alignment in terms of percentage accuracy (percentage of residues aligned by a method that were identical to the reference alignment) and coverage (percentage of residues aligned by the reference identical to residues aligned by the method). Longer AGAPE alignments were also more accurate and reached higher coverage than alignments from methods not using 1D structure ( Fig. 5 ). Finally, we examined residues in regions for which both AGAPE (predicted-vs-predicted) and PSI-BLAST overlapped with the reference alignments. Of the residues in these regions, 86% were identical between AGAPE and PSI-BLAST at the family level, 56% at the superfamily level and 33% at the fold level. Even in this subset of Ômost PSI-BLAST-like regionsÕ, AGAPE reached a higher accuracy than PSI-BLAST (family: 72 vs. 69; superfamily: 39 vs. 30; fold: 17 vs. 9). Note that the performance did not differ in these regions between AGAPE in predicted-vs-observed and predicted-vs-predicted mode (data not shown).

Tremendous improvement on the scale between best and worst. We used PSI-BLAST and pairwise BLAST in our comparisons as baselines, i.e. implicitly assumed that these two constitute a lower limit. While the reference alignments obviously constituted the upper limit (100% right), we also explored to which extent this upper limit was specific to the particular reference alignment method used by testing another method for structural superposition, namely CE [72] . Manfred Sippl has continuously tried to raise awareness for the fact that there often are alternative structural alignments that are Ôequally goodÕ by many measures [73] . If so, different structural superposition methods are likely to differ significantly. This was exactly what we found: while the CE alignments were better than those from AGAPE, the differences in accuracy were surprisingly small ( Fig. 5 ). This comparison between ÔbestÕ and ÔworstÕ also illustrated the amount of improvement for AGAPE over PSI-BLAST/BLAST. We also compiled all data considering CE as standard-of-truth . The results did not differ qualitatively from the ones shown in Fig. 5 i.e. now AGAPE appeared to be almost as accurate as MAMMOTH.



Fig. 5
fig5.gif

Fig. 5 : Longer and better alignments.
We measured the correctness of alignments by four simple scores each of which captures some aspects of the agreement between the alignment generated by the methods tested and by the reference structural alignment from MAMMOTH [69] (Fig. 7 for detailed sketch). For example, for the most sequence-related pairs (SCOP-family level, left), PSI-BLAST aligned about 57% of the reference alignments at almost 65% accuracy; our method not using 3D information (AGAPE-P2P) covered about 63% of the reference alignments at more than 65% accuracy. The difference was higher for SCOP-superfamily relations: PSI-BLAST 25%/13% (accuracy/coverage) and AGAPE-P2P 28%/23%. How significant this increase was became apparent when considering the performance of another structural alignment method evaluated against the same reference: CE [72] reached 30% accuracy at 26% coverage. Thus, AGAPE-P2P was closer to the performance of the structural alignment method than to that of PSI-BLAST.



AGAPE better than established fold recognition methods without using fold.  Finally, we compared the fold recognition performance of AGAPE with the two sustained fold recognition methods 3D-PSSM [1, 2] and FUGUE [5] . This comparison was based on a set of 440 queries taken from the EVA server that automatically evaluates structure prediction servers [74, 75] . The standard-of-truth for this comparison was that two proteins shared the same SCOP fold. While all methods were tested on the same data set, 3D-PSSM and FUGUE predictions were collected at the time when the respective proteins were published on the PDB web site (between 2002 and 2003), while the predictions of AGAPE and PSI-BLAST were collected at the end of 2003. While for each query we did remove all proteins that were not in the PDB at the time of testing 3D-PSSM and FUGUE, we could not apply the same caution to the sequence databases used for AGAPE and PSI-BLAST (simply because we do not keep weekly freezes of all databases). Thus, the comparison is likely to be unfairly favourable to AGAPE and PSI-BLAST. However, this ÔunfairnessÕ appeared to be limited, since both AGAPE and PSI-BLAST yielded similar performance when compiled on a reduced sequence database for which we randomly chose only 40% of all sequences from the current SWISS-PROT and TrEMBL versions (data not shown). Given all these cautions, it appeared that AGAPE performed considerably better than FUGUE and 3D-PSSM ( Fig. 6 ). Another interesting side-result of this comparison was that our attempts at estimating performance on fold-unbiased data sets appeared to have provided estimates that were more realistic than comparisons based on fold-biased sets would have. The EVA set was more realistic in the sense that - by construction of the data set - at the time of depositing the query proteins into the PDB there was no other fold that could have been reliably identified as similar by PSI-BLAST. When comparing the difference between the points of 50% accuracy for AGAPE and PSI-BLAST (dotted lines in Fig. 3 and Fig. 6 ), the PSI-BLAST results on the EVA set were rather similar to those obtained for our fold-unbiased data sets. One of the best non-META methods at CASP5 was ORFeus [22, 51, 40] . This method uses 1D predictions and explores the full potential of profile-profile alignments. We could not compare ORFeus and AGAPE on identical data sets. The only direct comparison was possible by using the comparison to PSI-BLAST published by the authors [22] : ORFeus found 47-68% more correct hits at rank 1 than PSI-BLAST (68% for three and 47% for six iterations), AGAPE found 65% more than PSI-BLAST at five iterations. At the only well-defined point in their publication, i.e. for 50 false predictions, PSI-BLAST reached 64% accuracy, ORFeus 70%. At the same level of accuracy for PSI-BLAST in our analysis ( Fig. 3 >


Fig. 6
fig6.gif

Fig. 6 : AGAPE without folds competitive with fold recognition methods.
In order to compare our methods based on the same data set to well-established, accurate fold recognition methods, we used a data set with 440 proteins provided by the EVA server. Note: while all methods were tested on identical sets, 3D-PSSM [1, 2] and FUGUE [5] were tested at a different point in time when sequence databases were smaller. Since all methods implicitly use sequence databases, this difference implied that the methods tested later (AGAPE and PSI-BLAST) had some advantage. The dashed thin grey line marks the points at which each method has 50% accuracy, for example, PSI-BLAST crossed this line for 72 proteins, FUGUE for 104, 3D-PSSM for 107, AGAPE-P2P for 112 and Ôbi-directionalÕ AGAPE (P2P-BiS) for 123. Thus, at that accuracy all three found about 50% more true positives than PSI-BLAST (44%, 49%, 55% and 70% respectively). Note that our method AGAPE did not use any information from the folds of any of the aligned pair (mode: predicted-vs-predicted). Also note that the relative performance of AGAPE-P2P indicated that Ôfold recognition-likeÕ performance with accurate estimates of E-values could be achieved on proteins with unknown 3D-structures.



 

Discussion

Redefining the goal of secondary structure prediction, again? Better predictions of 1D structure yielded better fold recognition in the AGAPE framework ( Fig. 4 ). Interestingly, the improvement from PROFphd without alignments (three-state per-residue accuracy ~69%) to PROFphd with alignments (about 77%) covered at least half of what could be gained by optimal prediction methods. Although secondary structure prediction methods continue to become more accurate [27, 54] , the latest improvement between todayÕs PROFphd and PHDpsi from a decade ago is relatively marginal [76] suggesting that further improvements might be increasingly difficult. David Jones and co-workers proposed to improve secondary structure prediction methods specifically with respect to their value for fold recognition [77] . Our observation that mistakes in 1D structure predictions correlated between proteins with similar folds might indicate another direction for specific improvement ( Fig. 1 ). In particular, at higher levels of sequence similarity (SCOP family level) two predicted strings of secondary structure were more similar to each other than the two assignments from DSSP, as if prediction errors were more often meaningful than not. May be we should improve secondary structure prediction methods in exactly those regions for which this unexpected error-correlation breaks down. Clearly, such an approach has never been attempted.

Relative improvement similar for sequence- and fold-unique data sets. To evaluate the performance of our methods across many fold types, we removed the bias from the PDB/SCOP by under-sampling over-represented folds (Methods). Our results consequently reflected the Ôaverage per foldÕ performance. Incidentally, our particular bias-reduction explained why, for example, the ROC-like curves ( Fig. 3 ) were considerably lower than curves published by many colleagues. On the other hand, there appear to be sound reasons suggesting that certain fold types are over-represented in nature [78, 79] . If so, a method A that performs well on more populated folds may have a lower per-fold average than another method B that performs more equally for all folds and may still be more often right than B. In order to ascertain that the AGAPE methods were not less accurate for populated folds, we also re-compiled all our analyses based on the entire SCOP40 (data not shown). Other than that this different data sets suggested numerically higher values for all methods (popular folds are more easy), it did not add any information. In particular, the AGAPE methods were as much better than PSI-BLAST/BLAST for these biased data sets as they were for our fold-unbiased set.

Correct estimate of statistical significance crucial. A method that correctly identifies similar folds but fails to correctly capture the statistical significance of a particular alignment would appear accurate by the test shown in Fig. 2 (how many correct folds amongst first N hits for each query) and inaccurate by the test shown in Fig. 4 (how many correct folds with better scores than the N highest scoring incorrect ones). TOPITS, our first method explicitly using predicted 1D structure for fold recognition [44] , already improved more over PSI-BLAST than PSI-BLAST over BLAST in terms of the rank-test ( Fig. 2 ), i.e. TOPITS was good for what it was developed for, namely for getting the top hits right. However, TOPITS did not perform well in the high-accuracy/low coverage regime when correct significance estimates were required ( Fig. 3 ). One of the major sources of success for the AGAPE methods introduced here, were the valid estimates of statistical significance. These precise E-values were also the reason for an impressive improvement through Ôbi-directionalÕ scoring. The combined 'bi-directional' score succeeds impressively in the ROC-like test ( Fig. 3 ), suggesting that the shapes of the background distributions for 'bi-directional' searches are similar. On the other hand, the bi-directional scores no longer correctly estimated the background for all ranges of similarity: for very related proteins, the E-values were over-optimistic because the assumption that forward (query against database) and backward searches (database against query) are statistically independent was incorrect. We currently do not have any thorough solution for this problem.

Fold recognition without folds competitive with methods using 3D information. Given all the caution about comparing our method to FUGUE and 3D-PSSM at a different point in time, our results still demonstrated that our method performed better than these methods. However, AGAPE in this comparison did not even use any experimental structural information while 3D-PSSM and FUGUE both are generic fold recognition methods that can only identify similarities to proteins of known structure. CASP5 suggested that 3D-PSSM and FUGUE might no longer be the most successful original fold recognition methods [54, 40] . Hence, it remains to be shown that fold recognition without fold can really compete with the best methods that explicitly use 3D information.

 

Conclusions

We introduced AGAPE, a novel method that improved fold recognition and alignment accuracy for pairs of proteins with different sequences and similar structures considerably without ever using the information of known structures. Due to the surprising finding that mistakes in 1D structure predictions (secondary structure and solvent accessibility) correlate, AGAPE was in fact often more reliable in identifying structural similarities when explicitly ignoring known structural information. Although our results were strictly valid only for the framework of the AGAPE methods introduced here, we have no reason to doubt that other fold recognition methods could also be improved by aligning predicted with predicted 1D structure. According to most measures for performance that we analysed, our sequence-only methods improved more over PSI-BLAST than BLAST over BLAST. As for other fold-based recognition methods, the improvement was higher for structural pairs with less sequence similarity. AGAPE found half of all pairs of proteins that belong to the same SCOP superfamily and different SCOP families at 60% accuracy compared to 5% accuracy for PSI-BLAST. For proteins only related on the SCOP fold level, our method largely failed: at 20% accuracy AGAPE found only 6% of the similarities (compared to 1% by PSI-BLAST). Nevertheless, in a realistic comparison AGAPE outperformed the well-established, sustained fold recognition methods 3D-PSSM and FUGUE that both are only applicable in comparisons in which an experimental structure is available for at least one of the proteins aligned. AGAPE is available for searches against known structures through PredictProtein (www.predictprotein.org). However, in order to fully exploit the potential of AGAPE, we would have to execute full dynamic programming alignments on very large databases. While the CPU-time required for such a search is preventive as a large-scale service such as the one in PredictProtein (http://www.predictprotein.org/, [80] ), it is feasible for individual important searches.

 

Methods and Materials



Model

1D structure predictions. The aspects of protein 1D structure that we used were secondary structure and solvent accessibility. In particular, we mapped secondary structure onto three states (helix, strand and other) and relative solvent accessibility onto two states (buried, i.e. <15% relative accessibility, and exposed). These were represented by six-dimensional 1D state-vectors (buried-helix, buried-strand, buried-other, exposed-helix, exposed-strand, exposed-other). Assignments of observed 1D states were taken from DSSP [3] using the standard mapping for secondary structure [81] and solvent accessibility [82] . Predictions were taken from PROFphd [71] , and for comparison from PHD [63] . The input alignments were generated using an optimised protocol [28, 76] for running PSI-BLAST [13] .

Generalised protein profiles. Alignment methods usually represent proteins by their amino acid sequences that are coded by 20-component-per-residue vectors: either binaries for pairwise sequence-sequence comparisons, or real-valued components for sequence-profile and profile-profile comparison. For our generalised profiles, we simply expand these vectors to 20*6=120 real-valued components per residue. This conceptually simple extension raises the issue of which scoring matrix to use. We built our generalised position specific scoring matrix (GPSSM) as a weighted sum over (i) a position specific scoring matrix for the sequence part (PSSM), and over (ii) a 1D structure state scoring matrix (SM). Given positions i and j in two generalised protein sequences the score was given by:

 

where i labelled the residue position in the query/target, aj gave the amino acid type at the position j in the database/template protein (20 types), and fi,yj represented the 1D structure states of query and template respectively (six states). The weight factor r was set to 0.4 (optimised on a separate data set). The 1D substitution matrix (SM) was largely taken from our earlier work [44, 45] , and slightly modified by subtracting 0.1 from all elements in order to obtain slightly negative random scores. The random background for 1D structure matches was computed using the 1D state distribution of PROFphd predictions, compiled on a sequence-unique (HSSP structural similarity threshold equal zero [12] ) subset of the PDB.

Parameters for dynamic programming. We embedded our methods into the formalism of Smith-Waterman-like dynamic programming [83] as implemented by the program MaxHom [10] . We used affine gap penalties, with the specific choices of gap_open=11 and gap_elongation=1. (The results presented were - on average - not sensitive to marginal changes in these values (±20%). We obtained our position specific scoring matrix (PSSM) from PSI-BLAST alignments [13] . In particular, we first searched against a redundancy-filtered [84] and SEG-filtered [85] version of SWISS-PROT+TrEMBL [8] . We stored the resulting PSI-BLAST PSSM after 5 iterations (ÔjÕ=5, ÔhÕ=0.001). Finally, we searched with these profiles against the unfiltered version of SWISS-PROT+TrEMBL.

Fold recognition methods. We used PSI-BLAST and BLAST programs as reference fold recognition methods. To study the influence of 1D structure information on fold recognition we used several in-house methods dubbed AGAPE (Aligning Generalized Profiles) that compared different scenarios for obtaining 1D structure. In particular, using DSSP assignments for both query and template (O2O), using PROFphd predictions for query and DSSP assignments for template (P2O), and using PROFphd predictions for both query and template (P2P). Furthermore, we tested two less accurate 1D prediction methods, namely PHDpsi [76] and PROFphd without alignments and compared our previous fold recognition method TOPITS. We also introduced another novel sequence-profile method (dubbed PSSM-SW) that re-aligned proteins according to the position-specific scoring matrix identified by PSI-BLAST through dynamic programming [83] and estimated statistical significance in the same way as AGAPE.

Estimating statistical significance. Simple protein models often assume that amino acids occur randomly at all positions. Such an assumption appears overly na•ve for secondary structure states in generalised sequences; such states appear to be much more ÔMarkovianÕ. It has been shown that scores of un-gapped optimal local alignments in such a situation approach extreme value distributions [86, 87] . We applied an extreme value distribution to approximate gapped alignments of generalised protein sequences. We followed the approach used for regular sequences [88] and assumed the probability that the optimal alignment attains a score S ³ x to be:

 

where m and n were the sequence lengths, and l and K were the equation parameters that we estimated by fitting eqn. 2all into scores generated during the search of protein database. When parameters are estimated in this way, true positive outliers skew the distribution. We coped with these outliers through a simple filtering procedure. First, we searched against the full database and removed the highest scoring protein and proteins related to it. Next, we fitted the remaining scores into eqn. 2. Then, we compiled the E-values for the removed scores based on fitted parameters. Finally, we added all scores with E-values ³1 back into the random background and refitted an eqn. 2again. We performed maximum likelihood fitting to estimate the parameters.Bi-directional scoring. Assume that proteins A and B are related. If we align the pair using their sequences only, aligning A against B usually yields similar results as aligning B against A. This symmetry disappears if we use sequence-profile alignments because the profiles resulting from A against the database and B against the database may not contain similar amounts of information. Obviously, the power of profile-based methods depends crucially on this information. Hence, A-profile vs. B may give very different results than A vs. B-profile. In context of a database method, this translates to searches with the query against the profiles of all database proteins ('backward') and with all database proteins against the profile of the query ('forward'). Such 'bi-directional' searches have previously been shown to improve detection of similarities [36, 1] . Usually only the best score of both searches is used, while information contained in the score from opposite direction is discarded. We decided to try combining evidence contained in both scores. Given the scores s1 and s2 from Ôbi-directionalÕ searches, we computed probabilities (p-values) p1 and p2 of obtaining these or better scores through random alignments. Computing a p-value for the joint distribution of Ôbi-directionalÕ scores requires estimating conditional probabilities (p(s2|s1)). We tested the assumption that the two scores are statistically independent, i.e. that their probability p12 of obtaining pairs of scores equal or better than s1, s2 is a product of the p-values, p12=p1 * p2. Other combinations of scores s1Õ, s2Õ with a corresponding product p12Õ may equal p12. Thus, p12 is not equal to the probability of obtaining this value. However, under the null hypothesis, the distribution of p-values of continuous test statistic is uniform on the interval [0,1]. Therefore, the distribution of the statistic p12 is, under our statistical independence assumption, equal to the distribution of the product of independent uniform random variables. The related statistic (-2 lnp12) can be shown to have a c2 distribution with four degrees of freedom [89] . In this work, we decided to use another more direct and faster way of computing the p-value of p12. It was shown that the probability Fn(p) for the product of n independent, uniform [0,1] random variables to have an observed value less than or equal to p is given by:

 

for 0 < p £ 1, and is zero when p is zero [90, 91] . In our case, for n=2, we have:

 

We found that combining evidence from both searches (forward/backward) in that way resulted in better separation of true and false positives but that the random background estimates were not as accurate as neither forward nor backward searches.

Database of generalised profiles. The implementation of bi-directional alignments and scoring required a set up of a database of generalised profiles for template proteins. We built the generalised profiles (GPSSM) for the templates in the same way as for the queries. To assign statistical significance to alignment scores of generalised query sequences and template GPSSMs we also needed to estimate parameters for the random score distribution for each template GPSSM. This was also done in much the same way as for the queries: We scored each generalised template profile against all other ÔunrelatedÕ generalised template sequences and than fit eqn. 2to the set of the scores obtained. The information about fitted parameters for each template GPSSM was stored together with those GPSSM's in the database. This procedure needed to be carried out only during initial setup.

 



Evaluation

Data sets. We based most of our evaluation on SCOP [68, 7] (release 1.61). In particular, on the 40% sequence-identity filtered version from ASTRAL [92] (http://astral.stanford.edu/). This set, SCOP40, was not fold-unique, in particular, it contained many versions of immunoglobulin-like, ferrodoxin-like, DNA/RNA-binding 3-helical bundle, knottins, TIM barrel and OB folds. In order to assess performance on the base of less biased folds, we generated a fold-unbiased subset of SCOP40 by randomly removing over-represented folds. We also removed membrane-proteins, coiled-coil proteins, designed proteins and small (<30 residues) proteins. In order to avoid over-fitting, we then split this set into two: a training set (495 proteins: 151 folds, 222 superfamilies, 339 families) and a test set (494: 152 folds, 219 superfamilies, 334 families). The two sets were separated in the sense that they had no overlap in their SCOP family, and they had similar numbers of relations. In particular, the test set had 320 pairs related by SCOP-family, 522 pairs related by SCOP-superfamily, and 450 related by SCOP-fold, only. All methods were asymmetric (A against B gave different results than B against A). Thus, we used both incidents independently (AB one alignment, BA another). Note that although BLAST alignments [19, 13] are intrinsically symmetric, the BLAST scores are not. The pairs used to study SCOP-family relations had on average 21% pairwise sequence identity in their 3D alignments, SCOP-superfamily pairs about 10% and SCOP-fold pairs about 8% sequence identity. That meant that even the relatively simple set of SCOP-family pairs could on average be classified as hard comparative homology cases (below 30% sequence identity alignment quality is rapidly degenerating [75] ).

Randomised data set for studying score distributions. To study the distribution of random scores produced by different methods we randomised sequences derived from SCOP40 (3690 sequences after removing membrane, coiled-coil and other biased proteins). We assigned predicted secondary structure (PROFsec) and solvent accessibility (PROFacc) to each residue, ending up with a collection of generalised sequences. Then we randomly shuffled secondary structure elements of each sequence (helix with helix, strand with strand and other with other). Next we randomly shuffled residues within each secondary structure element.

Additional data set for comparison with other methods. The SCOP data sets did not allow the direct comparison between our method and other fold recognition methods. The EVA server has been collecting results for structure prediction methods for over four years, now [74, 54, 75] . In particular, EVA we chose predictions from the two well-established methods 3D-PSSM [1, 2] and FUGUE [4, 5] for 440 proteins. For none of these there was a clear sequence homologue (HSSP-value> 0 <<<0.01) in the PDB at the time of deposition. Each pair in the set of 440 also had no significant sequence similarity to each other. When computing AGAPE results on this set, we were careful to use the same PDB data sets that were available to other servers, i.e. at the time of deposition. Due to the growth in sequence data, AGAPE could, however, access much larger sequence databases than 3D-PSSM and FUGUE.

Reference 3D alignments.  Generally, when using structural alignments as benchmark, the problem is that these are not uniquely defined; in fact, there often are alternative structural alignments that by many standards could be considered equal [73] . We slightly reduced the impact of this reality by using SCOP domains as the basic unit. Our results showed that this step did not entirely resolve the problem. The results presented in all figures used structural alignments from MAMMOTH [69] as the standard of truth. However, we also used structural alignments from a CE [72] for comparison. We found that CE based results corroborated results obtained with MAMMOTH.

Alignment quality. We used a number of measures to capture various aspects of alignment quality ( Fig. 7 ). (1) The reference-overlap is the percentage of residues for which a reference 3D alignment overlapped with a fold recognition alignment (note: not necessarily the same residues, simply the region). (2) The model-overlap measured the flip-side, i.e., the percentage of fold recognition alignment overlapped by the reference alignment. (3) Coverage measured residues identically aligned between reference and method as a percentage of the reference alignment length, and (4) accuracy measured the identically aligned residues as percentage of the alignment length of the method.



Fig. 7
fig7.gif

Fig. 7 : Measuring the correctness of alignments.
All scores were compiled with respect to the reference structural alignments from MAMMOTH [69] . The numbers in the second column label the residues in the target protein, the small-cap letters the residues aligned in the target. The overlap measures evaluate to which extent the regions aligned by a certain method overlap with the regions aligned in the reference alignment; reference-overlap is the percentage with respect to all residues aligned by the reference, model-overlap the percentage with respect to the model generated by the method. Coverage and accuracy evaluate the equivalence between reference alignment and method alignment; coverage captures identically aligned residues as percentage of the reference alignment length, accuracy the identically aligned residues as percentage of the method alignment length.

 

 



 



Acknowledgements

Thanks to Jinfeng Liu and Megan Restuccia (Columbia) for computer assistance; to Guy Yachdav (Columbia) for help in setting up the server; to the EVA team for the support of a crucial server that enables comparisons to many structure prediction methods, namely to Ingrid Koh (Columbia), Volker Eyrich (Schroedinger, Inc.), Osvaldo Grana (CNB Madrid), Alfonso Valencia (CNB Madrid), Marc Marti-Renom (UCSF), and Andrej Sali (UCSF). Special thanks also to the developers of publicly available fold recognition and structure prediction servers that helped us in assessing our method. In particular, thanks to the developers of 3D-PSSM and FUGUE. Thanks also to the anonymous reviewers for crucial comments. The work of DP and BR was supported by the grants 1-P50-GM62413-01 and RO1-GM63029-01 from the National Institute of Health (NIH). Last, not least, thanks to all those who deposit their experimental data in public databases, and to those who maintain these databases.

 

 

References

1.Kelley, L. A., MacCallum, R. M. &Sternberg, M. J. (2000). Enhanced genome annotation using structural profilesin the program 3D-PSSM. J. Mol. Biol., 299, 499-520.
2.Bates, P. A., Kelley, L. A., MacCallum,R. M. & Sternberg, M. J. (2001). Enhancement of protein modeling by humanintervention in applying the automatic programs 3D-JIGSAW and 3D-PSSM. Proteins, Suppl, 39-46.
3.Kabsch, W. & Sander, C. (1983).Dictionary of protein secondary structure: pattern recognition ofhydrogen-bonded and geometrical features. Biopolymers, 22, 2577-637.
4.Shi, J., Blundell, T. L. &Mizuguchi, K. (2001). FUGUE: sequence-structure homology recognition usingenvironment-specific substitution tables and structure-dependent gap penalties.J. Mol. Biol., 310, 243-57.
5.Williams, M. G., Shirai, H., Shi, J.,Nagendra, H. G., Mueller, J. et al. (2001). Sequence-structure homologyrecognition by iterative alignment refinement and comparative modeling. Proteins, Suppl, 92-97.
6.Berman, H. M., Westbrook, J., Feng, Z.,Gilliland, G., Bhat, T. N. et al. (2000). The Protein Data Bank. Nucl. AcidsRes., 28, 235-42.
7.Lo Conte, L., Brenner, S. E., Hubbard,T. J., Chothia, C. & Murzin, A. G. (2002). SCOP database in 2002:refinements accommodate structural genomics. Nucl. Acids Res., 30, 264-267.
8.Bairoch, A. & Apweiler, R. (2000).The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucl.Acids Res., 28,45-8.
9.Chothia, C. & Lesk, A. M. (1986).The relation between the divergence of sequence and structure in proteins. EMBOJ., 5, 823-826.
10.Sander, C. & Schneider, R. (1991).Database of homology-derived structures and the structural meaning of sequencealignment. Proteins, 9, 56-68.
11.Brenner, S. E., Chothia, C. &Hubbard, T. J. P. (1998). Assessing sequence comparison methods with reliablestructurally identified distant evolutionary relationships. Proc. Natl.Acad. Sci. U.S.A., 95, 6073-6078.
12.Rost, B. (1999). Twilight zone ofprotein sequence alignments. Prot. Engin., 12, 85-94.
13.Altschul, S. F., Madden, T. L.,Schaffer, A. A., Zhang, J., Zhang, Z. et al. (1997). Gapped BLAST andPSI-BLAST: a new generation of protein database search programs. Nucl. AcidsRes., 25, 3389-402.
14.Chenna, R., Sugawara, H., Koike, T.,Lopez, R., Gibson, T. J. et al. (2003). Multiple sequence alignment with theClustal series of programs. Nucl. Acids Res.,31, 3497-3500.
15.Eddy, S. R. (1998). Profile hiddenMarkov models. Bioinformatics, 14, 755-763.
16.Karplus, K., Barrett, C., Cline, M.,Diekhans, M., Grate, L. et al. (1999). Predicting protein structure using onlysequence information. Proteins, S3, 121-125.
17.Doolittle, R. F. (1986). Of URFs andORFs: a primer on how to analyze derived amino acid sequences. UniversityScience Books, Mill Valley California.
18.Pawlowski, K., Jaroszewski, L.,Rychlewski, L. & Godzik, A. (2000). Sensitive sequence comparison asprotein function predictor. Pac Symp Biocomput,8, 42-53.
19.Altschul, S. F., Gish, W., Miller, W.,Myers, E. W. & Lipman, D. J. (1990). Basic local alignment search tool. J.Mol. Biol., 215,403-10.
20.Yona, G. & Levitt, M. (2002).Within the twilight zone: a sensitive profile-profile comparison tool based oninformation theory. J. Mol. Biol., 315, 1257-75.
21.Rychlewski, L., Jaroszewski, L., Li,W. & Godzik, A. (2000). Comparison of sequence profiles. Strategies forstructural predictions using sequence information. Prot. Sci., 9, 232-41.
22.Ginalski, K., Pas, J., Wyrwicz, L. S.,von Grotthuss, M., Bujnicki, J. M. et al. (2003). ORFeus: Detection of distanthomology using sequence profiles and predicted secondary structure. Nucl.Acids Res., 31,3804-7.
23.Sadreyev, R. & Grishin, N. (2003).COMPASS: a tool for comparison of multiple protein alignments with assessmentof statistical significance. J. Mol. Biol.,326, 317-36.
24.Edgar, R. C. & Sjolander, K.(2004). COACH: profile-profile alignment of protein families using hiddenMarkov models. Bioinformatics,.
25.Marti-Renom, M. A., Madhusudhan, M. S.& Sali, A. (2004). Alignment of protein sequences by their profiles. Prot.Sci., 13, 1071-1087.
26.Von Ohsen, N., Sommer, I., Zimmer, R.& Lengauer, T. (2004). Arby: automatic protein structure prediction usingprofile-profile alignment and confidence measures. Bioinformatics,.
27.Rost, B. (2001). Protein secondarystructure prediction continues to rise. J. Struct. Biol., 134, 204-218.
28.Jones, D. T. & Swindells, M. B.(2002). Getting the most from PSI-BLAST. TIBS,27, 161-164.
29.Rost, B. (1997). Protein structuressustain evolutionary drift. Folding & Design,2, S19-S24.
30.Yang, A. S. & Honig, B. (2000). Anintegrated approach to the analysis and modeling of protein sequences andstructures. II. On the relationship between sequence and structural similarityfor proteins that are not obviously related in sequence. J. Mol. Biol., 301, 679-689.
31.Rost, B. (1998). Marrying structureand genomics. Structure, 6, 259-263.
32.Friedberg, I., Kaplan, T. &Margalit, H. (2000). Glimmers in the midnight zone: characterization of alignedidentical residues in sequence-dissimilar proteins sharing a common fold. ProcInt Conf Intell Syst Mol Biol, 8, 162-70.
33.Bujnicki, J. M. (2003).Crystallographic and bioinformatic studies on restriction endonucleases:inference of evolutionary relationships in the "midnight zone" ofhomology. Curr Protein Pept Sci, 4, 327-37.
34.Wodak, S. J. & Rooman, M. J.(1993). Generating and testing protein folds. Curr. Opin. Str. Biol., 3, 247-259.
35.Sippl, M. J. (1995). Knowledge-basedpotentials for proteins. Curr. Opin. Str. Biol.,5, 229-235.
36.Jones, D. T. (1999). GenTHREADER: anefficient and reliable protein fold recognition method for genomic sequences. J.Mol. Biol., 287,797-815.
37.Jones, D. T. (2000). Protein structureprediction in the postgenomic era. Curr. Opin. Str. Biol., 10, 371-379.
38.Godzik, A. (2003). Fold recognitionmethods. Methods Biochem Anal, 44, 525-46.
39.Karplus, K., Karchin, R., Draper, J.,Casper, J., Mandel-Gutfreund, Y. et al. (2003). Combining local-structure,fold-recognition, and new fold methods for protein structure prediction. Proteins, 53, 491-496.
40.Kinch, L. N., Wrabl, J. O., Krishna,S. S., Majumdar, I., Sadreyev, R. I. et al. (2003). CASP5 assessment of foldrecognition target predictions. Proteins, 53, 395-409.
41.Petrey, D., Xiang, Z., Tang, C. L.,Xie, L., Gimpelev, M. et al. (2003). Using multiple structure alignments, fastmodel building, and energetic analysis in fold recognition and homologymodeling. Proteins, 53, 430-435.
42.Tang, C. L., Xie, L., Koh, I. Y.,Posy, S., Alexov, E. et al. (2003). On the role of structural information inremote homology detection and sequence alignment: new methods using hybridsequence profiles. J. Mol. Biol., 334, 1043-62.
43.Bucher, P., Karplus, K., Moeri, N.& Hofmann, K. (1996). A flexible motif search technique based ongeneralized profiles. Comput. Chem., 20, 3-23.
44.Rost, B. (1995). TOPITS: threadingone-dimensional predictions into three-dimensional structures. Proc Int ConfIntell Syst Mol Biol, 3, 314-21.
45.Rost, B., Schneider, R. & Sander,C. (1997). Protein fold recognition by prediction-based threading. J. Mol.Biol., 270, 471-480.
46.Russell, R. B., Saqi, M. A., Bates, P.A., Sayle, R. A. & Sternberg, M. J. (1998). Recognition of analogous andhomologous protein folds--assessment of prediction success and associatedalignment accuracy using empirical substitution matrices. Prot. Engin., 11, 1-9.
47.Fischer, D. & Eisenberg, D.(1996). Fold recognition using sequence-derived properties. Prot. Sci., 5, 947-955.
48.Russell, R. B., Copley, R. R. &Barton, G. J. (1996). Protein fold recognition by mapping predicted secondarystructures. J. Mol. Biol., 259, 349-365.
49.Fischer, D. (2003). 3DS3 and 3DS53D-SHOTGUN meta-predictors in CAFASP3. Proteins,53, 517-523.
50.Fischer, D., Rychlewski, L., Dunbrack,R. L., Jr., Ortiz, A. R. & Elofsson, A. (2003). CAFASP3: the third criticalassessment of fully automated structure prediction methods. Proteins, 53, 503-516.
51.Ginalski, K. & Rychlewski, L.(2003). Protein structure prediction of CASP5 comparative modeling and foldrecognition targets using consensus alignment approach and 3D assessment. Proteins, 53, 410-417.
52.von Grotthuss, M., Pas, J., Wyrwicz,L., Ginalski, K. & Rychlewski, L. (2003). Application of 3D-Jury, GRDB, andVerify3D in fold recognition. Proteins, 53, 418-423.
53.Wallner, B., Fang, H. & Elofsson,A. (2003). Automatic consensus-based fold recognition using Pcons, ProQ, andPmodeller. Proteins, 53, 534-541.
54.Eyrich, V. A., Koh, I. Y. Y.,Przybylski, D., Gra–a, O., Pazos, F. et al. (2003). CAFASP3 in the spotlight ofEVA. Proteins, 53 Suppl 6, 548-560.
55.Hendrickson, W. A. (2000). Synchrotroncrystallography. TIBS, 25, 637-643.
56.Montelione, G. T., Zheng, D., Huang,Y. J., Gunsalus, K. C. & Szyperski, T. (2000). Protein NMR spectroscopy instructural genomics. Nat. Struct. Biol., 7, 982-985.
57.Liu, J. & Rost, B. (2001).Comparing function and structure between entire proteomes. Prot. Sci., 10, 1970-1979.
58.Liu, J. & Rost, B. (2002). Targetspace for structural genomics revisited. Bioinformatics, 18, 922-933.
59.Westbrook, J., Feng, Z., Chen, L.,Yang, H. & Berman, H. M. (2003). The Protein Data Bank and structuralgenomics. Nucl. Acids Res., 31, 489-491.
60.Liu, J., Hegyi, H., Acton, T. B.,Montelione, G. T. & Rost, B. (2004). Automatic target selection forstructural genomics on eukaryotes. Proteins,in press.
61.Liu, J. & Rost, B. (2004). CHOPproteins into structural domain-like fragments. Proteins, 55, 678-686.
62.Rost, B., Sander, C. & Schneider,R. (1994). Redefining the goals of protein secondary structure prediction. J.Mol. Biol., 235, 13-26.
63.Rost, B. (1996). PHD: predictingone-dimensional protein structure by profile based neural networks. Meth.Enzymol., 266,525-539.
64.Koretke, K. K., Russell, R. B.,Copley, R. R. & Lupas, A. N. (1999). Fold recognition using sequence andsecondary structure information. Proteins, 37, 141-148.
65.Rice, D. W. & Eisenberg, D.(1997). A 3D-1D substitution matrix for protein fold recognition that includespredicted secondary structure of the sequence. J. Mol. Biol., 267, 1026-38.
66.Wallqvist, A., Fukunishi, Y., Murphy,L. R., Fadel, A. & Levy, R. M. (2000). Iterative sequence/secondarystructure search for protein homologs: comparison with amino acid sequencealignments and application to fold recognition in genome databases. Bioinformatics, 16, 988-1002.
67.Rost, B. (2003). Prediction in 1D:secondary structure, membrane helices, and accessibility. Methods BiochemAnal., 44, 559-587.
68.Murzin, A. G., Brenner, S. E.,Hubbard, T. & Chothia, C. (1995). SCOP: a structural classification ofproteins database for the investigation of sequences and structures. J. Mol.Biol., 247, 536-40.
69.Ortiz, A. R., Strauss, C. E. &Olmea, O. (2002). MAMMOTH (matching molecular models obtained from theory): anautomated method for model comparison. Prot. Sci., 11, 2606-21.
70.Koretke, K. K., Russell, R. B. &Lupas, A. N. (2001). Fold recognition from sequence comparisons. Proteins, 45, 68-75.
71.Rost, B. (2004). How to use protein 1Dstructure predicted by PROFphd. Meth. Mol. Biol.,submitted.
72.Shindyalov, I. N. & Bourne, P. E.(1998). Protein structure alignment by incremental combinatorial extension (CE)of the optimal path. Prot. Engin., 11, 739-47.
73.Zu-Kang, F. & Sippl, M. J. (1996).Optimum superimposition of protein structures: ambiguities and implications. Folding& Design, 1,123-132.
74.Eyrich, V. A., Marti-Renom, M. A.,Przybylski, D., Madhusudhan, M. S., Fiser, A. et al. (2001). EVA: continuousautomatic evaluation of protein structure prediction servers. Bioinformatics, 17, 1242-3.
75.Koh, I. Y. Y., Eyrich, V. A.,Marti-Renom, M. A., Przybylski, D., Madhusudhan, M. S. et al. (2003). EVA:evaluation of protein structure prediction servers. Nucl. Acids Res., 31, 3311-3315.
76.Przybylski, D. & Rost, B. (2002).Alignments grow, secondary structure prediction improves. Proteins, 46, 197-205.
77.McGuffin, L. J. & Jones, D. T.(2003). Benchmarking secondary structure prediction for fold recognition. Proteins, 52, 166-175.
78.Finkelstein, A. V., Gutun, A. M. &Badretdinov, A. Y. (1993). Why are the same protein folds used to performdifferent functions? FEBS Lett., 325, 23-28.
79.Finkelstein, A. V., Badretdinov, A. Y.& Gutin, A. M. (1995). Why do protein architectures have Boltzmann-likestatistics? Proteins, 23, 142-150.
80.Rost, B., Yachdav, G. & Liu, J.(2004). The PredictProtein server. Nucl. Acids Res.,in press.
81.Rost, B. & Sander, C. (1993).Prediction of protein secondary structure at better than 70% accuracy. J.Mol. Biol., 232, 584-99.
82.Rost, B. & Sander, C. (1994).Conservation and prediction of solvent accessibility in protein families. Proteins, 20, 216-226.
83.Smith, T. F. & Waterman, M. S.(1981). Identification of common molecular subsequences. J. Mol. Biol., 147, 195-7.
84.Li, W., Jaroszewski, L. & Godzik,A. (2001). Clustering of highly homologous sequences to reduce the size oflarge protein databases. Bioinformatics, 17, 282-3.
85.Wootton, J. C. & Federhen, S.(1996). Analysis of compositionally biased regions in sequence databases. Meth.Enzymol., 266,554-571.
86.Gumbel, E. J. (1958). Statistics ofExtremes. Columbia University Press, New York.,.
87.Dembo, A. & Karlin, S. (1991).Strong limit theorems of empirical distributions for large segmentalexceedences of partial sums of markov variables. Annals of Probability, 19, 1756-1767.
88.Karlin, S. & Altschul, S. F.(1990). Methods for assessing the statistical significance of molecularsequence features by using general scoring schemes. Proc Natl Acad Sci U S A, 87, 2264-8.
89.Oosterhoff, J. (1969). Combination ofOne-Sided Statistical Tests. .
90.Feller, W. (1957). An Introduction toProbability Theory and its Applications. Vol. 2, 2nd edn. John Wiley &Sons, New York.,.
91.Bailey, T. L. & Gribskov, M.(1998). Combining evidence using p-values: application to sequence homologysearches. Bioinformatics, 14, 48-54.
92.Brenner, S. E., Koehl, P. &Levitt, M. (2000). The ASTRAL compendium for protein structure and sequenceanalysis. Nucl. Acids Res., 28, 254-6.
93.Andersen, C. A. F., Palmer, A. G.,Brunak, S. & Rost, B. (2002). Continuum secondary structure capturesprotein flexibility. Structure, 10, 175-184.

Contact:    rost@columbia.edu Version:    May 21, 2004
 top - TOC - CUBIC-papers - CUBIC