Alignments grow, secondary structure prediction improves

Dariusz Przybylski & Burkhard Rost $

CUBIC, Columbia University

Columbia University, Department of Biochemistry and Molecular Biophysics, 650 West 168th Street BB217, New York, NY 10032, USA

$ Corresponding author: rost@columbia.edu, http://cubic.bioc.columbia.edu/


COPYRIGHT:
Title: Alignments grow, secondary structure prediction improves
Author:Dariusz Przybylski & Burkhard Rost
Quote: D Przybylski & B Rost (2002) Proteins, 46, 197-205
Copyright © 2002 Wiley[Imprint], Inc.

Table of Contents


Abstract

Using information from sequence alignments significantly improves protein secondary structure prediction. Typically, more divergent profiles yield better predictions. Lately, various groups have shown that accuracy can be improved markedly by using PSI-BLAST profiles to develop new prediction methods. Here, we focused on the influences of various alignment strategies on two 8-year old PHD methods. The following results stood out. (1) PHD using pairwise alignments predicts about 72% of all residues correctly in one of the three states helix, strand, other. Using larger databases and PSI-BLAST raised accuracy to 75%. (2) More than 60% of the improvement originated from the growth of current sequence databases; about 20% resulted from detailed changes in the alignment procedure (substitution matrix, thresholds, gap penalties). Another 20% of the improvement resulted from carefully using iterated PSI-BLAST searches. (3) Interestingly, we failed to improve prediction accuracy further when attempting to refine the alignment by dynamic programming (MaxHom and ClustalW). (4) Improvement through family growth appears to saturate at some point. However, most families have not reached this saturation. Hence, we anticipate that prediction accuracy will continue to rise with database growth.

Availability: PHD and our protocol for automatic iterated PSI-BLAST searches are available through PredictProtein http://cubic.bioc.columbia.edu/predictprotein; the filtered PSI-BLAST database and additional scripts are available upon request from the authors.

Key words: protein structure prediction; solvent accessibility; evolutionary information; profiles-based multiple alignments; dynamic programming; neural networks; PSI-BLAST


Introduction

Evolutionary information improves structure prediction. Proteins with similar sequences adopt similar structures [8, 3] . In fact, proteins can exchange more than 70% of all their residues without altering the basic fold [9, 10, 11, 12] . However, the vast majority of possible sequences supposedly do not adopt globular structures, at all. Rather, the exact substitution pattern of which residues can be exchanged against which other is indicative of particular structural details. Consequently, the evolutionary information contained in sequence alignments can aid structure prediction. This has been realised since long [13, 14, 15, 16, 17, 9] . The break-through in automatically using this information was achieved by applying neural networks to the problem of secondary structure prediction [18, 19] . Replacing single sequences by family profiles, improved prediction accuracy by about five percentage points [19, 20] . The success in using evolutionary information for secondary structure prediction was not restricted to neural networks [21, 22, 23, 24, 25, 26, 27] . Furthermore, evolutionary information proved also beneficial for predicting other aspects of protein structure [28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 5, 38, 39, 40, 41, 42] .

More divergence yields better predictions. How much divergence in a family is needed to improve prediction accuracy? The more, the better! In the extreme: if we could use structural alignments to identify remote homologues and to build profiles, we would get better improves [43] . The trouble with this promising concept is, of course, we cannot structurally align proteins of unknown structure. However, the iterated, profile-based PSI-BLAST program [6] achieved the break-through, in practice, of another old idea: Use profiles to refine database searches. PSI-BLAST identifies more distant relations than pairwise alignment methods do [11] . This increased detection of very diverged family members has been used successfully to improve prediction accuracy by training neural networks on the PSI-BLAST profiles [44, 42] . The impressive improvement pioneered by David Jones [44] based upon developing a new prediction method. Here, we tried to isolate the causes for the recent improvement. While Cuff & Barton investigated how a new method could benefit from particular alignment strategies [45, 42] , we wanted to estimate how grown databases and better search techniques would improve existing method.s In particular, we chose the seven-year old method PHDsec as example for 'any method'. Incidentally, Barton & Cuff used PHDsec to underline their conclusion that using more informative alignments alone did not significantly improve prediction accuracy [42] .


Methods

Alignment and prediction methods. We obtained the following alignment methods from the respective public web sites: PSI-BLAST [1, 6] : ftp://ncbi.nlm.nih.gov/blast/; ClustalW [46, 2] : ftp://ftp.ebi.ac.uk/pub/software/. The source code for MaxHom [3, 47] was obtained from Reinhard Schneider (LION Biosciences, Boston). All predictions were obtained from the publicly available PHD programs: PHDsec for secondary structure [19, 20, 5] , and PHDacc for solvent accessibility [32, 5] .

Pre-filtering the database for PSI-BLAST. David Jones reported the importance of pre-filtering the database used for running PSI-BLAST [44] . We followed that concept. First, we combined SWISS-PROT [7] , TrEMBL [7] , and PDB [4] into one big database (referred to as BIG). Second, we removed low-complexity regions using SEG [48, 49] . Third, we marked coiled-coil regions with COILS [50, 51] .

Building families with PSI-BLAST and filtering the output. First, we searched with PSI-BLAST against our pre-filtered database restricting the number of iterations to three (if not stated otherwise). Second, we froze the PSI-BLAST profile and searched without iteration against the unfiltered database. Third, we included all proteins into the family that were below a certain BLAST E-value. Prior to using the final alignment for prediction, we reduced the redundancy by omitting all proteins more than 80% identical to any protein previously taken. Finally, we re-aligned all proteins found using ClustalW and MaxHom.

Monitoring changes by simple per-residue scores. The most established measure for secondary structure prediction accuracy is the three-state per-residue accuracy, often referred to as Q3 [19, 52] . Q3 reports the percentage of residues correctly predicted in one of the three states helix (H), strand (E), or other (L):

Eq 1 (Eq. 1)

where Nprot was the number of proteins in the data set. We assigned secondary structure from PDB files by the default DSSP [53] , with the following conversions of the 8 DSSP states: [HGI] -> H (helix), [EB] -> E (strand), [.ST] -> L (other, non-regular). More elaborate scores [19, 21, 54] yielded qualitatively similar results (data not shown). Solvent accessibility was also taken from DSSP, with the standard conversion of accessibility to relative values [32] . To monitor prediction accuracy, we used a two-state per-residue score giving the percentage of residues correctly predicted as either buried (< 16% accessible) or exposed (³ 16% accessible). Many other measures for prediction accuracy are beneficial to evaluate methods [32, 52, 21, 55, 54] . However, here we were only concerned about separating various sources of improvement.

Data sets for evaluation. Mostly, we provided relative values, i.e. 'improvement relative to using standard PHD alignments'. One reason for this was that relative values did not vary between different data sets chosen to estimate prediction accuracy. Nevertheless, we attempted to base all numbers on data sets 'as clean as possible'. In particular, none of the proteins used for evaluation had significant levels of sequence identity to any protein used for developing PHD. As a threshold for 'significant identity', we used a mark of '-5' percentage points below our previously established line implying structural similarity [56] (this roughly translates to 'less than 25% sequence identity over more than 100 residues). This particular cut-off implies that pairwise sequence searches have more than 90% wrong hits. Thus, our data sets were fairly conservative in terms of 'distance to known structures'. We tested three different data sets. First, 199 proteins added to PDB between June and December 1999 (dubbed 'set_199'). Second, 264 proteins added from April 2000 to January 2001 and used by the EVA server (dubbed 'set_EVA264', [57] ). Note for this particular set we also had 'blind' results from other methods available fulfilling the criterion that 'no protein was similar to any protein used to develop that method'. Third, 1136 proteins added to PDB between 1994 - when PHD was developed - and 1999 (dubbed 'set_1136'). All three sets were non-redundant in the sense that no pair in the respective set had significant sequence identity.

Significant differences. Plotting Q3 for many proteins yields a Gaussian distribution. Thus, whether or not a difference in prediction accuracy is significant depends on the standard deviation of that distribution and on the size of the data set. Here, we used the following rule-of-thumb to refer to a difference in accuracy (ÆQ) as ''significant':

Eq 2 (Eq. 2)

where N was the number of proteins in the data set and s the standard deviation of Q over that set. Typically, standard deviations for Q3 are in the order of 10. Note that Eqn. 2 is typically known as the 'standard error'. Hence, differences of 2 percentage points for 10 proteins are not significant (2 < 10/3=3), while differences of one percentage point are when based on 225 proteins (1 > 10/15=0.66). Thus, for our largest data set 'set_1136', differences above 0.3 were significant.


Results

Better secondary structure prediction by using BLAST. First, we searched with pairwise BLAST and the Blosum62 matrix against SWISS-PROT [7] . This improved accuracy over our previous strategy (MaxHom with McLachlan) significantly by about one percentage point (Q3, eqn. 1 Table 1 ). Encouraged by the immediate success, we tried to further improve by using a variety of other alignment methods. We failed ( Table 1 ). Next, we applied the full Smith-Waterman alignment algorithm [58] implemented in MaxHom [3] to sequences identified by BLAST. The gain was insignificant ( Table 1 ). Surprisingly, when we generated multiple alignments with ClustalW [2] , prediction accuracy decreased significantly (simple ClustalW in Table 1 ). A similar tendency was previously reported by Cuff & Barton [42] . This may be due to the sensitivity of ClustalW to including proteins of unrelated structure in the alignment (false positives). In fact, many of the proteins found at high E-values were likely false positives [11, 12] . Such errors may effect the quality of the multiple alignment, especially when close to a root of the family-dendrogram used by ClustalW to align a family. The effect decreased when we built the multiple alignment gradually, starting from a query sequence and proceeding towards more distant homologues (profile ClustalW in Table 1 ). Finally, we tried to filter out possible false positives using our extension of the HSSP-curve [3, 56] . This step decreased prediction accuracy albeit insignificantly (BLAST-filter in Table 1). Hence, it was beneficial to include more distant homologues in the alignment even if some of them were false positives. More precisely, the increase in divergence was more beneficial than the inclusion of false positives was detrimental.



Table 1: Improvement by using different methods to realign pairwise BLAST hits a
Method
SWISS-PROT b
BIG c
E<1E<10-3 E<1E<10-3
BLAST8.27.6 9.79.2
simple ClustalWd4.4 5.4
profile ClustalWe5.4 7.1
MaxHom with McLachlanf7.2 7.59.08.9
MaxHom with BLOSUM62g8.3 7.99.59.1
BLAST-filterh7.9 7.69.59.2
profile-based BLASTi8.2 7.89.69.1
significant difference> 0.44 > 0.44> 0.44 > 0.44

a Given are percentage points by which PHD improved over single-sequence based predictions by using the respective alignment methods three-state per-residue accuracy Q3. The baseline reflected the performance of PHD on single sequences (PHDsec = 66.3%, PHDacc = xx). In all cases, we used pairwise BLAST [1]  searches to select the proteins to be aligned with a BLAST E-value threshold of 1 (left columns: more homologues, more false positives) and of 10-3 (right columns: fewer homologues, fewer false positives). All results based on a set of 199 proteins.
b Using only homologues found in SWISS-PROT [7] .
c Using homologues found in a 'non-redundant' database merging SWISS-PROT + TrEMBL + PDB [7, 4] .
d Dynamic programming with default parameters [2] (note: we could not test ClustalW on BIG since our CPU time was too limited).
e Dynamic programming with gradual profile alignment by default parameters, sequences are brought into alignment one by one [2] (note: we could not test ClustalW on BIG since our CPU time was too limited).
f Dynamic programming with McLachlan [63] matrix and gap penalties same as those used in generation of HSSP database [3] . g Dynamic programming with BLOSUM62 [60] and gap penalties of 10+k, where k is a length of a gap.
h Homology reduction of BLAST alignments using a modified HSSP curve [56] .
i Sequences found by BLAST are realigned using a position specific scoring matrix produced by PSI-BLAST [6] on BIG (note: no iteration was used, here).



Significant improvement through larger database. The PHD methods were developed, analysed, and distributed based on alignments generated from the SWISS-PROT database (currently containing about 90,000 sequences). When we switched from SWISS-PROT to a large 'non-identical' database (BIG = SWISS-PROT + TrEMBL + PDB) of about 500,000 sequences, predictions increased significantly by an additional 1.5 percentage points ( Table 1 ). On average, we aligned about 2.7 times more proteins to each query sequence when using BIG. We observed a strong dependence of the performance on the threshold chosen to include sequences into the family: alignments containing sequences with E-values ² 1 (corresponding to P-value of about 0.63) yielded the highest prediction accuracy ( Table 2 ). While adding sequences to the families improved accuracy for most proteins, occasionally accuracy dropped ( Fig. 1 A). In fact, for some proteins predictions based on single sequences were more accurate than those based on alignments (negative values in Fig. 1 ). This effect persisted even when including only proteins with E-values < 10-20 ( Fig. 1 B) suggesting that the drop in accuracy was not only caused by false positives.



Table 2: Improvement by using different pairwise BLAST thresholds a
E-value bPHDsec c PHDacc d
1008.74.4
209.14.9
109.55.0
19.75.2
10-19.5 5.3
10-29.2 5.3
10-39.1 5.2
10-4 8.9 5.2
10-78.5 5.0
10-206.9 4.5
significant difference>0.44 >0.39

a Given are percentage points by which PHD improved over single-sequence based predictions by using the pairwise BLAST [1]  searches on BIG. Data set as in Table 1.
b Maximal E-value of sequences included in final alignment;
c Secondary structure prediction accuracy (PHDsec) improvements as given by an increase in the three-state per-residue accuracy Q3.
d Solvent accessibility prediction accuracy (PHDacc) improvements as given by an increase in the two-state per-residue accuracy Q2.





Fig. 1
fig1.gif

Fig. 1: Influence of database size on prediction accuracy. The improvement in the three-state per-residue accuracy is given as differences between predictions based on alignments vs. predictions based on single sequences. Thus, negative numbers imply that evolutionary information was detrimental. All alignments were generated by simple pairwise BLAST searches using E-value thresholds for including sequences of one1 (A) and 10-20 (B). The diagonal lines mark proteins predicted equally well when searching through SWISS-PROT and BIG. Point well above the diagonal mark proteins for which the larger database was highly beneficial. The observation of points below the diagonal for the conservative cut-off threshold (B) suggested that the decrease from using the larger database was not caused by accumulating false positives.



PSI-BLAST improved secondary structure prediction accuracy slightly. PSI-BLAST finds more distantly related homologues than pairwise search methods [11] . These extended profiles have been reported to improve prediction accuracy significantly [45, 44] . In contrast, we noticed only a marginal improvement of about 0.4 percentage points through using PSI-BLAST (Table 3). In fact, this was below the mark of 0.44 for significant differences. However, we observed consistently positive effects by using PSI-BLAST on other data sets. In particular, for the set of new proteins used by EVA [59, 57] , PSI-BLAST improved by about 0.6 percentage points. Given a standard error on that set of around 0.6 (Eqn. 2) this was at the edge of being statistically significant, albeit rather small in comparison to the difference between using SWISS-PROT and BIG with a simple BLAST ( Table 1, Table 3 .8 percentage points). Similarly, we observed an increase of 0.4 percentage points for 'set_1136' for which differences above 0.3 were statistically significant.



Table 3: Improvement by using PSI-BLAST a
Iteration E-value b
PHDsec c
PHDacc d
107.33.0
19.34.2
10-110.1 4.8
10-210.1 5.0
10-310.0 5.0
10-410.1 5.1
10-79.95.1
10-209.6 5.2
10-609.4 5.0
significant difference>0.44 >0.3

a Given are percentage points by which PHD improved when using iterated PSI-BLAST over single sequence-based predictions. The iteration parameter determines which proteins to include when building the profile used for the next iteration step. Data set as in Table 1.
b Maximum E-value of sequences used in refinement of position specific scoring matrix for PSI-BLAST (final alignment maximum E-value is set to 1).
c Secondary structure prediction accuracy improvements given by an increase in the three-state per-residue accuracy Q3.
a Solvent accessibility prediction accuracy improvements as given by an increase in the two-state per-residue accuracy Q2.

Table 4: Improvement by using PSI-BLAST with different numbers of iterations a



Accuracy was stable over a wide range of iteration parameters. Surprisingly, prediction accuracy on PSI-BLAST alignments was not very sensitive to the choice of the E-value limiting inclusion of sequences into the position-specific scoring matrix during iteration. Only rather extreme values were clearly worse (Table 3). In contrast, the number of iterations appeared more crucial: When iterating more than three times, accuracy dropped significantly when using a permissive h-parameter (proteins included when E-value < 10-4) and decreased slightly when using a restrictive h-parameter (10-10; Table 4 ). Finally, we investigated the influence of gap parameters and substitution matrices. In particular, we found that gap open values of 10-12 did not change accuracy and that predictions were similar when replacing the default BLOSUM62 [60] matrix by BLOSUM80 or BLOSUM45 matrices (data not shown).

Pre-filtering database was not vital for PSI-BLAST. Surprisingly, we could not establish that filtering the database for PSI-BLAST - as proposed by David Jones [44] - was crucial for secondary structure prediction. Although the tendency was that filtering the database improved, the improvement was not significant (Table 4). Nevertheless, the numbers for the unfiltered database were consistently lower for all the E and h parameters ( Table 4 ).



Table 4: a
Number of iterations b
h 10-4 c
h 10-10 c
filtered d non-filtered dfiltered d non-filtered d
19.59.7 9.59.7
29.910.0 9.810.0
310.19.8 10.110.0
49.69.3 10.19.8
69.38.8 9.99.7
108.17.4 9.79.5
significant difference>0.44 >0.44>0.44 >0.44

a Given are percentage points by which PHDsec improved when using iterated PSI-BLAST over single sequence-based predictions (three-state per-residue accuracy, Q3). For all runs, we included all proteins found below E-values of 1. Data set used as in Table 1.
b Number of PSI-BLAST iterations.
c PSI-BLAST iteration parameter (h-value) set to 10-4 (10-10), i.e. only sequence with E values < 10-4 (<10-10) were considered when compiling the profile.
d 'Filtered' refers to filtering the database used for the search (Methods), 'non-filtered' to not filtering the database.



PSI-BLAST vs. BLAST: combination could be best. We found on average 2.4 times more proteins by PSI-BLAST than by BLAST. This family growth was similar to that obtained by switching from SWISS-PROT to BIG; the improvement was not ( Table 1 Table 3 ). Furthermore, for many proteins BLAST yielded better predictions than PSI-BLAST ( Fig. 2 ). If we could decide from looking at the BLAST and PSI-BLAST alignments, which one is 'better', accuracy would increase by an additional percentage point (data not shown). Possibly, we could approach this improvement by a more elaborated protocol for running PSI-BLAST.



Fig. 2
fig2.gif

Fig. 2: Iterated PSI-BLAST vs. pairwise BLAST. All searches against BIG with an E-value threshold of 10-4 for the iteration and a threshold of 1 for including the proteins into the final family. Proteins for which PSI-BLAST yielded better predictions than BLAST fall above the diagonal. Obviously, predictions based on iterated PSI-BLAST searches were not consistently more accurate than those based on iterated BLAST searches.



Prediction accuracy related to number of sequences in alignment. The 25% of proteins with highest prediction accuracy had on average twice as many proteins in their alignments as the lowest scoring 25%. For small families (< 10 proteins) prediction accuracy improved about five percentage points over single sequences, for large families (> 100 proteins) more than 11 percentage points. Qualitatively, this dependence was apparent when plotting the improvement versus the number of proteins aligned ('set_1136', Fig. 3 A). Noticeably, the most significant gain resulted from adding a few proteins to the alignment (steep slope for small numbers in Fig. 3 A). For large families the improvement appeared to saturate. However, the number of proteins was not a perfect indicator of prediction accuracy since the slopes differed between different methods and databases ( Fig. 3 A). In particular, the data plotted in Fig. 3 A appeared to suggest that BLAST searches against SWISS-PROT yielded more improvement than BLAST or PSI-BLAST searches against BIG. However, the average was not compiled over the same families because BLAST found less homologues in SWISS-PROT than in BIG. We corrected for this by labelling the families according to the number of SWISS-PROT proteins in each alignment ( Fig. 3 B). This revealed that (1) searches against BIG consistently yielded better performances than searches against SWISS-PROT, and that (2) PSI-BLAST performed - on average - better than BLAST. Despite the saturation for large families, we observed improvements from adding sequences even for family sizes between 200 and 500 (largest family found in SWISS-PROT).



Fig. 3 fig3.gif

Fig. 3: Prediction accuracy vs. family size. The improvement given as differences between predictions based on alignments and predictions based on single sequences. Curves represented an average weighted fit. Different methods produced somewhat different shapes for the fit (A). SWISS-PROT based alignments were on average most efficient. On the other hand, alignments produced on BIG by both BLAST and PSI-BLAST improved over SWISS-PROT for all sizes (B: families labelled by number of SWISS-PROT homologues found). Prediction accuracy improved most significantly for small families and saturated for very large families. The average variation of the prediction accuracy was considerable (standard deviation of about 8). Thus, the number of aligned sequences was not a very good indicator of prediction accuracy for any given sequence. Nevertheless, the average trend was obvious (C: predictions based on BLAST alignments of BIG; note the straight line indicates a logarithmic fit; error bars correspond to 95 % confidence intervals).



Larger alignments improve accessibility prediction, marginally. Solvent accessibility predictions did not improve as much as secondary structure predictions by the various protocols we investigated. The two-state accuracy (Methods) increased by less than one percentage point when using a larger database. Interestingly, this improvement was more sensitive to the particular threshold used to include sequences ( Table 2 ). Hence, predictions of accessibility appeared more sensitive to false positives than did predictions of secondary structure. Not surprisingly then, using PSI-BLAST did not improve accessibility prediction either ( Table 3 ).


Discussion

PSI-BLAST better than BLAST? The average growth of the family size identified by PSI-BLAST was comparable to the growth achieved by searching through a larger database with a simple pairwise BLAST (factor 2.4 vs. 2.7). However, the resulting gain in performance was significantly smaller for PSI-BLAST. Proteins added to the families by PSI-BLAST were more distantly related to the query sequence than those found by BLAST. Thus, we anticipated a larger number of false positives for the PSI-BLAST families. When we based the prediction only on the remote homologues exclusively identified by PSI-BLAST accuracy improved over single sequence-based predictions overall by about six percentage points (data not shown). This value was about three percentage points lower than the gain through using pairwise BLAST on BIG ( Table 1 ). Partially, this might be explained by the different distributions of family sizes between the different methods and databases ( Fig. 4 ). For example, about 40% of the PSI-BLAST searches on BIG added less than 10 proteins to the family identified by pairwise BLAST while only 20% of the families had less than 10 protein using pairwise BLAST searches. The most important increase in prediction accuracy resulted from the first ten proteins included in an alignment ( Fig. 3 A). However, PSI-BLAST found only 4% more families with over 10 members than BLAST ( Fig. 4 ). This suggested the following explanation for the relatively small improvement through PSI-BLAST. PSI-BLAST needs a reasonable number of first iteration hits (hits from pairwise BLAST) to find additional family members. However, when this number is sufficiently large to unravel the full 'power' of the PSI-BLAST search, we reached the saturation of family sizes that improved prediction accuracy. In contrast, when we relaxed the stringent criteria for iterating PSI-BLAST (more iterations, lower h-parameter), prediction accuracy dropped. This could be explained by the effects of 'drift' and 'pollution' associated with PSI-BLAST [61] : the final profile is not centred around the original search sequence (drift) for which secondary structure is predicted, and many profiles contain proteins of different structure (pollution) than the one predicted.



Fig. 4
fig4.gif

Fig. 4: Distributions of family size. Here, we focused on families with fewer than 100 homologues; for these prediction accuracy increased most markedly (Fig. 3). Understandably, PSI-BLAST did not identify many more homologues than pairwise BLAST for small families. Supposedly, this caused the observation that prediction accuracy differed only marginally between PSI-BLAST and BLAST searches. However, around family sizes of 50, PSI-BLAST added as many family members as BLAST identified. Furthermore, whereas about 40% of the PSI-BLAST families had more than 100 homologues, only 5% of the pairwise BLAST families found as many homologues when searching against SWISS-PROT.



Is waiting for databases to grow more successful than developing new methods? Overall, our combination of PSI-BLAST and larger databases improved our decade-old prediction method PHD by about three percentage points over the previous protocol of using dynamic programming (MaxHom on SWISS-PROT). Since we were not aware of making use of any 'particular feature' of PHD to reach this value, we expect that similar improvements could be obtained for any old prediction method. Surprisingly, most of this improvement resulted from the growth of the databases (BIG vs. SWISS-PROT) and not from better search methods (PSI-BLAST vs. BLAST). Another surprise to us was the success of the popular BLAST algorithms in comparison with more CPU intensive dynamic programming searches ( Table 1 ). Could we gain further by re-training on larger databases and on PSI-BLAST profiles? Contrary to earlier claims [42] , PHD did profit substantially from extended profiles and did even perform on par with JNet / JPred2 developed on large data sets of extended PSI-BLAST profiles [42, 59, 57] . In contrast, PSIPRED [44] and PROFsec (B Rost, unpublished) reached a sustained level about 1.5 percentage points above this value on a data set used by EVA (data not shown) [59, 57] . The better performance of PROFsec resulted entirely from improving the method. However, for the case of PSIPRED we suspect that part of the better performance of the public PSIPRED server [62] resulted from a better protocol in running PSI-BLAST. In fact, when we used our PSI-BLAST alignments for our local version of PSIPRED, accuracy dropped significantly with respect to the server results.

How fatal are errors in the database? We assume that SWISS-PROT contains fewer errors than TrEMBL. Could we see this effect in the accuracy of secondary structure prediction? Too many overlapping effects prevented a conclusive answer to this question. Fig. 3 A appeared to suggest that SWISS-PROT improved prediction accuracy more than TrEMBL. However, this observation might have been caused mainly by the overlying saturation effect. Hence, we shall have to wait for SWISS-PROT to double before we can answer more precisely.

Will accuracy rise with future database growth? A roughly six-fold growth of the database (SWISS-PROT vs. BIG) improved prediction accuracy by about 1.5 percentage points. However, the increase saturated for families with more than 100-200 proteins ( Fig. 3 ). Hence, will future growth improve performance considerably? We anticipate an affirmative answer, since less than 30% of all families contain more than 100 proteins through pairwise alignments ( Fig. 4 ). Our assumption here is that newly sequenced proteins will not differ considerably from the proteins we already know from projects sequencing entire organisms.


Conclusions

Recent improvements in secondary structure prediction seem to be due to various sources. Our results indicated the following over-simplified formula. More than half of the recent improvement of secondary structure prediction resulted from the growth of sequence databases (from 90K in SWISS-PROT to 500K in BIG). Less than one fifth of the improvement was achieved through better database search methods (PSI-BLAST over BLAST). The remaining one-third of the improvement was due to better methods. Hence, the most crucial tool to improve secondary structure prediction proved to be the new BLAST/PSI-BLAST searching tools. The contribution of PSI-BLAST was probably smaller than usually assumed due to saturating nature of the dependence of prediction accuracy on alignment size ( Fig. 3 Fig. 4 ). The majority of proteins did not reach this saturation point, yet. Hence, we anticipate that prediction accuracy will rise continuously with every protein added to the databases.


Acknowledgements

Thanks to Jinfeng Liu (Columbia) for computer assistance, to BLAST team at NCBI for assistance. DP and BR gladly acknowledge support by the grant RO1-GM63029-01 from the National Institute of Health. Last not least, thanks to Amos Bairoch (Geneva), Rolf Appweiler (EBI, Hinxton), Phil Bourne (San Diego) and their teams for maintaining vital public databases and to all structural biologists who make the fruits of their efforts publicly available.


References