Dariusz Przybylski & Burkhard Rost $
CUBIC, Columbia University
Columbia University, Department of Biochemistry and Molecular Biophysics, 650 West 168th Street BB217, New York, NY 10032, USA
$ Corresponding author: rost@columbia.edu, http://cubic.bioc.columbia.edu/
| Title: | Alignments grow, secondary structure prediction improves |
| Author: | Dariusz Przybylski & Burkhard Rost |
| Quote: | D Przybylski & B Rost (2002) Proteins, 46, 197-205 |
| Copyright © 2002 Wiley[Imprint], Inc. |
Using information from sequence alignments significantly improves protein secondary structure prediction. Typically, more divergent profiles yield better predictions. Lately, various groups have shown that accuracy can be improved markedly by using PSI-BLAST profiles to develop new prediction methods. Here, we focused on the influences of various alignment strategies on two 8-year old PHD methods. The following results stood out. (1) PHD using pairwise alignments predicts about 72% of all residues correctly in one of the three states helix, strand, other. Using larger databases and PSI-BLAST raised accuracy to 75%. (2) More than 60% of the improvement originated from the growth of current sequence databases; about 20% resulted from detailed changes in the alignment procedure (substitution matrix, thresholds, gap penalties). Another 20% of the improvement resulted from carefully using iterated PSI-BLAST searches. (3) Interestingly, we failed to improve prediction accuracy further when attempting to refine the alignment by dynamic programming (MaxHom and ClustalW). (4) Improvement through family growth appears to saturate at some point. However, most families have not reached this saturation. Hence, we anticipate that prediction accuracy will continue to rise with database growth.
Availability: PHD and our protocol for automatic iterated
PSI-BLAST searches are available through PredictProtein http://cubic.bioc.columbia.edu/predictprotein;
the filtered PSI-BLAST database and additional scripts are available
upon request from the authors.
Key words: protein structure prediction; solvent accessibility;
evolutionary information; profiles-based multiple alignments;
dynamic programming; neural networks; PSI-BLAST
Evolutionary information improves structure prediction. Proteins with similar sequences adopt similar structures [8, 3] . In fact, proteins can exchange more than 70% of all their residues without altering the basic fold [9, 10, 11, 12] . However, the vast majority of possible sequences supposedly do not adopt globular structures, at all. Rather, the exact substitution pattern of which residues can be exchanged against which other is indicative of particular structural details. Consequently, the evolutionary information contained in sequence alignments can aid structure prediction. This has been realised since long [13, 14, 15, 16, 17, 9] . The break-through in automatically using this information was achieved by applying neural networks to the problem of secondary structure prediction [18, 19] . Replacing single sequences by family profiles, improved prediction accuracy by about five percentage points [19, 20] . The success in using evolutionary information for secondary structure prediction was not restricted to neural networks [21, 22, 23, 24, 25, 26, 27] . Furthermore, evolutionary information proved also beneficial for predicting other aspects of protein structure [28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 5, 38, 39, 40, 41, 42] .
More divergence yields better predictions. How much divergence
in a family is needed to improve prediction accuracy? The more,
the better! In the extreme: if we could use structural alignments
to identify remote homologues and to build profiles, we would
get better improves [43] . The trouble with this promising
concept is, of course, we cannot structurally align proteins of
unknown structure. However, the iterated, profile-based PSI-BLAST
program [6] achieved the break-through, in practice, of another
old idea: Use profiles to refine database searches. PSI-BLAST
identifies more distant relations than pairwise alignment methods
do [11] . This increased detection of very diverged family
members has been used successfully to improve prediction accuracy
by training neural networks on the PSI-BLAST profiles [44, 42] .
The impressive improvement pioneered by David Jones [44] based
upon developing a new prediction method. Here, we tried to isolate
the causes for the recent improvement. While Cuff & Barton
investigated how a new method could benefit from particular alignment
strategies [45, 42] , we wanted to estimate how grown databases
and better search techniques would improve existing method.s In
particular, we chose the seven-year old method PHDsec as example
for 'any method'. Incidentally, Barton & Cuff used PHDsec
to underline their conclusion that using more informative alignments
alone did not significantly improve prediction accuracy [42] .
Alignment and prediction methods. We obtained the following alignment methods from the respective public web sites: PSI-BLAST [1, 6] : ftp://ncbi.nlm.nih.gov/blast/; ClustalW [46, 2] : ftp://ftp.ebi.ac.uk/pub/software/. The source code for MaxHom [3, 47] was obtained from Reinhard Schneider (LION Biosciences, Boston). All predictions were obtained from the publicly available PHD programs: PHDsec for secondary structure [19, 20, 5] , and PHDacc for solvent accessibility [32, 5] .
Pre-filtering the database for PSI-BLAST. David Jones reported the importance of pre-filtering the database used for running PSI-BLAST [44] . We followed that concept. First, we combined SWISS-PROT [7] , TrEMBL [7] , and PDB [4] into one big database (referred to as BIG). Second, we removed low-complexity regions using SEG [48, 49] . Third, we marked coiled-coil regions with COILS [50, 51] .
Building families with PSI-BLAST and filtering the output. First, we searched with PSI-BLAST against our pre-filtered database restricting the number of iterations to three (if not stated otherwise). Second, we froze the PSI-BLAST profile and searched without iteration against the unfiltered database. Third, we included all proteins into the family that were below a certain BLAST E-value. Prior to using the final alignment for prediction, we reduced the redundancy by omitting all proteins more than 80% identical to any protein previously taken. Finally, we re-aligned all proteins found using ClustalW and MaxHom.
Monitoring changes by simple per-residue scores. The most established measure for secondary structure prediction accuracy is the three-state per-residue accuracy, often referred to as Q3 [19, 52] . Q3 reports the percentage of residues correctly predicted in one of the three states helix (H), strand (E), or other (L):
(Eq. 1)
where Nprot was the number of proteins in the data set. We assigned secondary structure from PDB files by the default DSSP [53] , with the following conversions of the 8 DSSP states: [HGI] -> H (helix), [EB] -> E (strand), [.ST] -> L (other, non-regular). More elaborate scores [19, 21, 54] yielded qualitatively similar results (data not shown). Solvent accessibility was also taken from DSSP, with the standard conversion of accessibility to relative values [32] . To monitor prediction accuracy, we used a two-state per-residue score giving the percentage of residues correctly predicted as either buried (< 16% accessible) or exposed (³ 16% accessible). Many other measures for prediction accuracy are beneficial to evaluate methods [32, 52, 21, 55, 54] . However, here we were only concerned about separating various sources of improvement.
Data sets for evaluation. Mostly, we provided relative values, i.e. 'improvement relative to using standard PHD alignments'. One reason for this was that relative values did not vary between different data sets chosen to estimate prediction accuracy. Nevertheless, we attempted to base all numbers on data sets 'as clean as possible'. In particular, none of the proteins used for evaluation had significant levels of sequence identity to any protein used for developing PHD. As a threshold for 'significant identity', we used a mark of '-5' percentage points below our previously established line implying structural similarity [56] (this roughly translates to 'less than 25% sequence identity over more than 100 residues). This particular cut-off implies that pairwise sequence searches have more than 90% wrong hits. Thus, our data sets were fairly conservative in terms of 'distance to known structures'. We tested three different data sets. First, 199 proteins added to PDB between June and December 1999 (dubbed 'set_199'). Second, 264 proteins added from April 2000 to January 2001 and used by the EVA server (dubbed 'set_EVA264', [57] ). Note for this particular set we also had 'blind' results from other methods available fulfilling the criterion that 'no protein was similar to any protein used to develop that method'. Third, 1136 proteins added to PDB between 1994 - when PHD was developed - and 1999 (dubbed 'set_1136'). All three sets were non-redundant in the sense that no pair in the respective set had significant sequence identity.
Significant differences. Plotting Q3 for many proteins yields a Gaussian distribution. Thus, whether or not a difference in prediction accuracy is significant depends on the standard deviation of that distribution and on the size of the data set. Here, we used the following rule-of-thumb to refer to a difference in accuracy (ÆQ) as ''significant':
(Eq. 2)
where N was the number of proteins in the data set and s the standard deviation of Q over that set. Typically, standard deviations for Q3 are in the order of 10. Note that Eqn. 2 is typically known as the 'standard error'. Hence, differences of 2 percentage points for 10 proteins are not significant (2 < 10/3=3), while differences of one percentage point are when based on 225 proteins (1 > 10/15=0.66). Thus, for our largest data set 'set_1136', differences above 0.3 were significant.
Better secondary structure prediction by using BLAST.
First, we searched with pairwise BLAST and the Blosum62 matrix
against SWISS-PROT [7] . This improved accuracy over our previous
strategy (MaxHom with McLachlan) significantly by about one percentage
point (Q3, eqn. 1 Table 1 ). Encouraged by the immediate
success, we tried to further improve by using a variety of other
alignment methods. We failed ( Table 1 ). Next, we applied the full
Smith-Waterman alignment algorithm [58] implemented in MaxHom
[3] to sequences identified by BLAST. The gain was insignificant
( Table 1 ). Surprisingly, when we generated multiple alignments
with ClustalW [2] , prediction accuracy decreased significantly
(simple ClustalW in Table 1 ). A similar tendency was previously
reported by Cuff & Barton [42] . This may be due to the
sensitivity of ClustalW to including proteins of unrelated structure
in the alignment (false positives). In fact, many of the proteins
found at high E-values were likely false positives [11, 12] .
Such errors may effect the quality of the multiple alignment,
especially when close to a root of the family-dendrogram used
by ClustalW to align a family. The effect decreased when we built
the multiple alignment gradually, starting from a query sequence
and proceeding towards more distant homologues (profile ClustalW
in Table 1 ). Finally, we tried to filter out possible false positives
using our extension of the HSSP-curve [3, 56] . This step decreased
prediction accuracy albeit insignificantly (BLAST-filter in Table 1). Hence, it was beneficial to include more distant homologues in the alignment even if some of them were false positives. More precisely, the increase in divergence was more beneficial than the inclusion of false positives was detrimental.
| Method | ||||
| E<1 | E<10-3 | E<1 | E<10-3 | |
| BLAST | 8.2 | 7.6 | 9.7 | 9.2 |
| simple ClustalWd | 4.4 | 5.4 | ||
| profile ClustalWe | 5.4 | 7.1 | ||
| MaxHom with McLachlanf | 7.2 | 7.5 | 9.0 | 8.9 |
| MaxHom with BLOSUM62g | 8.3 | 7.9 | 9.5 | 9.1 |
| BLAST-filterh | 7.9 | 7.6 | 9.5 | 9.2 |
| profile-based BLASTi | 8.2 | 7.8 | 9.6 | 9.1 |
| significant difference | > 0.44 | > 0.44 | > 0.44 | > 0.44 |
a Given are percentage points by which PHD
improved over single-sequence based predictions by using the respective
alignment methods three-state per-residue accuracy Q3.
The baseline reflected the performance of PHD on single sequences
(PHDsec = 66.3%, PHDacc = xx). In all cases, we used pairwise
BLAST [1] searches to select the proteins to be aligned
with a BLAST E-value threshold of 1 (left columns: more homologues,
more false positives) and of 10-3 (right columns: fewer
homologues, fewer false positives). All results based on a set
of 199 proteins.
b Using only homologues found in SWISS-PROT [7] .
c Using homologues found in a 'non-redundant' database merging SWISS-PROT + TrEMBL + PDB [7, 4] .
d Dynamic programming with default parameters
[2] (note: we could not test ClustalW on BIG since our CPU
time was too limited).
e Dynamic programming with gradual profile
alignment by default parameters, sequences are brought into alignment
one by one [2] (note: we could not test ClustalW on BIG since
our CPU time was too limited).
f Dynamic programming with McLachlan [63]
matrix and gap penalties same as those used in generation of HSSP
database [3] .
g Dynamic programming with BLOSUM62 [60]
and gap penalties of 10+k, where k is a length of a gap.
h Homology reduction of BLAST alignments using
a modified HSSP curve [56] .
i Sequences found by BLAST are realigned using
a position specific scoring matrix produced by PSI-BLAST [6]
on BIG (note: no iteration was used, here).
Significant improvement through larger database. The PHD
methods were developed, analysed, and distributed based on alignments
generated from the SWISS-PROT database (currently containing about
90,000 sequences). When we switched from SWISS-PROT to a large
'non-identical' database (BIG = SWISS-PROT + TrEMBL + PDB) of
about 500,000 sequences, predictions increased significantly by
an additional 1.5 percentage points ( Table 1 ). On average, we
aligned about 2.7 times more proteins to each query sequence when
using BIG. We observed a strong dependence of the performance
on the threshold chosen to include sequences into the family:
alignments containing sequences with E-values ² 1 (corresponding
to P-value of about 0.63) yielded the highest prediction accuracy
( Table 2 ). While adding sequences to the families improved accuracy
for most proteins, occasionally accuracy dropped ( Fig. 1 A). In
fact, for some proteins predictions based on single sequences
were more accurate than those based on alignments (negative values
in Fig. 1 ). This effect persisted even when including only proteins
with E-values < 10-20 ( Fig. 1 B) suggesting that
the drop in accuracy was not only caused by false positives.
| E-value b | PHDsec c | PHDacc d |
| 100 | 8.7 | 4.4 |
| 20 | 9.1 | 4.9 |
| 10 | 9.5 | 5.0 |
| 1 | 9.7 | 5.2 |
| 10-1 | 9.5 | 5.3 |
| 10-2 | 9.2 | 5.3 |
| 10-3 | 9.1 | 5.2 |
| 10-4 | 8.9 | 5.2 |
| 10-7 | 8.5 | 5.0 |
| 10-20 | 6.9 | 4.5 |
| significant difference | >0.44 | >0.39 |
a Given are percentage points by which PHD
improved over single-sequence based predictions by using the pairwise
BLAST [1] searches on BIG. Data set as in Table 1.
b Maximal E-value of sequences included in
final alignment;
c Secondary structure prediction accuracy (PHDsec)
improvements as given by an increase in the three-state per-residue
accuracy Q3.
d Solvent accessibility prediction accuracy
(PHDacc) improvements as given by an increase in the two-state
per-residue accuracy Q2.
Fig. 1: Influence of database size on prediction accuracy. The improvement in the three-state per-residue accuracy is given as differences between predictions based on alignments vs. predictions based on single sequences. Thus, negative numbers imply that evolutionary information was detrimental. All alignments were generated by simple pairwise BLAST searches using E-value thresholds for including sequences of one1 (A) and 10-20 (B). The diagonal lines mark proteins predicted equally well when searching through SWISS-PROT and BIG. Point well above the diagonal mark proteins for which the larger database was highly beneficial. The observation of points below the diagonal for the conservative cut-off threshold (B) suggested that the decrease from using the larger database was not caused by accumulating false positives.
PSI-BLAST improved secondary structure prediction accuracy slightly. PSI-BLAST finds more distantly related homologues than pairwise search methods [11] . These extended profiles have been reported to improve prediction accuracy significantly [45, 44] . In contrast, we noticed only a marginal improvement of about 0.4 percentage points through using PSI-BLAST (Table 3). In fact, this was below the mark of 0.44 for significant differences. However, we observed consistently positive effects by using PSI-BLAST on other data sets. In particular, for the set of new proteins used by EVA [59, 57] , PSI-BLAST improved by about 0.6 percentage points. Given a standard error on that set of around 0.6 (Eqn. 2) this was at the edge of being statistically significant, albeit rather small in comparison to the difference between using SWISS-PROT and BIG with a simple BLAST ( Table 1, Table 3 .8 percentage points). Similarly, we observed an increase of 0.4 percentage points for 'set_1136' for which differences above 0.3 were statistically significant.
| Iteration E-value b | ||
| 10 | 7.3 | 3.0 |
| 1 | 9.3 | 4.2 |
| 10-1 | 10.1 | 4.8 |
| 10-2 | 10.1 | 5.0 |
| 10-3 | 10.0 | 5.0 |
| 10-4 | 10.1 | 5.1 |
| 10-7 | 9.9 | 5.1 |
| 10-20 | 9.6 | 5.2 |
| 10-60 | 9.4 | 5.0 |
| significant difference | >0.44 | >0.3 |
a Given are percentage points by which PHD
improved when using iterated PSI-BLAST over single sequence-based
predictions. The iteration parameter determines which proteins
to include when building the profile used for the next iteration
step. Data set as in Table 1.
b Maximum E-value of sequences used in refinement
of position specific scoring matrix for PSI-BLAST (final alignment
maximum E-value is set to 1).
c Secondary structure prediction accuracy improvements
given by an increase in the three-state per-residue accuracy Q3.
a Solvent accessibility prediction accuracy improvements
as given by an increase in the two-state per-residue accuracy
Q2.
Table 4: Improvement by using PSI-BLAST with different numbers
of iterations a
Accuracy was stable over a wide range of iteration parameters.
Surprisingly, prediction accuracy on PSI-BLAST alignments was
not very sensitive to the choice of the E-value limiting inclusion
of sequences into the position-specific scoring matrix during
iteration. Only rather extreme values were clearly worse (Table
3). In contrast, the number of iterations appeared more crucial:
When iterating more than three times, accuracy dropped significantly
when using a permissive h-parameter (proteins included when E-value
< 10-4) and decreased slightly when using a restrictive
h-parameter (10-10; Table 4 ). Finally, we investigated
the influence of gap parameters and substitution matrices. In
particular, we found that gap open values of 10-12 did not change
accuracy and that predictions were similar when replacing the
default BLOSUM62 [60] matrix by BLOSUM80 or BLOSUM45 matrices
(data not shown).
Pre-filtering database was not vital for PSI-BLAST. Surprisingly, we could not establish that filtering the database for PSI-BLAST - as proposed by David Jones [44] - was crucial for secondary structure prediction. Although the tendency was that filtering the database improved, the improvement was not significant (Table 4). Nevertheless, the numbers for the unfiltered database were consistently lower for all the E and h parameters ( Table 4 ).
| Number of iterations b | ||||
| filtered d | non-filtered d | filtered d | non-filtered d | |
| 1 | 9.5 | 9.7 | 9.5 | 9.7 |
| 2 | 9.9 | 10.0 | 9.8 | 10.0 |
| 3 | 10.1 | 9.8 | 10.1 | 10.0 |
| 4 | 9.6 | 9.3 | 10.1 | 9.8 |
| 6 | 9.3 | 8.8 | 9.9 | 9.7 |
| 10 | 8.1 | 7.4 | 9.7 | 9.5 |
| significant difference | >0.44 | >0.44 | >0.44 | >0.44 |
a Given are percentage points by which PHDsec
improved when using iterated PSI-BLAST over single sequence-based
predictions (three-state per-residue accuracy, Q3).
For all runs, we included all proteins found below E-values of
1. Data set used as in Table 1.
b Number of PSI-BLAST iterations.
c PSI-BLAST iteration parameter (h-value) set to 10-4
(10-10), i.e. only sequence with E values < 10-4
(<10-10) were considered when compiling the profile.
d 'Filtered' refers to filtering the database used
for the search (Methods), 'non-filtered' to not filtering the
database.
PSI-BLAST vs. BLAST: combination could be best. We found on average 2.4 times more proteins by PSI-BLAST than by BLAST. This family growth was similar to that obtained by switching from SWISS-PROT to BIG; the improvement was not ( Table 1 Table 3 ). Furthermore, for many proteins BLAST yielded better predictions than PSI-BLAST ( Fig. 2 ). If we could decide from looking at the BLAST and PSI-BLAST alignments, which one is 'better', accuracy would increase by an additional percentage point (data not shown). Possibly, we could approach this improvement by a more elaborated protocol for running PSI-BLAST.
Fig. 2: Iterated PSI-BLAST vs. pairwise BLAST. All searches against BIG with an E-value threshold of 10-4 for the iteration and a threshold of 1 for including the proteins into the final family. Proteins for which PSI-BLAST yielded better predictions than BLAST fall above the diagonal. Obviously, predictions based on iterated PSI-BLAST searches were not consistently more accurate than those based on iterated BLAST searches.
Prediction accuracy related to number of sequences in alignment.
The 25% of proteins with highest prediction accuracy had
on average twice as many proteins in their alignments as the lowest
scoring 25%. For small families (< 10 proteins) prediction
accuracy improved about five percentage points over single sequences,
for large families (> 100 proteins) more than 11 percentage
points. Qualitatively, this dependence was apparent when plotting
the improvement versus the number of proteins aligned ('set_1136',
Fig. 3 A). Noticeably, the most significant gain resulted from
adding a few proteins to the alignment (steep slope for small
numbers in Fig. 3 A). For large families the improvement appeared
to saturate. However, the number of proteins was not a perfect
indicator of prediction accuracy since the slopes differed between
different methods and databases ( Fig. 3 A). In particular, the
data plotted in Fig. 3 A appeared to suggest that BLAST searches
against SWISS-PROT yielded more improvement than BLAST or PSI-BLAST
searches against BIG. However, the average was not compiled over
the same families because BLAST found less homologues in SWISS-PROT
than in BIG. We corrected for this by labelling the families according
to the number of SWISS-PROT proteins in each alignment ( Fig. 3 B).
This revealed that (1) searches against BIG consistently yielded
better performances than searches against SWISS-PROT, and that
(2) PSI-BLAST performed - on average - better than BLAST. Despite
the saturation for large families, we observed improvements from
adding sequences even for family sizes between 200 and 500 (largest
family found in SWISS-PROT).
Fig. 3: Prediction accuracy vs. family size. The improvement given as differences between predictions based on alignments and predictions based on single sequences. Curves represented an average weighted fit. Different methods produced somewhat different shapes for the fit (A). SWISS-PROT based alignments were on average most efficient. On the other hand, alignments produced on BIG by both BLAST and PSI-BLAST improved over SWISS-PROT for all sizes (B: families labelled by number of SWISS-PROT homologues found). Prediction accuracy improved most significantly for small families and saturated for very large families. The average variation of the prediction accuracy was considerable (standard deviation of about 8). Thus, the number of aligned sequences was not a very good indicator of prediction accuracy for any given sequence. Nevertheless, the average trend was obvious (C: predictions based on BLAST alignments of BIG; note the straight line indicates a logarithmic fit; error bars correspond to 95 % confidence intervals).
Larger alignments improve accessibility prediction, marginally.
Solvent accessibility predictions did not improve as much as secondary
structure predictions by the various protocols we investigated.
The two-state accuracy (Methods) increased by less than one percentage
point when using a larger database. Interestingly, this improvement
was more sensitive to the particular threshold used to include
sequences ( Table 2 ). Hence, predictions of accessibility appeared
more sensitive to false positives than did predictions of secondary
structure. Not surprisingly then, using PSI-BLAST did not improve
accessibility prediction either ( Table 3 ).
PSI-BLAST better than BLAST? The average growth of the
family size identified by PSI-BLAST was comparable to the growth
achieved by searching through a larger database with a simple
pairwise BLAST (factor 2.4 vs. 2.7). However, the resulting gain
in performance was significantly smaller for PSI-BLAST. Proteins
added to the families by PSI-BLAST were more distantly related
to the query sequence than those found by BLAST. Thus, we anticipated
a larger number of false positives for the PSI-BLAST families.
When we based the prediction only on the remote homologues exclusively
identified by PSI-BLAST accuracy improved over single sequence-based
predictions overall by about six percentage points (data not shown).
This value was about three percentage points lower than the gain
through using pairwise BLAST on BIG ( Table 1 ). Partially, this
might be explained by the different distributions of family sizes
between the different methods and databases ( Fig. 4 ). For example,
about 40% of the PSI-BLAST searches on BIG added less than 10
proteins to the family identified by pairwise BLAST while only
20% of the families had less than 10 protein using pairwise BLAST
searches. The most important increase in prediction accuracy resulted
from the first ten proteins included in an alignment ( Fig. 3 A).
However, PSI-BLAST found only 4% more families with over 10 members
than BLAST ( Fig. 4 ). This suggested the following explanation
for the relatively small improvement through PSI-BLAST. PSI-BLAST
needs a reasonable number of first iteration hits (hits from pairwise
BLAST) to find additional family members. However, when this number
is sufficiently large to unravel the full 'power' of the PSI-BLAST
search, we reached the saturation of family sizes that improved
prediction accuracy. In contrast, when we relaxed the stringent
criteria for iterating PSI-BLAST (more iterations, lower h-parameter),
prediction accuracy dropped. This could be explained by the effects
of 'drift' and 'pollution' associated with PSI-BLAST [61] :
the final profile is not centred around the original search sequence
(drift) for which secondary structure is predicted, and many profiles
contain proteins of different structure (pollution) than the one
predicted.
Fig. 4: Distributions of family size. Here, we focused on families with fewer than 100 homologues; for these prediction accuracy increased most markedly (Fig. 3). Understandably, PSI-BLAST did not identify many more homologues than pairwise BLAST for small families. Supposedly, this caused the observation that prediction accuracy differed only marginally between PSI-BLAST and BLAST searches. However, around family sizes of 50, PSI-BLAST added as many family members as BLAST identified. Furthermore, whereas about 40% of the PSI-BLAST families had more than 100 homologues, only 5% of the pairwise BLAST families found as many homologues when searching against SWISS-PROT.
Is waiting for databases to grow more successful than developing new methods? Overall, our combination of PSI-BLAST and larger databases improved our decade-old prediction method PHD by about three percentage points over the previous protocol of using dynamic programming (MaxHom on SWISS-PROT). Since we were not aware of making use of any 'particular feature' of PHD to reach this value, we expect that similar improvements could be obtained for any old prediction method. Surprisingly, most of this improvement resulted from the growth of the databases (BIG vs. SWISS-PROT) and not from better search methods (PSI-BLAST vs. BLAST). Another surprise to us was the success of the popular BLAST algorithms in comparison with more CPU intensive dynamic programming searches ( Table 1 ). Could we gain further by re-training on larger databases and on PSI-BLAST profiles? Contrary to earlier claims [42] , PHD did profit substantially from extended profiles and did even perform on par with JNet / JPred2 developed on large data sets of extended PSI-BLAST profiles [42, 59, 57] . In contrast, PSIPRED [44] and PROFsec (B Rost, unpublished) reached a sustained level about 1.5 percentage points above this value on a data set used by EVA (data not shown) [59, 57] . The better performance of PROFsec resulted entirely from improving the method. However, for the case of PSIPRED we suspect that part of the better performance of the public PSIPRED server [62] resulted from a better protocol in running PSI-BLAST. In fact, when we used our PSI-BLAST alignments for our local version of PSIPRED, accuracy dropped significantly with respect to the server results.
How fatal are errors in the database? We assume that SWISS-PROT contains fewer errors than TrEMBL. Could we see this effect in the accuracy of secondary structure prediction? Too many overlapping effects prevented a conclusive answer to this question. Fig. 3 A appeared to suggest that SWISS-PROT improved prediction accuracy more than TrEMBL. However, this observation might have been caused mainly by the overlying saturation effect. Hence, we shall have to wait for SWISS-PROT to double before we can answer more precisely.
Will accuracy rise with future database growth? A roughly
six-fold growth of the database (SWISS-PROT vs. BIG) improved
prediction accuracy by about 1.5 percentage points. However, the
increase saturated for families with more than 100-200 proteins
( Fig. 3 ). Hence, will future growth improve performance considerably?
We anticipate an affirmative answer, since less than 30% of all
families contain more than 100 proteins through pairwise alignments
( Fig. 4 ). Our assumption here is that newly sequenced proteins
will not differ considerably from the proteins we already know
from projects sequencing entire organisms.
Recent improvements in secondary structure prediction seem to
be due to various sources. Our results indicated the following
over-simplified formula. More than half of the recent improvement
of secondary structure prediction resulted from the growth of
sequence databases (from 90K in SWISS-PROT to 500K in BIG). Less
than one fifth of the improvement was achieved through better
database search methods (PSI-BLAST over BLAST). The remaining
one-third of the improvement was due to better methods. Hence,
the most crucial tool to improve secondary structure prediction
proved to be the new BLAST/PSI-BLAST searching tools. The contribution
of PSI-BLAST was probably smaller than usually assumed due to
saturating nature of the dependence of prediction accuracy on
alignment size ( Fig. 3 Fig. 4 ). The majority of proteins did
not reach this saturation point, yet. Hence, we anticipate that
prediction accuracy will rise continuously with every protein
added to the databases.
Thanks to Jinfeng Liu (Columbia) for computer assistance, to BLAST
team at NCBI for assistance. DP and BR gladly acknowledge support
by the grant RO1-GM63029-01 from the National Institute of Health.
Last not least, thanks to Amos Bairoch (Geneva), Rolf Appweiler
(EBI, Hinxton), Phil Bourne (San Diego) and their teams for maintaining
vital public databases and to all structural biologists who make
the fruits of their efforts publicly available.