| Title: | Powerful fusion: PSI-BLAST and consensus sequences |
| Author: | Dariusz Przybylski & Burkhard Rost |
| Quote: | Bioinformatics, 2008, Vol: pages |
Powerful fusion: PSI-BLAST and consensus sequences
| 1 | Dept. of Biochemistry and Molecular Biophysics, Columbia University, 630 West 168th Street, New York, NY 10032, USA |
| 2 | Broad Institute of MIT and Harvard University, 320 Charles St., Cambridge, MA 02141, USA |
| 3 | Columbia University Center for Computational Biology and Bioinformatics (C2B2), 1130 St. Nicholas Avenue Rm 802, New York, NY 10032, USA |
| 4 | North East Structural Genomics Consortium (NESG), Department of Biochemistry and Molecular Biophysics, Columbia University, 630 West 168th Street, New York, NY 10032, USA |
| * | Corresponding author: dsp23@columbia.edu URL http://www.rostlab.org/ |
Motivation: A typical PSI-BLAST search consists of iterative scanning and alignment of a large sequence database during which a scoring profile is progressively built and refined. Such a profile can also be stored and used to search against a different database of sequences. Using it to search against a database of consensus rather than native sequences is a simple add-on that boosts performance surprisingly well. The improvement comes at a price: we hypothesized that random alignment score statistics would differ between native and consensus sequences. Thus PSI-BLAST-based profile searches against consensus sequences might incorrectly estimate statistical significance of alignment scores. In addition, iterative searches against consensus databases may fail. Here, we addressed these challenges in an attempt to harness the full power of the combination of PSI-BLAST and consensus sequences.
Results: We studied alignment score statistics for various types of consensus sequences. In general, the score distribution parameters of profile-based consensus sequence alignments differed significantly from those derived for the native sequences. PSI-BLAST partially compensated for the parameter variation. We have identified a protocol for building specialized consensus sequences that significantly improved search sensitivity and preserved score distribution parameters. As a result, PSI-BLAST profiles can be used to search specialized consensus sequences without sacrificing estimates of statistical significance. We also provided results indicating that iterative PSI-BLAST searches against consensus sequences could work very well. Overall, we showed how a widely popular and effective method could be used to identify significantly more relevant similarities among protein sequences.
Availability: http://www.rostlab.org/services/consensus/
Contact: dsp23@columbia.edu
Key words: evolution, sequence analysis, multiple alignments, consensus sequences, PSI-BLAST, statistics.
| 3D | three-dimensional; 3D structure, three-dimensional structure, i.e. co-ordinates of all residues/atoms in a protein |
| Ångstrøm (Å) | =0.1 nm |
| BLAST | Basic Local Alignment Search Tool, i.e. fast method for database searches |
| PDB | database of protein 3D structures |
| PSI-BLAST | Position Specific Iterative BLAST |
| PSSM | Position Specific Scoring Matrix (also referred to as profile) |
| RMSD | root mean square deviation |
| SCOP | Structural Classification Of Proteins based on their 3D structures |
| SWISS-PROT | expert curated database of protein sequences and annotations |
| UniProt | large database of protein sequences |
| consensus sequence | a sequence of amino acids representing a multiple alignment of related proteins |
| sequence-sequence | alignment of two native sequences |
| profile-sequence | we introduced this term deviating from the usual sequence-profile alternative in order to easily identify our different protocols: the alignment is between a profile as the query and simple sequences as templates (as done by PSI-BLAST); |
| sequence-consensus | alignment of native with consensus sequences |
| profile-profile | alignment of two evolutionary profiles |
| profile-consensus | alignment of profile with a consensus sequence, i.e. method introduced in this manuscript |
| consensus-consensus | alignment of two consensus sequences |
Abbreviations end
PSI-BLAST achieves a remarkable compromise between speed and quality. Ideally, an alignment method should accurately identify related sequences in todayÕs rapidly growing databases within the shortest possible time. While we want to simultaneously optimize speed and reliability, in practice there is a tradeoff: very accurate alignment methods are relatively slow (e.g. profile-profile alignment algorithms), while very fast methods are far less sensitive than we might wish (e.g. BLAST [1] ). PSI-BLAST [2] strikes an excellent compromise between speed and sensitivity.
Consensus sequences improve PSI-BLAST performance. Consensus sequences were used early on to improve alignments [3] . Initially, the substitution of profiles with consensus sequences mimicked profile-sequence alignments [4, 5] . Many improvements followed [6, 7, 8, 9, 10, 11, 12, 13] . However, none of those methods approached the success of PSI-BLAST. We have recently proposed a simple add-on to PSI-BLAST that substantially improves its performance [14] . The add-on did not require any code change in PSI-BLAST. It consisted of adding a final step of ÔfreezingÕ the profile after the standard, iterative search against native sequences and then using it to search a database with the native sequences replaced by their consensus counterparts. This simple add-on improves the performance throughout the entire sensitivity curve. However, it is not clear how the underlying residue composition of database sequences affects the statistics of alignment scores. This is important because users rely on the estimates of statistical significance to judge retrieved alignments. In addition, incorrect scoring might invalidate iterative searches against consensus sequences; a single false alignment in one of the intermediate searches might pollute a scoring profile and thereby all subsequent searches.
This study was motivated by the following three assumptions: (1) For a given residue substitution scoring matrix, the statistical significance of alignment scores depends on the residue compositions of aligned sequences. Assume that a particular scoring matrix highly rewards the alignment of tryptophan. This implies that sequences rich in tryptophan will likely generate higher alignment scores than those with average tryptophan content. (2) The composition of consensus sequences differs from that of native sequences. Therefore, the distribution of alignment scores is likely different for consensus and native sequences, at least when using the same scoring matrix for both (such as BLOSUM62 [15] or the corresponding position-specific scoring matrices). (3) PSI-BLAST is very popular, well maintained, and has a great impact on the community of scientists that use sequence alignments. Therefore, it is desirable to improve PSI-BLAST performance without changing its alignment parameters (including scoring matrices and gap scores) with which the community is already familiar. In order to accomplish this, we have asked the following questions: what are the parameters of alignment score distribution for various types of consensus sequences? Can PSI-BLAST compensate for compositional variations through its internal composition-based adjustments [16] ? Or, can we build consensus sequences in a way that renders statistical significance reported by PSI-BLAST as valid? Finally, can we apply PSI-BLAST to iteratively search consensus sequence databases?
Generation of consensus sequences. We derived the consensus sequences from position-specific scoring matrices (PSSM, also known as scoring profiles) generated by iterative PSI-BLAST [17] searches of the redundancy-reduced UniProt [18] database containing about 1.5 million sequences. The sequence redundancy was reduced with CD-HIT [19] such that pairs of sequences had less than 80% identical residues (globally). We allowed up to five PSI-BLAST iterations, i.e. the frozen profile was computed based on the fourth iteration or the next to the last one for early converging queries. The E-value threshold for inclusion in PSSMs was set to 0.001 and we increased the maximum number of aligned sequences to 2000 (blastpgp options "-j 5 -h 0.001 -v 2000 -b 2000 -Q PSSM(ASCII)" ). Other options were left unchanged, including the default compositional adjustment of alignment score statistics and gap scores of -(11+k) for gaps of length k. The determination of consensus sequences was based on ASCII PSSMs. For a given sequence and a residue position we looked at the corresponding column of its PSSM and/or the frequency profile also present in the PSI-BLAST output. We explored three alternative ways for computing consensus residues at a given position i of a sequence: (i) MF - maximal frequency: the consensus residue j had the highest occurrence frequency fij in the profile column, (ii) MET - maximal relative entropy term: we chose the residue j with the highest relative entropy term fijln(fij/bj) with respect to the background frequency bj, (iii) MR - maximal ratio of frequencies: we chose the residue with the highest frequency ratio fij/bj. In addition we studied full (MF-full, MET-full, MR-full) and partial (MF-partial, MET-partial, MR-partial) versions of consensus sequences. For the (a) full consensus sequences we computed the consensus residue at each sequence position, and for the (b) partial consensus we computed the consensus in a constrained way, e.g. only for the more informative positions. The more informative positions were those having profile frequency columns with the relative entropy equal or above 0.6 (as reported in the PSI-BLAST output).
Alignments. All of the alignments (except those used to estimate the alignment score distribution parameters) were generated using PSI-BLAST ("blastpgp") version 2.2.15. The scoring profiles (PSSMs) for the profile-sequence alignments were generated in the same way as those used for generation of consensus sequences, except that a file containing the binary version of a PSSM was also stored (blastpgp option "-C PSSM(binary)" ). This binary PSSM was used for a final (non-iterative) PSI-BLAST search against the appropriate consensus or native sequence databases (blastpgp options: "-j 1 -R PSSM(binary)" ). For the non-profile based alignments of sequences, the blastpgp program with default BLOSUM62 [15] scoring matrix was used (options: "-j 1"). When studying iterative searches of consensus sequence databases we compared the performance for various numbers of iterations. The consensus version of the redundancy-reduced UniProt database used in iterative consensus searches was computed over a period of a few months using spare CPUs of a large computing cluster.
Evaluation of search capability.We evaluated the ability to identify remotely related proteins using SCOP [20] (release 1.69). We used the usual, descending hierarchy levels of "fold", "superfamily", and "family" to define true and false relationships. Our positives were pairs of protein domains from the same SCOP superfamily but from different SCOP families (i.e. the relatively easy pairs from the same family were not counted). When studying the iterative searches against consensus sequences, we also counted pairs from the same SCOP fold. The negatives belonged to different SCOP folds. We removed domains with: discontinuous sequences, missing coordinates in their three-dimensional structures, NMR and low-resolution structures (<2.5 Ångstrøms), and the short ones (<50 residues). Next we reduced the redundancy of the sequence set so that no pair of sequences could be aligned by BLAST with E-values better than 10-3 (when computed on UniProt database of ~2,000,000 sequences) or at levels of sequence identity and alignment length that corresponded to HSSP-values above 0 [21, 22] (whichever of the two criteria applied). This yielded a data set of 2476 sequences for which we applied the all-against-all test.
Score statistics. PSI-BLAST provides statistical significance of alignment scores in terms of expectation values (E-values) that are given by:
|
| (Eq. 1) |
where m and n are the effective lengths of aligned sequences (query and database), score is a raw alignment score (as given by the values in scoring matrix and gap penalties), and K and l are the parameters of the score distribution that depend on a scoring system and the residue composition of aligned sequences. Note that the computation of the E-value primarily depends upon a proper estimate of l and much less so on that for K.
Determining parameters of alignment score distributions. Estimating the statistical significance for alignment scores has been widely studied [23, 24, 25, 26] . We computed l and K parameters (eqn. 1) with our implementation for the island approach [27, 28]. This approach is appropriate as the primary methods studied in this manuscript rely on searching databases of consensus sequences with pre-computed PSSMs. Since we have also estimated statistical parameters for profile-based searches against native sequences we have established a link with earlier studies. First, we obtained the initial PSSMs for hundreds of thousands of randomly selected UniProt sequences. Most were too short to study the score distribution in the asymptotic limit of very long sequences. Therefore, we concatenated them in a random order and then cut them into final long PSSMs, each composed of 7,000 columns. We ended up with 75,000 of such long PSSMs and computed consensus sequences for each one of them. Those sequences were then used to derive residue background frequencies. The background frequencies were then used to generate random sequences used for studying alignment score distribution parameters. For partial consensus sequences, we computed two separate sets of backgrounds – inside and outside of consensus regions and used them accordingly for generation of random partial consensus sequences (as determined by the original PSSM column entropy values).
Studying the compositional adjustment of alignment score statistic in PSI-BLAST. The newer versions of PSI-BLAST can adjust alignment score statistics based on varying residue compositions of query and database sequences ("-t" option in PSI-BLAST). In particular we looked at the performance of a default adjustment implemented in the 2.2.15 version of the software. We have generated random sequence databases based on the native and consensus background residue frequencies. Random sequence sizes were the same as those found in the non-redundant UniProt database. We queried those databases with about 20,000 randomly chosen native sequences and the corresponding PSI-BLAST profiles (PSSMs). We recorded the cumulative numbers of alignments per query that were found with E-values better than a given threshold value.
Alignment score parameters depend on consensus type. The relationship between the alignment score and the lambda (eqn. 1) has been described before [28] . Low-scoring alignments usually have fewer gaps. This results in score distribution parameters differing from those obtained for high-scoring alignments with gaps. Here, we have focused mostly on asymptotic values of lambda for high scores because they correspond to statistically significant alignments originating from searches of large sequence databases. In particular, we looked at lambda for position-specific scoring matrices (PSSM) generated with 5 iterations of PSI-BLAST. We observed that lambda depended on the sequence types (Fig. 1). Computing consensus residues for the full sequence produced largest changes of lambda (open symbols in Fig. 1 i.e. MR-full, MET-full, and MF-full). For each one of them, the asymptotic value of lambda was less than 0.2 (more data points would be needed to establish a precise limit). The value of lambda for native sequences was about 0.255 (Fig. 1 green squares). This is rather close to a value of 0.267 previously established for the BLOSUM2 scoring matrix [28] . For the partial consensus sequences, lambda appeared to follow the value obtained for the native sequences (filled symbols in Fig. 1 i.e. MF-partial and MET-partial). To some extent this result is not surprising because partial consensus substitutions are more restricted than the full ones, i.e. change fewer residues (Table 1). As a result we established that one could use PSI-BLAST without any modifications to search partialconsensus sequence databases and maintain proper estimates of E-values.
|
|
Fig. 1: Estimating lambda. Score distribution parameter l (eqn. 1 y-axis) varies with alignment scores (x-axis). In practice, we are interested in the asymptotic value of lambda for higher scores. Full consensus sequences affected lambda significantly (open symbols) when compared to native sequences (green squares). In contrast, partial consensus did not significantly affect lambda (filled black and blue symbols). Red error bars estimate the standard deviation (for simplicity only shown for native sequences). Note that high alignment scores were attained by few alignments. |
We have also estimated the location parameter K used for computing E-values (eqn. 1). For example, we found it to be ~0.015 for the full consensus sequences (MF-full), 0.030 for partial consensus sequences (MF-partial) and 0.032 for the native sequences.
|
Native |
|
Full consensus |
Partial consensus |
||||
| |
|
Native |
MR |
MF |
MET |
MR |
MF |
MET |
| Native |
native |
100 |
|
|
|
|
|
|
| Full consensus | MR |
65 |
100 |
|
|
|
|
|
| MF |
54 |
76 |
100 |
|
|
|
|
|
| MET |
51 |
80 |
90 |
100 |
|
|
|
|
| Partial consensus | MR |
86 |
79 |
64 |
63 |
100 |
|
|
| MF |
83 |
72 |
71 |
67 |
93 |
100 |
|
|
| MET |
82 |
73 |
69 |
69 |
94 |
98 |
100 |
|
* Shown are averages for percentages pairwise residue identities between different types of sequences of a test set.
Search performance similar for all consensus types. Do some types of consensus sequences retrieve related sequences from a database better than others? For each type of consensus, we ordered all query alignments by PSI-BLAST E-values. Next, we computed the cumulative numbers of true positive relations (same SCOP superfamily but different family) for increasing cumulative numbers of false positive pairs (different SCOP folds). At any false positive number (i.e. at any error rate), the profiles-sequence searches against the databases of full consensus sequences yielded most true positives (Fig. 2 top three curves: MET-full, MF-full, MR-full). Interestingly, it did not matter much how we compiled the full consensus (three top lines with open symbols in Fig. 2 almost indistinguishable). The profile-based searches against partial consensus sequences (only most informative positions replaced by consensus) were somewhat less efficient, especially when more false hits were allowed (Fig. 2 MET-partial). Nevertheless, they were significantly better than standard profile-sequence searches of PSI-BLAST (Fig. 2 native). For comparison, we also included the performance of sequence-sequence searches with pairwise BLAST for the native and consensus sequences (Fig. 2 MET-full-1, MET-partial-1, native). As expected, pairwise searches fared much worse than profile-sequence searches. The relative performance difference between the full and partial consensus sequences appeared larger for the sequence-sequence (Fig. 2 MET-full-1, MET-partial-1) than for profile-sequence searches.
|
|
Fig. 2: Comparison of search performance. All against all alignments of the test set sequences were ordered by their PSI-BLAST E-values. The cumulative numbers of non-trivial true relations (same SCOP superfamily but different SCOP family) were plotted against the cumulative numbers of false positives (different SCOP folds). The profile-sequence searches against the full consensus sequences performed best (top three curves: MET-full, MF-full, MR-full). Profile-sequence searches against partial consensus sequences were slightly less efficient (MET-partial) but they were still significantly better than standard sequence-profile (native). Sequence-sequence searches (one cycle of PSI-BLAST with BLOSUM62 matrix) were clearly inferior (MET-full-1, MET-partial-1, native-1).
|
Composition of consensus sequences varied. The search performance appeared not to differ between various types of full consensus sequences although their average residue compositions were quite different (Fig. 3 A). The consensus based on the maximum ratio of target and background frequencies (MR-full) weighed more heavily rare residues such as tryptophane (W). The consensus based on the most frequent residue (MF-full) weighed more heavily the more ubiquitous ones such as leucine (L). Finally, the consensus based on relative entropy (MET-full) produced the composition more similar to the native one (Fig. 3 A, blue bars). The average percent differences in residue identity (and standard deviations) between native and full consensus sequences were: 65 (±14) for MR-full, 54 (±16) for MF-full, and 51 (±17) for MET-full consensus sequences. The partial consensus calculations resulted in average compositions that were much closer to the native ones (Fig. 3 B). The corresponding residue identities with respect to native sequences were: 86 (±7), 83 (±8), and 82 (±8)%. Thus the consensus calculation that changed sequences the least in terms of the average residue identity has changed the score parameters the most. Other pairwise residue identities are given in Table 1 . All calculations were performed on our non-redundant SCOP test set.
|
|
Fig. 3 : Comparison of residue compositions. We computed the background residue compositions for consensus and sequences in our test set. Full consensus sequences (left panel) differed more from native than partial consensus sequences (right panel). Choosing the consensus residue corresponding to the highest relative entropy term (blue bars) resulted, on average in smaller deviations from the native composition.
|
PSI-BLAST compositional adjustments were partially successful. When compositions of aligned sequences differ from the standard one, PSI-BLAST can attempt to correct estimates of statistical significance accordingly [16, 29] . We studied how well the default adjustments perform on consensus sequences (non-default adjustments are not available for profile-based searches). Using PSI-BLAST profiles we searched against the consensus and native sequence databases (Methods). For the comparison, we also searched with the BLOSUM62 substitution matrix (standard, non-profile BLAST search). In the latter case, the estimates of statistical significance were not very sensitive to compositional differences and the statistic adjustments worked well (Table 2 observed and expected counts similar; adjustments were conservative). However, for the profile-based searches the compositional differences were significant, particularly for full consensus sequences (especially pronounced for MR-full, Table 3 ). The compositional adjustment of scores attempted by PSI-BLAST (-t option set to 1) failed to satisfactorily correct for the differences. In contrast, the E-value estimates were good for partial consensus sequences. For both native and partial consensus sequences, the compositional score adjustment sometimes resulted in slightly increased numbers of random alignments with significant E-values.
| Expected\observed |
Native |
Full consensus |
Partial consensus |
|||||||||||
| |
native |
native-adj. |
MR |
MR-adj. |
MF |
MF-adj. |
MET |
MET-adj. |
MR |
MR-adj. |
MF |
MF-adj. |
MET |
MET-adj. |
| 0.001 |
0.0014 |
0.0010 |
0.0010 |
0.0002 |
0.0007 |
0.0006 |
0.0009 |
0.0003 |
0.0014 |
0.0008 |
0.0018 |
0.0008 |
0.0012 |
0.0009 |
| 0.01 |
0.010 |
0.007 |
0.006 |
0.004 |
0.006 |
0.005 |
0.008 |
0.004 |
0.011 |
0.006 |
0.012 |
0.005 |
0.011 |
0.006 |
| 0.1 |
0.09 |
0.07 |
0.07 |
0.04 |
0.03 |
0.06 |
0.08 |
0.06 |
0.09 |
0.07 |
0.11 |
0.07 |
0.10 |
0.07 |
| 1 |
0.9 |
0.7 |
0.7 |
0.5 |
0.2 |
0.7 |
0.9 |
0.6 |
0.9 |
0.7 |
1.1 |
0.7 |
1.0 |
0.7 |
| 10 |
9 |
7 |
7 |
6 |
18 |
7 |
9 |
7 |
9 |
8 |
11 |
8 |
10 |
8 |
* Shown are the expected and observed numbers of random alignment scores for ~ 20,000 sequence queries on randomly generated databases (of UniProt size) of native and consensus sequences. Appendix "-adj" indicates results obtained with the use of compositional adjustment of E-values with BLAST option "-t" set to 1.
| Expected\observed |
Native |
Full consensus |
Partial consensus |
|||||||||||
| native |
native-adj. |
MR |
MR-adj. |
MF |
MF-adj. |
MET |
MET-adj. |
MR |
MR-adj. |
MF |
MF-adj. |
MET |
MET-adj. |
|
| 0.001 |
0.0013 |
0.0033 |
66.4912 |
1.6637 |
0.0024 |
0.0008 |
0.0215 |
0.0093 |
0.0053 |
0.0040 |
0.0014 |
0.0028 |
0.0013 |
0.0038 |
| 0.01 |
0.008 |
0.020 |
98.880 |
4.3162 |
0.018 |
0.028 |
0.086 |
0.036 |
0.022 |
0.026 |
0.010 |
0.021 |
0.009 |
0.022 |
| 0.1 |
0.08 |
0.18 |
159.83 |
12.83 |
0.170 |
0.22 |
0.51 |
0.26 |
0.17 |
0.17 |
0.10 |
0.20 |
0.10 |
0.20 |
| 1 |
0.8 |
1.6 |
259.1 |
34.9 |
1.6 |
1.8 |
3.2 |
2.1 |
1.4 |
1.4 |
1.0 |
1.7 |
1.0 |
1.7 |
| 10 |
8 |
13 |
405 |
102 |
14 |
14 |
22 |
16 |
12 |
12 |
9 |
13 |
10 |
14 |
** PSI-BLAST searches were restarted from a stored profile.
* Shown are the expected and observed numbers of random alignment scores for a set of about 20,000 profile (PSSM) queries on randomly generated databases (of UniProt size) of native and consensus sequences. Appendix "-adj." indicates results obtained with a use of compositional adjustment of E-values with PSI-BLAST option "-t" set to 1.
Little additional CPU needed for add-on. In this study we used separate databases for the iterative derivation of PSSMs (non-redundant UniProt) and for the final search and alignment against consensus sequences. On average, the entire iterative PSI-BLAST search took about 10 min per query (about 2 min per iteration on a single 3.2GHz CPU with 2GB of RAM using query sequences of average length of 415 residues). The additional time consumed by the add-on to search a consensus sequence database of the same size depended on the sequence types. It took about 7 minutes to search MR-full and 4.5 for MF-full full consensus sequence databases. For partial consensus case it took about 2.5 minutes to search the MR-partial and about 2.2 for MF-partial (compared to about 2 minutes needed to search native one with PSI-BLAST profile).
Iterative searches against consensus sequences yielded further improvements. We made the first stab at analyzing iterative PSI-BLAST searches against consensus sequence databases. For this analysis we pushed the envelope by running up to 20 iterations. We counted hits belonging to the same SCOP fold but to different families as positives to reach deeper into remote protein domain relationships. The iterative PSI-BLAST searches against the native sequence database resulted in near saturation of performance at about 10 iterations. Only a small improvement was observed in subsequent 10 iterations (Fig. 4 top two green lines). The iterative searches against consensus sequences (MF-full) produced significantly more true hits with just three iterations. Five consensus iterations produced almost twice as many true hits as native PSI-BLAST search produced with twenty. For comparison, we showed the results of the profile-sequence search (profile obtained from ten iterations of PSI-BLAST on a native database) against a final database of consensus sequences (Fig. 4 blue line, mixed). These results remain to be compared to the performance of profile-profile methods [30, 31] .
|
|
Fig. 4: Iterating PSI-BLAST against native and consensus. Iterative PSI-BLAST searches and PSSM refinements on native sequence database (green lines) resulted in near saturation of performance at about 10 iterations (top two green lines). The corresponding searches on the database of consensus sequences (black lines) found significantly more true hits (same SCOP fold but different family) with just three iterations (black triangles) while five iterations (black circles) retrieved almost twice as many true hits as the maximum for the native PSI-BLAST. For comparison, a result of the profile search against a final database of consensus sequences (MF-full) is presented (blue line).
|
PSI-BLAST is an excellent, well-known, well-maintained, and trusted resource for searching and aligning sequence databases. A simple add-on consisting of searching with PSI-BLAST generated scoring profile against a database of consensus sequences significantly improved the performance in finding related sequences. Here, we specified in detail how different strategies of compiling consensus affected the estimates of statistical significance and performance. Profile-based PSI-BLAST searches against full consensus sequences improved the most over searches against native sequences. However they sometimes suffered from problems in the estimates of statistical significance. The partial consensus sequences improved significantly over native sequences without sacrificing estimates of statistical significance. Our initial results for iterative searches against consensus sequences were very promising: a lower number of iterations used less CPU overall and yielded about twice as many correct hits at the same error rates as standard PSI-BLAST searches did. Hence the fusion of PSI-BLAST and consensus sequences promises another leap in database searches.
This work was supported by the grants R01-LM07329-01 from the National Library of Medicine (NLM) and U54-GM074958-01 to the Northeast Structural Genomics consortium (NESG) from the Protein Structure Initiative (PSI) of the National Institutes of Health (NIH). Thanks to Johannes Soeding (Max-Planck-Institute in Tuebingen) for helpful comments. Thanks to all who deposit their experimental data in public databases and to those who maintain these databases. Thanks also to all who develop alignment tools and make them publicly available, in particular to those who develop and support PSI-BLAST!
| 1. | Altschul, S.F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. (1990). Basic localalignment search tool. J Mol Biol,215, 403-10. |
| 2. | Altschul, S.,Madden, T., Shaffer, A., Zhang, J., Zhang, Z. et al. (1997). Gapped Blast andPSI-Blast: a new generation of protein database search programs. NucleicAcids Research, 25,3389-3402. |
| 3. | Patthy, L.(1987). Detecting homology of distantly related proteins with consensussequences. J Mol Biol, 198,567-77. |
| 4. | Sonnhammer,E. L. & Kahn, D. (1994). Modular arrangement of proteins as inferred fromanalysis of homology. Protein Sci, 3,482-92. |
| 5. | Henikoff, S.& Henikoff, J. G. (1997). Embedding strategies for effective use ofinformation from multiple sequence alignments. Protein Sci, 6, 698-705. |
| 6. | Schultz, J.,Milpetz, F., Bork, P. & Ponting, C. P. (1998). SMART, a simple modulararchitecture research tool: identification of signaling domains. Proc NatlAcad Sci U S A, 95, 5857-64. |
| 7. | Schaffer, A.A., Wolf, Y. I., Ponting, C. P., Koonin, E. V., Aravind, L. et al. (1999).IMPALA: matching a protein sequence against a collection ofPSI-BLAST-constructed position-specific score matrices. Bioinformatics, 15, 1000-11. |
| 8. | Thelen, M.P., Venclovas, C. & Fidelis, K. (1999). A sliding clamp model for the Rad1family of cell cycle checkpoint proteins. Cell, 96, 769-70. |
| 9. | Marchler-Bauer, A., Panchenko, A. R., Shoemaker, B. A., Thiessen, P. A., Geer,L. Y. et al. (2002). CDD: a database of conserved domain alignments with linksto domain three-dimensional structure. Nucleic Acids Res, 30, 281-3. |
| 10. | Servant, F.,Bru, C., Carrere, S., Courcelle, E., Gouzy, J. et al. (2002). ProDom: automatedclustering of homologous domains. Brief Bioinform, 3, 246-51. |
| 11. | Finn, R. D.,Mistry, J., Schuster-Bockler, B., Griffiths-Jones, S., Hollich, V. et al.(2006). Pfam: clans, web tools and services. Nucleic Acids Res, 34, D247-51. |
| 12. | Letunic, I.,Copley, R. R., Pils, B., Pinkert, S., Schultz, J. et al. (2006). SMART 5:domains in the context of genomes and networks. Nucleic Acids Res, 34, D257-60. |
| 13. | Merkeev, I.V. & Mironov, A. A. (2006). PHOG-BLAST--a new generation tool for fastsimilarity search of protein families. BMC Evol Biol, 6, 51. |
| 14. | Przybylski,D. & Rost, B. (2007). Consensus sequences improve PSI-BLAST throughmimicking profile-profile alignments. Nucleic Acids Research, 35, 2238-2246. |
| 15. | Henikoff, S.& Henikoff, J. G. (1992). Amino acid substitution matrices from proteinblocks. Proc Natl Acad Sci U S A, 89,10915-9. |
| 16. | Schaffer, A.A., Aravind, L., Madden, T. L., Shavirin, S., Spouge, J. L. et al. (2001).Improving the accuracy of PSI-BLAST protein database searches withcomposition-based statistics and other refinements. Nucleic Acids Res, 29, 2994-3005. |
| 17. | Altschul, S.F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z. et al. (1997). GappedBLAST and PSI-BLAST: a new generation of protein database search programs. NucleicAcids Res, 25, 3389-402. |
| 18. | Apweiler,R., Bairoch, A., Wu, C. H., Barker, W. C., Boeckmann, B. et al. (2004).UniProt: the Universal Protein knowledgebase. Nucleic Acids Res, 32, D115-9. |
| 19. | Li, W.,Jaroszewski, L. & Godzik, A. (2001). Clustering of highly homologoussequences to reduce the size of large protein databases. Bioinformatics, 17, 282-3. |
| 20. | Murzin, A.G., Brenner, S. E., Hubbard, T. & Chothia, C. (1995). SCOP: a structuralclassification of proteins database for the investigation of sequences andstructures. J Mol Biol, 247,536-40. |
| 21. | Sander, C.& Schneider, R. (1991). Database of homology-derived protein structures andthe structural meaning of sequence alignment. Proteins, 9, 56-68. |
| 22. | Rost, B.(1999). Twilight zone of protein sequence alignments. Protein Engineering, 12, 85-94. |
| 23. | Karlin, S.& Altschul, S. F. (1990). Methods for assessing the statisticalsignificance of molecular sequence features by using general scoring schemes. ProcNatl Acad Sci U S A, 87,2264-8. |
| 24. | Mott, R.(1992). Maximum-likelihood estimation of the statistical distribution of Smith-Watermanlocal sequence similarity scores. Bull. Math. Biol., 54, 59-75. |
| 25. | Waterman, M.S. & Vingron, M. (1994). Rapid and accurate estimates of statisticalsignificance for sequence data base searches. Proc Natl Acad Sci U S A, 91, 4625-8. |
| 26. | Altschul, S.F. & Gish, W. (1996). Local alignment statistics. Methods Enzymol, 266, 460-80. |
| 27. | Olsen, R.,Bundschuh, R. & Hwa, T. (1999). Rapid assessment of extremal statistics forgapped local alignment. Proc Int Conf Intell Syst Mol Biol,211-22. |
| 28. | Altschul, S.F., Bundschuh, R., Olsen, R. & Hwa, T. (2001). The estimation ofstatistical parameters for local alignment score distributions. NucleicAcids Res, 29, 351-61. |
| 29. | Yu, Y. K.& Altschul, S. F. (2005). The construction of amino acid substitutionmatrices for the comparison of proteins with non-standard compositions. Bioinformatics, 21, 902-11. |
| 30. | Rychlewski,L., Jaroszewski, L., Li, W. & Godzik, A. (2000). Comparison of sequenceprofiles. Strategies for structural predictions using sequence information. ProteinSci, 9, 232-41. |
| 31. | Sadreyev, R.& Grishin, N. (2003). COMPASS: a tool for comparison of multiple proteinalignments with assessment of statistical significance. J Mol Biol, 326, 317-36. |
| Contact: admin@rostlab.org | Version: Jun 30, 2008 |