Protein fold recognition by merging 1D structure prediction and sequence alignments

Burkhard Rost

EMBL; 69012 Heidelberg, Germany; rost@EMBL-Heidelberg.de

EBI; Cambridge CB10 1RQ; England


Contact: Burkhard Rost (rost@EMBL-Heidelberg.de)

Table of Contents

Note: a short version of this manuscript is published in J. Mol. Biol. (Abstract)


Abstract

In fold recognition by threading one takes the amino acid sequence of a protein and evaluates how well it fits into one of the known three-dimensional (3D) protein structures. The quality of sequence-structure fit is typically evaluated using inter-residue potentials of mean force or other statistical parameters. Here, we present a new approach to evaluating sequence-structure fitness. Starting from the amino acid sequence we first predict secondary structure and solvent accessibility for each residue. We then thread the resulting one-dimensional (1D) profile of predicted structure assignments into each of the known 3D structures. The agreement between predicted and observed structure profile is evaluated using statistical parameters. The optimal threading for each sequence-structure pair is obtained using dynamic programming. The overall best sequence-structure pair constitutes the predicted 3D structure for the input sequence. The method is fine-tuned by adding information from direct sequence-sequence comparison and applying a series of empirical filters. Although the method relies on reduction of 3D information into 1D structure profiles, its accuracy is, surprisingly, not clearly inferior to methods based on evaluation of residue interactions in 3D. We therefore hypothesise that existing 1D-3D threading methods essentially capture not more than the fitness of an amino acid sequence for a particular 1D succession of secondary structure segments and residue solvent accessibility. The prediction-based threading method on average finds any structurally homologous region at first rank in 29% of the cases (including sequence information). For the 22% first hits detected at highest scores, the expected accuracy rose to 75%. However, the task to detect entire folds rather than homologous fragments, was managed much better: 45-75% of the first hits correctly recognised the fold. The quality of the resulting 3D models depends crucially on the details of the sequence-structure alignments which can be inaccurate in detail even in cases in which the correct fold is detected.


Key words: protein structure prediction, threading, remote homology detection, fold recognition, homology modelling, secondary structure, relative solvent accessibility, multiple alignments, dynamic programming, neural networks


Abbreviations used:
3D, three-dimensional; 1D, one-dimensional; MaxHom, dynamic programming algorithm for conservation weight based multiple sequence alignment; PDB, Protein Data Bank of experimentally determined 3D structures of proteins; SWISS-PROT, data base of protein sequences; DSSP, data base containing the secondary structure and solvent accessibility for proteins of known 3D structure; FSSP, data base of remote homologues of known 3D structure; PHD, Profile based neural network prediction of secondary structure (PHDsec) and solvent accessibility (PHDacc); rmsd, root mean square deviation; U, protein sequence of unknown 3D structure (e.g. search sequence in alignment procedure).





Introduction

Reducing the sequence-structure gap by homology modelling. Large scale gene-sequencing projects accumulate data of gene, and respectively protein sequences, at breathtaking pace (Oliver et al., 1992; Johnston et al., 1994). Information about three dimensional (3D) U ) can be modelled by homology, if a protein of known 3D structure is found which has more than 25-30% pairwise sequence identity to U (Chothia & Lesk, 1986; Sander & Schneider, 1991). Currently, homology modelling increases the number of known structures by a factor of three to over 11,000 (Schneider & Sander, 1996), i.e., for about one fourth of all sequences in SWISS-PROT (Bairoch & Boeckmann, 1994), 3D structure is either known or it can be built by homology. The quality of such models decreases with lower levels of pairwise sequence identity (Sali & Blundell, 1993; De Filippis et al., 1994; Holm et al., 1994; Sali & Blundell, 1994).

Possible scope of remote homology modelling. Protein structure is more conserved than is protein sequence (Chothia & Lesk, 1986; Lesk, 1991). Consequently, two naturally evolved proteins can have rather different sequences and still fold into homologous structures. But precisely, how much variation is possible? A level of 25-30% pairwise sequence identity is significant to ensure that two naturally evolved sequences have homologous structures (Sander & Schneider, 1991). However, even less pairwise sequence identity is often sufficient to maintain the same 3D structure. Currently there are thousands of remote homologues, i.e., homologues with less than 25% pairwise sequence identity, stored in a database of structurally aligned remote homologues (Holm et al., 1993; Holm & Sander, 1994). To illustrate the possible scope of remote homology modelling by numbers: aligning a list of 150 unique folds against a subset of PDB in which no pair has more than 25% sequence identity (>600 structures) yields 90,000 alignments. About 1,000 of these alignments correspond to true remote homologues. In other words, for each fold there are roughly ten hits in the range of 0-25% pairwise sequence identity. Assumed this distribution could be generalised, then remote homology modelling could be used to model 3D structure for about one fourth of the proteins for which it is applicable today, i.e., for another two to five thousand proteins. Furthermore, given the assumption of a limited number of folds realised in nature (Chothia, 1992; Finkelstein et al., 1993) and the increase of the database of known structures, the likelihood that there is a remote homologue in the database to the search sequence U is increasing. But, how can remote homologues be detected?

Errors in 3D structures can be detected by potential-based threading. Threading techniques have become increasingly popular as a means to detect and align remote homologues (Bowie et al., 1990; Hendlich et al., 1990; Bowie et al., 1991; Lüthy et al., 1991; Casari & Sippl, 1992; Godzik & Skolnick, 1992; Goldstein et al., 1992; Jones et al., 1992; Lüthy et al., 1992; Maiorov & Crippen, 1992; Sippl & Weitckus, 1992; Blundell & Johnson, 1993; Bryant & Lawrence, 1993; Godzik et al., 1993; Miyazawa & Jernigan, 1993; Nishikawa & Matsuo, 1993; Ouzounis et al., 1993; Sippl, 1993a; Stultz et al., 1993; Wilmanns & Eisenberg, 1993; Wodak & Rooman, 1993; Abagyan et al., 1994; Bauer & Beyer, 1994; Crippen & Maiorov, 1994; Lathrop & Smith, 1994; Sippl & Jaritz, 1994; Zhang & Eisenberg, 1994; Braxenthaler & Sippl, 1995; Flöckner et al., 1995; Wang et al., 1995; Fischer & Eisenberg, 1996; Russell et al., 1996). The concept used predominantly is to investigate the fitness of a given sequence for a given structure by a database-derived potential. Various methods differ mostly in the details of deriving and applying such potentials. In contrast to potentials used for molecular dynamics, environment-based potentials successfully detect the real structure among a set of grossly mis-folded decoys (Novotny et al., 1984; Novotny et al., 1988; Ouzounis et al., 1993; Sippl & Jaritz, 1994). Furthermore, potentials of mean-force are in some cases accurate enough to detect subtle errors and stresses in protein structures (Holm & Sander, 1992; Laskowski et al., 1993; Sippl, 1993b; Vriend & Sander, 1993; Rost & Sander, 1994c) or even to distinguish between different possible solutions resulting from refinement procedures (O'Donoghue, manuscript in preparation). Can potential-based threading be applied successfully for remote homology modelling?

Long way from fold recognition to remote homology modelling. The problem of detecting remote homologues is of the type 'needle in the haystack' (even worse, it is NP-complete, (Lathrop, 1994)). To illustrate this point by numbers: aligning the unique folds (150) against the entire PDB (3,000) would yield 450,000 pairs, of which about 1,500 are remote homologues (Holm & Sander, 1994), i.e., goal is to find the one true homologue among 100-300 decoys. Unfortunately, so far no analysis of threading methods has been based on a large data set. A test of threading methods at the first meeting to evaluate structure prediction accuracy (Moult et al., 1995) suggested levels of 10-40% accuracy in correctly detecting the homologous fold (Lemer et al., 1995; Shortle, 1995; Sippl, 1995). However, detection of the homologue is the simpler part of a successful remote homology modelling. More problematic is to correctly align the homologous proteins and to correctly build the model. Both these issues have not been evaluated on larger data sets; only for some isolated cases threading has been shown to yield a correct 3D model (Flöckner et al., 1995).

Here, we extend our previously proposed novel method for threading predictions of 1D structure into 3D structures (Rost, 1995a; Rost, 1995b). First, 1D structure profiles were predicted from multiple sequence alignments. Then, the 1D predictions were aligned to 1D projections of known structures. The accuracy of the method in detecting remote homologues was evaluated on a data set of 89 unique protein folds. The ability to correctly build remote homologous models is investigated for all correctly detected remote homologues. Finally, we compared the performance of the method to other tools based on three different data sets.




Methods

Brief outline of the algorithm

The algorithm started from a protein sequence which was aligned by MaxHom (Sander & Schneider, 1991) against the SWISS-PROT (Bairoch & Boeckmann, 1994) sequence database (Fig. 1). The resulting multiple sequence alignment was used as input to neural network systems predicting secondary structure (PHDsec, (Rost & Sander, 1994a) and solvent accessibility (PHDacc, (Rost & Sander, 1994b). The predictions were converted into 1D structural profiles. Up to this point the method was constrained to a straight prediction in 1D, i.e., without any reference to 3D structure or the final goal of threading. The 1D structure profiles could be produced by any prediction method. Effectively, the amino acid sequence had now been translated into a 1D string of structure symbols ('predicted structure profile'), with some cooperativity taken into account. The idea was now to find the 3D fold that had the most similar structure profile (in term of secondary structure and accessibility). The next step was to represent each of the known folds in the database as an observed structure profile (derived from the coordinates using DSSP, (Kabsch & Sander, 1983). Finally, predicted and observed 1D structure profiles were optimally aligned by a dynamic programming algorithm (MaxHom). The best hit of the alignment procedure was recorded, and the final best hit was taken as the predicted fold. The predicted 3D structure was modelled based on the alignment of the input sequence into the predicted fold.

Prediction of 1D structure

Generating 1D structure predictions. Secondary structure and solvent accessibility are well conserved within sequence families (Flores et al., 1993; Russell & Barton, 1993; Rost & Sander, 1994b; Rost et al., 1994). Therefore, the evolutionary information contained in multiple sequence alignment can be used to significantly improve 1D structure prediction (Maxfield & Scheraga, 1979; Zvelebil et al., 1987; Benner & Gerloff, 1990; Niermann & Kirschner, 1991; Benner et al., 1992; Rost & Sander, 1992; Levin et al., 1993; Rost & Sander, 1993; Livingstone & Barton, 1994; Salamov & Solovyev, 1995; Di Francesco et al., 1996; Rost, 1996). We used the PHD predictions for which the average expected accuracy is: >72% for secondary structure (helix, strand, rest), and >75% for relative accessibility (buried, exposed; (Rost & Sander, 1993; Rost & Sander, 1994b; Rost & Sander, 1994a; Rost & Sander, 1995; Rost, 1996))

Three alternatives for the aligned strings. For a practical application of the method, predicted 1D structure profiles were aligned to observed 1D structure profiles (PHD vs. PDB). To investigate the influence of the accuracy of 1D structure prediction, we performed the following calibration experiment: observed 1D structure profiles were aligned against observed 1D structure profiles (PDB vs. PDB). Another possible extension of the concept was the alignment of predicted against predicted 1D structure profiles (PHD vs. PHD). Such a search could yield a prediction of a fold identity between two proteins both of unknown structure.

Alignment of 1D structure

Number of 1D structure states. We analysed various options for the number of states onto which 3D structure was projected: (i) accessibility only, i.e., two states for buried (relative solvent accessibility <= 9%) and exposed; (ii) secondary structure only, i.e., three states for helix, strand and rest; (iii) combined 1D structure, i.e., six states (3 x 2); (iv) 1D structure combined with sequence, i.e., 120 states (3 x 2 x 20). The first two (only secondary structure or only accessibility) were clearly inferior to the combined approaches (Rost, 1995a). An implicit assumption for the success of 1D structure threading is that 1D structure is conserved between remote homologues.

1D structure conserved between remote homologues. Both secondary structure and solvent accessibility are conserved within sequence families (Rost & Sander, 1994b; Rost et al., 1994). We found that both secondary structure and relative accessibility were also conserved between remote homologues (Fig. 2, note that the averages became less representative for lower counts at levels > 18% sequence identity). This conservation is the precondition for fold recognition by 1D structure alignments. Environment-based potentials rely on the same conservation (Bowie et al., 1990; Bowie et al., 1991; Lüthy et al., 1991; Lüthy et al., 1994; Zhang & Eisenberg, 1994), and they use a comparable description of the input states. What is the difference of potential-based threading and 1D structure alignment?

Environment-based potentials same states, yet principally different method. The similarity of the 1D prediction-based alignment sketched here and environment-based potentials (Bowie et al., 1990; Bowie et al., 1991; Lüthy et al., 1991; Lüthy et al., 1992; Wilmanns & Eisenberg, 1993; Zhang & Eisenberg, 1994) may result in confusing the two principally different approaches. In contrast to threading based on local environment profiles, 1D prediction alignment operated in a global way: the entire 1D structure strings were first predicted and then globally aligned. In other words, the resulting prediction was the optimal alignment for the entire fractions of the folds aligned. Furthermore, prediction-based threading was, in principle, not limited to the knowledge about any structure: alignments of two predicted strings of 1D structure allow a detection of remote homology between two of unknown 3D structure.

Free parameters for dynamic programming

Free parameters for dynamic programming. The predicted strings were aligned based on a Smith-Waterman type dynamic programming algorithm (Smith & Waterman, 1981). This algorithm was implemented in the program MaxHom (Sander & Schneider, 1991; Schneider, 1994). The following free parameters had to be adjusted: (i) the similarity matrix, and (ii) the penalties associated with the introduction of gaps in the alignment.

Similarity matrix for six states. In order to correctly align protein sequences physico-chemical properties of amino acids have to be taken into account by weighting matches between residue pairs according to the physico-chemical similarity, i.e., by a 20 x 20 matrix M , the component M SDO3(ij) of which determines the score for a match at a given position between state i in the first string and state j in the second string (McLachlan et al., 1984; Pearson & Lipman, 1988; Altschul et al., 1990; Henikoff & Henikoff, 1992; Lawrence et al., 1993). For sequence alignments, some matrices perform better than others, but none is clearly best in all cases (Henikoff & Henikoff, 1993). The same applied to 1D structure alignment. Various strategies were explored to find an optimal matrix (Rost, 1995a; Rost, 1995b). Here we used a matrix refined starting from database counts (Fig. 3):

for all i, j = Hb (buried helix), He (exposed helix), Eb (buried strand), Ee (exposed strand), Lb (buried non-regular structure), Le (exposed non-regular structure), i.e., all states for the first and second string in the alignment. <x> described the average of variable x (particular residue in state i in first structure and in state j in second structure) over all states i and j , and f were frequencies derived from structural alignments of remote homologues. Finally, we simplified the resulting matrix by making it symmetric and slightly more balanced (Fig. 3).

Similarity for 120 states. The combination of information from 1D structure and sequence was accomplished by simply combining the 1D structure similarity matrix described above with a McLachlan (McLachlan et al., 1984) or a Blosum62 (Henikoff & Henikoff, 1992) exchange matrix:

where u = 0 - 100 tuned the percentage of 1D structure contribution to the final alignment score E (note that u = 0 corresponded to a simple sequence alignment; u = 100 marked an alignment based on 1D structure only). Finally the matrix was scaled linearly such that:

with the following choices: smin = -1 ; smax = 1 for u = 50 and a McLachlan exchange matrix; smin = -1 ; smax = 2 for u = 50 and a Blosum62 exchange matrix; and smin = -1 ; smax = 3 for u = 0 , i.e., the pure sequence alignment with a McLachlan matrix. The effect of, e.g., increasing smax was to lengthen the alignments. The values of smin and smax were not optimised with respect to the test set. Too small values result in alignments that were obviously too short (e.g. Blosum62/1D structure with smax = 1 resulted in alignments of average length <20).

Gap open and gap elongation penalty. The optimal choice of gap penalties depends on the context, i.e., the particular alignment pair (Vingron & Waterman, 1994). For an alignment of one guide string against a list of database strings, there is a trade-off between coverage (correct hits found vs. all possible correct hits) and accuracy (correct hits vs. all hits found) of detection for the choice of the gap parameters go (penalty for opening a gap) and ge (penalty for continuing an open gap). Here, results were compiled for various gap open penalties. The relative values of the two were found to be of marginal importance; all results were derived for: ge = 0.1 x go .

Final alignment score. The similarity between two strings was given by:

where M was the similarity matrix of eq. (2); and were the 1D structure states at residue positions k for string S1 (search sequence) and S2 (aligned sequence); Lali was the length of the alignment; denoted the number of gaps; and gave the length of all gaps, i.e., the total number of residues inserted or deleted. The alignment score thus defined depended strongly on characteristics of the two strings, e.g., the alignment length, the compositions of secondary structure, accessibility and amino acids (Karlin et al., 1991). To render a score comparable between different proteins, the simple alignment score was normalised to a z-score:

where <E> was the alignment score averaged over a background distribution of alignments; and [[sigma]] the standard deviation for that distribution. The appropriate choice of a background distribution is crucial for the fold recognition by potential-based threading (Sippl & Weitckus, 1992; Bryant & Lawrence, 1993; Sippl & Jaritz, 1994). We used the simplest model for the background distribution as given by all alignments resulting from a search of one protein against a database (we used 723 sequence representative protein chains as search set (Rost WWW, 1996). Note that a z-score compiled based on such a distribution did not change the ranking of alignment hits determined by the alignment score E .

Evaluation of prediction accuracy

Requirements for evaluating prediction accuracy. An appropriate evaluation of performance accuracy required the following steps. (1) A learning set was used to choose free parameters and to evaluate the method in a first round of preliminary tests (training neural networks for prediction of 1D structure; choosing values of similarity matrix and alignment parameters). (2) A generalisation set was used to finally evaluate the performance (only tested at end with existing 'final' method). (3) Scores were defined to measure the accuracy of detecting remote homologues (rank in list). (4) Performance was compared to published results from other threading methods, sequence alignments, and random predictions.

List of proteins to be threaded and list of resulting alignments. It is not sufficient to test the performance of threading on single cases (large fluctuations between different proteins). Thus, we based our analysis on a set of 89 unique protein folds (Table 1). Each protein was aligned against a fold library of 723 sequence-unique protein chains (Rost WWW, 1996) for which the 1D structure strings were assigned by DSSP (Kabsch & Sander, 1983) from experimental coordinates. (Note: a similar analysis of threading methods on a large data set has recently been accomplished by Daniel Fischer and David Eisenberg (Fischer & Eisenberg, 1996; Fischer et al., 1996); and by one of us (Rost,1995b)).

Cross validation and parameter optimisation. The first step of prediction-based threading was to predict 1D structure. As the evaluation of the accuracy required to use proteins of known 3D structure, we had to assure that the knowledge about 3D structure was not used for the 1D prediction. This was achieved by using prediction networks that had been trained on proteins with less than 25% pairwise sequence identity to the predicted protein (cross validation). Furthermore, the free parameters for the dynamic programming algorithm were optimised before the final results were compiled. This was achieved by varying the free parameters based on a smaller data set of 46 non-unique protein structures (list in (Rost, 1995b)).

Worst-case scenario: random prediction and sequence alignment. The least accurate method would be to predict remote homologues at random. For the set of 89 test examples (Table 1) and the search set of 723 chains (Rost WWW, 1996) the chance to randomly pick a correct remote homologue was about 2%. The success of relatively straightforward sequence alignment methods in detecting remote homologues will be discussed (Results).

Measuring accuracy of fold recognition. Given a list of true remote homologues and another of predicted homologues (dubbed hit list, i.e., all alignments of the search sequence with the database of folds), the simplest way to define prediction accuracy is the cumulative percentage of correct predictions up to rank R :

where gave the number of correct first hits at rank r of the list of predicted homologues and the number of all proteins in the test set (here 89). For example Q(1) marked the percentage of correct first hits at rank 1. The limitation to correct first hits resulted from the attempt to not bias the number by families with many remote homologues. The strength of the prediction varied between different threading experiments. To measure the accuracy obtained on subsets selected according to a fixed z-score (5), the following score was defined:

where was the number of correct hits found in ranks 1-R satisfying the constraint: z > [[theta]] ; and the number of all hits found at ranks 1-R with: z > [[theta]] . The corresponding measure for the coverage was defined by:

where was the number of test proteins. Cor([[theta]]) and Cov([[theta]]) determined the trade-off between accuracy and coverage. Results will be given for first ranks (R=1), only. The definitions for coverage and accuracy vs. a cut-off given address the following questions: what is the expected accuracy to find correct homologues if the hit list is cut at rank R and at a z-score > [[theta]]; and for which proportion of the proteins are predictions made at the given cut-off? Note that an alternative definition refers to ALL hits predicted at a given z-score threshold, independent of the rank (for simplicity we ignored this alternative, here).

Measuring accuracy of remote homology modelling. Given a correctly detected remote homologue, how accurate was the resulting model for 3D structure? The accuracy of a 3D model resulting from the 1D structure alignment depends, in principle, on two factors: (1) the correctness of the alignment, and (2) the way in which the homology modelling procedure is performed.

Data sets used for validation

Set of 89 unique folds. As of early 1996, there were more than 200 unique protein folds in PDB (Holm & Sander, 1994). These were used as a starting point to compile a set of 89 proteins used to evaluate the accuracy in detecting remote homologues (Rost WWW, 1996). The reduction from more than 200 unique folds to 89 proteins was explained by exclusion of: (i) membrane proteins, (ii) proteins without a remote homologue in the database, (iii) proteins for which the homologues had low significance (exclusion of cases with a DALI z-score < 3; (Holm & Sander, 1995)), and (iv) pairs which could not be found in the current HSSP database (Schneider & Sander, 1996). The resulting list of remote homologues comprised a rather difficult test set, as it included many cases for which the structural alignment covered only fragments of the two aligned proteins rather than extended over the entire 'folds'. Consequently, the results provided conservative estimates for the accuracy of 1D structure threading. The 723 sequence-unique proteins used to search remote homologues are listed on the WWW (Rost WWW, 1996).

Standard-of-truth: 3D homologues. When evaluating the results, all hits were considered as correctly identified remote homologues that were considered as significantly similar structures by the structure alignment program DALI (Holm & Sander, 1993). DALI alignments are stored for a set of representative proteins in the FSSP database (Holm & Sander, 1994). Pairs with significant pairwise sequence identity (>25%; higher for alignments shorter than 80 residues (Sander & Schneider, 1991)) were excluded from the evaluation. Those 'close homologous' were excluded as sequence alignment methods mostly find the correct homologue in the range above 25-30% pairwise sequence identity.

Data sets for comparison with other methods. Finally, we compiled the results of our method based on three tiny sets of proteins for which results were published in the literature: (1) a set of 11 proteins used by David Jones, Willy Taylor and Janet Thornton (Jones et al., 1992) to evaluate the performance of the program THREADER (Table 4); (2) a set of 11 representative protein families used by Rob Russell, Richard Copley and Geoff Barton (Russell et al., 1996) to evaluate the performance of the programs THREADER and MAP (note: we frequently used the respective representative of that family for which we found structural alignments in the current FSSP release; Table 5); (3) a set of 11 proteins used for the Asilomar 1994 prediction contest (Lemer et al., 1995; Moult et al., 1995) (Table 6).




Results

Fold recognition

Loss of information by projection onto 1D limiting factor. Did the 1D structure profiles capture enough information for successful threading? When threading 1D structure profiles taken from the DSSP (Kabsch & Sander, 1983) assignments based on coordinates of known 3D structures (in other words completely correct 'predictions'), the first hit was correct in 35% of all test cases (PDB vs. PDB, u =100, Table 1). When using real predictions from PHD (at an average accuracy of about 70%), the first hit was correct in 23% of the cases (PHD vs. PDB, u = 100, Table 1). Thus, the limited prediction accuracy of PHD (70%) reduced detection accuracy by 'only' 12 percentage points; whereas the loss of information by projecting 3D structure onto 1D accounted for 75 percentage points in reducing detection accuracy (per definitionem, the full knowledge of 3D structure results in 100% accurate detection).

Significant improvement by including sequence information. When 1D structure and sequence information was combined (eq. (2)) detection accuracy increased markedly: for a 50:50 mixture of 1D-structure-to-sequence (u = 50 in eq. (2)), 29% of the first hits were correct (Table 1); and in half of the test cases, the correct homologue was detected among the first five alignment hits (Fig. 4). The choice of a particular sequence matrix (McLachlan vs. Blosum62) yielded different alignments. However, the overall accuracy for the entire test set was similar (Fig. 4, Table 1). How did this results compare to pure sequence alignment or random predictions? For a random prediction, the first hit would be correct in 2% of the cases. For a sequence alignment method (MaxHom with McLachlan matrix), the first hit was correct in about 15% of all cases (Table 1). Thus, the combination of 1D structure and sequence information improved detection accuracy clearly with respect both to a simple sequence alignment and to an alignment based exclusively on 1D structure.

Stronger hits more likely to be correct. What was the expected accuracy if only first hits with a z-score above a certain cut-off value were regarded as correct? When the alignment list was cut off at a z-score > 4.5 (eq. (5)), the first hit was correct in 88% of the cases (Table 1). At this higher level of accuracy only 10 out of the 89 test proteins were detected (Fig. 5). The correlation between z-score and prediction accuracy illustrated, in particular, the strength of prediction-based threading, as opposed to simple sequence alignment. The sequence alignment used as reference resulted in relatively many correct first hits (15%), but it was very difficult to separate the chaff from the wheat: for 25% of the first hits the z-score was above 4.5, and of these only 30% were predicted correctly (Table 1). In other words, sequence alignments reached a similar level of accuracy as predicton-based threading for every fourth protein.

Successful detection of remote homology in absence of 3D information. One of the features of prediction-based threading is that the detection of remote homology is not restricted to knowing the structure of the target. Instead, a sequence of unknown structure can be threaded through a library of predicted 1D structure assignments. To evaluate the performance of prediction-based threading in absence of 3D knowledge we made cross-validated predictions for more than 700 unique proteins in our fold library (Rost WWW, 1996). When mixing sequence and structure information the result was surprisingly not much inferior to the case of using known 3D structures: 27% of the hits were correctly detected at first rank (PHD vs. PHD; Table 1).

Better recognition of entire folds than of shorter fragments. The test set of 89 proteins was deliberately chosen to answer the question: how accurate can the method detect any remote homologous fragment in a library of protein structures (remote homology detection). An easier task is to detect similarities between entire folds (fold detection). We generated subsets of our full test set by excluding all cases for which the structural alignments covered only a small fraction of the aligned pair. We introduced a cut-off parameter that distinguished between alignments of entire folds' and 'alignments of fragments' (Fig. 6 caption ). For example, if the goal is to detect similarities that cover at least 70% of the lengths of both proteins, the expected accuracy (correct first hit) rose to 50% (Fig. 6). Thus, prediction-based threading was clearly more successful in capturing homologies between entire folds than in detecting homologies between local regions.

No general optimum for alignment parameters. The detection accuracy depended marginally on the gap open penalty chosen. For aligning 1D structure only (u=0) gap open penalties > 3 performed better; and for mixing sequence and structure (u=50) penalties <= 2 yielded more accurate results. However, although the overall detection accuracy did not depend crucially on the choice of alignment parameters, the individual alignments did. Could we propose an optimal choice of the alignment parameters for a particular test case? Unfortunately, we did not succeed in deriving optimal parameters based on the specific case. This problem reflected the discrepancy between an expert-controlled (focusing on one prediction example and tuning parameters) and an automatic threading experiment (standard choice for free parameters). A similar problem is encountered by all threading methods. A general optimal choice for free parameters is difficult as threading is complicated. All estimates for performance accuracy given here provide a conservative worst-case estimate for the expected accuracy of automatic threading experiments.

Remote homology modelling

Few correct predictions of 3D structure. Given a correctly detected remote homologue, how accurate was the alignment? This question was addressed in two ways. First, the predicted alignments were compared to the structural alignments. For the hits correctly detected at ranks 1 and 2, the average shift score (eq. (9)) was 38%, the average identity of the residues between predicted and structural alignments was 33%, and the average shift 11 (Table 3). More than half of the hits correctly detected at first rank reached an alignment shift score (eq. (9)) above 50% (15 out of 25); and one half (13 out of 25) had more than 50% of the residues identical to the structural alignment (Table 3). The three examples given for alignments (Fig. 7) were representative in that the average identity between structural and predicted alignments were 31%, i.e., slightly less than the average over all 35 test examples correctly detected among the first two hits. For the second way to evaluate the alignment and consequently the accuracy in predicting 3D structure, we simply super-imposed the backbone model resulting from the predicted alignment with the known structure of the search protein. For only six of the test cases correctly detected at first rank (total of 25) the final model for the 3D structure of the threaded sequence deviated less than 2Å (rmsd) from the optimal superposition of the two structures. Thus, in most cases for which the remote homologue was detected as first hit in the alignment list, the alignment and consequently the resulting model were not at all correct. Could correct models be distinguished from false ones? An advantage of the prediction-based threading could be that knowledge-based potentials capture different information. Thus, false final models could be spotted, in particular by potential-based methods designed to detect inconsistencies in 3D structures (Sippl, 1993b).

Comparison to other threading methods

Find-self test of historical importance. Most publications on threading methods report results on a few test cases (Bowie et al., 1990; Finkelstein & Reva, 1991; Lüthy et al., 1991; Godzik & Skolnick, 1992; Lüthy et al., 1992; Sippl & Weitckus, 1992; Blundell & Johnson, 1993; Bryant & Lawrence, 1993; Nishikawa & Matsuo, 1993; Skolnick et al., 1993; Wilmanns & Eisenberg, 1993; Abagyan et al., 1994; Goldstein et al., 1994; Lathrop & Smith, 1994) or explore the ability of data-base derived potentials to recognise the native structure among a set of decoys (find-self test; (Hendlich et al., 1990; Sippl & Weitckus, 1992; Ouzounis et al., 1993; Bauer & Beyer, 1994). The find-self test may seem to be trivial, but it is not successfully managed by force fields based on first principles (Novotny et al., 1984; Novotny et al., 1988).

A very favourable set of 11 proteins. One method that was initially tested on 12 examples is the potential-based threading version published by Jones, Taylor and Thornton (Jones et al., 1992). For all 12 test cases the method is reported to have found the correct homologue at the first rank of the alignment, i.e., the detection accuracy is suggested to be Q(1) = 100%. But is the comparison 100% (Jones et al.) vs. 28% (here) appropriate? Both numbers were derived from different test sets, and as pointed out the 89 proteins chosen by us to compile expected accuracy represented a difficult test set. When we applied our method to 11 of the proteins of Jones et al. (the 12th was not in our database), the correct homologue was always detected at the first rank (Table 4). In other words, on the same data sets both methods yielded an accuracy of Q(1) = 100%.

Another favourable set of 11 proteins. More recently Russell et al. (1996) evaluated their own prediction-based threading method (MAP) and the THREADER program of Jones et al. (Jones et al., 1992) based on another small set of 11 proteins. For the first hit they reported an accuracy of 37-45% (depending on the threshold used for defining homologue structures) for their method MAP and of only 9-19% for THREADER (Jones et al., 1992). On the same 11 families, our prediction-based threading resulted in 78% correct first hits (Table 5). The reported quality of the alignments (measured by percentage identity of residues between the predicted and the structural alignment) was 15% for MAP and 11% for THREADER (Russell et al., 1996). For our prediction-based threading the average number of correctly aligned residues was 27% (Table 5). Thus, although the set used by Russell et al. (1996) was much more conservative than the one used by Jones et al. (1992), it still yielded very optimistic estimates for prediction accuracy when compared to the performance on our set of 89 proteins. Did we select a set that yielded too pessimistic estimates of performance accuracy?

The 11 Asilomar 1994 targets. A final test of our method on 11 proteins that were used as threading targets at the first Asilomar meeting for the evaluation of prediction methods (Lemer et al., 1995; Moult et al., 1995) suggested that the estimates derived on our initial set of 89 proteins might be closer to the 'reality' for using automated threading than those derived on favourable test sets. For the Asilomar 11 we correctly detected the remote homologues at first rank in four cases (i.e. 36%, Table 6). The average percentage of correctly aligned residues was 21%; the average shift 9 residues; and the alignment shift score on average AS = 26% (eq. (9)). Thus, the alignments were mostly wrong. How did the results compare to the blind predictions made for the meeting by others? The best methods performed better than our method: (1) the expert-driven usage of THREADER by David Jones and colleagues (Jones et al., 1995) detected 5 out of 9 proteins correctly at first rank; and (2) the best alignments of the potential-based threading method perfected by Manfred Sippl and colleagues (Flöckner et al., 1995) were clearly better than our best ones.

Remote homology modelling. How did prediction-based threading perform in terms of remote homology modelling compared to potential-based threading approaches? In the literature the correctness of the alignment and consequently the 3D model obtained by threading has been evaluated for very few cases. One common example is the homology between the heat shock protein 70 (PDB code: 2hsc) and the A chain of the muscle protein actin (PDB code: 2atnA). Searching with 2hsc, the 1D-profile threading brought up 2atnA at first rank. The predicted alignment agrees for 44% of the residues with the structural alignment taken from FSSP (Holm & Sander, 1994) (Fig. 8). For a threading method based on energy calculations, Abagyan, Frishman and Argos (1994) published the predicted alignment for the last 232 residues (Fig. 8) of the same pair. They report that the alignment was wrong for the C-terminal part of the molecules, for the 232 aligned residues their alignment is for 14% of the residues identical to the structural alignment. Interestingly, for the same region the prediction-based threading has 22% of the residues identical to the structural alignment, i.e., is clearly worse than the average for the entire protein.




Conclusion

Successful fold recognition by threading predicted 1D structure profiles. Fold motifs could be detected automatically by aligning predicted and known 1D structure profiles (secondary structure and solvent accessibility). However, even for an - in practice unrealistic - optimal prediction of 1D structure (assignment from known coordinates), the first hit was correct in only 35% of all test cases (Q(1), Table 1). A realistic prediction of 1D structure (obtained by cross-validated PHD predictions) yielded 23% detection accuracy. This result suggested two conclusions. (1) The loss of information by projecting 3D information onto 1D structure profiles was the bottle-neck of the method. To illustrate this problem: at least 16 unrelated structures contain the secondary structure motif 'H-E-E-H-E-E'. An additional incorporation of information about inter-residue distances may open that bottle-neck. (2) Further improvements of 1D structure predictions could improve the accuracy of prediction-based threading significantly.

Better fold recognition by combining 1D structure profiles and sequence information. When 1D structure profiles were combined with sequence information, detection accuracy improved significantly: about 29% of all first hits were correct (Table 1), and in about 53% of the test cases the correct homologues was found among the first five hits (Fig. 4). Prediction accuracy for the threading was significantly higher than for the simple sequence alignments used as reference (15% correct first hits, Fig. 4). Furthermore, accuracy could be increased by focusing on the subset of those hits which were predicted with higher z-scores. For example, for the 10% of all proteins predicted at z > 4.5 (eq. (5)) the expected accuracy of correctly detecting the fold at first rank rose to 88% (Table 1, Fig. 5). The prediction-based threading method was significantly better at detecting homologous folds than in detecting homologous fragment. For example, for a test set with true homologues for which the alignment covered 70% of both aligned sequences, one half of the first hits were correct (Fig. 6). A feature of prediction-based threading that may become particularly interesting for applications in practice is that remote homology can successfully be detected between protein pairs without knowledge of 3D structure: when using 1D structure predictions as fold library, we correctly detected the remote homologue in 27% of the test cases at first rank (Table 1).

Prediction-based threading competitive to other threading techniques. A potential-based threading for which results were initially published for more than ten proteins was THREADER (Table 1; (Jones et al., 1995)). However, a recent analysis based on another small set of 11 structure families (Russell et al., 1996), suggested a significantly lower level of accuracy for THREADER than previously estimated by Jones et al. (below 20%). The prediction-based threading method of Russell et al. reached 37% to 45% accuracy (correct first hit). For the same 11 families our method had 75% correct first hits (one standard deviation > 15%; Table 5). When - in retrospect - evaluating our method on the 11 threading targets used for the Asilomar 1994 prediction contest (Moult et al., 1995) we had four first hits correct (36%). However, other methods performed better (Lemer et al., 1995): an expert-driven usage of THREADER had more correct first hits (Jones et al., 1995), and the potential-based threading by Manfred Sippl and colleagues got the best alignments more accurately (Flöckner et al., 1995). David Fischer and David Eisenberg have recently developed a method for prediction-based threading that is very similar to the one presented here (Fischer & Eisenberg, 1996; Fischer et al., 1996). They evaluated their and previous potential-based threading methods based on a large set of 64 remote homologues and reported 31% correct hits for potential-based threading (Bowie et al., 1990; Bowie et al., 1991; Lüthy et al., 1991; Lüthy et al., 1992) and 48% correct hits for prediction-based threading (Fischer & Eisenberg, 1996). This confirms the conclusions suggested by the results presented here and previously (Rost, 1995a; Rost, 1995b): in correctly identifying the first hit prediction-based threading is, at least, as accurate as potential-based threading.

Correct prediction of 3D structure by remote homology modelling for single cases. The correct detection of remote homology is the precondition for remote homology modelling. However, correct detection does not imply correct alignments. On the contrary, for most correctly detected remote homologues the alignment was, at least, partially wrong (for one half of the hits correctly predicted at first rank the identity between predicted and structural alignment was above 50%, Table 3). The same is true for most other threading techniques (Flöckner et al., 1995; Lemer et al., 1995; Shortle, 1995; Fischer & Eisenberg, 1996; Russell et al., 1996). How can a false alignment result in the detection of the true remote homologue among a huge set of decoys? The answer remains open. However, for some single cases, prediction-based model building was shown to be successful in predicting 3D structure.

Method available by automatic prediction service. The prediction-based threading of 1D structure profiles (PHDthreader) is available via an automatic prediction service (send the word help to the internet address PredictProtein@EMBL-Heidelberg.DE , or use the World Wide Web (WWW) site http://www.embl-heidelberg.de/predictprotein/). By default input strings (1D structure profile) are generated by a PHD prediction, however, users can also opt to provide their own predictions of secondary structure and solvent accessibility.

Will threading replace structure determination ? The number of different protein folds is probably limited, no matter whether the exact number is more likely to be 1,000 (Chothia, 1992) or 10,000. This suggests that one day, when the structure of all those folds will have been determined by experiment, threading tools will eventually close the sequence-structure gap by remote homology modelling. There are three reasons why this appears to be an overoptimistic science fiction. (1) Correct alignments are still the exception rather than the rule. (2) Even when the alignments are correct, remote homology modelling at levels of less than 30% pairwise sequence identity is yet another unsolved problem (note that even for the much simpler cases of close homologues, i.e., for levels of 30-60% pairwise sequence identity, modelling procedures are not always successful). (3) The more unique folds are contained in the database, the more difficult will be the detection task. This was illustrated by the following experiment. We aligned our 89 test proteins against three different 'fold libraries': (a) the largest set of sequence-unique proteins as of spring 1996 (723 chains Rost WWW, 1996), (b) the largest set of 1995 (449 chains), and (c) a set of unique folds (plus the detectable homologues, 403 chains). The percentage of correctly detected first hits was inversely proportional to the size of the data set: 29% (a), 31% (b) and 33% (c). This result, probably, stems from the fact that the selection procedure is non-linear, thus, the likelihood of random errors is increased by increasing the fold library. In other words, threading is not likely to close the sequence-structure gap in the future but it can contribute to bridge it today.

Acknowledgements

First of all, thanks to Manfred Sippl (Univ. Salzburg) for discussions and help. Furthermore, thanks to Michael Braxenthaler (CARB, Washington, DC), Séan O'Donoghue (EMBL, Heidelberg), and Daniel Fischer (UCLA, Los Angeles) for helpful dialogues; and to Rob Hooft (EMBL, Heidelberg) for software assistance. Last, not least, thanks to all those who deposit protein structures and protein sequences in public databases and those maintaining high quality databases - to mention, in particular, Amos Bairoch and his group (Basel) - thereby enabling the design of prediction methods.




References

Abagyan, R., Frishman, D. & Argos, P. (1994). Recognition of distantly related proteins through energy calculations. Proteins, 19, 132-140.

Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. (1990). Basic local alignment search tool. J. Mol. Biol., 215, 403-410.

Bairoch, A. & Boeckmann, B. (1994). The SWISS-PROT protein sequence data bank: current status. Nucl. Acids Res., 22, 3578-3580.

Bauer, A. & Beyer, A. (1994). An Improved Pair Potential to Recognize Native Protein Folds. Proteins, 18, 254-261.

Benner, S. A., Cohen, M. A. & Gerloff, D. (1992). Correct structure prediction? Nature, 359, 781.

Benner, S. A. & Gerloff, D. (1990). Patterns of Divergence in Homologous Proteins as Indicators of Secondary and Tertiary Structure of the Catalytic Domain of Protein Kinases. Adv. Enz. Reg., 31, 121-181.

Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer, E. F., Brice, M. D. et al. (1977). The Protein Data Bank: a computer based archival file for macromolecular structures. J. Mol. Biol., 112, 535-542.

Blundell, T. L. & Johnson, M. S. (1993). Catching a common fold. Prot. Sci., 2, 877-883.

Bowie, J. U., Clarke, N. D., Pabo, C. O. & Sauer, R. T. (1990). Identification of protein folds: matching hydrophobicity patterns of sequence sets with solvent accessibility patterns of known structures.Proteins, 7, 257-264.

Bowie, J. U., Lüthy, R. & Eisenberg, D. (1991). A Method to Identify Protein Sequences That Fold into a Known Three-Dimensional Structure. Science, 253, 164-169.

Braxenthaler, M. & Sippl, M. (1995). Screening genome sequences for known folds. In Protein folds. A distance-based approach (Bohr, H. & Brunak, S., eds.), pp. 80-84, CRC Press, Boca Raton, Florida.

Bryant, S. H. & Lawrence, C. E. (1993). An empirical energy function for threading protein sequence through the folding motif. Proteins, 16, 92-112.

Casari, G. & Sippl, M. J. (1992). Structure-derived Hydrophobic Potential. J. Mol. Biol., 224, 725-732.

Chothia, C. (1992). One thousand protein families for the molecular biologist. Nature, 357, 543-544.

Chothia, C. & Lesk, A. M. (1986). The relation between the divergence of sequence and structure in proteins. EMBO J., 5, 823-826.

Crippen, G. M. & Maiorov, V. N. (1994). A Potential Function that Identifies Correct Protein Folds. In Protein Structure by Distance Analysis (Bohr, H. & Brunak, S., eds.), pp. 158-174, IOS Press, Amsterdam, Oxford, Washington.

De Filippis, V., Sander, C. & Vriend, G. (1994). Predicting local structural changes that result from point mutations. Prot. Engin., 7, 1203-1208.

Di Francesco, V., Garnier, J. & Munson, P. J. (1996). Improving protein secondary structure prediction with aligned homologous sequences. Prot. Sci., 5, 106-113.

Finkelstein, A. V., Gutun, A. M. & Badretdinov, A. Y. (1993). Why are the same protein folds used to perform different functions? FEBS Lett., 325, 23-28.

Finkelstein, A. V. & Reva, B. A. (1991). A search for the most stable folds of protein chains. Nature, 351, 497-499.

Fischer, D. & Eisenberg, D. (1996). Protein fold recognition using sequence-derived predictions. Prot. Science, 5, 947-955.

Fischer, D., Elofsson, A., Rice, D. & Eisenberg, D. (1996). Assessing the performance of fold recognition methods by means of a comprehensive benchmark. In Pacific Symposium on biocomputing, Hawaii, 1996, pp. 300-318.

Flöckner, H., Braxenthaler, M., Lackner, P., Jaritz, M., Ortner, M. et al. (1995). Progress in fold recognition. Proteins, 23, 376-386.

Flores, T. P., Orengo, C. A., Moss, D. S. & Thornton, J. M. (1993). Comparison of conformational characteristics in structurally similar protein pairs. Prot. Sci., 2, 1811-1826.

Godzik, A., Kolinski, A. & Skolnick, J. (1993). De novo and inverse folding predictions of protein structure and dynamics. J. Comput. Aided Mol. Design, 7, 397-438.

Godzik, A. & Skolnick, J. (1992). Sequence-structure matching in globular proteins: application to supersecondary and tertiary structure determination. Proc. Natl. Acad. Sc. U.S.A., 89, 12098-12102.

Goldstein, R. A., Luthey-Schulten, Z. A. & Wolynes, P. (1994). A Bayesian Approach to Sequence sequence alignment for Protein Structure Recognition. In 27th Hawaii International Conference on System Sciences (Hunter, L., eds.), pp. 306-315, Los Alamos, CA: IEEE Computer Society Press, Wailea, HI, U.S.A.

Goldstein, R. A., Luthey-Schulten, Z. A. & Wolynes, P. G. (1992). Optimal protein-folding codes from spin-glass theory. Proc. Natl. Acad. Sc. U.S.A., 89, 4918-4922.

Greer, J. (1991). Comparative modeling of homologous proteins. Meth. Enzymol., 202, 239-252.

Hendlich, M., Lackner, P., Weitckus, S., Flöckner, H., Froschauer, R. et al. (1990). Identification of Native Protein Folds Amongst a Large Number of Incorrect Models. The Calculation of Low Energy Conformations from Potentials of Mean Force. J. Mol. Biol., 216, 167-180.

Henikoff, S. & Henikoff, J. G. (1992). Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sc. U.S.A., 89, 10915-10919.

Henikoff, S. & Henikoff, J. G. (1993). Performance evaluation of amino acid substitution matrices. Proteins, 17, 49-61.

Holm, L., Ouzounis, C., Sander, C., Tuparev, G. & Vriend, G. (1993). A database of protein structure families with common folding motifs. Prot. Sci., 1, 1691-1698.

Holm, L., Rost, B., Sander, C., Schneider, R. & Vriend, G. (1994). Data based modeling of proteins. In Statistical Mechanics, Protein Structure, and Protein Substrate Interactions (Doniach, S., eds.), pp. 277-296, Plenum Press, New York.

Holm, L. & Sander, C. (1992). Evaluation of Protein Models by Atomic Solvation Preference. J. Mol. Biol., 225, 93-105.

Holm, L. & Sander, C. (1993). Protein Structure Comparison by Alignment of Distance Matrices. J. Mol. Biol., 233, 123-138.

Holm, L. & Sander, C. (1994). The FSSP database of structurally aligned protein fold families. Nucl. Acids Res., 22, 3600-3609.

Holm, L. & Sander, C. (1995). Dali: a network tool for protein structure comparison. TIBS, 20, 478-480.

Johnston, M., Andrews, S., Brinkman, R., Cooper, J., Ding, H. et al. (1994). Complete nucleotide sequence of saccaromyces cerevisiae chromosome VIII. Science, 265, 2077-2082.

Jones, D. T., Miller, R. T. & Thornton, J. M. (1995). Successful protein fold recognition by optimal sequence threading validated by rigorous blind testing. Proteins, 23, 387-397.

Jones, D. T., Taylor, W. R. & Thornton, J. M. (1992). A new approach to protein fold recognition. Nature, 358, 86-89.

Kabsch, W. & Sander, C. (1983). Dictionary of protein secondary structure: pattern recognition of hydrogen bonded and geometrical features. Biopolymers, 22, 2577-2637.

Karlin, S., Bucher, P., Brendel, V. & Altschul, S. F. (1991). Statistical methods and insight for protein and DNA sequences. Annu. Rev. Biophys. Biophys. Chem., 20, 175-203.

Laskowski, R. A., Moss, D. S. & Thornton, J. M. (1993). Main-chain bond lengths and bond angles in protein structures. J. Mol. Biol., 231, 1049-1067.

Lathrop, R. H. (1994). The protein threading problem with sequence amino acid interaction preferences is NP-complete. Prot. Engin., 7, 1059-1068.

Lathrop, R. H. & Smith, T. F. (1994). A Branch-and-Bound Algorithm for Optimal Protein Threading with Pairwise (Contact Potential) Amino Acid Interactions. In 27th Hawaii International Conference on System Sciences (Hunter, L., eds.), pp. 365-374, Los Alamos, CA: IEEE Computer Society Press, Wailea, HI, U.S.A.

Lattman, E. E. (1994). Protein crystallography for all. Proteins, 18, 103-106.

Lattman, E. E. ed. (1995). Protein structure prediction issue. Proteins, 23, 295-462.

Lawrence, C. E., Altschul, S. F., Boguski, M. S., Liu, J. S., Neuwald, A. F. et al. (1993). Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment. Science, 262, 208-214.

Lemer, C. M.-R., Rooman, M. J. & Wodak, S. J. (1995). Protein structure prediction by threading methods: evaluation of current techniques. Proteins, 23, 337-355.

Lesk, A. M. (1991). Protein Architecture - A Practical Approach. Oxford University Press, Oxford, New York, Tokyo.

Lesk, A. M. & Boswell, R. D. (1992). Homology modelling: inferences from tables of aligned sequences. Curr. Opin. Str. Biol., 2, 242-247.

Levin, J. M., Pascarella, S., Argos, P. & Garnier, J. (1993). Quantification of Secondary Structure Prediction Improvement Using Multiple Alignment. Prot. Engin., 6, 849-854.

Livingstone, C. D. & Barton, G. J. (1994). Secondary Structure Prediction from Multiple Sequence Data: Blood Clotting Factor XIII and Yersinia Protein-Tyrosine Phosphatase. Int. J. Peptide Protein Res., 44, 239-244.

Lüthy, R., Bowie, J. U. & Eisenberg, D. (1992). Assessment of protein models with three-dimensional profiles. Nature, 356, 83-85.

Lüthy, R., McLachlan, A. D. & Eisenberg, D. (1991). Secondary structure-based profiles: use of structure-conserving scoring tables in searching protein sequence databases for structural similarities. Proteins, 10, 229-239.

Lüthy, R., Xenarios, I. & Bucher, P. (1994). Improving the sensitivity of the sequence profile method. Prot. Sci., 3, 139-146.

Maiorov, V. N. & Crippen, G. M. (1992). A contact potential that recognises correct folding of globular proteins. J. Mol. Biol., 227, 876-888.

Maxfield, F. R. & Scheraga, H. A. (1979). Improvements in the Prediction of Protein Topography by Reduction of Statistical Errors. Biochem., 18, 697-704.

May, A. C. W. & Blundell, T. L. (1994). Automated comparative modelling of protein structures. Curr. Opin. Biotech., 5, 355-360.

McLachlan, A. D., Staden, R. & Boswell, D. R. (1984). A method for measuring the non-random bias of a codon usage table. Nucleic Acids Res., 12, 9567-9575.

Miyazawa, S. & Jernigan, R. L. (1993). A new substitution matrix for protein sequence searches based on contact frequencies in protein structures.Prot. Engin., 6, 267-278.

Moult, J., Pedersen, J. T., Judson, R. & Fidelis, K. (1995). A large-scale experiment to assess protein structure prediction methods. Proteins, 23, ii-iv.

Niermann, T. & Kirschner, K. (1991). Improving the prediction of secondary structure of 'TIM-barrel' enzymes (Corrigendum). Prot. Engin., 4, 359-370.

Nishikawa, K. & Matsuo, Y. (1993). Development of pseudo-energy potentials for assessing protein 3-D-1D compatibility and detecting weak homologies. Prot. Engin., 6, 811-820.

Novotny, J., Bruccoleri, R. E. & Karplus, M. (1984). An analysis of incorrectly folded models. Implications for structure prediction. J. Mol. Biol., 177, 787-818.

Novotny, J., Rashin, A. A. & Bruccoleri, R. E. (1988). Criteria that discriminate between native proteins and incorrectly folded models. Proteins, 4, 19-30.

Oliver, S., van der Aart, Q. J. M., Agostioni-Carbone, M. L., Aigle, M., Alberghina, L. et al. (1992). The complete DNA sequence of yeast chromosome III. Nature, 357, 38-46.

Ouzounis, C., Sander, C., Scharf, M. & Schneider, R. (1993). Prediction of protein structure by evaluation of sequence-structure fitness: Aligning sequences to contact profiles derived from 3D structures. J. Mol. Biol., 232, 805-825.

Pearson, W. R. & Lipman, D. J. (1988). Improved tools for biological sequence comparison. Proc. Natl. Acad. Sc. U.S.A., 85, 2444-2448.

Rost, B. (1995a). Fitting 1-D predictions into 3-D structures. In Protein folds: a distance based approach (Bohr, H. & Brunak, S., eds.), pp. 132-151, CRC Press, Boca Raton, Florida.

Rost, B. (1995b). TOPITS: Threading One-dimensional Predictions Into Three-dimensional Structures. In Third International Conference on Intelligent Systems for Molecular Biology (Rawlings, C., Clark, D., Altman, R., Hunter, L., Lengauer, T. et al., eds.), pp. 314-321, Menlo Park, CA: AAAI Press, Cambridge, England.

Rost, B. (1996). PHD: predicting one-dimensional protein structure by profile based neural networks. Meth. Enzymol., 266, 525-539.

Rost, B. & Sander, C. (1992). Jury returns on structure prediction. Nature, 360, 540.

Rost, B. & Sander, C. (1993). Prediction of protein secondary structure at better than 70% accuracy. J. Mol. Biol., 232, 584-599.

Rost, B. & Sander, C. (1994a). Combining evolutionary information and neural networks to predict protein secondary structure. Proteins, 19, 55-72.

Rost, B. & Sander, C. (1994b). Conservation and prediction of solvent accessibility in protein families. Proteins, 20, 216-226.

Rost, B. & Sander, C. (1994c). Structure prediction of proteins - where are we now? Curr. Opin. Biotech., 5, 372-380.

Rost, B. & Sander, C. (1995). Progress of 1D protein structure prediction at last. Proteins, 23, 295-300.

Rost, B., Sander, C. & Schneider, R. (1994). Redefining the goals of protein secondary structure prediction. J. Mol. Biol., 235, 13-26.

Rost WWW, B. (1996). Appendix to 'Protein fold recognition by prediction-based threading'. EMBL, WWW document (http://www.embl-heidelberg.de/~rost/Papers/Dfig/JMB96.html) .

Russell, R. B. & Barton, G. J. (1993). The limits of protein secondary structure prediction accuracy from multiple sequence alignment. J. Mol. Biol., 234, 951-957.

Russell, R. B., Copley, R. R. & Barton, G. J. (1996). Protein fold recognition by mapping predicted secondary structures. J. Mol. Biol., in press.

Salamov, A. A. & Solovyev, V. V. (1995). Prediction of protein secondary structure by combining nearest-neighbor algorithms and multiple sequence alignment. J. Mol. Biol., 247, 11-15.

Sali, A. & Blundell, T. (1994). Comparative Protein Modelling by Satisfaction of Spatial Restraints. In Protein Structure by Distance Analysis (Bohr, H. & Brunak, S., eds.), pp. 64-87, IOS Press, Amsterdam, Oxford, Washington.

Sali, A. & Blundell, T. L. (1993). Comparative protein modelling by satisfaction of spatial restraints. J. Mol. Biol., 234, 779-815.

Sander, C. & Schneider, R. (1991). Database of homology-derived structures and the structurally meaning of sequence alignment. Proteins, 9, 56-68.

Sander, C. & Schneider, R. (1994). The HSSP database of protein structure-sequence alignment. Nucl. Acids Res., 22, 3597-3599.

Schneider, R. (1994). Sequenz und Sequenz-Struktur Vergleiche und deren Anwendung für die Struktur- und Funktionsvorhersage von Proteinen. Univ. of Heidelberg, PhD.

Schneider, R. & Sander, C. (1996). The HSSP database of protein structure-sequence alignments. Nucl. Acids Res., 24, 201-205.

Shortle, D. (1995). Protein fold recognition. Nature Struct. Biol., 2, 91-92.

Sippl, M. J. (1982). On the Problem of Comparing Protein Structures. J. Mol. Biol., 156, 359-388.

Sippl, M. J. (1993a). Boltzmann's principle, knowledge based mean fields and protein folding. An approach to the computational determination of protein structures. J. Comput. Aided Mol. Design, 7, 473-501.

Sippl, M. J. (1993b). Recognition of errors in three-dimensional structures of proteins. Proteins, 17, 355-362.

Sippl, M. J. (1995). Knowledge-based potentials for proteins. Curr. Opin. Str. Biol., 5, 229-235.

Sippl, M. J. & Jaritz, M. (1994). Predictive Power of Mean Force Pair Potentials. In Protein Structure by Distance Analysis (Bohr, H. & Brunak, S., eds.), pp. 113-134, IOS Press, Amsterdam, Oxford, Washington DC.

Sippl, M. J. & Weitckus, S. (1992). Detection of native-like models for amino acid sequences of unknown three-dimensional structure in a data base of known protein conformations. Proteins, 13, 258-271.

Skolnick, J., Kolinski, A., Brooks III, C. L., Godzik, A. & Rey, A. (1993). A method for predicting protein structure from sequence. Curr. Biol., 3, 414-423.

Smith, T. F. & Waterman, M. S. (1981). Identification of common molecular subsequences. J. Mol. Biol., 147, 195-197.

Stultz, C. M., White, J. V. & Smith, T. F. (1993). Structural analysis based on state-space modeling. Prot. Sci., 2, 305-314.

Vingron, M. & Waterman, M. S. (1994). sequence alignment and penalty choice. J. Mol. Biol., 235, 1-12.

Vriend, G. & Sander, C. (1993). Quality of Protein Models: Directional Atomic Contact Analysis. J. Appl. Cryst., 26, 47-60.

Wang, Y., Lai, L., Han, Y., Xu, X. & Tang, Y. (1995). A new protein folding recognition potential function. Proteins, 21, 127-129.

Wilmanns, M. & Eisenberg, D. (1993). Three-dimensional profiles from residue-pair preferences: Identification of sequences with [[beta]]/[[alpha]]-barrel fold. Proc. Natl. Acad. Sc. U.S.A., 90, 1379-1383.

Wodak, S. J. & Rooman, M. J. (1993). Generating and testing protein folds. Curr. Opin. Str. Biol., 3, 247-259.

Zhang, K. Y. J. & Eisenberg, D. (1994). The three-dimensional profile method using residue preference as a continuous function of residue environment. Prot. Sci., 3, 687-695.

Zu-Kang, F. & Sippl, M. J. (1996). Optimum superimposition of protein structures: ambiguities and implications. Folding & Design, 1, 123-132.

Zvelebil, M. J., Barton, G. J., Taylor, W. R. & Sternberg, M. J. E. (1987). Prediction of protein secondary structure and active sites using alignment of homologous sequences. J. Mol. Biol., 195, 957-961.