| Title: | PROFcon: novel prediction of long-range contacts |
| Author: | Marco Punta & Burkhard Rost |
| Quote: | Bioinformatics, 2005, 21:2960-2968 |
PROFcon: novel prediction of long-range contacts
| 1 | CUBIC, Dept. of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA |
| 2 | Columbia University Center for Computational Biology and Bioinformatics (C2B2), Russ Berrie Pavilion, 1150 St. Nicholas Avenue, New York, NY 10032, USA |
| 3 | North East Structural Genomics Consortium (NESG), Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA |
| * | Corresponding author: punta@cubic.bioc.columbia.edu URL http://cubic.bioc.columbia.edu/ Tel: +1-212-305-4018, fax: +1-212-305-7932 |
This article is published in (Bioinformatics, issue, 2005 and pages) copyright Oxford University Press (2005). OUP is the only authorised source. All copying of this article including placing on another website requires the written permission of the copyright owner.
Motivation: Despite the continuing advance in the experimental determination of protein structures, the gap betweenthe number of known protein sequences and structures continues to increase. Prediction methods can bridge this sequence-structure gap only partially. Better predictions of non-local contacts between residues could improve comparative modeling, fold recognition, and could assist the experimental structure determination.
Results: Here, we introduced PROFcon, a novel contact prediction method that combines information from alignments, from predictions of secondary structure and solvent accessibility, from the region between two residues, and from the average properties of the entire protein. In contrast to some other methods, PROFcon predicted short and long proteins at similar levels of accuracy. As expected, PROFcon was clearly less accurate when tested on sparse evolutionary profiles, i.e. on families with few homologues. Prediction accuracy was highest for proteins belonging to the SCOP alpha/beta class. PROFcon compared favorably with state-of-the-art prediction methods at the CASP6 meeting. While the performance may still be perceived as low, our method clearly pushed the mark higher. Furthermore, predictions are already accurate enough to seed predictions of global features of protein structure.
Availability: http://www.predictprotein.org/submit_profcon.html
Contact: punta@cubic.bioc.columbia.edu
Key words: protein structure prediction, inter-residue contacts, evolutionary information, neural networks.
| 1D | one-dimensional (e.g. sequence or string of residue secondary structure or numbers for residue solvent accessibility) |
| 2D | two-dimensional (e.g. inter-residue contacts) |
| 3D structure | three-dimensional co-ordinates of protein structure |
| aka | also known as |
| CASP | Critical assessment of structure prediction [1] |
| CAFASP | evaluation of automated servers at CASP [2, 3, 4, 5] |
| HSSP | Homology derived structures of proteins [6, 7] |
| PDB | Protein data bank [8, 9] |
| PROFcon | contact prediction method introduced here |
| PROFphd | system for prediction of 1D structure [10, 11, 12]. |
Protein three-dimensional (3D) structure is one key to understanding biological function. Structures unravel details needed to engineer residue mutations and to design protein-specific ligands. Thus, the knowledge of protein structure can impact medical and clinical research. Structural genomics is a large-scale effort that has the experimental determination of most unknown folds as one aim [13, 14, 15, 16, 17]. Structural genomics consortia are progressing rapidly and already contribute every third (J Liu & B Rost, unpublished) of the sequence-unique structures added to the Protein Data Bank (PDB [9]) of 3D structures. Nevertheless, the sequence-structure gap, i.e. the difference between the number of proteins with known sequence and those with known structure is increasing much more rapidly. Methods that predict aspects of protein structure continue to be a crucial means of obtaining structural information that helps in the unraveling of protein function [18, 19, 20, 21].
Despite significant advances over the last years, computational biology can still not reliably generate biologically meaningful 3D models for proteins that have no detectable homology to proteins of known structure [1]. In the absence of a reliable solution to the protein structure prediction problem, developers have addressed simplified problems, such as the prediction of protein secondary structure and solvent accessibility; such methods have evolved into successful, automatic tools that continue to significantly impact experimental and computational biology [10, 22]. One problem with these methods is that they only predict local features of 3D structure; these local features are further simplified by projecting 3D information onto a 1D representation of protein structure (e.g. strings of secondary structure) that neither captures the 2D information of contacts between residues, nor local "irregularities" such as bends of helices. In contrast, two-dimensional (2D) maps of distances between residues reduce the dimensionality of the problem in a way that, in principle, allows the reconstruction of the full 3D structure [23, 24, 25]. NMR spectroscopy exploits this fact in determining structures from distance constraints. Most of the available 2D prediction methods simplify the description from real-valued distances to binary-valued contacts (below an assigned distance threshold; here we defined residues as in contact when their C-beta atoms were closer than 8, i.e. 0.8nm, Methods). Contact prediction methods extract information from correlated mutations [26, 27, 28], use neural networks with [29] or without correlated mutations [30], Hidden Markov Models [31, 32], Support Vector Machines [33] and genetic programming [34]. The prediction of 2D maps continues to be a very difficult problem. Consequently, performance is rather limited. Nevertheless, automatic 2D predictions have been used successfully for the prediction of protein structure [28, 35, 36]. Furthermore, no matter how inaccurate 2D predictions, they are still better than constraints from the best de novo 3D prediction methods [4].
Here, we introduced PROFcon, a new method for predicting inter-residue contacts through a simple neural network. For the network input we mixed different sources of information most of which had been used separately in some way before (Methods). We considered information from two "windows" around two residues i and j for which the probability of a spatial contact was predicted. Each sequence position k within the two windows (k ∈ {i-n, , i+n, j-n, , j+n}) was characterized by evolutionary substitution profiles from multiple sequence alignments, conservation weights, predicted secondary structure, and predicted solvent accessibility. Additionally, we used the complexity of residues i and j and classified the pair ij into one of seven classes based on their physico-chemical properties. Information from the sequence segment that connects a pair ij has been shown to correlate with the probability of contact formation [37]. The segment length - usually referred to as sequence separation - has been used successfully to predict contacts [38, 33]. However, characterizing the connecting segment in more detail is likely to further improve predictions for residues that are not too far apart [37]. Therefore, we added a third window describing the region at the center of a segment between i and j (analogous to the windows around i and j). Finally, we introduced global features such as the overall compositions of amino acids, predicted secondary structure composition, and protein length. We evaluated the sustained level of performance of our method on a dataset of experimentally determined X-ray structures from the PDB [8, 9]. We benchmarked proteins of different lengths, different number of aligned homologous sequences, and different structural classes assigned according to SCOP [39]. PROFcon performed favorably compared to other state-of-the-art contact prediction methods in the recent CASP6 (Dec. 2004) assessment of blind predictions (O Grana & A Valencia, Madrid, manuscript in preparation). The method is available as an Internet prediction server at http://www.predictprotein.org/submit_profcon.html ; all datasets used for this work are at http://www.rostlab.org/results/2004/profcon/ .
Data sets and cross-validation. All proteins used for the development of our methods were taken from the PDB [9], i.e. have known structure. To avoid biasing methods according to the accidental composition of proteins in the PDB, we extracted a subset of proteins that are not clearly related in sequence. The EVA server evaluates structure prediction methods [40] and maintains a continuously updated subset of sequence-unique PDB chains (no pair of proteins in this set has HSSP-value above 0 [6, 7]). In particular, we used the December 2003 EVA release, a set of 3201 protein chains of known structure. We removed all non-X-ray structures, all membrane and coiled-coil proteins and proteins with physical chain breaks [37]. We divided the proteins into three data sets: (1) for training, we selected structures with high resolution (2.0), (2) for cross-training, i.e. the optimization of all free neural network parameters such as stop training structures with low resolution in the interval (2.5-3.0 ), and (3) for testing structures with medium resolution (2.0-2.5 ). Due to CPU limitations, we had to reduce the test set further by excluding all proteins longer than 400 residues. Training, cross-training, and test set contained 748, 466 and 633 proteins, respectively.
Definition of contact. More for the sake of enabling the direct comparison with other methods than for any other reason, we choose the standard threshold for considering a pair of residues to be in contact [41, 42, 43, 26, 44, 45, 46], namely a maximal distance of 8 between their C-beta atoms (C-alpha for glycines).
Neural network architecture. We trained standard feed-forward neural networks with back-propagation and momentum term [47]. We addressed the extremely unequal distribution of true (contact) and false (non-contact) samples by balanced training [47]. Symmetry between the contact probabilities for the prediction between ij and ji was enforced through a simple post-processing by averaging over both raw output values [30]. In total, we used 738 input, 100 hidden, and 2 output units (contact, non-contact). The input features corresponded to three different ways of describing each pair of residues ( Table 1S in Supplementary Materials), we used: (1) information from the local environment of both residues, (2) information from the segment connecting i and j, and (3) global information from the entire protein.
(1) Local information from immediate residue environment. For each residue pair ij in a protein, the network incorporates information from all residues in two windows of size 9 centered around i and j (corresponding to the intervals {i-4;i+4} and {j-4;j+4}). Each residue position within the two windows was characterized by 29 input units: 20 for the evolutionary profile (i.e. frequency of occurrence of the 20 amino acid types at that position, as obtained from multiple sequence alignments [48, 49])), one additional unit served as a spacer accounting for the N- and C-terminal residues [50, 51], 4 units coded for the predicted secondary structure (three for helix/strand/other and one for the reliability of the secondary structure prediction at that residue), 3 units for the predicted solvent accessibility (two units for buried/exposed and one value for prediction reliability) and, finally, 1 for the conservation weight [48]. Alignments were obtained through PSI-BLAST [52] using our standard protocol of three automatic iterations [49] and then filtering the aligned sequences at 80% sequence identity, i.e. any two sequences with >80% percentage pairwise sequence identity were removed at the end. We used PROFphd [10, 11, 12] to predict secondary structure and solvent accessibility. Note that we trained and tested on predicted rather than observed values for 1D structure to account for the fact that secondary structure predictions are more correlated to each other than they are to observed secondary structure [53]. As a consequence, using predictions both in training and testing can result in a more coherent input to the networks and hence can help to ease the classification task (note that similar advantages hold for the prediction of secondary structure itself [47, 48, 12]). As the two local windows together accounted for 18 residue positions, we needed a total of 522 input units for their description (18*29). We also introduced additional features to better characterize the central residues i and j, namely a coarse-grained bio-physical classification [54] (7 input units: hydrophobic-hydrophobic, polar-polar, charged-polar, opposite charges, same charges, aromatic-aromatic, other) and we specified whether or not i and j were in low-complexity regions (according to SEG program [55], 2 input units).
(2) Local information from connecting segment. Since the centre of segments that connect two residues i and j has been shown to be most informative for the contact formation between these two residues [37], we introduced another window of five consecutive residues that spanned the interval {int(|i-j|/2)-2; int(|i-j|/2)+2}. Each residue in this window was characterized by the same information as used for the windows around i and j, i.e. we used 29 input units for each residue. Also, it has been shown that the probability for contact formation decreases as sequence separation increases [56, 43, 44, 57]. Therefore, we also had to encode the length of the segment that connects i and j; for this, we used 11 input units that corresponded to sequence separations of 6, 7, 8, 9, 10-14, 15-19, 20-24, 25-29, 30-39, 40-49, and >49 (values chosen by intuition instead of by optimization). Finally, we added features that described the entire segment, namely its amino acid composition (20 units), its secondary structure composition (3 units) and the fraction of SEG-low-complexity [55] residues in that segment (1 unit). Overall, we used 180 input units to describe the connecting segment.
(3) Global information. The use of global information can help the network to decide whether or not two residues are in contact. For example, knowing that the protein is very short should increase the probability of having a contact between two cysteine residues (disulfide bridges); knowing that the protein is longer than the average domain length (~100 residues) should decrease the probability of long-range contacts (fewer inter-domains contacts). Here, we used only very coarse-grained global features, namely 20+3 units to describe the composition of amino acids and secondary structure of the entire protein, and 4 units to describe the protein length (intervals 1-61, 61-120, 121-240, >241; again, values were not optimized but chosen identically to our PHD methods [47, 58, 48]).
Measuring performance. Many measures have been introduced to evaluate the performance of 2D predictions. We applied criteria that were basically identical to those used at CASP/CAFASP [2, 29, 3, 4]. Here, we only briefly sketched these scores that are described in detail elsewhere [26, 28, 29, 4]. Accuracy (also referred to as "specificity") was defined by:
|
| (Eqn. 1) |
where NCok is the number of correctly predicted contacts, i.e. the true positives (TP), NCprd is the number of predicted contacts, which corresponds to the sum of true positives (TP) and false positives (FP). We also followed the tradition to evaluate performance on a number of predicted contacts that is proportional to a fraction of the protein length L. The rationale behind this choice is that the overall number of contacts in a protein is linearly correlated to the protein length. The advantage is that accuracy estimates are related to a quantity that can be evaluated from sequence alone (hence, it is known a priori). However, in isolation accuracy alone does not suffice, instead we need to contrast it with the coverage (also referred to as "sensitivity"), defined as:
|
| (Eqn. 2) |
where NCok is the number of correctly predicted contacts, i.e. the true positives (TP), NCobs is the number of observed contacts, which corresponds to the sum of true positives (TP) and false negatives (FN). The following example illustrates the importance of considering coverage. If we fix the number of predictions to a fraction of the length of a protein (e.g. L/2), and if we assume a linear relation between the number of contacts (NCobs) and the protein length (L), i.e. NCobs=alpha+beta*L, we can write the coverage as:
|
| (Eqn. 3) |
where NCprd is the number of predicted contacts. A simple regression on our test set (633 proteins with L≤400 and considering only sequence separations s≥6) estimated the two free parameters to be: alpha~-220 and beta~5 (valid only for L≥45, see denominator in eqn. 3). For a protein of 100 residues, we would obtain Cov@0.18*Acc, while a protein with 400 residues would yield Cov~0.11*Acc. In other words, the same level of accuracy corresponds to different coverage for proteins of different length. Therefore, even if we predict a number of contacts proportional to the length of the protein L, we need to report Cov along with Acc in order to capture the performance of a method.
For each score we reported the average over all proteins in the test set and the associated estimates for standard deviations of the averages obtained from bootstrapping [59]. In Table 1, Table 2, Table 3, and Table 4 we reported performance for two particular values in the minimal sequence separation (s≥6 and s≥24), i.e. two different definitions of non-local contacts, and for a number L/2 of predictions. In Fig. 1 we reported accuracies for different sequence separation values, and in Fig. 1 S the accuracy for different numbers of predicted contacts ( Fig. 1S , panels A and B) and for different values in the prediction reliability, i.e. the normalized network output (Fig. 1 "δ-evaluation" [35] of the performance of the method ( Table 2S in Supplementary Material). In this case, a contact between two residues i and j is considered as correctly predicted if at least one inter-contact is observed between residues in the interval {i-d, i+d} and residues in the interval {j-d, j+d}, i.e. between residues in two windows of size 2d+1 around i and j. This gives an idea of how far off misplaced predictions are. Note that we never reported scores for data set used for optimizing any free parameter; instead all scores given estimate the performance for proteins of unknown structure.
SCOP classes. A coarse-grained classification groups proteins according to their structural class [60]. Here, we used the SCOP [61] classification (release 1.65). Of the 633 proteins in our test set, we could assign 522 to one of the four major classes (131 all-alpha, 103 all-beta, 119 alpha+beta, 169 alpha/beta). The remaining proteins were either in other structural classes (peptides, small and multi domain proteins), or were not assigned by SCOP, yet.
Contact density. We defined contact density, i.e. the density of non-local contacts in a protein as:
|
| (Eqn. 4) |
where L was the length of the protein, NCobs(s) was the number of experimentally observed contacts (C-beta ≤8 Angstrom) at sequence separation s or greater, and N(s) was the number of residue pairs in that protein separated by s or more sequence positions (counting only the upper diagonal of the symmetric matrix). We chose s=6 for all reports of contact density for simplification. D depends on at least two factors, namely the protein length and structural class of a protein. For instance, in our test set, D(6)=0.035 for proteins with L<150 (199 proteins) and D(6)=0.0155 for proteins with 250≤L≤400 (226 proteins). In other words, longer proteins have lower contact density as is well known. Contact densities for different structural classes were reported below (Results).
Biophysical features. We analyzed the performance separately for different biophysical contexts. Secondary structure was taken from DSSP [62], with DSSP states G, I and H treated as helix (H), B and E as strand (E) and all other DSSP states as "other" (L). Accessibility was also taken from DSSP. We considered nine amino acids as hydrophobic (alanine, leucine, isoleucine, valine, tryptophane, phenylalanine, proline, cysteine, and methionine).
Connecting segment very informative for contact formation. First, we confirmed [37] that the information from the segment connecting two residues i and j improves the prediction of contacts (Fig. 1 Table 1). For sequence separations >20, accuracy was always lower than 20% (Fig. 1), while it reached almost 40% (s=6 and L/20, Fig. 1) for "less global" residues. This difference could partially be explained by the background probability (more contacts at shorter separations as illustrated by the curve for random in Fig.1). On the other hand, it may be easier for the neural networks to learn what determines the formation of more local contacts (more samples, stronger sequence signatures, e.g. for beta turns and beta hairpins). In order to take this strong dependency of performance on sequence separation into account, we always reported values for performance for two different values of minimal sequence separation (s≥6 and s≥24). Note that we provided more details about the explicit dependency of performance in the supplementary material (in particular Fig. 1S ).
|
Method a |
Sequence separation b |
Nprot c |
Acc d |
ErrAcc e |
Cov d |
ErrCov e |
|
local only |
6 |
633 |
29.7 |
0.5 |
8.6 |
0.1 |
|
PROFcon |
6 |
633 |
32.4 |
0.5 |
9.8 |
0.2 |
|
local only |
24 |
621 |
19.5 |
0.5 |
8.8 |
0.3 |
|
PROFcon |
24 |
621 |
20.0 |
0.5 |
9.4 |
0.3 |
a
Two different methods are compared: networks that use only information from the local sequence environment of two contacting residues (local only) and our system that also uses information from the connecting segment (PROFcon).
b
Minimal sequence separation: n (=6 or 24) means that only contacts between pairs i, j minimally n residues apart are considered, i.e. |i-j| ≥ n.
c
.Body {
color: #000000;
font-family: 'Helvetica', 'Helvetica';
font-size: 12.00pt;
font-style: normal;
font-variant: normal;
font-weight: normal;
letter-spacing: 0;
line-height: 1.17;
margin-bottom: 0.000000pt;
margin-left: 0.00pt;
margin-right: 0.00pt;
margin-top: 0.000000pt;
padding-bottom: 12.000000pt;
padding-top: 0.000000pt;
text-align: left;
text-decoration: none;
text-indent: 0.00pt;
text-transform: none;
vertical-align: 0.000000em;
}
.Bullet {
list-style: disc;
margin-bottom: 0.000000pt;
margin-left: 0.000000pt;
margin-top: 0.000000pt;
padding-left: 7pt;
text-indent: 6.000000pt;
}
Number of proteins in respective test set.
d
Acc: Accuracy (aka Specificity, eqn. 1) and Cov: coverage (aka Selectivity, eqn. 2). Note that we report numbers valid for the L/2 strongest predictions.
e
Standard error from bootstrapping.
|
Fig. 1: PROFcon accuracy vs. sequence separation. |
Evolutionary profiles were crucial for performance. Due to the large size of our datasets and our limited resources we could not systematically test the relevance of all the input features that we used through a leave-one-out type of test. One aspect that we investigated by training separate networks was the contribution of non-local input information, in general: networks using only local features were less accurate than our final system (Table 1). Detailed analyses of our results and preliminary work on smaller datasets suggested which of the remaining input features were most important for performance. Evolutionary information was clearly most relevant, as had been noted before [57]. Even the simplest measure for the information in a multiple sequence alignment, namely the number of proteins aligned, clearly correlated with performance, e.g. the accuracy dropped from 37% for alignments with over 200 proteins to 23% for alignments with fewer than 15 proteins at sequence separations ≥6 and from 24% to 13% at separations ≥24 (Column Acc in Table 2). Note that, although accuracy correlated dramatically with the number of aligned sequences, differences in coverage (Column Cov in Table 2) were not statistically significant. This was probably related to the different protein composition of the reported subsets, in terms of length and structural class.
|
Nali a |
Sequence separation b |
Nprot b |
Acc b |
ErrAcc b |
Cov b |
ErrCov b |
|
0-14 |
6 |
138 |
23.0 |
1.0 |
9.5 |
0.8 |
|
15-49 |
|
123 |
31.0 |
1.0 |
9.8 |
0.5 |
|
50-199 |
|
187 |
35.6 |
0.9 |
9.7 |
0.3 |
|
≥200 |
|
185 |
37.0 |
0.9 |
10.2 |
0.4 |
|
|
|
|
|
|
|
|
|
0-14 |
24 |
132 |
13.2 |
0.7 |
10.0 |
1.0 |
|
15-49 |
|
120 |
19.0 |
1.0 |
9.1 |
0.7 |
|
50-199 |
|
185 |
21.5 |
0.9 |
9.0 |
0.4 |
|
≥200 |
|
184 |
24.2 |
0.9 |
9.5 |
0.4 |
a
Number of proteins in multiple sequence alignment used to extract evolutionary profiles.
b
As in Table 1 ; note all values compiled for the first L/2 predictions (Methods).
Contact density dependent on type of protein. It is well known that the contact density decreases with increasing protein length. Thus, contact predictions are more difficult for longer proteins [29, 30]. We observed that the contact density (eqn. 4) also depends on the structural class: all-beta proteins had the highest density (D(6)=0.040), while all-alpha proteins had the lowest contact density (D(6)=0.022). Since it is more difficult to predict low-density than high-density contacts, most existing methods for contact prediction strongly depend on protein length [38, 30] and structural class [33, 34].
Similar accuracy but better performance for short proteins. PROFcon reached surprisingly similar levels of accuracy for proteins of very different length (Table 3 column Acc). However, when also considering the coverage/selectivity of our predictions (Table 3 column Cov), we noted the length-dependency of our method: short proteins (especially of length < 100) had, by far, the highest coverage (1.5-2 times higher than for long proteins).
|
L a |
Sequence separation b |
Nprot b |
Acc b |
ErrAcc b |
Cov b |
ErrCov b |
|
≤100 |
6 |
78 |
31.0 |
2.0 |
18.0 |
1.0 |
|
101-200 |
|
230 |
32.5 |
0.9 |
9.7 |
0.3 |
|
201-300 |
|
191 |
33.9 |
0.9 |
8.6 |
0.2 |
|
301-400 |
|
134 |
31.0 |
1.0 |
7.1 |
0.1 |
|
All |
6 |
633 |
32.4 |
0.5 |
9.8 |
0.2 |
|
|
|
|
|
|
|
|
|
≤100 |
24 |
66 |
19.0 |
1.0 |
22.0 |
2.0 |
|
101-200 |
|
230 |
18.5 |
0.7 |
8.8 |
0.3 |
|
201-300 |
|
191 |
22.0 |
1.0 |
8.2 |
0.3 |
|
301-400 |
|
134 |
19.4 |
0.8 |
6.2 |
0.2 |
|
All |
24 |
621 |
20.0 |
0.5 |
9.4 |
0.3 |
a Protein length.
b
As in Table 1; note all values compiled for the first L/2 predictions (Methods).
All-alpha worst and alpha/beta best. Our test set was large enough to distinguish between the four major structural classes in SCOP [39], namely all-alpha, all-beta, alpha/beta, and alpha+beta (at least 100 proteins in each class). We found that PROFcon performed clearly worst on all-alpha, especially for shorter sequence separations (Table 4 ); we verified that this effect was true at levels of identical coverage (data not shown). Performance was largely similar for the other three classes with the exception of very long-range contacts (s≥24) that were predicted best in alpha/beta proteins (Table 4). A similar trend has previously been reported on a much smaller data set [34]. This trend might originate from particular strand-helix-strand modules that are abundant in alpha/beta proteins, namely those with two flanking strands that contact each other. This type of structural motif may be easy to predict, especially for a system that relies on information about the connecting segments.
|
SCOP class |
Sequence separation a |
Acc hydrophob b |
Cov hydrophob b |
Acc other b |
Cov other b |
<SA> c |
<sep> c |
|
All-alpha | 6 |
24 (59) |
14 |
24 (41) |
5 |
19(28) |
17(57) |
|
All-beta | 6 |
35 (48) |
11 |
17 (52) |
5 |
22(30) |
14(53) |
|
Alpha+beta | 6 |
32 (47) |
12 |
7 (53) |
6 |
22(23) |
16(61) |
|
Alpha/beta | 6 |
36 (56) |
13 |
16 (44) |
5 |
15(28) |
23(56) |
|
|
|
|
|
|
|
| |
|
All-alpha | 24 |
14 (72) |
13 |
11 (28) |
2 |
11(27) |
39(77) |
|
All-beta | 24 |
17 (59) |
10 |
8 (41) |
3 |
14(29) |
42(76) |
|
Alpha+beta | 24 |
16 (64) |
12 |
2 (36) |
3 |
14(23) |
44(79) |
|
Alpha/beta | 24 |
27 (56) |
15 |
15 (44) |
6 |
12(27) |
41(79) |
a As in Table 1; note all values compiled for the first L/2 predictions (Methods).
50% of predicted contacts within two residues of an observed contact. "δ-analysis" (Methods) shows that many predicted contacts fall very close to observed contacts (Table 2S in the Supplementary Material). For example, for sequence separation s6, 50% of all predicted contacts (L/2 predictions) are within 2 residues of an experimental contact ( Table 2S ).
Correct for core, hydrophobic, and regular secondary structure. At least half of the contacts correctly predicted by PROFcon were between residues in identical regular secondary structures (helix-helix and strand-strand, Table 3S in the Supplementary Material); this was independent of the structural class and of sequence separation ( Table 3S ). Although most correctly predicted contacts were between regular secondary structures, all residues contacting between helices were, on average, predicted the least accurately (slightly worse than mixed). Strand-strand contacts were by far the most accurately predicted (>40% for s≥6 and > 20% for s≥24, in all classes, Table 3S ). In alpha/beta proteins long-range strand-strand contacts (s≥24) were predicted at levels of accuracy as high as 42% to be compared to 20% and 24% in all-beta and alpha+beta, respectively. This indicated that PROFcon captured a strong signal from long-range preferences determining the formation of sheets in alpha/beta proteins that may be related to strand-helix-strand modules. In analogy to contacts between regular secondary structures, contacts between hydrophobic residues also constitute most of the correctly predicted contacts ( Table 4S in the Supplementary Material). Unlike for secondary structure, the contribution of hydrophobic pairs is slightly increasing for higher sequence separations. As expected, predicted contacts are on the average more buried (core residues) and less distant in sequence than observed contacts (averages between 11 and 22 Angstrom2 for correctly predicted contacts, Table 4S ; L/2 predictions).
CASP6 and comparisons to other methods. Comparing the performance of PROFcon to that of other contact prediction methods is not an easy task; different groups use different datasets and, as shown (Table 2, Table 3, Table 4), the dataset composition (number of sequences in alignments, protein length, structural class composition) significantly alters average scores. Furthermore, different groups use different scores, often even different definitions for what is considered a long-range inter-residue contact. The only reasonable comparison of the performance of methods is based on the same scores and the same data set. Such a data set must be sufficiently large to contain a representative sequence-unique subset of proteins [47, 58, 63, 64, 22]. Furthermore, the set should not overlap with any of the proteins used for development. While cross-validation provides some clues, it does not suffice for rigorous comparisons. At the moment, there is no set available that meets all conditions for a comprehensive, meaningful comparison of our method to others. The best approximation might be the data from CASP6 (Dec. 2004), with the caveat that this set was far too small to draw definite conclusions. PROFcon appeared to be one of the top three contact prediction methods at CASP6, as judged by the assessor (A Valencia, CASP6 web site; note that the limitation of the data set did not allow any distinction in performance between the top three methods).
Contact predictions capture relevant information and are useful! More than many other structure prediction methods, contact predictions continue to suffer from the fact that performance appears to be so low. Are 2D predictions of any use? Our predictions clearly capture important globular information as demonstrated by a method that succeeds in predicting folding rates exclusively based on PROFcon predictions [65]. Ortiz, Skolnick and colleagues have shown that even more noisy contact predictions provide important constraints for the prediction of 3D structures [35], and used by experts, contact predictions have also been shown to aid in fold recognition [28, 66]. Two particular examples from CASP6 may help elucidating what exactly is captured by 2D predictions.
Example 1: CASP6 target T0230 (Fig. 2 A) is a small protein (~100 residues) characterized by the following topology: two helices, two interacting strands (labeled A and B in Fig. 2 A), one helix (labeled 1), another strand (labeled C, which interacts with helix A) and one last helix. At CASP6, T0230 was classified as fold recognition analogous target (FR/A), i.e. a protein for which a structure was known in PDB, however, the similarity between the template and the known structure could not be identified through sequence homology. PROFcon strongly predicted the interaction between the two anti-parallel strands A and B that are separated by a short loop (Fig. 2). It also correctly identified the sparse cluster of interactions between helix 1 and strand C. However, it wrongly predicted the main interaction of strand C. In fact, while predicting only a few interactions between parallel strands C and B (that are in contact in the structure), it suggested a strong contact between C and A that is not observed.
|
Fig. 2: Predicted and observed contacts for two CASP6 targets.
|
Example 2: The substantially longer (~210 residues) domain labeled T0216_2 in CASP6 (Fig. 2 B) is part of a two-domain protein. CASP6 assessors classified this target as a novel fold (NF), i.e. a domain for which no other domain from PDB had a structural similarity. We focused on some of the strands because this domain has a rather complicated architecture. Specifically, we considered two groups of strands labeled A, B, C and 1, 2, 3, 4 in Fig. 2 B. PROFcon correctly identified contacts between pairs of strands A-B, 2-3 and 3-4, while it completely missed the interactions between A-C and 1-2. In both examples (Fig. 2), incorrect contact predictions between regular secondary structure segments often occurred far from the main diagonal of the map (i.e. at large sequence separations, see lines marking s=24). The overall accuracy for 2*L predictions and s6 was close to average for both T0230 (26%) and T0216_2 (21%).
These two examples seem to suggest that even when the overall prediction accuracy is rather low, PROFcon still correctly identifies contacts between regular secondary structures that are separated by less than 20-30 residues. This ability could be exploited by integrating the predictions into fold recognition and/or de novo prediction methods.
Unique combination of information makes the difference. Our approach to contact prediction was by no means "radically new". We simply combined sources of information slightly differently from what other groups did. In doing so, we ended up with a simple neural network the size of which is not unusual for the predictions of 1D structure [10, 22], but is slightly larger than what has previously been used to predict 2D structure. As a result we needed a very large training set with almost 400,000 positive (contact) samples. Preliminary tests demonstrated that we needed at least these many samples to fully profit from all the input features that we combined. The downside was the increase in computational complexity: the development required several terabytes and many CPU years. These constraints also impacted our ability to separately test - and possibly optimize - the various input features that we considered. One outstanding feature of PROFcon is its consistent performance across a wide range of protein lengths and for both less and more long-range contacts. This consistency was clearly related to the introduction of information from the segment connecting two contacting residues (Table 1).
Future. Major improvements from here may have to introduce ways of post-processing raw predictions. Our method did not exploit any of the constraints imposed on a contact map of an entire protein, e.g. that residues can form only a limited number of contacts. In the past, an intricate post-processing has been proposed for beta-strand pairing [67]. Although the concepts embedded in HMMSTR also address this task [32], no solution exists that comprehensively refines inter-residue contact predictions.
Better predictions of 2D information captured by inter-residue contacts from sequence could help predicting important aspects of protein structures. However, the difficulty of the task and the perceived lack of acceptable performance have so far hampered progress. We presented a new method that exploits evolutionary information in form of multiple sequence alignments and other sequence information relevant for predicting contacts through simple neural networks. While none of our ideas revolutionized the field, the particular combination of information chosen made a significant difference in sustained prediction performance; the major novelty was the particular way in which we successfully used information other than from the sequence environment of the two contacting residues (Table 1). Our method, PROFcon, was particularly successful in its consistent predictions of contacts across a wide range of protein lengths, as well as for residues closer in sequence (separated by at least 6 residues) as for residues very far apart in sequence (separated by at least 24 residues). Nevertheless, PROFcon performed better for short proteins (Table 3), for proteins for which sequence alignment methods detected many homologues (Table 2), and for proteins with beta-strands (Table 4); it was particularly successful for alpha/beta proteins (Table 4). Overall, visual inspections for individual contact maps suggest that the predictions contain more useful information than might be expected from the low levels of accuracy and coverage.
Thanks to Jinfeng Liu and Megan Restuccia (Columbia) for computer assistance; to the EVA team, in particular, to Dariusz Przybylski, Ingrid Koh, and Volker Eyrich (all Columbia), Osvaldo Grana and Alfonso Valencia (CNB Madrid). Many thanks for insightful discussions to Yanay Ofran (Columbia), Sren Brunak (CBS Copenhagen), Piero Fariselli and Rita Casadio (both Bologna Univ.), Alfonso Valencia (CNB Madrid), Reinhard Schneider (LION), and Chris Sander (Sloan Kettering, NYC). This work was supported by the grant R01-GM64633-01 from the NIH. Last, not least, thanks to Amos Bairoch (SIB, Geneva), Rolf Apweiler (EBI, Hinxton), Phil Bourne (San Diego Univ.), and their crews for maintaining excellent databases and to all experimentalists who enabled this analysis by making their data publicly available.
| 1. | Moult, J., Fidelis, K., Zemla, A. & Hubbard, T.(2003). Critical assessment of methods of protein structure prediction(CASP)-round V. Proteins: Structure, Function, and Genetics, 53, 334-339. |
| 2. | Fischer, D., Barret, C., Bryson, K., Elofsson, A.,Godzik, A. et al. (1999). CAFASP-1: critical assessment of fully automatedstructure prediction methods. Proteins: Structure, Function, and Genetics, Suppl 3, 209-217. |
| 3. | Fischer, D., Elofsson, A., Rychlewski, L., Pazos, F.,Valencia, A. et al. (2001). CAFASP2: the second critical assessment of fullyautomated structure prediction methods. Proteins: Structure, Function, andGenetics, 45 Suppl 5,S171-S183. |
| 4. | Eyrich, V. A., Koh, I. Y. Y., Przybylski, D., Graa, O.,Pazos, F. et al. (2003). CAFASP3 in the spotlight of EVA. Proteins:Structure, Function, and Genetics, 53Suppl 6, 548-560. |
| 5. | Fischer, D., Rychlewski, L., Dunbrack, R. L., Jr.,Ortiz, A. R. & Elofsson, A. (2003). CAFASP3: the third critical assessmentof fully automated structure prediction methods. Proteins: Structure,Function, and Genetics, 53,503-516. |
| 6. | Sander, C. & Schneider, R. (1991). Database ofhomology-derived structures and the structural meaning of sequence alignment. Proteins:Structure, Function, and Genetics, 9,56-68. |
| 7. | Rost, B. (1999). Twilight zone of protein sequencealignments. Protein Engineering, 12,85-94. |
| 8. | Bernstein, F. C., Koetzle, T. F., Williams, G. J. B.,Meyer, E. F., Brice, M. D. et al. (1977). The Protein Data Bank: a computerbased archival file for macromolecular structures. Journal of MolecularBiology, 112, 535-542. |
| 9. | Berman, H. M., Battistuz, T., Bhat, T. N., Bluhm, W. F.,Bourne, P. E. et al. (2002). The Protein Data Bank. Acta Crystallogr D BiolCrystallogr, 58, 899-907. |
| 10. | Rost, B. (2001). Protein secondary structure predictioncontinues to rise. Journal of Structural Biology, 134, 204-218. |
| 11. | Rost, B. & Liu, J. (2003). The PredictProteinserver. Nucleic Acids Research, 31,3300-3304. |
| 12. | Rost, B. (2005). How to use protein 1D structurepredicted by PROFphd. In The Proteomics Protocols Handbook (Walker, J. E.,eds.), pp. 875-901, Humana, Totowa NJ. |
| 13. | Rost, B. (1998). Marrying structure and genomics. Structure, 6, 259-263. |
| 14. | Shapiro, L. & Lima, C. D. (1998). The ArgonneStructural Genomics Workshop: Lamaze class for the birth of a new science. Structure, 6, 265-267. |
| 15. | Liu, J. & Rost, B. (2002). Target space forstructural genomics revisited. Bioinformatics, 18, 922-933. |
| 16. | Portugaly, E., Kifer, I. & Linial, M. (2002).Selecting targets for structural determination by navigating in a graph ofprotein families. Bioinformatics, 18,899-907. |
| 17. | Friedberg, I., Jaroszewski, L., Ye, Y. & Godzik, A.(2004). The interplay of fold recognition and experimental structuredetermination in structural genomics. Curr Opin Struct Biol, 14, 307-12. |
| 18. | Zhang, B., Rychlewski, L., Pawlowski, K., Fetrow, J.S., Skolnick, J. et al. (1999). From fold predictions to function predictions:automation of functional site conservation analysis for functional genomepredictions. Protein Sci, 8,1104-15. |
| 19. | Skolnick, J. & Fetrow, J. S. (2000). From genes toprotein structure and function: novel applications of computational approachesin the genomic era. Trends Biotechnol,18, 34-9. |
| 20. | Thornton, J. M. (2001). From genome to function. Science, 292, 2095-2097. |
| 21. | Goldsmith-Fischman, S. & Honig, B. (2003).Structural genomics: computational methods for structure analysis. ProteinSci, 12, 1813-21. |
| 22. | Rost, B., Liu, J., Przybylski, D., Nair, R., Bigelow,H. et al. (2003). Prediction of protein structure through evolution. InHandbook of Chemoinformatics - from data to knowledge (Gasteiger, J. &Engel, T., eds.), pp. 1789-1811, Wiley-VCH, Weinheim. |
| 23. | Galaktionov, S. G. & Rodionov, M. A. (1980). Calculationof the tertiary structure of proteins on the basis of analysis of the matricesof contacts between amino acid residues. Biophysics, 25, 395-403 (translation ofBiofizika, 1980, 25:385-392). |
| 24. | Havel, T. F., Kuntz, I. D. & Crippen, G. M. (1983).The theory and practice of distance geometry. Bull. Math. Biol., 45, 665-720. |
| 25. | Nilges, M. (1995). Calculation of protein structureswith ambiguous distance restraints. Automated assignment of ambiguous NOEcrosspeaks and disulphide connectivities. Journal of Molecular Biology, 245, 645-660. |
| 26. | Goebel, U., Sander, C., Schneider, R. & Valencia,A. (1994). Correlated mutations and residue contacts in proteins. Proteins:Structure, Function, and Genetics, 18,309-317. |
| 27. | Olmea, O. & Valencia, A. (1997). Improving contactpredictions by the combination of correlated mutations and other sources ofsequence information. Folding & Design,2, S25-S32. |
| 28. | Olmea, O., Rost, B. & Valencia, A. (1999).Effective use of sequence correlation and conservation in fold recognition. Journalof Molecular Biology, 293,1221-1239. |
| 29. | Fariselli, P., Olmea, O., Valencia, A. & Casadio,R. (2001). Progress in predicting inter-residue contacts of proteins withneural networks and correlated mutations. Proteins: Structure, Function, andGenetics, Suppl, 157-162. |
| 30. | Pollastri, G. & Baldi, P. (2002). Prediction ofcontact maps by GIOHMMs and recurrent neural networks using lateral propagationfrom all four cardinal corners. Bioinformatics, 18, S62-S70. |
| 31. | Bystroff, C. & Shao, Y. (2002). Fully automated abinitio protein structure prediction using I-SITES, HMMSTR and ROSETTA. Bioinformatics, 18, S54-S61. |
| 32. | Shao, Y. & Bystroff, C. (2003). Predictinginterresidue contacts using templates and pathways. Proteins: Structure,Function, and Genetics, 53,497-502. |
| 33. | Zhao, Y. & Karypis, G. (2003). Prediction ofcontact maps using support vector machines. In 3rd IEEE InternationalConference on Bioinformatics and Bioengineering (BIBE) (Society, I. C., eds.),pp. 26-33,. |
| 34. | MacCallum, R. M. (2004). Striped sheets and proteincontact prediction. Bioinformatics,20 Suppl 1, I224-I231. |
| 35. | Ortiz, A. R., Kolinski, A., Rotkiewicz, P., Ilkowski,B. & Skolnick, J. (1999). Ab initio folding of proteins using restraintsderived from evolutionary information. Proteins: Structure, Function, andGenetics, Suppl 3, 177-185. |
| 36. | Skolnick, J., Zhang, Y., Arakaki, A. K., Kolinski, A.,Boniecki, M. et al. (2003). TOUCHSTONE: a unified approach to protein structureprediction. Proteins: Structure, Function, and Genetics, 53, 469-479. |
| 37. | Gorodkin, J., Lund, O., Andersen, C. A. & Brunak,S. (1999). Using sequence motifs for enhanced neural network prediction ofprotein distance constraints. Ismb,95-105. |
| 38. | Fariselli, P., Olmea, O., Valencia, A. & Casadio,R. (2001). Prediction of contact maps with neural networks and correlatedmutations. Protein Engineering, 14,835-843. |
| 39. | Murzin, A. G., Brenner, S. E., Hubbard, T. &Chothia, C. (1995). SCOP: a structural classification of proteins database forthe investigation of sequences and structures. J Mol Biol, 247, 536-40. |
| 40. | Koh, I. Y. Y., Eyrich, V. A., Marti-Renom, M. A.,Przybylski, D., Madhusudhan, M. S. et al. (2003). EVA: evaluation of proteinstructure prediction servers. Nucleic Acids Research, 31, 3311-3315. |
| 41. | Miyazawa, S. & Jernigan, R. L. (1985). Estimationof effective interresidue contact energies from protein crystal structures:quasi-chemical approximation. Macromolecules, 18, 534-552. |
| 42. | Sippl, M. J. (1990). The calculation of conformationalensembles from potentials of mean force. An approach to the knowledge-based prediction of local structures ofglobular proteins. Journal of Molecular Biology, 213, 859-883. |
| 43. | Galaktionov, S. G. & Marshall, G. R. (1994).Properties of Intraglobular Contacts in Proteins: An Approach to Prediction ofTertiary Structure. In 27th Hawaii International Conference on System Sciences(Hunter, L., eds.), pp. 326-335, Los Alamitos, CA: IEEE Computer Society Press,Wailea, HI, U.S.A.. |
| 44. | Hubbard, T. J. P. (1994). Use of b-strand interaction pseudo-potential inprotein structure prediction and modelling. In 27th Hawaii InternationalConference on System Sciences (Hunter, L., eds.), pp. 336-344, IEEE Society Press,Maui, Hawaii, USA. |
| 45. | Taylor, W. R. & Hatrick, K. (1994). Compensatingchanges in protein multiple sequence alignment. Protein Engineering, 7, 341-348. |
| 46. | Lund, O., Frimand, K., Gorodkin, J., Bohr, H., Bohr, J.et al. (1997). Protein distance constraints predicted by neural networks andprobability density functions. Protein Engineering, 10, 1241-1248. |
| 47. | Rost, B. & Sander, C. (1993). Prediction of proteinsecondary structure at better than 70% accuracy. Journal of MolecularBiology, 232, 584-599. |
| 48. | Rost, B. (1996). PHD: predicting one-dimensionalprotein structure by profile based neural networks. Methods in Enzymology, 266, 525-539. |
| 49. | Przybylski, D. & Rost, B. (2002). Alignments grow,secondary structure prediction improves. Proteins: Structure, Function, andGenetics, 46, 195-205. |
| 50. | Bohr, H., Bohr, J., Brunak, S., Cotterill, R. M. J.,Lautrup, B. et al. (1988). Protein secondary structure and homology by neuralnetworks. FEBS Letters, 241,223-228. |
| 51. | Qian, N. & Sejnowski, T. J. (1988). Predicting thesecondary structure of globular proteins using neural network models. Journalof Molecular Biology, 202,865-884. |
| 52. | Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang,J., Zhang, Z. et al. (1997). Gapped BLAST and PSI-BLAST: a new generation ofprotein database search programs. Nucleic Acids Res, 25, 3389-402. |
| 53. | Przybylski, D. & Rost, B. (2004). Improving foldrecognition without folds. Journal of Molecular Biology, 341, 255-269. |
| 54. | Creighton, T. (1992). Proteins: Structures andMolecular Properties.. |
| 55. | Wootton, J. C. & Federhen, S. (1996). Analysis ofcompositionally biased regions in sequence databases. Methods Enzymol, 266, 554-71. |
| 56. | Lifson, S. & Sander, C. (1979). Antiparallel andparallel beta-strands differ in amino acid residue preferences. Nature, 282, 109-111. |
| 57. | Fariselli, P. & Casadio, R. (1999). A neuralnetwork based predictor of residue contacts in proteins. Protein Eng, 12, 15-21. |
| 58. | Rost, B. & Sander, C. (1994). Conservation andprediction of solvent accessibility in protein families. Proteins, 20, 216-26. |
| 59. | Efron, B. & Tibshirani, R. J. (1993). Anintroduction to the bootstrap.. |
| 60. | Levitt, M. (1976). A simplified representation ofprotein conformations for rapid simulation of protein folding. J. Mol. Biol., 104, 59-107. |
| 61. | Andreeva, A., Howorth, D., Brenner, S. E., Hubbard, T.J., Chothia, C. et al. (2004). SCOP database in 2004: refinements integratestructure and sequence family data. Nucleic Acids Res, 32 Database issue, D226-9. |
| 62. | Kabsch, W. & Sander, C. (1983). Dictionary ofprotein secondary structure: pattern recognition of hydrogen bonded andgeometrical features. Biopolymers, 22,2577-2637. |
| 63. | Rost, B. & O'Donoghue, S. I. (1997). Sisyphus andprediction of protein structure. Computer Applications in Biological Science, 13, 345-356. |
| 64. | Eyrich, V., Mart-Renom, M. A., Przybylski, D., Fiser,A., Pazos, F. et al. (2001). EVA: continuous automatic evaluation of proteinstructure prediction servers. Bioinformatics, 17, 1242-1243. |
| 65. | Punta, M. & Rost, B. (2005). Protein folding ratesestimated from contact predictions. Journal of Molecular Biology,in press. |
| 66. | Pazos, F., Rost, B. & Valencia, A. (1999). Aplatform for integrating threading results with protein family analyses. Bioinformatics, 15, 1062-1063. |
| 67. | Asogawa, M. (1997). Beta-sheet prediction usinginter-strand residue pairs and refinement with Hopfield neural network. ProcInt Conf Intell Syst Mol Biol, 5,48-51. |
| 68. | Humphrey, W., Dalke, A. & Schulten, K. (1996). VMD:visual molecular dynamics. J Mol Graph,14, 33-8, 27-8. |
| Contact: cubic@cubic.bioc.columbia.edu | Version: Apr 11, 2005 |