Continuum secondary structure captures protein flexibility
1 Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA, Andersen: ca2@cubic.bioc.columbia.edu, Palmer:agp6@columbia.edu, Rost:rost@columbia.edu
2 Center for Biological Sequence Analysis, BioCentrum, The Technical University of Denmark, DK-2800 Lyngby, Denmark, Andersen: ca2@cbs.dtu.dk, Brunak: brunak@cbs.dtu.dk
3 Columbia University Center for Computational Biology and Bioinformatics (C2B2), Russ Berrie Pavilion, 1150 St. Nicholas Avenue, New York, NY 10032, USA
* Corresponding authors: rost@columbia.edu, http://cubic.bioc.columbia.edu Tel: +1-212-305-3773, fax: +1-212-305-7932
Quote: submitted to: Structure (Jul 18, 2001; resubmitted Dec 7, 2001)
The DSSP program automates protein secondary structure assignment, using an algorithm that assigns every residue to one of eight states. However, any discrete assignment is incomplete, because the continuum of thermal fluctuations cannot be described. Hence, a continuous assignment of secondary structure that replaces 'static' by 'dynamic' states is proposed. Technically, the continuum results from calculating weighted averages over ten discrete DSSP assignments with different hydrogen bond thresholds. The final set of weights for the continuous assignment maximised the secondary structure similarity between different NMR models of the same protein. Individual NMR models obtained from high-quality data sets were more similar to one another than were X-ray structures of close homologues. More importantly, the final continuous assignment for a single NMR model successfully reflected the structural variations observed between all NMR models in the ensemble. The structural variations between NMR models were verified to correlate with thermal motion by comparison with generalised order parameters for backbone amide moieties. The final continuous DSSP assignments reflected the structural variations due to thermal fluctuations as detected by NMR spectroscopy. The continuous assignment reproduces the structural variation between many NMR models from one single model; therefore, functionally important variation can be extracted from a single X-ray structure using the continuous assignment procedure. Thus, continuous assignments of secondary structure may impact future protein structure analysis, comparison and prediction.
Key words: protein secondary structure assignment, evaluation, protein motion, protein structure prediction, protein function, NMR spectroscopy, structure comparison.
| DSSP | Dictionary of Secondary Structure of Proteins |
| DSSPcont | continuous DSSP assignment presented here |
| xstate | describes one of the eight DSSP assignments (G, H, I, E, B, T, S, L) |
| xclass | describes a group of states (GHI or GI) |
| xgroup | categorise all states into groups, e.g. {GHI, EB, LST} or {GI, H, E, B, T, SL} |
| NMR | Nuclear Magnetic Resonance |
| PDB | Protein Data Bank of 3D coordinates [1] |
| rmsd | root-mean-square deviation between protein coordinates |
| Adiff | summed difference between secondary structure assignments |
| Armsd | root-mean-square deviation between secondary structure assignments |
DSSP assigns secondary structure through hydrogen bonds. Pauling and colleagues correctly predicted the idealised protein secondary structures of a-helices [2] , p-helices [2] and b-sheets [3] based on intra-backbone hydrogen bonds. Five decades later, we know that, on average, about half of the residues in proteins are located in helices and sheets [1] . Pauling and colleagues incorrectly predicted that 310-helices would not occur in proteins, due to unfavourable bond angles; however, approximately 4% of all residues are observed in this conformation [4] . The DSSP program developed by Kabsch and Sander [5] assigns secondary structure as described by Pauling and colleagues for three helix types (310 'G', a 'H', p 'I') and for two extended sheet types (anti-parallel and parallel 'E'). DSSP restricts helices to segments with at least two consecutive hydrogen bonds flanking the helix, and strands to segments with at least three hydrogen bonds within the same extended sheet. Shorter segments are categorised as turn 'T' for helices and b-bridge 'B' for strands. The remaining two DSSP states are bends 'S' and other (dubbed 'L' in the following). DSSP allows for considerable deviations from the idealised hydrogen bond pattern in helices and strands, e.g., four hydrogen bonds suffice to assign an eight residue a-helix ('>>44XX44<<' in DSSP) and b-bulges of up to four residues are allowed within extended strands. These deviations are captured in DSSP output files, but not in the final discrete assignment of secondary structure states.
Discrete secondary structure assignments differ. DSSP is the most widely used assignment method. However, other methods have been used to assign secondary structure: based on Ca coordinates (DEFINE [6] ), protein curvature (P-Curve [7] ), phi/psi-angles (Ramachandran [8] ), expert knowledge (crystallographers' assignments in PDB), on phi/psi-angles and expert assignments (STRIDE [9] ), and by visual inspection of Ca traces [10] . Assignments from DSSP, DEFINE, and P-Curve agree for 63% of all residues [11] ; DSSP and STRIDE agree for 96% [4] . All these methods use non-physical thresholds in order to assign discrete secondary structure states.
Molecular motion of proteins in solution is captured by NMR. Proteins do not have unique, rigid structures in solution. The degree of flexibility varies significantly between structural segments and at least some conformational fluctuations are essential for function. Recently the correlation of local conformational variations with protein function has become an important part of experimental structural biology [12, 13] . In particular, NMR studies have emphasised the importance of structural changes over multiple length and time scales as observed for instance in calmodulin [14, 15, 16] . Protein structure determination by NMR spectroscopy finds many models, the ensemble, that are consistent with experimental constraints. The variations between these models result partially from experimental inconsistencies and incomplete data sets, but they are also believed to result partially from intrinsic fluctuations [17, 18] . NMR spin relaxation measurements are sensitive directly to conformational fluctuations [19] . In particular, the generalised order parameter S2, describes the equilibrium distribution of bond vector orientations on pico- to nano-second time scales. For example, 1-S2 is proportional to the variance of the angular distribution for small amplitude conformational fluctuations. Nonetheless, all currently successful secondary structure prediction methods implicitly assume the existence of one rigid protein structure. Typically, developers of structure prediction methods, do not use ensembles of NMR structures at all, or use only one representative model.
Assignment evaluation based on consistency. A fundamental question addressed here is how to evaluate and compare assignment schemes. A secondary structure assignment ought to neglect certain details of structures, while retaining others. We argue that a desirable feature of an assignment is consistency, i.e., the difference between proteins with the same tertiary structure should be minimised. This means that a "good" secondary structure assignment scheme should minimise the influence of small structural variations due to noise in the experimental determination process and thermal fluctuations. We can therefore evaluate an assignment scheme by comparing assignments within structural families of proteins (different sequences, similar structures) or between different NMR models of a protein (same sequences, similar structures).
Continuum of secondary structure assignment. We introduce a continuous assignment of secondary structure (DSSPcont). In our approach we chose to rely on NMR models to develop DSSPcont; however, we also investigated structural homologues determined by X-ray crystallography. Both sequence variations and thermally induced conformational fluctuations can result in structural differences between structural homologues. These two effects are indistinguishable for structural homologues. The fuzzy helix capping depicted in Fig. 1 a illustrates the variability in secondary structure assigned by DSSP for different NMR models of the same protein. The second a-helix has a well-defined core (residues 24-28), while the N- and C-caps of that helix are not well defined (fuzzy). Although strong capping signals have been reported for a-helices [10, 20, 21] , and b-strands [22, 4] , such caps are harder to predict than the core [23, 24, 25, 26] . The fuzziness of the DSSP cap assignments ( Fig. 1 a) indicates why caps are difficult to predict. Here, we show that the DSSPcont assignment successfully distinguishes between sharp and fuzzy caps. We found that secondary structure assignments varied less between different NMR models for the same protein than between X-ray structures of close homologues. The continuous assignment of secondary structure increases the assignment similarity in both cases. We also show that the variation between NMR models correlates with thermal motion, and that DSSPcont reproduces the variation observed between all models of a protein from the assignment based on a single model. Thus, DSSPcont captures information about thermal motion. The continuous assignment is publicly available (Methods).
Choosing weights for the hydrogen bond thresholds. We assigned a continuum of secondary structure by running DSSP with various hydrogen bond thresholds. We weighted the individual DSSP assignments by wh for a given hydrogen bond threshold h. Thus, we calculated the DSSPcont values for the structural class c from the assigned state s Î [G, H, I, T, E, B, S, L] and residue i:
|
|
(Eqn 1) |
where the discrete
be either 1 or 0,
describes the probability that a given residue i
is in class c. To score a given weighting
scheme, we used the different models reported in NMR structure ensembles and
calculated the average difference between single model assignments and the mean
assignment (eqn. 2). The best weighting scheme consequently ensured that the
assignment extracted as much information as possible from the single NMR model
given.
Coarse-grained optimum well-defined. The 100 best weighting schemes were all similar for helix {GHI}, strand {EB} and other {LST} ( Fig. 2 a). This similarity indicated that the weighting scheme had a well-defined stable global optimum. As expected, the most dominant weights were found close to the default DSSP hydrogen bond threshold of -0.5 kcal/mol. The weight for the -0.2 kcal/mol threshold was consistently low, while the adjacent threshold at -0.3 kcal/mol was consistently high ( Fig. 2 a). This prompted us to insert another threshold at -0.25 kcal/mol.
Fine-tuning the weights. We found the optimal set of weights using the entire NMR data set containing 211 proteins by a stepwise gradient descent ( Fig. 2 b). The final average difference over all states with respect to the mean assignment was 0.091. Hence, the DSSPcont assignment for a single model indeed reflected the structural variations between different NMR models of the same protein. Summing the weights wh with hydrogen bonds ≤ -0.5 kcal/mol contributed 74% of the total weight. Thus, a helix or strand assigned by the DSSP default accounted for at least 74% of the probability in the DSSPcont assignment. 53% of the DSSPcont weight mass originated from hydrogen bond thresholds below -0.5 kcal/mol. Thus, helices or strands ignored by the default DSSP can maximally obtain 53% of the DSSPcont probability. The default DSSP hydrogen bond threshold (-0.5 kcal/mol) occurred near the centre of the weighting scheme, with 53% probability weight for the weaker thresholds and 26% for the stronger thresholds. Thus, the conventional DSSP tends to under- rather than to over-assign regular secondary structure
DSSPcont correlates with variations between NMR models. The continuous assignment for 1cy3 appeared to correlate well with variations between the NMR models ( Fig. 3 b). Most strikingly the transition from a-helix to mixed a-helical/turn states observed for residues 23 and 28 in the NMR ensemble was captured by DSSPcont from one model alone ( Fig. 3 b). To further define this correlation we analysed a complete database of homologous X-ray structures and NMR structural ensembles.
DSSP states largely maintained by DSSPcont. We found that all states were dominated by the original DSSP assignment when the DSSPcont assignments were mapped to the eight DSSP states ( Fig. 3 ). Thus, the exchanges between classes mutually cancelled ( Table 1 ), leaving the average occurrence of each class practically unaltered. b-bridges (B in Fig. 3 ) were affected most: 10% of the probability mass was assigned as b-strand. Consequently, the default DSSP b-bridge often constitutes an ignored b-strand that would have been assigned given a lower hydrogen bond threshold. Conversely, the B state received probability mass from the states L, S and T resulting in a 15% net increase of the overall probability mass for state B ( Table 1 ).
| Assignment method | States | |||||||
| G | H | I | T | E | B | S | L | |
| Default DSSP | 3.9 | 33.5 | 0.0 | 11.6 | 21.0 | 1.2 | 8.8 | 19.9 |
| DSSPcont | 3.8 | 34.0 | 0.0 | 11.4 | 21.6 | 1.4 | 8.5 | 19.3 |
# The average propensities forthe eight secondary structure states were compiled on 1534 non-homologue X-rayand NMR protein chains (values given in percentages). The state propensitiesremained nearly unchanged between the default DSSP and DSSPcont assignments.The small differences observed show a small flow from the loop and tight helixstates (GSL) to the more regular helix and strand structures (HEB).
Consistency was higher for DSSPcont than for default DSSP. We compared the consistency of the default and the continuous assignments by measuring the average difference (Adiff, eqn. 2) and the assignment rmsd (Armsd, eqn. 3). Throughout all states, DSSP appeared less consistent than DSSPcont under the Armsd score ( Table 2 ). For the Adiff score both assignments appeared similar due to small differences in the overall occurrences ( Table 1 ). Large differences dominate the Armsd score (sum over squares), while many small differences dominate the Adiff score. Hence, the differences between two continuous assignments were common but small.
| Score | Set | Method | States | |||||||
| G | H | I | T | E | B | S | L | |||
| Adiff | X-ray | DefaultDSSP | 2.5 | 5.7 | 0.0 | 4.9 | 3.5 | 0.8 | 4.5 | 7.4 |
| DSSPcont | 2.5 | 5.7 | 0.0 | 4.8 | 3.7 | 0.8 | 4.3 | 7.3 | ||
| NMR | DefaultDSSP | 1.4 | 2.7 | 0.0 | 5.7 | 2.0 | 0.6 | 6.3 | 5.8 | |
| DSSPcont | 1.3 | 2.6 | 0.0 | 5.7 | 2.1 | 0.7 | 6.2 | 5.9 | ||
| Armsd | X-ray | DefaultDSSP | 15.7 | 23.8 | 1.2 | 22.0 | 18.7 | 8.6 | 21.1 | 27.2 |
| DSSPcont | 13.7 | 22.4 | 0.8 | 19.4 | 17.4 | 7.5 | 19.7 | 25.7 | ||
| NMR | DefaultDSSP | 11.8 | 16.3 | 1.8 | 24.0 | 14.0 | 7.8 | 25.2 | 24.2 | |
| DSSPcont | 9.2 | 13.1 | 1.4 | 19.4 | 11.3 | 6.2 | 22.5 | 22.1 | ||
# We compared the consistency of assignments by the averagedifference (Adiff, eqn. 2) and the assignment rmsd (Armsd, eqn. 3) to compare theconsistency of default DSSP assignment with that of DSSPcont. We used two data sets: (1) X-ray homologues (according to FSSP [35] ,ZDali≥10) and (2) different NMR models for the same protein. The regularsecondary structure assignments were significantly more consistent between theNMR models of the same proteins than between the X-ray homologues. While Adiff penalises many small differences, for Armsd minor differences are less important (square). Thus, DSSPcontproved considerably more consistent, when giving less importance to minordifferences (Armsd). All values in percentages.
Flows between classes link secondary structure states. To compare two continuous assignments, i.e. two probability vectors, we introduce the "flow" measure that describes the transformation of one DSSPcont vector into another ( Fig. 4 ). The average flow (Aflow matrix, eqn. 6) links states that often are assigned differently for the same residue. The average probability flow between the eight states describes the web of links between the eight states ( Fig. 5 ). We observed important flows from the helix states (GHI) to turns (T), from turns to the bend state (S), and finally from bends to the loop/other state (L). Because the (GHITS) states all describe a spiral geometry of the backbone, a continuous transition appeared to exist from the helix conformation (GHI), through short helices with few hydrogen bonds (T), to spirals/bends without hydrogen bonds (S), and finally to non-regular conformations (L). This suggested a poor description of the energies involved, because we would expect a disfavoured intermediate value [27] . Two effects may be involved: (1) the backbone-backbone hydrogen bonds do not include all energies involved, (2) the Coulomb hydrogen bond expression used in DSSP is too simple [4] .
About 2% of NMR models agree to less than 80% in Q3. Assume that all NMR models for one protein are, on average, equally accurate, and that we know only one model. How well can we then predict the secondary structure of all other models? The average Qtot (eqn. 5) prediction performance between NMR models using the default DSSP ranged from 93% for 3 classes (helix [GHI], strand [EB] and other [TSL]), to 89% for 6 classes (H, [GI], E, B, T, [SL]), to 85% for all 8 DSSP states. How many of the inferred predictions were worse than that? One out of four NMR models achieved less than 80% prediction accuracy when comparing all eight DSSP states ( Fig. 6 ); one out of nine NMR models achieved 80% in six classes, and one out of 50 achieved less than 80% for three classes ( Fig. 6 ). We clearly do not expect to reach the accuracy of NMR experiments by methods predicting secondary structure from sequence. Current secondary structure prediction methods are gradually approaching a sustained mark around 80% [28, 29] . Furthermore, prediction accuracy has exceeded 80% for about 60% of all proteins, and reached 93% for about 4% of all proteins [29] . Hence, the best prediction methods appeared as accurate as the most extreme NMR model for most proteins, and reached the average of all NMR models for about 4% of all proteins.
Correlation increases through DSSPcont. The accuracy measured by the Pearson correlation coefficient (eqn. 4) showed the same trend as Qtot: 2% of all NMR models fall below 0.65 for default DSSP ( Fig. 6 d). This level is reached by today's best prediction methods [28, 29] . The disagreement decreased substantially when using DSSPcont, e.g. reducing the percentage of models correlated < 0.8 in three states from 11% for DSSP to 5% for DSSPcont ( Fig. 6 d). We concluded that secondary structure prediction methods have reached a level of accuracy at which assignment inconsistencies have become important.
Assignment evaluation minimises influence of thermal fluctuations. We have used the differences between good quality NMR models of the same protein as indicators of thermal fluctuations to evaluate secondary structure assignments. By comparing experimental backbone 15N order parameter data relating the conformational fluctuations of proteins in solution to the Ca rmsd between NMR models, we were able to validate the stipulated correlation ( Fig. 7 ). As expected, not all of the variation between structural models reflected thermal fluctuations. For example, the high structural disorder in the C-terminus of 1d5v was not reflected in the measured order parameters.
DSSPcont reveals protein motion. The order parameter data enabled a comparison of the DSSPcont assignment with thermal fluctuations ( Fig. 8 ). By measuring the average propensity of regular structure (helix [GHI] and strand [EB]) for segment of three consecutive residues, DSSPcont indicated regions with medium to high degrees of motion given a single NMR model. Thus, DSSPcont can suggest regions of the polypeptide subject to conformational disorder from the coordinates of one NMR model or one X-ray structure alone.
Continuous assignment captures functional variations. We have taken the diversity between different NMR models of the same protein at face value and shown how protein structure assignment can profit from this variety ( Fig. 1 ). The resulting DSSPcont assignment scheme extends Pauling's hydrogen-bond-energy based definition of secondary structure by minimising the influence of variations due to thermal fluctuations and noise ( Table 2 ). In fact, the DSSPcont assignment correlates with intra-molecular thermal fluctuations in solution ( Fig. 8 ).
Variation between NMR models correlates with flexibility. We argued that "good" assignments differ only between regions in protein structures that are not conserved between close homologues or different NMR models and that distinguish between regions of thermal motion and less flexible regions. We showed that these two objectives were closely related. The assignment consistency between NMR models reflected in part the thermal fluctuations in proteins because the differences between NMR models correlated with independent NMR measurements of intra-molecular flexibility ( Fig. 7 ). On this premise, we proposed a novel assignment scheme (DSSPcont) that minimises the influence of thermal motion and noise. This assignment had to be continuous because discrete assignments fail to capture thermal fluctuations [30, 31] inherent to protein structure and function. Overall, DSSPcont captured the structural variations between different NMR models of the same protein, as well as between close homologues, based on one NMR model or one X-ray structure ( Fig. 1 Table 2 ).
Variations between homologues tend to maintain assignment classes. Overall, we observed a 'continuous transition' from assignments describing various degrees of spiral backbone geometry (H->G->T->S) to non-regular conformations ( Fig. 5 ). FSSP finds homologues by focusing on the overall fold rather than on structural details. This could explain the surprisingly large flows ( Fig. 4 ) from helix to strand observed for homologues ( Fig. 5 Fig. 6 ). In contrast, thermal fluctuations appeared to be the major contributor to DSSPcont assignments for different NMR models ( Fig. 8 ), suggesting that the experimental noise tends to cancel when averaging over many good NMR structures.
Continuous assignment impacts structure analysis, comparison and prediction. Continuous assignment of secondary structure is likely to improve methods that employ secondary structure assignments for structure comparison [32] , threading [33] , and prediction of conformational switches [34] . One major advantage of continuous assignment is that ragged helix caps and weak strand segments are less likely to be overlooked. Another advantage is that experimentally indiscernible differences are de-emphasised. Secondary structure prediction methods are currently reaching within the realm of error inherent to the default DSSP assignment. This fact calls for a rethinking of the assignment scheme. Most importantly, thermal fluctuations are no longer ignored but have become an integral part of the DSSPcont assignments.
The automatic assignment of protein secondary structure from three-dimensional co-ordinates of protein structures is an important and, in principle, a simple bioinformatics tool. Assignments are used to visualise structures, to speed up computational expensive structural comparisons, and to improve sequence searches. Hence, secondary structure assignments are important to assure the optimal yield of experimental structures and to cleverly select the targets for structural genomics. Although a conceptually simple task, the assignment of secondary structure is not always well defined. In fact, assignments vary between different NMR models of the same protein and between X-ray structures of homologues. Here, we argued that such differences are not a problem of the assignment scheme, rather that they carry important information if adequately processed. We showed that the variations between different NMR models correlate with thermal disorder. Because the novel continuous assignment of secondary structure (DSSPcont) reproduces the observed variation between high-quality NMR models, it also correlates with mobility related to protein function. Thus, continuous secondary structure assignments can predict conformational variations from a single X-ray structure and thereby may assist predictions of functionally important residues. More generally, it may help to pave the way to automatically generate valid hypotheses from protein structures. Finally, the continuous assignment appeared to describe ends of regular secondary structure segments (helices and strand) more accurately than discrete assignments. Often these caps carry important information about function and structure. Hence, the continuum may sharpen the tools that already profit from discrete assignments.
The DSSPcont assignment was constructed by applying nine hydrogen bond thresholds from -0.2 kcal/mol in steps of 0.1 down to -1 kcal/mol (eqn. 1). Testing three values per weight gives 39=19683 test rounds and approximately seven CPU days testing for ten proteins. We sampled the following non-normalised values wh = 1, 3, 5 " h ≥ -0.5 kcal/mol and wh = 0.25, 1, 2 <<<-0.5 kcal/mol on 10 NMR proteins with a total of 474 models ( Fig. 2 a). To fine-tune the weighting scheme, we performed a simple gradient descent optimisation for 50, 100, 150 and 211 proteins ( Fig. 2 b).
Selecting representative protein structures. We used representative protein structures according to FSSP [35] (October 6th 2000) in order to maximise the coverage of protein space. The FSSP data set contained 2361 structurally non-homologues protein chains longer than 30 residues. All structures were downloaded from the Protein Data Bank [1] . We used three data sets to compare assignments: (1) a mixed data set (1534 representative high-resoltion X-ray and NMR structures); (2) an X-ray data set with 145 representative high-resolution chains each having at least ten X-ray homologues (ZDali ≥ 10) yielding 3245 X-ray homologues; and (3) an NMR data set containing 211 chains from good NMR structures with at least 10 models giving a total of 4639 NMR models. The NMR chains selected were either representative chains or substitutes for a representative chain (Z-score ≥ 10). To ensure good quality X-ray structures, we discarded all structures with resolutions above 2.5 Å and chains having less than 70% of the residues in the most favoured region of Ramachandran angles [36] .
Missing quality criteria for NMR structures. No well-established criteria exist for evaluating the quality of NMR structures, even with the experimental NMR data at hand. NMR quality assessment methods divide into two categories: Self-consistency checks using coordinate data (PROCHECK-NMR [37] , WHAT IF [xxx 38] and average pairwise rmsd between NMR models) and experimental validation using NMR data (cross-validation of NMR data [39, 40] , Monte Carlo noise simulation [41] , NOE restraint violations [37] , completeness of NOE's [42] and the number of NOE's per residue). Joint evaluation schemes have been presented [43, 44] . Many of the evaluation methods are available; however, the variety of data formats used to store NMR data are not easily inter convertible. This has left large-scale quality assessment prohibitively time-consuming and limited to experts [xxx 42].
Performing large-scale quality evaluation of NMR structures. To select good-quality NMR structures, we constructed a ranking scheme based on the experimental methods described in the PDB header, the number of NOEs per residue (when available), the deposition date, and percentage of residues in the most favoured region of Ramachandran angles. The two latter measures have been shown to correlate to the completeness of NOEs [42] and more recently determined structures tend to use more powerful isotope-edited 3D and 4D spectroscopic methods. Our selection classified 39% of all NMR structures considered (≥ 10 models, ≥ 30 residues) to be of good quality. Good quality structures as defined by this protocol had an average of 79.3% residues with backbone dihedral angles in the most favoured Ramachandran area and 17.3 NOEs per residue.
Order parameter data. Fig. 7 displays
results for seven NMR structures of good quality, for which we could obtain S2
data:1b2t [45] , 1cdn
[46] , 1d5v
[47] , 1e41
[48] , 1fsp
[49] , 1vre
[50] , and
1xoa [xxx
51]
(variants of a single protein were not used to avoid over-representing a single
protein fold). The rmsd was calculated between
the heavy backbone atoms in all NMR model pairs, which were 3D aligned with
respect to the helix/sheet core (HEcore) segments. The proteins were all
selected to have physically reasonable order parameter data [52]
satisfying 
53] , or
the Indiana Dynamic database [54] .
The DSSP version used was downloaded from CMBI, version as of April 2000 [55] . To calculate the rmsd between protein chains we used the CE program [56] . Protein structures were visualised with MOLMOL [57] .
Adiff: Throughout all calculations we have ensured equal weighting of each protein (CC, Qtot) or each residue (Adiff, Armsd, flow), irrespectively of the number of NMR models or homologues found for a given structure. This is important because the number of pairs grows quadratically (M(M-1)/2) with the total number of models/homologues M. Some adaptations and extensions of the formulas given are therefore necessary. The average difference between two assignments A, B is defined as:
|
|
(Eqn 2) |
where N is the number of residues, i counts over all the amino acids and c is a given class. Adiff is also used as a single number to score an assignment scheme by
comparing all NMR models m to the average assignment over all models (applying Aic and Bic =
Armsd: The root-mean-square difference between two assignments A, B is given by:
|
(Eqn 3) |
CC: The Pearson correlation coefficient is defined as:
|
(Eqn 4) |
where c is the secondary structure class, i counts over all
residues in a given protein, Aic is
the predicted value for residue i in class c and Bic the respective assigned value and áAcñ, áBcñ denote
the average values over all residues in all proteins. We defined the average in
this particular way since some classes are missing in some proteins (e.g. in
all-a and all-b proteins) yielding CC=0 for these classes although the assignments A and B may be identical. The average
correlation coefficient is then trivially:
where C is the number of classes.
Qtot: The total percentage of correct predictions Qtot is:
|
|
(Eqn 5) |
where c counts over all C classes, TPc is the number of true positive predictions in class c and N is the total number of residues.
flow: The link between two continuous assignments A and B is measured by the flow flowi®j ( Fig. 4 ). It describes the probability flow from assignment A in state i to B in state j, when A and B disagree about the probability assigned for the two states:
|
(Eqn 6) |
|
|
(Eqn 7) |
| ∆ABi = Ai - Bi | (Eqn 8) |
where Ai is the probability of state i according to assignment A. Summing the matrix values in flow yields the total flow (Tflow(AB)) that reaches one for non-overlapping assignments. The flow matrix describes the flows involved when turning the assignment found in A into the one found in B. Averaging the flow between all pairs AB for all residues, we finally get Aflowi®j ( Fig. 5 ).
DSSPcont assignments are provided given a PDB identifier or a PDB file at www.cbs.dtu.dk/ services/DSSPcont and at cubic.bioc.columbia.edu/services/DSSPcont. The program package can be downloaded through the same site.
The work was supported by a grant from The Technical University of Denmark (awarded to CAFA), The Danish National Research Foundation (awarded to SB), from the National Science Foundation (MCB-9722392 awarded to AGP) and from the National Institutes of Health (P506M62413-01 and RO1-GM63029-01 awarded to BR). We thank Jinfeng Liu (Columbia, New York) for technical assistance and system maintenance. We thank the authors of the NMR structures: 1b2t [45] , 1cdn [46] , 1d5v [47] , 1e41 [48] , 1fsp [49] , 1vre [50] , and 1xoa [xxx 51] for providing S2 data. We are also grateful to Jürgen F. Doreleijers (BioMagResBank, Univ. of Wisconsin) for comments regarding NMR quality assessment. Last not least, thanks to all those who deposit experimental data in public databases and to those who maintain such databases.
| 1. | Berman, H.M., Westbrook, J., Feng, Z., Gillliland, G., Bhat, T. N. et al. (2000). TheProtein Data Bank. Nucl. Acids Res., 28, 235-242. |
| 2. | Pauling, L., Corey, R. B. & Branson, H. R. (1951). Two Hydrogen-BondedHelical Configurations of the Polypeptide Chain. Proc. Natl. Acad. Sci. USA, 37, 205-211. |
| 3. | Pauling, L. & Corey, R. B. (1951). Configurations of Polypeptide Chainswith Favored Orientations Around Single Bonds: Two New Pleated Sheets. Proc.Natl. Acad. Sci. U.S.A., 37, 729-740. |
| 4. | Andersen, C. A. (1998). Neural Network Assignment of Protein SecondaryStructure with Increased Predictability. Master thesis, The TechnicalUniversity of Denmark. |
| 5. | Kabsch, W. & Sander, C. (1983). How good are predictions of proteinsecondary structure? FEBS Lett., 155, 179-182. |
| 6. | Richards, F. M. & Kundrot, C. E. (1988). Identification of structuralmotifs from protein coordinate data: secondary structure and first-levelsupersecondary structure. Proteins, 3, 71-84. |
| 7. | Sklenar, H., Etchebest, C. & Lavery, R. (1989). Describing proteinstructure: a general algorithm yielding complete helicoidal parameters and aunique overall axis. Proteins, 6, 46-60. |
| 8. | Ramachandran, G. N. & Sasisekharan, V. (1968). Conformation ofpolypeptides and proteins. Adv. Prot. Chem.,23, 284-438. |
| 9. | Frishman, D. & Argos, P. (1995). Knowledge-based protein secondarystructure assignment. Proteins, 23, 566-579. |
| 10. | Richardson, J. S. & Richardson, D. C. (1988). Amino acid preference forspecific locations at the end of a helices. Science, 240, 1648-1652. |
| 11. | Colloc'h, N., Etchebest, C., Thoreau, E., Henrissat, B. & Mornon, J.-P.(1993). Comparison of three algorithms for the assignment of secondarystructure in proteins: the advantages of a consensus assignment. Prot.Engin., 6, 377-382. |
| 12. | Heel, M. v. (1992). Unveiling ribosomal structures: the final phases. Cur.Opin. Struct. Bio., 10, 259-264. |
| 13. | Brunger, A. T. & Laue, E. D. (2000). New approaches to studymacromolecular structure and function. Cur. Opin. Struct. Bio., 10, 557. |
| 14. | Barbato, G., Ikura, M., Kay, L. E. & Pastor, R. W. (2000). Backbonedynamics of calmodulin studied by nitrogen-15 relaxation using inverse detectedtwo-dimensional NMR spectroscopy: the central helix is flexible. Biochem., 31, 5269-5278. |
| 15. | Lee, A. L., Kinnear, S. A. & Wand, A. J. (2000). Redistribution and lossof side chain entropy upon formation of a calmodulin-peptide complex. Nat.Struct. Biol., 7,72-77. |
| 16. | Evenas, J., Malmendal, A. & Akke, M. (2001). Dynamics of the transitionbetween open and closed conformations in a calmodulin C-terminal domain mutant.Structure, 9,185-195. |
| 17. | Bonvin, A. M. J. J. & Brunger, A. T. (1996). Do NOE distances contain enough information to asses the relativepopulations of multi-conformer structures? J. Biomol. NMR, 7, 72-76. |
| 18. | Chalaoux, F. R., O'Donoghue, S. I. & Nilges, M. (1999). Moleculardynamics and accuracy of NMR structures: effects of error bounds and dataremoval. Proteins, 34, 453-463. |
| 19. | Palmer, A. G. (2001). NMR probes of molecular dynamics: Overview andComparison with other techniques. Annu. Rev. Biophys. Biomol. Struct., 30, 129-155. |
| 20. | Harper, E. T. & Rose, G. D. (1993). Helix stop signals in proteins andpeptides: the capping box. Biochem., 32, 7605-7609. |
| 21. | Aurora, R. & Rose, G. D. (1998). Helix capping. Prot. Sci., 7, 21-38. |
| 22. | Colloc'h, N. & Cohen, F. E. (1991). b-breakers: An aperiodicsecondary structure. J. Mol. Biol., 221, 603-613. |
| 23. | Brunak, S. (1991). Non-linearities in training sets identified by inspectingthe order in which neural networks learn. In Neural Networks From Biology toHigh Energy Physics (Benhar, O., Bosio, C., Del Giudice, P. & Tabet, E.,eds.), pp. 277-88, Elba, Italy. |
| 24. | Rost, B. & Sander, C. (1994). 1D secondary structure prediction throughevolutionary profiles. In Protein Structure by Distance Analysis (Bohr, H.& Brunak, S., eds.), pp. 257-276, IOS Press, Amsterdam, Oxford, Washington. |
| 25. | Brunak, S. & Engelbrecht, J. (1996). Protein structure and thesequential structure of mRNA: a-helix and b-sheet signals at the nucleotide level. Proteins, 25, 237-252. |
| 26. | Riis, S. K. & Krogh, A. (1996). Improving prediction of proteinsecondary structure using structured neural networks and multiple sequencealignments. J. Comp. Biol., 3, 163-183. |
| 27. | Sippl, M. J. (1996). Helmholtz free energy of peptide hydrogen bonds inproteins. J. Mol. Biol., 260, 644-648. |
| 28. | Petersen, T. N., Lundegaard, C., Nielsen, M., Bohr, H., Bohr, J. et al.(2000). Prediction of protein secondary structure at 80% accuracy. Proteins, 41, 17-20. |
| 29. | Rost, B. (2001). Protein secondary structure prediction continues to rise. J.Struct. Biol., 134,204-218. |
| 30. | Bax, A. & Tjandra, N. (1997). Are proteins even floppier than wethought? Nat. Struct. Biol., 4, 254-256. |
| 31. | Feher, V. A. & Cavanagh, J. (1999). Millisecond-timescale motionscontribute to the function of the bacterial response regulator protein SPO0F. Nature, 400, 289-293. |
| 32. | Przytycka, T., Aurora, R. & Rose, G. D. (1999). A protein taxonomy basedon secondary structure. Nat. Struct. Biol., 6, 672-682. |
| 33. | Rost, B. (1995). TOPITS: Threading One-dimensional Predictions IntoThree-dimensional Structures. In Third International Conference on IntelligentSystems for Molecular Biology (Rawlings, C., Clark, D., Altman, R., Hunter, L.,Lengauer, T. et al., eds.), pp. 314-321, Menlo Park, CA: AAAI Press, Cambridge,England. |
| 34. | Young, M., Kirshenbaum, K., Dill, K. A. & Highsmith, S. (1999).Predicting conformational switches in proteins. Prot. Sci., 8, 1752-1764. |
| 35. | Holm, L. & Sander, C. (1998). Touring protein fold space with Dali/FSSP.Nucl. Acids Res., 26, 318-321. |
| 36. | Morris, A. L., MaxArthur, M. W., Hutchinson, E. G. & Thornton, J. M.(1992). Steriochemical Quality of Protein Structure Coordinates. Proteins, 12, 345-364. |
| 37. | Laskowski, R. A., Rullmann, J. A., MacArthur, M. W., Kaptein, R. &Thornton, J. M. (1996). AQUA and PROCHECK-NMR: programs for checking the quality of protein structures solved byNMR. J. Biomol. NMR, 8, 477-86. |
| 38. | Vriend, G. (1990). WHAT IF: A molecular modeling and drug design program. J.Mol. Graph., 8,52-56. |
| 39. | Brünger, A. T., Clore, M. G., Gronenborn, A. M., Saffrich, R. &Nilges, M. (1993). Assessing the quality of solution nuclear magnetic resonancestructures by complete cross-validation. Science,261, 328-331. |
| 40. | Bonvin, A. M. J. J. & Brunger, A. T. (1995). Conformational Variabilityof Solution Nuclear Magnetic Resonance Structures. J. Mol. Biol., 250, 80-93. |
| 41. | Shriver, J. & Edmondson, S. (1993). Defining the precision with which aprotein structure is determined by NMR. Application to motolin. Biochem., 32, 1610-1617. |
| 42. | Doreleijers, J. F., Raves, M. L., Rullmann, J. A. C. & Kaptein, R.(1999). Completeness of NOEs in proteinstructures: A statistical analysis of NMR data. J. Biomol. NMR, 14, 123-132. |
| 43. | Doreleijers, J. F., Rullmann, J. A. C. & Kaptein, R. (1998). QualityAssessment of NMR Structures: A Statistical Survey. J. Mol. Biol., 281, 149-164. |
| 44. | Doreleijers, J. F., Vriend, G., Raves, M. L. & Kaptein, R. (1999).Validation of Nuclear Magnetic Resonance Structures of Proteins and NucleicAcids: Hydrogen Geometry and Nomenclature. Proteins, 37, 404-416. |
| 45. | Mizoue, L. S., Bazan, J. F., Johnson, E. C. & Handel, T. M. (1999).Solution structure and dynamics of the CX3Cchemokine domain of fractalkine and its interaction with an N-terminal fragmentof CX3CR1. Biochem., 38, 1402. |
| 46. | Akke, M., Forsen, S. & Chazin, W. J. (1995). Solution structure of Cd2+-calbindingD9k reveals details of the stepwise structural changes along theApo-(Cd2+)II1 - (Cd2+)II,I2 binding pathway. J. Mol. Biol., 252, 102-121. |
| 47. | van Dongen, M. J. P., Cederberg, A., Carlsson, P., Enerback, S. &Wikstrom, M. (2000). Solution structure and dynamics of the DNA-binding domainof the adipocyte-transcription factor FREAC-11. J. Mol. Biol., 296, 351-359. |
| 48. | Berglund, H., Olerenshaw, D., Sankar, A., Federwisch, M., McDondald, H. Q.et al. (2000). The Three-dimensional solution structure and dynamic propertiesof the human FADD death Domain. J. Mol. Biol.,302, 171-188. |
| 49. | Feher, V. A., Zapf, J. W., Hoch, J. A., Whiteley, J. M., McIntosh, L. P. etal. (1997). High-resolution NMR structure and backbone dynamics of the bacillussubtilis response regulator, SPO0F: Implications for phosphorylation andmolecular recognition. Biochem., 36, 10015-10025. |
| 50. | Volkman, B. F., Alam, S. L., Satterlee, J. D. & Markley, J. L. (1998).Solution structure and backbone dynamics of component IV glycera dibranchiatamonomeric hemoglobin-co. Biochem., 37, 10906. |
| 51. | Jeng, M. F., Campbell, A. P., Begley, T., Holmgren, A., Case, D. A. et al.(1994). High-resolution solution structures of oxidized and reduced escherichiacoli thioredoxin. Structure, 2, 853-868. |
| 52. | Case, D. A. (1999). Calculations of NMR dipolar coupling strengths in modelpeptides. J. Biomol. NMR, 15, 95-102. |
| 53. | Goodman, J. L., Pagel, M. D. & Stone, M. J. (2000). Relationshipsbetween protein structure and dynamics from a database of NMR-derived backboneorder parameters. J. Mol. Biol., 295, 963-978. |
| 54. | Seavey, B. R., Farr, E. A., Westler, W. M. & Markley, J. L. (1991). Arelational database for sequence-specific protein NMR data. J. Biomol. NMR, 1, 217-236. |
| 55. | Vriend, G. & Krieger, E. (2000). Centre for Molecular and BiomolecularInformatics CMBI version of DSSP. http://www.cmbi.kun.nl/gv/dssp/. |
| 56. | Shindyalov, I. N. & Bourne, P. E. (1998). Protein structure alignment byincremental combinatorial extension (CE) of the optimal path. Prot. Engin., 11, 739-747. |
| 57. | Koradi, R., Billeter, M. & Wuthrich, K. (1996). MOLMOL: a program fordisplay and analysis of macromolecular structures. J. Mol. Graphics, 14, 51-55. |
| 58. | Kabsch, W. & Sander, C. (1983). Dictionary of protein secondary structure:pattern recognition of hydrogen bonded and geometrical features. Biopolymers, 22, 2577-2637. |
| 59. | Rothemund, S., Liou, Y. C., Krause, E. & Sonnichsen, F. D. (1999). A newclass of hexahelical insect proteins revealed as putative carriers of smallhydrophobic ligands. Struct. Fold Design, 7, 1325-1332. |
| Contact: rost@columbia.edu | Version: Dec 8, 2001 |