bottom - CUBIC-papers - CUBIC

Title: NORSp: predictions of long regions without regular secondary structure
Author:Jinfeng Liu & Burkhard Rost
Quote: NAR, 2003, 31(13):3833-3835

NORSp: predictions of long regions without regular secondary structure

Jinfeng Liu1, 3, 4 & Burkhard Rost 1, 2, 3, *

1 CUBIC, Dept. of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA
2 Columbia University Center for Computational Biology and Bioinformatics (C2B2), Russ Berrie Pavilion, 1150 St. Nicholas Avenue, New York, NY 10032, USA
3 North East Structural Genomics Consortium (NESG), Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA
4 Dept. of Pharmacology, Columbia Univ., 630 West 168th Street, New York, NY 10032, USA
* Corresponding author:  email = liu@cubic.bioc.columbia.edu URL http://cubic.bioc.columbia.edu/  Tel: +1-212-305-3773, fax: +1-212-305-7932

This article is published in (Nucleic Acids Research, issue, date and pages) © copyright Oxford University Press (2003). OUP is the only authorised source. All copying of this article including placing on another website requires the written permission of the copyright owner.

Table of contents



 


Abstract

Many structurally flexible regions play important roles in biological processes. It has been shown that extended loopy regions are very abundant in the protein universe and that they have been conserved through evolution. Here, we present NORSp, a publicly available predictor for disordered regions in protein. Specifically, NORSp predicts long regions with no regular secondary structure. Upon user submission of a protein sequence, NORSp will analyse the protein for its secondary structure, and presence of transmembrane helices and coiled-coil. It will then return e-mail to the user about the presence and position of disordered regions. NORSp can be accessed from http://cubic.bioc.columbia.edu/services/NORSp/.

 

Key words: no regular secondary structure; disordered regions, sequence analysis, secondary structure.

 

 

Abbreviations used

; 3D structurethree-dimensional co-ordinates of protein structure
COILSprediction of coiled-coil regions from sequence based on statistics and expert rules [1]
NORSsegment of more than 70 consecutive residues of NO Regular Secondary structure, i.e. without helix or strand (more precisely, we required that less than 12% of the residues in the respective region were in helix or strand and that at least one region of more than 10 residues was exposed to solvent) [2]
PROFphdprofile-based neural network prediction of secondary structure and solvent accessibility [3]
PHDhtmprofile-based neural network prediction of transmembrane helices [4] .


 

Introduction

Irregular structures mediate function. The three-dimensional (3D) structure of a protein is assumed to largely determine its biological function. The first decades of rapid progress in the experimental determination of 3D structures by X-ray crystallography [5] focused on determining "rigid" structures at high resolution. Recently, a new type of structures has emerged with very long regions that appear to adopt regular structure only upon binding to substrates or other proteins [6] ; they are referred to as floppy, natively disordered, natively unfolded, or loopy [7, 8, 9, 10, 2] . It seems that these irregular regions are important for function.

Predicting irregular structures. Structural irregularity can be studied from several aspects: One class of 'natively disordered' regions was defined as the regions invisible in electron density maps of X-ray diffraction, presumably since the flexibility keeps them from crystallising into well-ordered structures. These regions sometimes are associated with regions with 'compositional bias' or 'low sequence complexity' [11, 12, 13] . Another class is characterised by proteins that appear unfolded by CD measurements [7] . Previously, we investigated the problem of disordered proteins from a structure-oriented perspective, and studied extended regions of very low regular secondary structure (helix or strand) content (NORS) [2] . We showed that NORS regions are particularly abundant in eukaryotic proteomes, conserved during evolution, over-represented in regulatory function category, and important in protein-protein interactions. These results were in agreement with studies that predicted 'natively disordered regions' through neural networks [14] .

Here, we introduced a web-based interface to make our method predicting NORS regions publicly accessible. The method can be useful for biologists in several ways. For example, crystallographers can check whether their proteins contain NORS regions and make the decision about whether to proceed with the experiments since NORS proteins may be difficult to crystallise, as demonstrated by the their low occurrence in PDB [2] . Biologists interested in protein structure-function relationship may also find it interesting to verify whether the protein-protein interaction sites coincide with NORS region.

 

Design and implementation

Definition of NORS. We defined NORS regions as segments of more than 70 consecutive residues with less than 12% of the residues in helix, strand, or coiled-coil regions and with at least one segment of ten adjacent residues exposed to solvent. We identify such NORS regions by merging predictions of secondary structure, transmembrane helices, and coiled-coil regions. We pre-calculate these information as well as NORS regions for each protein in more than 60 completely sequenced genomes ( Fig. 1 ), and have included them in our PEP database [15] through a searchable SRS [16] interface (http://cubic.bioc.columbia.edu/db/PEP/). NORS information has also been used in our target selection process for North East Structural Genomics Consortium [17] to exclude proteins likely to pose problems to crystallisation.



Fig. 1
fig1.gif

Fig. 1. NORS proteins are much more abundant in eukaryotes than in prokaryotes and archae-bacteria. Shown in the graph are the average percentages of NORS proteins in the three kingdoms. Error bars indicate the maximum and minimum values.




 

Prediction by NORSp. Protein sequences submitted to our web site are subjected to the following steps. (a) Build sequence profile through a database search with an automated, iterated PSI-BLAST [18] . (b) Secondary structure, solvent accessibility are predicted by PROFphd [3] , membrane helices are predicted by the PHDhtm [4] using the PSI-BLAST profiles. (c) Coiled-coil regions are predicted by COILS [1] . (d) The secondary structure, membrane helices, and coiled-coil information are then combined to calculate the structural content for each sequence window of a certain length, and NORS regions are identified when the structural content is below the given threshold; overlapping NORS regions are joined. Technically, to obtain most of these intermediate results, NORSp utilises the same engine that is behind the PredictProtein server [19] (http://cubic.bioc.columbia.edu/predictprotein/).

 

Input, output, and advanced options

Input. The input to NORSp is protein sequence; proteins shorter than 70 residues are returned un-processed. Currently, the valid input format is a sequence in one-letter residue code or a FASTA-format. The sequence can be entered into the sequence text box or uploaded from usersÕ local disk.

Output. Users have the option of receiving ÒsuccinctÓ output, which only shows the position of NORS region in the context of the submitted sequence, or ÒverboseÓ output, which includes the intermediate data used by NORSp: secondary structure, solvent accessibility, transmembrane helices, and coiled-coil prediction. By default, the results will be in plain text (ASCII) format. However, HTML formatted results can also be requested that can be displayed in any web browsers. Due to concerns about file size and user mailbox overflow, normally the results will be available to download from our website and only URLs are sent to the users by e-mail unless users request the full results being sent directly.

Recommendation and advanced options. We determined the particular threshold used to define NORS regions in order to minimise the false positive rate as determined by manually inspecting PDB proteins [2] . This conservative solution implies that the vast majority of NORS regions that we detect are likely to constitute structurally irregular, floppy, loopy, or natively disordered regions. However, we supposedly miss many such regions in our predictions. Users who are aware of this may be interested in changing the threshold to see which regions may be good candidates for irregular regions although not detected by our default. We provide three options for advanced users: the size of sequence window for calculating secondary structure content (default=70), maximum of secondary structure content (default=12%), and the minimum length of consecutive exposed residues (default=10).

Acknowledgements

Thanks to Hepan Tan (Columbia) for his help in developing the tool. This work was supported by the grants 1-P50-GM62413-01 and RO1-GM63029-01 from the National Institute of Health (NIH). Last, not least, thanks to all those who deposit their experimental data in public databases, and to those who maintain these databases.

 

 

References

1.Lupas, A. (1996). Prediction andanalyis of coiled-coil structures. Methods in Enzymology, 266, 513-525.
2.Liu, J., Tan, H. & Rost, B.(2002). Loopy proteins appear conserved in evolution. Journal of MolecularBiology, 322, 53-64.
3.Rost, B. (2001). Review: proteinsecondary structure prediction continues to rise. J Struct Biol, 134, 204-18.
4.Rost, B., Casadio, R. &Fariselli, P. (1996). Topology prediction for helical transmembrane proteins at86% accuracy. Protein Science, 5, 1704-1718.
5.Hendrickson, W. A. (1991).Determination of macromolecular structures from anomalous diffraction ofsynchrotron radiation. Science, 254, 51-58.
6.Wright, P. E. & Dyson, H. J.(1999). Intrinsically unstructured proteins: re-assessing the proteinstructure-function paradigm. J Mol Biol, 293, 321-31..
7.Uversky, V. N., Gillespie, J. R.& Fink, A. L. (2000). Why are Ònatively unfoldedÓ proteins unstructuredunder physiologic conditions? Proteins: Structure, Function, and Genetics, 41, 415-427.
8.Dunker, A. K. & Obradovic, Z.(2001). The protein trinity-linking function and disorder. NatureBiotechnology, 19,805-806.
9.Namba, K. (2001). Roles of partlyunfolded conformations in macromolecular self-assembly. Genes Cells, 6, 1-12.
10.Zetina, C. R. (2001). A conservedhelix-unfolding motif in the naturally unfolded proteins. Proteins:Structure, Function, and Genetics, 44, 479-483.
11.Wootton, J. C. & Federhen, S.(1996). Analysis of compositionally biased regions in sequence databases. Methodsin Enzymology, 266,554-571.
12.Dunker, A. K., Garner, E.,Guilliot, S., Romero, P., Albrecht, K. et al. (1998). Protein disorder and theevolution of molecular recognition: theory, predictions and observations. PacSymp Biocomput,473-84..
13.Dunker, A. K., Lawson, J. D.,Brown, C. J., Williams, R. M., Romero, P. et al. (2001). Intrinsicallydisordered protein. J Mol Graph Model, 19, 26-59..
14.Romero, P., Obradovic, Z., Li, X.,Garner, E. C., Brown, C. J. et al. (2001). Sequence complexity of disorderedprotein. Proteins: Structure, Function, and Genetics, 42, 38-48.
15.Carter, P., Liu, J. & Rost, B.(2003). PEP: Predictions for Entire Proteomes. Nucleic Acids Res, 31, 410-3.
16.Etzold, T. & Argos, P. (1993).SRS--an indexing and retrieval tool for flat file data libraries. ComputAppl Biosci, 9,49-57..
17.Liu, J. & Rost, B. (2002).Target space for structural genomics revisited. Bioinformatics, 18, 922-933.
18.Altschul, S. F., Madden, T. L.,Schaffer, A. A., Zhang, J., Zhang, Z. et al. (1997). Gapped BLAST andPSI-BLAST: a new generation of protein database search programs. NucleicAcids Res, 25,3389-402..
19.Rost, B. & Liu, J. (2003). ThePredictProtein server. Nucleic Acids Research,submitted, .

Contact:    rost@columbia.edu Version:    Mar 17, 2003
top - CUBIC-papers - CUBIC