CUBIC papers 1999-now

 


Columbia University
Department of Biochemistry and Molecular Biophysics/C2B2
1130 St. Nicholas Ave. Rm 805
New York, NY 10032, USA 

Email:  rost@columbia.edu
WWW:     http://www.rostlab.org/       
Tel:       +1-212-851-4669     
Fax:     +1-212-305-7932     

 

 

 

 

NOTE: This is not a full list of publications from all members of the CUBIC group.

 


 

Bibliography

 


1.   B Rost (1999) Twilight zone of protein sequence alignments. Protein Engineering 12:85-94.

2.   A Zemla, C Venclovas, K Fidelis and B Rost (1999) A modified definition of SOV, a segment-based measure for protein secondary structure prediction assessment. Proteins: Structure, Function, and Genetics 34:220-223.

3.   F Pazos, B Rost and A Valencia (1999) A platform for integrating threading results with protein family analyses. Bioinformatics 15:1062-1063.

4.   O Olmea, B Rost and A Valencia (1999) Effective use of sequence correlation and conservation in fold recognition. Journal of Molecular Biology 293:1221-1239.

5.   D Fischer, et al. (1999) CAFASP-1: critical assessment of fully automated structure prediction methods. Proteins: Structure, Function, and Genetics Suppl 3:209-217.

6.   M Cokol, R Nair and B Rost (2000) Finding nuclear localisation signals. EMBO Reports 1:411-415.

7.   B Rost and C Sander (2000) Third generation prediction of secondary structure. Methods in Molecular Biology 143:71-95.

8.   B Rost (2001) Protein secondary structure prediction continues to rise. Journal of Structural Biology 134:204-218.

9.   V Eyrich, MA Mart’-Renom, D Przybylski, A Fiser, F Pazos, A Valencia, A Sali and B Rost (2001) EVA: continuous automatic evaluation of protein structure prediction servers. Bioinformatics 17:1242-1243.

10.            J Liu and B Rost (2001) Comparing function and structure between entire proteomes. Protein Science 10:1970-1979.

11.            B Rost and V Eyrich (2001) EVA: large-scale analysis of secondary structure prediction. Proteins: Structure, Function, and Genetics 45 Suppl 5:S192-S199.

12.            R Nair and B Rost (2001) Surface profiles predict sub-cellular localisation. preprint: Columbia University.

13.            D Fischer, A Elofsson, L Rychlewski, F Pazos, A Valencia, B Rost, AR Ortiz and RLJ Dunbrack (2001) CAFASP2: the second critical assessment of fully automated structure prediction methods. Proteins: Structure, Function, and Genetics 45 Suppl 5:S171-S183.

14.            B Rost, P Baldi, G Barton, J Cuff, V Eyrich, D Jones, K Karplus, R King, M Ouali, G Pollastri and D Przybylski (2001) Simple jury predicts protein secondary structure best. Preprint: Columbia University.

15.            B Rost and P Baldi (2001) New improvements in protein secondary structure prediction. Preprint: Columbia University.

16.            D Przybylski and B Rost (2002) Alignments grow, secondary structure prediction improves. Proteins: Structure, Function, and Bioinformatics 46:195-205.

17.            CAF Andersen, AG Palmer, S Brunak and B Rost (2002) Continuum secondary structure captures protein flexibility. Structure 10:175-184.

18.            B Rost (2002) Enzyme function less conserved than anticipated. Journal of Molecular Biology 318:595-608.

19.            J Liu and B Rost (2002) Target space for structural genomics revisited. Bioinformatics 18:922-933.

20.            R Nair and B Rost (2002) Inferring sub-cellular localisation through automated lexical analysis. Bioinformatics 18:S78-S86.

21.            J Liu, H Tan and B Rost (2002) Loopy proteins appear conserved in evolution. Journal of Molecular Biology 322:53-64.

22.            CP Chen, A Kernytsky and B Rost (2002) Transmembrane helix predictions revisited. Protein Science 11:2774-2791.

23.            R Nair and B Rost (2002) Sequence conserved for sub-cellular localization. Protein Science 11:2836-2847.

24.            CP Chen and B Rost (2002) Long membrane helices and short loops predicted less accurately. Protein Science 2766-2773.

25.            CP Chen and B Rost (2002) State-of-the-art in membrane prediction. Applied Bioinformatics 1:21-35.

26.            B Rost (2002) Did evolution leap to create the protein universe? Current Opinion in Structural Biology 12:409-416.

27.            G Pollastri, D Przybylski, B Rost and P Baldi (2002) Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins: Structure, Function, and Bioinformatics 47:228-235.

28.            MA Marti-Renom, MS Madhusudhan, A Fiser, B Rost and A Sali (2002) Reliability of assessment of protein structure prediction methods. Structure 10:435-440.

29.            A Sali, MA Marti-Renom, MS Madhusudhan, A Fiser and B Rost (2002) Reply to Moult et al. Structure 10:292-293.

30.            Y Ofran and B Rost (2003) Analysing six types of protein-protein interfaces. Journal of Molecular Biology 325:377-387.

31.            R Nair, P Carter and B Rost (2003) NLSdb: database of nuclear localization signals. Nucleic Acids Research 31:397-399.

32.            P Carter, J Liu and B Rost (2003) PEP: Predictions for Entire Proteomes. Nucleic Acids Research 31:410-413.

33.            B Rost (2003) Neural networks predict protein structure: hype or hit? In: P Frasconi and R Shamir (eds.). Artificial intelligence and heuristic methods in bioinformatics. Amsterdam: IOS Press:34-50.

34.            Y Ofran and B Rost (2003) Predict protein-protein interaction sites from local sequence information. FEBS Letters 544:236-239.

35.            B Rost and J Liu (2003) The PredictProtein server. Nucleic Acids Research 31:3300-3304.

36.            J Liu and B Rost (2003) NORSp: predictions of long regions without regular secondary structure. Nucleic Acids Research 31:3833-3835.

37.            R Nair and B Rost (2003) LOC3D: annotate sub-cellular localization for protein structures. Nucleic Acids Research 31:3337-3340.

38.            S Mika and B Rost (2003) UniqueProt: creating representative protein sequence sets. Nucleic Acids Research 31:3789-3791.

39.            A Kernytsky and B Rost (2003) Static benchmarking of membrane helix predictions. Nucleic Acids Research 31:3642-3644.

40.            IYY Koh, VA Eyrich, MA Marti-Renom, D Przybylski, MS Madhusudhan, E Narayanan, O Gra–a, A Valencia, A Sali and B Rost (2003) EVA: evaluation of protein structure prediction servers. Nucleic Acids Research 31:3311-3315.

41.            P Carter, CAF Andersen and B Rost (2003) DSSPcont: continuous secondary structure assignments for proteins. Nucleic Acids Research 31:3293-3295.

42.            VA Eyrich and B Rost (2003) META-PP: single interface to crucial prediction servers. Nucleic Acids Research 31:3308-3310.

43.            R Nair and B Rost (2003) Better prediction of sub-cellular localization by combining evolutionary and structural information. Proteins: Structure, Function, and Bioinformatics 53:917-930.

44.            VA Eyrich, IYY Koh, D Przybylski, O Gra–a, F Pazos, A Valencia and B Rost (2003) CAFASP3 in the spotlight of EVA. Proteins: Structure, Function, and Bioinformatics 53 Suppl 6:548-560.

45.            B Rost (2002) Rising accuracy of protein secondary structure prediction. In: D Chasman (eds.). Protein structure determination, analysis, and modeling for drug discovery. New York: Dekker:207-249.

46.            CAF Andersen and B Rost (2003) Automatic secondary structure assignment. Methods Biochem Anal. 44:341-363.

47.            B Rost (2003) Prediction in 1D: secondary structure, membrane helices, and accessibility. Methods Biochem Anal. 44:559-587.

48.            J Liu and B Rost (2003) Domains, motifs, and clusters in the protein universe. Current Opinion in Chemical Biology 7:5-11.

49.            B Rost, J Liu, D Przybylski, R Nair, H Bigelow, KO Wrzeszczynski and Y Ofran (2003) Prediction of protein structure through evolution. In: J Gasteiger and T Engel (eds.). Handbook of Chemoinformatics - from data to knowledge. Weinheim: Wiley-VCH:1789-1811.

50.            KO Wrzeszczynski and B Rost (2003) xx Cataloguing proteins in cell cycle control. In: H Lieberman (eds.). Cell cycle checkpoint control protocols. Totowa, NJ: Humana Press:219-233.

51.            B Rost, J Liu, R Nair, KO Wrzeszczynski and Y Ofran (2003) Automatic prediction of protein function. Cellular and Molecular Life Sciences submitted Mar 25, 2003.

52.            R Zidovetzki, B Rost, DL Armstrong and I Pecht (2003) Role of transmembrane domains in the functions of Fc receptors. Journal of Biophysical Chemistry 15:555-575.

53.            JM Aramini, et al. (2003) Solution NMR structure of the 30S ribosomal protein S28E from Pyrococcus horikoshii. Protein Science 12:2823-2830.

54.            J Liu and B Rost (2004) CHOP proteins into structural domains. Proteins: Structure, Function, and Bioinformatics 55:678-688.

55.            H Bigelow, D Petrey, J Liu, D Przybylski and B Rost (2004) Prediction of transmembrane beta-barrels for entire proteomes. Nucleic Acids Research 32:2566-2577.

56.            KO Wrzeszczynski and B Rost (2004) Annotating proteins from Endoplasmic reticulum and Golgi apparatus in eukaryotic proteomes. Cellular and Molecular Life Sciences 61:1341-1353.

57.            B Rost, G Yachdav and J Liu (2004) The PredictProtein server. Nucleic Acids Research 32:W321-W326.

58.            R Nair and B Rost (2004) LOCnet and LOCtarget: Sub-cellular localization for structural genomics targets. Nucleic Acids Research 32:W517-W521.

59.            J Liu and B Rost (2004) CHOP: parsing proteins into structural domains. Nucleic Acids Research 32:W569-W571.

60.            S Mika and B Rost (2004) NLProt: extracting protein names and sequences from papers. Nucleic Acids Research 32:W634-W637.

61.            J Liu, H Hegyi, TB Acton, GT Montelione and B Rost (2004) Automatic target selection for structural genomics on eukaryotes. Proteins: Structure, Function, and Bioinformatics 56:188-200.

62.            J Liu and B Rost (2004) Sequence-based prediction of protein domains. Nucleic Acids Research 32:3522-3530.

63.            S Mika and B Rost (2004) Protein names peeled precisely off free text. Bioinformatics 20:I241-I247.

64.            D Przybylski and B Rost (2004) Improving fold recognition without folds. Journal of Molecular Biology 341:255-269.

65.            R Nair and B Rost (2004) Annotating protein function through lexical analysis. AI Magazine 25:45-56.

66.            J Glasgow, I Jurisica and B Rost (2004) AI and Bioinformatics. AI Magazine 25:7-8.

67.            Z Wunderlich, TB Acton, J Liu, G Kornhaber, J Everett, P Carter, N Lan, N Echols, M Gerstein, B Rost and GT Montelione (2004) The protein target list of the Northeast Structural Genomics Consortium. Proteins: Structure, Function, and Bioinformatics 56:181-187.

68.            R Powers, TB Acton, Y Chiang, PK Rajan, JR Cort, MA Kennedy, J Liu, L Ma, B Rost and GT Montelione (2004) 1H, 13C and 15N assignments for the Archaeglobus fulgidis protein AF2095. Journal of Biomolecular NMR 30:107-108.

69.            S Mika and B Rost (2005) NMPdb: database of nuclear matrix proteins. Nucleic Acids Research 33:D160-163.

70.            R Nair and B Rost (2005) Mimicking cellular sorting improves prediction of subcellular localization. Journal of Molecular Biology 348:85-100.

71.            M Punta and B Rost (2005) Protein folding rates estimated from contact predictions. Journal of Molecular Biology 348:507-512.

72.            M Punta and B Rost (2005) PROFcon: novel prediction of long-range contacts. Bioinformatics 21:2960-2968.

73.            A Schlessinger and B Rost (2005) Protein flexibility and rigidity predicted from sequence. Proteins: Structure, Function, and Bioinformatics in press.

74.            Y Ofran and B Rost (2005) Predictive methods using protein sequence. In: AD Baxevanis and BF Ouellette (eds.). Bioinformatics. New York: Wiley:197-222.

75.            B Rost (2005) How to use protein 1D structure predicted by PROFphd. In: JE Walker (eds.). The Proteomics Protocols Handbook. Totowa NJ: Humana:875-901.

76.            Y Ofran, M Punta, R Schneider and B Rost (2005) Beyond annotation transfer by homology: novel protein function prediction methods that can assist drug discovery. Drug Discovery Today 10:1475-1482.

77.            J Benach, WC Edstrom, I Lee, K Das, B Cooper, R Xiao, J Liu, B Rost, TB Acton, GT Montelione and JF Hunt (2005) The 2.35 A structure of the TenA homolog from Pyrococcus furiosus supports an enzymatic function in thiamine metabolism. Acta Crystallogr D Biol Crystallogr 61:589-598.

78.            O Grana, VA Eyrich, F Pazos, B Rost and A Valencia (2005) EVAcon: a protein contact prediction evaluation service. Nucleic Acids Res 33:W347-351.

79.            HV Jagadish, D States and B Rost (2005) ISMB 2005. Bioinformatics 21 Suppl 1:i1-i2.

80.            The FANTOM Consortium, et al. (2005) The Transcriptional Landscape of the Mammalian Genome. Science 309:1559-1563.

81.            R Powers, et al. (2005) Solution structure of Archaeglobus fulgidis peptidyl-tRNA hydrolase (Pth2) provides evidence for an extensive conserved family of Pth2 enzymes in archea, bacteria, and eukaryotes. Protein Science 14:2849-2861.

82.            DA Snyder, et al. (2005) Comparisons of NMR spectral quality and success in crystallization demonstrate that NMR and X-ray crystallography are complementary methods for small protein structure determination. J Am Chem Soc 127:16505-16511.

83.            J Moult, K Fidelis, B Rost, T Hubbard and A Tramontano (2005) Critical assessment of methods of protein structure prediction (CASP)-Round 6. Proteins 61:3-7.

84.            O Grana, D Baker, RM Maccallum, J Meiler, M Punta, B Rost, ML Tress and A Valencia (2005) CASP6 assessment of contact prediction. Proteins 61:214-224.

85.            A Schlessinger, Y Ofran, G Yachdav and B Rost (2006) Epitome: Database of structure-inferred antigenic epitopes. Nucleic Acids Research 34:D777-780.

86.            J Liu, J Gough and B Rost (2006) Distinguishing protein-coding from non-coding RNA through support vector machines. PLoS Genetics 2:e29; DOI: 10.1371/journal.pgen.0020029.

87.            A Schlessinger, G Yachdav and B Rost (2006) PROFbval: predict flexible and rigid residues in proteins. Bioinformatics 22:891-893.

88.            S Mika and B Rost (2006) Protein–protein interactions more conserved within species than across species. PLoS Computational Biology 2:e79.

89.            H Bigelow and B Rost (2006) PROFtmb: a web server for predicting bacterial transmembrane beta barrel proteins. Nucleic Acids Research 34:W186-188.

90.            Y Ofran, G Yachdav, E Mozes, T-t Soong, R Nair and B Rost (2006) Create and assess protein networks through molecular characteristics of individual proteins. Bioinformatics 22:e402-407.

91.            A Passerini, M Punta, A Ceroni, B Rost and P Frasconi (2006) Identifying cysteines and histidines in transition-metal-binding sites using support vector machines and neural networks. Proteins: Structure, Function, and Bioinformatics 65:305-316.

92.            HM Berman, et al. (2006) Outcome of a workshop on archiving structural models of biological macromolecules. Structure 14:1211-1217.

93.            Y Ofran and B Rost (2007) ISIS: Interaction Sites Identified from Sequence. Bioinformatics 23:e13-16.

94.            D Przybylski and B Rost (2007) Consensus sequences improve PSI-BLAST through mimicking profile-profile alignments. Nucleic Acids Research 35:2238-2246.

95.            J Liu, GT Montelione and B Rost (2007) Novel leverage of structural genomics. Nature Biotechnology in press.

96.            Y Ofran and B Rost (2007a) Protein-protein interaction hot spots carved into sequences. PLoS Comput Biol in press.

97.            Y Bromberg and B Rost (2007) SNAP: predict effect of non-synonymous polymorphisms on function. Nucleic Acids Research in press.

98.            Y Ofran, V Mysore and B Rost (2007) Prediction of DNA binding residues from sequence. Bioinformatics in press.

99.            A Schlessinger, J Liu and B Rost (2007b) Natively unstructured loops differ from other loops. PLoS Comput Biol in press.

100.         M Punta, LR Forrest, H Bigelow, A Kernytsky, J Liu and B Rost (2007) Membrane protein prediction methods. Methods 41:460-474.

101.         D Przybylski and B Rost (2007) Predicting simplified features of protein structure. In: T Lengauer (eds.). Bioinformatics – From Genomes to Therapies. Weinheim: Wiley-VCH:in press.

102.         R Nair and B Rost (2007) Predicting protein subcellular localization using intelligent systems. In: D Leon and S Markel (eds.). In Silico Technology in Drug Target Identification and Validation. Marcel Dekker:.

103.         H Bigelow and B Rost (2007) Online tools for predicting integral membrane proteins. In: MJ Peirce and R Wait (eds.). Proteomic analysis of membrane proteins: methods and protocols. .

104.         R Nair and B Rost (2006) Predicting protein subcellular localization using intelligent systems. In: D Leon and S Markel (eds.). In silico technology in drug target identification and validation. Boca Raton, FL: CRC Press:.

105.         D Przybylski and B Rost (2006) In: T Lengauer (eds.). New York: Wiley-VCH:.

106.         Y Ofran and B Rost (2003) Rescue for statistical tests in high-throughput biology. Bioinformatics submitted.

 

 

 


 

 


Abstracts


1999 peer-reviewed

Twilight zone of protein sequence alignments

Burkhard Rost

Quote: 1999 Protein Engineering 12, 85-94

Sequence alignments unambiguously distinguish between protein pairs of similar and non-similar structure when the pairwise sequence identity is high (>40% for long alignments). The signal gets blurred in the twilight zone of 20-35% sequence identity. Here, I analysed more than a million sequence alignments between protein pairs of known structures to re-define a line distinguishing between true and false positives for low levels of similarity. Four results stood out. (1) The transition from the safe zone of sequence alignment into the twilight zone is described by an explosion of false negatives. More than 95% of all pairs detected in the twilight zone had different structures. More precisely, above a cut-off roughly corresponding to 30% sequence identity, 90% of the pairs were homologous; below 25% less than 10% were. (2) Whether or not sequence homology implied structural identity depended crucially on the alignment length. For example, if ten residues were similar in an alignment of length 16 (> 60%), structural similarity could not be inferred. (3) The 'more similar than identical' rule (discarding all pairs for which percentage similarity was lower than percentage identity) reduced false positives significantly. (4) Similarly successful was sequence space hopping: pairs were predicted to be homologous when the respective sequence families had proteins in common. All findings are applicable to automatic database searches. 

 

1999 collaborations

A modified definition of SOV, a segment-based measure for protein secondary structure prediction assessment

Adam Zemla, Ceslovas Venclovas, Krzysztof Fidelis &
Burkhard Rost

Quote: 1999 Proteins: Structure, Function, and Genetics 34, 220-223

We present a measure for the evaluation of secondary structure prediction methods that is based on secondary structure segments rather than individual residues. The algorithm is an extension of the segment overlap measure Sov, originally defined by Rost et al. (J Mol Biol 1994;235:13-26). The new definition of Sov corrects the normalization procedure and improves Sov's ability to discriminate between similar and dissimilar segment distributions. The method has been comprehensively tested during the second Critical Assessment of Techniques for Protein Structure Prediction (CASP2). Here, we describe the underlying concepts, modifications to the original definition, and their significance.

A platform for integrating threading results with protein family analyses

Florencio Pazos, Burkhard Rost & Alfonso Valencia

Quote: 1999 Bioinformatics 15, 1062-1063

We have developed a package for the interactive visualization of results from different threading programs. Additionally, we have integrated relevant information about protein sequence, function, evolution, and structure into the interface.

Effective use of sequence correlation and conservation in fold recognition

Osvaldo Olmea, Burkhard Rost & Burkhard Rost

Quote: 1999 Journal of Molecular Biology 293, 1221-1239

Protein families are a rich source of information; sequence conservation and sequence correlation are two of the main properties that can be derived from the analysis of multiple sequence alignments. Sequence conservation is related to the direct evolutionary pressure to retain the chemical characteristics of some positions in order to maintain a given function. Sequence correlation is attributed to the small sequence adjustments needed to maintain protein stability against constant mutational drift. Here, we showed that sequence conservation and correlation were each frequently informative enough to detect incorrectly folded proteins. Furthermore, combining conservation, correlation, and polarity, we achieved an almost perfect discrimination between native and incorrectly folded proteins. Thus, we made use of this information for threading by evaluating the models suggested by a threading method according to the degree of proximity of the corresponding correlated, conserved, and apolar residues. The results showed that the fold recognition capacity of a given threading approach could be improved almost fourfold by selecting the alignments that score best under the three different sequence-based approaches.

CAFASP-1: critical assessment of fully automated structure prediction methods

Daniel Fischer, C Barret, K Bryson, Arne Elofsson, Adam Godzik, David Jones, Kevin J. Karplus, L.  A. Kelley, R.  M.  MacCallum, K. Pawowski, Burkhard Rost, Leszek Rychlewski, Michael Sternberg

Quote: 1999 Proteins: Structure, Function, and Genetics Suppl 3, 209-217

The results of the first Critical Assessment of Fully Automated Structure Prediction (CAFASP-1) are presented. The objective was to evaluate the success rates of fully automatic web servers for fold recognition which are available to the community. This study was based on the targets used in the third meeting on the Critical Assessment of Techniques for Protein Structure Prediction (CASP-3). However, unlike CASP-3, the study was not a blind trial, as it was held after the structures of the targets were known. The aim was to assess the performance of methods without the user intervention that several groups used in their CASP-3 submissions. Although it is clear that "human plus machine" predictions are superior to automated ones, this CAFASP-1 experiment is extremely valuable for users of our methods; it provides an indication of the performance of the methods alone, and not of the "human plus machine" performance assessed in CASP. This information may aid users in choosing which programs they wish to use and in evaluating the reliability of the programs when applied to their specific prediction targets. In addition, evaluation of fully automated methods is particularly important to assess their applicability at genomic scales. For each target, groups submitted the top-ranking folds generated from their servers. In CAFASP-1 we concentrated on fold-recognition web servers only and evaluated only recognition of the correct fold, and not, as in CASP-3, alignment accuracy. Although some performance differences appeared within each of the four target categories used here, overall, no single server has proved markedly superior to the others. The results showed that current fully automated fold recognition servers can often identify remote similarities when pairwise sequence search methods fail. Nevertheless, in only a few cases outside the family-level targets has the score of the top-ranking fold been significant enough to allow for a confident fully automated prediction. Because the goals, rules, and procedures of CAFASP-1 were different from those used at CASP-3, the results reported here are not comparable with those reported in CASP-3. Nevertheless, it is clear that current automated fold recognition methods can not yet compete with "human-expert plus machine" predictions. Finally, CAFASP-1 has been useful in identifying the requirements for a future blind trial of automated served-based protein structure prediction.


 


 

2000 peer-reviewed

Finding nuclear localization signals

Murat Cokol, Rajesh Nair & Burkhard Rost

Quote: 2000 EMBO Reports 1, 411-415

A variety of nuclear localisation signals (NLSs) are experimentally known; only one motif was available for database searches. We initially collected a set of 91 experimentally verified NLSs from the literature. Through iterated 'in silico mutagenesis' we then extended the set to 214 potential NLSs. This final set matched in 43% of all known nuclear proteins and in no known non-nuclear protein. We estimated >17% of all eukaryotic proteins may be imported into the nucleus. Finally, we found an overlap between NLS and DNA-binding region for 90% of the proteins for which both NLS and DNA-binding regions were known. Thus, evolution seemed to have used part of the existing DNA-binding mechanism when compartmentalising DNA-binding proteins into the nucleus. However, only 56 of our 214 NLS motifs overlapped with DNA-binding regions. These 56 NLSs enabled a de novo prediction of partial DNA-binding regions for about 800 proteins in human, fly, worm and yeast. 

 

2000 non-peer

Third generation prediction of secondary structure

Burkhard Rost & Chris Sander

Quote: 2000 Methods in Molecular Biology 143, 71-95

We still cannot predict protein structure from sequence, in general. But, we can do much better in predicting simplified aspects of structure. Particularly, the field of secondary structure has been revived by a break-through that has been achieved by a combination of elaborated algorithms and evolutionary information available in ever growing data bases. Some of the new, third generation methods for secondary structure prediction are clearly superior to previous methods: b-strands are predicted more accurately; predicted segments look like those observed; and the overall accuracy is about ten percentage points higher than for methods from previous generations. Performance can be improved even further by using these methods in an 'expert' rather than in an 'automatic' mode.


 

 


 

2001 peer-reviewed

Protein secondary structure prediction continues to rise

Burkhard Rost

Quote: 2001 Journal of Structural Biology 134, 204-218

Methods predicting protein secondary structure have improved substantially in the 90's through using evolutionary information taken from the divergence of proteins in the same structural family. Recently, the evolutionary information resulting from improved searches and larger databases has again boosted prediction accuracy by more than four percentage points to its current height around 76% of all residues predicted correctly in one of the three states helix, strand, other. The last year also brought successful new concepts to the field. These new methods may be particularly interesting in light of the improvements achieved through simply combining existing methods. Divergent evolutionary profiles not only contain enough information to substantially improve prediction accuracy, but even to correctly predict long stretches of identical residues observed in alternative secondary structure states depending on non local conditions. An example is a method automatically identifying structural switches, and thus finding a remarkable connection between predicted secondary structure and aspects of function. Secondary structure predictions are increasingly becoming the working horse for numerous methods aiming at predicting protein structure and function. Is the recent increase in accuracy significant enough to make predictions even more useful? Since the recent improvement yields a better prediction of segments, and in particular of beta-strands, I believe the answer is affirmative. What is the limit of prediction accuracy? We shall see.

EVA: continuous automatic evaluation
of protein structure prediction servers

Volker A. Eyrich, Marc A. Mart’-Renom, Dariusz Przybylski, Mallur S. Madhusudhan, Andr‡s Fiser, Florencio Pazos, Alfonso Valencia, Andrej Sali & Burkhard Rost

Quote: 2001 Bioinformatics 17, 1242-1243

Summary: Evaluation of protein structure prediction methods is difficult and time-consuming. Here, we de-scribe EVA, a web server for assessing protein structure prediction methods, in an automated, continuous and large-scale fashion. Currently, EVA evaluates the performance of a variety of prediction methods available through the internet. Every week, the sequences of the latest experimentally determined protein structures are sent to prediction servers, results are collected, performance is evaluated, and a summary is published on the web. EVA has so far collected data for more than 3000 protein chains. These results may provide valuable insight to both developers and users of prediction methods.

Comparing function and structure between entire proteomes

Jinfeng Liu & Burkhard Rost

Quote: 2001 Protein Science 10, 1970-1979

More than 30 organisms have been entirely sequenced. Here, we applied a variety of simple bioinformatics tools to analyse 29 proteomes for representatives from all three kingdoms: eukaryotes, prokaryotes and archaebacteria. We confirmed that eukaryotes have relatively more long proteins than prokaryotes and archaes, and that the overall amino acid composition is similar between the three. We predicted that about 15-30% of all proteins contained transmembrane helices. We could not find a correlation between the content of membrane proteins and the complexity of the organism. In particular, we did not find significantly higher percentages of helical membrane proteins in eukaryotes than in prokaryotes or archae. However, we found more proteins with 7 transmembrane helices in eukaryotes and more with 6 and 12 in prokaryotes. We found twice as many coiled-coil proteins in eukaryotes (10%) as in prokaryotes and archaes (4-5%), and we predicted about 15-25% of all proteins to be secreted by most eukaryotes and prokaryotes. Every tenth protein had no known homologue in current databases, and 30-40% of the proteins fall into structural families with more than 100 members. A classification by cellular function verified that eukaryotes had a higher proportion of proteins for communication with the environment. Finally, we found at least one homologue of experimentally known structure for about 20%-45% of all proteins; the regions with structural homology covered 20%-30% of all residues. These numbers may or may not suggest that there are 1200-2600 folds in the universe of protein structures. All predictions are available at Protein Science 10, 1970-1979.

EVA: large-scale analysis
of secondary structure prediction

Burkhard Rost & Volker A Eyrich

Quote: 2001 Proteins: Structure, Function, and Genetics 45 Suppl 5, S192-S199

EVA is a web-based server that evaluates automatic structure prediction servers continuously and objectively. Since June 2000, EVA collected more than 20,000 secondary structure predictions. The EVA sets sufficed to conclude that the field of secondary structure prediction has advanced again. Accuracy increased substantially in the 90's through using evolutionary information taken from the divergence of proteins in the same structural family. Recently, the evolutionary information resulting from improved searches and larger databases has again boosted prediction accuracy by more than four percentage points to its current height around 76% of all residues predicted correctly in one of the three states helix, strand, other. The best current methods solved most of the problems raised at earlier CASP meetings: All good methods now get segments right and perform well on strands. Is the recent increase in accuracy significant enough to make predictions even more useful? We believe the answer is affirmative. What is the limit of prediction accuracy? We shall see.

Surface profiles predict sub-cellular localisation

Rajesh Nair & Burkhard Rost

Quote: 2001 CUBIC Preprint

The gap between the number of known protein sequence and the knowledge about protein function is rapidly increasing. One important physical aspect of function is the sub-cellular localisation of a protein. Here, we trained two-layered feed-forward neural networks to predict the sub-cellular localisation for proteins of known structure. We introduced two novel key aspects: (1) using evolutionary information, and (2) using surface composition. We also trained networks only on the N-terms. Finally, we combined all our networks. We evaluated sustained levels of performance by four-fold cross-validation. The major single source of improvement was the use of evolutionary information. However, the com-bination of our various networks yielded the final, significant improvement over previous methods. The final system reached an accuracy above 80% (two-state). This level may suffice to make the method valuable for target selection in structural genomics.

 

2001 collaborations

CAFASP2: the second critical assessment of fully automated structure prediction methods

Daniel Fischer, Arne Elofsson, Leszek Rychlewski, Florencio Pazos, Alfonso Valencia, Burkhard Rost, Angel B Ortiz & R. L. Dunbrack

Quote: 2001 Proteins: Structure, Function, and Genetics 45 Suppl 5, S171-S183

The results of the second Critical Assessment of Fully Automated Structure Prediction (CAFASP2) are presented. The goals of CAFASP are to (i) assess the performance of fully automatic web servers for structure prediction, by using the same blind prediction targets as those used at CASP4, (ii) inform the community of users about the capabilities of the servers, (iii) allow human groups participating in CASP to use and analyze the results of the servers while preparing their nonautomated predictions for CASP, and (iv) compare the performance of the automated servers to that of the human-expert groups of CASP. More than 30 servers from around the world participated in CAFASP2, covering all categories of structure prediction. The category with the largest participation was fold recognition, where 24 CAFASP servers filed predictions along with 103 other CASP human groups. The CAFASP evaluation indicated that it is difficult to establish an exact ranking of the servers because the number of prediction targets was relatively small and the differences among many servers were also small. However, roughly a group of five "best" fold recognition servers could be identified. The CASP evaluation identified the same group of top servers albeit with a slightly different relative order. Both evaluations ranked a semiautomated method named CAFASP-CONSENSUS, that filed predictions using the CAFASP results of the servers, above any of the individual servers. Although the predictions of the CAFASP servers were available to human CASP predictors before the CASP submission deadline, the CASP assessment identified only 11 human groups that performed better than the best server. Furthermore, about one fourth of the top 30 performing groups corresponded to automated servers. At least half of the top 11 groups corresponded to human groups that also had a server in CAFASP or to human groups that used the CAFASP results to prepare their predictions. In particular, the CAFASP-CONSENSUS group was ranked 7. This shows that the automated predictions of the servers can be very helpful to human predictors. We conclude that as servers continue to improve, they will become increasingly important in any prediction process, especially when dealing with genome-scale prediction tasks. We expect that in the near future, the performance difference between humans and machines will continue to narrow and that fully automated structure prediction will become an effective companion and complement to experimental structural genomics.

 

2001 preprints

Surface profiles predict sub-cellular localisation

Rajesh Nair & Burkhard Rost

Quote: 2001 CUBIC Preprint

The gap between the number of known protein sequence and the knowledge about protein function is rapidly increasing. One important physical aspect of function is the sub-cellular localisation of a protein. Here, we trained two-layered feed-forward neural networks to predict the sub-cellular localisation for proteins of known structure. We introduced two novel key aspects: (1) using evolutionary information, and (2) using surface composition. We also trained networks only on the N-terms. Finally, we combined all our networks. We evaluated sustained levels of performance by four-fold cross-validation. The major single source of improvement was the use of evolutionary information. However, the combination of our various networks yielded the final, significant improvement over previous methods. The final system reached an accuracy above 80% (two-state). This level may suffice to make the method valuable for target selection in structural genomics.

Simple jury predicts protein secondary structure best

Burkhard Rost, Pierre Baldi, Geoff Barton, James Cuff, Volker A. Eyrich, David Jones, Kevin Karplus, Ross King, Gianluca Pollastri, Dariusz Przybylski

Quote: 2001 CUBIC Preprint 5

The field of secondary structure prediction methods has advanced again. The best methods now reach levels of 74-76% of the residues correctly predicted in one of the three states helix, strand, or other. In context of EVA/CASP, we experimented with averaging over the best current methods. The resulting jury decision proved significantly more accurate than the best method. Although the 'jury' seemed the best choice on average, for 60% of all proteins one method was better than the jury. Furthermore, the best individual methods tended to be superior to the jury in estimating the reliability of a prediction. Hence, averaging over predictions may be the method of choice for a quick scan of large data set, while experts may profit from studying the respective method in detail.

New improvements in protein secondary structure prediction

Burkhard Rost & Pierre Baldi

Quote: 2001 CUBIC preprint

We still cannot predict protein 3D structure from sequence, in general. But bioinformatics continues to improve methods available for predicting structural features. Particularly, the field of protein secondary structure prediction has advanced substantially in the 90's by combining algorithms from artificial intelligence with evolutionary information. Recently, growing databases and better search strategies have again boosted prediction accuracy by more than four percentage points. Today's most accurate methods predict more than 76% of all residues correctly in one of the three states helix, strand, or other. This high level has already been sustained by more than 300 new protein structures added since the methods were developed. However, the field is progressing rapidly: another unpublished algorithmic advance may already outperform the current state-of-the-art methods. The last two years also brought successful new concepts to the field. These new methods may be particularly interesting in light of the improvements achieved through simply combining existing methods. Divergent evolutionary profiles not only contain enough information to substantially improve prediction accuracy, but even to correctly predict long stretches of identical residues observed in alternative secondary structure states depending on non local conditions. An example is a method automatically identifying structural switches, and thus linking predicted secondary structure to aspects of function. Secondary structure predictions are increasingly becoming the workhorse of numerous methods aiming at predicting protein structure and function. Since the recent improvements yield better predictions of segments, and in particular of beta-strands, we believe that the recent increase in accuracy significant enough to make predictions even more useful.


 

 


 

2002 peer-reviewed

Alignments grow, secondary structure prediction improves

Dariusz Przybylski & Burkhard Rost

Quote: 2002 Proteins: Structure, Function, and Bioinformatics 46, 195-205

Using information from sequence alignments significantly improves protein secondary structure prediction. Typically, more divergent profiles yield better predictions. Lately, various groups have shown that accuracy can be improved markedly by using PSI-BLAST profiles to develop new prediction methods. Here, we focused on the influences of various alignment strategies on two 8-year old PHD methods. The following results stood out. (1) PHD using pairwise alignments predicts about 72% of all residues correctly in one of the three states helix, strand, other. Using larger databases and PSI-BLAST raised accuracy to 75%. (2) More than 60% of the improvement originated from the growth of current sequence databases; about 20% resulted from detailed changes in the alignment procedure (substitution matrix, thresholds, gap penalties). Another 20% of the improvement resulted from carefully using iterated PSI-BLAST searches. (3) Interestingly, we failed to improve prediction accuracy further when attempting to refine the alignment by dynamic programming (MaxHom and ClustalW). (4) Improvement through family growth appears to saturate at some point. However, most families have not reached this saturation. Hence, we anticipate that prediction accuracy will continue to rise with database growth.

Continuous assignment of secondary structure correlates with protein flexibility

Claus AF Andersen, Arthur G Palmer, S¿ren Brunak & Burkhard Rost

Quote: 2002 Structure 10, 175-184

The DSSP program automates protein secondary structure assignment, using an algorithm that assigns every residue to one of eight states. However, any discrete assignment is incomplete, because the continuum of thermal fluctuations cannot be described. Hence, a continuous assignment of secondary structure that replaces 'static' by 'dynamic' states is proposed. Technically, continuous DSSP assignments were calculated from a single structure as weighted averages over ten static DSSP assignments with different hydrogen bond thresholds. The continuous DSSP assignments reflect the structural variations due to thermal fluctuations as detected by NMR spectroscopy. Continuous secondary structure assignments may impact future protein structure comparison and prediction.

Enzyme function less conserved than anticipated

Burkhard Rost

Quote: 2002 Journal of Molecular Biology 318, 595-608

The level of sequence similarity that implies similarity in protein structure is well established. Recently, many groups proposed thresholds in sequence similarity that implies similarity in enzymatic function. All results suggest that enzyme function is conserved above levels of 50% pairwise sequence identity. Here, I argue that all groups substantially over-estimated the conservation of enzyme function due to bias in the data sets used. An unbiased analysis suggested that less than 30% of the pairs above 50% sequence identity have identical EC numbers. Another surprising finding was that even PSI-BLAST E-values below 10-50 did not suffice to transfer enzyme function without errors. A score relating sequence identity to alignment length (distance from HSSP-threshold) outperformed statistical BLAST scores for high accuracy. In particular, the distance score allowed error-free transfer of enzyme function for the 10% most similar enzyme pairs. In practice, the revised detailed estimates for the sequence conservation of enzyme function may provide important benchmarks for everyday sequence analysis and for genome annotation.

Target space for structural genomics revisited

Jinfeng Li, & Burkhard Rost

Quote: 2002 Bioinformatics 18, 922-933

Motivation:         Structural genomics eventually aims at determining structures for all proteins. However, in the beginning experimentalists are likely to focus on globular proteins to achieve a rapid basic coverage of protein sequence space. How many proteins will structural genomics have to target? How many proteins will be excluded since we already have structural information for these or since they are not globular? We have to answer these questions in context of our target selection for the North-East Structural Genomics Consortium (NESG).

Results: We estimated that structural information is available for about 6-38% of all proteins; 6% if we require high accuracy in comparative modelling, 38% if we are satisfied with having a rough idea about the fold. Excluding all regions that are not globular, we found that structural genomics may have to target about 48% of all proteins. This corresponded to a similar percentage of residues of the entire proteomes (52%). We explored a number of different strategies to cluster protein space in order to find the number of families representing these 48% of structurally unknown proteins. For the subset of all entirely sequenced eukaryotes, we found over 18000 fragment clusters each of which may be a suitable target for structural genomics.

Availability:         All data are available from the authors, most results are summarised at: http://cubic.bioc.columbia.edu/genomes/RES/

Inferring sub-cellular localisation
through automated lexical analysis

Rajesh Nair & Burkhard Rost

Quote: 2002 Bioinformatics 18, S78-S86

Motivation: The SWISS-PROT sequence database contains keywords of functional annotations for many proteins. In contrast, information about the sub-cellular localization is only available for few proteins. Experts can often infer localization from keywords describing protein function. We developed LOCkey, a fully automated method for lexical analysis of SWISS-PROT keywords that assigns sub-cellular localization. With the rapid growth in sequence data, the biochemical characterisation of sequences has been falling behind. Our method may be a useful tool for supplementing functional information already automatically available.

Results: The method reached a level of more than 82% accuracy in a full cross-validation test. Due to a lack of functional annotations, we could infer localization for less than half of all proteins in SWISS-PROT. We applied LOCkey to annotate five entirely sequenced proteomes, namely Saccharomyces cerevisiae (yeast), Caenorhabditis elegans (worm), Drosophila melanogaster (fly), Arabidopsis thaliana (plant) and a subset of all human proteins. LOCkey found about 8000 new annotations of sub-cellular localization for these eukaryotes.

Availability: Annotations of localization for eukaryotes at:  http://cubic.bioc.columbia.edu/services/LOCkey.

Loopy proteins appear conserved in evolution

Jinfeng Liu, Hepan Tan & Burkhard Rost

Quote: 2002 Journal of Molecular Biology 322, 53-64

Over the last decade, structural biologists have unravelled many proteins that appear natively disordered. Common assumptions are that many of these proteins adopt structure through binding and that the structural flexibility enables them to adopt different functions. Here, we investigated regions of more than 70 sequence-consecutive residues that have no regular secondary structure (NORS). Analysing 31 entirely sequenced organisms, we predicted five times as many proteins with NORS regions ('loopy' proteins) in eukaryotes (20%) than in prokaryotes and archaeas (4%). Thousands of these NORS regions were over 150 residues long. The amino acid composition of NORS regions differed from that of loops in PDB. Although NORS regions had significantly more low-complexity residues than other proteins, simple cut-off thresholds for sequence bias missed most NORS regions. On average, NORS regions were evolutionarily at least as conserved as their flanking regions. Furthermore, yeast proteins with NORS regions had more protein-protein interaction partners than other proteins. Regulatory and transcription-related functions were over-represented in loopy proteins, biosynthesis and energy metabolism were under-represented. Overall, our analysis confirmed that proteins with non-regular structures appear to play important functional roles, and they may adopt yet unknown types of protein structures.

Transmembrane helix predictions revisited

Chien Peter Chen, Andrew Kernytsky & Burkhard Rost

Quote: 2002 Protein Science 11, 2774-2791

Methods that predict membrane helices have become increasingly useful in context of analysing entire proteomes, as well as in everyday sequence analysis. Part of the importance of bioinformatics tools for membrane proteins stems from the lack of high-resolution information. Here, we analysed 27 advanced and simple methods in detail. The recent X-ray structures of membrane proteins have challenged common assumptions in the field, namely, membrane helices are often longer than 27 residues, and they are not well conserved in evolution. These new experimental data were one reason why all prediction methods have been overestimated. In fact, the accuracy of low-resolution assignments for membrane helices did not differ significantly from the best prediction methods. Some of the advanced methods performed better than others; no advanced method performed consistently best. In contrast, methods using only simple hydrophobicity scales to predict membrane helices were clearly inferior to the advanced methods. All methods confused signal peptides with membrane helices, while most advanced methods correctly distinguished between membrane helical and other proteins. We challenge that the problems unravelled here have reopened the field of predicting membrane helices and their orientation.

Sequence conserved for sub-cellular localization

Rajesh Nair & Burkhard Rost

Quote: 2002 Protein Science 11, 2836-2847

The more proteins diverged in sequence, the more difficult it becomes for bioinformatics to infer similarities of protein function and structure from sequence. The precise thresholds used in automated genome annotations depend on the particular aspect of protein function transferred by homology. Here, we presented the first large-scale analysis of the relation between sequence similarity and identity in sub-cellular localization. Three results stood out: (1) the sub-cellular compartment is generally more conserved than what might have been expected given that short sequence motifs like nuclear localization signals can alter the native compartment. (2) The sequence-conservation of localization is similar between different compartments, and (3) it is similar to the conservation of structure and enzymatic activity. In particular, we found the transition between the regions of conserved and non-conserved localization to be very sharp although the thresholds for conservation were less well defined than for structure and enzymatic activity. We found that a simple measure for sequence similarity accounting for pairwise sequence identity and alignment length, the HSSP-distance, distinguished accurately between protein pairs of identical and different localizations. In fact, BLAST expectation values outperformed the HSSP-distance only for alignments in the sub-twilight zone. We succeeded in slightly improving the accuracy of inferring localization through homology by fine-tuning the thresholds. Finally, we applied our results to the entire SWISS-PROT database and five entirely sequenced eukaryotes.

Long membrane helices and short loops
predicted less accurately

Chien Peter Chen & Burkhard Rost

Quote: 2002 Protein Science 2766-2773

Low-resolution experiments suggest that most membrane helices span over 17-25 residues and that most loops between two helices are longer than 15 residues. Both constraints have been used explicitly in the development of prediction methods. Here, we compared the largest possible sequence-unique data sets from high- and low-resolution experiments. For the high-resolution data, we found that only half of the helices fall into the expected length interval and that half of the loops were shorter than ten residues. We compared the accuracy of detecting short loops and long helices for 28 advanced and simple prediction methods: All methods predicted short loops less accurately than longer ones. In particular, loops shorter than seven residues appeared to be very difficult to detect by current methods. Similarly, all methods tended to be more accurate for longer than for shorter helices. However, helices with more than 32 residues were predicted less accurately than all other helices. Our findings may suggest particular strategies for improving prediction of membrane helices.

 

2002 non-peer

State-of-the-art in membrane protein prediction

Chien Peter Chen & Burkhard Rost

Quote: 2002 Applied Bioinformatics 1, 21-35

Membrane proteins are crucial for many biological functions and have become attractive targets for pharmacological agents. The importance is reflected by the observation that about 10-30% of all proteins contains membrane spanning helices. Despite recent successes, high-resolution structures for membrane proteins remain exceptional. The gap between known sequences and known structures calls for finding solutions through bioinformatics. While many methods predict membrane helices, very few predict membrane strands. The good news is that most methods for helical membrane proteins are available and are more often right than wrong. The best current prediction methods appear to correctly predict all membrane helices for about 50-70% of all proteins and to falsely predict membrane helices for about 310% of all globular proteins. The bad news is that developers have seriously over-estimated the accuracy of their methods. In particular, while simple hydrophobicity scales identify many membrane helices, they frequently and incorrectly predict membrane helices in globular proteins. Additionally, all methods tend to confuse signal peptides with membrane helices. Nonetheless, wet-lab biologists can reach into an impressive toolbox for membrane protein predictions. However, for the computational biologists, they will have to improve their methods considerably before they reach the levels of accuracy they claimed.

Did evolution leap to create the protein universe?

Burkhard Rost

Quote: 2002 Current Opinion in Structural Biology 12, 409-416

Over 60 organisms from all three kingdoms of life are now entirely sequenced. In many respects the inventory of proteins used in different kingdoms appears surprisingly similar. However, eukaryotes differ from other kingdoms in that they use many long proteins; have more proteins with coiled-coil helices and with regions abundant in regular secondary structure. Particular structural domains are used in many pathways. Nevertheless, one domain tends to occur only once in one particular pathways. Many proteins may not have close homologues in different species (orphans) and there may even be folds that are specific to one species. This view implies that protein fold space is discrete. An alternative model suggests that structure space is continuous in that modern proteins evolved by aggregating fragments of ancient proteins. Either way, after having harvested proteomes by applying standard tools, the challenge now seems to develop better methods for comparative proteomics.

 

2002 collaborations

Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles

Gianluca Pollastri,  Dariusz Przybylski, Burkhard Rost, Pierre Baldi

Quote: 2002 Proteins: Structure, Function, and Bioinformatics 47, 228-235

Secondary structure predictions are increasingly becoming the workhorse for several methods aiming at predicting protein structure and function. Here we use ensembles of bidirectional recurrent neural network architectures, PSI-BLAST-derived profiles, and a large nonredundant training set to derive two new predictors: (a) the second version of the SSpro program for secondary structure classification into three categories and (b) the first version of the SSpro8 program for secondary structure classification into the eight classes produced by the DSSP program. We describe the results of three different test sets on which SSpro achieved a sustained performance of about 78% correct prediction. We report confusion matrices, compare PSI-BLAST to BLAST-derived profiles, and assess the corresponding performance improvements. SSpro and SSpro8 are implemented as web servers, available together with other structural feature predictors at: http://promoter.ics.uci.edu/BRNN-PRED/.

Reliability of assessment of protein structure prediction methods

Marc A Marti-Renom, MS Madhusudhna, Andras Fiser, Burkhard Rost, Andrej Sali

Quote: 2002 Structure 10, 435-440

The reliability of ranking of protein structure modeling methods is assessed. The assessment is based on the parametric Student's t test and the nonparametric Wilcox signed rank test of statistical significance of the difference between paired samples. The approach is applied to the ranking of the comparative modeling methods tested at the fourth meeting on Critical Assessment of Techniques for Protein Structure Prediction (CASP). It is shown that the 14 CASP4 test sequences may not be sufficient to reliably distinguish between the top eight methods, given the model quality differences and their standard deviations. We suggest that CASP needs to be supplemented by an assessment of protein structure prediction methods that is automated, continuous in time, based on several criteria applied to a large number of models, and with quantitative statistical reliability assigned to each characterization.

Reply to Moult

Andrej Sali, Marc A. Marti-Renom, M.S. Madhusudhan, Andr‡s Fiser, and Burkhard Rost

Quote: 2002 Structure 10, 292-293

 


 

 


 

2003 peer-reviewed

Analysing six types of protein-protein interfaces

Yanay Ofran & Burkhard Rost

Quote: 2003 Journal of Molecular Biology 325, 377-387

Non-covalent residue side-chain interactions occur in many different types of proteins and facilitate many biological functions. Are these differences manifested in different sequence compositions and/or different residue-residue contacts? Previous studies, analysing small data sets, gave contradicting answers. Here, we introduced a new data-mining method that yielded the largest high-resolution data set of interactions ever analysed. We also introduced an information theory-based analysis method. Our results suggested differentiating at least six types of protein interfaces, each corresponding to a different functional or structural association between chains and residues. Particularly, we found consistent and significant differences between interactions of residues within the same structural domain and between different domains, between permanent and transient interactions, and between interactions associating homo- and hetero-oligomers. The differences between the six types were so substantial that using amino acid composition alone, we could predict the correct interface type given a set of 1000 contacts. Moreover, residue-residue contact preferences also differed, implying that the underlying biochemical mechanisms are distinct in each of the six types. Our data might refine methods predicting protein structure from sequence, as well as methods inferring aspects of function from structure. Furthermore, our results suggested it might be possible to predict the interface type from sequence.

NLSdb: database of nuclear localization signals

Rajesh Nair, Phil Carter & Burkhard Rost

Quote: 2003 Nucleic Acids Research 31, 397-399

NLSdb is a database of nuclear localization signals (NLSs) and of nuclear proteins. NLSs are short stretches of residues mediating transport of nuclear proteins into the nucleus. The database contains 114 experimentally determined NLSs that were obtained through extensive literature search. Using Ôin silico mutagenesisÕ this set was extended to 308 experimental and potential NLSs. This final set matched over 43% of all known nuclear proteins and matches no currently known non-nuclear protein. NLSdb contains over 6000 predicted nuclear proteins and their targeting signals from the PDB and SWISS-PROT databases. The database also contains over 12500 predicted nuclear proteins from six entirely sequenced eukaryotic proteomes (Homo sapiens, Mus musculus, Drosophila melanogaster, Caenorhabditis elegans, Arabidopsis thaliana, and Saccharomyces cerevisiae). NLS motifs often co-localise with DNA-binding regions. This observation was used to also annotate over 1500 DNA-binding proteins. NLSdb can be accessed via the web site: http://cubic.bioc.columbia.edu/db/NLSdb/.

PEP: Predictions for Entire Proteomes

Phil Carter, Jinfeng Liu & Burkhard Rost

Quote: 2003 Nucleic Acids Research 31, 410-413

PEP is a database of Predictions for Entire Proteomes. The database contains summaries of analyses of protein sequences from a range of organisms representing all three major kingdoms of life: eukaryotes, prokaryotes, and archaea. All proteins publicly available for organisms were aligned against SWISS-PROT, TrEMBL and PDB. Additionally the following annotations are provided: secondary structure, transmembrane helices, coiled coils, regions of low complexity, signal peptides, PROSITE motifs and classes of cellular function. Proteins that contain long regions without regular secondary structure are also identified. We have produced a related database of structural domain-like fragments derived from PEP, and clusters based on homology between all fragments. The PEP database, fragments and clusters are distributed freely as a set of flat files, and have been integrated into SRS. The PEP group of databases can be accessed from: http://cubic.bioc.columbia.edu/pep.

Neural networks predict protein structure: hype or hit?

Burkhard Rost

Quote: 2003 Artificial intelligence and heuristic methods in bioinformatics 34-50

Neural networks have been applied to many pattern classification problems. Here, I review applications to the problem of predicting protein structure from protein sequence. Initially, many methods were apparently designed by researchers who just wanted a real-life application for their gadget. However, the competitiveness of the field separated the wheat from the chaff. Meanwhile, several neural network-based methods have contributed significantly to advancing the field of bio-informatics, and some are clearly influencing molecular biology. Today, a plethora of network methods is used in everyday sequence analysis, and an increasing number of applications explore very novel problems.

Predict protein-protein interaction sites from local sequence information

Yanay Ofran & Burkhard Rost

Quote: 2003 FEBS Letters 544, 236-239

Protein-protein inter-actions are facilitated by a myriad of residue-residue contacts on the interacting proteins. Identifying the site of interaction in the protein is a key for deciphering its functional mechanisms, and is crucial for drug development. Many studies indicate that the compositions of these contacting residues are unique. Here, we showed that this information sufficed to predict protein-protein interaction sites: A neural network identified protein-protein interfaces from sequence. These predictions were then clustered in sequence local windows. For the most strongly predicted sites (in 34 of 333 proteins), 94% of the predictions were confirmed experimentally. At 70% accuracy, we correctly predicted at least one interaction site in 20% of the complexes (66/333). These results indicate that the prediction of some interaction sites from sequence alone is possible. Incorporating evolutionary and predicted structural information may improve our method. However, even at this early stage, our tool can already be beneficial to wet-lab biology. (Note: all data are available at: http://cubic.bioc.columbia.edu/results/2003/pp_febs/.)

The PredictProtein server

Burkhard Rost & Jinfeng Liu

Quote: 2003 Nucleic Acids Research 31, 3300-3304

PredictProtein (PP, http://cubic.bioc.columbia.edu/pp/) is an internet service for sequence analysis and the prediction of aspects of protein structure and function. Users submit protein sequence or alignment. The server returns a multiple sequence alignment, PROSITE sequence motifs, low-complexity regions (SEG), ProDom domain assignments, nuclear localisation signals, regions lacking regular structure and predictions of secondary structure, solvent accessibility, globular regions, transmembrane helices, coiled-coil regions, structural switch regions, and cysteine bonds. Upon request fold recognition by prediction-based threading is available. For all services, users can submit their query either by electronic mail, or interactively from World Wide Web.

NORSp: predictions of long regions without regular secondary structure

Jinfeng Liu & Burkhard Rost

Quote: 2003 Nucleic Acids Research 31, 3833-3835

Many structurally flexible regions play important roles in biological processes. It has been shown that extended loopy regions are very abundant in the protein universe and that they have been conserved through evolution. Here, we present NORSp, a publicly available predictor for disordered regions in protein. Specifically, NORSp predicts long regions with no regular secondary structure. Upon user submission of a protein sequence, NORSp will analyse the protein for its secondary structure, and presence of transmembrane helices and coiled-coil. It will then return e-mail to the user about the presence and position of disordered regions. NORSp can be accessed from http://cubic.bioc.columbia.edu/services/NORSp/.

LOC3D: annotate sub-cellular localization for protein structures

Rajesh Nair & Burkhard Rost

Quote: 2003 Nucleic Acids Research 31, 3337-3340

LOC3D (http://cubic.bioc.columbia.edu/db/LOC3D/) is both a weekly-updated database and a web server for predictions of sub-cellular localization for eukaryotic proteins of known 3D structure. Localization is predicted using four different methods: (1) PredictNLS: prediction of nuclear proteins through nuclear localization signals, (2) LOChom: inferring localization through sequence homology, (3) LOCkey: inferring localization through automatic text analysis of SWISS-PROT keywords, and (4) and LOC3Dini: ab initio prediction through a system of neural networks and vector support machines. The final prediction is based on the method that predicts localization with the highest confidence. The LOC3D database currently contains predictions for over 8700 eukaryotic protein chains taken from PDB. The web server can be used to predict sub-cellular localization for proteins for which only a predicted structure is available from threading servers. This makes the resource of particular interest to structural genomics initiatives.

UniqueProt: creating representative protein sequence sets

Sven Mika & Burkhard Rost

Quote: 2003 Nucleic Acids Research 31, 3789-3791

UniqueProt is a practical and easy to use web-service designed to create representative, unbiased data sets of protein sequences. The largest possible representative sets are found through a simple greedy algorithm using the HSSP-value to establish sequence similarity. UniqueProt is not a real clustering program in the sense that the 'representatives' are not at the centres of well-defined clusters since the definition of such clusters is problem-specific. Overall, UniqueProt is a reasonable fast solution for bias in data sets. The service is accessible at http://cubic.bioc.columbia.edu/services/uniqueprot; a command-line version for Linux is downloadable from this website.

Static benchmarking of membrane helix predictions

Andrew Kernytsky & Burkhard Rost

Quote: 2003 Nucleic Acids Research 31, 3642-3644

Prediction of trans-membrane helices continues to be a difficult task with a few prediction methods clearly taking the lead; none of these is clearly best on all accounts. Recently, we have carefully set up protocols for benchmarking the most relevant aspects of prediction accuracy and have applied it to over 30 prediction methods. Here, we present the extension of that analysis to an automatic web server evaluating new methods (cubic.bioc.columbia.edu/services/tmh_benchmark/). The most important achievements of the tool are: (1) any new method is compared to the battery of well-established tools. (2) The battery of measures explored allows spotting strengths in methods that may not be 'best' overall. In particular, we report per-residue and per-segment scores for accuracy, and the error-rates for confusing membrane helices with globular proteins or signal peptides. An additional feature is that developers can directly investigate any hydrophobicity scale for its potential in predicting membrane helices.

EVA: evaluation of protein structure prediction servers

Ingrid YY Koh, Volker A Eyrich, Marc A Marti-Renom, Dariusz Przybylski, Mallur S Madhusudhan, Eswar Narayanan, Osvaldo Grana, Alfonso Valencia, Andrej Sali & Burkhard Rost

Quote: 2003 Nucleic Acids Research 31, 3311-3315

EVA (http://cubic.bioc.columbia.edu/eva/) is a web server evaluating the performance of automatic protein structure prediction methods. The evaluation is fully automated, objective, and updated once a week, in order to cope with the large number of existing prediction servers the underlying methods change constantly and to help developers estimate the accuracy of their methods. EVA currently evaluates servers in the following four categories of protein structure prediction: in 1D, secondary structure predictions; in 2D, contact predictions; in 3D, comparative modelling and threading/fold recognition. Every day, sequences of newly available protein structures are sent to the servers to obtain their predictions. The results collected are then compared once a week to the experimental structures; the results are published through the web. EVA provides useful information to developers as well as users of prediction methods. Over time EVA has accumulated a very large number of proteins to test methods. This assures that the evaluation is continuous, objective and in particular that methods are compared and ranked on data sets of significant sizes. Another particular unique feature of EVA is that methods are only ranked if their performance has a sustained and significant difference.

DSSPcont: continuous secondary structure assignments for proteins

Phil Carter, Claus AF Andersen & Burkhard Rost

Quote: 2003 Nucleic Acids Research 31, 3293-3295

The DSSP program automatically assigns the secondary structure for each residue from the three-dimensional co-ordinates of a protein structure to one of eight states. However, discrete assignments are incomplete in that they cannot capture the continuum of thermal fluctuations. Therefore, DSSPcont (http://cubic.bioc.columbia.edu/services/DSSPcont) introduces a continuous assignment of secondary structure that replaces 'static' by 'dynamic' states. Technically, the continuum results from calculating weighted averages over ten discrete DSSP assignments with different hydrogen bond thresholds. A DSSPcont assignment for a particular residue is a percentage likelihood of eight secondary structure states, derived from a weigthed average of the ten DSSP assignments. The continuous assignments have two important features: (1) They reflect the structural variations due to thermal fluctuations as detected by NMR spectroscopy. (2) They reproduce the structural variation between many NMR models from one single model. Therefore, functionally important variation can be extracted from a single X-ray structure using the continuous assignment procedure.

META-PP: single crucial to prediction servers

Volker A Eyrich & Burkhard Rost

Quote: 2003 Nucleic Acids Research 31, 3308-3310

The Meta-PP server (http://cubic.bioc.columbia.edu/meta/) simplifies access to a battery of public protein structure and function prediction servers by providing a common and stable web-based interface. The goal is to make these powerful and increasingly essential methods more readily available to non-expert users and the bioinformatics community at large. At present META-PP provides access to a selected set of high-quality servers in the areas of comparative modelling, threading/fold recognition, secondary structure prediction and more specialised fields like contact and function prediction.

Better prediction of sub-cellular localization
by combining evolutionary and structural information

Rajesh Nair & Burkhard Rost

Quote: 2003 Proteins: Structure, Function, and Bioinformatics 53, 917-930

The native sub-cellular compartment of a protein is one aspect of its function. Thus, predicting localization is an important step toward predicting function. Short zip code-like sequence fragments regulate some of the shuttling between compartments. Cataloguing and predicting such motifs is the most accurate means of determining localization in silico. However, only few motifs are currently known, and not all the trafficking appears regulated in this way. The amino acid composition of a protein correlates with its localization. All general prediction methods employed this observation. Here, we explored the evolutionary information contained in multiple alignments and aspects of protein structure to predict localization in absence of homology and targeting motifs. Our final system combined statistical rules and a variety of neural networks to achieve an overall four-state accuracy above 65%, a significant improvement over systems using only composition. The system was at its best for extra-cellular and nuclear proteins; it was significantly less accurate than TargetP for mitochondrial proteins. Interestingly, all methods that were developed on SWISS-PROT sequences failed grossly when fed with sequences from proteins of known structures taken from PDB. We therefore developed two separate systems: one for proteins of known structure and one for proteins of unknown structure. Finally, we applied the PDB-based system along with homology-based inferences and automatic text analysis to annotate all eukaryotic proteins in the PDB (http://cubic.bioc.columbia.edu/db/LOC3D). We imagine that this pilot method - certainly in combination with similar tools - may be valuable target selection in structural genomics.

CAFASP3 in the spotlight of EVA

Volker A Eyrich, Ingrid Koh, Dariusz Przybylski, Osvaldo Grana, Florencio Pazos, Alfonso Valencia & Burkhard Rost

Quote: 2003 Proteins: Structure, Function, and Bioinformatics 53 Suppl 6, 548-560

We have analysed fold recognition, secondary structure and contact prediction servers from CAFASP3. This assessment was carried out in the framework of the fully automated, web-based evaluation server EVA. Detailed results are available at http://cubic.bioc.columbia.edu/eva/cafasp3/. We observed that the sequence-unique targets from CAFASP3/CASP5 were not fully representative for evaluating performance. For all three categories, we showed how careless ranking might be misleading. We compared methods from all categories to experts in secondary structure and contact prediction and homology modellers to fold recognisers. While the secondary structure experts clearly outperformed all others, the contact experts appeared to outperform only novel fold methods. Automatic evaluation servers are good at getting statistics right and at using these to discard misleading ranking schemes. We challenge that to let machines rule where they are best might be the best way for the community to enjoy the tremendous benefit of CASP as a unique opportunity for brainstorming.

 

2003 non-peer

Rising accuracy of protein secondary structure prediction

Burkhard Rost

Quote: 2002 Protein structure determination, analysis, and modeling for drug discovery 207-249

We still cannot predict protein 3D structure from sequence, in general. But bioinformatics continuously improve methods predicting simplified aspects of structure. Particularly, the field of secondary structure has achieved a break-through by combining algorithms from artificial intelligence with evolutionary information. PHD, the first third generation method surmounted the 'magic' line of predicting more than 70% of all residues correctly in one of three states (helix, strand, other). Furthermore, b-strands were predicted almost twice as often correct as by methods of the first and second generation. Finally, predicted segments look like those observed. Recently, the evolutionary information resulting from improved searches and larger databases has again boosted prediction accuracy by more than four percentage points to its current height around 77%. Divergent evolutionary profiles not only contain enough information to substantially improve prediction accuracy, but even to correctly predict long stretches of identical residues observed in alternative secondary structure states depending on non local conditions. An example is a method automatically identifying structural switches, and thus finding a remarkable connection between predicted secondary structure and aspects of function. Due to their remarkable success, secondary structure predictions have become the working horse for numerous methods aiming at predicting protein structure and function. Moreover, performance can be improved even further by using these methods in an ÔexpertÕ rather than in an ÔautomaticÕ mode. Have we, now reached the limit of prediction accuracy? Time will tell.

Automatic secondary structure assignment

Claus AF Andersen & Burkhard Rost

Quote: 2003 Methods Biochem Anal. 44, 341-363

Automatically assigning protein secondary structure from 3D co-ordinates is an important and simple bioinformatics tool. The assignments are used in 3D structure visualisations to simplify the presentation of a protein in order to highlight functional aspects. Structural comparisons of proteins are performed faster when first comparing secondary structure. Secondary structure has also been used to improve sequence searches. Hence, secondary structure is important to assure the optimal yield of an experimental structure and to cleverly select the targets for structural genomics. Here, we review the principles of the most popular assignment methods DSSP, STRIDE, DEFINE and P-Curve. We also compare these methods and suggest evaluation criteria for 'good' assignments. Finally, we describe an extension from discrete to continuous assignment of secondary structure.

Prediction In 1D: Secondary structure,
membrane helices, and accessibility

Burkhard Rost

Quote: 2003 Methods Biochem Anal. 44, 559-587

Predictions of simplified aspects of protein structure are often the first step to gaining some insight into the function of a protein. Furthermore, proteome analysis and methods predicting 3D structure increasingly base upon 1D predictions. Developing 1D prediction methods may be one of the most active and most successful disciplines of bioinformatics. Here, I attempted to summarise some of the major ideas of available methods. Particular focus is on evaluating the performance of methods. Recent advances are reviewed and some hints for using methods for sequence analysis are given.

Domains, motifs, and clusters in the protein universe

Jinfeng Li & Burkhard Rost

Quote: 2003 Current Opinion in Chemical Biology 7, 5-11

The rapid growth bio-sequences results in an increasing demand for reliable methods that group proteins. A few databases with curated alignments of protein families have demonstrated that expert-driven repositories can keep up with the data deluge in the genome era. These original resources implicitly identify domain-like modules in proteins. An increasing number of automatic methods have sprouted over the last years that cluster the protein universe. Many of these implicitly dissect proteins into structural domain-like fragments. In a very coarse-grained evaluation some of the automatic methods appear on par with expert-driven approaches. However, neither automatic nor manual methods are currently entirely up to the challenges of tasks such as target selection in structural genomics. Thus, we urgently need refined and sustained automatic clustering tools.

Predicting protein structure through evolution

Burkhard Rost, Jinfeng Li, Dariusz Przybylski, Rajesh Nair, Kazimierz O. Wrzeszczynski, Henry Bigelow &Yanay Ofran

Quote: 2003 Handbook of Chemoinformatics - from data to knowledge 1789-1811

The ultimate goal of protein structure prediction is to extend our knowledge and understanding of the structures and functions of proteins beyond that which is possible by experiment. Virtually all techniques, including 1D, 2D, and 3D structure prediction, and diverse kinds of function prediction use profiles rather than single sequences as the Ôinformation objectÕ for prediction. Database methods rely on structural information to evaluate the fitness of a protein sequence for a given structure according to a statistical model. Energetic methods derive predictions by calculating the fitness according to thermodynamic and kinetic principles. The two approaches have their limitations: database methods suffer from sparse statistics and therefore often over-fit the data, while energetic methods must vastly simplify theory to be tractable with the limited computational power available. The best predictors almost always use a combination of both in an intelligent way.

Cataloguing proteins in cell cycle control

Kazimierz O. Wrzeszczynski & Burkhard Rost

Quote: 2003 Cell cycle checkpoint control protocols 219-233

Bioinformatics makes a number of methods available that can also be used to identify cell cycle related proteins. Nevertheless, few tools are specifically designed to cope with cell cycle proteins. In fact, a vast amount of data is currently scattered among many databases. Here, we present a first detailed analysis of known cell cycle proteins. We combined databases mining and literature searches with an evaluation of evolutionary conservation. The objective was to identify cell cycle control proteins in various proteomes. In total we found 595 experimentally annotated cell cycle control proteins; these clustered into 113 distinct structural families. We noticed that neither simple values for pairwise sequence identity nor expectation values taken from popular PSI-BLAST alignments allow an error-free inference of involvement in cell cycle control by sequence similarity. However, when we also considered alignment length we could find thresholds for reliable inference of cell cycle proteins. Applying these safe thresholds to the six entirely sequenced organisms (human, mouse, fly, worm, Arabidopsis and yeast), we could identify 463 un-annotated proteins likely to be involved in cell cycle control. Slightly lower levels of accuracy extended the count to approximately 500-1300 additional proteins, which may be candidates for involvement in cell cycle control process.

Automatic prediction of protein function

Burkhard Rost, Jinfeng Liu, Rajesh Nair, Kazimierz O. Wrzeszczynski & Yanay Ofran

Quote: 2003 Cellular and Molecular Life Sciences submitted Mar 25, 2003

Most methods annotating protein function utilise sequence homology to proteins of experimentally known function. Such a homology-based annotation transfer is problematic and limited in scope. Therefore, computational biologists have begun to develop ab initio methods that predict aspects of function, including sub-cellular localization, post-translational modifications, functional type, and protein-protein interactions. For the first two cases, the most accurate approaches rely on identifying short signalling motifs, while the most general methods utilise tools of artificial intelligence. An example for this is an outstanding new method that predicts classes of cellular function directly from sequence. Similarly, promising methods have been developed that correctly predict protein-protein interaction partners at acceptable levels of accuracy, at least for some pairs in entire proteomes. No matter how difficult the task appears, successes over the last few years have clearly paved the way.

 

2003 collaborations

Role of transmembrane domains in the functions of Fc receptors

Raphael Zidovetzki, Burkhard Rost & Israel Pecht

Quote: 2003 Journal of Biophysical Chemistry 15, 555-575

In the present study we use a novel method, PHDhtm, to predict the exact locations and extents of the transmembrane (TM) domains of multisubunit immunoglobulin Fc- receptors. Whereas most previous studies have used single residue hydrophobicity plots for characterizing of these domains, PHDhtm utilizes a system of neural networks and the evolutionary information contained in multiple alignments of related sequences to predict the above. Present PHDhtm application predicts TM domains of immunoglobulin Fc- receptors that in many cases differ significantly from those derived by using earlier methods. Comparisons of helical wheel projections of the presently derived TM domains from PHDhtm with those produced earlier reveal different hydrophobic moments as well as hydrophobic and hydrophilic surfaces. These differences probably alter the character of subunit association within the receptor complexes. This new algorithm can also be used for other membrane protein complexes and may advance both understanding the principles underlying such complexes formation, and design of peptides that can interfere with such TM domain association so as to modulate specific cellular responses.

Solution NMR structure of the 30S ribosomal protein S28E from Pyrococcus horikoshii

JM Aramini, YJ Huang, JR Cort, S Goldsmith-Fischman, R Xiao, L Shih, CK Ho, J Liu, B Rost, B Honig, MA Kennedy, TB Acton & GT Montelione

Quote: 2003 Protein Science 12, 2823-2830

We report NMR assignments and solution structure of the 71-residue 30S ribosomal protein S28E from the archaean Pyrococcus horikoshii, target JR19 of the Northeast Structural Genomics Consortium. The structure, determined rapidly with the aid of automated backbone resonance assignment (AutoAssign) and automated structure determination (AutoStructure) software, is characterized by a four-stranded beta-sheet with a classic Greek-key topology and an oligonucleotide/oligosaccharide beta-barrel (OB) fold. The electrostatic surface of S28E exhibits positive and negative patches on opposite sides, the former constituting a putative binding site for RNA. The 13 C-terminal residues of the protein contain a consensus sequence motif constituting the signature of the S28E protein family. Surprisingly, this C-terminal segment is unstructured in solution.

 


 

 


 

2004 peer-reviewed

CHOP proteins into structural domain-like fragments

Jinfeng Liu & Burkhard Rost

Quote: 2004 Proteins: Structure, Function, and Bioinformatics 55, 678-688

We developed a method CHOP dissecting proteins into domain-like fragments. The basic idea was to cut proteins from entirely sequenced organisms beginning from very reliable experimental information (PDB), proceeding to expert annotations of domain-like regions (Pfam-A), and completing through cuts based on termini of known proteins. In this way, CHOP dissected over two thirds of all proteins from 62 proteomes. Analysis of our structural domain-like fragments revealed four surprising results. First, over 70% of all dissected proteins contained more than one fragment. Second, most domains spanned on average over about 100 residues. This average was similar for eukaryotic and prokaryotic proteins, and it is also valid - although previously not described - for all proteins in the PDB. Third, single domain proteins were significant longer than most domains in multi-domain proteins. Fourth, three-fourth of all domains appeared shorter than 210 residues. We believe that our CHOP fragments constituted an important resource for functional and structural genomics. Nevertheless, our main motivation to develop CHOP was that single-linkage clustering method failed to adequately group full-length proteins. In contrast, CLUP - the simple clustering scheme CLUP introduced here - succeeded largely to group the CHOP fragments from 62 proteomes such that all members of one cluster shared a basic structural core. CLUP found over 63,000 multi- and over 118,000 single-member clusters. Although most fragments were restricted to a particular cluster, about 24% of the fragments were duplicated in at least two clusters. Our thresholds for grouping two fragments into the same cluster were rather conservative. Nevertheless, our results suggested that structural genomics initiatives have to target over 30,000 fragments to at least cover the multi-member clusters in 62 proteomes.

Predicting transmembrane beta-barrels for entire proteomes

Henry Bigelow, Donald Petrey, Jinfeng Liu, Dariusz Przybylski & Burkhard Rost

Quote: 2004 Nucleic Acids Research 32, 2566-2577

Very few methods address the problem of predicting beta-barrel membrane proteins directly from sequence. One reason is that only very few high-resolution structures for transmembrane beta-barrel proteins (TMB) have been determined thus far. Here we introduced the design, statistics and results of a novel profile-based Hidden Markov Model for the prediction and discrimination of transmembrane beta-barrels. The method carefully attempts to avoid over-fitting the sparse experimental data. While our model training and scoring procedures were very similar to a recently published work, the architecture and structure-based labelling were significantly different. In particular, we introduced a new definition of beta-hairpin motifs, explicit state modelling of transmembrane strands, and a log-odds whole-protein discrimination score. The resulting method reached an overall four-state (up-, down-strand, periplasmic-, outer-loop) accuracy as high as 86%. Furthermore, accurately discriminated TMB from non-TMB proteins (45% coverage at 100% accuracy). This high precision enabled the application to 72 entirely sequenced Gram-negative bacteria. At high confidence, we found over 164 previously uncharacterised TMB proteins at high confidence. Database searches did not implicate any of these proteins with membranes. We challenge that the vast majority of our 164 predictions will eventually be verified experimentally.

Annotating proteins from Endoplasmic Reticulum and Golgi apparatus in eukaryotic proteomes

Kazimierz O. Wrzeszczynski & Burkhard Rost

Quote: 2004 Cellular and Molecular Life Sciences 61, 1341-1353

The sub-cellular localization of a native protein constitutes one coarse-grained aspect of its function. Transport between compartments is often regulated through short sequence motifs. Here, we analysed experimentally characterised ER/Golgi retrieval motifs and investigated the accuracy of homology-transfer. Only the C-terminal ER retrieval motifs KDEL, HDEL and AIAKE were sufficiently specific. However, even unspecific motifs may help, provided we know the probability for localization given this motif. We provided such estimates. We also rigorously estimated the accuracy and coverage for inferring ER and Golgi localization through homology-transfer by sequence similarity. In entire proteomes, we could thereby annotate 3304 ER (3182 membrane) and 1853 Golgi proteins (759 membrane). We identified another 5157 globular and 3941 membrane putative ER or Golgi proteins. Each experimental annotation yielded, on average, 1-3 high-accuracy and 5-6 low-accuracy homology-transfers in the six proteomes. These numbers will increase with each new experimental annotation.

The PredictProtein server

Burkhard Rost, Guy Yachdav & Jinfeng Liu

Quote: 2004 Nucleic Acids Research 32, W321-W326

PredictProtein (PP, http://www.predictprotein.org) is an Internet service for sequence analysis and the prediction of protein structure and function. Users submit protein sequences or alignments; PredictProtein returns multiple sequence alignments, PROSITE sequence motifs, low-complexity regions (SEG), nuclear localisation signals, regions lacking regular structure (NORS) and predictions of secondary structure, solvent accessibility, globular regions, transmembrane helices, coiled-coil regions, structural switch regions, disulfide-bonds, sub-cellular localization, and functional annotations. Upon request fold recognition by prediction-based threading, CHOP domain assignments, predictions of transmembrane strands and inter-residue contacts are also available. For all services, users can submit their query either by electronic mail, or interactively from World Wide Web.

LOCnet and LOCtarget: Sub-cellular localization for structural genomics targets

Rajesh Nair & Burkhard Rost

Quote: 2004 Nucleic Acids Research 32, W517-W521

LOCtarget is a web server and database that predicts and annotates sub-cellular localization for structural genomics targets; LOCnet is one of the methods used in LOCtarget that can predict sub-cellular localization for all eukaryotic and prokaryotic proteins. Targets are taken from the central registration database for structural genomics, namely TargetDB. LOCtarget predicts localization through a combination of four different methods: known nuclear localization signals (PredictNLS), homology-based transfer of experimental annotations (LOChom), inference through automatic text analysis of SWISS-PROT keywords (LOCkey), and de novo prediction through a system of neural networks (LOCnet). Additionally, we report predictions from SignalP. The final prediction is based on the method with the highest confidence. The web server can be used to predict sub-cellular localization of proteins from their amino acid sequence. The LOCtarget database currently contains localization predictions for all eukaryotic proteins from TargetDB and is updated every week. The server is available at: http://www.rostlab.org/services/LOCtarget/.

CHOP: parsing proteins into structural domains

Jinfeng Liu & Burkhard Rost

Quote: 2004 Nucleic Acids Research 32, W569-W571

Sequence-based domain assignment is one of the most important and challenging problems in structural biology. We have developed the method CHOP that chops proteins into domain-like fragments. The basic idea is to cut proteins from entirely sequenced organisms beginning from very reliable experimental information (PDB), proceeding to expert annotations of domain-like regions (Pfam-A), and completing through cuts based on termini of native protein ends. The CHOP server takes protein sequences as input and returns the dissections supported by homology transfer. CHOP results are precompiled for many entirely sequenced proteomes. The service is available at: http://www.rostlab.org/services/CHOP/.

NLProt: extracting protein names and sequences from papers

Sven Mika & Burkhard Rost

Quote: 2004 Nucleic Acids Research 32, W634-W637

Automatically extracting protein names from the literature and linking these names to the associated entries in sequence databases becomes increasingly important for annotating biological databases. NLProt is a novel system that combines dictionary- and rule-based filtering with several Support-Vector Machines (SVMs) to tag protein names in PubMed abstracts. When considering partially tagged names as errors, NLProt still reached a precision of 75% at a recall of 76%. By many criteria our system outperformed other tagging methods significantly, in particular, it proved very reliable even for novel names. Names encountered particularly frequently in Drosophila, such as white, wing, bizarre constitute an obvious limitation of NLProt. Our method is available both as an Internet server and as a program for download (http://cubic.bioc.columbia.edu/services/NLProt/). Input can be PubMed/MEDLINE identifiers, authors, titles, and journals, as well as collections of abstracts, or entire papers.

Automatic target selection
for structural genomics on eukaryotes

Jinfeng Liu, Hedi Hegyi, Tom Acton, Gaetano T Montelione & Burkhard Rost

Quote: 2004 Proteins: Structure, Function, and Bioinformatics 56, 188-200

A central goal of structural genomics is to experimentally determine representative structures for all protein families. At least 14 structural genomics pilot projects are currently investigating the feasibility of high-throughput structure determination; nine of these in the USA are NIH funded. Initiatives differ in the particular subset of 'all families' on which they focus. At the NorthEast Structural Genomics consortium (NESG), we target eukaryotic protein domain families. The automatic target selection procedure has three aims: (1) Identify all protein domain families from currently five entirely sequenced eukaryotic target organisms based on their sequence homology. (2) Discard those families that can be modelled based on structural information already present in the PDB. (3) Target representatives of the remaining families for structure determination. In order to guarantee that all members of one family share a common fold-like region, we had to begin by dissecting proteins into structural domain-like regions before clustering. Our hierarchical approach, CHOP, utilising homology to PrISM, Pfam-A, and SWISS-PROT chopped the 103,796 eukaryotic proteins/ORFs into 247,222 fragments. 122,999 of these fragments appeared suitable targets that were grouped into over 27,000 singleton and over 18,000 multi-fragment clusters. Thus, our results suggested that it might be necessary to determine over 40,000 structures to minimally cover the subset of five eukaryotic proteomes.

Sequence-based prediction of protein domains

Jinfeng Liu & Burkhard Rost

Quote: 2004 Nucleic Acids Research 32, 3522-3530

Guessing the boundaries of structural domains has been an important and challenging problem in experimental and computational structural biology. Predictions were based on intuition, biochemical properties, statistics, sequence homology, and other aspects of predicted protein structure. Here, we introduced CHOPnet, a de novo method that predicts structural domains in absence of homology to known domains. Our method was based on neural networks and relied exclusively on information available for all proteins. Evaluating sustained performance through rigorous cross-validation on proteins of known structure, we correctly predicted the number of domains in 69% of all proteins. For 50% of the two-domain proteins the centre of the predicted boundary was closer than 21 residues to the boundary assigned from 3D structures; this was about eight percentage points better than predictions by Ôequal splitÕ. Our results appeared to compare favourably with those from previously published methods. CHOPnet may be useful to restrict the experimental testing of different fragments for structure determination in the context of structural genomics.

 

Protein names peeled precisely off free text

Sven Mika & Burkhard Rost

Quote: 2004 Bioinformatics 20, I241-I247

Motivation: Automatically identifying protein names from the scientific literature is a prerequisite for the increasing demand for data-mining tools of this wealth of information. Existing approaches are based on dictionaries, rules, and machine-learning. Here, we introduced a novel system that combines a pre-processing dictionary- and rule-based filtering step with several separately trained Support-Vector Machines (SVMs) to identify protein names in MEDLINE abstracts.

Results:  Our new tagging-system NLProt is able to extract protein names with a precision (accuracy) of 75% at a recall (coverage) of 76% after training on a corpus, which was used before by other groups and contains 200 annotated abstracts. For our estimate of sustained performance, we considered partially identified names as false positives. One important issue frequently ignored in the literature is the redundancy in evaluation sets. We suggested some guidelines for removing overly inadequate overlaps between training- and testing sets. Applying these new guidelines, our program appeared to significantly out-perform other methods tagging protein names. NLProt was so successful due to the SVM-building blocks that succeeded in utilising the local context of protein names in scientific literature. We challenge that our system may constitute the most general and precise method for tagging protein names.

Availability:  http://cubic.bioc.columbia.edu/services/nlprot/

Contact: mika@cubic.bioc.columbia.edu

Improving fold recognition without folds

Dariusz Przybylski & Burkhard Rost

Quote: 2004 Journal of Molecular Biology 341, 255-269

The most reliable way to align two proteins of unknown structure is through sequence-profile and profile-profile alignment methods. If the structure for one of the two is known, fold recognition methods outperform purely sequence-based alignments. Here, we introduced a novel method that aligns generalized sequence and predicted structure profiles. Using predicted 1D structure (secondary structure and solvent accessibility) significantly improved over sequence-only methods, both in terms of correctly recognising pairs of proteins with different sequences and similar structures and in terms of correctly aligning the pairs. The scores obtained by our generalised scoring matrix followed an Extreme Value Distribution; this yielded accurate estimates of the statistical significance of our alignments. We found that mistakes in 1D structure predictions correlated between proteins from different sequence-structure families. The impact of this surprising result was that our method succeeded in significantly out-performing sequence-only methods even without explicitly using structural information from any of the two. Since AGAPE also outperformed established methods that rely on 3D information, we made it available through http://www.predictprotein.org. If we solved the problem of CPU-time required to apply AGAPE on millions of proteins, our results could also impact everyday database searches.

 

2004 non-peer

Annotating protein function through lexical analysis

Rajesh Nair & Burkhard Rost

Quote: 2004 AI Magazine 25, 45-56

We now know the entire genomes for over 100 organisms. The experimental characterisation of the newly sequenced proteins is deemed to lack behind this explosion of raw sequences (sequence-function gap). The rate at which expert annotators add experimental information into more or less controlled vocabularies of databases snails along at even slower pace. Most methods that annotate protein function exploit sequence similarity by transferring experimental information for homologues. A crucial development aiding such homology-based information transfer are large-scale, work- and management-intensive projects venturing to develop a comprehensive ontology for protein function, like the Gene Ontology project. In parallel, fully- or semi-automatic methods have successfully begun to mine the existing data through lexical analysis. Some of these tools target parsing controlled vocabularies from databases; others dare mining free texts from MEDLINE abstracts or full scientific papers. Automated text analysis has become a rapidly expanding discipline in bioinformatics. A few of these text-based tools have already been embedded into research projects.

AI and Bioinformatics.

Janice Glasgow, Igor Jurisica & Burkhard Rost

Quote: 2004 AI Magazine 25, 7-8

Editorial

 

2004 collaborations

The protein target list of the Northeast Structural Genomics consortium

Zeba Wunderlich, Thomas B Acton, Jinfeng Liu, Gregory Kornhaber, John Everett, Phil Carter, Ning Lan, Nathaniel Ecols, Mark Gerstein, Burkhard Rost, & Gaetano T Montelione

Quote: 2004 Proteins: Structure, Function, and Bioinformatics 56, 181-187

Editorial

1H, 13C and 15N assignments for the Archaeglobus fulgidis protein AF2095

R Powers, TB Acton, Y Chiang, PK Rajan, JR Cort, MA Kennedy, J Liu, L Ma, B Rost & GT Montelione

Quote: 2004 Journal of Biomolecular NMR 30, 107-108

The solution structure of protein AF2095 from the thermophilic archaea Archaeglobus fulgidis, a 123-residue (13.6-kDa) protein, has been determined by NMR methods. The structure of AF2095 is comprised of four alpha-helices and a mixed beta-sheet consisting of four parallel and anti-parallel beta-strands, where the alpha-helices sandwich the beta-sheet. Sequence and structural comparison of AF2095 with proteins from Homo sapiens, Methanocaldococcus jannaschii, and Sulfolobus solfataricus reveals that AF2095 is a peptidyl-tRNA hydrolase (Pth2). This structural comparison also identifies putative catalytic residues and a tRNA interaction region for AF2095. The structure of AF2095 is also similar to the structure of protein TA0108 from archaea Thermoplasma acidophilum, which is deposited in the Protein Data Bank but not functionally annotated. The NMR structure of AF2095 has been further leveraged to obtain good-quality structural models for 55 other proteins. Although earlier studies have proposed that the Pth2 protein family is restricted to archeal and eukaryotic organisms, the similarity of the AF2095 structure to human Pth2, the conservation of key active-site residues, and the good quality of the resulting homology models demonstrate a large family of homologous Pth2 proteins that are conserved in eukaryotic, archaeal, and bacterial organisms, providing novel insights in the evolution of the Pth and Pth2 enzyme families.

 


 

 


 

2005 peer-reviewed

NMPdb: database of nuclear matrix proteins

Sven Mika & Burkhard Rost

Quote: 2005 Nucleic Acids Research 33, D160-163

The nuclear matrix (NM) is a structure resulting from the aggregation of proteins and RNA in the nucleus of eukaryotic cells; it is the Òsticky bitÓ that remains after aggressive DNAse digestion and salt extraction protocols. Due to the important role of the NM in DNA-replication and -transcription and in RNA-splicing, the expression pattern of NM proteins has become an important early indicator for numerous cancers/tumors. Recent descriptions of the NM structure distinguish between a network-like Òinternal nuclear matrixÓ (INM) and a Ònuclear shellÓ that connects the INM to the inner and outer nuclear membranes. A cautious nuclear-matrix preparation protocol reveals a coat of proteins on top of the INM; these proteins are usually referred to as the Ònuclear matrix-associated proteinsÓ. Here, we describe a new database (NMPdb http://www.rostlab.org/db/NMPdb/) that currently contains 398 nuclear matrix proteins. We collected these data through a semi-automated analysis of over 3,000 scientific articles in PubMed. We could match these 398 proteins to 302 protein sequences in UniProt or GenBank. Our NMPdb repository annotates these links along with the following annotations: organism, cell-type, PubMed identifier, sequence-based predictions of structural and functional features and for some entries the explicit sequence segment that is responsible for localization (nuclear matrix targeting signal).

Mimicking cellular sorting improves prediction of subcellular localization

Rajesh Nair & Burkhard Rost

Quote: 2005 Journal of Molecular Biology 348, 85-100

Predicting the native subcellular compartment of a protein is an important step toward elucidating its function. Here we introduce LOCtree, a hierarchical system combining support vector machines (SVMs) and other prediction methods. LOCtree predicts the subcellular compartment of a protein by mimicking the mechanism of cellular sorting and exploiting a variety of sequence and predicted structural features in its input. Currently LOCtree does not predict localization for membrane proteins, since the compositional properties of membrane proteins significantly differ from those of non-membrane proteins. While any information about function can be used by the system, we presented estimates of performance that are valid when only the amino acid sequence of a protein is known. When evaluated on a non-redundant test set, LOCtree achieved sustained levels of 74% accuracy for non-plant eukaryotes, 70% for plants, and 84% for prokaryotes. We rigorously benchmarked LOCtree in comparison to the best alternative methods for localization prediction. LOCtree outperformed all other methods in nearly all benchmarks. Localization assignments using LOCtree agreed quite well with data from recent large-scale experiments. Our preliminary analysis of a few entirely sequenced organisms, namely human (Homo sapiens), yeast (Saccharomyces cerevisiae), and weed (Arabidopsis thaliana) suggested that over 35% of all non-membrane proteins are nuclear, about 20% are retained in the cytosol, and that every fifth protein in the weed resides in the chloroplast.

Protein folding rates estimated from contact predictions

Marco Punta & Burkhard Rost

Quote: 2005 Journal of Molecular Biology 348, 507-512

Folding rates of small single-domain proteins that fold through simple two-state kinetics can be estimated from details of the three-dimensional protein structure. Previously, predictions of secondary structure had been exploited to predict folding rates from sequence. Here, we estimate two-state folding rates from predictions of internal residue-residue contacts in proteins of unknown structure. Our estimate is based on the correlation between the folding rate and the number of predicted long-range contacts normalized by the square of the protein length. It is well known that long-range order derived from known structures correlates with folding rates. The surprise was that estimates based on very noisy contact predictions were almost as accurate as the estimates based on known contacts. On average, our estimates were similar to those previously published from secondary structure predictions. The combination of these methods that exploit different sources of information improved performance. It appeared that the combined method reliably distinguished fast from slow two-state folders.

PROFcon: novel prediction of long-range contacts

Marco Punta & Burkhard Rost

Quote: 2005 Bioinformatics 21, 2960-2968

Motivation:              Despite the continuing advance in the experimental determination of protein structures, the gap between the number of known protein sequences and structures continues to increase. Prediction methods can bridge this sequence-structure gap only partially. Better predictions of non-local contacts between residues could improve comparative modeling, fold recognition, and could assist the experimental structure determination.

Results:             Here, we introduced PROFcon, a novel contact prediction method that combines information from alignments, from predictions of secondary structure and solvent accessibility, from the region between two residues, and from the average properties of the entire protein. In contrast to some other methods, PROFcon predicted short and long proteins at similar levels of accuracy. As expected, PROFcon was clearly less accurate when tested on sparse evolutionary profiles, i.e. on families with few homologues. Prediction accuracy was highest for proteins belonging to the SCOP alpha/beta class. PROFcon compared favorably with state-of-the-art prediction methods at the CASP6 meeting. While the performance may still be perceived as low, our method clearly pushed the mark higher. Furthermore, predictions are already accurate enough to seed predictions of global features of protein structure.

Availability:  www.predictprotein.org/submit_profcon.html

Contact: punta@cubic.bioc.columbia.edu

Protein flexibility and rigidity predicted from sequence

Avner Schlessinger & Burkhard Rost

Quote: 2005 Proteins: Structure, Function, and Bioinformatics in press

Structural flexibility has been associated with various biological processes such as molecular recognition and catalytic activity. In silico studies of protein flexibility have attempted to characterize and predict flexible regions based on simple principles. B-values derived from experimental data are widely used to measure residue flexibility. Here, we present the most comprehensive large-scale analysis of B-values. We used this analysis to develop a neural network-based method that predicts flexible/rigid residues from amino acid sequence. The system uses both global and local information, i.e. features from the entire protein such as secondary structure composition, protein length, and fraction of surface residues, and features from a local window of sequence-consecutive residues. The most important local feature was the evolutionary exchange profile reflecting sequence conservation in a family of related proteins. To illustrate its potential, we applied our method to four different case studies, each of which related our predictions to aspects of function. The first two were the prediction of regions that undergo conformational switches upon environmental changes (switch II region in Ras) and the prediction of surface regions the rigidity of which is crucial for their function (tunnel in propeller folds). Both were correctly captured by our method. The third study established that residues in active sites of enzymes are predicted by our method to have unexpectedly high B-values. The final study demonstrated how well our predictions correlated with NMR order parameters to reflect motion. Our method had not been set up to address any of the tasks in those four case studies. Therefore, we expect that this method will assist in many attempts at inferring aspects of function.

 

2005 non-peer

Predictive methods using protein sequence

Yanay Ofran & Burkhard Rost

Quote: 2005 Bioinformatics 197-222

The amino acid sequence of a protein dictates its 3D structure, which, in turn determines its function. It is rather simple to determine the sequence of a protein, but quite complicated and laborious to determine its structure and function. Consequently, the number of available sequences grows rapidly, while only a small fraction of them gets comprehensive annotation. Bioinformatics attempts to bridge this gap by devising computational methods for the prediction of structure and function from the protein sequences.

How to use protein 1D structure predicted by PROFphd

Burkhard Rost

Quote: 2005 The Proteomics Protocols Handbook 875-901

Predicting simplified aspects of protein structure is often the first step to gaining some insight into protein structure and function. Proteome analysis, therefore, increasingly relies on 1D predictions. Developing such methods may be one of the most active and most successful disciplines of computational biology. The key to this success has been the combination of algorithms from artificial intelligence with experimental high-resolution data and the wealth of evolutionary information contained in today's ever growing databases. Here, I discuss the nuts and bolts of the PROFphd programs that predict secondary structure, solvent accessibility and transmembrane segments. On the one hand, 1D predictions have become crucial for automatic methods that identify distant structural relations or predict aspects of function such as sub-cellular localization, protein-protein interaction interfaces, functional types, regions undergoing local conformational re-arrangements, and intrinsically unstructured regions. On the other hand, 1D predictions have assisted experimental biologists in single case studies for tasks such as chain tracing, designing antibodies, exploring the effects of point mutations, refining the quest for binding sites and interactions, and in unravelling functional and structural similarities.

Beyond annotation transfer by homology: novel protein function prediction methods that can assist drug discovery

Yanay Ofran, Marco Punta, Reinhard Schneider & Burkhard Rost

Quote: 2005 Drug Discovery Today 10, 1475-1482

Each entirely sequenced organism adds hundreds to thousands of protein sequences for which the only annotation is Òhypothetical proteinÓ. Thousands of those may be drug targets that remain hidden without annotations. Even high-throughput experiments lag behind the speed of sequencing. Computational methods that generate hypotheses about protein or gene function contribute to narrowing this sequence-function gap. Here, we review the challenges, research approaches, and recently developed tools in the field of function prediction. The first promising methods can predict function de novo, i.e. for completely uncharacterized proteins. This considerably widens the usefulness of sequencing for drug discovery.

 

2005 collaborations

The 2.35 A structure of the TenA homolog from Pyrococcus furiosus supports an enzymatic function in thiamine metabolism

J Benach, WC Edstrom, I Lee, K Das, B Cooper, R Xiao, J Liu, B Rost, TB Acton, GT Montelione & JF Hunt

Quote: 2005 Acta Crystallogr D Biol Crystallogr 61, 589-598

TenA (transcriptional enhancer A) has been proposed to function as a transcriptional regulator based on observed changes in gene-expression patterns when overexpressed in Bacillus subtilis. However, studies of the distribution of proteins involved in thiamine biosynthesis in different fully sequenced genomes have suggested that TenA may be an enzyme involved in thiamine biosynthesis, with a function related to that of the ThiC protein. The crystal structure of PF1337, the TenA homolog from Pyrococcus furiosus, is presented here. The protomer comprises a bundle of alpha-helices with a similar tertiary structure and topology to that of human heme oxygenase-1, even though there is no significant sequence homology. A solvent-sequestered cavity lined by phylogenetically conserved residues is found at the core of this bundle in PF1337 and this cavity is observed to contain electron density for 4-amino-5-hydroxymethyl-2-methylpyrimidine phosphate, the product of the ThiC enzyme. In contrast, the modestly acidic surface of PF1337 shows minimal levels of sequence conservation and a dearth of the basic residues that are typically involved in DNA binding in transcription factors. Without significant conservation of its surface properties, TenA is unlikely to mediate functionally important protein-protein or protein-DNA interactions. Therefore, the crystal structure of PF1337 supports the hypothesis that TenA homologs have an indirect effect in altering gene-expression patterns and function instead as enzymes involved in thiamine metabolism.

EVAcon: a protein contact prediction evaluation service

O Grana, VA Eyrich, F Pazos, B Rost & A Valencia

Quote: 2005 Nucleic Acids Res 33, W347-351

Here we introduce EVAcon, an automated web service that evaluates the performance of contact prediction servers. Currently, EVAcon is monitoring nine servers, four of which are specialized in contact prediction and five are general structure prediction servers. Results are compared for all newly determined experimental structures deposited into PDB ( approximately 5-50 per week). EVAcon allows for a precise comparison of the results based on a system of common protein subsets and the commonly accepted evaluation criteria that are also used in the corresponding category of the CASP assessment. EVAcon is a new service added to the functionality of the EVA system for the continuous evaluation of protein structure prediction servers. The new service is accesible from any of the three EVA mirrors: PDG (CNB-CSIC, Madrid) (http://www.pdg.cnb.uam.es/eva/con/index.html); CUBIC (Columbia University, NYC) (http://cubic.bioc.columbia.edu/eva/con/index.html); and Sali Lab (UCSF, San Francisco) (http://eva.compbio.ucsf.edu/~eva/con/index.html).

ISMB 2005

HV Jagadish, D States & B Rost

Quote: 2005 Bioinformatics 21 Suppl 1, i1-i2

Editorial

The transcriptional landscape of the mammalian genome

The FANTOM Consortium, 
P Carninci, T Kasukawa, S Katayama, J Gough, MC Frith, N Maeda, R Oyama, T Ravasi, B Lenhard, C Wells, R Kodzius, K Shimokawa, VB Bajic, SE Brenner, S Batalov, ARR Forrest, M Zavolan, MJ Davis, LG Wilming, V Aidinis, JE Allen, A Ambesi-Impiombato, R Apweiler, RN Aturaliya, TL Bailey, M Bansal, L Baxter, KW Beisel, T Bersano, H Bono, AM Chalk, KP Chiu, V Choudhary, A Christoffels, DR Clutterbuck, ML Crowe, E Dalla, BP Dalrymple, B deBono, G DellaGatta, D diBernardo, T Down, P Engstrom, M Fagiolini, G Faulkner, CF Fletcher, T Fukushima, M Furuno, S Futaki, M Gariboldi, P Georgii-Hemming, TR Gingeras, T Gojobori, RE Green, S Gustincich, M Harbers, Y Hayashi, TK Hensch, N Hirokawa, D Hill, L Huminiecki, M Iacono, K Ikeo, A Iwama, T Ishikawa, M Jakt, A Kanapin, M Katoh, Y Kawasawa, J Kelso, H Kitamura, H Kitano, G Kollias, SPT Krishnan, A Kruger, SK Kummerfeld, IV Kurochkin, LF Lareau, D Lazarevic, L Lipovich, J Liu, S Liuni, S McWilliam, M MadanBabu, M Madera, L Marchionni, H Matsuda, S Matsuzawa, H Miki, F Mignone, S Miyake, K Morris, S Mottagui-Tabar, N Mulder, N Nakano, H Nakauchi, P Ng, R Nilsson, S Nishiguchi, S Nishikawa, F Nori, O Ohara, Y Okazaki, V Orlando, KC Pang, WJ Pavan, G Pavesi, G Pesole, N Petrovsky, S Piazza, J Reed, JF Reid, BZ Ring, M Ringwald, B Rost, Y Ruan, SL Salzberg, A Sandelin, C Schneider, C Schšnbach, K Sekiguchi, CAM Semple, S Seno, L Sessa, Y Sheng, Y Shibata, H Shimada, K Shimada, D Silva, B Sinclair, S Sperling, E Stupka, K Sugiura, R Sultana, Y Takenaka, K Taki, K Tammoja, SL Tan, S Tang, MS Taylor, J Tegner, SA Teichmann, HR Ueda, E vanNimwegen, R Verardo, CL Wei, K Yagi, H Yamanishi, E Zabarovsky, S Zhu, A Zimmer, W Hide, C Bult, SM Grimmond, RD Teasdale, ET Liu, V Brusic, J Quackenbush, C Wahlestedt, JS Mattick, DA Hume, C Kai, D Sasaki, Y Tomaru, S Fukuda, M Kanamori-Katayama, M Suzuki, J Aoki, T Arakawa, J Iida, K Imamura, M Itoh, T Kato, H Kawaji, N Kawagashira, T Kawashima, M Kojima, S Kondo, H Konno, K Nakano, N Ninomiya, T Nishio, M Okada, C Plessy, K Shibata, T Shiraki, S Suzuki, M Tagami, K Waki, A Watahiki, Y Okamura-Oho, H Suzuki, J Kawai and Y Hayashizaki

Quote: 2005 Science 309, 1559-1563

This study describes comprehensive polling of transcription start and termination sites and analysis of previously unidentified full-length complementary DNAs derived from the mouse genome. We identify the 5' and 3' boundaries of 181,047 transcripts with extensive variation in transcripts arising from alternative promoter usage, splicing, and polyadenylation. There are 16,247 new mouse protein-coding transcripts, including 5154 encoding previously unidentified proteins. Genomic mapping of the transcriptome reveals transcriptional forests, with overlapping transcription on both strands, separated by deserts in which few transcripts are observed. The data provide a comprehensive platform for the comparative analysis of mammalian transcriptional regulation in differentiation and development.

Solution structure of Archaeglobus fulgidis Peptidyl-tRNA hydrolase (Pth2) provides evidence for an extensive conserved family of Pth2 enzymes in archea, bacteria and eukaryotes

R Powers, N Mirkovic,  D Murray, S Goldsmith-Fischman, TB Acton, Y Chiang, R Paranji, JR Cort, JY Huang, MA Kennedy, J Liu, L Ma, B Rost & GT Montelione

Quote: 2005 Protein Science 14, 2849-2861

The solution structure of protein AF2095 from the thermophilic archaea Archaeglobus fulgidis, a 123-residue (13.6-kDa) protein, has been determined by NMR methods. The structure of AF2095 is comprised of four alpha-helices and a mixed beta-sheet consisting of four parallel and anti-parallel beta-strands, where the alpha-helices sandwich the beta-sheet. Sequence and structural comparison of AF2095 with proteins from Homo sapiens, Methanocaldococcus jannaschii, and Sulfolobus solfataricus reveals that AF2095 is a peptidyl-tRNA hydrolase (Pth2). This structural comparison also identifies putative catalytic residues and a tRNA interaction region for AF2095. The structure of AF2095 is also similar to the structure of protein TA0108 from archaea Thermoplasma acidophilum, which is deposited in the Protein Data Bank but not functionally annotated. The NMR structure of AF2095 has been further leveraged to obtain good-quality structural models for 55 other proteins. Although earlier studies have proposed that the Pth2 protein family is restricted to archeal and eukaryotic organisms, the similarity of the AF2095 structure to human Pth2, the conservation of key active-site residues, and the good quality of the resulting homology models demonstrate a large family of homologous Pth2 proteins that are conserved in eukaryotic, archaeal, and bacterial organisms, providing novel insights in the evolution of the Pth and Pth2 enzyme families.

Comparisons of NMR spectral quality and success in crystallization demonstrate that NMR and X-ray crystallography are complementary methods for small protein structure determination

DA Snyder, Y Chen, NG Denissova, T Acton, JM Aramini, M Ciano, R Karlin, J Liu, P Manor, PA Rajan, P Rossi, GV Swapna, R Xiao, B Rost, J Hunt & GT Montelione

Quote: 2005 J Am Chem Soc 127, 16505-16511

X-ray crystallography and NMR spectroscopy provide the only sources of experimental data from which protein structures can be analyzed at high or even atomic resolution. The degree to which these methods complement each other as sources of structural knowledge is a matter of debate; it is often proposed that small proteins yielding high quality, readily analyzed NMR spectra are a subset of those that readily yield strongly diffracting crystals. We have examined the correlation between NMR spectral quality and success in structure determination by X-ray crystallography for 159 prokaryotic and eukaryotic proteins, prescreened to avoid proteins providing polydisperse and/or aggregated samples. This study demonstrates that, across this protein sample set, the quality of a protein's [15N-1H]-heteronuclear correlation (HSQC) spectrum recorded under conditions generally suitable for 3D structure determination by NMR, a key predictor of the ability to determine a structure by NMR, is not correlated with successful crystallization and structure determination by X-ray crystallography. These results, together with similar results of an independent study presented in the accompanying paper (Yee, et al., J. Am. Chem. Soc., accompanying paper), demonstrate that X-ray crystallography and NMR often provide complementary sources of structural data and that both methods are required in order to optimize success for as many targets as possible in large-scale structural proteomics efforts.

Critical assessment of methods of protein structure prediction (CASP) - round 6

J Moult, K Fidelis, T Hubbard, B Rost & A Tramontano

Quote: 2005 Proteins 61, 3-7

This article is an introduction to the special issue of the journal Proteins, dedicated to the sixth CASP experiment to assess the state of the art in protein structure prediction. The article describes the conduct of the experiment and the categories of prediction included, and outlines the evaluation and assessment procedures. A brief summary of progress over the decade of CASP experiments is also provided.

CASP6 assessment of contact prediction

O Grana, D Baker, RM MacCallum, J Meiler, M Punta, B Rost, ML Tress & A Valencia

Quote: 2005 Proteins 61, 214-224

Here we present the evaluation results of the Critical Assessment of Protein Structure Prediction (CASP6) contact prediction category. Contact prediction was assessed with standard measures well known in the field and the performance of specialist groups was evaluated alongside groups that submitted models with 3D coordinates. The evaluation was mainly focused on long range contact predictions for the set of new fold targets, although we analyzed predictions for all targets. Three groups with similar levels of accuracy and coverage performed a little better than the others. Comparisons of the predictions of the three best methods with those of CASP5/CAFASP3 suggested some improvement, although there were not enough targets in the comparisons to make this statistically significant. 2005 Wiley-Liss, Inc.

 


 

 


 

2006 peer-reviewed

Epitome: Database of structure-inferred antigenic epitopes

Avner Schlessinger, Yanay Ofran, Guy Yachdav & Burkhard Rost

Quote: 2006 Nucleic Acids Research 34, D777-780

Immunoglobulin molecules specifically recognize particular areas on the surface of proteins. These areas are commonly dubbed B-cell epitopes. The identification of epitopes in proteins is important both for the design of experiments and vaccines. Additionally, the interactions between epitopes and antibodies have often served as a model for protein-protein interactions. One of the main obstacles in creating a database of antigen-antibody interactions is the difficulty in distinguishing between antigenic and non-antigenic interactions. Antigenic interactions involve specific recognition sites on the antibodyÕs surface, while non-antigenic interactions are between a protein and any other site on the antibody. To solve this problem, we performed a comparative analysis of all protein-antibody complexes for which structures have been experimentally determined. Additionally, we developed a semi-automated tool that identified the antigenic interactions within the known antigen-antibody complex structures. We compiled those interactions into Epitome, a database of structure-inferred antigenic residues in proteins.  Epitome consists of all known antigen/antibody complex structures, a detailed description of the residues that are involved in the interactions, and their sequence/structure environments. Interactions can be visualized using an interface to Jmol. The database is available at http://www.rostlab.org/services/epitome/.

Distinguish protein-coding from non-coding RNA using support vector machines

J Liu, J Gough & B Rost

Quote: 2006 PLoS Genetics 2, e29; DOI: 10.1371/journal.pgen.0020029

RIKEN's FANTOM project has revealed many previously unknown coding sequences, as well as an unexpected degree of variation in transcripts resulting from alternative promoter usage and splicing. Ever more transcripts that do not code for proteins have been identified by transcriptome studies, in general. Increasing evidence points to the important cellular roles of such non-coding RNAs (ncRNAs). The distinction of protein-coding RNA transcripts from ncRNA transcripts is therefore an important problem in understanding the transcriptome and carrying out its annotation. Very few in silico methods have specifically addressed this problem. Here, we introduce CONC (for "coding or non-coding"), a novel method based on support vector machines that classifies transcripts according to features they would have if they were coding for proteins. These features include peptide length, amino acid composition, predicted secondary structure content, predicted percentage of exposed residues, compositional entropy, number of homologs from database searches, and alignment entropy. Nucleotide frequencies are also incorporated into the method. Confirmed coding cDNAs for eukaryotic proteins from the Swiss-Prot database constituted the set of true positives, ncRNAs from RNAdb and NONCODE the true negatives. Ten-fold cross-validation suggested that CONC distinguished coding RNAs from ncRNAs at about 97% specificity and 98% sensitivity. Applied to 102,801 mouse cDNAs from the FANTOM3 dataset, our method reliably identified over 14,000 ncRNAs and estimated the total number of ncRNAs to be about 28,000.

PROFbval: predict flexible and rigid residues in proteins

A Schlessinger, G Yachdav & B Rost

Quote: 2006 Bioinformatics 22, 891-893

SUMMARY: The mobility of a residue on the protein surface is closely linked to its function. The identification of extremely rigid or flexible surface residues can therefore contribute information crucial for solving the complex problem of identifying functionally important residues in proteins. Mobility is commonly measured by B-value data from high-resolution three-dimensional X-ray structures. Few methods predict B-values from sequence. Here, we present PROFbval, the first web server to predict normalized B-values from amino acid sequence. The server handles amino acid sequences (or alignments) as input and outputs normalized B-value and two-state (flexible/rigid) predictions. The server also assigns a reliability index for each prediction. For example, PROFbval correctly identifies residues in active sites on the surface of enzymes as particularly rigid. AVAILABILITY: http://www.rostlab.org/services/profbval CONTACT: profbval@rostlab.org SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Model organisms pose problems for unraveling protein-protein interactions

S Mika & B Rost

Quote: 2006 PLoS Computational Biology 2, e79

Experimental high-throughput studies of protein-protein interactions are beginning to provide enough data for comprehensive computational studies. Today, about ten large data sets, each with thousands of interacting pairs, coarsely sample the interactions in fly, human, worm, and yeast. Another about 55,000 pairs of interacting proteins have been identified by more careful, detailed biochemical experiments. Most interactions are experimentally observed in prokaryotes and simple eukaryotes; very few interactions are observed in higher eukaryotes such as mammals. It is commonly assumed that pathways in mammals can be inferred through homology to model organisms, e.g. the experimental observation that two yeast proteins interact is transferred to infer that the two corresponding proteins in human also interact. Two pairs for which the interaction is conserved are often described as interologs. The goal of this investigation was a large-scale comprehensive analysis of such inferences, i.e. of the evolutionary conservation of interologs. Here, we introduced a novel score for measuring the overlap between protein-protein interaction data sets. This measure appeared to reflect the overall quality of the data and was the basis for our two surprising results from our large-scale analysis. Firstly, homology-based inferences of physical protein-protein interactions appeared far less successful than expected. In fact, such inferences were accurate only for extremely high levels of sequence similarity. Secondly, and most surprisingly, the identification of interacting partners through sequence similarity was significantly more reliable for protein pairs within the same organism than for pairs between species. Our analysis underlined that the discrepancies between different datasets are large, even when using the same type of experiment on the same organism. This reality considerably constrains the power of homology-based transfer of interactions. In particular, the experimental probing of interactions in distant model organisms has to be undertaken with some caution. More comprehensive images of protein-protein networks will require the combination of many high-throughput methods, including in silico inferences and predictions. http://www.rostlab.org/results/2006/ppi_homology/

PROFtmb: A web server for predicting bacterial transmembrane beta barrel proteins

Henry Bigelow & Burkhard Rost

Quote: 2006 Nucleic Acids Research 34, W186-188

PROFtmb predicts transmembrane beta-barrel (TMB) proteins in Gram-negative bacteria. For each query protein, PROFtmb provides both a Z-value indicating that the protein actually contains a membrane barrel, and a four-state per-residue labeling of upward- and downward-facing strands, periplasmic hairpins and extracellular loops. While most users submit individual proteins known to contain TMBs, some groups submit entire proteomes to screen for potential TMBs. Response time is about 4 min for a 500-residue protein. PROFtmb is a profile-based Hidden Markov Model (HMM) with an architecture mirroring the structure of TMBs. The per-residue accuracy on the 8-fold cross-validated testing set is 86% while whole-protein discrimination accuracy was 70 at 60% coverage. The PROFtmb web server includes all source code, training data and whole-proteome predictions from 78 Gram-negative bacterial genomes and is available freely and without registration at http://rostlab.org/services/proftmb.

Create and assess protein networks through molecular characteristics of individual proteins

Y Ofran, G Yachdav, E Mozes, T Soong & B Rost

Quote: 2006 Bioinformatics 22, e402-407

MOTIVATION: The study of biological systems, pathways and processes relies increasingly on analyses of networks. Most often, such analyses focus on network topology, thereby treating all proteins or genes as identical, featureless nodes. Integrating molecular data and insights about the qualities of individual proteins into the analysis may enhance our ability to decipher biological pathways and processes.

RESULTS: Here, we introduce a novel platform for data integration that generates networks on the macro system-level, analyzes the molecular characteristics of each protein on the micro level, and then combines the two levels by using the molecular characteristics to assess networks. It also annotates the function and subcellular localization of each protein and displays the process on an image of a cell, rendering each protein in its respective cellular compartment. By thus visualizing the network in a cellular context we are able to analyze pathways and processes in a novel way. As an example, we use the system to analyze proteins implicated with Alzheimers disease and show how the integrated view corroborates previous observations and how it helps in the formulation of new hypotheses regarding the molecular underpinnings of the disease.

AVAILABILITY: http://www.rostlab.org/services/pinat.

 

2006 collaborations

Identifying cysteines and histidines in transition metal binding sites by a two-stage support vector machines - neural networks approach

A Passerini, M Punta, A Ceroni, B Rost & P Frasconi

Quote: 2006 Proteins: Structure, Function, and Bioinformatics 65, 305-316

Accurate predictions of metal-binding sites in proteins by using sequence as the only source of information can significantly help in the prediction of protein structure and function, genome annotation, and in the experimental determination of protein structure. Here, we introduce a method for identifying histidines and cysteines that participate in binding of several transition metals and iron complexes. The method predicts histidines as being in either of two states (free or metal bound) and cysteines in either of three states (free, metal bound, or in disulfide bridges). The method uses only sequence information by utilizing position-specific evolutionary profiles as well as more global descriptors such as protein length and amino acid composition. Our solution is based on a two-stage machine-learning approach. The first stage consists of a support vector machine trained to locally classify the binding state of single histidines and cysteines. The second stage consists of a bidirectional recurrent neural network trained to refine local predictions by taking into account dependencies among residues within the same protein. A simple finite state automaton is employed as a postprocessing in the second stage in order to enforce an even number of disulfide-bonded cysteines. We predict histidines and cysteines in transition-metal-binding sites at 73% precision and 61% recall. We observe significant differences in performance depending on the ligand (histidine or cysteine) and on the metal bound. We also predict cysteines participating in disulfide bridges at 86% precision and 87% recall. Results are compared to those that would be obtained by using expert information as represented by PROSITE motifs and, for disulfide bonds, to state-of-the-art methods. (c) 2006 Wiley-Liss, Inc.

Outcome of a workshop on archiving structural models of biological macromolecules

HM Berman, SK Burley, W Chiu, A Sali, A Adzhubei, PE Bourne, SH Bryant, RL Dunbrack, Jr., K Fidelis, J Frank, A Godzik, K Henrick, A Joachimiak, B Heymann, D Jones, JL Markley, J Moult, GT Montelione, C Orengo, MG Rossmann, B Rost, H Saibil, T Schwede, DM Standley and JD WestbrookQuote:

Quote: 2006 Structure 14, 1211-1217

 

 


 

 

 


 

2007 peer-reviewed

ISIS: Interaction Sites Identified from Sequence

Yanay Ofran & Burkhard Rost

Quote: 2007 Bioinformatics 23, e13-16

MOTIVATION: Large-scale experiments reveal pairs of interacting proteins but leave the residues involved in the interactions unknown. These interface residues are essential for understanding the mechanism of interaction and are often desired drug targets. Reliable identification of residues that reside in protein-protein interface typically requires analysis of protein structure. Therefore, for the vast majority of proteins, for which there is no high-resolution structure, there is no effective way of identifying interface residues.

RESULTS: Here we present a machine learning-based method that identifies interacting residues from sequence alone. Although the method is developed using transient protein-protein interfaces from complexes of experimentally known 3D structures, it never explicitly uses 3D information. Instead, we combine predicted structural features with evolutionary information. The strongest predictions of the method reached over 90% accuracy in a cross-validation experiment. Our results suggest that despite the significant diversity in the nature of protein-protein interactions, they all share common basic principles and that these principles are identifiable from sequence alone.

Consensus sequences improve PSI-BLAST searches

Dariusz Przybylski & Burkhard Rost

Quote: 2007 Nucleic Acids Research 35, 2238-2246

Sequence alignments may be the most fundamental computational resource for molecular biology. The best methods that identify sequence relatedness through profile-profile comparisons are much slower and more complex than sequence-sequence and sequence-profile comparisons such as, respectively, BLAST and PSI-BLAST. Families of related genes and gene products (proteins) can be represented by consensus sequences that list the nucleic/amino acid most frequent at each sequence position in that family. Here, we propose a novel approach for consensus-sequence-based comparisons. This approach improved searches and alignments as a standard add-on to PSI-BLAST without any changes of code. Improvements were particularly significant for more difficult tasks such as the identification of distant structural relations between proteins and their corresponding alignments. Despite the fact that the improvements were higher for more divergent relations, they were consistent even at high accuracy/low error rates for non-trivially related proteins. The improvements were very easy to achieve; no parameter used by PSI-BLAST was altered and no single line of code changed. Furthermore, the consensus sequence add-on required relatively little additional CPU time. We discuss how advanced users of PSI-BLAST can immediately benefit from using consensus sequences on their local computers. We have also made the method available through the Internet (http://www.rostlab.org/services/consensus/).

Novel leverage of structural genomics

Jinfeng Liu, Gaetano T Montelione & Burkhard Rost

Quote: 2007 Nature Biotechnology in press

Genome sequencing has transformed biomedical research, but it is only the first step toward understanding biological systems. Subsequent steps include the determination of three-dimensional (3D) structures and functions for all proteins, and the study of networks. One long-term goal of Structural Genomics (SG) is to make 3D atomic-level structures easily obtainable for most proteins from their corresponding DNA sequences. In particular, the Protein Structure Initiative (PSI) from the National Institutes of Health (NIH) in the USA is expanding the impact of the Human Genome Project1 by large-scale structure determination2. As part of the five-year pilot phase (PSI1), more than 1,200 protein structures were deposited into the PDB3. The second phase of PSI (PSI2), initiated July 1, 2005, supports four Large-scale Research Centers, along with six additional Specialized Research Centers responsible for development of new technologies for structural biology and structural genomics. Structural genomics projects have also been established in Europe4, Japan5 and Canada2.

Protein-protein interaction hot spots carved into sequences

Yanay Ofran & Burkhard Rost

Quote: 2007a PLoS Comput Biol in press

Protein-protein interactions, a key to almost any biological process, are mediated by molecular mechanisms that are not entirely clear. The study of these mechanisms often focuses on all residues at protein-protein interfaces. However, only a small subset of all interface residues is actually essential for recognition or binding. Commonly referred to as "hot spots", these essential residues are defined as residues that impede protein-protein interactions if mutated. While no in silico tool identifies hot spots in unbound chains, numerous prediction methods were designed to identify all the residues in a protein that are likely to be a part of protein-protein interfaces. These methods typically identify only a small fraction of all interface residues. Here, we analyzed the hypothesis that the two subsets are correlated, i.e. that in silico methods predict few residues because they preferentially predict hot spots. We demonstrated that this is indeed the case and that we can therefore predict directly from the sequence of a single protein which residues are interaction hot spots (without knowledge of the interaction partner). Our results suggested that most protein complexes are stabilized by similar basic principles. The ability to accurately and efficiently identify hot spots from sequence enables the annotation and analysis of protein-protein interaction hot spots in entire organism and thus may benefit function prediction and drug development.

SNAP: predict effect of non-synonymous polymorphisms on function

Yana Bromberg & Burkhard Rost

Quote: 2007 Nucleic Acids Research in press

Many genetic variations are single nucleotide polymorphisms (SNPs). Non-synonymous SNPs are 'neutral' if the resulting point-mutated protein is not functionally discernible from the wild type and 'non-neutral' otherwise. The ability to identify non-neutral substitutions could significantly aid targeting disease causing detrimental mutations, as well as SNPs that increase the fitness of particular phenotypes. Here, we introduced comprehensive data sets to assess the performance of methods that predict SNP effects. Along we introduced SNAP (screening for non-acceptable polymorphisms), a neural network-based method for the prediction of the functional effects of non-synonymous SNPs. SNAP needs only sequence information as input, but benefits from functional and structural annotations, if available. In a cross-validation test on over 80 000 mutants, SNAP identified 80% of the non-neutral substitutions at 77% accuracy and 76% of the neutral substitutions at 80% accuracy. This constituted an important improvement over other methods; the improvement rose to over ten percentage points for mutants for which existing methods disagreed. Possibly even more importantly SNAP introduced a well-calibrated measure for the reliability of each prediction. This measure will allow users to focus on the most accurate predictions and/or the most severe effects. Available at http: http://www.rostlab.org/services/SNAP.

Prediction of DNA binding residues from sequence

Yanay Ofran & Burkhard Rost

Quote: 2007 Bioinformatics in press

Motivation: Thousands of proteins are known to bind to DNA; for most of them the mechanism of action and the residues that bind to DNA, i.e. the binding sites, are yet unknown. Experimental identification of binding sites requires expensive and laborious methods such as mutagenesis and binding essays. Hence, such studies are not applicable on a large scale. If the three-dimensional structure of a protein is known it is often possible to predict DNA-binding sites in silico. However, for most proteins such knowledge is not available.

Results: It has been shown that DNA-binding residues have distinct biophysical characteristics. Here we demonstrate that these characteristics are so distinct that they enable accurate prediction of the residues that bind DNA directly from amino acid sequence without requiring any additional experimental or structural information. In a cross-validation based on the largest non-redundant dataset of high-resolution complexes available today, we found that 89% of our predictions are confirmed by experimental data. Thus, it is now possible to identify DNA binding sites on a proteomic scale even in the absence of any experimental data or 3D structural information.

Availability: http://cubic.bioc.columbia.edu/services/disis

Natively unstructured loops differ from other loops

Avner Schlessinger, Jinfeng Liu & Burkhard Rost

Quote: 2007b PLoS Comput Biol in press

Natively unstructured or disordered protein regions may increase the functional complexity of an organism; they are particularly abundant in eukaryotes and often evade structure determination. Many computational methods predict unstructured regions by training on outliers in otherwise well-ordered structures. Here, we introduce an approach that uses a neural network in a very different and novel way. We hypothesize that very long contiguous segments with non-regular secondary structure (NORS regions) differ significantly from regular, well-structured loops and that a method detecting such features could predict natively unstructured regions. Training our new method, NORSnet, on predicted information rather than on experimental data yielded three major advantages: it removed the overlap between testing and training, it systematically covered entire proteomes, and it explicitly focused on one particular aspect of unstructured regions with a simple structural interpretation, namely that they are loops. Our hypothesis was correct: well-structured and unstructured loops differ so substantially that NORSnet succeeded in their distinction. Benchmarks on previously used and new experimental data of unstructured regions revealed that NORSnet performed very well. Although it was not the best single prediction method, NORSnet was sufficiently accurate to flag unstructured regions in proteins that were previously not annotated. In one application, NORSnet revealed previously undetected unstructured regions in putative targets for structural genomics and may thereby contribute to increasing structural coverage of large eukaryotic families. NORSnet found unstructured regions more often in domain boundaries than expected at random. In another application, we estimated that 50-70% of all worm proteins observed to have over 7 protein-protein interaction partners have unstructured regions. The comparative analysis between NORSnet and DISOPRED2 suggested that long unstructured loops are a major part of unstructured regions in molecular networks.

Natively unstructured regions in proteins identified from contact predictions

Avner Schlessinger, Marco Punta & Burkhard Rost

Quote:

Motivation: Natively unstructured (also dubbed intrinsically disordered) regions in proteins often adopt regular structures under particular conditions. Proteins with such regions are overly abundant in eukaryotes, they may increase functional complexity organism, and they usually evade structure determination in the unbound form. Low propensity for the formation of internal residue contacts has been previously used to predict natively unstructured regions.

Results: We combined PROFcon predictions for protein-specific contacts with a generic pairwise potential to predict unstructured regions. This novel method, Ucon, outperformed the best available methods in predicting proteins with long unstructured regions. Furthermore, Ucon correctly identified cases missed by other methods. By computing the difference between predictions based on specific contacts (approach introduced here) and those based on generic potentials (realized in other methods), we might identify unstructured regions that are involved in protein-protein binding. We discussed one example to illustrate this ambitious aim. Overall, Ucon added quality and an orthogonal aspect that may help in the experimental study of unstructured regions in network hubs.

Availability: http://www.predictprotein.org/submit_ucon.html

 

 

 

 

 

2007 non-peer

Membrane protein prediction methods

M Punta, LR Forrest, H Bigelow, A Kernytsky, J Liu & B Rost

Quote: 2007 Methods 41, 460-474

We survey computational approaches that tackle membrane protein structure and function prediction. While describing the main ideas that have led to the development of the most relevant and novel methods, we also discuss pitfalls, provide practical hints and highlight the challenges that remain. The methods covered include: sequence alignment, motif search, functional residue identification, transmembrane segment and protein topology predictions, homology and ab initio modeling. In general, predictions of functional and structural features of membrane proteins are improving, although progress is hampered by the limited amount of high-resolution experimental information available. While predictions of transmembrane segments and protein topology rank among the most accurate methods in computational biology, more attention and effort will be required in the future to ameliorate database search, homology and ab initio modeling.

Predicting protein subcellular localization using intelligent systems

Dariusz Przybylski & Burkhard Rost

Quote: