bottom - CUBIC-papers - CUBIC

Title: LOC3D: annotate sub-cellular localization for protein structures
Author:Rajesh Nair & Burkhard Rost
Quote: Nucl Acids Res, 2003, 31, 3337-3340

LOC3D: annotate sub-cellular localization for protein structures

Rajesh Nair 1, 4, * & Burkhard Rost 1, 2, 3, *

1 CUBIC, Dept. of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA
2 Columbia University Center for Computational Biology and Bioinformatics (C2B2), Russ Berrie Pavilion, 1150 St. Nicholas Avenue, New York, NY 10032, USA
3 North East Structural Genomics Consortium (NESG), Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA
4 Dept. of Physics, Columbia Univ., 538 West 120th Street, New York, NY 10027, USA
* Corresponding author:  email = nair@cubic.bioc.columbia.edu URL http://cubic.bioc.columbia.edu/  Tel: +1-212-305-4018, fax: +1-212-305-7932

This article is published in (Nucleic Acids Research, issue, date and pages) © copyright Oxford University Press (2003). OUP is the only authorised source. All copying of this article including placing on another website requires the written permission of the copyright owner.a

Table of contents



 


Abstract

LOC3D ( http://cubic.bioc.columbia.edu/db/LOC3D/ ) is both a weekly-updated database and a web server for predictions of sub-cellular localization for eukaryotic proteins of known 3D structure. Localization is predicted using four different methods: (1) PredictNLS: prediction of nuclear proteins through nuclear localization signals, (2) LOChom: inferring localization through sequence homology, (3) LOCkey: inferring localization through automatic text analysis of SWISS-PROT keywords, and (4) and LOC3Dini: ab initio prediction through a system of neural networks and vector support machines. The final prediction is based on the method that predicts localization with the highest confidence. The LOC3D database currently contains predictions for over 8700 eukaryotic protein chains taken from PDB. The web server can be used to predict sub-cellular localization for proteins for which only a predicted structure is available from threading servers. This makes the resource of particular interest to structural genomics initiatives.

 

Key words: protein sub-cellular localization, protein structure, protein function, homology, structural genomics.

 

Abbreviations used

3D structurethree-dimensional co-ordinates of protein structure
CGICommon Gateway Interface
EMBLEuropean Molecular Biology Laboratory
LOChomhomology based inference of sub-cellular localization [1]
LOCkeylocalization prediction based on SWISS-PROT keywords [2]
LOC3Dinineural network based localization prediction
PDBprotein data bank [3]
PEPpredictions for entire proteomes [4]
predictNLSprediction of nuclear localization signals [5, 6]
SWISS-PROTexpert-curated database of proteins [7] .


 

 

Introduction

Sub-cellular localization one aspect of protein function. Proteins must be localized in the same sub-cellular compartment to co-operate towards a common biological function. Thus, the native sub-cellular localization of a protein is important for understanding of gene/protein function. Aberrant sub-cellular localization of proteins has been observed in the cells of several diseases, such as cancer and AlzheimerÕs disease. Therefore, experimentally unravelling the native compartment of a protein constitutes one step on the long way to determining its function. The explosion of sequence information through large-scale sequencing projects has widened the gap between the number of sequences deposited in public databases and the experimental characterisation of the corresponding proteins [8, 9] . Experimental annotations of sub-cellular localization are often based on operational, biochemical definitions (e.g. cell fractions or targeting signals of various sorts) that can be error prone. In contrast, computational tools can provide fast and accurate localization predictions for any organism [10, 11] . Attempts to predict sub-cellular localization have become one of the central problems in bioinformatics [12, 13] .

Predicting sub-cellular localization of proteins. The Protein Data Bank (PDB) [3] contains proteins of known 3D structures. Sub-cellular localization is annotated for very few of the proteins deposited in PDB. The LOC3Ddb database is the first comprehensive database of predicted and inferred sub-cellular localization for proteins of known structure. Four different methods are applied (Method); the method with the strongest signal is chosen to annotate sub-cellular localization for all proteins in PDB. The LOC3Ddb database can be useful in complementing functional information for proteins from domain databases like SMART [14] , PFAM [15] and functional site resources like ELM [16] , ProtFun [17] and Prosite [18] .

 

 

Method

LOC3D ventures four different paths to annotate sub-cellular localization ( Fig. 1 ).



Fig. 1
fig1.gif

Fig. 1. : The LOC3D system. From the query PDB structure, the amino acid sequence, three state secondary structure and solvent accessible surface residues of the protein are extracted. LOC3D uses four different methods to annotate sub-cellular localization: a) predictNLS: the amino acid sequence is scanned for nuclear localization signals. b) LOChom: the sequence is first aligned using PSI-BLAST to a localization annotated database of proteins. If any sequence homologues are discovered, sub-cellular localization annotation is transferred from the homologue. c) LOCkey: the SWISS-PROT database contains functional information for proteins in the form of keywords. LOCkey infers sub-cellular localization based on keyword entries. The above three programs are based solely on the amino acid sequence of the protein and do not use any structural information. d) LOC3Dini: sub-cellular localization is predicted by a system of neural networks trained on a number of global features like amino acid composition, secondary structure composition and surface residue composition. The final localization annotation in the LOC3D database is taken from the most reliable prediction amongst the four individual methods.


 





 

(1) PredictNLS: identification of nuclear localization signals. The most accurate way to predict nuclear localization is to identify the nuclear localization signal: Active transport of proteins into the nucleus takes place by binding to specific molecules such as importins and karyopherins that recognise distinct targeting signals [19] . This targeting signal typically contains a short segment of consecutive residues and is commonly referred to as the nuclear localization signal (NLS). PredictNLS [5, 6] uses a set of expert-curated experimentally known NLSs to predict nuclear localization. At 100% accuracy this tool identifies about half of all known nuclear proteins.

(2) LOCkey: digest experimental data from SWISS-PROT keywords. Our second most accurate tool to infer localization is to simply infer localization from experimental descriptions of localization as contained in the controlled vocabulary of SWISS-PROT [7] . LOCkey infers sub-cellular localization through an automated lexical analysis of SWISS-PROT keywords [2] . In contrast to dictionary-based approaches, LOCkey is fully automated and the rule libraries used to infer localization from keywords are generated dynamically. The method is based on a novel implementation of an M-ary (multiple category) classifier [20, 21] . For example for a protein with the SWISS-PROT keywords ÔNADP, Acetylation, NAD, and OxidoreductaseÕ (keywords for the protein and its sequence homologues are merged to obtain the final keyword list), the LOCkey algorithm first generates all possible combinations of the keywords. For a protein with four keywords this gives 24-1=15 possible combinations (ex. Of keyword combinations are ÔNADP, AcetylationÕ; ÔAcetylation, NAD, OxidoreductaseÕ; ÔAcetylation, NADÕ; ÔNADPÕ and so on). For each of these keyword combinations the algorithm compiles localization statistics for matches against a pre-compiled database of keyword associations for proteins of known sub-cellular localization. Finally localization is assigned to the protein by minimizing an entropy-based objective function (in this example the localization assigned is Cytoplasm). The method is extremely accurate in inferring sub-cellular localization when any functional information in the form of keywords is known (over 82% accuracy using full cross-validation).

(3) LOChom: inference through sequence homology. The next most reliable means of getting at sub-cellular localization is through annotation transfer through homology: If a protein of experimentally known localization L is significantly similar in sequence to a query protein U, U and L have identical localization [22, 1] . We have carried out the most exhaustive study of the sequence conservation of sub-cellular localization to establish the thresholds for annotation transfer based on homology [1] . Sequence homologues were identified using pairwise BLAST [23] and PSI-BLAST [24] . To assign sub-cellular localization three measures of sequence similarity were investigated: 1) sequence identity, 2) BLAST e-values [23] and 3) distance from ÔHSSP-thresholdÕ [25, 26] . Of the three measures, distance from ÔHSSP-thresholdÕ was the most successful in annotating sub-cellular localization. One of the results of this investigation was a refined version of the distance from ÔHSSP-thresholdÕ formula [27, 26, 1] which significantly improves the coverage of proteins that can be annotated using homology. The use of position specific scoring matrix (PSSM) in PSI-BLAST improved the coverage of proteins that could be annotated using homology by more than 5% over simple pairwise BLAST. Further improvements in homology-based annotation were obtained through the use of separate Òconservation thresholdsÓ and Òaccuracy versus sequence similarityÓ curves for each of the localization classes.

(4) LOC3Dini: ab initio prediction from sequence and structure. LOC3Dini is a prediction system that predicts sub-cellular localization from sequence and structure using neural networks (submitted). Sub-cellular localization is predicted using a number of global features of protein sequence and structure. The LOC3Dini system consists of three layers and sorts proteins into one of four localization classes (extracellular, cytoplasmic, nuclear and mitochondrial). (1) The first layer consists of four dedicated neural networks that use particular features from protein sequences, alignments, and secondary structure to pre-sort proteins into L/not-L (with L = cytoplasmic, nuclear, extra-cellular, mitochondrial). The features used include, amino acid composition, composition of surface accessible residues and composition of amino acid residues in one of three secondary structure states (helix, beta strand and loop). Evolutionary information was incorporated by replacing the amino acid composition by profile based amino acid composition. (2) The second layer consists of neural networks combining output from networks trained on different input features. (3) The third layer uses a simple jury decision to assign one of four localization-states to each protein. Major sources of improvement over publicly available methods originated from using: (i) secondary structure information, (ii) solvent accessibility, and (iii) evolutionary information from sequence profiles as input to the neural networks. The final four-state classification accuracy of the system was over 65%. This is more than ten percentage points higher than systems using only amino acid composition.

Final annotation of sub-cellular localization through best single method. The final annotation of localization that is generated by LOC3D is taken from the most reliable prediction amongst the four individual methods. Using this four-step approach significantly improves prediction accuracy since different methods are most accurate in different regimes. For example, if an NLS is detected by PredictNLS, the protein has a high probability of being nuclear (our NLS motifs are exclusive to nuclear proteins). If functional information in the form of SWISS-PROT keywords is available, LOCkey can use this information to infer sub-cellular localization at a very high accuracy. In the absence of sufficient functional information, identification of sequence homologues using LOChom proves most accurate. Ab-initio predictions using LOC3Dini are the least accurate means, however, they are applicable when all the other methods fail. In terms of coverage of PDB, the most successful methods were homology based assignments using LOChom, accounting for 44% of the final assignments and keywords based assignments using LOCkey, accounting for 37% of the assignments ( Table 1 ). Although extremely accurate, nuclear localization signals could be inferred for only about 1% of the eukaryotic chains using PredictNLS.



Table . 1
Table 1 : Annotations by LOC3D by method *.

METHOD

Numberof proteinsPercentageof proteins

 

LOChom

388044

LOCkey

322237

LOC3Dini

156118

PredictNLS

1301

SUM

8793100

a Numberof final localization assignments made by each method.



 

Comprehensive annotation for 3D structures. LOC3D is a comprehensive source of information regarding sub-cellular localization for eukaryotic proteins of known structure. Currently, there are no available databases cataloguing sub-cellular localization information for eukaryotic chains in PDB. The database contains sub-cellular localization information for over 8700 PDB chains ( Table 2 ). Proteins secreted to the extra-cellular space form the largest class of proteins in PDB, followed by cytoplasmic and nuclear proteins.



Table . 2
Table 2 : Annotations by LOC3D by type oflocalization *.

Sub-cellular localization

Numberof proteins

 

Extra-cellular space

3786

Cytoplasm

2328

Nucleus

1066

Mitochondria

1024

Chloroplast

348

Peroxysome

88

Lysosome

85

Endoplasmic reticulum

50

Vacuoles

14

Golgi apparatus

4

SUM

8793

a Numberof PDB chains in the LOC3D database assigned to the given localization.



 

 

LOC3D Interface

Database description.The LOC3D database has been formatted in an EMBL-like flat-file format. The database can be accessed on the web through a PERL CGI interface [28, 29, 30] . The database can be used in either query-mode or browse-mode. (1) User query: Any object in the database can be queried using a PERL regular expression like syntax. The query can be a name or a wildcard pattern (the search engine automatically appends the Ô*Õ wildcard pattern at the end of the query). If the query field is left blank, the search displays all objects of the selected type. Three types of objects can be queried: PDB chain identifiers, types of sub-cellular localization and type of prediction method. For example, querying the Ôsub-cellular localization classÕ object with ÒnuclearÓ displays all proteins in the database that are predicted to have nuclear localization. (2) Browsing the database: In this mode, database entries are displayed in order of decreasing confidence of prediction.

Web server description.The LOC3D web server has been implemented using a PERL CGI interface. Protein structures in PDB format [3] can be submitted to the server. Sub-cellular localization is predicted using the method described above. Prediction results are returned via email.

Format and fields. Each protein can have up to four localization predictions associated with it, one from each method. The database uses four fields to represent predictions from each method: (1) Method: the type of prediction method used. (2) Loci: predicted sub-cellular localization from this method. The predicted sub-cellular localization can be one of nine classes ( Table 2 ). (3) Confidence: confidence score assigned by the prediction method. This is a number between 0 and 100. Larger confidence scores mark more accurate predictions. (4) Details: any reasons, if available, for the particular localization class inferred by the method. For example, for a LOCkey prediction, this field would give details of the keywords responsible for this localization prediction.

 

Conclusions

LOC3D should be a useful resource for functional studies of proteins. In particular, large-scale efforts in structural genomics may profit from the tool. We plan to implement the system to predict sub-cellular localization based on predicted secondary structure. Another future goal is to also include all proteins in the SWISS-PROT database and the fully sequenced eukaryotic genomes. We also plan to incorporate the database into comprehensive proteome databases like the PEP [4] database.

LOC3D should be cited with the present publication as reference. The database can be accessed through the World Wide Web at: http://cubic.bioc.columbia.edu/db/LOC3D/.

 

 

Acknowledgements

Thanks to Jinfeng Liu and Megan Restuccia (Columbia) for computer assistance and Kaz (Columbia) for valuable discussions. The work of RN and BR was supported by the grant DBI-0131168 from the National Science Foundation (NSF). Last, not least, thanks to Amos Bairoch (SIB, Geneva), Rolf Apweiler (EBI, Hinxton), Phil Bourne (San Diego Univ.), and their crews for maintaining excellent databases and to all experimentalists who enabled this tool by making their data publicly available.


References

1.Nair, R. & Rost, B. (2002).Sequence conserved for subcellular localization. Protein Sci, 11, 2836-47.
2.Nair, R. & Rost, B. (2002). Inferringsub-cellular localization through automated lexical analysis. Bioinformatics,18 Suppl 1, S78-S86.
3.Berman, H. M., Westbrook, J., Feng,Z., Gillliland, G., Bhat, T. N. et al. (2000). The Protein Data Bank. NucleicAcids Research, 28, 235-42.
4.Carter, P., Liu, J. & Rost, B.(2003). PEP: Predictions of Entire Proteomes. NAR (submitted), .
5.Cokol, M., Nair, R. & Rost, B.(2000). Finding nuclear localisation signals. EMBO Reports, 1, 411-415.
6.Nair, R., Carter, P. & Rost, B.(2003). NLSdb: database of nuclear localization signals. Nucleic Acids Res, 31,397-9.
7.Bairoch, A. & Apweiler, R.(2000). The SWISS-PROT protein sequence database and its supplement TrEMBL in2000. Nucleic Acids Res, 28, 45-8.
8.Rost, B. & Sander, C. (1996).Bridging the protein sequence-structure gap by structure predictions. AnnualReview of Biophysics and Biomolecular Structure, 25, 113-136.
9.Koonin, E. V. (2000). Bridging thegap between sequence and function. Trends Genet, 16, 16.
10.Eisenberg, D., Marcotte, E. M.,Xenarios, I. & Yeates, T. O. (2000). Protein function in the post-genomicera. Nature, 405, 823-6.
11.Lewis, S., Ashburner, M. &Reese, M. G. (2000). Annotating eukaryote genomes. Curr Opin Struct Biol, 10,349-54.
12.Bork, P., Dandekar, T.,Diaz-Lazcoz, Y., Eisenhaber, F., Huynen, M. et al. (1998). Predicting function:from genes to genomes and back. J Mol Biol, 283, 707-25..
13.Nakai, K. (2000). Protein sortingsignals and prediction of subcellular localization. Adv Protein Chem, 54,277-344.
14.Ponting, C. P., Schultz, J.,Milpetz, F. & Bork, P. (1999). SMART: identification and annotation ofdomains from signalling and extracellular protein sequences. Nucleic AcidsResearch, 27, 229-32.
15.Sonnhammer, E. L., Eddy, S. R.& Durbin, R. (1997). Pfam: a comprehensive database of protein domainfamilies based on seed alignments. Proteins: Structure, Function, and Genetics,28, 405-420.
16.Puntervoll, P., Linding, R.,Gemund, C., Chabanis-Davidson, S., Mattingsdal, M. et al. (2003). The ELMserver: A new resource for revealing short functional sites in modulareukaryotic proteins. Nucleic Acids Res, .
17.Jensen, L. J., Gupta, R., Blom, N.,Devos, D., Tamames, J. et al. (2002). Prediction of human protein function frompost-translational modifications and localization features. J Mol Biol, 319,1257-65.
18.Hofmann, K., Bucher, P., Falquet,L. & Bairoch, A. (1999). The PROSITE database, its status in 1999. NucleicAcids Research, 27, 215-219.
19.Tinland, B., Koukolikova-Nicola,Z., Hall, M. N. & Hohn, B. (1992). The T-DNA-linked VirD2 protein containstwo distinct functional nuclear localization signals. Proc Natl Acad Sci U S A,89, 7442-6.
20.Dasarathy, B. V. (1991). NearestNeighbor (NN) Norms: NN Pattern Classification Techniques. IEEE ComputerSociety Press, Las Alamitos, California.
21.Yang, Y. & Liu, X. (1999). Are-examination of text categorisation methods. Proceedings of the ACM SIGIRConference on Research and Development in Information Retrieval., 42-49.
22.Eisenhaber, F. & Bork, P.(1998). Wanted: subcellular localization of proteins based on sequence. Trendsin Cell Biology, 8, 169-170.
23.Altschul, S. F., Gish, W., Miller,W., Myers, E. W. & Lipman, D. J. (1990). Basic local alignment search tool.J Mol Biol, 215, 403-10.
24.Altschul, S., Madden, T., Shaffer,A., Zhang, J., Zhang, Z. et al. (1997). Gapped Blast and PSI-Blast: a newgeneration of protein database search programs. Nucleic Acids Research, 25,3389-3402.
25.Sander, C. & Schneider, R.(1991). Database of homology-derived protein structures and the structuralmeaning of sequence alignment. Proteins, 9, 56-68.
26.Rost, B. (1999). Twilight zone ofprotein sequence alignments. Protein Eng, 12, 85-94.
27.Sander, C. & Schneider, R.(1991). Database of homology-derived structures and the structural meaning ofsequence alignment. Proteins: Structure, Function, and Genetics, 9, 56-68.
28.Wall, L. & Schwartz, R. L.(1990). Programming perl. O'Reilly & Associates, Inc., Sebastopol, CA.
29.Stein, L. D. (2001). Using Perl tofacilitate biological analysis. Methods Biochem Anal, 43, 413-49.
30.Stajich, J. E., Block, D., Boulez,K., Brenner, S. E., Chervitz, S. A. et al. (2002). The Bioperl toolkit: Perlmodules for the life sciences. Genome Res, 12, 1611-8. 

Contact:    rost@columbia.edu Version:    Mar 19, 2003
top - CUBIC-papers - CUBIC