bottom - TOC - CUBIC-papers - CUBIC

Title: LOCnet and LOCtarget: Sub-cellular localization for structural genomics targets
Author: Rajesh Nair, & Burkhard Rost
Quote: Nucl Acids Res, 2004, 32:W517-W521

LOCnet and LOCtarget: Sub-cellular localization for structural genomics targets

Rajesh Nair 1,4 & Burkhard Rost ?

1 CUBIC, Dept. of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA
2 Columbia University Center for Computational Biology and Bioinformatics (C2B2), Russ Berrie Pavilion, 1150 St. Nicholas Avenue, New York, NY 10032, USA
3 North East Structural Genomics Consortium (NESG), Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA
4 Dept. of Physics, Columbia Univ., 538 West 120th Street, New York, NY 10027, USA
* Corresponding author: cubic@cubic.bioc.columbia.edu URL http://cubic.bioc.columbia.edu/  Tel: +1-212-305-4018, fax: +1-212-305-7932

This article is published in (Nucleic Acids Research, issue, date and pages) copyright Oxford University Press (2004). OUP is the only authorised source. All copying of this article including placing on another website requires the written permission of the copyright owner.

 

Table of contents


Abstract

LOCtarget is a web server and database that predicts and annotates sub-cellular localization for structural genomics targets; LOCnet is one of the methods used in LOCtarget that can predict sub-cellular localization for all eukaryotic and prokaryotic proteins. Targets are taken from the central registration database for structural genomics, namely TargetDB. LOCtarget predicts localization through a combination of four different methods: known nuclear localization signals (PredictNLS), homology-based transfer of experimental annotations (LOChom), inference through automatic text analysis of SWISS-PROT keywords (LOCkey), and de novo prediction through a system of neural networks (LOCnet). Additionally, we report predictions from SignalP. The final prediction is based on the method with the highest confidence. The web server can be used to predict sub-cellular localization of proteins from their amino acid sequence. The LOCtarget database currently contains localization predictions for all eukaryotic proteins from TargetDB and is updated every week. The server is available at: http://www.rostlab.org/services/LOCtarget/.

 

Key words: structural genomics, protein sub-cellular localization, protein structure, protein function, homology, signalling motifs. 

 

Abbreviations used

3D structurethree-dimensional protein structure
CGICommon Gateway Interface
LOC3Ddatabase with sub-cellular localization prediction for proteins of known structure [1]
LOChomhomology-transfer of experimental annotations for sub-cellular localization [2]
LOCkeyprediction of sub-cellular localization through automated text analysis [3]
LOCnetneural network system predicting sub-cellular localization from sequence and structure [4]
NLSnuclear localization signal
PDBProtein Data Bank of experimentally determined 3D structures of proteins [5]
PredictNLSdetection of nuclear localization signals [6, 7]
SignalPprediction of generic signal peptides for proteins of the secretory pathway [8]
SWISS-PROTdata base of protein sequences [9, 10]
TargetDBofficial registration database for structural genomics targets [11]
.

.


 

Overview

Structural genomics initiatives unravel protein structures. Over 15 structural genomics initiatives currently aim at determining a large number of protein structures in a high-throughput manner. These projects have already deposited almost 700 new protein three-dimensional (3D) structures into the PDB over the last four years [5] . The rate at which structures are experimentally determined for which no low-resolution models are available is over five times higher for structural genomics consortia than it is for the entire PDB [12] . One ultimate goal is to experimentally determine at least one representative 3D structure for all sequence-structure families. It is now clear that we need over ten times more structures to realise this concept than was originally anticipated [12, 13] . Nevertheless, it is currently believed that structural genomics consortia will be able to experimentally determine almost 10,000 new structures before 2010. If chosen optimally, these 10,000 would half the number of residues in known proteins for which we do not have any structural annotations (J Liu & B Rost, unpublished). The PDB has created a centralised registration database for target sequences from structural genomics projects worldwide called TargetDB (http://TargetDB.pdb.org, [11] ). TargetDB currently contains over 50,000 target sequences. The 3D structure for the majority (>98%) of these sequences is currently unknown and many lack any functional annotations. Functionally annotating structures from structural genomics is currently an important challenge for computational biology [14, 15, 16] .

Sub-cellular localization one key toward unravelling protein function. Proteins that co-operate toward a common biological function are often located in the same sub-cellular compartment. Aberrant sub-cellular localization of proteins has been observed in the cells of several diseases, such as cancer and AlzheimerÕs disease. Thus, the sub-cellular localization of a protein is a an important coarse-grained aspect of itÕs role. Some of the publicly available predictors for sub-cellular localization are, for example, TargetP [17] , a neural-network based method for predicting signal peptides, mitochondrial targeting peptides and chloroplast targeting peptides, NNPSL [18] , a neural network based localization predictor using amino acid composition, SubLoc [19] , a support vector based predictor using amino acid composition and PSORT and PSORT II [20] , which are based on identifying sequence motifs responsible for protein sorting, on sequence homology and on NNPSL. We have developed LOCtarget (http://www.rostlab.org/), a database with and web server for the prediction of sub-cellular localization for all sequences in TargetDB. LOCtarget is a comprehensive system for localization prediction based on database annotations for sequence homologues (LOChom), functional information in the form of SWISS-PROT keywords (LOCkey), sequence motifs involved in targeting to the nucleus (PredictNLS), and a system of neural networks (LOCnet) for de novo prediction. LOCnet was found to be over 7% more accurate than the best publicly available system [4] . LOCtarget can also be used to predict sub-cellular localization for proteins in any other context. In particular, LOCnet and LOCtarget differ from LOC3D [1] in that they predict for proteins of unknown structures. The LOCtarget database can be useful in complementing other predicted functional information regarding the target sequences in the SPAM database that provides annotations for TargetDB entries (http://span.sdsc.edu/perl/browser_beta.pl [21] ).

 

Methods and Results

LOCtarget combines the following four different paths to annotate and predict sub-cellular localization ( Fig. 1 ).

(i) PredictNLS: identification of nuclear localization signalsThe most accurate way to predict nuclear localization is to identify the nuclear localization signal (NLS): Active transport of proteins into the nucleus is realised by specific molecules such as importins and karyopherins that bind to distinct targeting signals [22] . This targeting signal typically contains a short segment of consecutive residues and is commonly referred to as the nuclear localization signal (NLS). PredictNLS [6, 7] uses a set of expert-curated experimentally known NLSs to predict nuclear localization. At 100% accuracy this tool identifies about half of all known nuclear proteins.

(ii) LOCkey: digest experimental data from SWISS-PROT keywords.  Our second most accurate tool to infer localization uses experimental descriptions of protein function as contained in the controlled vocabulary of SWISS-PROT keywords [10] . First, we align the target sequence to sequences in SWISS-PRROT using pairwise BLAST [23] . Second, we extract all SWISS-PROT keywords for all sequence homologues that meet specified thresholds in terms of sequence similarity and the content of these keywords. LOCkey [3] then infers sub-cellular localization through an automated lexical analysis of the extracted SWISS-PROT keywords. In contrast to dictionary-based approaches, LOCkey is fully automated and the rule libraries used to infer localization from keywords are generated dynamically. The method is extremely accurate when any functional information in the form of keywords is known (over 82% accuracy using full cross-validation).

(iii) LOChom: inference through sequence homology. The next most reliable means of finding out the sub-cellular localization is through homology-transfer: If a protein of experimentally known localization L is significantly sequence similar to a query protein Q, Q and L have identical localization [24, 2] . We have carried out the most exhaustive study of the sequence conservation of sub-cellular localization to establish the thresholds for annotation transfer based on homology [2] . Sequence homologues were first identified using pairwise BLAST and PSI-BLAST. To assign sub-cellular localization three measures of sequence similarity were investigated: pairwise sequence identity, BLAST/PSI-BLAST expectation values (EVAL) and distances from the Sander-Schneider-curve that relates alignment length to sequence identity [25, 26] (referred to as HSSP-value or HVAL). Of the three measures, the HSSP-value was the most successful in annotating sub-cellular localization. One of the results of our original investigation was a problem-specific refinement for the HSSP-value [3] . The use of position specific scoring matrices in PSI-BLAST also improved the reliability of the homology-transfer. Further improvements in homology-based annotation were obtained through the use of separate Òconservation thresholdsÓ and Òaccuracy versus sequence similarityÓ curves for each of the localization classes. Note that at the level at which we use LOChom, our decisions are for all compartments significantly more accurate than any de novo prediction, and even than predictions based on signal or target motifs.

(iv) LOCnet: de novo prediction from sequence.  LOCnet is a system that predicts sub-cellular localization from sequence using neural networks [4] . The LOCnet system consists of three layers that sort proteins into one of four classes (extra-cellular, cytoplasmic, nuclear and mitochondrial). Major sources of improvement over publicly available methods originated from using: predicted secondary structure (from PROFsec [27, 28] ), improved predictions of solvent accessibility (from PROFacc [29, 30] , and evolutionary information from sequence profiles. LOCnet has a module that implicitly predicts generic signal peptides (but not the cleavage sites) and target peptides [4] . The final four-state classification accuracy of the system was about 65%. This is nearly ten percentage points higher than systems using only amino acid composition. We also noted that we had to develop a system tailored specifically to predicting localization for proteins of known 3D structure [4] that is available through LOC3D [1] . Although LOCnet performs better for extra-cellular proteins with signal peptides, it can also identify proteins that are secreted through an alternative pathway(s) (such as fgf, IL-1), and - in combination with other methods - it can distinguish between proteins with signal peptides that are retained in the Endoplasmic reticulum or Golgi apparatus and proteins those that are actually secreted [4, 31] . Note however that LOCnet is significantly less accurate than TargetP for mitochondrial proteins.

(v) SignalP: prediction of generic signal peptides. SignalP (version 2) is a neural network based prediction of generic N-terminal signal peptides [8, 17] . Prediction accuracy for eukaryotic proteins is around 70-80% [4] . Note that - due to licensing issues - we do not return the detailed predictions from SignalP, rather we only indicate whether or not SignalP detected a signal peptide.



Fig. 1
fig1.gif

Fig. 1 : The LOCtarget system. From the query amino acid sequence, the three state secondary structure and solvent accessible surface residues of the protein are predicted using PROFphd [32, 29] . LOCtarget uses four different methods to annotate sub-cellular localization: (a) PredictNLS: the amino acid sequence is scanned for nuclear localization signals. (b) LOChom: the sequence is first aligned through PSI-BLAST profiles to a database with experimental annotations about localization. If any sequence homologues are discovered, sub-cellular localization annotation is transferred from the homologue. (c) LOCkey: the SWISS-PROT database contains functional information for proteins in the form of keywords. LOCkey infers sub-cellular localization based on keyword entries. The above three programs are based solely on the amino acid sequence of the protein and do not use any structural information. (d) LOCnet: sub-cellular localization is predicted by a system of neural networks trained on a number of global features such as amino acid composition, predicted secondary structure composition and composition of predicted surface accessible residues. The final localization annotation in the LOCtarget database is taken from the most reliable prediction amongst the four individual methods.



Best single method determines the final annotation of localization. The final annotation of localization by LOCtarget is taken from the most reliable prediction amongst the four individual methods. Using this four-step approach significantly improves prediction accuracy since different methods are most accurate in different regimes. For example, if an NLS is detected by PredictNLS, the protein has a high probability of being nuclear (our NLS motifs are exclusive to nuclear proteins). If functional information in the form of SWISS-PROT keywords is available, LOCkey can use this information to infer sub-cellular localization at a very high accuracy. In the absence of sufficient functional information, identification of sequence homologues using LOChom proves most accurate. De novo predictions using LOCnet are the least accurate means, however, they are applicable when all the other methods fail. In fact, most structural genomics targets could only predicted by LOCnet (82.7%, Table 1 ). On the other extreme, the most accurate method (PredictNLS) contributed with less than 3% to the final annotations ( Table 1 ).



Table . 1
Table 1 : Annotations of LOCtarget by method.
METHODPercentage of proteins
LOCnet82.7
LOChom7.6
LOCkey7.1
PredictNLS2.6
SUM100.0



Fewer than 10% eukaryotic proteins. The LOCtarget database currently contains sub-cellular localization information for nearly 50,000 targets ( Table 2 ); most of these are from prokaryotes and archae. Note that for the about 700 proteins for which we have 3D structures, the predictions in LOCtarget and LOC3D may differ, since LOCtarget predictions are based on sequences, not on structures. About 17% of the prokaryotic proteins are predicted as extra-cellular ( Table 2 ). Of the eukaryotic proteins in TargetDB, nuclear proteins constitute the single largest group accounting for around 39% of the sequences. Proteins secreted to the extra-cellular space account for 16% of the proteins while proteins retained in the cytoplasm account for 24%.



Table . 2
Table 2 : Annotations by LOCtarget by typeof localization.
Sub-cellular localizationaEukaryotic SequencesProkaryotic Sequences
Cytoplasm114236195
Extra-cellular space7647594
Nucleus18160
Mitochondria7530
Periplasmic681287
Chloroplast530
Endoplasmic reticulum310
Golgi apparatus340
Lysosome120
Peroxysome140
Vacuoles40
Signal peptidesb5255745
SUM469145076

a Number of target sequences in the LOCtarget database assigned to the given localization.

b Method: SignalP 2.0



 

Input, output, and options

Database description. The LOCtarget database has been formatted in an EMBL-like flat-file format. The database can be accessed on the web through a PERL CGI interfaces. The database can be used in either query-mode or browse-mode. (1) User query: Any object in the database can be queried using a PERL regular expression-like syntax. The query can be a name or a wildcard pattern (the search engine automatically appends the Ô*Õ wildcard pattern at the end of the query). If the query field is left blank, the search displays all objects of the selected type. Three types of objects can be queried: TargetDB protein identifiers, types of sub-cellular localization and type of prediction method. For example, querying the Ôsub-cellular localization classÕ object with ÒnuclearÓ displays all proteins in the database that are predicted to have nuclear localization. (2) Browsing the database: In this mode, database entries are displayed in order of decreasing confidence of prediction.

Web server description. The LOCtarget web server has been implemented using a PERL CGI interface. Currently sequences can be submitted only in FASTA format. However, we anticipate that the server will accept any standard sequence format in the very near future (FASTA, PIR, MSF, SWISS-PROT) or a list of protein identifier codes from SWISS-PROT, TrEMBL, or PDB. We will also enable uploading sequences from your local machine. Users have the option of receiving plain text (ASCII) output or HTML formatted results that can be displayed in any web browsers. Results are returned as e-mail attachments.

Format and fields. Each protein can have up to four localization predictions associated with it, one from each method. The database uses four fields to represent predictions from each method: (1) Method: the type of prediction method used. (2) Loci: predicted sub-cellular localization from this method. The predicted sub-cellular localization can be one of nine classes ( Table 2 ). (3) Confidence: confidence score assigned by the prediction method. This is a number between 0 and 100. Larger confidence scores mark more accurate predictions. (4) Details: any reasons, if available, for the particular localization class inferred by the method. For example, for a LOCkey prediction, this field would give details of the keywords responsible for this localization prediction.

 

Future

We are currently extending our system to also predict localization for prokaryotic and archae targets. Next, we will incorporate our annotations of sub-cellular localization prediction the SGTDB/SPAM database (http://spam.sdsc.edu/perl/browser_beta.pl), which provides annotations for TargetDB. Further extensions will include entirely sequenced organisms and improved prediction methods (Nair & Rost, in preparation).

 

Acknowledgements

Thanks to Jinfeng Liu and Megan Restuccia (Columbia) for computer assistance and to Kaz Wrzeszczynski (Columbia) for valuable discussions. The work of RN and BR was supported by the grants DBI-0131168 from the National Science Foundation (NSF), and the grant R01-LM07329-01 from the National Library of Medicine (NLM). Last, not least, thanks to Amos Bairoch (SIB, Geneva), Rolf Apweiler (EBI, Hinxton), Phil Bourne (San Diego Univ.), John Westbrook (Rutgers) and their crews for maintaining excellent databases and to all experimentalists who enabled this tool by making their data publicly available.

 

References

1.Nair, R. & Rost, B. (2003). LOC3D:annotate sub-cellular localization for protein structures. Nucl. Acids Res., 31, 3337-3340.
2.Nair, R. & Rost, B. (2002).Sequence conserved for sub-cellular localization. Prot. Sci., 11, 2836-2847.
3.Nair, R. & Rost, B. (2002).Inferring sub-cellular localisation through automated lexical analysis. Bioinformatics, 18, S78-S86.
4.Nair, R. & Rost, B. (2003). Betterprediction of sub-cellular localization by combining evolutionary andstructural information. Proteins, 53, 917-930.
5.Berman, H. M., Battistuz, T., Bhat, T.N., Bluhm, W. F., Bourne, P. E. et al. (2002). The Protein Data Bank. ActaCrystallogr D Biol Crystallogr, 58, 899-907.
6.Cokol, M., Nair, R. & Rost, B.(2000). Finding nuclear localisation signals. EMBO Rep., 1, 411-415.
7.Nair, R., Carter, P. & Rost, B.(2003). NLSdb: database of nuclear localization signals. Nucl. Acids Res., 31, 397-399.
8.Nielsen, H., Engelbrecht, J., Brunak,S. & von Heijne, G. (1997). Identification of prokaryotic and eukaryoticsignal peptides and prediction of their cleavage sites. Prot. Engin., 10, 1-6.
9.Bairoch, A. & Apweiler, R. (2000). TheSWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucl.Acids Res., 28,45-48.
10.Boeckmann, B., Bairoch, A., Apweiler,R., Blatter, M. C., Estreicher, A. et al. (2003). The SWISS-PROT proteinknowledgebase and its supplement TrEMBL in 2003. Nucl. Acids Res., 31, 365-370.
11.Westbrook, J., Feng, Z., Chen, L.,Yang, H. & Berman, H. M. (2003). The Protein Data Bank and structuralgenomics. Nucl. Acids Res., 31, 489-491.
12.Liu, J., Hegyi, H., Acton, T. B.,Montelione, G. T. & Rost, B. (2004). Automatic target selection forstructural genomics on eukaryotes. Proteins,in press.
13.Liu, J. & Rost, B. (2004). CHOP:parsing proteins into structural domains. Nucl. Acids Res.,submitted 2004-02-14.
14.Goldsmith-Fischman, S. & Honig, B.(2003). Structural genomics: computational methods for structure analysis. Prot.Sci., 12, 1813-21.
15.Laskowski, R. A., Watson, J. D. &Thornton, J. M. (2003). From protein structure to biochemical function? JStruct Funct Genomics, 4, 167-77.
16.Stark, A. & Russell, R. B. (2003).Annotation in three dimensions. PINTS: Patterns in Non-homologous TertiaryStructures. Nucl. Acids Res., 31, 3341-3344.
17.Emanuelsson, O., Nielsen, H., Brunak,S. & von Heijne, G. (2000). Predicting subcellular localization of proteinsbased on their N-terminal amino acid sequence. J. Mol. Biol., 300, 1005-1016.
18.Reinhardt, A. & Hubbard, T.(1998). Using neural networks for prediction of the subcellular location ofproteins. Nucl. Acids Res., 26, 2230-2235.
19.Hua, S. & Sun, Z. (2001). Supportvector machine approach for protein subcellular localization prediction. Bioinformatics, 17, 721-728.
20.Nakai, K. & Horton, P. (1999).PSORT: a program for detecting sorting signals in proteins and predicting theirsubcellular localization. TIBS, 24, 34-6.
21.Bourne, P. E., Addess, K. J., Bluhm,W. F., Chen, L., Deshpande, N. et al. (2004). The distribution and querysystems of the RCSB Protein Data Bank. Nucl. Acids Res., 32, D223-5.
22.Tinland, B., Koukolikova-Nicola, Z.,Hall, M. N. & Hohn, B. (1992). The T-DNA-linked VirD2 protein contains twodistinct functional nuclear localization signals. Proc. Natl. Acad. Sci.U.S.A., 89, 7442-6.
23.Altschul, S., Madden, T., Shaffer, A.,Zhang, J., Zhang, Z. et al. (1997). Gapped Blast and PSI-Blast: a newgeneration of protein database search programs. Nucl. Acids Res., 25, 3389-3402.
24.Eisenhaber, F. & Bork, P. (1998).Wanted: subcellular localization of proteins based on sequence. TICB, 8, 169-170.
25.Sander, C. & Schneider, R. (1991).Database of homology-derived structures and the structural meaning of sequencealignment. Proteins, 9, 56-68.
26.Rost, B. (1999). Twilight zone ofprotein sequence alignments. Prot. Engin., 12, 85-94.
27.Rost, B. (2001). Protein secondarystructure prediction continues to rise. J. Struct. Biol., 134, 204-218.
28.Rost, B. & Liu, J. (2003). ThePredictProtein server. Nucl. Acids Res., 31, 3300-3304.
29.Rost, B. (2004). How to use protein 1Dstructure predicted by PROFphd. Meth. Mol. Biol.,submitted.
30.Rost, B., Yachdav, G. & Liu, J.(2004). The PredictProtein server. Nucl. Acids Res.,in press.
31.Wrzeszczynski, K. O. & Rost, B.(2004). Annotating proteins from Endoplasmic reticulum and Golgi apparatus ineukaryotic proteomes. CMLS,submitted.
32.Rost, B. (1996). PHD: predictingone-dimensional protein structure by profile based neural networks. Meth.Enzymol., 266,525-539. 

Contact:    rost@columbia.edu Version:    Apr 18, 2004
 top - TOC - CUBIC-papers - CUBIC