bottom - CUBIC-papers - CUBIC

Title: NLSdb: database of nuclear localization signals
Author:Rajesh Nair, Phil Carter & Burkhard Rost
Quote: Nucl Acids Res, 2003, 31:397-399

NLSdb: database of nuclear localization signals

Rajesh Nair 1,2,*, Phil Carter 1 & Burkhard Rost 1, 3, 4, *

1 CUBIC, Dept. of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA
2 Dept. of Physics, Columbia Univ., 538 West 120th Street, New York, NY 10027, USA
3 Columbia University Center for Computational Biology and Bioinformatics (C2B2), Russ Berrie Pavilion, 1150 St. Nicholas Avenue, New York, NY 10032, USA
4 North East Structural Genomics Consortium (NESG), Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA
* Corresponding author:  email = rost@columbia.edu URL http://cubic.bioc.columbia.edu/  Tel: +1-212-305-3773, fax: +1-212-305-7932

This article is published in (Nucleic Acids Research, issue, 2002 and pages) © copyright Oxford University Press (2002). OUP is the only authorised source. All copying of this article including placing on another website requires the written permission of the copyright owner.

Table of contents



 


Abstract

NLSdb is a database of nuclear localization signals (NLSs) and of nuclear proteins. NLSs are short stretches of residues mediating transport of nuclear proteins into the nucleus. The database contains 114 experimentally determined NLSs that were obtained through extensive literature search. Using ‘in silico mutagenesis’ this set was extended to 308 experimental and potential NLSs. This final set matched over 43% of all known nuclear proteins and matches no currently known non-nuclear protein. NLSdb contains over 6000 predicted nuclear proteins and their targeting signals from the PDB and SWISS-PROT databases. The database also contains over 12500 predicted nuclear proteins from six entirely sequenced eukaryotic proteomes (Homo sapiens, Mus musculus, Drosophila melanogaster, Caenorhabditis elegans, Arabidopsis thaliana, andSaccharomyces cerevisiae). NLS motifs often co-localise with DNA-binding regions. This observation was used to also annotate over 1500 DNA-binding proteins. NLSdb can be accessed via the web site: http://cubic.bioc.columbia.edu/db/NLSdb/.

Key words: nuclear localization signal, database, protein targeting, DNA-binding, SRS.

Introduction

Extraction and testing of NLS motifs. Proteins are actively transported into the nucleus by binding to specific molecules such as importins and karyopherins that recognise distinct targeting signals [1] . The targeting signal is usually a short stretch of consecutive residues and is commonly referred to as the nuclear localization signal (NLS). Experimentally best characterised are mono-partite and bipartite motifs. Most mono-partite motif are characterised by a cluster of positively charged residues preceded by a helix-breaking residue. Most bipartite motif consists of two clusters of basic residues separated by 9-12 residues. Over the last few years a large number of distinct NLSs have been experimentally implicated in nuclear transport [2, 3] . However, NLSs have been experimentally determined for fewer than 10% of known nuclear proteins. To remedy this situation we devised a procedure of ‘in silico mutagenesis’ to discover new NLSs [4] . Briefly this procedure works as follows: (I) Change or remove some residues from the experimentally characterised NLS motifs and monitor the resulting true (nuclear) and false (non-nuclear) matches. Obviously, allowing alternative residues at particular positions increased the number of nuclear proteins found. However, often this also increased the number of matching non-nuclear proteins. (II) Discard any potential NLSs that are found in known non-nuclear proteins (false matches). (III) Require that potential NLSs be found in at least two distinct nuclear families. The 194 potential NLSs discovered using this procedure increased the coverage of known nuclear proteins to 43%. All proteins in the PDB and SWISS-PROT [5] database were annotated using the full list of experimental and potential NLSs. We also annotated all sequences in the yeast, worm, fruit fly and human proteomes. Approximately 20% of the NLS motifs were observed to co-localise with experimentally determined DNA-binding region of proteins [6, 4] . These motifs were used to annotate DNA-binding proteins.

General interest. NLSdb is a comprehensive source of information regarding NLSs and proteins translocated into the nucleus by signal sequences. Targeting signal recognition is a key control point in the regulation of nuclear transport. A database of NLS motifs is therefore a useful resource for biologists in identifying targeting signals in their sequence. The database describes all experimentally determined NLS motifs with links to original xxreferences in PubMed [7] . The information provided by our tool has already been useful for experimental studies of nuclear targeting.

 

Database Description

Interface. The data is stored and managed using the portal of the Sequence Retrieval System (SRS) [8] . SRS provides a convenient and robust framework for managing molecular databases. This provides users with quick, efficient search, retrieval and display methods that work for any web browser. Using SRS, the information in NLSdb can be easily integrated with other public and proprietary databases. The database is continuously updated and refined from the primary literature.

Format and fields. NLSdb has been formatted in an EMBL-like flat-file format, thus allowing indexing of the database in SRS [8] . Each NLSdb entry describes a nuclear localization signal. Each entry is organised into six major fields: (I) Origin, (II) Annotation, (III) Reference, (IV) Confidence, (V) Proteins and (VI) DNAbinding. The ‘Origin’ field describes whether the NLS has been found by direct experiments, or if it is a potential NLS discovered through our ‘in silico mutagenesis’. For experimentally determined NLSs, further information is provided in the fields ‘Annotation’ and ‘Reference’. The ‘Annotation’ field describes the protein family in which the experimental NLS was first established, and the ‘Reference’ field gives the primary literature citation. The ‘Reference’ field also contains a link to the PubMed for each citation. The ‘Confidence’ field is an indicator of our confidence in the NLS; it consists of two sub-fields; ‘Total confidence’ and ‘% Nuclear’. ‘Total confidence’ is the number of localization annotated proteins from SWISS-PROT in which this NLS is found and ‘% Nuclear’ is the percentage of these that are annotated as nuclear in SWISS-PROT. The ‘Proteins’ field lists proteins from various databases that are likely to be targeted to the nucleus since they match the given NLS motif. Currently the ‘Proteins’ field contains proteins from the SWISS-PROT, PDB and the PEP [9] databases. All protein entries are linked to the original entries in the respective databases. The ‘DNAbinding’ field describes whether the NLS overlaps with known DNA-binding regions of proteins. NLSdb can be browsed either starting with the NLS entries or with any of the data-fields defined above.

Annotations for entirely sequenced eukaryotic proteomes. Using the full set of experimental plus potential (discovered through ‘in silico’ mutagenesis) NLS motifs in NLSdb, we found over 12500 proteins with NLS in six entirely sequenced eukaryotic proteomes ( Table 1 ).


Conclusions

NLSdb can greatly help in better understating signal dependent nuclear transport of proteins. The potential NLS motifs discovered through ‘in silico’ mutagenesis can aid in discovering new signal sequences involved in nuclear targeting. A future goal is to integrate NLSdb with all sequences in the TREMBL database and all proteomes in the PEP database.

NLSdb should be cited with the present publication as reference. The database can be accessed through the World Wide Web at: http://cubic.bioc.columbia.edu/db/NLSdb/.

 

References

1.Tinland, B., Koukolikova-Nicola, Z.,Hall, M. N. & Hohn, B. (1992). The T-DNA-linked VirD2 protein contains twodistinct functional nuclear localization signals. Proc Natl Acad Sci U S A, 89,7442-6.
2.Mattaj, I. W. & Englmeier, L.(1998). Nucleocytoplasmic transport: the soluble phase. Annu Rev Biochem, 67,265-306.
3.Jans, D. A., Xiao, C. Y. & Lam,M. H. (2000). Nuclear targeting signal recognition: a key control point innuclear transport? Bioessays, 22, 532-44.
4.Cokol, M., Nair, R. & Rost, B.(2000). Finding nuclear localisation signals. EMBO Reports, 1, 411-415.
5.Bairoch, A. & Apweiler, R.(2000). The SWISS-PROT protein sequence database and its supplement TrEMBL in2000. Nucl. Acids Res., 28, 45-48.
6.LaCasse, E. C. & Lefebvre, Y. A.(1995). Nuclear localization signals overlap DNA- or RNA-binding domains innucleic acid-binding proteins. Nucleic Acids Res, 23, 1647-56.
7.Airozo, D., Allard, R., Brylawski,B., Canese, K., Kenton, D. et al. (1999). MEDLINE. 1999, .
8.Etzold, T., Ulyanov, A. & Argos,P. (1996). SRS: Information retrieval system for molecular biology data banks.Meth. Enzymol., 266, 114-128.
9.Carter, P., Liu, J. & Rost, B.(2003). PEP: Predictions of Entire Proteomes. NAR (submitted), .  

Contact:    rost@columbia.edu Version:    Sep 13, 2002
top - CUBIC-papers - CUBIC