Finding nuclear localization signals

Murat Cokol, Raj Nair & Burkhard Rost $

CUBIC, Columbia University

Columbia University, Department of Biochemistry and Molecular Biophysics, 630 West 168th Str., New York, NY 10032, USA

$ Corresponding author: rost@columbia.edu, http://cubic.bioc.columbia.edu/
Tel: +1-212-305-3773, fax: +1-212-305-7932

contact e-mail:rost@columbia.edu


Table of Contents


Abstract

A variety of nuclear localisation signals (NLSs) are experimentally known; only one motif was available for database searches. We initially collected a set of 91 experimentally verified NLSs from the literature. Through iterated 'in silico mutagenesis' we then extended the set to 214 potential NLSs. This final set matched in 43% of all known nuclear proteins and in no known non-nuclear protein. We estimated >17% of all eukaryotic proteins may be imported into the nucleus. Finally, we found an overlap between NLS and DNA-binding region for 90% of the proteins for which both NLS and DNA-binding regions were known. Thus, evolution seemed to have used part of the existing DNA-binding mechanism when compartmentalising DNA-binding proteins into the nucleus. However, only 56 of our 214 NLS motifs overlapped with DNA-binding regions. These 56 NLSs enabled a de novo prediction of partial DNA-binding regions for about 800 proteins in human, fly, worm and yeast.

Key words: nuclear localisation signal (NLS); protein sequence analysis; genome analysis; predict cellular localization; predict DNA-binding regions; Drosophila melanogaster; Caenorhabditis elegans; Saccharomyces cerevisiae.


Introduction

Simplification of nuclear import. A nuclear localisation signal (NLS) is a short stretch of amino acids that mediates the transport of nuclear proteins into the nucleus ( Fig. 1 ). NLS motifs play a key rôle in this mechanism. (1) Typically, deletion of the NLS disrupts nuclear import. (2) Frequently, a non-nuclear protein will be imported into the nucleus if fused to an NLS. Both facts have been used routinely to experimentally unravel NLS motifs [1, 2] .



Fig. 1
fig1.gif

Fig. 1. Simplified scheme for nuclear import. Upon synthesis of nuclear proteins in the cytoplasm, e.g. the family of importins or transportins bind to the NLS. The complex importin/NLS-protein (or transportin/protein) is then actively transported into the nucleus through nuclear pores involving the Ran GTPase cycle. Currently, this is the only known mechanism for nuclear import [19, 23] .



Variety of NLS motifs. Do experimentally known NLS motifs have a consensus? Positively charged residues are abundant in NLSs, in general, since some of these positive residues bind to e.g. importins [3] . Mutating positive charges is often the simplest way to disrupt nuclear import. However, there are Glycine-rich NLS motifs with few positive charges [4] . Experimentally best described are monopartite and bipartite motifs [5] . Typically, the monopartite motif is characterised by a cluster of basic residues preceded by a helix-breaking residue. Similarly, the bipartite motif consists of two clusters of basic residues separated by 9-12 residues. However, not all experimentally known NLSs comply with the above 'rules' [6, 7, 8] . Furthermore, many non-nuclear proteins match such simplified 'consensus rules'.

Finding an NLS in silico? A wealth of experimental data about NLSs has been accumulated. How can you find a known NLS in your protein? If a standard database search reveals a 'significant similarity' between your protein and a protein of experimentally known and annotated NLS, you can infer the NLS from the homologue. If not, can you find most experimental motifs in PROSITE [xxx 9]? The negative answer was the starting point for this work: build an 'expert database' of experimentally known NLSs. Another motivation was the observation that NLSs defined by experiments often appeared too specific. Theoretical generalisations for NLSs have been suggested: 'NLS cores are hexapeptides with at least four basic residue and neither acidic nor bulky residues' [xxx 10]. However, this motif matches only few nuclear and many non-nuclear proteins.

Do homologues have similar NLSs? Two naturally evolved proteins with more than 30% identical residues have similar 3D structures [11] . Sequence similarity required to infer function is much higher [12] . Structural thresholds depend on alignment length, e.g., two identical 11-residue-peptides can adopt different structures [13] . NLSs are short stretches of residues. Thus, at which levels of sequence similarity can we infer that two proteins will have a similar NLS? A lack of data prevented us from thoroughly answering this question. However, we found some upper boundaries.

Here, we presented an extended expert database of experimentally known and potential NLS motifs. We evaluated the validity of the set by a rigorous test against known nuclear and non-nuclear proteins. Our method comprised three steps: (1) data collection: collect experimental NLS motifs from literature, extend motifs through close homologues, (2) generalisation: refine motifs found by shortening (too specific) or lengthening (not specific enough), and test new motifs conceptually similar to known motifs found in many families of nuclear proteins. The crucial component of both steps was to accept motifs if NOT found in non-nuclear proteins.


Results and Discussion

Improved accuracy and coverage of NLS database

Inferring NLSs based on sequence very limited. We found about 30 protein pairs with more than 80% sequence identity and different annotations (nuclear and cytoplasmic) in our subset of SWISS-PROT (Methods, e.g. the nuclear elongation factor 1-Alpha-2 in mouse and the cytoplasmic transcription elongation factor 1-Alpha in Zebra fish had 91% identity over 460 residues). At 50-65% sequence identity, we found many pairs aligned over a substantial length, and annotated in different localisations (e.g. 60% nuclear and extracellular: fbrl_rat/ndl_drome; 63% nuclear and mitochondrial: hmgt_mouse/mtt1_human; 51% nuclear and chloroplast grp1_sinal/ro30_nicpl). Thus, we can infer that a protein is nuclear only if it is almost identical to a known nuclear protein. However, for all the experimental NLSs we extracted we succeeded to correctly infer the nuclear localisation knowing the NLS. Note, this failed for all NLSs from previously published theoretical generalisations [10] .

Raising coverage from 9% to 43%. Before we started, we had three ways to find an NLS in protein A. (1) We could memorise NLSs published and visually detect one (or several) of these in A. Obviously, this requires time and ample expertise. Furthermore, all experimental NLSs covered only 10% of the known nuclear proteins (too specific, Table 1 ). (2) We could automatically detect the NLS in PROSITE [9] . However, this covered only about 3% of all known proteins, and was not always correct ( Table 1 ).



 

Table 1: Accuracy and coverage of NLS motifs

Set A N NLS B Nprot nucC Nfam nuc D AccuracyE Coverage F
PROSITE 1 9631 90% 3%
SWISS-PROT 322 290 n.a. 9%
NLS-litcleaned9130935100%10%
NLS-lit consensus 91 537 35 100% 17%
PredictNLS_DB 214 1354 186 100% 43%

A : Set: PROSITE: motifs annotated in the PROSITE database of functional motifs [9] ; SWISS-PROT: subset of SWISS-PROT database [14] annotating nuclear localisation signals (note that a few proteins had more than one NLS annotated); NLS-lit cleaned: subset of motifs from literature with 100% accuracy; NLS-lit consensus: motifs refined by consensus of close homologues; PredictNLS_DB: final data set after in silico mutagenesis; B N NLS: number of NLS motifs in set; C Nprot nuclear: number of proteins matching any of the NLS and known to be nuclear; D Nfam: number of unique protein families matching any of the NLS and known to be nuclear (Methods: dataset); E Accuracy: percentage of nuclear proteins in set of proteins matching any of the NLS; F Coverage: percentage of known nuclear proteins (Methods: dataset) matching any of the motifs in the set (total number of known nuclear proteins 3142).



 


We could find a significant level of sequence similarity to a protein for which the NLS was annotated in SWISS-PROT [14] . This covered about 9% of all known nuclear proteins ( Table 1 ). Furthermore, standard database searches starting with the proteins known to be nuclear yielded less than 25% of the known nuclear families at a generous BLAST cut-off of 10-3 . In contrast, our final expert set of potential NLSs matched 43% of all nuclear proteins without any false positive ( Table 1 ).

Limitations and error margin of method. Proteins often contain more than one NLS. Thus, our method might fail to propose the functional NLS. Furthermore, a few of our potential NLSs might just be motifs common to nuclear proteins such as DNA-binding motifs. Examples for motifs common to nuclear proteins we found with the motif-detection programs PRATT [15] and the Gibbs-sampler [16] were long repeats of Glycines, Glutamic Acids and Glutamine, and zinc-finger type II motifs. Most importantly, we found possible NLSs in 54 E. coli proteins, only 26 of which could be explained by DNA-binding motifs. Assuming that the remaining 28 comprised errors, we estimated the error margin of our method < 1% (28/4286).

Lessons learned from 'in silico mutagenesis'. (1) As expected, amino acids with similar physico-chemical properties could often be exchanged (Leucine/Isoleucine). (2) Unexpectedly, positive amino acids (Arginine and Lysine) often could NOT be inter-changed. (3) None of the NLSs previously proposed by theory passed our criterion of 100% accuracy. (4) We found that proteins may have similar structure and function and yet may utilise different NLSs. (5) Very peculiar motifs we added to our final list were (A) GGGxGGGxxSSS, e.g. found by generalisation of the M9 domain motif (human RNP A1 protein), and (B) SGxxG{3,}?xG{3,}?xG{3,}?S (any number of more than three consecutive Gs), e.g. found in the transcriptional activator protein of mouse.

More than 17% of eukaryotic proteins nuclear. Extrapolating from the SWISS-PROT coverage, we could estimate a lower-limit (SWISS-PROT biased towards known NLSs) for the fraction of nuclear proteins in eukaryotes. We detected potential NLSs in 4187 proteins from human, fly, yeast and the worm ( Table 2 ). Thus, more than 17% of all eukaryotic proteins appeared to be imported into the nucleus. All entire genomes investigated had a similar percentage of nuclear proteins, although they clearly differed in the content of extra-cellular, helical membrane, and coiled-coil proteins [17] .




 

Table 2: Nuclear proteins in genomes

Genome A No of ORFs B No of proteins with NLS C Estimated content nuclear D
Human 13933 1311 >22%F
Drosophila 14219 1256 >21%
C. elegans 16232 1141 >17%
Yeast 6307 479 >18%
E. coli 4286 54 0%

A : Genome: we obtained the incomplete set of human sequences from the latest releases of SWISS-PROT and TrEMBL [14] , and the complete lists of proteins for the genomes of Drosophila melangoster, Caenorhabditis elegans, Saccharomyces cerevisia , and Escherichia coli from the respective Web sites [26] ; B No of ORFs: number of open-reading frames (proteins) in entire genome; C No of proteins with NLS: number of proteins for which the set PredictNLS_DB found an NLS in that genome; D Estimated content nuclear: given that our data set of NLS covers about 43% of all known nuclear proteins (Table 1), we estimated the content of nuclear proteins in the entire genome based on the number of proteins for which we found NLS; supposedly, these estimates provided a lower boundary (Results); E Estimated coverage for human: since our current data set for human contains only about 10% of all the proteins expected in the human genome, and since most of these are strongly biased by 'experimental focus', we could not estimate whether or not the coverage for human will be similar for the remaining 90% of all human proteins.



 


Specific NLS motifs used to bind DNA

20% of NLS motifs co-localised with DNA-binding region. Too few complexes of DNA/protein were solved by X-ray crystallography to conclude that NLS and DNA-binding motifs were co-localised. Instead, we used 1115 proteins with SWISS-PROT annotations about DNA-binding regions; 736 of these had a known NLS (66%), and for 664 the NLS overlapped with the DNA-binding region. Thus, for 90% of all proteins, for which we knew both the NLS and the DNA-binding region, both motifs overlapped. For 10% of the proteins, we could establish that the NLS and the DNA-binding region did NOT overlap. Furthermore, the NLS motifs co-localising with DNA-binding constituted about one fourth (56 of 214) of our final NLS set. The very observation that DNA-binding and NLS overlap frequently was not novel. In fact, based on a 20 times larger data set, we verified the original results from [18] . We also corrected their estimate upwards: where they found that 67% of the DNA-binding regions co-localised with the NLS, we found this number to be 90%. In contrast, our results suggested that MOST NLS motifs were NOT used to bind DNA.

RNA-binding regions typically NOT overlapping with NLS. Contrary to LaCasse and Lefebvre (1995), we found that only 33 of the 99 regions annotated in SWISS-PROT as RNA-binding in nuclear proteins overlapped with an NLS. The difference largely resulted from their definition of 'RNA-binding region' as the entire region between two consecutive RNA-binding sites. In contrast, SWISS-PROT - correctly - annotated only regions experimentally shown to bind RNA.

Structures for DNA-binding and NLS. For 20 of the investigated 22 proteins of known structure, we found the known NLS to overlap with the DNA-binding region ( Fig. 2 ). The only exceptions were rap1 from yeast and the segmentation protein fushi tarazu from fly (PDB codes: 1ign and 1ftz) for which we did not find the respective NLS in the known DNA-binding regions. However, these two exceptions did not have any of the 56 NLSs found to co-localise with DNA-binding. As expected, we found all NLSs on the protein surface.



Fig. 2
fig2.gif

Fig. 2. NLS motif also used for DNA-binding. Zoom into the interface between DNA and P55-C-fos proto-oncogene protein (note: the other parts of the amazing crystal structure of the complex with PDB id 1a02 [24] are not shown). The coloured region corresponds to the residues RRERNKMAAAKSRNRRR. In fact, this motif is also contained in our data set of potential NLS motifs. Colouring scheme: basic residues shown in red; others in yellow. Graph created with RASMOL [25] .



Speculation about evolution. The co-localisation of NLSs and DNA-binding regions suggested that DNA and shuttle proteins like importins and transportins utilised similar binding residues. Protein-DNA interactions may have preceded the 'invention' of a nucleus used by eukaryotes to compartmentalise all processes involving DNA. How to recognise proteins to import into this compartment? Common to many nuclear proteins are DNA-binding regions. Thus, it seems likely to utilise fragments of these regions to manage nuclear import. Consequently, we expect to find importin-like proteins and NLS-like sequences in prokaryotic organisms. In fact, we did find such motifs in E. coli protein ( Table 2 ); many of these appeared involved in DNA-binding. Obviously, evolution invented other NLS motifs (only 56 of 214 of the NLSs co-localised with DNA-binding) over time. NLSs are often also used to target nuclear export [19] . Could we thus perceive the co-localisation of DNA-binding and NLS as an elegant mechanism to also prevent export for some of the proteins? And did evolution in fact have to invent novel NLS motifs to manage export rather than import? Our data did not falsify such speculations.

De novo prediction of DNA-binding regions. Searching with the NLS/DNA motifs, we predicted a relatively small number of DNA-binding proteins in eukaryotes, ranging from 419 in human to 67 in yeast ( Table 3 ). However, this was 2-9 times higher than the number of proteins in the respective organism for which SWISS-PROT annotated DNA-binding or for which we could infer DNA-binding through homology ( Table 3 ). Thus, we predicted new potential DNA-binding regions for more than 800 proteins in all four eukaryotes.




 

Table 3: DNA-binding regions in genomes

Genome A Nprot B Nprot bind-DNA predicted C Nprot bind-DNA known D
Human 13933 419 141
Drosophila 14219 300 37
C.elegans 16232 251 10
Yeast 6307 67 10
E.coli 4286 13 3

A : Genome: see Table 2; B Nprot: total number of proteins in entire genome; C Nprot bind-DNA predicted: number of proteins for which we predict DNA-binding using NLS motifs; D Nprot bind-DNA known: number of proteins for which DNA-binding is annotated, or can be inferred by homology to a protein for which binding is annotated (note: family relations taken from [26] ).



 


Availability of data set and program

Our data set and method are available at: http://cubic.bioc.columbia.edu/predictNLS. The program also allows experimentalists to test accuracy and coverage for new NLS motifs they may find or suspect. This features has already helped to experimentally unravel a novel NLS in the hairless protein [20] . Finally, we added a form enabling experimentalists to add new NLSs. Every NLS added may help to speed up the next experiment!


Methods

Collecting initial set of NLS from literature. We searched about 250 papers and reviews for experimentally determined nuclear localisation signals. Our main criteria for 'accepting' NLS were that the signal was proven sufficient to mediate the nuclear transport of a non-nuclear protein to the nucleus and that deleting the NLS prevented the nuclear import. Technically, some motifs taken at this step comprised simple protein sequences, others regular expressions.

Sets of nuclear and non-nuclear proteins. We retrieved all proteins in SWISS-PROT release 38.0 [14] with annotations of sub-cellular localisation (ignoring PUTATIVE, POTENTIAL, BY SIMILARITY). Finally, we sorted all remaining proteins into two sets: (1) nuclear proteins (true positives, 3142 proteins) and (2) non-nuclear proteins (true negatives, 5910 proteins). Note: the set of nuclear proteins corresponded to 618 structural families [11] .

Extending experimental NLSs through homology. For each experimental NLS-protein, we found homologues in SWISS-PROT with PredictProtein [21] . For pairs with more than 80% identical residues, we extended the initial set of experimental NLSs by adding the sequence corresponding to the experimental NLS in the homologues.

Testing experimental NLSs. We tested the validity of all motifs found in the literature and their homologues by monitoring the matches of any motif in the sets of nuclear and non-nuclear proteins ( Fig. 3 ). The rationale was to find all NLS that matched exclusively in nuclear proteins.



Fig. 3
fig3.gif

Fig. 3. Scheme for the concept of 'in silico mutagenesis'. We started the search with the hypothetical motif GNKAKRQRST. We searched the data sets of proteins known to be nuclear and proteins known to be non-nuclear for presence of this motif. In this particular example, two nuclear and one non-nuclear protein matched. Requiring 100% accuracy for all motifs, we did not include GNKAKRQRST into our data set of potential motifs. Note: the particular example was one of many failed attempts to generalise an experimental NLS.

 
 



 

In silico mutagenesis. Given the list of sustained NLS motifs (experimental and homologues), we increased the number of potential NLS by 'in silico mutagenesis': we changed or removed some residues in the given motifs and monitored the resulting true (nuclear) and false (non-nuclear) matches. Obviously, allowing alternative residues at particular positions increased the number of nuclear proteins found. However, often this also increased the number of matching non-nuclear proteins. For example, the experimentally determined motif GKKRSKA was present in two nuclear proteins. We could infer that the amino acid type at the positions of Serine (S) and Alanine (A) were not crucial for the NLS motif since GKKRxK found 11 nuclear proteins. For example, KKRxK matched 105 proteins, only 69% of which were nuclear. Thus, we rejected this generalisation. In general, while trying to increase our coverage by our extended NLS list, we dropped any NLS present in ANY non-nuclear protein, i.e. 100% accuracy. Furthermore, we required the motif to be present in at least two distinct protein families. We tried all possible generalisations for the NLS motifs in our initial set through 'educated-guess trial-and-error'. Finally, we compiled the coverage, i.e. the fraction of the known nuclear proteins correctly detected by our final expert database of NLS motifs.

NLS and DNA-binding regions. We explored two ways of testing whether or not NLS motifs overlapped with known DNA-binding sites. Firstly, we looked at proteins for which the NLS and the three-dimensional structures are experimentally known. Towards this end, we investigated 22 examples of proteins of known structure (PDB codes: 1a02, 1an2, 1an4, 1akh, 1au7, 1b8i, 1cdw, 1fos, 1hlo, 1hry, 1hwt, 1lat, 2lef, 1mdy, 1nk2, 1nk3, 1oct, 1pdn, 1pue, 1tgh; 1ftz, 1ign [22] . Secondly, we compared the DNA-binding regions annotated in SWISS-PROT with the NLS matching in our extended data set (1115 proteins in total).


Acknowledgements

Thanks to Jinfeng Liu (Columbia Univ.) for computer assistance and collection of the genome data sets; to Barry Honig for his valuable comments on DNA-binding, to Amos Bairoch (SIB, Geneva), Rolf Apweiler (EBI, Hinxton) and their crews for maintaining the excellent databases SWISS-PROT and TrEMBL. Last, not least, thanks to all those who enabled this analysis by depositing experimental information about nuclear localisation signals.

References