Chapter 2
The transport of proteins into the nucleus of a cell is mediated by short stretches of residues called nuclear localization signals (NLS’s). A variety of NLS’s have been discovered experimentally. However, these experimentally identified NLS’s can explain the nuclear transport of fewer than 10% of the known nuclear proteins. Our goal in this work was twofold, 1) to catalogue the experimentally determined NLS’s and 2) to discover recurring themes in these NLS’s with a view to discovering new NLS’s. We initially collected a set of 114 experimentally verified NLS’s from the literature. Through iterated 'in silico mutagenesis' we extended this set to 308 experimental and potential NLSs. This final set could account for the nuclear transport of 43% of all known nuclear proteins. Using the newly discovered NLS’s we could predict the nuclear localization and targeting mechanism for over 6000 proteins of previously unknown localization from the SWISS-PROT and PDB protein databases. We also predicted over 12500 nuclear proteins from six entirely sequenced eukaryotic proteomes (Homo sapiens, Mus musculus, Drosophila melanogaster, Caenorhabditis elegans, Arabidopsis thaliana, and Saccharomyces cerevisiae). The NLS’s and the predicted nuclear proteins have been incorporated into a database, NLSdb (http://cubic.bioc.columbia.edu/db/NLSdb/), using the sequence retrieval system (SRS) for managing molecular biology databases. We estimated >23% of all eukaryotic proteins may be imported into the nucleus. We observed that for nearly all nuclear proteins that bind DNA, the nuclear localization signal overlapped with the DNA-binding region. We showed that for 90% of the proteins for which both NLS and DNA-binding regions were known, the NLS and the DNA-binding region showed significant overlap. Thus, evolution seems to have used the existing DNA-binding mechanism when compartmentalising DNA-binding proteins into the nucleus. Of the 308 NLS motifs, 56 were found to overlap with DNA-binding regions. These 56 NLSs enabled a de novo prediction of partial DNA-binding regions for about 1500 proteins in the six eukaryotic proteomes. We have developed a publicly accessible web server PredictNLS (http://cubic.bioc.columbia.edu/predictNLS/ ) to aid discovery and analysis of NLS’s in protein sequences.
Simplification of nuclear import. A nuclear localization signal (NLS) is a short stretch of amino acids that mediates the transport of nuclear proteins into the nucleus (Fig. 2-1). NLS motifs play a key rôle in this mechanism. (1) Typically, deletion of the NLS disrupts nuclear import. (2) Frequently, a non-nuclear protein will be imported into the nucleus if fused to an NLS. Both facts have been used routinely to experimentally unravel NLS motifs (Tinland, Koukolikova-Nicola et al. 1992; Moede, Leibiger et al. 1999).
Variety of NLS motifs. Do experimentally known NLS motifs have a consensus? Positively charged residues are abundant in NLSs, in general, since some of these positive residues bind to e.g. importins (Conti, Uy et al. 1998). Mutating positive charges is often the simplest way to disrupt nuclear import. However, there are Glycine-rich NLS motifs with few positive charges (Bonifaci, Moroianu et al. 1997). Experimentally best described are monopartite and bipartite motifs (Boulikas 1993). Typically, the monopartite motif is characterised by a cluster of basic residues preceded by a helix-breaking residue. Similarly, the bipartite motif consists of two clusters of basic residues separated by 9-12 residues. However, not all experimentally known NLSs comply with the above 'rules' (Hsieh, Shimizu et al. 1998; Truant and Cullen 1999; Irie, Yamagata et al. 2000). Furthermore, many non-nuclear proteins match such simplified 'consensus rules'.
Finding an NLS in silico? A wealth of experimental data about NLSs has been accumulated. How can you find a known NLS in your protein? If a standard database search reveals a 'significant similarity' between your protein and a protein of experimentally known and annotated NLS, you can infer the NLS from the homologue. If not, can you find most experimental motifs in PROSITE (Hofmann, Bucher et al. 1999)? The negative answer was the starting point for this work: build an 'expert database' of experimentally known NLSs. Another motivation was the observation that NLSs defined by experiments often appeared too specific. Theoretical generalisations for NLSs have been suggested: 'NLS cores are hexapeptides with at least four basic residue and neither acidic nor bulky residues' (Boulikas 1994). However, this motif matches only few nuclear and many non-nuclear proteins.
Fig.
2-1: Simplified scheme for nuclear import. Upon
synthesis of nuclear proteins in the cytoplasm, e.g. the family of
importins or transportins bind to the NLS. The complex importin/NLS-protein
(or transportin/ protein) is then actively transported into the nucleus
through nuclear pores involving the Ran GTPase cycle. Currently, this is
the only known mechanism for nuclear import (Mattaj
and Englmeier 1998; Weis 1998).

Do homologues have similar NLSs? Two naturally evolved proteins with more than 30% identical residues have similar 3D structures (Rost 1999). Sequence similarity required to infer function is much higher (Devos and Valencia 2000). Structural thresholds depend on alignment length, e.g., two identical 11-residue-peptides can adopt different structures (Minor and Kim 1996). NLSs are short stretches of residues. Thus, at which levels of sequence similarity can we infer that two proteins will have a similar NLS? A lack of data prevented us from thoroughly answering this question. However, we found some upper boundaries.
Here, we presented an extended expert database of experimentally known and potential NLS motifs. We evaluated the validity of the set by a rigorous test against known nuclear and non-nuclear proteins. Our method comprised three steps: (1) data collection: collect experimental NLS motifs from literature, extend motifs through close homologues, (2) generalisation: refine motifs found by shortening (too specific) or lengthening (not specific enough), and test new motifs conceptually similar to known motifs found in many families of nuclear proteins. The crucial component of both steps was to accept motifs if NOT found in non-nuclear proteins.
Collecting initial set of NLS from literature. We searched about 250 papers and reviews for experimentally determined nuclear localization signals. Our main criteria for 'accepting' NLS were that the signal was proven sufficient to mediate the nuclear transport of a non-nuclear protein to the nucleus and that deleting the NLS prevented the nuclear import. Technically, some motifs taken at this step comprised simple protein sequences, others regular expressions.
Sets of nuclear and non-nuclear proteins. We retrieved all proteins in SWISS-PROT release 38.0 (Bairoch and Apweiler 1999) with annotations of subcellular localization (ignoring PUTATIVE, POTENTIAL, BY SIMILARITY). Finally, we sorted all remaining proteins into two sets: (1) nuclear proteins (true positives, 3142 proteins) and (2) non-nuclear proteins (true negatives, 5910 proteins). Note: the set of nuclear proteins corresponded to 618 structural families (Rost 1999).
Extending experimental NLSs through homology. For each experimental NLS-protein, we found homologues in SWISS-PROT with PredictProtein (Rost 1996). For pairs with more than 80% identical residues, we extended the initial set of experimental NLSs by adding the sequence corresponding to the experimental NLS in the homologues.
Testing experimental NLSs. We tested the validity of all motifs found in the literature and their homologues by monitoring the matches of any motif in the sets of nuclear and non-nuclear proteins (Fig. 2-2). The rationale was to find all NLS that matched exclusively in nuclear proteins.
In silico mutagenesis. Given the list of sustained NLS motifs (experimental and homologues), we increased the number of potential NLS by 'in silico mutagenesis': we changed or removed some residues in the given motifs and monitored the resulting true (nuclear) and false (non-nuclear) matches. Obviously, allowing alternative residues at particular positions increased the number of nuclear proteins found. However, often this also increased the number of matching non-nuclear proteins. For example, the experimentally determined motif GKKRSKA was present in two nuclear proteins.
We could infer that the amino acid type at the positions of Serine (S) and Alanine (A) were not crucial for the NLS motif since GKKRxK found 11 nuclear proteins. For example, KKRxK matched 105 proteins, only 69% of which were nuclear. Thus, we rejected this generalisation. In general, while trying to increase our coverage by our extended NLS list, we dropped any NLS present in ANY non-nuclear protein, i.e. 100% accuracy. Furthermore, we required the motif to

be present in at least two distinct protein families. We tried all possible generalisations for the NLS motifs in our initial set through 'educated-guess trial-and-error'. Finally, we compiled the coverage, i.e. the fraction of the known nuclear proteins correctly detected by our final expert database of NLS motifs.
NLS and DNA-binding regions. We explored two ways of testing whether or not NLS motifs overlapped with known DNA-binding sites. Firstly, we looked at proteins for which the NLS and the three-dimensional structures are experimentally known. Towards this end, we investigated 22 examples of proteins of known structure (PDB codes: 1a02, 1an2, 1an4, 1akh, 1au7, 1b8i, 1cdw, 1fos, 1hlo, 1hry, 1hwt, 1lat, 2lef, 1mdy, 1nk2, 1nk3, 1oct, 1pdn, 1pue, 1tgh; 1ftz, 1ign (Berman, Westbrook et al. 2000). Secondly, we compared the DNA-binding regions annotated in SWISS-PROT with the NLS matching in our extended data set (1115 proteins in total).
SRS interface of NLSdb. The NLS’s and the predicted nuclear proteins and DNA-binding proteins are stored in a flat-file database for easy data access. NLSdb stores and manages data using the portal of the Sequence Retrieval System (SRS) (Etzold, Ulyanov et al. 1996). SRS provides a convenient and robust framework for managing molecular databases. This provides users with quick, efficient search, retrieval and display methods that work for any web browser. Using SRS, the information in NLSdb can be easily integrated with other public and proprietary databases. The database is continuously updated and refined from the primary literature.
Format and fields. NLSdb has been formatted in an EMBL-like flat-file format, thus allowing indexing of the database in SRS (Etzold, Ulyanov et al. 1996). Each NLSdb entry describes a nuclear localization signal. Each entry is organised into six major fields: (I) Origin, (II) Annotation, (III) Reference, (IV) Confidence, (V) Proteins and (VI) DNAbinding. The ‘Origin’ field describes whether the NLS has been found by direct experiments, or if it is a potential NLS discovered through our ‘in silico mutagenesis’. For experimentally determined NLSs, further information is provided in the fields ‘Annotation’ and ‘Reference’. The ‘Annotation’ field describes the protein family in which the experimental NLS was first established, and the ‘Reference’ field gives the primary literature citation. The ‘Reference’ field also contains a link to the PubMed for each citation. The ‘Confidence’ field is an indicator of our confidence in the NLS; it consists of two sub-fields; ‘Total confidence’ and ‘% Nuclear’. ‘Total confidence’ is the number of localization annotated proteins from SWISS-PROT in which this NLS is found and ‘% Nuclear’ is the percentage of these that are annotated as nuclear in SWISS-PROT. The ‘Proteins’ field lists proteins from various databases that are likely to be targeted to the nucleus since they match the given NLS motif. Currently the ‘Proteins’ field contains proteins from the SWISS-PROT, PDB and the PEP (Carter, Liu et al. 2003) databases. All protein entries are linked to the original entries in the respective databases. The ‘DNAbinding’ field describes whether the NLS overlaps with known DNA-binding regions of proteins. NLSdb can be browsed either starting with the NLS entries or with any of the data-fields defined above.
Searching the NLS database. All data-fields in NLSdb can be searched using standard Boolean queries. Proteins in NLSdb can be identified through their SWISS-PROT, PDB or PEP identifiers. NLS motifs can be queried by providing a string of one-letter amino acid codes. Database entries can be downloaded using the save ‘Complete entries’ functionality in SRS.
2.3 Results and discussions
Inferring NLSs based on sequence very limited. We found about 30 protein pairs with more than 80% sequence identity and different annotations (nuclear and cytoplasmic) in our subset of SWISS-PROT (Methods, e.g. the nuclear elongation factor 1-Alpha-2 in mouse and the cytoplasmic transcription elongation factor 1-Alpha in Zebra fish had 91% identity over 460 residues). At 50-65% sequence identity, we found many pairs aligned over a substantial length, and annotated in different localizations (e.g. 60% nuclear and extracellular: fbrl_rat/ndl_drome; 63% nuclear and mitochondrial: hmgt_mouse/mtt1_human; 51% nuclear and chloroplast grp1_sinal/ro30_nicpl). Thus, we can infer that a protein is nuclear only if it is almost identical to a known nuclear protein. However, for all the experimental NLSs we extracted we succeeded to correctly infer the nuclear localization knowing the NLS. Note, this failed for all NLSs from previously published theoretical generalisations (Boulikas 1994).
Raising coverage from 9% to 43%. Before we started, we had three ways to find an NLS in protein A. (1) We could memorise NLSs published and visually detect one (or several) of these in A. Obviously, this requires time and ample expertise. Furthermore, all experimental NLSs covered only 10% of the known nuclear proteins (too specific, Table 1). (2) We could automatically detect the NLS in PROSITE (Hofmann, Bucher et al. 1999). However, this covered only about 3% of all known proteins, and was not always correct (Table 2-1). (3) We could find a significant level of sequence similarity to a protein for which the NLS was annotated in SWISS-PROT (Bairoch and Apweiler 1999). This covered about 9% of all known nuclear proteins (Table 2-1). Furthermore, standard database searches starting with the proteins known to be nuclear yielded less than 25% of the known nuclear families at a generous BLAST cut-off of 10-3. In contrast, our final expert set of potential NLSs matched 43% of all nuclear proteins without any false positive (Table 2-1).

Limitations and error margin of method. Proteins often contain more than one NLS. Thus, our method might fail to propose the functional NLS. Furthermore, a few of our potential NLSs might just be motifs common to nuclear proteins such as DNA-binding motifs. Examples for motifs common to nuclear proteins we found with the motif-detection programs PRATT (Jonassen 1997) and the Gibbs-sampler (Hertz and Stormo 1999) were long repeats of Glycines, Glutamic Acids and Glutamine, and zinc-finger type II motifs. Most importantly, we found possible NLSs in 54 E. coli proteins, only 26 of which could be explained by DNA-binding motifs. Assuming that the remaining 28 comprised errors, we estimated the error margin of our method < 1% (28/4286).
Lessons learned from 'in silico
mutagenesis'. (1) As expected, amino acids with similar physico-chemical
properties could often be exchanged (Leucine/Isoleucine). (2) Unexpectedly,
positive amino acids (Arginine and Lysine) often could NOT be inter-changed.
(3) None of the NLSs previously proposed by theory passed our criterion of 100%
accuracy. (4) We found that proteins may have similar structure and function
and yet may utilise different NLSs. (5) Very peculiar motifs we added to our
final list were (A) GGGxGGGxxSSS, e.g. found by generalisation of the M9 domain
motif (human RNP A1 protein), and (B) SGxxG
More than 23% of eukaryotic proteins nuclear. Extrapolating from the SWISS-PROT coverage, we could estimate a lower-limit (SWISS-PROT biased towards known NLSs) for the fraction of nuclear proteins in eukaryotes. We detected potential NLSs in 12500 proteins from human, mouse, plant, fly, yeast and the worm (Table 2-2). Thus, more than 23% of all eukaryotic proteins appeared to be imported into the nucleus. All entire genomes investigated had a similar percentage of nuclear proteins, although they clearly differed in the content of extra-cellular, helical membrane, and coiled-coil proteins (Liu WWW and Rost 2000).
20% of NLS motifs co-localized with DNA-binding region. Too few complexes of DNA/protein were solved by X-ray crystallography to conclude that NLS and DNA-binding motifs were co-localised. Instead, we used 1115 proteins with SWISS-PROT annotations about DNA-binding regions; 736 of these had a known NLS (66%), and for 664 the NLS overlapped with the DNA-binding region. Thus, for 90% of all proteins, for which we knew both the NLS and the DNA-binding region, both motifs overlapped.
For 10% of the proteins, we could establish that the NLS and the DNA-binding region did NOT overlap. Furthermore, the NLS motifs co-localising with DNA-binding constituted about one fourth (56 of 214) of our final NLS set. The very observation that DNA-binding and NLS overlap frequently was not novel. In fact, based on a 20 times larger data set, we verified the original results from (LaCasse and Lefebvre 1995). We also corrected their estimate upwards: where they found that 67% of the DNA-binding regions co-localised with the NLS, we found this number to be 90%. In contrast, our results suggested that MOST NLS motifs were NOT used to bind DNA.
RNA-binding regions typically NOT overlapping with NLS. Contrary to LaCasse and Lefebvre (1995), we found that only 33 of the 99 regions annotated in SWISS-PROT as RNA-binding in nuclear proteins overlapped with an NLS. The difference largely resulted from their definition of 'RNA-binding region' as the entire region between two consecutive RNA-binding sites. In contrast, SWISS-PROT – correctly - annotated only regions experimentally shown to bind RNA.
Genome A No of ORFs B No of proteins with NLS C Estimated content nuclear D Human 31073 4122 > 30 % F Mouse 28096 2538 > 21% A. Thaliana 25456 2390 > 22% Drosophila 14184 1274 > 21% C. elegans 18898 1915 > 23% Yeast 6306 482 > 18% E. coli 4286 54 0% A Genome:
we obtained the incomplete set of human and mouse sequences from the latest
releases of SWISS-PROT and TrEMBL (Bairoch and Apweiler 1999), and the complete lists of proteins for the
genomes of Arabidopsis Thaliana, Drosophila
melangoster, Caenorhabditis elegans, Saccharomyces
cerevisia , and Escherichia coli from the respective Web sites (Liu WWW and Rost 2000); B No
of ORFs: number of open-reading frames (proteins) in entire genome; C No of proteins with NLS:
number of proteins for which the set PredictNLS_DB found an NLS in that
genome; D Estimated
content nuclear: given that our data set of NLS covers about 43% of all
known nuclear proteins (Table 1), we estimated the content of nuclear
proteins in the entire genome based on the number of proteins for which we
found NLS; supposedly, these estimates provided a lower boundary (Results).
Table 2-2: Nuclear proteins in seven
entirely sequenced proteomes.

Structures for DNA-binding and NLS. For 20 of the investigated 22 proteins of known structure, we found the known NLS to overlap with the DNA-binding region (Fig. 2-3). The only exceptions were rap1 from yeast and the segmentation protein fushi tarazu from fly (PDB codes: 1ign and 1ftz) for which we did not find the respective NLS in the known DNA-binding regions. However, these two exceptions did not have any of the 56 NLSs found to co-localise with DNA-binding. As expected, we found all NLSs on the protein surface.
Speculation about evolution. The co-localization of NLSs and DNA-binding regions suggested that DNA and shuttle proteins like importins and transportins utilised similar binding residues. Protein-DNA interactions may have preceded the 'invention' of a nucleus used by eukaryotes to compartmentalise all processes involving DNA. How to recognise proteins to import into this compartment? Common to many nuclear proteins are DNA-binding regions.
Thus, it seems likely to utilise fragments of these regions to manage nuclear import. Consequently, we expect to find importin-like proteins and NLS-like sequences in prokaryotic organisms. In fact, we did find such motifs in E. coli protein (Table 2-2); many of these appeared
Fig.
2-3: NLS motif also used for DNA-binding. Zoom into the interface
between DNA and P55-C-fos proto-oncogene protein (note: the other parts of
the amazing crystal structure of the complex with PDB id 1a02 (Chen, Glover et al. 1998)
are not shown).
The coloured region corresponds to the residues RRERNKMAAAKSRNRRR. In fact,
this motif is also contained in our data set of potential NLS motifs.
Colouring scheme used: basic residues shown in red; others in yellow. Graph
created with RASMOL (Sayle
and Milner-White 1995).


involved in DNA-binding. Obviously, evolution invented other NLS motifs (only 56 of 214 of the NLSs co-localized with DNA-binding) over time. NLSs are often also used to target nuclear export (Mattaj and Englmeier 1998). Could we thus perceive the co-localization of DNA-binding and NLS as an elegant mechanism to also prevent export for some of the proteins? And did evolution in fact have to invent novel NLS motifs to manage export rather than import? Our data did not falsify such speculations.

De novo prediction of DNA-binding regions. Searching with the NLS/DNA motifs, we predicted a relatively small number of DNA-binding proteins in eukaryotes, ranging from 419 in human to 67 in yeast (Table 2-3). However, this was 2-9 times higher than the number of proteins in the respective organism for which SWISS-PROT annotated DNA-binding or for which we could infer DNA-binding through homology (Table 2-3). Thus, we predicted new potential DNA-binding regions for more than 800 proteins in all four eukaryotes.
2.3.3 Availability of data set and program
Our data set and method are available at: http://cubic.bioc.columbia.edu/predictNLS. The program also allows experimentalists to test accuracy and coverage for new NLS motifs they may find or suspect. This feature has already helped to experimentally unravel a novel NLS in the hairless protein (Djabali, Aita et al. 2001). Finally, we added a form enabling experimentalists to add new NLSs. Every NLS added may help to speed up the next experiment!
† This chapter is based on:
1. Cokol, M., R. Nair, et al. (2000).
"Finding nuclear localization signals." EMBO Rep 1(5): 411-5.
2. Nair, R., P. Carter, et al. (2003).
"NLSdb: database of nuclear localization signals." Nucleic Acids
Res 31(1): 397-9.