Columbia University, Department of Biochemistry and Molecular Biophysics, 630 West 168th Str., New York, NY 10032, USA
$
Corresponding author: rost@columbia.edu,
http://cubic.bioc.columbia.edu/
Tel: +1-212-305-3773, fax: +1-212-305-7932
contact e-mail:rost@columbia.edu
A variety of nuclear localisation signals (NLSs) are experimentally known; only one motif was available for database searches. We initially collected a set of 91 experimentally verified NLSs from the literature. Through iterated 'in silico mutagenesis' we then extended the set to 214 potential NLSs. This final set matched in 43% of all known nuclear proteins and in no known non-nuclear protein. We estimated >17% of all eukaryotic proteins may be imported into the nucleus. Finally, we found an overlap between NLS and DNA-binding region for 90% of the proteins for which both NLS and DNA-binding regions were known. Thus, evolution seemed to have used part of the existing DNA-binding mechanism when compartmentalising DNA-binding proteins into the nucleus. However, only 56 of our 214 NLS motifs overlapped with DNA-binding regions. These 56 NLSs enabled a de novo prediction of partial DNA-binding regions for about 800 proteins in human, fly, worm and yeast.
Key words: nuclear localisation signal (NLS); protein sequence analysis; genome analysis; predict cellular localization; predict DNA-binding regions; Drosophila melanogaster; Caenorhabditis elegans; Saccharomyces cerevisiae.
Simplification of nuclear import. A nuclear localisation
signal (NLS) is a short stretch of amino acids that mediates the
transport of nuclear proteins into the nucleus ( Fig. 1 ). NLS motifs
play a key rôle in this mechanism. (1) Typically, deletion
of the NLS disrupts nuclear import. (2) Frequently, a non-nuclear
protein will be imported into the nucleus if fused to an NLS.
Both facts have been used routinely to experimentally unravel
NLS motifs [1, 2] .
Fig. 1. Simplified scheme for nuclear import.
Upon synthesis
of nuclear proteins in the cytoplasm, e.g. the family of importins
or transportins bind to the NLS. The complex importin/NLS-protein
(or transportin/protein) is then actively transported into the
nucleus through nuclear pores involving the Ran GTPase cycle.
Currently, this is the only known mechanism for nuclear import
[19, 23] .
Variety of NLS motifs. Do experimentally known NLS motifs have a consensus? Positively charged residues are abundant in NLSs, in general, since some of these positive residues bind to e.g. importins [3] . Mutating positive charges is often the simplest way to disrupt nuclear import. However, there are Glycine-rich NLS motifs with few positive charges [4] . Experimentally best described are monopartite and bipartite motifs [5] . Typically, the monopartite motif is characterised by a cluster of basic residues preceded by a helix-breaking residue. Similarly, the bipartite motif consists of two clusters of basic residues separated by 9-12 residues. However, not all experimentally known NLSs comply with the above 'rules' [6, 7, 8] . Furthermore, many non-nuclear proteins match such simplified 'consensus rules'.
Finding an NLS in silico? A wealth of experimental data about NLSs has been accumulated. How can you find a known NLS in your protein? If a standard database search reveals a 'significant similarity' between your protein and a protein of experimentally known and annotated NLS, you can infer the NLS from the homologue. If not, can you find most experimental motifs in PROSITE [xxx 9]? The negative answer was the starting point for this work: build an 'expert database' of experimentally known NLSs. Another motivation was the observation that NLSs defined by experiments often appeared too specific. Theoretical generalisations for NLSs have been suggested: 'NLS cores are hexapeptides with at least four basic residue and neither acidic nor bulky residues' [xxx 10]. However, this motif matches only few nuclear and many non-nuclear proteins.
Do homologues have similar NLSs? Two naturally evolved proteins with more than 30% identical residues have similar 3D structures [11] . Sequence similarity required to infer function is much higher [12] . Structural thresholds depend on alignment length, e.g., two identical 11-residue-peptides can adopt different structures [13] . NLSs are short stretches of residues. Thus, at which levels of sequence similarity can we infer that two proteins will have a similar NLS? A lack of data prevented us from thoroughly answering this question. However, we found some upper boundaries.
Here, we presented an extended expert database of experimentally
known and potential NLS motifs. We evaluated the validity of the
set by a rigorous test against known nuclear and non-nuclear proteins.
Our method comprised three steps: (1) data collection: collect
experimental NLS motifs from literature, extend motifs through
close homologues, (2) generalisation: refine motifs found by shortening
(too specific) or lengthening (not specific enough), and test
new motifs conceptually similar to known motifs found in many
families of nuclear proteins. The crucial component of both steps
was to accept motifs if NOT found in non-nuclear proteins.
Inferring NLSs based on sequence very limited. We found about 30 protein pairs with more than 80% sequence identity and different annotations (nuclear and cytoplasmic) in our subset of SWISS-PROT (Methods, e.g. the nuclear elongation factor 1-Alpha-2 in mouse and the cytoplasmic transcription elongation factor 1-Alpha in Zebra fish had 91% identity over 460 residues). At 50-65% sequence identity, we found many pairs aligned over a substantial length, and annotated in different localisations (e.g. 60% nuclear and extracellular: fbrl_rat/ndl_drome; 63% nuclear and mitochondrial: hmgt_mouse/mtt1_human; 51% nuclear and chloroplast grp1_sinal/ro30_nicpl). Thus, we can infer that a protein is nuclear only if it is almost identical to a known nuclear protein. However, for all the experimental NLSs we extracted we succeeded to correctly infer the nuclear localisation knowing the NLS. Note, this failed for all NLSs from previously published theoretical generalisations [10] .
Raising coverage from 9% to 43%. Before we started, we had three ways to find an NLS in protein A. (1) We could memorise NLSs published and visually detect one (or several) of these in A. Obviously, this requires time and ample expertise. Furthermore, all experimental NLSs covered only 10% of the known nuclear proteins (too specific, Table 1 ). (2) We could automatically detect the NLS in PROSITE [9] . However, this covered only about 3% of all known proteins, and was not always correct ( Table 1 ).
| Set A | N NLS B | Nprot nucC | Nfam nuc D | AccuracyE | Coverage F | |
| PROSITE | 1 | 96 | 31 | 90% | 3% | |
| SWISS-PROT | 322 | 290 | n.a. | 9% | ||
| NLS-lit | cleaned | 91 | 309 | 35 | 100% | 10% |
| NLS-lit | consensus | 91 | 537 | 35 | 100% | 17% |
| PredictNLS_DB | 214 | 1354 | 186 | 100% | 43% |
A
: Set: PROSITE: motifs
annotated in the PROSITE database of functional motifs [9] ;
SWISS-PROT: subset of SWISS-PROT database [14] annotating
nuclear localisation signals (note that a few proteins had more
than one NLS annotated); NLS-lit cleaned: subset of motifs
from literature with 100% accuracy; NLS-lit consensus:
motifs refined by consensus of close homologues; PredictNLS_DB:
final data set after in silico mutagenesis; B
N NLS: number of NLS motifs in set; C
Nprot nuclear: number of proteins matching any of the
NLS and known to be nuclear; D
Nfam:
number of unique protein families matching any of the NLS and
known to be nuclear (Methods: dataset); E
Accuracy:
percentage of nuclear proteins in set of proteins matching any
of the NLS; F
Coverage: percentage
of known nuclear proteins (Methods: dataset) matching any of the
motifs in the set (total number of known nuclear proteins 3142).
Limitations and error margin of method. Proteins often contain more than one NLS. Thus, our method might fail to propose the functional NLS. Furthermore, a few of our potential NLSs might just be motifs common to nuclear proteins such as DNA-binding motifs. Examples for motifs common to nuclear proteins we found with the motif-detection programs PRATT [15] and the Gibbs-sampler [16] were long repeats of Glycines, Glutamic Acids and Glutamine, and zinc-finger type II motifs. Most importantly, we found possible NLSs in 54 E. coli proteins, only 26 of which could be explained by DNA-binding motifs. Assuming that the remaining 28 comprised errors, we estimated the error margin of our method < 1% (28/4286).
Lessons learned from 'in silico mutagenesis'. (1) As expected, amino acids with similar physico-chemical properties could often be exchanged (Leucine/Isoleucine). (2) Unexpectedly, positive amino acids (Arginine and Lysine) often could NOT be inter-changed. (3) None of the NLSs previously proposed by theory passed our criterion of 100% accuracy. (4) We found that proteins may have similar structure and function and yet may utilise different NLSs. (5) Very peculiar motifs we added to our final list were (A) GGGxGGGxxSSS, e.g. found by generalisation of the M9 domain motif (human RNP A1 protein), and (B) SGxxG{3,}?xG{3,}?xG{3,}?S (any number of more than three consecutive Gs), e.g. found in the transcriptional activator protein of mouse.
More than 17% of eukaryotic proteins nuclear. Extrapolating
from the SWISS-PROT coverage, we could estimate a lower-limit
(SWISS-PROT biased towards known NLSs) for the fraction of nuclear
proteins in eukaryotes. We detected potential NLSs in 4187 proteins
from human, fly, yeast and the worm ( Table 2 ). Thus, more than
17% of all eukaryotic proteins appeared to be imported into the
nucleus. All entire genomes investigated had a similar percentage
of nuclear proteins, although they clearly differed in the content
of extra-cellular, helical membrane, and coiled-coil proteins
[17] .
| Genome A | No of ORFs B | No of proteins with NLS C | Estimated content nuclear D |
| Human | 13933 | 1311 | >22%F |
| Drosophila | 14219 | 1256 | >21% |
| C. elegans | 16232 | 1141 | >17% |
| Yeast | 6307 | 479 | >18% |
| E. coli | 4286 | 54 | 0% |
A
: Genome: we obtained the incomplete
set of human sequences from the latest releases of SWISS-PROT
and TrEMBL [14] , and the complete lists of proteins for the
genomes of Drosophila melangoster, Caenorhabditis
elegans, Saccharomyces cerevisia , and Escherichia
coli from the respective Web sites [26] ; B
No of ORFs: number of open-reading frames (proteins)
in entire genome; C
No of proteins with
NLS: number of proteins for which the set PredictNLS_DB found
an NLS in that genome; D
Estimated content
nuclear: given that our data set of NLS covers about 43% of
all known nuclear proteins (Table 1), we estimated the content
of nuclear proteins in the entire genome based on the number of
proteins for which we found NLS; supposedly, these estimates provided
a lower boundary (Results); E
Estimated
coverage for human: since our current data set for human contains
only about 10% of all the proteins expected in the human genome,
and since most of these are strongly biased by 'experimental focus',
we could not estimate whether or not the coverage for human will
be similar for the remaining 90% of all human proteins.
20% of NLS motifs co-localised with DNA-binding region. Too few complexes of DNA/protein were solved by X-ray crystallography to conclude that NLS and DNA-binding motifs were co-localised. Instead, we used 1115 proteins with SWISS-PROT annotations about DNA-binding regions; 736 of these had a known NLS (66%), and for 664 the NLS overlapped with the DNA-binding region. Thus, for 90% of all proteins, for which we knew both the NLS and the DNA-binding region, both motifs overlapped. For 10% of the proteins, we could establish that the NLS and the DNA-binding region did NOT overlap. Furthermore, the NLS motifs co-localising with DNA-binding constituted about one fourth (56 of 214) of our final NLS set. The very observation that DNA-binding and NLS overlap frequently was not novel. In fact, based on a 20 times larger data set, we verified the original results from [18] . We also corrected their estimate upwards: where they found that 67% of the DNA-binding regions co-localised with the NLS, we found this number to be 90%. In contrast, our results suggested that MOST NLS motifs were NOT used to bind DNA.
RNA-binding regions typically NOT overlapping with NLS. Contrary to LaCasse and Lefebvre (1995), we found that only 33 of the 99 regions annotated in SWISS-PROT as RNA-binding in nuclear proteins overlapped with an NLS. The difference largely resulted from their definition of 'RNA-binding region' as the entire region between two consecutive RNA-binding sites. In contrast, SWISS-PROT - correctly - annotated only regions experimentally shown to bind RNA.
Structures for DNA-binding and NLS. For 20 of the investigated
22 proteins of known structure, we found the known NLS to overlap
with the DNA-binding region ( Fig. 2 ). The only exceptions were
rap1 from yeast and the segmentation protein fushi tarazu
from fly (PDB codes: 1ign and 1ftz) for which we did not find
the respective NLS in the known DNA-binding regions. However,
these two exceptions did not have any of the 56 NLSs found to
co-localise with DNA-binding. As expected, we found all NLSs on
the protein surface.
Fig. 2. NLS motif also used for DNA-binding.
Zoom into
the interface between DNA and P55-C-fos proto-oncogene protein
(note: the other parts of the amazing crystal structure of the
complex with PDB id 1a02 [24] are not shown). The coloured
region corresponds to the residues RRERNKMAAAKSRNRRR. In fact,
this motif is also contained in our data set of potential NLS
motifs. Colouring scheme: basic residues shown in red; others
in yellow. Graph created with RASMOL [25] .
Speculation about evolution. The co-localisation of NLSs and DNA-binding regions suggested that DNA and shuttle proteins like importins and transportins utilised similar binding residues. Protein-DNA interactions may have preceded the 'invention' of a nucleus used by eukaryotes to compartmentalise all processes involving DNA. How to recognise proteins to import into this compartment? Common to many nuclear proteins are DNA-binding regions. Thus, it seems likely to utilise fragments of these regions to manage nuclear import. Consequently, we expect to find importin-like proteins and NLS-like sequences in prokaryotic organisms. In fact, we did find such motifs in E. coli protein ( Table 2 ); many of these appeared involved in DNA-binding. Obviously, evolution invented other NLS motifs (only 56 of 214 of the NLSs co-localised with DNA-binding) over time. NLSs are often also used to target nuclear export [19] . Could we thus perceive the co-localisation of DNA-binding and NLS as an elegant mechanism to also prevent export for some of the proteins? And did evolution in fact have to invent novel NLS motifs to manage export rather than import? Our data did not falsify such speculations.
De novo prediction of DNA-binding regions. Searching with
the NLS/DNA motifs, we predicted a relatively small number of
DNA-binding proteins in eukaryotes, ranging from 419 in human
to 67 in yeast ( Table 3 ). However, this was 2-9 times higher than
the number of proteins in the respective organism for which SWISS-PROT
annotated DNA-binding or for which we could infer DNA-binding
through homology ( Table 3 ). Thus, we predicted new potential DNA-binding
regions for more than 800 proteins in all four eukaryotes.
| Genome A | Nprot B | Nprot bind-DNA predicted C | Nprot bind-DNA known D |
| Human | 13933 | 419 | 141 |
| Drosophila | 14219 | 300 | 37 |
| C.elegans | 16232 | 251 | 10 |
| Yeast | 6307 | 67 | 10 |
| E.coli | 4286 | 13 | 3 |
A
: Genome: see Table 2; B
Nprot: total number of proteins in entire genome; C
Nprot bind-DNA predicted: number of proteins for which
we predict DNA-binding using NLS motifs; D
Nprot bind-DNA known: number of proteins for which DNA-binding
is annotated, or can be inferred by homology to a protein for
which binding is annotated (note: family relations taken from
[26] ).
Our data set and method are available at: http://cubic.bioc.columbia.edu/predictNLS.
The program also allows experimentalists to test accuracy and
coverage for new NLS motifs they may find or suspect. This features
has already helped to experimentally unravel a novel NLS in the
hairless protein [20] . Finally, we added a form enabling experimentalists
to add new NLSs. Every NLS added may help to speed up the next
experiment!
Collecting initial set of NLS from literature. We searched about 250 papers and reviews for experimentally determined nuclear localisation signals. Our main criteria for 'accepting' NLS were that the signal was proven sufficient to mediate the nuclear transport of a non-nuclear protein to the nucleus and that deleting the NLS prevented the nuclear import. Technically, some motifs taken at this step comprised simple protein sequences, others regular expressions.
Sets of nuclear and non-nuclear proteins. We retrieved all proteins in SWISS-PROT release 38.0 [14] with annotations of sub-cellular localisation (ignoring PUTATIVE, POTENTIAL, BY SIMILARITY). Finally, we sorted all remaining proteins into two sets: (1) nuclear proteins (true positives, 3142 proteins) and (2) non-nuclear proteins (true negatives, 5910 proteins). Note: the set of nuclear proteins corresponded to 618 structural families [11] .
Extending experimental NLSs through homology. For each experimental NLS-protein, we found homologues in SWISS-PROT with PredictProtein [21] . For pairs with more than 80% identical residues, we extended the initial set of experimental NLSs by adding the sequence corresponding to the experimental NLS in the homologues.
Testing experimental NLSs. We tested the validity of all
motifs found in the literature and their homologues by monitoring
the matches of any motif in the sets of nuclear and non-nuclear
proteins ( Fig. 3 ). The rationale was to find all NLS that matched
exclusively in nuclear proteins.
Fig. 3. Scheme for the concept of 'in silico mutagenesis'.
We started the search with the hypothetical motif GNKAKRQRST.
We searched the data sets of proteins known to be nuclear and
proteins known to be non-nuclear for presence of this motif. In
this particular example, two nuclear and one non-nuclear protein
matched. Requiring 100% accuracy for all motifs, we did not include
GNKAKRQRST into our data set of potential motifs. Note: the particular
example was one of many failed attempts to generalise an experimental
NLS.
In silico mutagenesis. Given the list of sustained NLS motifs (experimental and homologues), we increased the number of potential NLS by 'in silico mutagenesis': we changed or removed some residues in the given motifs and monitored the resulting true (nuclear) and false (non-nuclear) matches. Obviously, allowing alternative residues at particular positions increased the number of nuclear proteins found. However, often this also increased the number of matching non-nuclear proteins. For example, the experimentally determined motif GKKRSKA was present in two nuclear proteins. We could infer that the amino acid type at the positions of Serine (S) and Alanine (A) were not crucial for the NLS motif since GKKRxK found 11 nuclear proteins. For example, KKRxK matched 105 proteins, only 69% of which were nuclear. Thus, we rejected this generalisation. In general, while trying to increase our coverage by our extended NLS list, we dropped any NLS present in ANY non-nuclear protein, i.e. 100% accuracy. Furthermore, we required the motif to be present in at least two distinct protein families. We tried all possible generalisations for the NLS motifs in our initial set through 'educated-guess trial-and-error'. Finally, we compiled the coverage, i.e. the fraction of the known nuclear proteins correctly detected by our final expert database of NLS motifs.
NLS and DNA-binding regions. We explored two ways of testing
whether or not NLS motifs overlapped with known DNA-binding sites.
Firstly, we looked at proteins for which the NLS and the three-dimensional
structures are experimentally known. Towards this end, we investigated
22 examples of proteins of known structure (PDB codes: 1a02, 1an2,
1an4, 1akh, 1au7, 1b8i, 1cdw, 1fos, 1hlo, 1hry, 1hwt, 1lat, 2lef,
1mdy, 1nk2, 1nk3, 1oct, 1pdn, 1pue, 1tgh; 1ftz, 1ign [22] .
Secondly, we compared the DNA-binding regions annotated in SWISS-PROT
with the NLS matching in our extended data set (1115 proteins
in total).
Thanks to Jinfeng Liu (Columbia Univ.) for computer assistance
and collection of the genome data sets; to Barry Honig for his
valuable comments on DNA-binding, to Amos Bairoch (SIB, Geneva),
Rolf Apweiler (EBI, Hinxton) and their crews for maintaining the
excellent databases SWISS-PROT and TrEMBL. Last, not least, thanks
to all those who enabled this analysis by depositing experimental
information about nuclear localisation signals.