Chapter 4

Inferring subcellular localization through automated lexical analysis

The SWISS-PROT sequence database contains keywords of functional annotations for many proteins. In contrast, information about the subcellular localization is only available for few proteins. Experts can often infer localization from keywords describing protein function. We developed LOCkey, a fully automated method for lexical analysis of SWISS-PROT keywords that assigns subcellular localization. With the rapid growth in sequence data, the biochemical characterization of sequences has been falling behind. Our method may be a useful tool for supplementing functional information already automatically available. LOCkey reached a level of more than 82% accuracy in a full cross-validation test. Due to a lack of functional annotations, we could infer localization for less than half of all proteins in SWISS-PROT. We applied LOCkey to annotate five entirely sequenced proteomes, namely Saccharomyces cerevisiae (yeast), Caenorhabditis elegans (worm), Drosophila melanogaster (fly), Arabidopsis thaliana (plant) and a subset of all human proteins. LOCkey found about 8,000 new annotations of subcellular localization for these eukaryotes. Localization annotations for eukaryotes can be accessed at: http://cubic.bioc.columbia.edu/services/LOCkey.

4.1 Introduction

Protein sequence-function gap. The number of completely sequenced genomes has been rapidly increasing. Currently, we know the full genomes for over 100 organisms (Adams, Celniker et al. 2000; Frishman 2000; Liu and Rost 2000). This sequence explosion has widened the gap between the number of sequences deposited in public databases and the experimental characterization of the corresponding proteins (Koonin 2000). To bridge this gap, faster and more effective means of creating annotation are required (Baker and Brass 1998; Fleischmann, Moller et al. 1999; Eisenberg, Marcotte et al. 2000). One promising approach are automatic annotations (Gaasterland and Sensen 1996; Apweiler, Gateau et al. 1997). One step towards understanding protein function is elucidating its subcellular localization (Eisenhaber and Bork 1998).

Database annotations of function often very detailed. Protein function may be described best in the context of molecular interactions. The SWISS-PROT database (Bairoch and Apweiler 2000) contains functional annotations predominantly at a very detailed level of biochemical function, e.g. a protein may be annotated as a cdc2 kinase, but not as being involved in intra-cellular communication (Tamames, Ouzounis et al. 1998; Apweiler 2001). We would like to complement these detailed annotations with descriptions in context of higher-order processes such as the regulation of gene expression, pathways, or signaling cascades (Riley 1993; Riley and Labedan 1997; Bork, Dandekar et al. 1998; Tamames, Ouzounis et al. 1998). Descriptions at this level are available for only few proteins. To remedy this situation, a number of automatic and semi-automatic tools have been developed for functional annotation of proteins.

Classification of proteins into families of homologues. Many automatic annotations are derived from sequence similarity to proteins of known function (Fleischmann, Adams et al. 1995; Bork and Gibson 1996; Koonin 2000; Devos and Valencia 2001). A typical similarity search starts by aligning the unknown protein U against databases with functional annotations through search tools such as BLAST (Altschul, Gish et al. 1990), FASTA (Pearson and Lipman 1988), or PSI-BLAST (Altschul, Madden et al. 1997). If a homologue H is found that has an annotation and significant sequence similarity to U, the annotation of H is transferred to U. Such inference of function is reliable only if the sequences of U and H are very similar (Devos and Valencia 2001; Rost 2002). Several pitfalls of such transfers of function have been reported, e.g. inadequate knowledge of thresholds for 'significant sequence similarity', or using only the best database hit or ignoring the domain organisation of proteins (Bork and Koonin 1998; Doerks, Bairoch et al. 1998; Galperin and Koonin 2000; Devos and Valencia 2001).

Automatic annotation through functional descriptors. Annotation systems have been based on SWISS-PROT keywords (Tamames, Ouzounis et al. 1998; Eisenhaber and Bork 1999; Fleischmann, Moller et al. 1999; Hofmann, Bucher et al. 1999; Bairoch and Apweiler 2000). The common approach to classifying function is to first extract characteristic keywords for each of the functional classes from a set of proteins classified by experts. Using these keywords, a library of rules is created that associates a certain pattern of keywords to a functional class. Creating the 'rules library' is a difficult task for which a variety of solutions have been used proposed. EUCLID (Tamames, Ouzounis et al. 1998) uses SWISS-PROT keywords to classify proteins into 14 classes of cellular function (according to the scheme proposed by Monika Riley (Riley and Labedan 1997; Karp, Riley et al. 1999)). Using a simple voting scheme, the system assigns the unknown sequence to the functional class to which the majority of its keywords belong (Tamames, Ouzounis et al. 1996). A disadvantage of using such dictionaries is that they can only 'discover' simple correlations between the known functional keywords. The method of the Apweiler group (Fleischmann, Moller et al. 1999) annotates function for the TrEMBL (Bairoch and Apweiler 2000) database based on SWISS-PROT keywords and PROSITE motifs (Hofmann, Bucher et al. 1999). The system generates a 'RuleBase' by extracting functional annotations from all SWISS-PROT proteins that contain the same PROSITE motif. If a PROSITE motif is discovered in an un-annotated TrEMBL sequence, the functional annotation is transferred from the 'RuleBase'. Recently (Kretschmann, Fleischmann et al. 2001), the group has implemented the C4.5 data-mining algorithm to automatically generate rules for keyword annotations found in SWISS-PROT. The rules are based on taxonomy, PROSITE motifs, and PFAM patterns in SWISS-PROT proteins belonging to different InterPro (Apweiler, Attwood et al. 2000) families. The Meta_A annotator (Eisenhaber and Bork 1999) is a partly automatic annotation evaluation system based on a combination of lexical analysis and libraries of expert rules. The rule libraries are derived from scanning the protein names, taxonomy information, commentaries and feature tables in SWISS-PROT. The system assigns one of twelve final subcellular localizations to each protein. Meta_A combines primary attributes with AND, OR and NOT logical operators to create a library of 'biological rules'. The rules relating lexical patterns with functional attributes are created by expert intervention, i.e., resemble a dictionary (Eisenhaber and Bork 1999). Thus, the creation of the rule library is time-consuming and has to be repeated for each new application.

Algorithms for text categorization. The problem of automatically extracting rules from SWISS-PROT keywords has parallels to the problem of 'Text Categorization'. Text categorization (TC) is the problem of assigning predefined categories to text documents such a journal articles or abstracts. Many statistical learning methods have been applied to this problem. These include nearest neighbour classifiers (Yang and Pederson 1997), multivariate regression models (Yang and Chute 1992; Schutze, Hull et al. 1995), probabilistic Bayesian models (Lewis and Ringuette 1994) and symbolic rule learning (Apte, Damerau et al. 1994). M-ary (multiple category) classifiers like the k-Nearest Neighbour (Dasarathy 1991) and the Linear Least-squares Fit (LLSF)(Yang and Liu 1999). Here we describe LOCkey, a novel method automatically assigning proteins to classes of subcellular localization based on a lexical analysis of SWISS-PROT keywords. LOCkey is based on M-ary classifiers that solve the classification problem accurately when the number of data points (proteins) and dimensionality of the feature space (number of keywords) are not too large. In contrast to dictionary-based approaches, LOCkey is fully automated and the rule libraries are generated dynamically. Our method may be applied to any database with keywords of functional information and to any task involving higher-level classifications.

4.2 Materials and methods

Implementation of algorithm. Instead of creating an a priori 'rule library', we generated all possible sets of rule libraries for a protein of unknown 'class' from a set of SWISS-PROT keywords. The protein was assigned based on the 'rule library' that solved the classification problem best. We applied the algorithm to infer one of ten classes of subcellular localization (Table 4-1). The algorithm involved two separate steps: (1) build a data set of trusted vectors from proteins of known localization, and (2) classify unknown proteins (Fig. 4-1).

Step 1: Building data set of trusted vectors. First, we compiled a data set with proteins of experimentally known localization. Then, we extracted a list of keywords from SWISS-PROT for this set from the 'keyindex' file. Since only a partial functional annotation was available for a large number of proteins, we merged keywords from homologous sequences, to provide as complete an annotation, as possible. In particular, we identified all SWISS-PROT sequences with HSSP distances (Eq. 4-1) >15 to sequences in the sequence-unique subset (below) using BLAST (Altschul and Gish 1996; Altschul, Madden et al. 1997) and extracted their keywords. Finally, we built a data set of binary vectors (Salton 1989) for these keywords that represented the presence of a certain keyword by 1 and the absence by 0. To reduce the dimensionality of feature space, we retained only keywords with 'above random' classifying ability based on an entropy (Eq. 4-2) and normalised entropy (Eq. 4-3) cut-off (Schutze, Hull et al. 1995). The accuracy vs. coverage plots were not very sensitive to the particular cut-off chosen (plots given at cubic.columbia.edu/services/LOCkey). We merged these keywords with the keywords found for the corresponding protein from the sequence-unique set. Most proteins had 2-5 keywords (Fig. 4-3A).

Fig. 4-1: LOCkey algorithm. First, we compiled a sequence-unique data set of proteins of experimentally known subcellular localization. For these proteins, we then extracted keywords from SWISS-PROT. Next, we merged keywords found in homologues. We represented the keywords found in the proteins of known localization as binary vectors in the 'Trusted Vector Set'. For proteins of unknown localization U, the goal now became to compare these to the ‘Trusted Vector Set’. Toward this end, we first identified all keywords in U and in homologues of U. Then we constructed all possible keyword combinations (SUB vectors), and compared these to the ‘Trusted’ vectors. We found the best matching vector based on entropy criteria (Methods). Finally, we used this ‘best matching vector’ to infer localization for the query.

 

Removing 'trivial' keywords. To evaluate the ability of the system to discover non-trivial correlations between variables, we excluded all keywords from the vector set that were biologically co-related to localization (e.g. 'DNA-binding'). We also excluded keywords that were observed to occur more than 90% of the time in proteins within a single subcellular localization. Thus, we excluded 81 keywords from our trusted vector set. Removing these keywords resulted in that we could not identify any keyword for 176 test proteins. To minimize the effects of annotation errors, we retained only keywords that occurred in at least 10 protein families.

Step 2: Classifying proteins of unknown localization. To infer the localization of a protein U of unknown localization, we first retrieved all keywords for U from SWISS-PROT that matched in our 'trusted vector set' of informative keywords. Thus, we retrieved a vector V(U) that had the same dimension as the vectors in the 'trusted set'. Next, we generated all possible alternatives to V(U) for which one or many 1's were flipped to 0's. For example, for a protein with 3 keywords, we generated 23-1 = 7 sub-vectors V'(U): 111, 110, 101, 011, 100, 010 and 001. These sub-vectors constituted all possible keyword combinations for protein U. The final task was to find the keyword combination that yielded the best classification of U into one of ten classes of subcellular localizations, i.e. was most similar to one of the 'trusted vectors'. To achieve this, we retrieved all exact matches of any of the sub-vectors V'(U) to any of the proteins in the trusted vectors, i.e. found all proteins in the trusted set containing one of the keywords found in U. By construction of the sub-vectors, the proteins retrieved in this way may also contain keywords not found in U. Next, we simply counted how often the proteins retrieved belonged to a particular class C(i), i=1 … 10. We repeated this for each of the sub-vectors V'(U), and selected the finally assigned localization by minimizing an entropy-based objective function ('prediction mode').

Data sets. We selected all proteins with unambiguously annotated subcellular localization in SWISS-PROT release 40 (Bairoch and Apweiler 2000). We excluded sequences annotated as "POSSIBLE", "PROBABLE", "SPECIFIC PERIODS" or "BY SIMILARITY", and proteins with multiple annotations of localization. This left 13,589 proteins in the 'Experimental data set' (Table 4-1).

To reduce bias, we built a representative subset of sequence-unique proteins by using a simple greedy algorithm (Hobohm, Scharf et al. 1992). In particular, we accepted only pairs with an HSSP-distances below 15 (Sander and Schneider 1994; Rost 1999):

                     HSSP DISTANCE = PIDE - HSSP_PIDE                                       (Eq. 4-1)

                     HSSP_PIDE=

where, PIDE is the percentage of pairwise identical residues and L is the alignment length. The final sequence-unique subset of known localization contained 3146 proteins. For entire-proteome predictions, we obtained the sequences with their alignments to SWISS-PROT proteins from http://cubic.bioc.columbia.edu/genomes/ (Liu and Rost 2000).

Sorting assignments by keyword entropy. For each of the remaining keywords, we calculated the Shannon Information SI (Shannon 1951) according to:

                                                                                                        (Eq. 4-2)

where, N is the number of localization classes (10) and Pi  are the probabilities of finding the keyword in one of the 10 classes of localization. Since the Shannon Information does not take into account the background distribution of proteins among the various localizations, we calculated a normalised Shannon Information normSI for each keyword.

                                                                                                (Eq. 4-3)

where, Xi was the fraction of proteins belonging to a given localization identified by the keyword and M the number of localizations in which the keyword was found. Finally, we defined the percent of fractional change in SI (and normSI) as:

                                                                      (Eq. 4-4)

where, maxSI is the maximum possible Shannon Information and maxNormSI the maximum possible normalised Shannon Information. We included only those keywords in our set of trusted vectors that satisfied the criteria: fracSI > 25 and maxNormSI > 25.

Inferring localization from keywords (prediction mode). To infer localization for test proteins we identified the keyword combinations that maximised fracSI and maxNormSI (Eq. 4-4). Predictions were made only if at least one keyword combination could be found such that fracSI > 70 and maxNormSI > 70. Additionally, we required that the keyword combination was present in at least five families in the training set.

Evaluating performance accuracy. We evaluated performance by a five-fold cross-validation experiment, i.e. we partitioned the sequence-unique subset into five sets. Then we used four of the five sets to generate the data set of trusted keyword vectors (training set), and inferred (predicted) localization for the remaining fifth set (test set). We repeated this procedure five times such that each of the sets was used for testing once. The final levels of accuracy and coverage constituted averages over all five tests. The partitioning was performed using a greedy clustering algorithm starting with the largest and longest families (Hobohm, Scharf et al. 1992). An HSSP distance = 5 (Eq. 4-1) was chosen for the clustering. This ensured that two 100 residue long sequences chosen from the training and test set have fewer than 35% pairwise identical residues.

4.3 Results

LOCkey yielded high accuracy but low coverage. We tested LOCkey in a five-fold cross-validation experiment (Methods). Some SWISS-PROT keywords are such that all biologists would immediately know the localization of the respective protein (e.g. DNA-binding), others need more expertise. To assess the ability of the system to discover non-trivial correlations between variables, we excluded all keywords predominantly associated with a single localization. This 'filtering-out of the most obvious' required removing 81 keywords from the 'trusted vectors'. For 176 of the proteins in the test set, we found no keywords. When choosing the entropy cut-off such that we could classify one fourth of all proteins (coverage in Fig. 4-2A), our system correctly inferred one of ten classes of subcellular localization for 87% of all proteins (accuracy in Fig. 4-2A).

We noticed that the top two hits contained the observed localization for a similar level of accuracy (87%) at an entropy cut-off at which we assigned localization for about 35% of all proteins (dashed line in Fig. 4-2A). Interestingly, this increased performance when considering the top two hits, was mostly due to proteins from the chloroplast that were often confused with mitochondrial proteins. The reasons for the low coverage of the system were manifold. First, some proteins had no keyword (176 of 3146). Second, we required that the keyword pattern was present in at least five proteins in the vector set. Third, the vector set was too small (3146 proteins) to provide a good sample for all proteins. In other words, many keywords found in the testing set were not present in the 'training' set.

Fig. 4-2: Results for five-fold cross-validation. (A) Average over 10 classes. Results were obtained for a sequence-unique set. Keywords thought to be biologically correlated with subcellular localization and those observed to occur with high specificity in a single localization were excluded (Methods). The bold line represents accuracy versus coverage for the predicted localization. For example, at 25% coverage the system approaches a prediction accuracy of 87%. Prediction accuracy appeared higher, when we considered our findings correct if the correct localization was one of top two predicted localizations (grey line). (B) Major classes. Nuclear, extra-cellular and mitochondrial classes showed similar accuracy versus coverage statistics. Cytoplasmic and chloroplast proteins were predicted with a much lower accuracy.

 

Performance varied substantially between classes. LOCkey was more successful for some classes than for others. In particular, extra-cellular, nuclear and mitochondrial proteins could be inferred more reliably than the other classes (Fig. 4-2B). On the other hand, performance was worst for proteins from the cytoplasm and chloroplasts (Fig. 4-2B). When we considered our findings correct if any of the top two hits was predicted in the observed localization, we noticed a considerable improvement for most of the major classes (data not shown). The only exceptions were cytoplasmic proteins for which the keywords yielded levels above 75% accuracy at entropy thresholds corresponding to levels of coverage around 10%. The detailed 'confusion matrix' (Table 4-2) revealed that cytoplasmic proteins were most often confused with nuclear proteins, and proteins from the chloroplast were most often assigned incorrectly to mitochondria. Although the minor classes (Golgi, Endoplasmic reticulum, peroxisome, vacuoles, and lysosome) contained too few proteins to allow statistically significant conclusions, we noted that proteins from vacuoles and the lysosome were mostly confused with extra-cellular proteins (Table 4-2).

Performance improves with number of keywords. LOCkey inferred the correct localization for almost all proteins for which we had many keywords. In fact, both accuracy and coverage approached 100% for proteins with more than 25 keywords (Fig. 4-3B). For proteins with few keywords, the accuracy decreased with increasing coverage (Fig. 4-3B). The improved coverage with increasing number of keywords was the result of discovering new keyword combinations that meet the entropy criteria (Methods).

Since the keywords were non-specific to any localization class, the initial drop in accuracy was due to a larger fraction of keyword combinations that closely met the entropy criteria. These combinations were more often incorrectly predicted.

Annotating entire proteomes. Using LOCkey, we could provide subcellular localization annotations to 38-55% more proteins than by simple annotation transfer using homology (Table 4-3). To provide independent confirmation for our annotations, we checked the annotations inferred for nuclear proteins for which we found nuclear localization signals (NLS; (Cokol, Nair et al. 2000) and extra-cellular proteins with predicted signal peptides (Nielsen, Engelbrecht et al. 1997). More than 20% of all nuclear proteins identified by LOCkey (~9600) contained known NLSs (Table 4-3). All 3130 extra-cellular proteins identified by LOCkey in human and arabidopsis and over 60% of those in fly and worm contained signal peptides (Table 4-3).

In contrast, only 2 of the predicted 76 extra-cellular proteins in yeast contained predicted signal peptides. Note that the numbers of proteins with signal peptides or NLSs that were not identified by LOCkey are not relevant, since neither SignalP nor our NLS database find all extra-cellular or nuclear proteins at 100% accuracy. The relevant numbers in this context were the values for accuracy vs. coverage for the cross-validation experiment (Fig. 4-2). Unfortunately, the only way to assess whether or not LOCkey correctly identified nuclear and extra-cellular proteins without motifs is to await the respective experiments.

If we can generalise the levels of accuracy found in the cross-validation experiment, we conclude that LOCkey indeed identified many proteins that could not have been classified reliably by motif-based methods.

LOCkey assignments were often not trivial. Experts can often assign localization from SWISS-PROT keywords. This may be unpractical in the context of assigning localization for entire proteomes. However, when analyzing the assignments from LOCkey in more detail (http://cubic.bioc.columbia.edu/services/LOCkey/), we found that often the biologists in our lab could not clearly infer localization from the SWISS-PROT keywords. In other words, LOCkey ‘discovered’ relations that require more specific expertise than we can expect from a typical expert annotator.

4.4 Discussions and conclusions

Functional information absent for majority of proteins. LOCkey automatically provided high quality annotations of subcellular localization from SWISS-PROT keywords. However, for most test proteins, we could not infer localization, at all. This low coverage originated from a lack of relevant functional information for many proteins. One solution to this problem could be to extract keywords from bibliography databases such as MEDLINE (Andrade and Valencia 1998).

Extensive shuttling of proteins within the cell. Of all the major localization classes, cytoplasmic proteins were predicted worst (Table 4-2); these were also the major source of error in predicting nuclear and extra-cellular proteins. One reason could be that experimental annotations are less accurate for cytoplasmic proteins. Another reason could be that proteins do in fact shuttle between the cytoplasm and other localizations and that our 'errors' really captured proteins that could also occur in the predicted class. This interpretation was somehow supported by the finding that LOCkey often found the correct class in the first two hits. In other words, when replacing the binary classification accuracy (a protein can only be in one single localization) by a probabilistic measure (one protein can be in many compartments), LOCkey appeared more accurate.

LOCkey significantly improves coverage for genomes. We applied LOCkey to five (yeast, worm, fly, human, and arabidopsis) entirely sequenced eukaryotic proteomes. We could infer localization for over 8,300 proteins for which localization could not have been detected by any other automatic system. Three types of methods can infer or predict localization in the context of entire proteomes: (1) homology to proteins of known localization, (2) detection of sequence motifs, and (3) prediction from sequence and structure. In our group, we simultaneously work on all these types of methods. LOCkey is most relevant for the coverage achieved by homology-based methods, since it makes it possible to automatically increase the data set of proteins of known localization for which we can apply homology thresholds (Nair & Rost, unpublished).

LOCkey algorithm can help improve fold recognition. The PERL-code (Wall and Schwartz 1990) of LOCkey was optimised to provide fast annotations. Annotating the entire C. elegans proteome took less than four hours on a PIII 900 MHz machine. The algorithm is limited to problems with few data points in the vector set (n<<1000000) and with few keywords (n<<10000). Since the algorithm was not tailored to inferring subcellular localization, we are currently implementing the same idea to recognize distant similarities to proteins of known structure, i.e. to the problem of fold recognition. Our preliminary results are encouraging.

 



 This chapter is based on:

1.     Nair, R. and B. Rost (2002). "Inferring sub-cellular localization through automated lexical analysis." Bioinformatics 18 Suppl 1: S78-S86.

2.     Nair, R. and B. Rost (2004). "Annotating protein function through lexical analysis." AI magazine 25: 45-56.