Chapter 6

Predicting subcellular localization based on functional hierarchies

Predicting the native subcellular compartment of a protein is an important step towards elucidating its function. Here, we describe a novel support vector machine (SVM) based localization predictor, LOCtree, incorporating a hierarchical ontology of subcellular localization classes which uses as input a number of observed and predicted features of the amino acid sequence. By mimicking the biological trafficking mechanism as closely as possible using a machine learning architecture, LOCtree very accurately predicts the subcellular localization of proteins. LOCtree achieved an accuracy of 74% for eukaryotic non-plant, 70% for plant and 84% for prokaryotic sequences on a non-redundant test set. We rigorously benchmarked the performance of LOCtree vis-à-vis a number of publicly available methods for localization prediction. LOCtree outperformed all publicly available servers in nearly all benchmarks. Localization assignments using LOCtree agreed quite well with data from recent large-scale experiments. From our analysis of the Homo sapiens, Saccharomyces cerevisiae and Arabidopsis thaliana genomes using LOCtree we concluded that over 35% of all non-membrane proteins are nuclear, around 20% retained in the cytosol, over 10% mitochondrial and around 20% of plant proteins are chloroplastic. The prediction server and the results of our genome predictions are available at http://cubic.bioc.columbia.edu/services/LOCtree/.

6.1 Introduction

Computational prediction of subcellular localization indispensable. The genome (DNA) sequences of more than 100 organisms, including a draft sequence of the human genome (Venter, Adams et al. 2001; Istrail, Sutton et al. 2004), have now been completed. For over 60 of these, the data is publicly available and contributes about 250K protein sequences, i.e. about one fourth of all currently known protein sequences (Liu and Rost 2001; Carter, Liu et al. 2003; Pruess, Fleischmann et al. 2003). With this explosion of genome sequences, a major challenge in modern biology is to understand the expression, function, and regulation of the entire set of proteins encoded by an organism. This information will be invaluable for understanding how complex biological processes occur at a molecular level, how they differ in various cell types, and how they are altered in disease states (Zhu, Bilgin et al. 2003). Proteins must be localized in the same subcellular compartment to cooperate towards a common function. Therefore, experimentally unravelling the native compartment of a protein constitutes one step on the long way to determining its role (Bork, Ouzounis et al. 1994; Bork, Dandekar et al. 1998). Using experimental high-throughput methods for epitope and green fusion protein (GFP) tagging, two groups have recently reported localization data for most proteins in Saccharomyces cerevisiae (bakers yeast) (Kumar, Agarwal et al. 2002; Huh, Falvo et al. 2003). So far, the majority of large-scale experiments suggesting localization have been restricted to yeast, or to particular compartments, such as a recent analysis of chloroplast proteins in Arabidopsis thaliana (weed) (Kleffmann, Russenberger et al. 2004). As of now, these large-scale experiments cannot be repeated for mammalian or other higher eukaryotic proteomes. One significant obstacle is that large-scale production of a collection of cell lines each with a defined gene chromosomally tagged at the 3′-end is not yet possible (Davis 2004). In contrast, computational tools can provide fast and accurate localization predictions for any organism (Bork and Koonin 1998; Koonin 2000). In fact, attempts to predict subcellular localization has become one of the central problems in bioinformatics (Eisenhaber and Bork 1998; Nakai 2000; Schneider and Fechner 2004).

Bioinformatics partially successful in predicting localization from sequence. A number of methods have tried to predict localization by identifying local sequence motifs, such as signal peptides(von Heijne 1995; Nielsen, Engelbrecht et al. 1997; Nakai and Horton 1999; Emanuelsson and von Heijne 2001) or nuclear localization signals (NLS) (Cokol, Nair et al. 2000; Nakai 2000) that are responsible for protein targeting. Proteins destined for the secretory pathway, the mitochondria and the chloroplast contain N-terminal targeting peptides that are recognised by the translocation machinery (Rusch and Kendall 1995; Schatz and Dobberstein 1996). Thus, prediction methods use only the N-terminal residues (Nielsen, Engelbrecht et al. 1997; Emanuelsson, Nielsen et al. 2000). Many proteins destined for the nucleus contain NLS motifs that may occur anywhere in the sequence. Recently, we have collected a data set of experimental and potential NLS motifs as an aid to predicting nuclear localization (Nair, Carter et al. 2003). However, the vast majority of proteins have no known motif. For mitochondrial and chloroplast proteins, a number of alternative targeting pathways have recently been discovered (Folsch, Guiard et al. 1996; Herrmann and Neupert 2000). Furthermore, a particular problem for methods detecting N-terminal signals is that start-codons are predicted with less than 70% accuracy by genome projects (Reinhardt and Hubbard 1998; Lander, Linton et al. 2001; Venter, Adams et al. 2001). Overall, known and predicted sequence motifs enable annotating about 30% of the proteins in six entirely sequenced eukaryotic proteomes (Liu and Rost 2001; Mott, Schultz et al. 2002). Other methods that can be reliably used to annotate localization but are not always applicable are annotation transfer from sequence homologues (Nair and Rost 2002) and text analysis (Fleischmann, Moller et al. 1999; Kretschmann, Fleischmann et al. 2001; Nair and Rost 2002; Lu, Szafron et al. 2004).

Another promising approach which has the advantage of high coverage has been suggested by the observation that the overall amino acid composition of a protein correlates with its native compartment (Nishikawa, Kubota et al. 1983; Nakashima and Nishikawa 1994). This observation has led to the development of a variety of ab initio prediction methods based solely on composition (Reinhardt and Hubbard 1998; Chou and Elrod 1999; Hua and Sun 2001; Nair and Rost 2003). Higher order correlations (residues i and (i+n), n=2,3,4) have been accounted for by using pseudo-amino acid composition (Cai, Liu et al. 2002; Chou and Cai 2003; Pan, Zhang et al. 2003). Recently, we showed that incorporating structural and evolutionary information significantly improves prediction accuracy (Nair and Rost 2003). Using a domain projection method based on the presence of SMART domains, Mott et al. (Mott, Schultz et al. 2002) were able to accurately predict localization for approximately 25% of eukaryotic proteins. With the availability of many completely sequenced genomes, phylogenetic profiles have been employed to identify subcellular localization (Marcotte, Xenarios et al. 2000). So far, this approach has been much less accurate than methods based solely on composition. Drawid & Gerstein have proposed a Bayesian system, based on a diverse range of 30 different features, to predict the localization of yeast proteins (Drawid and Gerstein 2000).

Significant shortcomings in traditional approach to ab initio prediction of subcellular localization. The traditional approach to ab initio prediction of subcellular localization, or any other functional class, has been to first define the localization classes of interest and then train machine learning algorithms like neural networks or SVM’s to discriminate proteins belonging to each of the classes from the rest of the proteins (Reinhardt and Hubbard 1998; Emanuelsson, Nielsen et al. 2000; Hua and Sun 2001; Nair and Rost 2003). Finally, these dedicated networks for the different classes are combined to predict the localization for an unknown protein. However, in such a prediction scheme no consideration is given to the fact that some classes of proteins are more similar to each other than to other classes due to evolutionary factors. We propose that incorporating a hierarchical ontology of evolutionary similarities into a machine learning framework should lead to better utilization of the underlying biological mechanism of targeting and improved prediction accuracy of localization classes.

LOCtree: Ab initio prediction of subcellular localization based on hierarchical ontologies. Here, we describe a novel SVM based localization predictor, LOCtree, incorporating a hierarchical ontology of subcellular localization classes which uses as input a number of observed and predicted features of the amino acid sequence. The most important factor for the development of such a predictor is an appropriate definition of the hierarchical ontology, one that is most suitable for the problem being addressed. Biological similarities among the various subcellular classes have been incorporated into a standardized framework for the description of cellular components by the gene ontology (GO) consortium (Ashburner, Ball et al. 2000; Lewis, Ashburner et al. 2000). We used a modified version of the GO definition for cellular classes, one that was more suitable for addressing the problem of protein sorting. For example, in the GO classification scheme endoplasmic reticulum and golgi apparatus are subcategories of the cytoplasm since they are parts of the latter. However, proteins destined for the extra-cellular space, endoplasmic reticulum (ER), golgi apparatus, endosomes and lysosomes are targeted via the ER pathway and are more similar to each other than they are to other intra-cellular proteins (Rusch and Kendall 1995). This is so since, due to evolutionary reasons, these organelles are topologically equivalent to the exterior of the cell. Hence, in our classification scheme these compartments are grouped together and are designated as belonging to the secretory pathway. To predict localization, LOCtree uses a decision tree like architecture derived from gene ontology and from evolutionary similarities in the protein sorting mechanism (Fig.6-1). The decision nodes in the tree are implemented using SVM’s. We introduced methodologies for training and testing such architecture. A decision tree like architecture for combining the SVM’s was observed to give the best accuracy. LOCtree was extremely successful at learning evolutionary similarities among subcellular localization classes and was significantly more accurate than other traditional networks at predicting subcellular localization. This architecture has the added advantage that at each node in the hierarchical tree only the input features that are important for the classification task at that node need to be used as input to the SVM. It has been noticed by other groups that using input features selectively in this fashion remarkably improves prediction accuracy (Jensen, Gupta et al. 2002). Another major problem with traditional localization prediction systems is that to add a new prediction module for a subcellular class; the module has to be integrated with all the other dedicated networks predicting the other subcellular classes.

This integration can be fairly cumbersome and affects the prediction results for all subcellular classes. The advantage of implementing a hierarchical ontology is the ease with which independently developed specialized classifiers can be added to this framework. Any new classifier added to a specific node in the architecture only affects prediction accuracy for the children of that node. To predict subcellular localization, LOCtree uses a number of observed and predicted features of the amino acid sequence. Observed features used included overall amino acid composition and composition of the 50 N-terminal residues. Evolutionary information was incorporated in the form of sequence profiles (Nair and Rost 2003). Predicted features used included, the amino acid composition separated into three secondary structure states (helix, strand, other) and output from the SignalP server (Nielsen, Engelbrecht et al. 1997). The three state secondary structures of proteins were predicted using PROFphd (Rost 1996; Rost and Liu 2003). Only the SVM at the top of the hierarchy uses the raw SignalP output as input while all other SVM’s use only the remaining compositional features as input. The SVM was implemented using the SVM-light package.

The output of the top level SVM determines if a protein is classified as belonging to the secretory pathway or not (Fig. 6-1). Proteins targeted via the secretory pathway are further sorted into extra-cellular proteins and other organelles while intra-cellular proteins are further sorted into nuclear or cytoplasmic proteins. The cytoplasmic proteins are further classified as cytosolic or mitochondrial for eukaryotic non-plant proteins (Fig. 6-1a). For plant proteins a further distinction is made to chloroplast proteins (Fig. 6-1b).

For prokaryotes a simpler architecture is used achieving a three fold distinction between extra-cellular, periplasmic and cytosolic proteins (Fig. 6-1c). We have applied LOCtree to analyze the subcellular localization of complete genomes of a number of eukaryotic and prokaryotic organisms. The LOCtree subcellular localization prediction server and the results of our localization annotations for entire genomes can be accessed at http://cubic.bioc.columbia.edu/services/LOCtree/.

6.2 Materials and methods

Data sets used for development and evaluation. We selected all eukaryotic and prokaryotic proteins with explicit annotations about subcellular localization in SWISS-PROT release 40 (Bairoch and Apweiler 2000). We excluded proteins annotated as MEMBRANE, POSSIBLE, PROBABLE, SPECIFIC PERIODS or BY SIMILARITY. We also excluded proteins annotated with multiple localizations. This left about 9,000 eukaryotic proteins and 13,000 prokaryotic proteins in our SWISS-PROT data set of experimentally annotated localization (‘SWISS-PROT annotated’ set, Table 6-1). Training, and test sets were constructed from this set such that no pair of proteins from any two sets had levels of sequence similarity above HSSP-distances of 5 (Eq. 6-1). We picked this value, since below this threshold assigning subcellular localization based solely on homology leads to significant errors (Nair and Rost 2002). Furthermore, the test set was redundancy reduced at HSSP-distances <10 using a simple greedy search (Hobohm, Scharf et al. 1992). This ensured that no two proteins in the test set had greater than 40% sequence identity over more than 100 residues (number of sequence unique proteins given in Table 6-1). The reason for this reduction was to find a balance between biased data known to yield over-estimates (Nielsen, Engelbrecht et al. 1996; Rost 2002) and between data sets that were too small. Note that we did not have to define thresholds for significant sequence similarity between motifs, such as signal peptides (Nielsen, Engelbrecht et al. 1996), since we used the entire protein information. Note: all data sets are available at: http://cubic.bioc.columbia.edu/results/2004/LOCtree/.

SWISS-PROT-new set used only for testing. After we completed the development of all our methods, we used an additional data set to re-examine performance, namely, we collected all proteins that had been added to SWISS-PROT between release 40 and 41 (labelled 'SWISS-PROT-new'). We filtered out all of these new proteins that had HSSP-distances >5 to any previously used protein and found the sequence-unique subset of the new proteins (Table 6-1). We never used any of these proteins for development, and it is rather unlikely that the other methods tested used any of these (Table 6-6).

HSSP-distance to measure pairwise sequence similarity. The HSSP-distance is defined as the distance from the HSSP threshold (Rost 1999); it is given by:

            HSSP-DISTANCE = PIDE - HSSP_PIDE(q)

            HSSP_PIDE(q)= q +                        (Eq. 6-1)

where, L is the length of the alignment between two proteins, PIDE the percentage of pairwise identical residues, and HSSP_PIDE(q) the revised HSSP-threshold for the level q.

Increasing size of training set. Preliminary results suggested that using a sequence redundant set of proteins to train the SVM improved prediction of localization. Two strategies were used to increase the size of the training set: 1) Homology based annotations: using LOChom (Nair and Rost 2002) we annotated localization for all sequence homologues in the SWISS-PROT database of proteins in the training set, and 2) SWISS-PROT keyword based annotations: using LOCkey (Nair and Rost 2002) we first annotated localization for all proteins in the SWISS-PROT database for which adequate keyword functional information was present in the database. Next, any protein that had a sequence similarity of HSSP-distance greater than 5 to proteins in the test set was excluded. The remaining proteins were then added to the training set. Using the two above mentioned procedures increased the size of the training set by nearly four times.

Building profiles. We showed previously (Nair and Rost 2003) that using evolutionary information in the form of sequence profiles can significantly improve prediction accuracy. Profile-based composition was calculated by aligning the sequences against the SWISS-PROT + TrEMBL database using MaxHom dynamic programming algorithm (Sander and Schneider 1991). The aligned sequences were filtered for redundancy at 95% pairwise sequence identity, i.e. pairs exceeding this limit were removed. Finally, we included only those proteins into the alignment that were above an HSSP-distance of 5 and had a pairwise sequence identity above 50% with respect to the guide protein of known localization. These thresholds were found to be optimal on a limited subset of the training data. Finally, the profile composition was calculated by replacing each amino acid residue in the protein by the residue frequencies in the profile. All composition information was input to the SVM in the form of profile-based composition.

Hierarchical architecture and support vector machine training. Each decision node in the hierarchical architecture of LOCtree (Fig. 6-1) was implemented using a support vector machine (SVM). The SVM’s were implemented using the SVM-light package (Joachims 2000). The inputs to the SVM’s were the amino acid composition, composition of the 50 N-terminal residues (both 20 input units) and amino acid composition in the three secondary structure states (sixty input units). For the eukaryotic plant and non-plant architectures, raw output from the SignalP server (Nielsen, Engelbrecht et al. 1997) was used as additional input to the top node SVM’s which determine if a protein is sorted via the secretory pathway or not. Altogether 100 variables (104 for the top-level SVM in the eukaryotic architecture) were used as input to the SVM’s. Each SVM was trained using the radial basis function (rbf) kernel. The parameter for the rbf-kernel and the C parameter for trade-off between training error and margin were determined by optimizing on a limited subset of the training data. We used a constant value of =15 and C=500 for all SVM’s. The predictions from the SVM were observed to be quite resilient to changes in kernel parameters.

Final decision through simple winner-take-it-all. We experimented with two different methods for determining the localization of a protein, 1) Decision tree: at each node in the hierarchical architecture (Fig. 6-1), a simple yes/no decision was made based only on the SVM output at that node to determine which branch of the localization tree the protein belongs to, and 2) Summing over prediction strengths: the branch of the localization tree the protein is sorted through at each node was determined by summing the prediction strength’s over all previous nodes. The simple decision tree architecture was finally used since it outperformed the summing architecture by over 1 percentage points.

Cross-validation. The ‘SWISS-PROT annotated’ data was partitioned into six equal-sized sets using the HSSP-distance criteria described above: five of these sets were combined to give the training set and the sixth one was used for testing. Finally, we rotated through the test set such that each protein was used for testing exactly once. We never used any information from the test set to optimize parameters.

Evaluating performance.  Six-fold cross-validation was applied to test the hierarchical architecture. As a simple measure for performance we used the percentage accuracy (Q, number of correctly predicted test proteins as percentage of total number of test proteins). The accuracy/specificity and coverage/sensitivity of the two-state networks were measured using four ratios derived from TP (number of proteins predicted to be in localization i and observed to be in localization i, the true positives), FP (number of proteins predicted to be in localization i and observed not to be in i, the false positives) and FN (number of proteins predicted not to be in localization i and observed to be in i, the false negatives). We used:

                                                                                         (Eq. 6-2)

                                                                                       (Eq. 6-3)

In other words, Acc is the accuracy of prediction, and Cov is the coverage. We combined these two numbers (Acc and Cov) through the geometric average:

                                                                            (Eq. 6-4)

The overall accuracy was measured by the accuracy Q:

                                  Q =              (Eq. 6-5)

 

For assessing the accuracy of two class predictions we used the Mathews correlation coefficient (MCC) (Mathews 1985) and the normalized mutual information (MI) (Rost and Sander 1993):

            (Eq. 6-6)

                                       (Eq. 6-7)

The Mathews correlation coefficient and the mutual information as defined above is only applicable for two classes. For more than two classes, the Mathews correlation coefficient was modified to the generalized correlation coefficient (Baldi, Brunak et al. 2000):

                                                                                           (Eq. 6-8)

where, N is the number of proteins, K the number of localization classes, the confusion matrix and  is given by:

                                                                                                     (Eq. 6-9)

For many classes the mutual information was modified to the information coefficient (Baldi, Brunak et al. 2000):

                                                         (Eq. 6-10)

Prediction methods. The prediction accuracy of four publicly available methods was evaluated using the ‘Non-plant unique’, ‘Plant unique’ and the ‘Non-plant new unique’ sets (Table 6-8). The four methods were: (1) TargetP: neural network based tool predicting localization based on N-terminal sequence information