bottom - TOC - CUBIC-papers - CUBIC - Rost group

Title: Mimicking cellular sorting improves prediction of subcellular localization
Author: Rajesh Nair & Burkhard Rost
Quote: quote_xx

Mimicking cellular sorting improves prediction of subcellular localization

Rajesh Nair 1,4 & Burkhard Rost 1,2,3

1 CUBIC, Dept. of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA
2 Columbia University Center for Computational Biology and Bioinformatics (C2B2), Russ Berrie Pavilion, 1150 St. Nicholas Avenue, New York, NY 10032, USA
3 North East Structural Genomics Consortium (NESG), Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA
4 Dept. of Physics, Columbia Univ., 538 West 120th Street, New York, NY 10027, USA
* Corresponding author: nair@cubic.bioc.columbia.edu URL http://cubic.bioc.columbia.edu/  Tel: +1-212-305-4018, fax: +1-212-305-7932

This article is published in (J Mol Biol, issue, 2005 and pages) © copyright Journal of Molecular Biology, Academic Press (2005). Academic Press is the only authorized source. All copying of this article including placing on another website requires the written permission of the copyright owner.

Table of contents


Abstract

Predicting the native subcellular compartment of a protein is an important step toward elucidating its function. Here we introduce LOCtree, a hierarchical system combining support vector machines (SVMs) and other prediction methods. LOCtree predicts the subcellular compartment of a protein by mimicking the mechanism of cellular sorting and exploiting a variety of sequence and predicted structural features in its input. Currently LOCtree does not predict localization for membrane proteins, since the compositional properties of membrane proteins significantly differ from those of non-membrane proteins. While any information about function can be used by the system, we presented estimates of performance that are valid when only the amino acid sequence of a protein is known. When evaluated on a non-redundant test set, LOCtree achieved sustained levels of 74% accuracy for non-plant eukaryotes, 70% for plants, and 84% for prokaryotes. We rigorously benchmarked LOCtree in comparison to the best alternative methods for localization prediction. LOCtree outperformed all other methods in nearly all benchmarks. Localization assignments using LOCtree agreed quite well with data from recent large-scale experiments. Our preliminary analysis of a few entirely sequenced organisms, namely human (Homo sapiens), yeast (Saccharomyces cerevisiae), and weed (Arabidopsis thaliana) suggested that over 35% of all non-membrane proteins are nuclear, about 20% are retained in the cytosol, and that every fifth protein in the weed resides in the chloroplast.

Key words: protein subcellular localization prediction, support vector machines, hierarchical ontology, sequence alignment, database search, secondary structure, solvent accessibility.

 

Abbreviations used

1D structureone-dimensional structure (e.g. sequence or string of secondary structure, or solvent accessibility)
ERendoplasmic reticulum
GFPgreen fluorescent protein [1]
GFP-tagginghere used to refer to the large-scale experimental determination of localization through GFP tagging [2]
GOgene ontology [3]
HSSPdatabase of protein structure-sequence alignments [4]
ILlarge scale localization of yeast proteins from high-throughput immuno-localization of epitope-tagged gene products [5]
LOChompredicting localization using annotation transfer through sequence homology [6, 7]
LOCkeyusing SWISS-PROT keywords to predict subcellular localization [8]
LOCtreehierarchical system of SVMs introduced here
NLSnuclear localization signal
NNPSLneural networks predicting localization [9]
PHDProfile based neural network prediction of secondary structure, solvent accessibility and transmembrane helices [10, 11]
PredictNLSprediction of nuclear proteins through nuclear localization signals [12, 13]
PROFphdAdvanced profile-based neural network prediction of secondary structure and solvent accessibility [14]
PSORTknowledge-based expert system using amino acid composition and sequence motifs [15, 16, 17, 18]
SGDSaccharomyces cerevisiae genome database [19]
SignalPneural network system predicting signal peptides [20]
SMARTSimple Modular Architecture Research Tool [21]
SVMsupport vector machine
SWISS-PROTdata base of protein sequences [22]
SubLocsupport-vector machine-based prediction of localization [23]
TargetPcombined method predicting chloroplast (ChloroP), extra-cellular (SignalP), and mitochondrial proteins [20, 24, 25]
TMSidentification of chloroplast proteins in A. thaliana using tandem mass spectroscopy [26]
TrEMBLtranslation of the EMBL-nucleotide database coding DNA to protein sequences [22] .


 

Introduction

Assignment and prediction of subcellular localization indispensable. The genomes, i.e. all DNA-sequences, of over 260 organisms (Feb. 2005), including the human genome [27, 28] have been completed. For over 200 of the entirely sequenced organisms, the protein sequences are publicly available; 105 have been analyzed in the PEP database ( www.rostgroup.org/db/PEP/ , cubic.bioc.columbia.edu/db/PEP/ ), and contribute about 413,000 protein sequences, i.e. about one fourth of all currently known protein sequences [29, 30, 31]. With this explosion of genome sequences, the major challenge in modern biology is to follow suit in advancing the knowledge of the expression, regulation, and function of the entire set of proteins encoded by an organism, i.e. its proteome. This information will be invaluable for understanding how complex biological processes occur at a molecular level, how they differ in various cell types, and how they are altered in disease states. Proteins must be localized in the same subcellular compartment to cooperate towards a common function. Therefore, experimentally unraveling the native compartment of a protein constitutes one step on the long way to determining its role. Using experimental high-throughput methods for epitope and green fluorescent protein (GFP) tagging, two groups have recently reported localization data for most proteins in Saccharomyces cerevisiae (bakerÕs yeast) [5, 2]. So far, the majority of large-scale experimental efforts to predict localization have been restricted to yeast, or to particular compartments, such as a recent analysis of chloroplast proteins in Arabidopsis thaliana (weed) [26]. As of now, these large-scale experiments cannot be repeated for mammalian or other higher eukaryotic proteomes. One major obstacle is that large-scale production of a collection of cell lines each with a defined gene chromosomally tagged at the 3'-end is not yet possible [32]. In contrast, computational tools can provide fast and accurate localization predictions for any organism [33, 34, 35, 36]. Attempts to predict subcellular localization have increasingly become one of the central problems in bioinformatics/computational biology [37, 38, 39, 40, 41, 42, 43].

Most reliable predictions cover less than 50% of all proteins. A number of methods predict localization by identifying short sequence motifs, such as signal peptides [44, 45, 20, 24, 18, 46] or nuclear localization signals (NLS) [12, 38, 47, 13] that are responsible for protein targeting. Most proteins destined for the secretory pathway, the mitochondria and the chloroplast contain N-terminal peptides that are recognized by the translocation machinery [48, 49]. The term signal peptide is used to describe the peptides in secreted proteins that are cleaved in the endoplasmic reticulum (ER) by signal peptidases; the peptides from mitochondria and chloroplast are referred to as transit peptides. Signal and transit peptides can be recognized by generic prediction methods, that by the detection of these peptides also predict subcellular localization [50, 24, 25, 41]. Many proteins destined for the nucleus contain NLS motifs that may occur anywhere in the sequence. Recently, we have collected a data set of experimental and potential NLS motifs as an aid to predicting nuclear localization [13]. However, the vast majority of nuclear proteins have no known motif. For mitochondria and chloroplast, a number of alternative targeting pathways have also been discovered recently [51, 52, 53, 54]. Additionally, proteins such as fibroblast growth factors are targeted to the extra-cellular space via non-classical secretory pathways, i.e. they do not possess N-terminal signal peptides [55, 56]. Furthermore, a particular problem for methods detecting N-terminal signals is that start-codons are predicted with less than 70% accuracy by genome projects [9, 27, 28]. Overall, known and predicted sequence motifs enable annotating about 30% of the proteins in six entirely sequenced eukaryotic proteomes [57, 29, 58]. Other methods that can be reliably used to annotate localization but are not always applicable are annotation transfer from sequence homologues [6] and text analysis [59, 60, 8, 61]. A particular variant of homology-based predictions is the domain projection method that is based on similarity to SMART domains of known subcellular localization [58]. Despite recent high-throughput experiments, the most reliable prediction methods together cover less than 50% of entirely sequenced multi-cellular proteomes.

De novo predictions of localization restricted by limited biophysical reality. In the near future, the only hope of assigning compartments to the remaining half of all multi-cellular proteins is using methods that predict localization from features other than known import/export motifs. The most promising approach is to exploit the correlation between localization and amino acid composition of a protein [62, 63] , which is mostly due to the altering of the protein surface in response to changing environmental conditions [64]. Methods using only amino acid composition to predict localization are de novo methods; they predict localization without any explicit experimental knowledge of the protein under investigation. In particular, they are as accurate if any information about function is available for a target as when the target is merely a 'hypothetical protein'. Higher order residue correlations (between residues i and i+n, for n=2,3,4) have been accounted for by using pseudo-amino acid composition [65, 66, 67]. Recently, we showed that incorporating structural and evolutionary information significantly improves prediction accuracy [68]. With the availability of many completely sequenced genomes, phylogenetic profiles have been employed to identify subcellular localization [69]. So far, this approach has been much less accurate than methods based solely on composition. Drawid & Gerstein have proposed a Bayesian system, based on a diverse range of 30 different features, to predict the localization of yeast proteins [70]. The problem with all these methods is that they are based on sequence features that may reveal localization but are not the reason why proteins are transported, such as signal and transit peptides and nuclear localization signals. Furthermore, all general methods - with the exception of PSORT [15, 16] - implicitly assume that all localizations are equidistant, i.e. if a method predicts a nuclear protein to be cytoplasmic it makes the same mistake as another method which predicts this protein to be extra-cellular. In reality, however, some compartments are more similar to each other than others, e.g. ER is closer to extra-cellular than to nuclear due to the proximity in the space of the biological sorting machinery.

Here, we describe a novel system of support vector machines (SVMs) that predict subcellular localization by incorporating a hierarchical ontology of localization classes modeled onto biological processing pathways. By construction, the system penalizes confusions of classes along the same pathway (e.g. ER instead of extra-cellular) less than confusions between classes from different pathways (e.g. ER instead of nuclear). The biological similarities are incorporated from the description of cellular components in the gene ontology (GO) [3, 71]. We simplified and tailored the GO definitions to the problem of protein sorting. For example, in GO both the ER and the Golgi apparatus are subcategories of the cytoplasm. However, proteins destined for the extra-cellular space, the ER, the Golgi, endosomes and lysosomes are targeted via the same secretory pathway. By this criterion, proteins from the secretory pathway are more similar to each other than they are to other intra-cellular proteins [48]. Hence, in our classification scheme these compartments are grouped together and are designated as belonging to the secretory pathway. Technically, we incorporated the ontology through a decision tree with SVMs as the nodes (Fig. 1). We favored SVMs over neural networks due their improved performance (data not shown). The final system, LOCtree, was extremely successful at learning evolutionary similarities among subcellular localization classes and was significantly more accurate than other traditional networks at predicting subcellular localization.



Fig. 1
fig1.gif

Fig. 1: Hierarchical architecture of LOCtree. LOCtree uses specialized architecture to predict localization of proteins from different organisms: (A) architecture used to predict localization of eukaryotic non-plant proteins, (B) architecture for plant proteins, and (C) the architecture for prokaryotic proteins. At each branch point a support vector machine (SVM) is used to accomplish a binary classification (either protein belongs to localization class L or does not belong to L). The hierarchical architecture has been designed to mimic the biological protein sorting mechanism as closely as possible. The branches of the tree represent intermediate stages in the sorting machinery while the nodes represent the decision points in the sorting machinery. The different levels of SVMs in the hierarchical tree are labeled Level 0, Level 1, etc. For example, Level 0 represents the top node SVM which discriminates between secretory pathway proteins and other intra-cellular proteins (A and B) or proteins which remain in the cytoplasm from the rest (C). The intermediate node SVMs in the next level are represented as Level 1, and are responsible for separating extra-cellular proteins from proteins sorted to the organelles and nuclear proteins from cytoplasmic proteins (A and B). For the prokaryotic architecture (C), Level 1 is the terminal level for Gram-negative bacteria and separates extra-cellular proteins from periplasmic proteins. In addition, Level 1 also contains the cytoplasmic leaf that is propagated without branching from Level 0. For Gram-positive bacteria, Level 0 is the terminal level and separates cytoplasmic proteins from extra-cellular proteins (non-cytoplasmic branch). The leaves of the tree, represented by rectangular boxes represent the final localization classes for which prediction is made. If a leaf has a depth smaller than the overall depth of the tree it is propagated without branching for the remainder of the tree. Level 2 is the terminal level for the eukaryotic non-plant architecture (A) and is responsible for sorting proteins into one of five subcellular classes (mitochondria and cytosol plus the three leaves from Level 1), while Level 3 is the terminal level for the plant architecture (C) and separates proteins into one of six classes (mitochondria and chloroplast plus the four leaves from Level 2). The prediction accuracy of the parent nodes is higher than the child nodes leading to significantly improved prediction accuracyÕs for the intermediate localization states. Abbreviations: EXT, extra-cellular; NUC, nucleus; CYT, cytosol; MIT, mitochondria; CHLORO, chloroplast; RIP, periplasm; and ORG, organelle. Organelles are the endoplasmic reticulum, Golgi apparatus, peroxysomes, lysosomes, and vacuolar compartments.



We have applied LOCtree to analyze the subcellular localization of complete genomes of a number of eukaryotic and prokaryotic organisms. The LOCtree subcellular localization prediction server and the results of our localization annotations for entire proteomes are available through http://www.rostlab.org/services/LOCtree or http://cubic.bioc.columbia.edu/services/LOCtree.

Results

Data sets and cross-validation results

More data with noise better than less data with less noise.  Proteins with experimentally annotated subcellular localization were extracted from SWISS-PROT [22] (Methods). For this study, we excluded membrane proteins, i.e. all our results are valid for a subset of 75-80% of all proteins [72, 73, 74, 57, 75]. In total, we had 8,980 eukaryotic and 13,186 prokaryotic non-membrane proteins with explicit experimental annotations (Methods). Training and test sets were constructed by partitioning the data such that test sequences had less than 25% sequence identity to any sequence in the training set over an alignment length of 250 residues (HVAL=5, eqn. 1Methods). To avoid overestimating performance, we reduced redundancy such that our final sequence-unique test set contained 1,505 non-redundant eukaryotic non-plant sequences, 304 plant and 672 prokaryotic test sequences. All results of the methods described here were based on six-fold cross-validation experiments, i.e. we cycled six times through the entire sequence-unique data such that each protein was used for testing once. To increase the size of the training set, we included homology-based (LOChom [6]) and keyword-based (LOCkey [8]) predictions in the training data. While adding these noisy predictions, we ascertained that no homologues to any of the test proteins were included. This procedure almost quadrupled the training data; it increased prediction accuracy by nearly seven percentage points (data not shown). The major improvement resulted from the addition of keyword-based annotations using LOCkey [8]. Plants were treated separately since their compositional features differed significantly from non-plant eukaryotes (not shown). Using the SVM-light package [76] , we found the radial basis function (RBF) kernel to perform better than linear and polynomial kernels. This result was obtained on a small subset of all proteins without cross-validation. In particular, we did not optimize this solution for the final test set.

Very accurate distinction between secretory pathway proteins and all others. To predict the localization of an unknown eukaryotic protein, LOCtree first determines if it is sorted using the secretory pathway. The SVM that makes this distinction achieved an overall prediction accuracy around 90% for both eukaryotic non-plant (Table 1 Fig. 2 A) and plant proteins (Table 2). Using the signal peptide prediction of SignalP [20] as one input to the SVM improved accuracy by over one percentage point (data not shown). We also confirmed our previous observation that using overall composition in conjunction with N-terminal composition improved performance [68]. Our methods also distinguished intra-cellular proteins very accurately from those entering the secretory pathway (>90% accuracy, Table 1 and Table 2).



Table 1
Table 1: LOCtree onnon-redundant test set of eukaryotic non-plant proteins. *
Hierarchy level Class Nprot Acc Cov gAv Q (StdDev) MCC MI Nstates
Level 0

Secretory Pathway

415 81 80 81 89 (2) 0.73 0.44 2

Intra-cellular

1090 92 93 93
Level 1

Extra-cellular

363 83 81 82 78 (4) 0.55 0.40 4

Organelles

52 51 52 52

Nuclear

562 78 78 78

Cytoplasm

528 76 78 77
Level 2

Cytosol

330 63 66 64 74 (6) 0.55 0.39 5

Mitochondria

198 70 67 68

* Abbreviations used: 
Hierarchy Level and Class as illustrated in Fig. 1; Nprot: number of proteins in sequence-unique test set with a given localization; Nstates: number of effective states predicted at given level (note that Level 1 contains four states while Level 2 contains five states, namely the Level 1 leaves (extra-cellular, organelles and nuclear) + cytosol + mitochondria).
Performance measures: Acc: accuracy or specificity (eqn. 2); Cov: coverage or selectivity (eqn. 3); gAv: geometric average between Acc and Cov (eqn. 4); Q: overall prediction accuracy for a given level in the hierarchy (eqn. 5note depending on the level this is a 2-state, 4-state, or 5-state value); MCC: Mathews correlation coefficient (eqn. 6and eqn. 8); MI: mutual information (eqn. 7and eqn. 10). Note 1: Q=74% at Level 2 is the overall accuracy for classification into one of five localization classes (extra-cellular, organelles, nuclear, cytosol or mitochondria).



Table 2
Table 2: LOCtree on non-redundant test set of plant proteins. *
Hierarchy Level Class Nprot Acc Cov gAv Q (StdDev) MCC MI Nstates
Level 0

Secretory Pathway

42 77 79 78 94 (4) 0.74 0.49 2

Intra-cellular

262 97 96 97
Level 1

Extra-cellular

22 68 68 68 88 (5) 0.58 0.49 4

Organelles

20 57 60 58

Nuclear

32 70 81 75

Cytoplasm

230 95 93 94
Level 2

Cytosol

77 73 74 74 77 (7) 0.59 0.44 5

Non-Cytosol

153 84 80 82
Level 3

Mitochondria

50 61 74 67 70 (3) 0.58 0.42 6

Chloroplast

103 77 63 70

* Abbreviations used as in Table 1.



Overall accuracy of 74% for non-plants.  If a protein is predicted as belonging to the secretory pathway it is further sub-classified into extra-cellular or not (Fig. 1). The non extra-cellular proteins belong to either of the following organelles: endoplasmic reticulum (ER), Golgi apparatus, peroxisome, lysosome, or vacuole. Proteins native in one of these organelles were predicted at levels around 50% accuracy and 52% coverage (values were higher for plant proteins, Table 2); these values were much lower than the averages for all other classes. The sub-classification of intra-cellular proteins into nucleus and cytoplasm was less accurate; however, levels of accuracy and coverage were still above 76% ( Table 1 and Table 2). Of the final localization classes (the leaves in Fig. 1 >5=74% (Table 1) is the overall accuracy for classification into one of five localization classes (extra-cellular, organelles, nuclear, cytosol or mitochondrial). This was over seven percentage points more accurate than another system that used the same input with the traditional pairwise SVMs (data not shown). The unique feature of our method is that it predicts 'intermediate' localizations such as intra-cellular and cytoplasm (Table 1 and Table 2). These 'intermediate' localizations are predicted with much higher accuracy's as is evident from the progressive decrease in prediction accuracy as we descent the hierarchical tree (Fig. 2 A).



Fig. 2
fig2.gif

Fig. 2: Reliability of LOCtree. The curves show prediction accuracy of LOCtree for eukaryotic animal sequences. (A) Overall performance: The prediction accuracy decreases as we descent the hierarchical tree (Fig. 1A). The Level 2 accuracy shown includes the accuracy of all Level1 leaves like the extra-cellular, organelle and nuclear classes (Fig. 1A), and represents the accuracy of classifying the protein into one of five subcellular classes. At 75% coverage the prediction accuracy is around 94% for Level 0, dropping to 84% for Level 1 and 77% for Level 2. The ability of the hierarchical system to predict intermediate localization states at significantly higher accuracy is evident from the 17% difference in prediction accuracy between Level 0 and Level 2. Level 1 which separates proteins into one of four subcellular classes is over 7% more accurate than Level 2 that separates proteins into one of five classes. (B) Class-wise performance: LOCtree is best at discriminating secretory pathway proteins from all other proteins (91% accuracy at 50% coverage). Prediction of nuclear and extra-cellular proteins was only slightly less accurate (84% accuracy at 50% coverage) while performance was significantly worse for cytosolic proteins with only 64% correctly predicted. The standard deviation in the prediction accuracy for each of the localization classes was roughly 7%.



Accurate distinction of three prokaryotic classes.  For prokaryotic proteins, LOCtree first determines if the protein is cytoplasmic or not. The SVM discriminating between these localizations reached an overall accuracy of 90% (Table 3). Prediction accuracy did not differ significantly between Gram-positive and Gram-negative bacteria. For Gram-negative bacteria, the non-cytoplasmic proteins are further classified into periplasmic and extra-cellular. The overall three class (cytoplasmic, periplasmic or extra-cellular) prediction accuracy for Gram-negative bacteria was Q3=83%; the two class (cytoplasmic or extra-cellular) accuracy for Gram-positive bacteria was Q2 =90%. For Gram-negative bacteria, the distinction between periplasmic and extra-cellular proteins was at a much lower accuracy than cytoplasmic proteins.



Table 3
Table 3: LOCtree on non-redundant test set of prokaryotic proteins. *
Hierarchy Level Class Nprot Acc Cov gAv Q (StdDev) MCC MI Nstates
Level 0

Cytoplasm

426 89 97 93 90 (4) 0.79 0.52 2

Non-Cytoplasm

246 93 80 86
Level 1

Periplasmic

125 86 62 73 83 (2) 0.55 0.45 3

Extra-cellular

42 59 74 66

* Abbreviations used as in Table 1. Note 1: Level 1 is applicable for Gram-negative bacteria, only. For Gram-positive bacteria, the system performs a two-state classification with non-cytoplasmic proteins being classified as extra-cellular. For Gram-negative bacteria, non-cytoplasmic proteins are further separated into periplasmic and extra-cellular proteins. Note 2: The Level 0 prediction accuracy did not differ significantly between Gram-positive and Gram-negative bacteria. The prediction accuracy reported above is the combined prediction accuracy for Gram-positive and Gram-negative bacteria. The overall two class prediction accuracy for Gram-positive bacteria was Q2 =90% while the three class prediction accuracy for Gram-negative bacteria was Q3=83%.



 

Comparison with other methods using additional test sets

Other methods tested on new data set. We compared our method to the following publicly available methods: TargetP [25] , SubLoc [23] , NNPSL [9] , and PSORT II [18]. In contrast to all other methods, TargetP focuses exclusively on sorting signals (secreted, chloroplast, mitochondria); it does not predict proteins in any other compartment, such as in the nucleus or cytoplasm. Of the other servers that predict at least four classes, SubLoc [23] is also based on SVMs; NNPSL [9] is based on neural networks. SubLoc and NNPSL rely solely on amino acid composition while PSORT [18] combines information from local sequence motifs and a neural network based method. All publicly available methods were tested on smaller data sets than LOCtree, and on data sets with little mutual overlap. We could run all servers on our non-redundant test set from the cross-validation experiments (Table 1, Table 2, Table 3). However, most of the proteins in our data set had been used to develop those servers and we could not cross-validate any method other than our own. A benchmark with our data set would, therefore, have very limited value. For completeness, we reported the results of this test which as expected, over-estimated the public servers significantly (Table 1 Appendix, Supporting Online Material) The most meaningful comparison of prediction methods is based on a significantly sized, sequence-unique data set of proteins that have neither been used for the development of any of the methods tested, nor have significant sequence similarity to any of the methods tested. Unfortunately, such sets are often difficult to get. If we ignored the most recent improvement of one of the components of TargetP, namely SignalP_3 [41] , we could find such a data set in proteins added between SWISS-PROT version 40 and 41 (Methods). While SignalP_3 has been developed after release 41, all the methods that we compared have been developed before release 41. Note that we deliberately restricted our development of LOCtree to proteins available in version 40 so that we could carry out this comparison. In many ways, our evaluation was also informative of the sustained performance of the methods tested, some of which had fallen prey to severe over-estimates of performance in their original publications. Note that PSORT, TargetP and the different versions of SignalP stood out in that their authors had correctly estimated the sustained performance all along.

LOCtree over 20 percentage points more accurate than other general servers. In the benchmark of proteins that had not been used for the development of any method, LOCtree outperformed all other servers (Table 4). TargetP was more accurate at predicting proteins targeted via the secretory pathway but its coverage was lower than that of LOCtree. The reason was that TargetP slightly under-predicted the secretory pathway (imbalance in gAv (eqn. 4), i.e. the geometric average over Acc and Cov, Table 4). On our data set, we found the accuracy of SignalP 3.0 [41] to be slightly lower than that of TargetP, since the difference was not significant SignalP 3.0 was not shown separately in order to simplify. PSORT II was the most accurate server for predicting extra-cellular proteins, however, this was achieved at the cost of an extremely low level of coverage; in the geometric average between accuracy and coverage, PSORT II was more than 30 percentage points lower than LOCtree. As shown in our cross-validation experiments (Table 1 -3), LOCtree was very balanced in its compromise between accuracy and coverage, i.e. between under- and over-prediction, for all classes, and it was much more balanced than any other server. In terms of the overall 4-state accuracy (Q4 eqn. 5), LOCtree scored 21 percentage points higher than its best competitor SubLoc (Table 4 ).



Table 4
Table 4: Comparison on identical sequence-unique set of new SWISS-PROT non-plant eukaryotic proteins. *
LOCtree here TargetP [20, 24, 25] SubLoc [23] PSORT [15, 16, 17, 18] NNPSL [9]
Secretory Pathway Acc 87 93
Cov 90 73
gAv 88 82
Ext Acc 86 73 91 62
Cov 93 53 32 63
gAv 89 62 54 63
Nuc Acc 77 64 56 67
Cov 85 71 75 59
gAv 81 67 65 63
Cyt Acc 82 43 47 42
Cov 64 56 47 38
gAv 72 49 47 40
Mit Acc 73 54 48 46 30
Cov 78 75 59 59 67
gAv 75 64 53 52 45
Overall accuracy Q4 78 57 51 52

Abbreviations used as in Table 1, with the following exceptions: 
Data set: all sequence-unique eukaryotic non-plant proteins added between release 41 and 40 of SWISS-PROT (Non-plant new unique in Table 4 Appendix , Supporting Online Material). Note that none of the proteins in this set had significant sequence similarity to any of the proteins that had annotations about localization in SWISS-PROT at the time of development of the prediction methods for which results are shown. In this sense, our test set could also provide an independent and likely more accurate estimate for the sustained performance than some of the original publications for some of the methods. 
LocalizationExt, extra-cellular; Nuc, nuclear; Cyt, cytosolic; Mit, mitochondria; Chloro, chloroplast.
Methods: Predictions from methods other than LOCtree - introduced here - were taken from their public Internet servers (Methods), except for PSORT II that was run locally; numbers in square brackets under methods refer to original publication (References). 
Numbers in bold: in each row, the best method(s) is (are) marked in bold letters; methods are grouped according to significant differences (below), i.e. all values that are statistically indistinguishable are marked as one best group. 
Significant differences: For LOCtree, the standard deviation in the five-state accuracy was roughly six percentage points. The following estimates for standard deviations were published: TargetP [25] , about one percentage point; NNPSL [9] , about 2.5 percentage points; PSORT II [18] , about 3.5 percentage points. Since no error estimates were published for SubLoc [23] , we used 2.5 percentage points as the mean over the other three.



Performance better than existing methods even for incorrect sequences. LOCtree explicitly used information from the first 50 residues (N-termini) and compositions from the entire protein. Both these values are likely to be wrong for many proteins taken from large-scale sequencing projects [77, 78, 79, 80]. We tried to estimate the effect of such mistakes through two different 'models': (1) we cleaved off 30 N-terminal residues for all proteins, and (2) we randomly picked positions to remove one third of the sequence for each protein. These tests constituted worst-case scenarios in the sense that they all over-estimated sequencing errors substantially. We found that the overall prediction accuracy of LOCtree on the randomly cleaved fragments was 68% (Table 2 Appendix, Supporting Online Material), 10% less than what was obtained using the full protein sequence. For the N-term cleaved sequences, the accuracy further dropped to 55% due to the explicit dependence of LOCtree on N-terminal sequence information. This is still accurate enough to provide reliable first estimates of localization for genomic sequences.

About 80% agreement between predictions and large-scale experiments in yeast.Over the last years the large-scale experimental determination of subcellular localization for a substantial fraction of all yeast proteins has become increasingly accurate. Using high-throughput immuno-localization (IL) of epitope-tagged gene products, the Snyder group [5] determined the localization for about 60%, while the OÕShea group [2] exploited high-throughput green fluorescent protein (GFP) tagging to cover about 66% of all yeast proteins. Both studies did not distinguish between membrane and non-membrane proteins, and both did not capture secreted proteins. Many proteins were experimentally associated to more than one single compartment: 35% for Snyder et al. and 31% for O'Shea et al. We compared the LOCtree predictions of all proteins predicted to not contain membrane helices, and observed to be nuclear or mitochondrial in the two large-scale experiments. Proteins observed to be in the cytoplasm in the two large-scale studies were excluded from our analysis since a large fraction (43% for Snyder et al. and 55% for O'Shea et al.) of cytosolic proteins were also observed in alternative compartments. Similarly, we also excluded all other proteins experimentally associated with more than one compartment. This filtering left over 1,000 proteins from the GFP data and about 200 proteins from the IL data. For both these data sets, about 80% of the predictions from LOCtree were identical to the experimental results (Table 5). This is comparable to the 80% agreement between the GFP data and traditional non-high-throughput results previously annotated in SGD [81, 19]. The agreement between GFP and IL is about 75% for the subset of 146 proteins found to be nuclear or mitochondrial by GFP that were also found in the IL data set. The agreement between GFP annotations and yeast proteins annotated using homology to SWISS-PROT proteins was 79%. This is for five subcellular classes and using an HVAL>10 (eqn. 1) for homology annotations. For the IL data the agreement with SWISS-PROT was 72% (note due to the small data set this number is a very inaccurate estimate).

Chloroplasts: experimental data supported by predictions. Using tandem mass spectroscopy (TMS), Kleffmann et al. [26] have recently identified 690 proteins localized in the chloroplast of Arabidopsis thaliana (weed). Of these we predicted 190 to contain membrane helices. We compared LOCtree and TargetP [25] for the remaining 500 proteins (Fig. 3). The following results stood out: (1) less than half of these proteins were identified by all three methods, (2) a considerable fraction (29%) of the 500 proteins was only identified by TMS, and (3) when comparing the chloroplast predictions for all weed proteins, we found that LOCtree and TargetP agreed in about 87% of their predictions. TargetP however predicts more chloroplast proteins than LOCtree. This could be due to the over-prediction of chloroplast proteins by TargetP which has been previously reported by an independent group [82].



Table 5
Table 5: Performance of LOCtree based on large scale Yeast localization data.*
Method Nuclear Mitochondrial
Obs Acc Cov Obs Acc Cov

GFP [2]

586 82 68 418 83 51

IL [5]

124 88 64 60 76 62

* Abbreviations used:  Methods: Experimental subcellular localization data for proteins in Yeast were obtained from two methods: GFP, large scale localization using green fluorescent protein tagging [2] ; IL, large scale localization using high-throughput immunolocalization of epitope-tagged proteins [5]. DataObs, number of non-membrane proteins for which localization was predicted in this compartment by the respective large-scale experiment. All proteins observed to be in multiple compartments by the large-scale methods were excluded from our analysis. LOCtree was used to predict localization of the remaining proteins. The prediction accuracy (Acc) and coverage (Cov) of LOCtree was calculated by assuming that the localization observed in the large-scale experiment represents the true localization of the protein. Note 1: Cytoplasmic proteins were excluded from our analysis since a large fraction (45%-70%) of all proteins was observed to be in the cytoplasm in the two large-scale experiments. Nearly half of all cytoplasmic proteins were observed to be associated with more than one compartment and many are likely to be further sorted to other compartments.


Fig. 3
fig3.gif

Fig. 3: Benchmarking LOCtree using large-scale experimental data. Both LOCtree and TargetP [25] were used to predict the localization of nearly 500 chloroplast proteins from Arabidopsis thaliana which were identified using tandem mass spectroscopy (TMS) by Kleffmann et al. [26]. LOCtree and TargetP both predicted over 45% of these proteins to be localized in the chloroplast lending strong support to the large-scale experimental data using TMS. Over 70% of the proteins were predicted to be in the chloroplast by at least one server. TargetP showed a high degree of agreement with LOCtree, agreeing with over 87 percent of the predictions using LOCtree.



 

Application to representative proteomes

We used LOCtree to annotate the subcellular localization for all non-membrane proteins in the entire proteomes of Homo sapiens (human) [27, 28] , Arabidopsis thaliana (weed) [83] , and Saccharomyces cerevisiae (yeast) [19] (Fig. 4). The results of our proteome annotations can be queried (downloaded) from the LOCtree website ( http://cubic.bioc.columbia.edu/cgi-bin/var/nair/LOCtree/query-genome.pl or http://www.rostgroup.org/cgi-bin/var/nair/LOCtree/query-genome.pl ). We estimated that over 60% of all non-plant and over 50% of all plant proteins are nuclear or remain in the cytosol (Fig. 4, Table 3 Appendix , Supporting Online Material). While over 75% of the non-membrane proteins in all genomes appeared intra-cellular, the fraction of secreted proteins varied substantially between 8-20%, with plants having fewer than 10% extra-cellular proteins and the number exceeding 20% in human. Nuclear proteins were overabundant in yeast. In general, the unicellular yeast was somewhere in between human and weed in its composition of compartments.  



Fig. 4
fig4.gif

Fig. 4: Composition of compartments in 3 representative proteomes. Note that 100% of the pie charts represents the number of proteins without transmembrane helices (predicted by PHDhtm [11, 105, 14] and taken from PEP [30]). The final estimates were corrected in order to account for our compartment-specific estimates of accuracy and coverage (eqn. 11). For all three proteomes the nucleus appeared to take the lion's share of all proteins, only the chloroplast came near this value for the plant representative weed (Arabidopsis thaliana). Human (Homo sapiens) has significantly more secreted proteins than do weed and yeast (Saccharomyces cerevisiae). Yeast appeared to have the highest fraction of mitochondrial proteins. For the proteomes the percent fractional error for the estimates of the different compartments are given by: Extra-cellular ( 9%), Organelle ( 34%), Nuclear ( 10%), Cytosol ( 20%), Mitochondria ( 25%) and for Chloroplast ( 17%).



Discussion

Tree-based system provided additional advantages to boosting performance. Our results demonstrated how the prediction of subcellular localization can be substantially improved by mimicking the biological protein trafficking mechanism as closely as possible through, LOCtree, a hierarchical tree of SVMs (Fig. 1 Table 1, Table 2, Table 3). PSORT [15, 16, 84, 18] is based on an implementation of a reasoning tree that is conceptually most similar to LOCtree. However, unlike LOCtree, PSORT is not based on an explicit 'ontology' of subcellular localization. Instead, the nodes of the PSORT reasoning tree assign a probabilistic value to the presence/absence of a single feature and have no intrinsic meaning. In contrast, the nodes of LOCtree separate proteins belonging to different cellular sorting pathways. In addition to being more accurate, our machine learning system provided two added benefits. The first was the prediction of 'intermediate stages'. The prediction of 'intermediate stages' - such as secretory pathway - was achieved at much higher levels of accuracy than the native compartments. This is not too surprising given that large-scale experiments using IL and GFP-tagging [5, 2] suggest that 30% of all proteins are ambiguous, i.e. spend a considerable portion of their life-time in more than one native compartment. These experimental data might strike many biologists as 'expected' since the vast majority of proteins travel through the cell, e.g. most extra-cellular proteins 'visit' at least three other compartments (ER, Golgi, vesicles) before they eventually are secreted. However, the very fact that our system reaches levels above 74% accuracy for the distinction of proteins into one of five compartments (extra-cellular, cytoplasmic, mitochondrial, nuclear, and organellar) suggested that most proteins have very strong preferences for one single compartment imprinted onto their sequences. The importance of post-translational modifications in altering these sequence signals as a means to increase the 'fitness' for other compartments might explain the difference between these two extreme opposite perceptions of proteins as native to 'one native compartment' and as 'frequent travelers between compartments' [64, 85, 35]. The second advantage of our hierarchical system might appear to be of more technical nature, namely that our modular architecture allows the addition of more fine-grained modules at later stages (e.g. the split of nuclear into nuclear-matrix and other [86]). However, this seemingly technical detail actually once again was borne out of the advantage of mimicking the actual sorting system. We observed that as one descent's the hierarchical tree the prediction accuracy progressively decreases, since the classification task becomes increasingly complex and the SVM has to discriminate between increasingly similar proteins. One problem with our decision tree-like implementation was that a prediction mistake at a top node could not be corrected at nodes lower in the hierarchy. The appropriate choice of the evolutionary hierarchy was, therefore, crucial. The fact that we cannot correct mistakes from higher levels was by no means a feature of the design: we tried to recover from higher-level sorting mistakes by predicting localization of a protein at all nodes and averaging over the prediction strengths over all higher level nodes. Sometimes this worked, however, most of the time such an alteration introduced new mistakes.

Over 20 percentage points improvement over existing generalized methods.  We also showed that the increase in the size of the training set through the addition of noisy predictions was more relevant than the noise added from the mistakes in these annotations obtained through text-analysis [8] and homology-transfer [6]. This surprising finding suggested that prediction methods might improve even more through the continued addition of large-scale experimental tackling of localization. Finally, we confirmed our previous findings [68] that predicted structure and evolutionary profiles contain information relevant for the prediction of localization. All these data combined with our hierarchical tree-based system improved the overall accuracy over 20 percentage points over the best competitor that generically predicted localization in four states (extra-cellular, nuclear, cytoplasmic, mitochondrial; Table 4). The only method that performed at a similarly high level as LOCtree was TargetP [25] (Table 4 ) that focuses on particular classes (secreted, mitochondria, chloroplast). TargetP [25] was also the only method that appeared significantly better at predicting one particular compartment, namely, chloroplast proteins ( Table 1 Appendix , Supporting Online Material; note however that in this test we did compare our method in cross-validation mode to TargetP in not-cross-validation mode, i.e. were likely to have over-estimated the performance of TargetP). Predictions using LOCtree had the added advantage of being extremely balanced between accuracy and coverage (Table 1, Table 2, Table 3, Table 4). In contrast, methods such as SignalP [20] - the secreted/not-secreted component of TargetP - are either very prone to over- (high coverage, low accuracy) or under-prediction (low coverage, high accuracy; e.g. TargetP, PSORT II). The only other general method that was as well balanced as LOCtree was NNPSL [9] that had an overall performance of 27 percentage points below LOCtree (Table 4).

Many estimates for performance had a rather short-life span. Another problem that we noticed was that only TargetP and PSORT had published estimates for performance that were close to our results on a 'never-seen-before' set of sequence-unique proteins. In particular, SubLoc [23] was estimated to achieve an overall accuracy of Q4=79%, while it reached only 57% on our data (Table 4). The differences may be explained by the fact that up to 90% pairwise sequence identity was allowed between testing and training set for the original publication of SubLoc and NNPSL [9]. Cai et al. [87] also claim very high level of accuracy (73%). That value was more difficult to compare because their methods are not available as servers and because their publications did not rigorously describe protocols for removing redundancy. In fact, it appears that only proteins identical between training and testing set were excluded. Furthermore, the accuracy was compiled on a different partition of the prediction goal. More recently this group published even higher estimates using similar data sets with unspecified sequence similarity between testing and training [65, 66]. In general, the problem of correctly estimating performance is a very difficult one as illustrated by the bi-annual meetings for the Critical Assessment of Structure Prediction (CASP [88, 89, 90, 91, 92]) and by servers that evaluate the performance of servers such as EVA (http://www.rostlab.org/EVA [93, 94]). The task is particularly difficult in a field in which we have too few and no continuous resource of experimental data.

As accurate as large-scale experiments? Numerically, our predictions from LOCtree agreed as much with traditional Òsmall-scaleÓ biochemical determinations of subcellular localization as did the recent large-scale experiments [5, 2, 26] (Table 5 Table 4). Interestingly, our predictions reached a similar level of performance as large-scale experiments (GFP: 79%, IL: 72%, and LOCtree: 74-78%) if analyzed against more careful traditional approaches as the standard-of-truth. This by no means implies that we aimed at the replacement of experiments. Rather, we see predictions from LOCtree as a reasonable, cheap starting point for careful experiments and as a complement for the interpretation of large-scale results. Furthermore, our prediction method had slightly different potential than the large-scale experiments, e.g. while we could identify secreted proteins, we currently could not clearly distinguish between Golgi and vesicles, nor did we include membrane proteins.

Open tasks. Our current system marked in some ways the end of a very long series of methods addressed at predicting localization. While methods using homology-transfer (LOChom [6]) and text-analysis (LOCkey [8]) were crucial for achieving our new state-of-the-art level of performance we will have to tie some loose ends, in particular, we currently exclude membrane proteins, treat each protein as one without any experimental annotations (LOChom and LOCkey are used for training and for our prediction server, however, they are not generically integrated into LOCtree), and do not distinguish between proteins that are generically native to more than one compartment and those which are not. Furthermore, the task of annotating more than a few representative proteomes remains.

 

Conclusion

Previous attempts at predicting subcellular localization have implemented machine-learning algorithms using the standard parallel architecture as is common practice in computer science and have focused on improving prediction by incorporating additional sequence features that are correlated with localization. Here we have shown that prediction accuracy can be significantly improved by using a hierarchical architecture of support vector machines to mimic the protein sorting mechanism. This result is likely to hold for other aspects of protein function and can significantly aid the development of more accurate predictors of protein function. The ability of many proteins to function in more than one native subcellular compartment makes the prediction task especially difficult. In fact, over 30% of the more than 4000 yeast proteins for which localization has been determined using high-throughput experiments [2] are associated with more than one compartment. The hierarchical architecture of LOCtree can better incorporate proteins which spend time in more than one native compartment by predicting 'intermediate' localization states, which span multiple subcellular classes, at much higher accuracy. The fact that the system achieved an overall five-state prediction accuracy of 74% seems to indicate that the native subcellular localization is imprinted somehow onto the protein sequence and that a majority of proteins carry only one strong sequence signal for one particular compartment. In future, it should be possible to further extend the abilities of LOCtree by adding modules that can make fine-grained distinctions such as discriminating among the different organelles and various substructures like the nucleolus.

 

Methods and Materials

Data sets used for development and evaluation.  We selected all eukaryotic and prokaryotic proteins with explicit annotations about subcellular localization in SWISS-PROT release 40 [95]. We excluded proteins annotated as MEMBRANE, POSSIBLE, PROBABLE, SPECIFIC PERIODS or BY SIMILARITY. We also excluded proteins annotated with multiple localizations. This left about 9,000 eukaryotic proteins and 13,000 prokaryotic proteins in our trusted set of experimentally annotated localization (SWISS-PROT annotated set, Table 4 Appendix , Supporting Online Material). Training, and test sets were constructed from this set such that no pair of proteins from any two sets had sequence similarity levels corresponding to HVAL>5 (eqn. 1). We picked this value, since below this threshold assigning subcellular localization based solely on homology leads to significant errors [6]. Furthermore, the test set was redundancy reduced at HVAL<10 using a simple greedy search [96]. This ensured that no two proteins in the test set had greater than 25% sequence identity over more than 250 residues (number of sequence unique proteins given in Table 4 Appendix , Supporting Online Material). The reason for this reduction was to find a balance between biased data known to yield over-estimates [97, 98] and between too small data sets likely to yield incorrect estimates. We did not have to define thresholds for significant sequence similarity between motifs such as signal peptides [97] , since we never explicitly used this information, rather we used the entire protein information. All data available at: www.rostlab.org/results/2005/LOCtree/.

SWISS-PROT-new set used for testing, only.  After we completed the development of all our methods, we used an additional data set to re-examine performance, namely, we collected all proteins that had been added to SWISS-PROT between release 40 and 41 (results presented in Table 4). We excluded all new proteins that had HVAL>5 (eqn. 1) to any previously used protein and found the sequence-unique subset of the new proteins ( Table 4 Appendix , Supporting Online Material). We never used any of these proteins for development, and it is rather unlikely that any of the other methods tested (Table 4 used any of these since all methods were developed based on SWISS-PROT releases <41.

HSSP-value to measure pairwise sequence similarity. The simplest way to measure sequence similarity is percentage pairwise sequence identity (PIDE), i.e. the percentage of residues identical between two proteins (not counting gaps). Another measure is the statistical expectation values as reported by BLAST. Here, we used a third measure, namely the HSSP-value (HVAL) because it more accurately allowed the separation between proteins pairs for which similarity in localization is recognizable from sequence than the other two [6]. The HVAL [99, 100] is given by:

      (eqn. 1)

where L was the number of residues aligned between two proteins, PID the percentage of pairwise identical residues.

Increasing size of training set.  Preliminary results suggested that a larger training set improved SVM performance through increased coverage of the sequence space. Another source of improvement was using a sequence redundant set of proteins to train the SVM. Two strategies were used to increase the size of the training set: (1) SWISS-PROT keyword-based annotations: using LOCkey [8] , we first annotated localization for all proteins in the SWISS-PROT database for which adequate keyword functional information was present in the database. Next, proteins with HVAL>5 to proteins in the test set were excluded. The remaining proteins were added to the training set; and (2) Homology based annotations: using LOChom [6] , we annotated localization for all sequence homologues in the SWISS-PROT database of proteins in the training set. Using both procedures increased the size of the training set by almost a factor of four.

Building evolutionary profiles. We showed previously [68] that using evolutionary information in the form of sequence profiles significantly improves prediction accuracy. Profiles were built by aligning the sequences against the SWISS-PROT + TrEMBL database using the MaxHom dynamic programming algorithm [101]. The aligned sequences were filtered for redundancy at 95% pairwise sequence identity, i.e. pairs exceeding this limit were removed. Finally, we included only those proteins that had HVAL>5 and PID>50% with respect to the guide sequence. These thresholds were previously found to be optimal for a rather different prediction method [68]. Fi