Chapter 6
Predicting subcellular localization based on functional hierarchies†
Predicting the native subcellular compartment of a protein is an important step towards elucidating its function. Here, we describe a novel support vector machine (SVM) based localization predictor, LOCtree, incorporating a hierarchical ontology of subcellular localization classes which uses as input a number of observed and predicted features of the amino acid sequence. By mimicking the biological trafficking mechanism as closely as possible using a machine learning architecture, LOCtree very accurately predicts the subcellular localization of proteins. LOCtree achieved an accuracy of 74% for eukaryotic non-plant, 70% for plant and 84% for prokaryotic sequences on a non-redundant test set. We rigorously benchmarked the performance of LOCtree vis-à-vis a number of publicly available methods for localization prediction. LOCtree outperformed all publicly available servers in nearly all benchmarks. Localization assignments using LOCtree agreed quite well with data from recent large-scale experiments. From our analysis of the Homo sapiens, Saccharomyces cerevisiae and Arabidopsis thaliana genomes using LOCtree we concluded that over 35% of all non-membrane proteins are nuclear, around 20% retained in the cytosol, over 10% mitochondrial and around 20% of plant proteins are chloroplastic. The prediction server and the results of our genome predictions are available at http://cubic.bioc.columbia.edu/services/LOCtree/.
Computational prediction of subcellular localization indispensable. The genome (DNA) sequences of more than 100
organisms, including a draft sequence of the human genome (Venter, Adams et al.
2001; Istrail, Sutton et al. 2004), have
now been completed. For over 60 of these, the data is publicly available and
contributes about 250K protein sequences, i.e. about one fourth of all
currently known protein sequences (Liu and Rost 2001;
Carter, Liu et al. 2003; Pruess, Fleischmann et al. 2003). With
this explosion of genome sequences, a major challenge in modern biology is to
understand the expression, function, and regulation of the entire set of
proteins encoded by an organism. This information will be invaluable for understanding how complex
biological processes occur at a molecular level, how they differ in various
cell types, and how they are altered in disease states (Zhu, Bilgin et al. 2003). Proteins must be localized in the
same subcellular compartment to cooperate towards a common function. Therefore,
experimentally unravelling the native compartment of a protein constitutes one
step on the long way to determining its role (Bork, Ouzounis et al. 1994; Bork, Dandekar et
al. 1998). Using experimental
high-throughput methods for epitope and green fusion protein (GFP) tagging, two
groups have recently reported localization data for most proteins in Saccharomyces cerevisiae (bakers yeast) (Kumar, Agarwal et al. 2002; Huh, Falvo et al.
2003). So far, the majority of
large-scale experiments suggesting localization have been restricted to yeast,
or to particular compartments, such as a recent analysis of chloroplast
proteins in Arabidopsis thaliana
(weed) (Kleffmann, Russenberger et al. 2004). As of now, these large-scale
experiments cannot be repeated for mammalian or other higher eukaryotic
proteomes. One significant obstacle is that large-scale production of a
collection of cell lines each with a defined gene chromosomally tagged at the
3′-end is not yet possible (Davis 2004). In contrast, computational tools
can provide fast and accurate localization predictions for any organism (Bork and Koonin 1998; Koonin 2000). In fact, attempts to predict subcellular
localization has become one of the central problems in bioinformatics (Eisenhaber and Bork 1998; Nakai 2000;
Schneider and Fechner 2004).
Bioinformatics partially successful in predicting localization from
sequence. A number
of methods have tried to predict localization by identifying local sequence
motifs, such as signal peptides(von Heijne 1995; Nielsen, Engelbrecht et al.
1997; Nakai and Horton 1999; Emanuelsson and von Heijne 2001) or nuclear localization signals
(NLS) (Cokol, Nair et al. 2000; Nakai 2000) that are responsible for protein
targeting. Proteins destined for the secretory pathway, the mitochondria and
the chloroplast contain N-terminal targeting peptides that are recognised by
the translocation machinery (Rusch and Kendall 1995; Schatz and Dobberstein
1996). Thus, prediction methods use only
the N-terminal residues (Nielsen, Engelbrecht et al. 1997; Emanuelsson,
Nielsen et al. 2000). Many proteins destined for the
nucleus contain NLS motifs that may occur anywhere in the sequence. Recently,
we have collected a data set of experimental and potential NLS motifs as an aid
to predicting nuclear localization (Nair, Carter et al. 2003). However, the vast majority of
proteins have no known motif. For mitochondrial and chloroplast proteins, a
number of alternative targeting pathways have recently been discovered (Folsch, Guiard et al. 1996; Herrmann and
Neupert 2000). Furthermore, a particular problem
for methods detecting N-terminal signals is that start-codons are predicted
with less than 70% accuracy by genome projects (Reinhardt and Hubbard 1998; Lander, Linton et
al. 2001; Venter, Adams et al. 2001). Overall, known and predicted sequence
motifs enable annotating about 30% of the proteins in six entirely sequenced
eukaryotic proteomes (Liu and Rost 2001; Mott, Schultz et al. 2002). Other methods that can be reliably
used to annotate localization but are not always applicable are annotation
transfer from sequence homologues (Nair and Rost 2002) and text analysis (Fleischmann, Moller et
al. 1999; Kretschmann, Fleischmann et al. 2001; Nair and Rost 2002; Lu, Szafron
et al. 2004).
Another promising approach which has
the advantage of high coverage has been suggested by the observation that the
overall amino acid composition of a protein correlates with its native
compartment (Nishikawa, Kubota et al. 1983; Nakashima and
Nishikawa 1994). This observation has led to the
development of a variety of ab initio
prediction methods based solely on composition (Reinhardt and Hubbard 1998; Chou and Elrod
1999; Hua and Sun 2001; Nair and Rost 2003). Higher order correlations
(residues i and (i+n), n=2,3,4) have been accounted for by using pseudo-amino
acid composition (Cai, Liu et al. 2002; Chou and Cai 2003; Pan,
Zhang et al. 2003). Recently, we showed that
incorporating structural and evolutionary information significantly improves
prediction accuracy (Nair and Rost 2003). Using a domain projection method
based on the presence of SMART domains, Mott et al. (Mott, Schultz et al. 2002) were able to accurately predict
localization for approximately 25% of eukaryotic proteins. With the
availability of many completely sequenced genomes, phylogenetic profiles have
been employed to identify subcellular localization (Marcotte, Xenarios et al. 2000). So far, this approach has been
much less accurate than methods based solely on composition. Drawid &
Gerstein have proposed a Bayesian system, based on a diverse range of 30
different features, to predict the localization of yeast proteins (Drawid and Gerstein 2000).
Significant shortcomings in traditional approach to ab initio prediction
of subcellular localization. The traditional approach to ab
initio prediction of subcellular localization, or any other functional
class, has been to first define the localization classes of interest and then
train machine learning algorithms like neural networks or SVM’s to discriminate
proteins belonging to each of the classes from the rest of the proteins (Reinhardt and Hubbard 1998; Emanuelsson,
Nielsen et al. 2000; Hua and Sun 2001; Nair and Rost 2003). Finally, these dedicated networks
for the different classes are combined to predict the localization for an
unknown protein. However, in such a prediction scheme no consideration is given
to the fact that some classes of proteins are more similar to each other than
to other classes due to evolutionary factors. We propose that incorporating a
hierarchical ontology of evolutionary similarities into a machine learning
framework should lead to better utilization of the underlying biological
mechanism of targeting and improved prediction accuracy of localization
classes.
LOCtree: Ab initio prediction of subcellular localization based on
hierarchical ontologies. Here, we describe a novel SVM based localization predictor, LOCtree,
incorporating a hierarchical ontology of subcellular localization classes which
uses as input a number of observed and predicted features of the amino acid
sequence. The most important factor for the development of such a predictor is
an appropriate definition of the hierarchical ontology, one that is most
suitable for the problem being addressed. Biological similarities among the
various subcellular classes have been incorporated into a standardized
framework for the description of cellular components by the gene ontology (GO)
consortium (Ashburner, Ball et al. 2000; Lewis, Ashburner
et al. 2000). We used a modified version of the
GO definition for cellular classes, one that was more suitable for addressing
the problem of protein sorting. For example, in the GO classification scheme
endoplasmic reticulum and golgi apparatus are subcategories of the cytoplasm
since they are parts of the latter. However, proteins destined for the
extra-cellular space, endoplasmic reticulum (ER), golgi apparatus, endosomes
and lysosomes are targeted via the ER pathway and are more similar to each
other than they are to other intra-cellular proteins (Rusch and Kendall 1995). This is so since, due to
evolutionary reasons, these organelles are topologically equivalent to the
exterior of the cell. Hence, in our classification scheme these compartments
are grouped together and are designated as belonging to the secretory pathway.
To predict localization, LOCtree uses a decision tree like architecture derived
from gene ontology and from evolutionary similarities in the protein sorting
mechanism (Fig.6-1). The decision nodes in the tree are implemented using
SVM’s. We introduced methodologies for training and testing such architecture.
A decision tree like architecture for combining the SVM’s was observed to give
the best accuracy. LOCtree was extremely successful at learning evolutionary similarities
among subcellular localization classes and was significantly more accurate than
other traditional networks at predicting subcellular localization. This
architecture has the added advantage that at each node in the hierarchical tree
only the input features that are important for the classification task at that
node need to be used as input to the SVM. It has been noticed by other groups
that using input features selectively in this fashion remarkably improves
prediction accuracy (Jensen, Gupta et al. 2002). Another major problem with
traditional localization prediction systems is that to add a new prediction
module for a subcellular class; the module has to be integrated with all the
other dedicated networks predicting the other subcellular classes.
This integration can be fairly
cumbersome and affects the prediction results for all subcellular classes. The
advantage of implementing a hierarchical ontology is the ease with which
independently developed specialized classifiers can be added to this framework.
Any new classifier added to a specific node in the architecture only affects
prediction accuracy for the children of that node. To predict
subcellular localization, LOCtree uses a number of observed and predicted
features of the amino acid sequence. Observed features used included overall
amino acid composition and composition
of the 50 N-terminal residues. Evolutionary information was incorporated
in the form of sequence profiles (Nair and Rost 2003). Predicted features used
included, the amino acid composition separated into three secondary structure states (helix, strand, other) and
output from the SignalP server (Nielsen, Engelbrecht et
al. 1997).
The three state secondary structures of proteins were predicted using PROFphd (Rost 1996; Rost and Liu
2003).
Only the SVM at the top of the
hierarchy uses the raw SignalP output as input while all other SVM’s use only
the remaining compositional features as input. The SVM was implemented using the
SVM-light package.
The output of
the top level SVM determines if a protein is classified as belonging to the
secretory pathway or not (Fig. 6-1). Proteins targeted via the secretory pathway are further sorted into
extra-cellular proteins and other organelles while intra-cellular proteins are
further sorted into nuclear or cytoplasmic proteins. The cytoplasmic proteins
are further classified as cytosolic or mitochondrial for eukaryotic non-plant
proteins (Fig. 6-1a). For plant proteins a further distinction is made to
chloroplast proteins (Fig. 6-1b).

For prokaryotes a simpler
architecture is used achieving a three fold distinction between extra-cellular,
periplasmic and cytosolic proteins (Fig. 6-1c). We have applied LOCtree to
analyze the subcellular localization of complete genomes of a number of
eukaryotic and prokaryotic organisms. The LOCtree subcellular localization
prediction server and the results of our localization annotations for entire
genomes can be accessed at http://cubic.bioc.columbia.edu/services/LOCtree/.
Data sets used for development and evaluation. We selected all eukaryotic and
prokaryotic proteins with explicit annotations about subcellular localization
in SWISS-PROT release 40 (Bairoch and Apweiler 2000). We excluded proteins annotated as
MEMBRANE, POSSIBLE, PROBABLE, SPECIFIC PERIODS or BY SIMILARITY. We also
excluded proteins annotated with multiple localizations. This left about 9,000
eukaryotic proteins and 13,000 prokaryotic proteins in our SWISS-PROT data set
of experimentally annotated localization (‘SWISS-PROT
annotated’ set, Table 6-1). Training, and test sets were constructed from
this set such that no pair of proteins from any two sets had levels of sequence
similarity above HSSP-distances of 5 (Eq. 6-1). We picked this value, since
below this threshold assigning subcellular localization based solely on
homology leads to significant errors (Nair and Rost 2002). Furthermore, the test set was
redundancy reduced at HSSP-distances <10 using a simple greedy search (Hobohm, Scharf et al. 1992). This ensured that no two proteins
in the test set had greater than 40% sequence identity over more than 100
residues (number of sequence unique proteins given in Table 6-1). The reason
for this reduction was to find a balance between biased data known to yield
over-estimates (Nielsen, Engelbrecht et al. 1996; Rost 2002) and between data sets that were too
small. Note that we did not have to define thresholds for significant sequence
similarity between motifs, such as signal peptides (Nielsen, Engelbrecht et al. 1996), since we used the entire protein
information. Note: all data sets are available at: http://cubic.bioc.columbia.edu/results/2004/LOCtree/.

SWISS-PROT-new set used only for testing. After we completed the development of all our
methods, we used an additional data set to re-examine performance, namely, we
collected all proteins that had been added to SWISS-PROT between release 40 and
41 (labelled 'SWISS-PROT-new'). We filtered out all of these new proteins that
had HSSP-distances >5 to any previously used protein and found the
sequence-unique subset of the new proteins (Table 6-1). We never used any of
these proteins for development, and it is rather unlikely that the other
methods tested used any of these (Table 6-6).
HSSP-distance to measure pairwise sequence similarity. The HSSP-distance is defined as the
distance from the HSSP threshold (Rost 1999); it is given by:
HSSP-DISTANCE
= PIDE - HSSP_PIDE(q)
HSSP_PIDE(q)= q +
(Eq. 6-1)
where, L is the length of the
alignment between two proteins, PIDE the percentage of pairwise identical
residues, and HSSP_PIDE(q) the revised HSSP-threshold for the
level q.
Increasing size of training set. Preliminary results suggested that using a
sequence redundant set of proteins to train the SVM improved prediction of
localization. Two strategies were used to increase the size of the training
set: 1) Homology based annotations: using LOChom (Nair and Rost 2002) we annotated localization for all
sequence homologues in the SWISS-PROT database of proteins in the training set,
and 2) SWISS-PROT keyword based annotations: using LOCkey (Nair and Rost 2002) we first annotated localization for
all proteins in the SWISS-PROT database for which adequate keyword functional
information was present in the database. Next, any protein that had a sequence
similarity of HSSP-distance greater than 5 to proteins in the test set was
excluded. The remaining proteins were then added to the training set. Using the
two above mentioned procedures increased the size of the training set by nearly
four times.
Building profiles. We showed previously (Nair and Rost 2003) that using evolutionary information
in the form of sequence profiles can significantly improve prediction accuracy.
Profile-based composition was calculated by aligning the sequences against the
SWISS-PROT + TrEMBL database using MaxHom dynamic programming algorithm (Sander and Schneider 1991). The aligned sequences were
filtered for redundancy at 95% pairwise sequence identity, i.e. pairs exceeding
this limit were removed. Finally, we included only those proteins into the
alignment that were above an HSSP-distance of 5 and had a pairwise sequence
identity above 50% with respect to the guide protein of known localization.
These thresholds were found to be optimal on a limited subset of the training
data. Finally, the profile composition was calculated by replacing each amino
acid residue in the protein by the residue frequencies in the profile. All
composition information was input to the SVM in the form of profile-based
composition.
Hierarchical architecture and support vector machine training. Each decision node in the
hierarchical architecture of LOCtree (Fig. 6-1) was implemented using a support
vector machine (SVM). The SVM’s were implemented using the SVM-light package (Joachims 2000). The inputs to the SVM’s were the
amino acid composition, composition of the 50 N-terminal residues (both 20
input units) and amino acid composition in the three secondary structure states
(sixty input units). For the eukaryotic plant and non-plant architectures, raw
output from the SignalP server (Nielsen, Engelbrecht et al. 1997) was used as additional input to the
top node SVM’s which determine if a protein is sorted via the secretory pathway
or not. Altogether 100 variables (104 for the top-level SVM in the eukaryotic
architecture) were used as input to the SVM’s. Each SVM was trained using the
radial basis function (rbf) kernel. The
parameter for the rbf-kernel and the C parameter for
trade-off between training error and margin were determined by optimizing on a
limited subset of the training data. We used a constant value of
=15 and C=500 for all SVM’s. The predictions from the SVM
were observed to be quite resilient to changes in kernel parameters.
Final decision through simple winner-take-it-all. We experimented with two different methods
for determining the localization of a protein, 1) Decision tree: at each node
in the hierarchical architecture (Fig. 6-1), a simple yes/no decision was made
based only on the SVM output at that node to determine which branch of the
localization tree the protein belongs to, and 2) Summing over prediction
strengths: the branch of the localization tree the protein is sorted through at
each node was determined by summing the prediction strength’s over all previous
nodes. The simple decision tree architecture was finally used since it
outperformed the summing architecture by over 1 percentage points.
Cross-validation. The ‘SWISS-PROT annotated’
data was partitioned into six equal-sized sets using the HSSP-distance criteria
described above: five of these sets were combined to give the training set and
the sixth one was used for testing. Finally, we rotated through the test set
such that each protein was used for testing exactly once. We never used any
information from the test set to optimize parameters.
Evaluating performance. Six-fold cross-validation was
applied to test the hierarchical architecture. As a simple measure for
performance we used the percentage accuracy (Q, number of correctly predicted
test proteins as percentage of total number of test proteins). The
accuracy/specificity and coverage/sensitivity of the two-state networks were
measured using four ratios derived from TP
(number of proteins predicted to be in localization i and observed to be in localization i, the true positives), FP
(number of proteins predicted to be in localization i and observed not to be in i,
the false positives) and FN (number
of proteins predicted not to be in localization i and observed to be in i,
the false negatives). We used:
(Eq.
6-2)
(Eq.
6-3)
In other words, Acc is the accuracy of prediction, and Cov is the coverage. We combined these two numbers (Acc and Cov) through the geometric average:
(Eq. 6-4)
The overall accuracy was measured by
the accuracy Q:
Q =
(Eq. 6-5)
For assessing the accuracy of two
class predictions we used the Mathews correlation coefficient (MCC) (Mathews 1985) and the normalized mutual
information (MI) (Rost and Sander 1993):
(Eq. 6-6)
(Eq. 6-7)
The Mathews correlation coefficient
and the mutual information as defined above is only applicable for two classes.
For more than two classes, the Mathews correlation coefficient was modified to
the generalized correlation coefficient (Baldi, Brunak et al. 2000):
(Eq.
6-8)
where, N is the number of proteins,
K the number of localization classes,
the
confusion matrix and
is given by:
(Eq.
6-9)
For many classes the mutual
information was modified to the information coefficient (Baldi, Brunak et al. 2000):
(Eq. 6-10)
Prediction methods. The prediction accuracy of four publicly available methods was evaluated using the ‘Non-plant unique’, ‘Plant unique’ and the ‘Non-plant new unique’ sets (Table 6-8). The four methods were: (1) TargetP: neural network based tool predicting localization based on N-terminal sequence information