Chapter 5
Better prediction of subcellular localization by combining evolutionary and structural information†
The native subcellular compartment of a protein is one aspect of its function. Thus, predicting localization is an important step toward predicting function. Short zip code-like sequence fragments regulate some of the shuttling between compartments. Cataloguing and predicting such motifs is the most accurate means of determining localization in silico. However, only few motifs are currently known, and not all the trafficking appears regulated in this way. The amino acid composition of a protein correlates with its localization. All general prediction methods employed this observation. Here, we explored the evolutionary information contained in multiple alignments and aspects of protein structure to predict localization in absence of homology and targeting motifs. Our final system combined statistical rules and a variety of neural networks to achieve an overall four-state accuracy above 65%, a significant improvement over systems using only composition. The system was at its best for extra-cellular and nuclear proteins; it was significantly less accurate than TargetP for mitochondrial proteins. Interestingly, all methods that were developed on SWISS-PROT sequences failed grossly when fed with sequences from proteins of known structures taken from PDB. We therefore developed two separate systems: one for proteins of known structure and one for proteins of unknown structure. Finally, we applied the PDB-based system along with homology-based inferences and automatic text analysis to annotate all eukaryotic proteins in the PDB (http://cubic.bioc.columbia.edu/db/LOC3D). The LOC3D database currently contains predictions for over 8,700 eukaryotic protein chains taken from PDB. The web server can be used to predict subcellular localization for proteins for which only a predicted structure is available from threading servers. We imagine that this pilot method - certainly in combination with similar tools - may be valuable for target selection in structural genomics.
Subcellular localization important to elucidate protein function. Proteins must be localized in the same subcellular compartment to cooperate towards a common function. Therefore, experimentally unravelling the native compartment of a protein constitutes one step on the long way to determining its role. The explosion of sequence information through large-scale sequencing projects has widened the gap between the number of sequences deposited in public databases and the experimental characterisation of the corresponding proteins (Rost 1996; Koonin 2000; Rost 2002). Using high-throughput methods of epitope-tagging and immuno-fluorescence analysis Snyder et al (Kumar, Agarwal et al. 2002) have recently reported localization data for the entire proteome of Saccharomyces cerevisiae (yeast). So far, the majority of large-scale experiments suggesting localization have been restricted to yeast. This is primarily due to the fidelity of homologous recombination in yeast and the concomitant ease with which integrated reporter gene fusions can be generated. In contrast, computational tools can provide fast and accurate localization predictions for any organism (Eisenberg, Marcotte et al. 2000; Lewis, Ashburner et al. 2000). In fact, attempts to predict subcellular localization has become one of the central problems in bioinformatics (Eisenhaber and Bork 1998; Nakai 2000; Emanuelsson and von Heijne 2001).
Inferring localization by homology relatively accurate but not always applicable. A variety of approaches have been used to classify proteins with respect to subcellular localization. One of the most reliable means is annotation transfer from homologues (Andrade, O'Donoghue et al. 1998; Bork and Koonin 1998; Nair and Rost 2002): If a protein of experimentally known localization is significantly similar in sequence to a query protein U, localization can be inferred for U. However, the level of 'significant sequence similarity' varies substantially between localizations, and is much higher than that required for correct inference of folds (Rost 1999; Nair and Rost 2002). Thus, even when accepting many errors less than 25% of the proteins in SWISS-PROT (Bairoch and Apweiler 2000) can be classified by homology into one of ten compartments (Nair and Rost 2002). Using text analysis of SWISS-PROT keywords to infer localization, we can annotate subcellular localization for about 48% of all proteins in SWISS-PROT (Nair and Rost 2002).
Sequence motifs predict successfully for some compartments. Another way to predict localization is to identify local sequence motifs such as signal peptides (von Heijne 1995; Nielsen, Engelbrecht et al. 1997; Emanuelsson, Nielsen et al. 2000; Nakai 2000) or nuclear localization signals (NLS) (Cokol, Nair et al. 2000; Nakai 2000). Proteins destined for the secretory pathway, the mitochondria and the chloroplast contain N-terminal targeting peptides that are recognized by the translocation machinery (Rusch and Kendall 1995; Schatz and Dobberstein 1996). Thus, prediction methods use only the N-terminal residues (Nielsen, Engelbrecht et al. 1997; Emanuelsson, Nielsen et al. 2000). Discriminant analysis has been applied to identify proteins imported into mitochondria (Claros and Vincens 1996). Many proteins destined for the nucleus contain NLS motifs that may occur anywhere in the sequence (Mattaj and Englmeier 1998; Weis 1998). Recently, we have collected a data set of experimental and potential NLS motifs as an aid to predicting nuclear localization (Cokol, Nair et al. 2000); some of these signals are also used for the export from the nucleus (Cokol, Nair et al. 2000). However, the vast majority of proteins have no known motif. Furthermore, a particular problem for methods detecting N-terminal signals is that start-codons are predicted with less than 70% accuracy by genome projects (Reinhardt and Hubbard 1998). Overall, known and predicted sequence motifs enable annotating about 30% of the proteins in six entirely sequenced eukaryotic proteomes (Liu and Rost 2002; Mott, Schultz et al. 2002).
Ab initio methods predict localization for all proteins at lower accuracy. A third approach to predicting localization has been suggested by the observation that the overall amino acid composition correlates with the native compartment (Nishikawa, Kubota et al. 1983; Nakashima and Nishikawa 1994). This observation has led to the development of a variety of prediction methods based solely on composition (Cedano, Aloy et al. 1997; Horton and Nakai 1997; Reinhardt and Hubbard 1998; Nakai and Horton 1999). Higher order correlations (residues i and (i+n), n=2,3,4) have been accounted for by using pseudo-amino acid composition (Pan, Zhang et al. 2003). With the availability of many completely sequenced genomes, phylogenetic profiles have been employed to identify subcellular localization (Marcotte, Xenarios et al. 2000). So far, this approach has been much less accurate than methods based solely on composition. PSORT II is a knowledge-based expert system that integrates rules based on amino acid composition with known sequence motifs (Nakai and Kanehisa 1992; Nakai and Horton 1999), and also uses other methods such as NNPSL (Reinhardt and Hubbard 1998). Thus, the accuracy of PSORT II somehow depends on the accuracy of the underlying original methods. Drawid & Gerstein have proposed a Bayesian system based on a diverse range of 30 different features (Drawid and Gerstein 2000). They applied their system to predicting localization of yeast proteins. Using a seven-fold jack-knife procedure on 1342 yeast proteins with known localization, they reported a prediction accuracy of 75% at 67% coverage (Drawid and Gerstein 2000).
What determines subcellular localization? Experimental studies of protein targeting have shown that the subcellular localization of a protein is determined by its three-dimensional (3D) structure and/or the presence of local sequence motifs. Mutation studies have shown that disrupting the structure of a protein causes aberrant localization. One way of incorporating global information about protein structure into predictions of localization is by using secondary structure content. A number of local sequence motifs have been shown to mediate protein targeting (von Heijne 1995; Weis 1998). Andrade et al. have previously concluded that the signal for subcellular localization is almost entirely due to the surface residues (Andrade, O'Donoghue et al. 1998). Here, we have utilized these aspects of protein structure information to aid predicting localization.
We describe LOC3Dnet and LOCnet (Fig. 5-1), two systems of neural networks that sort proteins into one of four localization classes: extra-cellular, cytoplasmic, nuclear and mitochondrial. One (LOC3Dnet) is specialized on sequences from the PDB, the other (LOCnet) on sequences from SWISS-PROT. We excluded helical membrane proteins and used proteins from other minor compartments only as 'false positives'. In particular, proteins from the secretory pathway that are retained in the Golgi apparatus or the Endoplasmic reticulum were treated as 'non extra-cellular'. The method used 3+1 layers to make the final decision. The first layer consisted of four dedicated neural networks that used particular features from protein sequences, alignments, and structure to pre-sort proteins into L/not-L (L = cytoplasmic, nuclear, extra-cellular, mitochondrial). The second layer neural networks combined different input features. The third layer used a simple jury decision (Rost and Sander 1993) to assign one of four localization-states to each protein. We applied the final system to predict the native subcellular compartment for all eukaryotic proteins in PDB. Toward this end, we added a fourth layer combining the information

from PredictNLS (Cokol, Nair et al. 2000), sequence homology (Nair and Rost 2003), automatic text analysis (Nair and Rost 2003) and the prediction system introduced here. The neural networks were trained and tested on PDB and SWISS-PROT sequences of experimentally annotated localizations. We distinguished the following input features: overall amino acid composition, the amino acid composition of surface residues, composition of three-state secondary structure (helix, strand, other), and the amino acid composition separated into three secondary structure states (helix, strand, other). We also predicted secondary structure and surface composition using PHDsec and PHDacc respectively (Rost and Sander 1994; Rost 1996). Since biased data sets tend to yield over-estimates in prediction accuracy (Rost 2002), we took great care to select sequence-unique subsets of the data to estimate performance. All results of the methods described here were based either on four-fold cross-validation experiments or on SWISS-PROT sequences that had no significant sequence similarity to any protein used for development. The LOC3D localization prediction server for protein structures and the results of our annotations for all eukaryotic chains in the PDB can be accessed through the web at http://cubic.bioc.columbia.edu/db/LOC3D/.
The LOC3Ddb
database is the first comprehensive database of predicted and inferred
subcellular localization for all proteins of known structure. The LOC3Ddb database can be useful in
complementing functional information for proteins from domain databases like
SMART (Ponting, Schultz et al. 1999), PFAM (Sonnhammer, Eddy et al. 1997) and functional site resources like
ELM (Puntervoll, Linding et al. 2003), ProtFun (Jensen, Gupta et al. 2002) and Prosite (Hofmann, Bucher et al. 1999).
Data sets used for development and evaluation. We selected all eukaryotic proteins with explicit annotations about subcellular localization in SWISS-PROT release 40 (Bairoch and Apweiler 2000). We excluded proteins annotated as MEMBRANE, POSSIBLE, PROBABLE, SPECIFIC PERIODS or BY SIMILARITY. We also excluded proteins annotated with multiple localizations. This left 8,980 eukaryotic proteins in our SWISS-PROT data set of experimentally annotated localization ('trusted SWISS-PROT set', Table 5-1). Next, we assigned localization to PDB chains (Bernstein, Koetzle et al. 1977) by searching for homologues in the 'trusted SWISS-PROT set'. We transferred the annotated localization for all PDB chains, which were aligned at HSSP-distances (Eq. 5-1) above 10 (number of PDB chains found given in Table 5-1). Above this homology threshold subcellular localization annotation can be reliably transferred at over 90% accuracy (Nair and Rost 2003). Training, test and validation sets were constructed such that no pair of proteins from any two sets had levels of sequence similarity above HSSP-distances of 5 (Eq. 5-1). We picked this value, since below this threshold assigning subcellular localization based solely on homology leads to significant errors (Nair and Rost 2002). Furthermore, the test set was redundancy reduced at HSSP-distances <10 using a simple greedy search (Hobohm and Sander 1994). This ensured that no two proteins in the test set had greater than 40% sequence identity over more than 100 residues (number of sequence unique chains given in Table 5-1). The reason for this reduction was to find a balance between biased data known to yield over-estimates (Nielsen, Engelbrecht et al. 1996; Rost 2002) and between data sets that were too small. Note that we did not have to define thresholds for significant sequence similarity between motifs, such as signal peptides (Nielsen, Engelbrecht et al. 1996), since we used the entire protein information.

All non-eukaryotic proteins were also excluded for testing. We identified eukaryotic proteins by using three methods: (1) PDB to SWISS-PROT links in the SWISS-PROT database, (2) using the source and organism entry in PDB and (3) first hit is eukaryotic protein when the chain is aligned to the SWISS-PROT database. Note: all data sets are available at: http://cubic.bioc.columbia.edu/results/2003/localization/.
SWISS-PROT-new set used only for testing. After we completed the development of all our methods, we used an additional data set to re-examine performance, namely, we collected all proteins that had been added to SWISS-PROT between release 40 and 41 (labelled 'SWISS-PROT-new'). We filtered out all of these new proteins that had HSSP-distances >5 to any previously used protein and found the sequence-unique subset of the new proteins (Table 5-1). We never used any of these proteins for development, and it is rather unlikely that the other methods tested used any of these.
HSSP-distance to measure pairwise sequence similarity. The HSSP-distance is defined as the distance from the HSSP threshold (Rost 1999); it is given by:
HSSP-DISTANCE = PIDE - HSSP_PIDE(q)
HSSP_PIDE(q)= q +
(Eq.
5-1)
where L is the length of the alignment between two proteins, PIDE the percentage of pairwise identical residues, and HSSP_PIDE(q) the revised HSSP-threshold for the level q.
Observed and predicted information about protein structure. The observed secondary structure was extracted from the DSSP assignments (Kabsch and Sander 1983). Exposed residue composition was calculated from the solvent accessible surface area (Connolly 1983) in the DSSP database (Kabsch and Sander 1983). Three state secondary structure was predicted using PHDsec. We predicted all residues as exposed that were predicted to have relative solvent accessibility > 10% by PHDacc (Rost 1996). We chose this threshold since it gave good prediction accuracy on a limited subset of the training sets.
Building profiles. Profile-based composition was calculated by aligning the sequences against the SWISS-PROT + TrEMBL database using MaxHom dynamic programming algorithm (Sander and Schneider 1991). The aligned sequences were filtered for redundancy at 95% pairwise sequence identity, i.e. pairs exceeding this limit were removed. Finally, we included only those proteins into the alignment that were above an HSSP-distance of 5 and had a pairwise sequence identity above 50% with respect to the guide protein of known localization. These thresholds were found to be optimal on a limited subset of the training data. Finally the profile composition was calculated by replacing each amino acid residue in the protein by the residue frequencies in the profile.
Cross-validation. We separated our data into three sets: training, validation and testing set. Finally, we rotated through the sets such that each protein was used for testing exactly once. We never used any information from the test set to optimise parameters. In particular, we determined the number of hidden units based on of the validation sets and did not change it when we rotated. We stopped training when the best classification was obtained on the validation set.
Neural network training and architectures for PDB chains. We used three levels of networks (Fig. 5-1). First, a feed-forward neural network architecture with one hidden layer and two output units was trained on class/non-class for each localization. Training was done with standard back-propagation including momentum term (details in (Rost and Sander 1993)). The two output units represent different strengths of yes/no predictions for each localization class. Only localization classes with sufficient training examples in the PDB were considered. The neural networks were trained on PDB chains using overall amino acid composition, surface residue composition (both twenty input units), three-state secondary structure composition (three input units), and amino acid composition in the three secondary structure states (sixty input units) as input. We applied 'balanced training', i.e. examples belonging to the ‘yes’ and ‘no’ classes were alternately presented to the network during training. For networks with three and 20 input units a configuration with three hidden nodes was chosen, while for the network using amino acid composition in a secondary structure state as input, a configuration with 9 hidden nodes was chosen. For the second level, the different first level networks were combined using a jury decision (sum over all first level outputs) and combining neural networks (input first level output). The training, test and validation sets remained the same for the second level networks. The second level networks had 6 input units and three hidden units. In the third level the combination networks for the different localizations were combined in a jury to give the final four-state localization prediction.
Neural network training and architectures for SWISS-PROT proteins. We used basically the same architectures as for PDB chains, with two major differences. First, we only trained and tested on predicted secondary structure and solvent accessibility (since structure is not known for most of these proteins). Second, we used additional input units for the final summary networks, namely each network 'saw' the composition of the 50 N-terminal (20 units) residues (for proteins shorter than this, we simply used the entire protein).
Final decision through simple winner-take-it-all on 2nd layer of networks. The second layer networks (Fig. 5-1) all have two output units with the values:
outL and out¬L for
L=
We converted these values into probabilities:
Then we predicted the protein in the localization L' with:
(Eq. 5-2)
The strength of this prediction was measured using the reliability index RI:
(Eq.
5-3)
Evaluating performance. Four-fold cross-validation was applied to test the neural networks. As a simple measure for performance we used the percentage accuracy (Q, number of correctly predicted test proteins as percentage of total number of test proteins). The accuracy/specificity and coverage/sensitivity of the two-state networks were measured using four ratios derived from TP (number of proteins predicted to be in localization i and observed to be in localization i, the true positives), TN (number of proteins predicted not to be in localization i and observed to be so, the true negatives), FP (number of proteins predicted to be in localization i and observed not to be in i, the false positives) and FN (number of proteins predicted not to be in localization i and observed to be in i, the false negatives). We used:
(Eq.
5-6)
(Eq.
5-7)
In other words, pL are all correctly predicted in localization L as percentage of all predicted in L, and oL all correctly predicted as percentage of those observed in L. We combined these two numbers (pL and oL) through the geometric average:
(Eq.
5-8)
The overall four-state accuracy was measured by the accuracy Q4:
Q4 =
(Eq.
5-9)
To determine the best two-state networks, we used the Matthews correlation coefficient (Matthews 1975):
(Eq.
5-10)
Prediction methods. The prediction accuracy of four publicly available methods was evaluated using the sequence-unique test set of eukaryotic PDB chains. The four methods tested were: (1) NNPSL: neural network based tool predicting localization from amino acid composition (Reinhardt and Hubbard 1998); (2) SubLoc: support vector machine prediction of localization from amino acid composition (Hua and Sun 2001); (3) PSORT II: integrated method based on detecting sorting signals and predictions from other methods like NNPSL and SignalP (Nakai and Horton 1999); and (4) TargetP: neural network based tool predicting localization based on N-terminal sequence information (Emanuelsson, Nielsen et al. 2000). All methods were run with default parameter settings.
Annotating
proteins with known 3-D structures. Localization
was first predicted for all protein structures in PDB using four different
methods (Fig. 5-2): (1) PredictNLS: prediction of nuclear proteins through nuclear
localization signals, (2) LOChom: inferring localization through sequence
homology, (3) LOCkey: inferring localization through automatic text analysis of
SWISS-PROT keywords, and (4) and LOC3Dnet: ab initio prediction through a
system of neural networks. The final prediction is
based on the method that predicts localization with the highest confidence.

Linear separation by principal component analysis not enough.
The overall amino acid composition (Fig. 5-3A), the surface composition
(Fig. 5-3B), the three-state secondary structure composition (Fig. 5-3C), and
the combined sequence-structure composition (Fig. 5-3D) all showed some
correlation with subcellular localization in two dimensions (the first two
principal components). The strongest signal was the residue composition
separated by three secondary structure states (HEL). However, in contrast to
previous studies (Andrade, O'Donoghue et
al. 1998), we could not fully discriminate
between the three major classes by a linear separation on any single feature.
The full principal component analysis (Methods) revealed that the eigenvalues
of the first eight principal components were of similar magnitude for overall
amino acid composition. Hence, projecting the composition vectors onto two
dimensions resulted in a considerable loss of information. In order to fully
resolve the signal for subcellular localization present in the different global
composition features, we implemented a neural network based machine-learning
algorithm.
Structural information improves accuracy. We trained four different neural networks that
were specialised to discriminate between one of four localizations
(cytoplasmic, extra-cellular, nuclear, and mitochondrial, Table 5-2) and all
others, i.e. each network had two output units. Then, we combined the outputs
from each specialist through a statistical jury decision (Rost and Sander 1993) to give the final four state
prediction of localization (Eq. 5-2).


Networks based on the secondary
structure state specific residue composition reached the highest accuracy (from
57% for sequence only to above 59% for secondary structure dependent
composition, Fig. 5-4A). Networks based on predicted surface (Fig. 5-4A)
performed slightly worse than those based on the observed data (Fig. 5-4A).
However, when using only the exposed surface predicted by PHD (Fig. 5-4A)
prediction accuracy dropped below that obtained by sequence alone (Fig. 5-4A).

Evolutionary information improves by three percentage points. The advantages of using
evolutionary information in the form of sequence profiles have been
demonstrated for secondary structure prediction by a number of researchers (Rost and Sander 1993). Using sequence profiles as input
(Methods), prediction accuracy increased by up to 3% (Fig. 5-4B) for pairwise
networks based on overall sequence and secondary structure. The number of
proteins in the alignments was on average similar to that obtained for all PDB
proteins solved over the last two years (data not shown, (Przybylski and Rost 2002)), in other words, the set of
proteins that we used to evaluate performance did not stand out from what is
typical for the PDB.
Nuclear and extra-cellular proteins predicted significantly better. The prediction accuracy for the
different localization classes showed similar trends when using different
sequence features as input to the neural network. Extra-cellular and nuclear localizations
were predicted most accurately while the cytoplasmic and mitochondrial classes
were predicted with a much lower accuracy (Fig. 5-5, Table 5-2).
Fig. 5-5: Pairwise first level neural
networks accurate for some localizations. The
accuracy versus coverage curve for the four pairwise neural networks
trained on amino acid composition separated into the observed secondary
structure states shows that extra-cellular and nuclear classes were
predicted very accurately (above 80% accuracy at 75% coverage). However,
prediction accuracy for the cytoplasmic and mitochondrial classes was much
lower (accuracy less than 65% at 75% coverage). The standard error in
prediction accuracy for each of the localization classes was roughly

0.15% points. Networks trained on other composition
features showed similar trends in prediction accuracy for the different
localization classes (data not shown).
Second level combination of simple networks improves significantly. We obtained by far the best results
for each of the four localizations using combinations of the previously
developed networks. We tried two versions: Combine outputs from first level
networks through a statistical jury decision, and feed first level network
output into a second level network. Amongst the combined networks, using
evolutionary information consistently improved over using single sequences by
about two percentage points (Fig. 5-6A+B). The combined networks tested on
predicted surface and secondary structure were only slightly less accurate than
those tested on the DSSP data. The accuracy of our final system was more than
six percentage points higher than networks based only on amino acid
composition. The profile-based combination networks performed best and were
thus used to make the final localization predictions.
Over 73% accuracy for most reliably predicted 50% of all proteins. So far, we reported levels of
accuracy valid when we forced a prediction for each protein. However, some of
the predictions were 'stronger' than others. We translated this prediction
strength into a reliability index (Eq. 5-3) and investigated the dependency of
prediction accuracy on this reliability (Fig. 5-7). When we made predictions
only for the most reliably predicted half of all proteins of known structure,
prediction accuracy exceeded 75% for both the networks based on predicted and
observed structural information (Fig. 5-7A).
This corresponds approximately to a
reliability index of 40 for both the profile based networks. Similarly,
prediction accuracy reached about 73% for the most strongly predicted half of
proteins of unknown structure, however, the actual value for this system was
slightly different, namely this point was reached for a reliability index of about
47 (Fig. 5-7B).


Comparison to other methods. Two publicly available methods address a
similar general-purpose prediction of localization without sequence motifs or
homology, namely the neural network based program NNPSL (Reinhardt and Hubbard 1998) and the support vector machine-based
program SubLoc (Hua and Sun 2001). We applied both methods to our
test set of 359 sequence-unique eukaryotic PDB chains. Since some of these
proteins were used for developing NNPSL and SubLoc, our test of these public
methods may slightly over-estimate their performance. On the PDB data, SubLoc
reached an overall four-state accuracy around 50%, NNPSL around 44% (Table 5-3).
Our networks that used amino acid composition alone reached a similar level (NetSeq in Table 5-3). Incorporating
predicted surface, secondary structure and evolutionary information the final
combination network performed significantly better (LOC3DnetPHD in Table 5-3). Our system trained on SWISS-PROT
sequences (LOCnet) was conceptually
identical to the one trained on PDB sequences (LOC3DnetPHD). Hence, we were
surprised that it performed significantly worse (over ten percentage points
reduction in Q4). This drop suggested that PDB sequences differ so
significantly from SWISS-PROT sequences that we need a specialist to predict subcellular
localization for PDB proteins. When comparing these methods on a data set of
sequence-unique SWISS-PROT proteins of known localization that had been added
between release 40 and 41 (and were neither used in our development nor in that
of the other methods tested), we got different results (Table 5-4): now SubLoc
reached an overall four-state accuracy around 54%, NNPSL around 52%, and our
system trained on SWISS-PROT sequences (LOCnet)
clearly outperformed the system trained on PDB sequences (LOC3DnetPHD). Again, we observed that using all the information
(predicted 1D structure and alignment profiles) yielded a sustained improvement
around eight percentage points (NetSeq
vs. LOCnet in Table 5-4).
Our current system was only inferior
to NNPSL and SubLoc for mitochondrial proteins. Comparing our system to methods
that also utilise sequence motifs (PSORT II) or that specialise on particular
general signals (TargetP), we confirmed that our method performed particularly
poorly on mitochondrial proteins: TargetP performed clearly best for
mitochondrial proteins. In contrast, it appeared that our system implicitly
picked up the presence of signal peptides used in TargetP.