Chapter 5

Better prediction of subcellular localization by combining evolutionary and structural information

The native subcellular compartment of a protein is one aspect of its function. Thus, predicting localization is an important step toward predicting function. Short zip code-like sequence fragments regulate some of the shuttling between compartments. Cataloguing and predicting such motifs is the most accurate means of determining localization in silico. However, only few motifs are currently known, and not all the trafficking appears regulated in this way. The amino acid composition of a protein correlates with its localization. All general prediction methods employed this observation. Here, we explored the evolutionary information contained in multiple alignments and aspects of protein structure to predict localization in absence of homology and targeting motifs. Our final system combined statistical rules and a variety of neural networks to achieve an overall four-state accuracy above 65%, a significant improvement over systems using only composition. The system was at its best for extra-cellular and nuclear proteins; it was significantly less accurate than TargetP for mitochondrial proteins. Interestingly, all methods that were developed on SWISS-PROT sequences failed grossly when fed with sequences from proteins of known structures taken from PDB. We therefore developed two separate systems: one for proteins of known structure and one for proteins of unknown structure. Finally, we applied the PDB-based system along with homology-based inferences and automatic text analysis to annotate all eukaryotic proteins in the PDB (http://cubic.bioc.columbia.edu/db/LOC3D). The LOC3D database currently contains predictions for over 8,700 eukaryotic protein chains taken from PDB. The web server can be used to predict subcellular localization for proteins for which only a predicted structure is available from threading servers. We imagine that this pilot method - certainly in combination with similar tools - may be valuable for target selection in structural genomics.

5.1 Introduction

Subcellular localization important to elucidate protein function. Proteins must be localized in the same subcellular compartment to cooperate towards a common function. Therefore, experimentally unravelling the native compartment of a protein constitutes one step on the long way to determining its role. The explosion of sequence information through large-scale sequencing projects has widened the gap between the number of sequences deposited in public databases and the experimental characterisation of the corresponding proteins (Rost 1996; Koonin 2000; Rost 2002). Using high-throughput methods of epitope-tagging and immuno-fluorescence analysis Snyder et al (Kumar, Agarwal et al. 2002) have recently reported localization data for the entire proteome of Saccharomyces cerevisiae (yeast). So far, the majority of large-scale experiments suggesting localization have been restricted to yeast. This is primarily due to the fidelity of homologous recombination in yeast and the concomitant ease with which integrated reporter gene fusions can be generated. In contrast, computational tools can provide fast and accurate localization predictions for any organism (Eisenberg, Marcotte et al. 2000; Lewis, Ashburner et al. 2000). In fact, attempts to predict subcellular localization has become one of the central problems in bioinformatics (Eisenhaber and Bork 1998; Nakai 2000; Emanuelsson and von Heijne 2001).

Inferring localization by homology relatively accurate but not always applicable. A variety of approaches have been used to classify proteins with respect to subcellular localization. One of the most reliable means is annotation transfer from homologues (Andrade, O'Donoghue et al. 1998; Bork and Koonin 1998; Nair and Rost 2002): If a protein of experimentally known localization is significantly similar in sequence to a query protein U, localization can be inferred for U. However, the level of 'significant sequence similarity' varies substantially between localizations, and is much higher than that required for correct inference of folds (Rost 1999; Nair and Rost 2002). Thus, even when accepting many errors less than 25% of the proteins in SWISS-PROT (Bairoch and Apweiler 2000) can be classified by homology into one of ten compartments (Nair and Rost 2002). Using text analysis of SWISS-PROT keywords to infer localization, we can annotate subcellular localization for about 48% of all proteins in SWISS-PROT (Nair and Rost 2002).

Sequence motifs predict successfully for some compartments. Another way to predict localization is to identify local sequence motifs such as signal peptides (von Heijne 1995; Nielsen, Engelbrecht et al. 1997; Emanuelsson, Nielsen et al. 2000; Nakai 2000) or nuclear localization signals (NLS) (Cokol, Nair et al. 2000; Nakai 2000). Proteins destined for the secretory pathway, the mitochondria and the chloroplast contain N-terminal targeting peptides that are recognized by the translocation machinery (Rusch and Kendall 1995; Schatz and Dobberstein 1996). Thus, prediction methods use only the N-terminal residues (Nielsen, Engelbrecht et al. 1997; Emanuelsson, Nielsen et al. 2000). Discriminant analysis has been applied to identify proteins imported into mitochondria (Claros and Vincens 1996). Many proteins destined for the nucleus contain NLS motifs that may occur anywhere in the sequence (Mattaj and Englmeier 1998; Weis 1998). Recently, we have collected a data set of experimental and potential NLS motifs as an aid to predicting nuclear localization (Cokol, Nair et al. 2000); some of these signals are also used for the export from the nucleus (Cokol, Nair et al. 2000). However, the vast majority of proteins have no known motif. Furthermore, a particular problem for methods detecting N-terminal signals is that start-codons are predicted with less than 70% accuracy by genome projects (Reinhardt and Hubbard 1998). Overall, known and predicted sequence motifs enable annotating about 30% of the proteins in six entirely sequenced eukaryotic proteomes (Liu and Rost 2002; Mott, Schultz et al. 2002).

Ab initio methods predict localization for all proteins at lower accuracy. A third approach to predicting localization has been suggested by the observation that the overall amino acid composition correlates with the native compartment (Nishikawa, Kubota et al. 1983; Nakashima and Nishikawa 1994). This observation has led to the development of a variety of prediction methods based solely on composition (Cedano, Aloy et al. 1997; Horton and Nakai 1997; Reinhardt and Hubbard 1998; Nakai and Horton 1999). Higher order correlations (residues i and (i+n), n=2,3,4) have been accounted for by using pseudo-amino acid composition (Pan, Zhang et al. 2003). With the availability of many completely sequenced genomes, phylogenetic profiles have been employed to identify subcellular localization (Marcotte, Xenarios et al. 2000). So far, this approach has been much less accurate than methods based solely on composition. PSORT II is a knowledge-based expert system that integrates rules based on amino acid composition with known sequence motifs (Nakai and Kanehisa 1992; Nakai and Horton 1999), and also uses other methods such as NNPSL (Reinhardt and Hubbard 1998). Thus, the accuracy of PSORT II somehow depends on the accuracy of the underlying original methods. Drawid & Gerstein have proposed a Bayesian system based on a diverse range of 30 different features (Drawid and Gerstein 2000). They applied their system to predicting localization of yeast proteins. Using a seven-fold jack-knife procedure on 1342 yeast proteins with known localization, they reported a prediction accuracy of 75% at 67% coverage (Drawid and Gerstein 2000).

What determines subcellular localization? Experimental studies of protein targeting have shown that the subcellular localization of a protein is determined by its three-dimensional (3D) structure and/or the presence of local sequence motifs. Mutation studies have shown that disrupting the structure of a protein causes aberrant localization. One way of incorporating global information about protein structure into predictions of localization is by using secondary structure content. A number of local sequence motifs have been shown to mediate protein targeting (von Heijne 1995; Weis 1998). Andrade et al. have previously concluded that the signal for subcellular localization is almost entirely due to the surface residues (Andrade, O'Donoghue et al. 1998). Here, we have utilized these aspects of protein structure information to aid predicting localization.

We describe LOC3Dnet and LOCnet (Fig. 5-1), two systems of neural networks that sort proteins into one of four localization classes: extra-cellular, cytoplasmic, nuclear and mitochondrial. One (LOC3Dnet) is specialized on sequences from the PDB, the other (LOCnet) on sequences from SWISS-PROT. We excluded helical membrane proteins and used proteins from other minor compartments only as 'false positives'. In particular, proteins from the secretory pathway that are retained in the Golgi apparatus or the Endoplasmic reticulum were treated as 'non extra-cellular'. The method used 3+1 layers to make the final decision. The first layer consisted of four dedicated neural networks that used particular features from protein sequences, alignments, and structure to pre-sort proteins into L/not-L (L = cytoplasmic, nuclear, extra-cellular, mitochondrial). The second layer neural networks combined different input features. The third layer used a simple jury decision (Rost and Sander 1993) to assign one of four localization-states to each protein. We applied the final system to predict the native subcellular compartment for all eukaryotic proteins in PDB. Toward this end, we added a fourth layer combining the information

from PredictNLS (Cokol, Nair et al. 2000), sequence homology (Nair and Rost 2003), automatic text analysis (Nair and Rost 2003) and the prediction system introduced here. The neural networks were trained and tested on PDB and SWISS-PROT sequences of experimentally annotated localizations. We distinguished the following input features: overall amino acid composition, the amino acid composition of surface residues, composition of three-state secondary structure (helix, strand, other), and the amino acid composition separated into three secondary structure states (helix, strand, other). We also predicted secondary structure and surface composition using PHDsec and PHDacc respectively (Rost and Sander 1994; Rost 1996). Since biased data sets tend to yield over-estimates in prediction accuracy (Rost 2002), we took great care to select sequence-unique subsets of the data to estimate performance. All results of the methods described here were based either on four-fold cross-validation experiments or on SWISS-PROT sequences that had no significant sequence similarity to any protein used for development. The LOC3D localization prediction server for protein structures and the results of our annotations for all eukaryotic chains in the PDB can be accessed through the web at http://cubic.bioc.columbia.edu/db/LOC3D/.

The LOC3Ddb database is the first comprehensive database of predicted and inferred subcellular localization for all proteins of known structure. The LOC3Ddb database can be useful in complementing functional information for proteins from domain databases like SMART (Ponting, Schultz et al. 1999), PFAM (Sonnhammer, Eddy et al. 1997) and functional site resources like ELM (Puntervoll, Linding et al. 2003), ProtFun (Jensen, Gupta et al. 2002) and Prosite (Hofmann, Bucher et al. 1999).

5.2 Materials and methods

Data sets used for development and evaluation. We selected all eukaryotic proteins with explicit annotations about subcellular localization in SWISS-PROT release 40 (Bairoch and Apweiler 2000). We excluded proteins annotated as MEMBRANE, POSSIBLE, PROBABLE, SPECIFIC PERIODS or BY SIMILARITY. We also excluded proteins annotated with multiple localizations. This left 8,980 eukaryotic proteins in our SWISS-PROT data set of experimentally annotated localization ('trusted SWISS-PROT set', Table 5-1). Next, we assigned localization to PDB chains (Bernstein, Koetzle et al. 1977) by searching for homologues in the 'trusted SWISS-PROT set'. We transferred the annotated localization for all PDB chains, which were aligned at HSSP-distances (Eq. 5-1) above 10 (number of PDB chains found given in Table 5-1). Above this homology threshold subcellular localization annotation can be reliably transferred at over 90% accuracy (Nair and Rost 2003). Training, test and validation sets were constructed such that no pair of proteins from any two sets had levels of sequence similarity above HSSP-distances of 5 (Eq. 5-1). We picked this value, since below this threshold assigning subcellular localization based solely on homology leads to significant errors (Nair and Rost 2002). Furthermore, the test set was redundancy reduced at HSSP-distances <10 using a simple greedy search (Hobohm and Sander 1994). This ensured that no two proteins in the test set had greater than 40% sequence identity over more than 100 residues (number of sequence unique chains given in Table 5-1). The reason for this reduction was to find a balance between biased data known to yield over-estimates (Nielsen, Engelbrecht et al. 1996; Rost 2002) and between data sets that were too small. Note that we did not have to define thresholds for significant sequence similarity between motifs, such as signal peptides (Nielsen, Engelbrecht et al. 1996), since we used the entire protein information.

All non-eukaryotic proteins were also excluded for testing. We identified eukaryotic proteins by using three methods: (1) PDB to SWISS-PROT links in the SWISS-PROT database, (2) using the source and organism entry in PDB and (3) first hit is eukaryotic protein when the chain is aligned to the SWISS-PROT database. Note: all data sets are available at: http://cubic.bioc.columbia.edu/results/2003/localization/.

SWISS-PROT-new set used only for testing. After we completed the development of all our methods, we used an additional data set to re-examine performance, namely, we collected all proteins that had been added to SWISS-PROT between release 40 and 41 (labelled 'SWISS-PROT-new'). We filtered out all of these new proteins that had HSSP-distances >5 to any previously used protein and found the sequence-unique subset of the new proteins (Table 5-1). We never used any of these proteins for development, and it is rather unlikely that the other methods tested used any of these.

HSSP-distance to measure pairwise sequence similarity. The HSSP-distance is defined as the distance from the HSSP threshold (Rost 1999); it is given by:

            HSSP-DISTANCE = PIDE - HSSP_PIDE(q)

            HSSP_PIDE(q)= q +                         (Eq. 5-1)

where L is the length of the alignment between two proteins, PIDE the percentage of pairwise identical residues, and HSSP_PIDE(q) the revised HSSP-threshold for the level q.

Observed and predicted information about protein structure. The observed secondary structure was extracted from the DSSP assignments (Kabsch and Sander 1983). Exposed residue composition was calculated from the solvent accessible surface area (Connolly 1983) in the DSSP database (Kabsch and Sander 1983). Three state secondary structure was predicted using PHDsec. We predicted all residues as exposed that were predicted to have relative solvent accessibility > 10% by PHDacc (Rost 1996). We chose this threshold since it gave good prediction accuracy on a limited subset of the training sets.

Building profiles. Profile-based composition was calculated by aligning the sequences against the SWISS-PROT + TrEMBL database using MaxHom dynamic programming algorithm (Sander and Schneider 1991). The aligned sequences were filtered for redundancy at 95% pairwise sequence identity, i.e. pairs exceeding this limit were removed. Finally, we included only those proteins into the alignment that were above an HSSP-distance of 5 and had a pairwise sequence identity above 50% with respect to the guide protein of known localization. These thresholds were found to be optimal on a limited subset of the training data. Finally the profile composition was calculated by replacing each amino acid residue in the protein by the residue frequencies in the profile.

Cross-validation. We separated our data into three sets: training, validation and testing set. Finally, we rotated through the sets such that each protein was used for testing exactly once. We never used any information from the test set to optimise parameters. In particular, we determined the number of hidden units based on of the validation sets and did not change it when we rotated. We stopped training when the best classification was obtained on the validation set.

Neural network training and architectures for PDB chains. We used three levels of networks (Fig. 5-1). First, a feed-forward neural network architecture with one hidden layer and two output units was trained on class/non-class for each localization. Training was done with standard back-propagation including momentum term (details in (Rost and Sander 1993)). The two output units represent different strengths of yes/no predictions for each localization class. Only localization classes with sufficient training examples in the PDB were considered. The neural networks were trained on PDB chains using overall amino acid composition, surface residue composition (both twenty input units), three-state secondary structure composition (three input units), and amino acid composition in the three secondary structure states (sixty input units) as input. We applied 'balanced training', i.e. examples belonging to the ‘yes’ and ‘no’ classes were alternately presented to the network during training. For networks with three and 20 input units a configuration with three hidden nodes was chosen, while for the network using amino acid composition in a secondary structure state as input, a configuration with 9 hidden nodes was chosen. For the second level, the different first level networks were combined using a jury decision (sum over all first level outputs) and combining neural networks (input first level output). The training, test and validation sets remained the same for the second level networks. The second level networks had 6 input units and three hidden units. In the third level the combination networks for the different localizations were combined in a jury to give the final four-state localization prediction.

Neural network training and architectures for SWISS-PROT proteins. We used basically the same architectures as for PDB chains, with two major differences. First, we only trained and tested on predicted secondary structure and solvent accessibility (since structure is not known for most of these proteins). Second, we used additional input units for the final summary networks, namely each network 'saw' the composition of the 50 N-terminal (20 units) residues (for proteins shorter than this, we simply used the entire protein).

Final decision through simple winner-take-it-all on 2nd layer of networks. The second layer networks (Fig. 5-1) all have two output units with the values:

            outL and out¬L for L= {cytoplasmic, extracellular, nuclear, mitochondrial}  

We converted these values into probabilities:

                       

Then we predicted the protein in the localization L' with:

                           (Eq. 5-2)

The strength of this prediction was measured using the reliability index RI:

                                                                                                          (Eq. 5-3)

 

Evaluating performance. Four-fold cross-validation was applied to test the neural networks. As a simple measure for performance we used the percentage accuracy (Q, number of correctly predicted test proteins as percentage of total number of test proteins). The accuracy/specificity and coverage/sensitivity of the two-state networks were measured using four ratios derived from TP (number of proteins predicted to be in localization i and observed to be in localization i, the true positives), TN (number of proteins predicted not to be in localization i and observed to be so, the true negatives), FP (number of proteins predicted to be in localization i and observed not to be in i, the false positives) and FN (number of proteins predicted not to be in localization i and observed to be in i, the false negatives). We used:

                                                        (Eq. 5-6)

                                                          (Eq. 5-7)

In other words, pL are all correctly predicted in localization L as percentage of all predicted in L, and oL all correctly predicted as percentage of those observed in L. We combined these two numbers (pL and oL) through the geometric average:

                                                                                     (Eq. 5-8)

 

 

The overall four-state accuracy was measured by the accuracy Q4:

 

            Q4 =                                   (Eq. 5-9)

To determine the best two-state networks, we used the Matthews correlation coefficient (Matthews 1975):

                                                (Eq. 5-10)

 

Prediction methods. The prediction accuracy of four publicly available methods was evaluated using the sequence-unique test set of eukaryotic PDB chains. The four methods tested were: (1) NNPSL: neural network based tool predicting localization from amino acid composition (Reinhardt and Hubbard 1998); (2) SubLoc: support vector machine prediction of localization from amino acid composition (Hua and Sun 2001); (3) PSORT II: integrated method based on detecting sorting signals and predictions from other methods like NNPSL and SignalP (Nakai and Horton 1999); and (4) TargetP: neural network based tool predicting localization based on N-terminal sequence information (Emanuelsson, Nielsen et al. 2000). All methods were run with default parameter settings.

Annotating proteins with known 3-D structures. Localization was first predicted for all protein structures in PDB using four different methods (Fig. 5-2): (1) PredictNLS: prediction of nuclear proteins through nuclear localization signals, (2) LOChom: inferring localization through sequence homology, (3) LOCkey: inferring localization through automatic text analysis of SWISS-PROT keywords, and (4) and LOC3Dnet: ab initio prediction through a system of neural networks. The final prediction is based on the method that predicts localization with the highest confidence.

 

5.3 Results

Linear separation by principal component analysis not enough.  The overall amino acid composition (Fig. 5-3A), the surface composition (Fig. 5-3B), the three-state secondary structure composition (Fig. 5-3C), and the combined sequence-structure composition (Fig. 5-3D) all showed some correlation with subcellular localization in two dimensions (the first two principal components). The strongest signal was the residue composition separated by three secondary structure states (HEL). However, in contrast to previous studies (Andrade, O'Donoghue et al. 1998), we could not fully discriminate between the three major classes by a linear separation on any single feature. The full principal component analysis (Methods) revealed that the eigenvalues of the first eight principal components were of similar magnitude for overall amino acid composition. Hence, projecting the composition vectors onto two dimensions resulted in a considerable loss of information. In order to fully resolve the signal for subcellular localization present in the different global composition features, we implemented a neural network based machine-learning algorithm.

Structural information improves accuracy. We trained four different neural networks that were specialised to discriminate between one of four localizations (cytoplasmic, extra-cellular, nuclear, and mitochondrial, Table 5-2) and all others, i.e. each network had two output units. Then, we combined the outputs from each specialist through a statistical jury decision (Rost and Sander 1993) to give the final four state prediction of localization (Eq. 5-2).

 

Networks based on the secondary structure state specific residue composition reached the highest accuracy (from 57% for sequence only to above 59% for secondary structure dependent composition, Fig. 5-4A). Networks based on predicted surface (Fig. 5-4A) performed slightly worse than those based on the observed data (Fig. 5-4A). However, when using only the exposed surface predicted by PHD (Fig. 5-4A) prediction accuracy dropped below that obtained by sequence alone (Fig. 5-4A).

Evolutionary information improves by three percentage points. The advantages of using evolutionary information in the form of sequence profiles have been demonstrated for secondary structure prediction by a number of researchers (Rost and Sander 1993). Using sequence profiles as input (Methods), prediction accuracy increased by up to 3% (Fig. 5-4B) for pairwise networks based on overall sequence and secondary structure. The number of proteins in the alignments was on average similar to that obtained for all PDB proteins solved over the last two years (data not shown, (Przybylski and Rost 2002)), in other words, the set of proteins that we used to evaluate performance did not stand out from what is typical for the PDB.

Nuclear and extra-cellular proteins predicted significantly better. The prediction accuracy for the different localization classes showed similar trends when using different sequence features as input to the neural network. Extra-cellular and nuclear localizations were predicted most accurately while the cytoplasmic and mitochondrial classes were predicted with a much lower accuracy (Fig. 5-5, Table 5-2).

Fig. 5-5: Pairwise first level neural networks accurate for some localizations. The accuracy versus coverage curve for the four pairwise neural networks trained on amino acid composition separated into the observed secondary structure states shows that extra-cellular and nuclear classes were predicted very accurately (above 80% accuracy at 75% coverage). However, prediction accuracy for the cytoplasmic and mitochondrial classes was much lower (accuracy less than 65% at 75% coverage). The standard error in prediction accuracy for each of the localization classes was roughly 0.15% points. Networks trained on other composition features showed similar trends in prediction accuracy for the different localization classes (data not shown).

 

Second level combination of simple networks improves significantly. We obtained by far the best results for each of the four localizations using combinations of the previously developed networks. We tried two versions: Combine outputs from first level networks through a statistical jury decision, and feed first level network output into a second level network. Amongst the combined networks, using evolutionary information consistently improved over using single sequences by about two percentage points (Fig. 5-6A+B). The combined networks tested on predicted surface and secondary structure were only slightly less accurate than those tested on the DSSP data. The accuracy of our final system was more than six percentage points higher than networks based only on amino acid composition. The profile-based combination networks performed best and were thus used to make the final localization predictions.

Over 73% accuracy for most reliably predicted 50% of all proteins. So far, we reported levels of accuracy valid when we forced a prediction for each protein. However, some of the predictions were 'stronger' than others. We translated this prediction strength into a reliability index (Eq. 5-3) and investigated the dependency of prediction accuracy on this reliability (Fig. 5-7). When we made predictions only for the most reliably predicted half of all proteins of known structure, prediction accuracy exceeded 75% for both the networks based on predicted and observed structural information (Fig. 5-7A).

This corresponds approximately to a reliability index of 40 for both the profile based networks. Similarly, prediction accuracy reached about 73% for the most strongly predicted half of proteins of unknown structure, however, the actual value for this system was slightly different, namely this point was reached for a reliability index of about 47 (Fig. 5-7B).

 

Comparison to other methods. Two publicly available methods address a similar general-purpose prediction of localization without sequence motifs or homology, namely the neural network based program NNPSL (Reinhardt and Hubbard 1998) and the support vector machine-based program SubLoc (Hua and Sun 2001). We applied both methods to our test set of 359 sequence-unique eukaryotic PDB chains. Since some of these proteins were used for developing NNPSL and SubLoc, our test of these public methods may slightly over-estimate their performance. On the PDB data, SubLoc reached an overall four-state accuracy around 50%, NNPSL around 44% (Table 5-3). Our networks that used amino acid composition alone reached a similar level (NetSeq in Table 5-3). Incorporating predicted surface, secondary structure and evolutionary information the final combination network performed significantly better (LOC3DnetPHD in Table 5-3). Our system trained on SWISS-PROT sequences (LOCnet) was conceptually identical to the one trained on PDB sequences (LOC3DnetPHD). Hence, we were surprised that it performed significantly worse (over ten percentage points reduction in Q4). This drop suggested that PDB sequences differ so significantly from SWISS-PROT sequences that we need a specialist to predict subcellular localization for PDB proteins. When comparing these methods on a data set of sequence-unique SWISS-PROT proteins of known localization that had been added between release 40 and 41 (and were neither used in our development nor in that of the other methods tested), we got different results (Table 5-4): now SubLoc reached an overall four-state accuracy around 54%, NNPSL around 52%, and our system trained on SWISS-PROT sequences (LOCnet) clearly outperformed the system trained on PDB sequences (LOC3DnetPHD). Again, we observed that using all the information (predicted 1D structure and alignment profiles) yielded a sustained improvement around eight percentage points (NetSeq vs. LOCnet in Table 5-4).

Our current system was only inferior to NNPSL and SubLoc for mitochondrial proteins. Comparing our system to methods that also utilise sequence motifs (PSORT II) or that specialise on particular general signals (TargetP), we confirmed that our method performed particularly poorly on mitochondrial proteins: TargetP performed clearly best for mitochondrial proteins. In contrast, it appeared that our system implicitly picked up the presence of signal peptides used in TargetP.