Chapter 3

Sequence conserved for subcellular localization

The more proteins diverged in sequence, the more difficult it becomes for bioinformatics to infer similarities of protein function and structure from sequence. The precise thresholds used in automated genome annotations depend on the particular aspect of protein function transferred by homology. Here, we presented the first large-scale analysis of the relation between sequence similarity and identity in subcellular localization. Three results stood out: (1) the subcellular compartment is generally more conserved than what might have been expected given that short sequence motifs like nuclear localization signals can alter the native compartment. (2) The sequence-conservation of localization is similar between different compartments, and (3) it is similar to the conservation of structure and enzymatic activity. In particular, we found the transition between the regions of conserved and non-conserved localization to be very sharp although the thresholds for conservation were less well defined than for structure and enzymatic activity. We found that a simple measure for sequence similarity accounting for pairwise sequence identity and alignment length, the HSSP-distance, distinguished accurately between protein pairs of identical and different localizations. In fact, BLAST expectation values outperformed the HSSP-distance only for alignments in the sub-twilight zone. We succeeded in slightly improving the accuracy of inferring localization through homology by fine-tuning the thresholds. Finally, we applied our results to the entire SWISS-PROT database and five entirely sequenced eukaryotes.

3.1 Introduction

Sequence conservation of protein function. Proteins retain the 'memory' of their evolutionary ancestry in their sequence, structure and function. High sequence similarity alone is considered to be sufficient evidence for common ancestry and is routinely used to infer structural and functional similarity (Bork and Koonin 1998). However, many proteins of similar structure have no discernible sequence similarity (Murzin 1998; Rost 1999; Yang and Honig 2000). The evolutionary conservation of protein structure, i.e. the relationship between similarity in sequence and in structure has been explored extensively (Chothia and Lesk 1986; Sander and Schneider 1991; Abagyan and Batalov 1997; Alexandrov and Luethy 1998). The resulting thresholds enable transferring structural annotations (Sander and Schneider 1991). The thresholds for sequence similarity that imply similarity in function cannot be inferred from those for structure (Shah and Hunter 1997; Devos and Valencia 2000; Pawlowski, Jaroszewski et al. 2000; Rost 2002). One problem in establishing thresholds for the transfer of function is that the term 'protein function' is - albeit intuitive -not well defined. Function is a complex phenomenon associated with many mutually overlapping levels: chemical, biochemical, cellular, organism mediated, and developmental. These levels are related in complex ways; for example, protein kinases can be related to different cellular functions (such as cell cycle), and to a chemical function (transferase) plus a complex control mechanism by interaction with other proteins. This lack of a precise definition generates two specific problems for analysing the conservation of function. First, we have sufficiently large, machine-readable data only for very few aspects of function (Wilson, Kreychman et al. 2000). Second, the conservation differs significantly between different types of function (Devos and Valencia 2000; Pawlowski, Jaroszewski et al. 2000; Rost 2002). Nevertheless, a better understanding of the relation between function and sequence is fundamentally important, since it can provide insights into the underlying mechanisms of evolving new functions through changes in sequence and structure (Thornton, Orengo et al. 1999). Large-scale genome sequencing projects have led to a rapidly widening gap between the number of known sequences and their functional annotations. Efforts at addressing this situation have largely relied on exploiting sequence similarity to infer functional similarity (Casari, Sander et al. 1995; Bork and Koonin 1998; Andrade, Brown et al. 1999; Koonin 2000). However, few large-scale studies have evaluated the accuracy of the transfer of function (Eisenhaber and Bork 1998; Karp 1998; Devos and Valencia 2000; Devos and Valencia 2001).

Three zones of sequence comparisons: From trivial (safe) over problematic (twilight) to impossible (midnight). In general, we can separate between three regions of protein comparisons (Fig. 3-1): (1) Safe zone: for very high levels of sequence similarity proteins have similar functions and structures; aligning the proteins is straightforward. (2) Twilight zone (Doolittle 1986): at some point of divergence the alignment of proteins becomes problematic, in fact, we no

Fig. 3-1: Transition from safe over twilight to midnight zone of protein comparisons. Alignment methods maximise the sequence similarity between two proteins. When we want to translate these levels of sequence similarity to conclusions about similarity in structure/function, we can distinguish three major regions; the boundaries between these are not well defined. (1) Safe zone: all protein pairs in this region have similar structure/function, i.e., sequence similarity implies similarity in structure/function. (2) Twilight zone: most pairs in this region have similar structure/function. (3) Midnight zone: while many of the pairs in this region may have similar structure/function, most do not. The curves illustrate accuracy (or specificity, black line) and coverage (or selectivity, grey line); the x-axis gives the pairwise sequence similarity, the y-axis the percentage of pairs that are similar above the given threshold (accuracy) and the percentage of similar pairs that are found above the given threshold (coverage). These sketched curves point out that there is a trade-off between accuracy and coverage: while the safe zone is defined by 100% accuracy, we typically find only few of the pairs with similar structure/function in this region of sequence similarity (low coverage). On the other extreme end, we find many pairs of similar structure/function in the midnight zone (high coverage). However, the accuracy is very low. Obviously, the choice of appropriate thresholds constitutes a balance between the Skylla of '100% accuracy, no homologue found' and the Charibdis of 'many putative homologues found, most are not homologous'. The particular shape of the curves that describe accuracy and coverage depends on the problem at hand, i.e., on the particular feature of biological similarity that we try to infer (Fig. 3-6 compares the transition for a variety of features). Here, we focus on the problem of establishing thresholds that allow inferring subcellular localization through sequence similarity.

 

longer can safely infer similarity in a particular feature from sequence. However, typically a considerable fraction of the protein pairs identified in the twilight zone still have a particular feature in common.

(3) Midnight zone: protein pairs in this zone have so low levels of sequence similarity that we can no longer detect their similarity from sequence alone (Rost 1997; Rost 1999). Interestingly, the vast majority of protein pairs with similar three-dimensional structure (3D) populate the midnight zone (Rost 1997; Brenner, Chothia et al. 1998; Rost, O'Donoghue et al. 1998; Yang and Honig 2000).

Statistical scores versus percentage sequence identity for conservation of structure. The most popular database search methods BLAST and PSI-BLAST (Altschul, Gish et al. 1990; Altschul 1993; Altschul, Madden et al. 1997) use neither percent sequence identity nor raw alignment scores to characterise sequence similarity. Instead, they use probabilities or expectation values that reflect the statistical significance of a given alignment. It has been claimed that statistical scoring schemes are superior to scores based on pairwise sequence identity in identifying structural homologues (Pearson 1995; Abagyan and Batalov 1997; Brenner, Chothia et al. 1998). Statistical scores are clearly superior to simply measuring pairwise sequence identity. Sander & Schneider (Sander and Schneider 1991) introduced another measure for sequence similarity to identify proteins of similar structure when building their HSSP database. Their measure, the HSSP-distance correlates alignment length and pairwise sequence identity. A minor modification of the original HSSP-curve resulted in a measure for sequence similarity that appears more successful than statistical scores in describing structural similarity for pairwise alignments (Rost 1999). Can we generalise the lessons learned from the conservation of structure to that of function?

Sharp transition of thresholds describing the sequence conservation of function. Recent efforts at investigating the sequence-function relation have utilised three different classifiers for function: (1) enzymatic activity as described by the Enzyme Commission (EC, (Webb 1992)) numbers (Hegyi and Gerstein 1999; Pawlowski, Jaroszewski et al. 2000; Todd, Orengo et al. 2001; Rost 2002), (2) SWISS-PROT keywords (Devos and Valencia 2000), and (3) the GeneOntology (GO, (Lewis, Ashburner et al. 2000)) classification as used in FLYBASE (Ashburner and Drysdale 1994; Wilson, Kreychman et al. 2000). Wilson et al. (2000) observed that the separation between proteins of similar and non-similar function is best described by sigmoidal curves that drop off sharply at particular thresholds for conservation. They claimed near perfect conservation of function as measured by EC and GO above 40-50% pairwise sequence identity. Other groups have published similar findings for enzymatic activity (Shah and Hunter 1997; Devos and Valencia 2000; Pawlowski, Jaroszewski et al. 2000; Todd, Orengo et al. 2001). More recently, we have found that most of these results appeared over-optimistic, i.e. that enzymatic function is less well conserved than anticipated (Rost 2002). Devos and Valencia (2000) have also investigated the sequence identity-to-function relationship for SWISS-PROT (Bairoch and Apweiler 2000) keywords and binding sites. While they find that the functional shapes of the threshold separating conserved and non-conserved function are similar between different aspects of function, the precise thresholds differ between EC, keywords, and binding sites. In contrast to all other groups, Pawlowski et al. (2000) reported thresholds for conservation that were not sharp. Their measure of sequence similarity was based on their in-house alignment program BASIC (Rychlewski, Zhang et al. 1999).

Statistical scores versus percentage sequence identity for conservation of function. Wilson et al (2000) found percent identity to be better at quantifying functional similarity than statistical scoring schemes. However, their results also indicated that statistical scoring schemes are better at discriminating between highly diverged sequences. In contrast, Pawlowski et al. (2000) and our recent results (Rost 2002) suggested that both statistical Z-scores and BLAST expectation values were clearly superior to pairwise sequence identity at quantifying similarity in enzymatic function. However, our results also confirmed that a combination of alignment length and pairwise sequence identity outperforms BLAST scores at levels of low sequence divergence (Rost 2002).

Sequence conservation of subcellular localization. The subcellular localization of a protein is correlated with its function (Ferrigno and Silver 1999; Faust and Montenarh 2000; Pearce 2000). Obviously, we expect proteins with very similar sequences to be localised in similar cellular compartments. In fact, this assumption is applied in everyday sequence analysis and database annotations (Eisenhaber and Bork 1998; Bairoch and Apweiler 2000). However, while we do know that subcellular localization is evolutionarily imprinted onto the protein surface (Andrade, O'Donoghue et al. 1998), the precise threshold for the sequence conservation of subcellular localisation has not been explored, yet. Here, we investigated this conservation for different compartments, i.e. implicitly for different aspects of function (Liscovitch, Czarny et al. 1999; Sirover 1999). Using subcellular localization to quantify functional similarity, we re-discovered particular features of the sequence-function relationship noticed for other types of function before (Devos and Valencia 2000; Jaroszewski, Rychlewski et al. 2000; Todd, Orengo et al. 2001 ; Rost 2002). We benchmarked the effectiveness of various scoring schemes and proposed a new scheme for refined identification of functional homologues and compared the accuracy of homology transfer to the accuracy of prediction methods (Claros, Brunak et al. 1997; Nielsen, Brunak et al. 1999; Drawid and Gerstein 2000; Nakai 2001). Finally, we applied our scheme to annotate five entirely sequenced eukaryotes and the entire SWISS-PROT database. In conjunction with other studies relating structure-to-function, our work provided useful insights into the evolution of new functions through sequence and structural changes. Our results may lead to improvements in functional annotations of newly sequenced genomes.

3.2 Materials and methods

Data set. We selected all eukaryotic proteins with annotated subcellular localization in SWISS-PROT release 37 (Bairoch and Apweiler 2000). We excluded sequences annotated as "POSSIBLE","PROBABLE" or "BY SIMILARITY". We also excluded membrane proteins and all sequences annotated with multiple localizations. This left 7405 proteins with experimentally annotated localization (Table 3-1, All SWISS-PROT).

To reduce bias, we selected a representative data set of sequence-unique proteins. Protein pairs were clustered using a simple greedy algorithm starting with the largest and longest families (Hobohm, Scharf et al. 1992; Rost 2002). We investigated different thresholds for clustering the sequences. The major results of our work were insensitive to the particular choice of the threshold (data not shown). Note that the data reported was obtained when using an HSSP-distance of 4 (Eq. 3-1) to cluster since that value defined the threshold of sequence conservation. The database comparisons for the clustering were performed by pairwise BLAST (Altschul, Gish et al. 1990; Altschul and Gish 1996).

Generating alignments for pair comparisons. We aligned all sequences from the sequence-unique subset (Table 3-1) against all proteins of known localization using pairwise BLAST (Altschul, Gish et al. 1990; Altschul and Gish 1996). For all proteins from the sequence unique subset we generated PSI-BLAST (Altschul, Madden et al. 1997) profiles using a filtered version of all currently known sequences with three iterations (Przybylski and Rost 2002). These profiles were then aligned against all proteins of known localization.

Scores for measuring sequence similarity. The simplest way to measure sequence similarity is percentage pairwise sequence identity (PIDE), i.e. the percentage of residues identical between two proteins divided by residues aligned (not counting gaps). The second measure that we used was given by the statistical expectation values as reported by BLAST (E-VAL, note: we typically report the logarithm of this value in our figures). The third scoring scheme we used was the distance from the HSSP-curve (Sander and Schneider 1991; Rost 1999):

   HSSP-distance = PIDE – HSSP_curve(q)

   HSSP_curve(q)= q +                        (3-1)

where L was the length of the alignment between two proteins, PIDE the percentage of pairwise identical residues, and HSSP_curve(q) the revised HSSP-threshold for the level q. As described above, we chose q = 4 to reduce the bias. However, to compile distances, we chose the threshold of q = 0.

Modifications to optimise detection of homologues. We introduced two modifications of the standard HSSP-distance (Eq. 3-1):

(1) Perpendicular HSSP-distance: To calculate the 'perpendicular HSSP-distance' (Eq. 3-1), percentage sequence identity and alignment length have to be measured in comparable units. This was done by first identifying approximate saturation points (slope 0 or ¥) on the HSSP-curve. Using these saturation points, we re-scaled the length of alignment axis (L in Eq. 3-1) and expressed it in terms of percent identity. For a given alignment, the normal to the re-scaled HSSP-curve was first identified. The length of the normal gave the perpendicular HSSP-distance. We experimented with various re-scaling constants. Finally a re-scaling constant of 0.26 for the length of alignment was observed to provide the best results.

(2) Scaled HSSP-distance: Two proteins with 100% identical residues (PIDE=100) over an alignment length of 25 residues have an HSSP-distance of only 33. However, we observed very few false positives even over relatively short fragments for very similar pairs (Fig. 3-5). This suggested that better identification of homologues might be possible by using a relative distance from the curve. The scaled HSSP-distance was defined as:

                                    (3-2)

where PIDE was the percentage pairwise sequence identity, and the HSSP_curve was as defined in Eq. 3-1 (with a threshold of 0).

Definitions of accuracy and coverage. We used the following definitions to measure accuracy/specificity:

                                                                                                                                (3-3)

with the thresholds given by either (1) percentage pairwise sequence identity, (2) BLAST expectations values, (3) the distance from the HSSP-curve (Eq. 3-1), (4) the Scaled HSSP-distance (Eq. 3-2), or (5) the Perpendicular HSSP-distance. We considered all pairs as 'true' that were experimentally found in the same subcellular compartment. In analogy, we used the following definitions for coverage/sensitivity:

       (3-4)

The accuracy of prediction was measured using the ratios:

     (3-5)

Note that pL and oL measure two different aspects of prediction methods, in particular, oL reflects how many of the known proteins are correctly predicted while pL reflects how many of the predicted proteins are correctly predicted. For example, a method that strongly over-predicts (like SignalP) yields a high oL and a low pL (Table 3-3).

Prediction methods. The prediction accuracy of three publicly available subcellular localization predictors was evaluated using the sequence-unique test set (Table 3-1). The three predictors were: 1) NNPSL: neural network based tool for predicting subcellular localization based on amino acid composition (Reinhardt and Hubbard 1998); 2) SubLoc: a support vector machine based tool for predicting subcellular localization based on amino acid composition (Hua and Sun 2001); and 3) TargetP: neural network based tool for large-scale subcellular localization prediction based on N-terminal sequence information (Emanuelsson, Nielsen et al. 2000). All methods were run with default parameter settings.

Annotating localization based on homology. All proteins belonging to five entirely sequenced eukaryotic proteomes (Homo sapiens, Mus musculus, Drosophila melanogaster, Caenorhabditis elegans, and Arabidopsis thaliana) and all proteins in the SWISS-PROT database were aligned by pairwise BLAST to our data set of proteins with experimentally known localization (Table 3-1). We measure sequence similarity by the scaled HSSP-distance, and considered only alignment pairs above the conservation threshold (scaled HSSP-distance=4). We estimated the accuracy of the annotation transfer by homology using the Accuracy vs HSSP-distance curves that were obtained for the different localizations. We annotated localization based on the known localization of the closest homologue. Only those proteins for which localization could be inferred with greater than 70% accuracy were annotated.

3.3 Results

Over one million pair comparisons between proteins of known localization. The data set of proteins with experimentally known subcellular localization extracted from SWISS-PROT contained 7405 proteins. Since many of these proteins were very similar in sequence (bias), we constructed a representative sequence-unique subset containing 1,248 proteins (Table 3-1). Our objective was to establish at which level of sequence similarity proteins reside in the same localization. In order to obtain estimates that are unbiased, we have to analyse the unbiased, sequence-unique set. However, if we were to align an all-against-all for subset of 1,248, we would not find any close homologues since by construction no pair in the sequence-unique set is closely related. Consequently, we could not deduce thresholds for everyday sequence comparisons. Therefore, we have to accept some bias by aligning all proteins from the sequence-unique set against all proteins from the full set. Thus, our results were based on over nine million pair comparisons. Although this number might appear high enough to allow statements about statistical significance, the data set was not evenly distributed between the 11 different compartments (Table 3-1). We distinguished between major compartments (nucleus, cytoplasm, mitochondria, extra-cellular space, and chloroplasts) for which we had sufficiently large sets, and minor compartments (vacuoles, Golgi, and periplasm) for which the sets were too small to generalise our findings. Three compartments (lysosome, ER, and peroxysome) ranged somewhere in between these two extremes.

Sequence conservation of localization similar for major compartments. The functional shapes for the sequence conservation of subcellular localization were similar across the major compartments (Fig. 3-2).

The thresholds for accurate inference of localization through homology were around HSSP-distances of 4 (Eq. 3-1, Fig. 3-2A). We observed similar functional behaviour for alignments generated using pairwise BLAST (Fig. 3-2A) and PSI-BLAST profiles (Fig. 3-2B) for the alignment. The cumulative coverage (Eq. 3-4) however showed considerable variation for pairs belonging to the different localizations. Remarkably, the transition from safe zone to twilight zone (Fig. 3-1) occurred at an HSSP-distance close to that for the sequence conservation of protein structure (HSSP-distance of 0 (Rost 1999), Fig. 3-2).

Sharp transition from safe to twilight zone. If we want to infer the localization of a protein through homology to a protein of experimentally known localization, we have to relate sequence similarity to the conservation of the compartment. We explored three different ways of measuring sequence similarity (Methods): (1) BLAST expectation values (E-VAL, (Altschul and Gish 1996), (2) percentage pairwise sequence identity (PIDE), and (3) the distance from the HSSP threshold (DIST, Eq. 3-1). The curves describing the sequence conservation of localization resembled sigmoidal relationships (Fig. 3-3).

Fig. 3-3: Average conservation of subcellular localization. The upper graphs show the performance of pairwise BLAST searches for the biased set, while the lower graphs show the performance of pairwise BLAST and PSI-BLAST searches on the sequence-unique subset. The filled symbols show cumulative accuracy and cumulative coverage (Eq. 3) for pairwise BLAST; open symbols give the results from PSI-BLAST searches. For the biased set, the cumulative coverage is 1% corresponding to the identification of about 274K pairs from identical localization (true pairs), while for the sequence-unique subset a cumulative coverage of 1% corresponds to the identification of about 21K true pairs. Conservation thresholds for BLAST and PSI-BLAST are indicated by open and filled arrows, respectively. For HSSP-distance (C and F), the conservation threshold using BLAST was at HSSP-distance=4 (open arrow) for the biased and sequence-unique sets, while using PSI-BLAST the conservation threshold was at HSSP-distance=0 (filled arrow) for the sequence-unique set. The cumulative accuracy and cumulative coverage when using BLAST for the sequence-unique set was 87% and 0.36% respectively and for PSI-BLAST it was 91% and 0.4% respectively. For the cumulative accuracy versus percent sequence identity graphs (A and D) no sharp conservation thresholds could be established. The percent sequence identity graphs showed the largest variation for the biased and sequence-unique sets. In contrast, the graphs for BLAST E-values (B and E) and HSSP distances (C and F, Eq. 1) were similar for the biased and the sequence-unique set. The conservation thresholds for PSI-BLAST occurred at a lower threshold than that for pairwise BLAST (D, E and F). The middle graphs plot the logarithm of the BLAST E-values (log to the base e). Note that BLAST E-values below 10-200 did not suffice to safely infer localization. In contrast, at very high HSSP distances and sequence identities localization could be reliably transferred.

 

Similar functional shapes describe the sequence conservation of enzymatic activity (Wilson, Kreychman et al. 2000; Todd, Orengo et al. 2001; Rost 2002), of gene ontology classes (Wilson, Kreychman et al. 2000), and of protein structure (Vogt, Etzold et al. 1995; Abagyan and Batalov 1997; Brenner, Chothia et al. 1998; Park, Karplus et al. 1998; Rost 1999; Jaroszewski, Rychlewski et al. 2000; Blake and Cohen 2001). The conservation of localization (Fig. 3-3) was characterised by a region of slow monotonic decrease in accuracy (safe zone), followed by a transition to a region in which the accuracy decreases sharply (twilight/midnight zones). The transition from safe to twilight zone was markedly sharper for the BLAST expectation values (Fig. 3-3B and Fig. 3-3E) and for the HSSP distance (Fig. 3-3C and Fig. 3-3F) than for pairwise sequence identity (Fig. 3-3A and Fig. 3-3D). Similar results have been reported for structure (Yang and Honig 2000). The curves relating accuracy (Eq. 3-3, not shown) and cumulative accuracy (Eq. 3-3, Fig. 3-3) respectively to sequence similarity were rather similar.

The results for the biased and unique data sets were surprisingly similar for the HSSP-distance and the BLAST expectations values (Fig. 3-3). However, the results differed significantly for pairwise sequence identity (Fig. 3-3A and Fig. 3-3D): the biased set suggested that we can correctly infer localization through homology for 90% of all proteins if we require about 50% identical residues (Fig. 3-3A).

This is similar to what many molecular biologists would use in everyday sequence analysis. In contrast, the sequence-unique set indicated that we need over 70% identical residues to correctly infer localization at a level of 90% accuracy for pairwise BLAST searches (Fig. 3-3D).

Expectation values outperformed HSSP distance for very diverged pairs. When analyzing the relation between accuracy (percentage of localizations correctly inferred above given threshold) and coverage (number of correct inferences made above threshold), we noticed that the HSSP-distance outperformed simple pairwise sequence identity for all thresholds (Fig. 3-4A: circles vs. squares).

However, the HSSP-distance was superior to the BLAST expectation values only for proteins of conserved structure (HSSP distances above 0, Fig. 3-4A: circles vs. diamonds, pairwise BLAST transition marked by upper, left arrow). Surprisingly, the BLAST expectation values were slightly inferior to percentage pairwise sequence identity for very similar proteins (Fig. 3-4A indicated by lower, right-hand arrow). For PSI-BLAST, percent sequence identity performed even better. Since the HSSP-distance gave the best prediction above the conservation threshold

Fig. 3-4: Performance for different measures of sequence similarity. The black lines and open symbols show cumulative coverage versus cumulative accuracy for PSI-BLAST searches while grey lines and shaded symbols show the same for pairwise BLAST (A and B). The figure plots data only for cumulative accuracy above 80%, which is well below the threshold for conservation of localization. (A) For HSSP distance (circles) and percent sequence identity (squares), PSI-BLAST vastly outperforms pairwise BLAST. However, using BLAST E-values both BLAST and PSI-BLAST gave comparable performance at the conservation threshold (86% cumulative accuracy in figure). For both pairwise BLAST and PSI-BLAST scoring the alignments using HSSP distance (Eq. 3-1) gave the best coverage versus accuracy graphs. Using HSSP distance for PSI-BLAST alignments gave overall best performance. (B) For both pairwise BLAST and PSI-BLAST using scaled distance (Eq. 3-2) from the HSSP curve improved performance compared to HSSP distance. The performance was worse when perpendicular distance from the HSSP curve was used. Overall using PSI-BLAST alignments and scaled distance from the HSSP curve gave best performance. The curves for cumulative accuracy and coverage for the scaled HSSP-distance (C) were similar to those obtained for the standard HSSP-distance (Fig. 3-3F).

 

 

 

(Fig. 3-4A), using HSSP-distances to infer localization by homology improves the accuracy of information transfer significantly. In fact, the HSSP-curve derived to describe the sequence conservation of protein structure (Rost 1999), described the basic difference between protein pairs of identical and of different localizations surprisingly well (Fig. 3-5). We obtained similar graphs for the individual localizations and for PSI-BLAST profiles (data not shown).

Refining the thresholds to infer localization by homology. We might improve the accuracy of transferring experimental information about localization by homology in two ways. We could refine the original HSSP-curve used to determine the thresholds for this inference. However, the incorrect predictions shown in Fig. 3-5 suggested that this might not be simple.

Alternatively, we could modify the way of compiling the distance from the HSSP-curve used to establish thresholds. Towards this end, we explored the following alternatives: (1) standard HSSP-distance (Eq. 3-1, note that this is the distance used for all previous figures), (2) perpendicular HSSP-distance (Methods), and (3) a scaled HSSP-distance (Eq. 3-2). We found the scaled HSSP-distance slightly superior to the standard HSSP-distance, while the perpendicular HSSP-distance performed significantly worse (Fig. 3-4B). For pairwise BLAST searches, the scaled HSSP-distance discovered 13% more pairs of identical localization than the standard HSSP-distance at the same conservation threshold. The curves relating cumulative accuracy and cumulative coverage respectively to scaled HSSP-distance (Fig. 3-4C) were similar to those obtained for HSSP-distance (Fig. 3-3F).

Annotation transfer for entire SWISS-PROT and entire eukaryotes. Using the scaled HSSP-distance (Eq. 3-2), we annotated the subcellular localization based on homology for approximately one-fifth of the proteins in five entirely sequenced eukaryotes and the entire SWISS-PROT database (Table 3-2).

The percentages of proteins for which could infer localization differed substantially between the highest value above 26% for human and the lowest around 13% for the worm. Previously, we predicted that about 20% of all fly proteins are nuclear (Cokol, Nair et al. 2000). In contrast, over 60% of all the proteins for which we could infer localization by homology for entire proteomes were nuclear. Obviously, this number reflected the current bias. The annotations of subcellular localization are available at http://cubic.bioc.columbia.edu/db/LocHom/.

Ab initio prediction methods were not necessarily more accurate. The accuracy of inferring subcellular localization by homology exceeded 80% at the HSSP-distances of 4 (Fig. 3-2). We compared the performance of three publicly available prediction methods on the same sequence-unique data set. Note that the resulting estimates for prediction accuracy are likely to constitute over-estimates since many of our test proteins may have been used to develop those prediction methods.