Inferring sub-cellular localization through automated lexical analysis
| Title: | Inferring sub-cellular localization through automated lexical analysis |
| Author: | Rajesh Nair & Burkhard Rost |
| Quote: | Bioinformatics, 2002, 11, 2836-2847 (ISMB'2002 Proceedings). |
Motivation: The SWISS-PROT sequence database contains keywords of functional annotations for many proteins. In contrast, information about the sub-cellular localization is only available for few proteins. Experts can often infer localization from keywords describing protein function. We developed LOCkey, a fully automated method for lexical analysis of SWISS-PROT keywords that assigns sub-cellular localization. With the rapid growth in sequence data, the biochemical characterisation of sequences has been falling behind. Our method may be a useful tool for supplementing functional information already automatically available.
Results: The method reached a level of more than 82% accuracy in a full cross-validation test. Due to a lack of functional annotations, we could infer localization for less than half of all proteins in SWISS-PROT. We applied LOCkey to annotate five entirely sequenced proteomes, namely Saccharomyces cerevisiae (yeast), Caenorhabditis elegans (worm), Drosophila melanogaster (fly), Arabidopsis thaliana (plant) and a subset of all human proteins. LOCkey found about 8000 new annotations of sub-cellular localization for these eukaryotes.
Availability: Annotations of localization for eukaryotes at: http://cubic.bioc.columbia.edu/services/LOCkey.
Contact: rost@columbia.edu
Key words: genome sequence analysis, predicting sub-cellular localization, protein function, lexical analysis.
Protein sequence-function gap. The number of completely sequenced genomes has been rapidly increasing. Currently, we know the full genomes for over 60 organisms [1, 2, 3, 4, 5, 6, 7] . This sequence explosion has widened the gap between the number of sequences deposited in public databases and the experimental characterisation of the corresponding proteins [8] . To bridge this gap, faster and more effective means of creating annotation are required [9, 10, 11, 12] . One promising approach are automatic annotations [13, 14] . One step towards understanding protein function is elucidating its sub-cellular localization [15] .
Database annotations of function often very detailed. Protein function may be described best in the context of molecular interactions. The SWISS-PROT database [16] contains functional annotations predominantly at a very detailed level of biochemical function, e.g. a protein may be annotated as a cdc2 kinase, but not as being involved in intra-cellular communication [17, 18] . We would like to complement these detailed annotations with descriptions in context of higher-order processes such as the regulation of gene expression, pathways, or signalling cascades [19, 20, 21, 17] . Descriptions at this level are available for only few proteins. To remedy this situation, a number of automatic and semi-automatic tools have been developed for functional annotation of proteins.
Classification of proteins into families of homologues. Many automatic annotations are derived from sequence similarity to proteins of known function [22, 1, 23, 24, 8, 25, 26, 27] . A typical similarity search starts by aligning the unknown protein U against databases with functional annotations through search tools such as BLAST [28] , FASTA [29] , or PSI-BLAST [30] . If a homologue H is found that has an annotation and significant sequence similarity to U, the annotation of H is transferred to U. Such inference of function is reliable only if the sequences of U and H are very similar [26, 31] . Several pitfalls of such transfers of function have been reported, e.g. inadequate knowledge of thresholds for 'significant sequence similarity', or using only the best database hit or ignoring the domain organisation of proteins [32, 33, 34, 26] .
Automatic annotation through functional descriptors. Annotation systems have been based on SWISS-PROT keywords [35, 17, 36, 10, 37, 16] . The common approach to classifying function is to first extract characteristic keywords for each of the functional classes from a set of proteins classified by experts. Using these keywords, a library of rules is created that associates a certain pattern of keywords to a functional class. Creating the 'rules library' is a difficult task for which a variety of solutions have been used proposed. EUCLID [17] uses SWISS-PROT keywords to classify proteins into 14 classes of cellular function (according to the scheme proposed by Monika Riley [38, 19, 20, 39] ). Using a simple voting scheme, the system assigns the unknown sequence to the functional class to which the majority of its keywords belong [35] . A disadvantage of using such dictionaries is that they can only 'discover' simple correlations between the known functional keywords. The method of the Apweiler group [10] annotates function for the TrEMBL [16] database based on SWISS-PROT keywords and PROSITE motifs [37] . The system generates a 'RuleBase' by extracting functional annotations from all SWISS-PROT proteins that contain the same PROSITE motif. If a PROSITE motif is discovered in an un-annotated TrEMBL sequence, the functional annotation is transferred from the 'RuleBase'. Recently [40] , the group has implemented the C4.5 data-mining algorithm to automatically generate rules for keyword annotations found in SWISS-PROT. The rules are based on taxonomy, PROSITE motifs, and PFAM patterns in SWISS-PROT proteins belonging to different InterPro [41] families. The Meta_A annotator [36] is a partly automatic annotation evaluation system based on a combination of lexical analysis and libraries of expert rules. The rule libraries are derived from scanning the protein names, taxonomy information, commentaries and feature tables in SWISS-PROT. The system assigns one of twelve final sub-cellular localizations to each protein. Meta_A combines primary attributes with AND, OR and NOT logical operators to create a library of 'biological rules'. The rules relating lexical patterns with functional attributes are created by expert intervention, i.e., resemble a dictionary [36] . Thus, the creation of the rule library is time-consuming and has to be repeated for each new application.
Algorithms for text categorisation. The problem of automatically extracting rules from SWISS-PROT keywords has parallels to the problem of 'Text Categorisation'. Text categorisation (TC) is the problem of assigning predefined categories to text documents such a journal articles or abstracts. Many statistical learning methods have been applied to this problem. These include nearest neighbour classifiers [42] , multivariate regression models [43, 44] , probabilistic Bayesian models [45] and symbolic rule learning [46] . M-ary (multiple category) classifiers like the k-Nearest Neighbour [47] and the Linear Least-squares Fit (LLSF) [48] . Here we describe LOCkey, a novel method automatically assigning proteins to classes of sub-cellular localization based on a lexical analysis of SWISS-PROT keywords. LOCkey is based on M-ary classifiers that solve the classification problem accurately when the number of data points (proteins) and dimensionality of the feature space (number of keywords) are not too large. In contrast to dictionary-based approaches, LOCkey is fully automated and the rule libraries are generated dynamically. Our method may be applied to any database with keywords of functional information and to any task involving higher-level classifications.
Implementation of algorithm. Instead of creating an a priori 'rule library', we generated all possible sets of rule libraries for a protein of unknown 'class' from a set of SWISS-PROT keywords. The protein was assigned based on the 'rule library' that solved the classification problem best. We applied the algorithm to infer one of ten classes of sub-cellular localization ( Table 1 ). The algorithm involved two separate steps: (1) build a data set of trusted vectors from proteins of known localization, and (2) classify unknown proteins ( Fig. 1 ).
Step 1: Building data set of trusted vectors. First, we compiled a data set with proteins of experimentally known localization. Then, we extracted a list of keywords from SWISS-PROT for this set from the 'keyindex' file. Since only a partial functional annotation was available for a large number of proteins, we merged keywords from homologous sequences, to provide as complete an annotation, as possible. In particular, we identified all SWISS-PROT sequences with HSSP distances ( eqn. 1 ) >15 to sequences in the sequence-unique subset (below) using BLAST [49, 30] and extracted their keywords. Finally, we built a data set of binary vectors [50] for these keywords that represented the presence of a certain keyword by 1 and the absence by 0. To reduce the dimensionality of feature space, we retained only keywords with 'above random' classifying ability based on an entropy ( eqn. 2 ) and normalised entropy ( eqn. 3 ) cut-off [44] . The accuracy vs. coverage plots were not very sensitive to the particular cut-off chosen (plots given at cubic.columbia.edu/services/LOCkey). We merged these keywords with the keywords found for the corresponding protein from the sequence-unique set. Most proteins had 2-5 keywords ( Fig. 3 A).
Fig. 1. : LOCkey algorithm. First, we compiled a sequence-unique data set of proteins of experimentally known sub-cellular localization. For these proteins, we then extracted keywords from SWISS-PROT. Next, we merged keywords found in homologues. We represented the keywords found in the proteins of known localization as binary vectors in the 'Trusted Vector Set'. For proteins of unknown localization U, the goal now became to compare these to the ‘Trusted Vector Set’. Toward this end, we first identified all keywords in U and in homologues of U. Then we constructed all possible keyword combinations (SUB vectors), and compared these to the ‘Trusted’ vectors. We found the best matching vector based on entropy criteria (Methods). Finally, we used this ‘best matching vector’ to infer localisation for the query.
Removing 'trivial' keywords. To evaluate the ability of the system to discover non-trivial correlations between variables, we excluded all keywords from the vector set that were biologically co-related to localization (e.g. 'DNA-binding'). We also excluded keywords that were observed to occur more than 90% of the time in proteins within a single sub-cellular localization. Thus, we excluded 81 keywords from our trusted vector set. Removing these keywords resulted in that we could not identify any keyword for 176 test proteins. To minimise the effects of annotation errors, we retained only keywords that occurred in at least 10 protein families.
Step 2: Classifying proteins of unknown localization. To infer the localization of a protein U of unknown localization, we first retrieved all keywords for U from SWISS-PROT that matched in our 'trusted vector set' of informative keywords. Thus, we retrieved a vector V(U) that had the same dimension as the vectors in the 'trusted set'. Next, we generated all possible alternatives to V(U) for which one or many 1's were flipped to 0's. For example, for a protein with 3 keywords, we generated 23-1 = 7 sub-vectors V'(U): 111, 110, 101, 011, 100, 010 and 001. These sub-vectors constituted all possible keyword combinations for protein U. The final task was to find the keyword combination that yielded the best classification of U into one of ten classes of sub-cellular localizations, i.e. was most similar to one of the 'trusted vectors'. To achieve this, we retrieved all exact matches of any of the sub-vectors V'(U) to any of the proteins in the trusted vectors, i.e. found all proteins in the trusted set containing one of the keywords found in U. By construction of the sub-vectors, the proteins retrieved in this way may also contain keywords not found in U. Next, we simply counted how often the proteins retrieved belonged to a particular class C(i), i=1 … 10. We repeated this for each of the sub-vectors V'(U), and selected the finally assigned localization by minimising an entropy-based objective function ('prediction mode').
Data sets. We selected all proteins with unambiguously annotated sub-cellular localization in SWISS-PROT release 40 [16] . We excluded sequences annotated as "POSSIBLE", "PROBABLE", "SPECIFIC PERIODS" or "BY SIMILARITY", and proteins with multiple annotations of localization. This left 13589 proteins in the 'Experimental data set' ( Table 1 ). To reduce bias, we built a representative subset of sequence-unique proteins by using a simple greedy algorithm [51] . In particular, we accepted only pairs with an HSSP-distances below 15 [52, 53] :
HSSP DISTANCE = PIDE - HSSP_PIDE (Eq. 1)
HSSP_PIDE=
![]()
where PIDE is the percentage of pairwise identical residues and L is the alignment length. The final sequence-unique subset of known localization contained 3146 proteins. For entire-proteome predictions, we obtained the sequences with their alignments to SWISS-PROT proteins from cubic.bioc.columbia.edu/genomes [7] .
| Sub-cellular localization | SWISS-PROTa | Sequence-uniqueb | a Number of proteins with known localization found in SWISS-PROT; bNumber of sequence-unique proteins, i.e. representative subset of allSWISS-PROT proteins found (Methods). Note: we used the sequence-unique set astest set. | |
| Nucleus | 3478 | 922 | ||
| Extra-cellular space | 2900 | 724 | ||
Cytoplasm | 2642 | 544 | ||
| Mitochondria | 1743 | 467 | ||
| Chloroplast | 1648 | 197 | ||
| Endoplasmic reticulum | 568 | 108 | ||
| Peroxysome | 177 | 55 | ||
| Golgi apparatus | 167 | 52 | ||
| Lysosome | 163 | 49 | ||
| Vacuolar | 103 | 28 | ||
| SUM (all 10) | 13589 | 3146 |
Sorting assignments by keyword entropy. For each of the remaining keywords, we calculated the Shannon Information SI [54] according to:
where N is the number of localization classes (10) and Pi are the probabilities of finding the keyword in one of the 10 classes of localization. Since the Shannon Information does not take into account the background distribution of proteins among the various localizations, we calculated a normalised Shannon Information normSI for each keyword.
where Xi was the fraction of proteins belonging to a given localization identified by the keyword and M the number of localizations in which the keyword was found. Finally, we defined the percent of fractional change in SI (and normSI) as:
where maxSI is the maximum possible Shannon Information and maxNormSI the maximum possible normalised Shannon Information. We included only those keywords in our set of trusted vectors that satisfied the criteria: fracSI > 25 and maxNormSI > 25.
Inferring localization from keywords (prediction mode). To infer localization for test proteins we identified the keyword combinations that maximised fracSI and maxNormSI ( eqn. 4 ). Predictions were made only if at least one keyword combination could be found such that fracSI > 70 and maxNormSI > 70. Additionally, we required that the keyword combination was present in at least five families in the training set.
Evaluating performance accuracy. We evaluated performance by a five-fold cross-validation experiment, i.e. we partitioned the sequence-unique subset into five sets. Then we used four of the five sets to generate the data set of trusted keyword vectors (training set), and inferred (predicted) localization for the remaining fifth set (test set). We repeated this procedure five times such that each of the sets was used for testing once. The final levels of accuracy and coverage constituted averages over all five tests. The partitioning was performed using a greedy clustering algorithm starting with the largest and longest families [51] . An HSSP distance = 5 ( eqn. 1 ) was chosen for the clustering. This ensured that two 100 residue long sequences chosen from the training and test set have fewer than 35% pairwise identical residues.
LOCkey yielded high accuracy but low coverage. We tested LOCkey in a five-fold cross-validation experiment (Methods). Some SWISS-PROT keywords are such that all biologists would immediately know the localization of the respective protein (e.g. DNA-binding), others need more expertise. To assess the ability of the system to discover non-trivial correlations between variables, we excluded all keywords predomi-nantly associated with a single localization. This 'filtering-out of the most obvious' required removing 81 keywords from the 'trusted vectors'. For 176 of the proteins in the test set, we found no keywords. When choosing the entropy cut-off such that we could classify one fourth of all proteins (coverage in Fig. 2 A), our system correctly inferred one of ten classes of sub-cellular localization for 87% of all proteins (accuracy in Fig. 2 A). We noticed that the top two hits contained the observed localization for a similar level of accuracy (87%) at an entropy cut-off at which we assigned localization for about 35% of all proteins (dashed line in Fig. 2 A). Interestingly, this increased performance when considering the top two hits, was mostly due to proteins from the chloroplast that were often confused with mitochondrial proteins. The reasons for the low coverage of the system were manifold. First, some proteins had no keyword (176 of 3146). Second, we required that the keyword pattern was present in at least five proteins in the vector set. Third, the vector set was too small (3146 proteins) to provide a good sample for all proteins. In other words, many keywords found in the testing set were not present in the 'training' set.
Performance varied substantially between classes. LOCkey was more successful for some classes than for others. In particular, extra-cellular, nuclear and mitochondrial proteins could be inferred more reliably than the other classes ( Fig. 2 B). On the other hand, performance was worst for proteins from the cytoplasm and chloroplasts ( Fig. 2 B). When we considered our findings correct if any of the top two hits was predicted in the observed localization, we noticed a considerable improvement for most of the major classes (data not shown). The only exceptions were cytoplasmic proteins for which the keywords yielded levels above 75% accuracy at entropy thresholds corresponding to levels of coverage around 10%. The detailed 'confusion matrix' ( Table 2 ) revealed that cytoplasmic proteins were most often confused with nuclear proteins, and proteins from the chloroplast were most often assigned incorrectly to mitochondria. Although the minor classes (Golgi, Endoplasmic reticulum, peroxisome, vacuoles, and lysosome) contained too few proteins to allow statistically significant conclusions, we noted that proteins from vacuoles and the lysosome were mostly confused with extra-cellular proteins ( Table 2 ).
Fig. 2. : Results for five-fold cross-validation. (A) Average over 10 classes. Results were obtained for a sequence-unique set. Keywords thought to be biologically correlated with sub-cellular localization and those observed to occur with high specificity in a single localization were excluded (Methods). The bold line represents accuracy versus coverage for the predicted localization. For example, at 25% coverage the system approaches a prediction accuracy of 87%. Prediction accuracy appeared higher, when we considered our findings correct if the correct localization was one of top two predicted localizations (grey line). (B) Major classes. Nuclear, extra-cellular and mitochondrial classes showed similar accuracy versus coverage statistics. Cytoplasmic and chloroplast proteins were predicted with a much lower accuracy.
| Prd aÞObs b ß | nuc | ext | cyt | mit | pla | ret | oxi | gol | lys | vac | SUMobs |
| nuc | 340 | 5 | 4 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 350 |
| ext | 5 | 321 | 11 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 337 |
| cyt | 37 | 25 | 59 | 14 | 1 | 2 | 0 | 0 | 0 | 0 | 138 |
| mit | 10 | 0 | 5 | 151 | 20 | 4 | 0 | 0 | 0 | 0 | 190 |
| pla | 5 | 1 | 5 | 24 | 56 | 1 | 0 | 0 | 0 | 0 | 92 |
| ret | 1 | 4 | 2 | 1 | 1 | 2 | 0 | 2 | 0 | 0 | 13 |
| oxi | 2 | 0 | 1 | 5 | 0 | 1 | 0 | 0 | 0 | 0 | 9 |
| gol | 0 | 2 | 2 | 0 | 0 | 0 | 0 | 17 | 0 | 0 | 21 |
| lys | 0 | 5 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 6 |
| vac | 0 | 2 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 5 |
| SUMprd | 400 | 365 | 90 | 198 | 78 | 10 | 0 | 19 | 1 | 0 | 1161 |
aPrd: predicted localization; bObs:annotated localization; Abbreviations for localizations: nuc: nucleus; ext:extra-cellular space; cyt: cytoplasm; mit: mitochondria; pla: chloroplast; ret:Endoplasmic reticulum; oxi: peroxysome; gol: Golgi apparatus; lys: lysosome;vac: vacuoles. The numbers give the proteins used in the five-foldcross-validation experiment for which LOCkey assigned any localization (correctclassifications in bold letters).
Performance improves with number of keywords. LOCkey inferred the correct localization for almost all proteins for which we had many keywords. In fact, both accuracy and coverage approached 100% for proteins with more than 25 keywords ( Fig. 3 B). For proteins with few keywords, the accuracy decreased with increasing coverage ( Fig. 3 B). The improved coverage with increasing number of keywords was the result of discovering new keyword combinations that meet the entropy criteria (Methods). Since the keywords were non-specific to any localization class, the initial drop in accuracy was due to a larger fraction of keyword combinations that closely met the entropy criteria. These combinations were more often incorrectly predicted.
Annotating entire proteomes. Using LOCkey, we could provide sub-cellular localization annotations to 38-55% more proteins than by simple annotation transfer using homology ( Table 3 ). To provide independent confirmation for our annotations, we checked the annotations inferred for nuclear proteins for which we found nuclear localization signals (NLS; [55] and extra-cellular proteins with predicted signal peptides [56] . More than 20% of all nuclear proteins identified by LOCkey (~ 9600) contained known NLSs ( Table 3 ). All 3130 extra-cellular proteins identified by LOCkey in human and arabidopsis and over 60% of those in fly and worm contained signal peptides ( Table 3 ). In contrast, only 2 of the predicted 76 extra-cellular proteins in yeast contained predicted signal peptides. Note that the numbers of proteins with signal peptides or NLSs that were not identified by LOCkey are not relevant, since neither SignalP nor our NLS database find all extra-cellular or nuclear proteins at 100% accuracy. The relevant number in this context were the values for accuracy vs. coverage for the cross-validation experiment ( Fig. 2 ). Unfortunately, the only way to assess whether or not LOCkey correctly identified nuclear and extra-cellular proteins without motifs is to await the respective experiments. If we can generalise the levels of accuracy found in the cross-validation experiment, we conclude that LOCkey indeed identified many proteins that could not have been classified reliably by motif-based methods.
LOCkey assignments were often not trivial. Experts can often assign localization from SWISS-PROT keywords. This may be unpractical in the context of assigning localization for entire proteomes. However, when analysing the assignments from LOCkey in more detail (http://cubic.bioc.columbia.edu/services/LOCkey/), we found that often the biologists in our lab could not clearly infer localization from the SWISS-PROT keywords. In other words, LOCkey ‘discovered’ relations that require more specific expertise than we can expect from a typical expert annotator.
Fig. 3. : Performance improves with number of keywords. (A) Keyword distribution in test set: Most test proteins had 2-5 keywords. (B) Performance as a function of keywords: The prediction accuracy and coverage were both nearly 100% for proteins with more than 30 keywords. The coverage (thin line) tends to increase with the number of keywords. The accuracy was observed to decrease first (thick line) before increasing.
| Organism | Nprota | OneKeyb | LOCkeyc | Homologyd | signalPe | predictNLSf |
| Arabidopsis thaliana (plant) | 25456 | 6703 | 3598 | 1961 | 100 | 16 |
| Caenorhabditis elegans (worm) | 18898 | 3584 | 1999 | 1240 | 60 | 22 |
| Drosophilamelanogaster (fly) | 14184 | 4010 | 2430 | 1501 | 66 | 24 |
| Homo sapiens (human, partial) | 31073 | 16522 | 10174 | 6057 | 100 | 23 |
| Saccharomyces cerevisiae (yeast) | 6306 | 3691 | 1747 | 837 | 3 | 20 |
| SUM | 95917 | 34510 | 19948 | 11596 |
a Nprot: Number of proteins inproteome; b OneKey: Number of proteins with at least one keyword inSWISS-PROT that matches our trusted vectors (System); c LOCkey:number of proteins for which LOCkey inferred sub-cellular localization in tenclasses (Table 1; note: these results were obtained using the entropythresholds that gave 87% testing accuracy, Fig. 2); d Homology:sub-cellular localization inferred using homology, i.e. sequence similarity toproteins of known localization taken from SWISS-PROT (at a threshold ofHSSP-distance > 15; at this distance the assignment through homology yieldedlevels around 90% accuracy, Nair & Rost, unpublished); e signalP:percentage of predicted extra-cellular proteins also predicted to contain asignal peptide (Nielsen et al., 1997); f predictNLS: percentage of predicted nuclear proteinsalso predicted to have a nuclear localization signal (Cokolet al., 2000). Note that LOCkey enabled to annotate8352 eukaryotic proteins of unknown localization (19948-11596).
LOCkey automatically provided high quality annotations of sub-cellular localization from SWISS-PROT keywords. However, for most test proteins, we could not infer localization, at all. This low coverage originated from a lack of relevant functional information for many proteins. One solution to this problem could be to extract keywords from bibliography databases such as MEDLINE [57] .
Of all the major localization classes, cytoplasmic proteins were predicted worst ( Table 2 ); these were also the major source of error in predicting nuclear and extra-cellular proteins. One reason could be that experimental annotations are less accurate for cytoplasmic proteins. Another reason could be that proteins do in fact shuttle between the cytoplasm and other localizations and that our 'errors' really captured proteins that could also occur in the predicted class. This interpretation was somehow supported by the finding that LOCkey often found the correct class in the first two hits. In other words, when replacing the binary classification accuracy (a protein can only be in one single localization) by a probabilistic measure (one protein can be in many compartments), LOCkey appeared more accurate.
We applied LOCkey to five (yeast, worm, fly, human, and arabidopsis) entirely sequenced eukaryotic proteomes. We could infer localization for over 8300 proteins for which localization could not have been detected by any other automatic system. Three types of methods can infer or predict localization in the context of entire proteomes: (1) homology to proteins of known localization, (2) detection of sequence motifs, and (3) prediction from sequence and structure. In our group, we simultaneously work on all these types of methods. LOCkey is most relevant for the coverage achieved by homology-based methods, since it allows to automatically increase the data set of proteins of known localization for which we can apply homology thresholds (Nair & Rost, unpublished).
The PERL-code [58] of LOCkey was optimised to provide fast annotations. Annotating the entire C. elegans proteome took less than four hours on a PIII 900 MHz machine. The algorithm is limited to problems with few data points in the vector set (n<<1000000) and with few keywords (n<<10000).
Since the algorithm was not tailored to inferring sub-cellular localization, we are currently implementing the same idea to recognise distant similarities to proteins of known structure, i.e. to the problem of fold recognition. Our preliminary results are encouraging.
Thanks to Jinfeng Liu (Columbia) for computer assistance and the genome data; to Dariusz Przybylski (Columbia) and Trevor Siggers (Columbia) for helpful discussions and to Kazimierz Wrzeszczynski (Columbia) and Henry Bigelow (Columbia) for valuable comments on the manuscript. We also thank the undisclosed reviewers for their helpful comments. The work was supported by the grants 1-P50-GM62413-01 and RO1-GM63029-01 from the National Institute of Health. Last, not least, thanks to all those who deposit their experimental data in public databases, and to those who maintain these databases.
| 1. | Fleischmann, R. D., Adams, M. D.,White, O., Clayton, R. A., Kirkness, E. F. et al. (1995). Whole-genome randomsequencing and assembly of Haemophilus influenzae Rd. Science, 269, 496-512. |
| 2. | Goffeau, A., Barrell, B. G., Bussey,H., Davis, R. W., Dujon, B. et al. (1996). Life with 6000 genes. Science, 274, 546-567. |
| 3. | The C. elegans Sequencing Consortium(1998). Genome sequence of the nematode C. elegans: a platform forinvestigating biology. Science, 282, 2012-2018. |
| 4. | Adams, M. D., Celniker, S. E., Holt,R. A., Evans, C. A., Gocayne, J. D. et al. (2000). The genome sequence ofDrosophila melanogaster. Science, 287, 2185-2195. |
| 5. | Arabidopsis Genome Initiative(2000). Analysis of the genome sequence of the flowering plant Arabidopsisthaliana. Nature, 408, 796-815. |
| 6. | Frishman, D. (2000). PEDANT: proteinextraction, description, and analysis tool. |
| 7. | Liu, J. & Rost, B. (2000).Analysing all proteins in entire genomes. |
| 8. | Koonin, E. V. (2000). Bridging thegap between sequence and function. Trends Genet,16, 16. |
| 9. | Baker, P. G. & Brass, A. (1998).Recent developments in biological sequence databases. Curr Opin Biotechnol, 9, 54-8. |
| 10. | Fleischmann, W., Moller, S.,Gateau, A. & Apweiler, R. (1999). A novel method for automatic functionalannotation of proteins. Bioinformatics, 15, 228-33. |
| 11. | Eisenberg, D., Marcotte, E. M.,Xenarios, I. & Yeates, T. O. (2000). Protein function in the post-genomicera. Nature, 405,823-6. |
| 12. | Lewis, S., Ashburner, M. &Reese, M. G. (2000). Annotating eukaryote genomes. Curr Opin Struct Biol, 10, 349-54. |
| 13. | Gaasterland, T. & Sensen, C. W.(1996). MAGPIE: automated genome interpretation. Trends Genet, 12, 76-8. |
| 14. | Apweiler, R., Gateau, A., Contrino,S., Martin, M. J., Junker, V. et al. (1997). Protein sequence annotation in thegenome era: the annotation concept of SWISS-PROT+TREMBL. Proc Int ConfIntell Syst Mol Biol, 5, 33-43. |
| 15. | Eisenhaber, F. & Bork, P.(1998). Wanted: subcellular localization of proteins based on sequence. Trendsin Cell Biology, 8,169-170. |
| 16. | Bairoch, A. & Apweiler, R.(2000). The SWISS-PROT protein sequence database and its supplement TrEMBL in2000. Nucleic Acids Res, 28, 45-8. |
| 17. | Tamames, J., Ouzounis, C., Casari,G., Sander, C. & Valencia, A. (1998). EUCLID: automatic classification ofproteins in functional classes by their database annotations. Bioinformatics, 14, 542-3. |
| 18. | Apweiler, R. (2001). Functionalinformation in SWISS-PROT: the basis for large-scale characterisation ofprotein sequences. Brief Bioinform, 2, 9-18. |
| 19. | Riley, M. (1993). Function of thegene products in Escherichia coli. Microbiol. Rev., 57, 862-952. |
| 20. | Riley, M. & Labedan, B. (1997).Protein evolution viewed through Escherichia coli protein sequences:introducing the notion of a structural segment of homology, the module. Journalof Molecular Biology, 268, 857-868. |
| 21. | Bork, P., Dandekar, T.,Diaz-Lazcoz, Y., Eisenhaber, F., Huynen, M. et al. (1998). Predicting function:from genes to genomes and back. J Mol Biol,283, 707-25. |
| 22. | Casari, G., Andrade, M. A., Bork,P., Boyle, J., Daruvar, A. et al. (1995). Challenging times for bioinformatics.Nature, 376,647-648. |
| 23. | Bork, P. & Gibson, T. J.(1996). Applying motif and profile searches. Methods in Enzymology, 266, 162-184. |
| 24. | Andrade, M. A., Brown, N. P.,Leroy, C., Hoersch, S., de Daruvar, A. et al. (1999). Automated genome sequenceanalysis and annotation. Bioinformatics, 15, 391-412. |
| 25. | Tatusov, R. L., Galperin, M. Y.,Natale, D. A. & Koonin, E. V. (2000). The COG database: a tool forgenome-scale analysis of protein functions and evolution. Nucleic Acids Res, 28, 33-6. |
| 26. | Devos, D. & Valencia, A.(2001). Intrinsic errors in genome annotation. Trends in Genetics, 17, 429-431. |
| 27. | Remm, M., Storm, C. E. &Sonnhammer, E. L. (2001). Automatic Clustering of Orthologs and In-paralogsfrom Pairwise Species Comparisons. Journal of Molecular Biology, 314, 1041-1052. |
| 28. | Altschul, S. F., Gish, W., Miller,W., Myers, E. W. & Lipman, D. J. (1990). Basic local alignment search tool.J Mol Biol, 215,403-10. |
| 29. | Pearson, W. R. & Lipman, D. J.(1988). Improved tools for biological sequence comparison. Proc Natl AcadSci U S A, 85,2444-8. |
| 30. | Altschul, S., Madden, T., Shaffer,A., Zhang, J., Zhang, Z. et al. (1997). Gapped Blast and PSI-Blast: a newgeneration of protein database search programs. Nucleic Acids Research, 25, 3389-3402. |
| 31. | Rost, B. (2001). Enzyme functionless conserved than anticipated. Journal of Molecular Biology,submitted. |
| 32. | Bork, P. & Koonin, E. V.(1998). Predicting functions from protein sequences--where are the bottlenecks?Nat Genet, 18,313-8. |
| 33. | Doerks, T., Bairoch, A. & Bork,P. (1998). Protein annotation: detective work for function prediction. TrendsGenet, 14, 248-50. |
| 34. | Galperin, M. Y. & Koonin, E. V.(2000). Who's your neighbor? New computational approaches for functionalgenomics. Nat Biotechnol, 18, 609-13. |
| 35. | Tamames, J., Ouzounis, C., Sander,C. & Valencia, A. (1996). Genomes with distinct function composition. FEBSLett, 389, 96-101. |
| 36. | Eisenhaber, F. & Bork, P.(1999). Evaluation of human-readable annotation in biomolecular sequencedatabases with biological rule libraries. Bioinformatics, 15, 528-35. |
| 37. | Hofmann, K., Bucher, P., Falquet,L. & Bairoch, A. (1999). The PROSITE database, its status in 1999. NucleicAcids Research, 27,215-219. |
| 38. | Krawiec, S. & Riley, M. (1990).Organization of the bacterial chromosome. Microbiol. Rev., 54, 502-539. |
| 39. | Karp, P. D., Riley, M., Paley, S.M., Pellegrini-Toole, A. & Krummenacker, M. (1999). Eco Cyc: encyclopediaof Escherichia coli genes and metabolism. Nucleic Acids Research, 27, 55-8. |
| 40. | Kretschmann, E., Fleischmann, W.& Apweiler, R. (2001). Automatic rule generation for protein annotationwith the C4.5 data mining algorithm applied on SWISS-PROT. Bioinformatics, 17, 920-6. |
| 41. | Apweiler, R., Attwood, T. K.,Bairoch, A., Bateman, A., Birney, E. et al. (2000). InterPro--an integrated documentationresource for protein families, domains and functional sites. Bioinformatics, 16, 1145-50. |
| 42. | Yang, Y. & Pederson, J. P.(1997). A comparative study on feature selection in text categorization. TheFourteenth International Conference on Machine Learning,412-420. |
| 43. | Yang, Y. & Chute, C. G. (1992).An application of least squares fit mapping to clinical classification. Proceedings- the Annual Symposium on Computer Applications in Medical Care,460-4. |
| 44. | Schutze, H., Hull, D. A. & Pederson,J. O. (1995). A comparison of classifiers and document representation for therouting problem. 18th Ann Int ACM SIGIR Conference on Research andDevelopment in Information Retrieval (SIGIR '95),229-237. |
| 45. | Lewis, D. D. & Ringuette, M.(1994). Comparison of two learning algorithms for text categorization. Proceedingsof the Third Annual Symposium on Document Analysis and Information Retrieval(SDAIR'94). |
| 46. | Apte, C., Damerau, F. & Weiss,S. (1994). Towards language independent automated learning of textcategorization models. Proceedings of the 17th Annual ACM/SIGIR conference. |
| 47. | Dasarathy, B. V. (1991). NearestNeighbor (NN) Norms: NN Pattern Classification Techniques. IEEE ComputerSociety Press, Las Alamitos, California. |
| 48. | Yang, Y. & Liu, X. (1999). Are-examination of text categorisation methods. Proceedings of the ACM SIGIRConference on Research and Development in Information Retrieval.,42-49. |
| 49. | Altschul, S. F. & Gish, W.(1996). Local alignment statistics. Methods in Enzymology, 266, 460-480. |
| 50. | Salton, G. (1989). Automatic TextProcessing. Addison-Wesley, Reading, MA. |
| 51. | Hobohm, U., Scharf, M., Schneider,R. & Sander, C. (1992). Selection of representative protein data sets. ProteinScience, 1, 409-17. |
| 52. | Sander, C. & Schneider, R.(1994). The HSSP database of protein structure-sequence alignments. NucleicAcids Research, 22,3597-3599. |
| 53. | Rost, B. (1999). Twilight zone ofprotein sequence alignments. Protein Eng, 12, 85-94. |
| 54. | Shannon, C. E. (1951). Predictionand entropy of printed English. Bell System Tech. J., 30, 50-64. |
| 55. | Cokol, M., Nair, R. & Rost, B.(2000). Finding nuclear localisation signals. EMBO Reports, 1, 411-415. |
| 56. | Nielsen, H., Engelbrecht, J.,Brunak, S. & von Heijne, G. (1997). A neural network method foridentification of prokaryotic and eukaryotic signal peptides and prediction oftheir cleavage sites. Int J Neural Syst, 8, 581-99. |
| 57. | Andrade, M. A. & Valencia, A.(1998). Automatic extraction of keywords from scientific text: application tothe knowledge domain of protein families. Bioinformatics, 14, 600-7. |
| 58. | Wall, L. & Schwartz, R. L.(1990). Programming perl. O'Reilly & Associates, Inc., Sebastopol, CA. |
| Contact: rost@columbia.edu | Version: Apr 2, 2002 |