bottom - TOC - CUBIC-papers - CUBIC

Title: Annotating protein function through lexical analysis
Author:Rajesh Nair & Burkhard Rost
Quote: AI Magazine, 25, 45-56

Annotating protein function through lexical analysis

Rajesh Nair 1,4, * & Burkhard Rost 1, 2, 3, *

1 CUBIC, Dept. of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA
2 Columbia University Center for Computational Biology and Bioinformatics (C2B2), Russ Berrie Pavilion, 1150 St. Nicholas Avenue, New York, NY 10032, USA
3 North East Structural Genomics Consortium (NESG), Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA
5 Dept. of Physics, Columbia Univ., 538 West 120th Street, New York, NY 10027, USA
* Corresponding author:  email = cubic@cubic.bioc.columbia.edu URL http://cubic.bioc.columbia.edu/  Tel: +1-212-305-4018, fax: +1-212-305-7932

This article is published in (AI Magazine, issue, 2003 and pages) © copyright American Association for Artificial Intelligence (AAAI) Press (2003). AAAI Press is the only authorised source. All copying of this article including placing on another website requires the written permission of the copyright owner.

 

Table of contents


Abstract

We now know the entire genomes for over 100 organisms. The experimental characterisation of the newly sequenced proteins is deemed to lack behind this explosion of raw sequences (sequence-function gap). The rate at which expert annotators add experimental information into more or less controlled vocabularies of databases snails along at even slower pace. Most methods that annotate protein function exploit sequence similarity by transferring experimental information for homologues. A crucial development aiding such homology-based information transfer are large-scale, work- and management-intensive projects venturing to develop a comprehensive ontology for protein function, like the Gene Ontology project. In parallel, fully- or semi-automatic methods have successfully begun to mine the existing data through lexical analysis. Some of these tools target parsing controlled vocabularies from databases; others dare mining free texts from MEDLINE abstracts or full scientific papers. Automated text analysis has become a rapidly expanding discipline in bioinformatics. A few of these text-based tools have already been embedded into research projects.

Key words: text mining, protein function, sub-cellular localization, text analysis, MEDLINE abstracts.

 

Abbreviations used

DNAdeoxyribonucleic acid
ECenzyme commission number: functional classification of enzymes
GOgene ontology: functional classification of proteins
MIPSMunich information center for protein sequences
SWISS-PROTcurated database of protein sequences
TrEMBLprotein entries derived from coding sequences present in the EMBL nucleotide sequence database.


 

 

Introduction

Proteins are the machinery of life. The information for life is stored in a four-letter alphabet in the genome (DNA) [1, 2] . This four-letter DNA alphabet is translated first into another four-letter alphabet of biochemically less stable RNA nucleotides and from these into a 20-letter amino acid alphabet constituting the basic language for proteins, the machinery of life. Proteins are assembled by joining amino acids through peptide bonds; they differ greatly in the number of amino acids joined (from 30 to over 30,000) and in the arrangement and types of amino acids used (dubbed residues, when joined in proteins). The complete set of all proteins in one particular organism is referred to as its proteome. Proteins are the macromolecules that perform most important tasks in organisms ˆ, such as catalysis of biochemical reactions, transport of nutrients, recognition and transmission of signals. The plethora of aspects of the role of any particular protein is referred to as its 'function'. Although this is an intuitive statement, protein function is not a well-defined term. Instead, function is a complex phenomenon that is associated with many mutually overlapping levels: chemical, biochemical, cellular, organism mediated, developmental, and physiological. These levels are related in complex ways, for example, protein kinases can be related to different cellular functions (such as cell cycle), and to a chemical function (transferase) plus a complex control mechanism by interaction with other proteins; the same kinase may also be the culprit that leads to mis-function, or disease. Thus, identifying protein function is a step toward understanding diseases and toward identifying drug targets.

Gap between protein sequences and function. The first entire genome (DNA) sequence of a free living organism, Haemophilus influenzae, was published in 1995 [3] . Currently, we know the full genomes for over 100 organisms; for over 60 of these the data is publicly available and contributes about 250K protein sequences, i.e. about one fourth of all currently known protein sequences [4, 5, 6, 7] . The number of entirely sequenced genomes is expected to continue growing exponentially for at least the next few years. This explosion of sequence information has widened the gap between the number of protein sequences deposited in public databases and the experimental characterisation of the corresponding proteins [8, 9, 10, 11] . Bioinformatics plays a central role in bridging the sequence-function gap through the development of tools for faster and more effective prediction of protein function [12, 13, 14, 15, 16, 17] .

Here, we briefly review a few attempts at annotating function through homology transfer and automatic text analysis. The most widely used methods that allow guessing protein function rely on the ability to correctly mine the information deposited in public databases and in scientific journals. While we continue to need a more comprehensive ontology for protein function, developers have begun to successfully explore the marvels of an ever-increasing body of research in biology and medicine. There are two major types of methods attempting automatic lexical analysis: (1) parsing of controlled vocabulary from databases, and (2) mining unstructured text as available from scientific publications. We could not cover all the promising approaches that have mushroomed over the last 5-10 years. Therefore, we focus in detail on a few success-stories.

Annotations and annotation transfer of protein function

Molecular biology databases with functional information. Information about proteins is stored in public databases like SWISS-PROT and TrEMBL ( Table 1 ). SWISS-PROT [18, 19] is an expert-curated database that also contains annotations about function ( Fig. 1 ). These annotations are added by a team of expert annotators who extract this information primarily from journal publications [20] . TrEMBL [18] consists of entries that are derived from the translation of all coding sequences in the EMBL nucleotide sequence database that are not in SWISS-PROT. Unlike SWISS-PROT records, those in TrEMBL are awaiting manual annotation. SWISS-PROT currently contains 'only' 122,564 (release 41) sequence entries while the TrEMBL database contains over 821,014 (release 72) sequence entries [21] .



Fig. 1
fig1.gif

Fig. 1. : Protein entry in SWISS-PROT. The SWISS-PROT identifier for the protein MYOD_HUMAN is found under the header ÔIDÕ. The type of protein and its source organism are found under the ÔDEÕ and ÔOSÕ headers respectively. Detailed functional information regarding the protein is found under the header ÔCCÕ. This information is written in plain English and is not suitable for computer analysis. Following the ÔKWÕ header are keywords describing the function of the protein. The keywords use a restricted vocabulary and are ideal for tools designed for text analysis. Sequence information is found at the very end following the ÔSQÕ header.





Table . 1
Table 1 : Web sites of major databases and genome resources.
DatabaseURL
SWISS-PROT

http://www.ebi.ac.uk/swissprot/

TrEMBL

http://www.ebi.ac.uk/trembl/

Gene Ontology (GO)

http://www.geneontology.org/

MIPS

http://mips.gsf.de/

Ensembl

http://www.ensembl.org/

PEP

http://cubic.bioc.columbia.edu/db/PEP/



Annotations of function mostly through homology transfer. Experimentally determining protein function continues to be a laborious task that may take enormous resources, for example, more than a decade after the discovery, we still do not know the precise and entire functional role of the prion protein [22] . The automatic elucidation of protein function is therefore an appealing challenge [23, 24, 25] . The bioinformatics means to struggle toward this end typically explore the fact that two proteins with similar sequence often have similar function. The basic idea to exploit this fact involves the following steps: (1) extract the experimental information from the literature into a controlled vocabulary of annotated databases. (2) Establish thresholds T for pairwise sequence similarity that imply similarity in function. (3) For a protein U of unknown function: search the database for proteins {K} that have a sequence similarity to U: SIM(K,U) > T. (4) Finally, if any such protein K is found, transfer its annotation to U. Albeit this concept appears straightforward, in practice, there are many hurdles to overcome: (i) it is very difficult to create controlled vocabularies [26, 18] , and (ii) one single number may not capture all the functional roles [26, 27] . (iii) To add to the complication, it seems that the precise values for thresholds of significant sequence similarity (T) are actually specific to particular function, i.e. become T(F), and have to be re-established for any given task [28, 29, 30, 31, 32, 27, 33, 34, 35, 36] . In general, the inference of function is reliable only for very high levels of sequence similarity [37, 33, 34] . For example, to reliably (>90% accuracy) infer the sub-cellular localization by homology transfer over 80% pairwise sequence identity is required. Below this threshold the accuracy of annotation transfer rapidly decreases [33] . Several pitfalls in transferring annotations of function have been reported, e.g. inadequate knowledge of thresholds for 'significant sequence similarity', using only the best database hit, or ignoring the domain organisation of proteins [38, 39, 40, 37] . Despite all these problems: the majority of annotations about function in public databases result from homology transfer [10, 37, 15] . Few databases provide unambiguous pointers for the origin of the information. One problem arising from this is that it may be difficult to distinguish speculations from experimentally supported annotations.

Problem 1: Multiple levels of description. The function of a protein depends on the context. Database annotations of protein function are often confusing due to the variety of functional roles [41] . We need computer-readable hierarchical descriptions of function [42, 12] . Several groups and associations have ventured to introduce numerical schemata to define function. The first attempt was the introduction of Enzyme Classification numbers (EC, [43] ); this classification uses four digits to classify enzymatic activity. The first EC digit distinguishes the general types of enzymes; the second EC digit specifies the substrate (oxireductases), the group transferred (transferases), the type of bond (hydrolases, lyases, ligases), or the type of reorganisation (isomerases). The third and fourth digits provide more detail (for an excellent survey of structural aspects of enzymatic function see Todd, Orengo & Thornton [27] ). MIPS attempts to extend this idea to a wider perspective of more proteins and more roles through their classification catalogue [44] . Arguably, the most impressive Gargantuan effort at defining an ontology for protein function originates from the Gene Ontology (GO) consortium [26] . GO distinguishes three levels of protein function. (1) Molecular function: at the molecular level, the protein can, for example, catalyse a metabolic reaction, recognize or transmit a signal. (2) Biological process: a set of many co-operating proteins is responsible for achieving broad biological goals, for example, mitosis or purine metabolism, or signal transduction cascades (3) Cellular component: this category includes the structure of sub-cellular compartments, the localization of proteins, and macromolecular complexes. Examples include nucleus, telomere, and origin recognition complex. The sub-cellular localization of a protein is an essential attribute for this level. The totality of the physiological sub-systems and their interplay with various environmental stimuli determines properties of the phenotype, the morphology and physiology of the organism and its behaviour. GO is not complete. Nevertheless, GO constitutes the best set of definitions available today.

Problem 2: Functional information not machine-readable. Nearly all databases present the protein sequence in formats that are more or less straightforward to parse by computers. However, annotations are mostly written in plain text using a rich biological vocabulary that often varies in different areas of research ( Fig. 1 >

 

Automatic lexical analysis of controlled vocabularies

From details to summaries. Protein databases, like SWISS-PROT, usually contain functional annotations at a very detailed level of biochemical function, e.g. a given sequence is annotated as a cdc2 kinase, but not as being involved in intra-cellular communication [47, 19] . A number of text analysis tools have been implemented that infer various aspects of cellular function from database annotations of molecular function. Many methods explore the functional annotations in SWISS-PROT, especially the keyword annotations [47, 45, 13, 48] . SWISS-PROT currently contains over 800 keywords describing function. Semantic analysis of the keywords is used to categorise proteins into classes of cellular function [49, 50, 51, 52, 53, 54] . There are two types of methods: (1) fully-automated and (2) semi-automated methods.

(1) Fully-automated methods. The problem of automatically extracting rules from keywords has parallels to the problem of 'Text Categorization' (TC), i.e. the problem of assigning predefined categories to free text documents. Many statistical learning methods have been applied to this problem. These include, nearest neighbour classifiers [55] , multivariate regression models [56, 57] , probabilistic Bayesian models [58] , symbolic rule learning [59] , M-ary (multiple category) classifiers like the k-Nearest Neighbour [60] and the Linear Least-squares Fit (LLSF) have been intensively studied and are among the most accurate for text categorisation [61] . The majority of the tools for annotating function are based on one of the above methods. Some of the major methods in this category are LOCkey [48] , Spearmint [62, 63] and the SVM-based approach of Stapley et al. [64] ( Table 2 ).

(2) Semi-automated methods. These methods are based on building dictionaries of rules. Keywords characteristic of each of the functional classes are first extracted from a set of classified example proteins. Using these keywords a library of rules is created associating a certain pattern of occurrence of keywords to a functional class. The major methods in this category are EUCLID [47] , Meta_A [45] and RuleBase [13] . Below we review the LOCkey and EUCLID systems as examples of the two main approaches.




Table . 2
Table 2 : Resources for Text analysis.
MethodURL
LOCkey

http://cubic.bioc.columbia.edu/services/LOCkey/

GeneQuiz

http://jura.ebi.ac.uk:8765/ext-genequiz/

Meta_A

http://mendel.imp.univie.ac.at/CELL_LOC/

AbXtract

http://columba.ebi.ac.uk:8765/andrade/abx



LOCkey: information theory based classifier. The LOCkey system [48] is a novel M-ary classifier which predicts the sub-cellular localization of a protein based on SWISS-PROT keywords. The algorithm can be divided into two steps ( Fig. 2 ): (1) Building data sets of trusted vectors for known proteins, and (2) classifying unknown proteins. Firstly, a list of keywords is extracted from SWISS-PROT for all proteins with known sub-cellular localization. Most proteins have 2-5 keywords, on average. A data set of binary vectors [65] is generated for each protein by representing the presence of a certain keyword in the protein by 1 and absence by 0. Secondly, to infer sub-cellular localization of an unknown protein U all keywords for U are read from SWISS-PROT. These keywords are translated into a binary keyword vector. From this original keyword vector, LOCkey generates a set of all possible combinations of alternative vectors by flipping vector components of value 1 (presence of keyword) to 0 in all possible combinations. For example, for a protein with three keywords, there are 23-1 = 7 possible sub-vectors: 111, 110, 101, 011, 100, 010 and 001. These sub-vectors constitute all possible keyword combinations for protein U. The keyword combination, i.e. sub-vector, that yields the best classification of U into one of ten classes of sub-cellular localizations is found. This is done by retrieving all exact matches of each of the sub-vectors to any of the proteins in the trusted set, i.e. by finding all proteins in the trusted set that contain all the keywords present in the sub-vector. By construction, the proteins retrieved in this way may also contain keywords not found in U. The next task is to estimate the 'surprise value' of the given assignment. Toward this end, LOCkey simply compiles the number of proteins belonging to each type of sub-cellular localization. This procedure is repeated in turn for each of the sub-vectors and localization is finally assigned to a protein by minimising an entropy-based objective function. The system accurately solves the classification problem when the number of data points (proteins) and dimensionality of the feature space (number of keywords) are not too large. LOCkey reached a level of more than 82% accuracy in a full cross-validation test. However, due to a lack of functional annotations, the system failed to infer localization for more than half of all proteins in the test set. For five entirely sequenced proteomes, namely Saccharomyces cerevisiae (yeast), Caenorhabditis elegans (worm), Drosophila melanogaster (fly), Arabidopsis thaliana (plant) and a subset of all human proteins, the LOCkey system automatically found about 8000 new annotations about sub-cellular localization. LOCkey has been optimised to provide fast annotations. For example, annotating the entire C. elegans proteome took less than four hours on a PIII 900 MHz machine. The algorithm is limited to problems with relatively few data points (proteins) in the vector set (n<<1000000) and with few keywords (n<<10000).



Fig. 2
fig2.gif

Fig. 2. : The LOCkey algorithm. A sequence unique data set of localization annotated SWISS-PROT proteins was first compiled. Keywords were extracted for these proteins and merged with any keywords found in homologues. The keywords were represented as binary vectors in the 'Trusted Vector Set'. An unknown query was first annotated with keywords through identification of SWISS-PROT homologues. Keywords for the query were represented as binary vectors. All possible keyword combinations were constructed (the SUB vectors). The best matching vector was found based on entropy criteria (see methods). This vector was used to infer localization for the query.





EUCLID: dictionary-based classification. The EUCLID system [47] uses SWISS-PROT keywords to classify proteins into 14 classes of cellular function according to the scheme originally proposed by Monika Riley [66, 67, 52, 54] . The 14 classes are the grouped into three broad functional classes (energy, information and communication). First, keywords characteristic of each of the functional classes is extracted from a set of classified example proteins provided by a human expert. This dictionary of characteristic keywords satisfies the following criteria: (1) Only keywords with functional meaning are used; keywords with no functional information are excluded (e.g. hypothetical or 3D-structure). (2) Only keywords appearing in more than one SWISS-PROT entry are considered. (3) Only keywords with more than 85% of their occurrences in a single functional category are included in the dictionaries. For assigning sequences to classes a simple voting scheme is used. A sequence is automatically classified in the functional class to which the majority of its keywords belong. The dictionary of keywords is then used to automatically assign all proteins from the database if a sufficient match is found. Proteins thus assigned to a functional class are analyzed to extract a new, more extensive dictionary of characteristic keywords. The process is iterated until classification quality no longer increases ( Fig. 3 ). A limitation of this approach is that only simple correlations between keywords can be 'discovered'. The method is scalable and can be applied to very large protein databases. For the genome sequence of Mycoplasma genitalium [68] , the EUCLID system was able to classify 52% of the sequences at a classification accuracy of 82%. The EUCLID algorithm has been incorporated into the GeneQuiz system [69] . GeneQuiz is a semi-automated protein sequence analysis workbench whose principal purpose is to infer a specific and reliable functional assignment together with a broad cellular role for a query protein by analysis of annotations from sequence database matches.



Fig. 3
fig3.gif

Fig. 3. : The EUCLID algorithm. Scheme of the iterative method used to classify sequences in three functional classes. The classification relies on the definition of a dictionary of keywords characteristic for a particular functional class. (1) Experts assign hexokinase from yeast (hxkb_yeast) to the ENERGY class. (2) A keyword dictionary is constructed scoring the keywords associated with hexokinase in the ENERGY class. (3) The same dictionary is then extended to classifying other proteins. The process is iterated until no more keywords are gained.






 

Mining free text from the literature

Digging deep into the vast, dark space of publications. Experimental results are usually published first in scientific journals. Since such publications do not conform to any standardised rules, this information is not computer-readable. At best, this lack of automation leads to a severe delay in incorporating the information into databases. Furthermore, a lot of the data will lie buried in the literature forever. The PDB database of protein structures defines standards to which the submitted data has to confer. One solution to the problem of burying information in the literature might be to adopt similar standards for the publication of functional information. For example, it could be required to deposit functional information into databases that have controlled vocabularies. While such a concept is currently being discussed, for the time being text-mining tools are the only means of retrieving functional information from the literature. In recent years, many groups have worked on dedicated problems in this area, like machine-selection of articles of interest [70, 71] , automated extraction of information using statistical methods [72, 73] ,or natural language processing techniques [74, 75, 76, 77] as well as setting up specialized knowledge bases for storing molecular knowledge [78] . The invaluable electronic availability of scientific publications through MEDLINE [79] not only severely impacted the ways of writing papers and doing science in general, it also enabled the development of an avalanche of methods that mine these data. Automatic text-analysis tools can assist human annotators and can thus significantly shorten the time-lag of functional annotations. One of the most crucial bottle-necks for automated text analysis is the mapping of gene/protein names [80, 81, 15] . While this problem may be over-come in the near future by particular standards adopted by journals, this hurdle currently hinders the availability and usefulness of public methods, considerably.

Natural language processing vs. field-specific 'grammars'. Many tools focus on mining MEDLINE abstracts. While the principal reason for this restriction is supposedly related to complexity (abstracts available, fit onto a disk, can be searched quickly), abstract occasionally are more easy to mine since many papers contain less precise and less well supported sections in the text that are difficult to distinguish from more informative sections by machines [82, 83, 84] . The current version of MEDLINE contains nearly 12 million abstracts stored on approximately 43GB of disk space. A prominent example of methods that target entire papers is still restricted to a small number of journals [76, 85] . The task of unravelling information about function from MEDLINE abstracts can be approached from two different angles. On the one hand, computational techniques for understanding text written in natural language (NL) are based on lexical, syntactical and semantic analysis [65, 86] . In addition to indexing 'terms' in documents, natural language processing (NLP) methods extract and index higher level semantic structures composed of terms, and relationships between terms. This can be done in different ways [87] . However, this approach is confronted with the variability, fuzziness and complexity of human language [83] . The Genies system [76, 85] for automatically gathering and processing of knowledge about molecular pathways and the IFBP transcription factor database [88] are natural language processing (NLP) based systems. An alternative approach that may be more relevant in practice, is based on the treatment of text with statistical methods [89, 90] . In this approach, the possible relevance of words in a text is deduced from the comparison of frequency of different words in this text with the frequency of same words in reference sets of text [91] . Some of the major methods using the statistical approach are AbXtract [92, 90, 93] and the automatic pathway discovery tool of Ng and Wong [74] . There are advantages to each of these approaches (grammar or pattern matching). Generally, the less syntax is used, the more domain-specific the system. This allows the construction of a robust system relatively quickly, but many subtleties may be lost in the interpretation of sentences. In some applications, however, the domain-dependent pattern matching approach may be the only way to attain reasonable performance in the near future [94] .

AbXtract: extraction of keywords from abstracts. The AbXtract system [92, 90, 95, 96] is triggered by collections of abstracts related to a given protein, and it is able to extract functional information directly from MEDLINE abstracts. Relevant keywords are selected by their relative accumulation in comparison with a domain specific background distribution. To obtain a representative set of words (and their abundance) in protein families, the background distribution of abstracts is chosen so as to represent the widest range of protein families. For each of the representative set (dictionary) of words, two statistical parameters are computed: their frequency in each family and the deviation of the distribution of their frequencies in the set of families. Provided with a query family and an associated set of MEDLINE abstracts, words that are likely to be functionally important for the family (putative keywords) are found by comparison with the background set. This is done by measuring the frequency of the relevant word in the query family relative to its background frequency of occurrence using a z-score [90] . Words with a high z-score are likely to be potential keywords for the family. The system has been tested on a number of different protein families and showed a good ability to extract functionally important keywords. A modification of this algorithm, called SUISEKI [96, 81] , has been applied to the problem of extracting protein-protein interaction from MEDLINE abstracts. In addition to the statistical approach of AbXtract, SUISEKI (System for Information Extraction on Interactions) also takes advantage of the analysis of the syntactical structure of phrases and other developments in computational linguistics. The SUISEKI system was able to extract almost 70% of the interactions present in a relatively large text corpus at approximately 80% accuracy for the best defined interactions. The SUISEKI system discovered a total of 4657 protein-protein interactions between cell-cycle related proteins in yeast from ~5300 abstracts (~12 Mb). The authors identify a number of sources of error in mining MEDLINE abstracts, the currently most urgent problem are the numerous mistakes in identifying protein names. There is no systematic nomenclature for gene and protein names; this has led to a number of possible writing variants and synonyms. This babel obviously hampers detection and classification severely. The second major problem is caused by indirect

 

 


References

1.Alberts, B., Bray, D., Roberts, K.& Watson, J. (1994). Molecular Biology of the Cell. Garland Publishing, NewYork and London.
2.Lodish, H., Berk, A., Baltimore, D.& Darnell, J. (2000). Molecular Cell Biology. W H Freeman & Co, NewYork.
3.Fleischmann, R. D., Adams, M. D.,White, O., Clayton, R. A., Kirkness, E. F. et al. (1995). Whole-genome randomsequencing and assembly of Haemophilus influenzae Rd. Science, 269, 496-512.
4.Liu, J. & Rost, B. (2001).Comparing function and structure between entire proteomes. Protein Science, 10, 1970-1979.
5.Carter, P., Liu, J. & Rost, B.(2003). PEP: Predictions for Entire Proteomes. Nucleic Acids Research, 31, 410-413.
6.Frishman, D., Mokrejs, M., Kosykh,D., Kastenmuller, G., Kolesov, G. et al. (2003). The PEDANT genome database. NucleicAcids Res, 31,207-11.
7.Pruess, M., Fleischmann, W.,Kanapin, A., Karavidopoulou, Y., Kersey, P. et al. (2003). The ProteomeAnalysis database: a tool for the in silico analysis of whole proteomes. NucleicAcids Research, 31,414-417.
8.Rost, B. & Sander, C. (1996).Bridging the protein sequence-structure gap by structure predictions. AnnualReview of Biophysics and Biomolecular Structure,25, 113-136.
9.Baker, P. G. & Brass, A. (1998).Recent developments in biological sequence databases. Curr Opin Biotechnol, 9, 54-8.
10.Koonin, E. V. (2000). Bridging thegap between sequence and function. Trends Genet,16, 16.
11.Lewis, S., Ashburner, M. &Reese, M. G. (2000). Annotating eukaryote genomes. Curr Opin Struct Biol, 10, 349-54.
12.Bork, P., Dandekar, T.,Diaz-Lazcoz, Y., Eisenhaber, F., Huynen, M. et al. (1998). Predicting function:from genes to genomes and back. J Mol Biol,283, 707-25..
13.Fleischmann, W., Moller, S.,Gateau, A. & Apweiler, R. (1999). A novel method for automatic functionalannotation of proteins. Bioinformatics, 15, 228-33..
14.Luscombe, N. M., Greenbaum, D.& Gerstein, M. (2001). What is bioinformatics? A proposed definition andoverview of the field. Methods Inf Med, 40, 346-58.
15.Valencia, A. (2002). Search andretrieve: Large-scale data generation is becoming increasingly important inbiological research. But how good are the tools to make sense of the data? EMBOReports, 3, 396-400.
16.Valencia, A. & Pazos, F.(2002). Computational methods for the prediction of protein interactions. CurrentOpinion in Structural Biology, 12, 368-373.
17.Rost, B., Liu, J., Nair, R.,Wrzeszczynski, K. O. & Ofran, Y. (2003). Automatic prediction of proteinfunction. Cellular and Molecular Life Sciences,submitted Mar 25, 2003.
18.Bairoch, A. & Apweiler, R.(2000). The SWISS-PROT protein sequence database and its supplement TrEMBL in2000. Nucleic Acids Res, 28, 45-8.
19.Apweiler, R. (2001). Functionalinformation in SWISS-PROT: the basis for large-scale characterisation ofprotein sequences. Brief Bioinform, 2, 9-18..
20.Junker, V., Contrino, S.,Fleischmann, W., Hermjakob, H., Lang, F. et al. (2000). The role SWISS-PROT andTrEMBL play in the genome research environment. J Biotechnol, 78, 221-234.
21.Boeckmann, B., Bairoch, A., Apweiler,R., Blatter, M. C., Estreicher, A. et al. (2003). The SWISS-PROT proteinknowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res, 31, 365-70..
22.Harrison, P. M., Bamborough, P.,Daggett, V., Prusiner, S. & Cohen, F. E. (1997). The prion folding problem.Current Opinion in Structural Biology, 7, 53-59.
23.Gaasterland, T. & Sensen, C. W.(1996). MAGPIE: automated genome interpretation. Trends Genet, 12, 76-8.
24.Apweiler, R., Gateau, A., Contrino,S., Martin, M. J., Junker, V. et al. (1997). Protein sequence annotation in thegenome era: the annotation concept of SWISS-PROT+TREMBL. Proc Int ConfIntell Syst Mol Biol, 5, 33-43.
25.Eisenberg, D., Marcotte, E. M.,Xenarios, I. & Yeates, T. O. (2000). Protein function in the post-genomicera. Nature, 405,823-6.
26.Ashburner, M., Ball, C. A., Blake,J. A., Botstein, D., Butler, H. et al. (2000). Gene ontology: tool for theunification of biology. The Gene Ontology Consortium. Nat Genet, 25, 25-9.
27.Todd, A. E., Orengo, C. A. &Thornton, J. M. (2001). Evolution of function in protein superfamilies, from astructural perspective. J Mol Biol, 307, 1113-43.
28.Shah, I. & Hunter, L. (1997).Predicting enzyme function from sequence: a systematic appraisal. In FifthInternational Conference on Intelligent Systems for Molecular Biology(Gaasterland, T., Karp, P., Karplus, K., Ouzounis, C., Sander, C. et al.,eds.), pp. 276-283, AAAI Press, Halkidiki, Greece.
29.Ouzounis, C., Perez-Irratxeta, C.,Sander, C. & Valencia, A. (1998). Are binding residues conserved? PacSymp Biocomput,401-12.
30.Rost, B. (1999). Twilight zone ofprotein sequence alignments. Protein Eng, 12, 85-94.
31.Devos, D. & Valencia, A.(2000). Practical limits of function prediction. Proteins, 41, 98-107.
32.Wilson, C. A., Kreychman, J. &Gerstein, M. (2000). Assessing annotation transfer for genomics: quantifyingthe relations between protein sequence, structure and function throughtraditional and probabilistic scores. J Mol Biol,297, 233-49.
33.Nair, R. & Rost, B. (2002).Sequence conserved for subcellular localization. Protein Sci, 11, 2836-47.
34.Rost, B. (2002). Enzyme functionless conserved than anticipated. Journal of Molecular Biology, 318, 595-608.
35.Wrzeszczynski, K. O. & Rost, B.(2003). Cataloguing proteins in cell cycle control. In Cell cycle checkpointcontrol protocols (Lieberman, H., eds.), pp. submitted, Humana Press, Totowa,NJ.
36.Wrzeszczynski, K. O. & Rost, B.(2003). In silico anaysis of retention signals for Endoplasmic reticulum andGolgi apparatus. Proteins: Structure, Function, and Genetics,submitted.
37.Devos, D. & Valencia, A.(2001). Intrinsic errors in genome annotation. Trends in Genetics, 17, 429-431.
38.Bork, P. & Koonin, E. V.(1998). Predicting functions from protein sequences--where are the bottlenecks?Nat Genet, 18,313-8.
39.Doerks, T., Bairoch, A. & Bork,P. (1998). Protein annotation: detective work for function prediction. TrendsGenet, 14, 248-50.
40.Galperin, M. Y. & Koonin, E. V.(2000). Who's your neighbor? New computational approaches for functionalgenomics. Nat Biotechnol, 18, 609-13..
41.Attwood, T. K. (2000). Genomics.The Babel of bioinformatics. Science, 290, 471-3.
42.Overbeek, R., Larsen, N., Smith,W., Maltsev, N. & Selkov, E. (1997). Representation of function: the nextstep. Gene, 191,GC1-GC9..
43.Webb, E. C. (1992). EnzymeNomenclature 1992. Recommendations of the Nomenclature committee of theInternational Union of Biochemistry and Molecular Biology. Academic Press, NewYork.
44.Mewes, H. W., Frishman, D., Gruber,C., Geier, B., Haase, D. et al. (2000). MIPS: a database for genomes andprotein sequences. Nucleic Acids Res, 28, 37-40.
45.Eisenhaber, F. & Bork, P.(1999). Evaluation of human-readable annotation in biomolecular sequencedatabases with biological rule libraries. Bioinformatics, 15, 528-35..
46.Tsoka, S. & Ouzounis, C. A.(2000). Recent developments and future directions in computational genomics. FEBSLett, 480, 42-8.
47.Tamames, J., Ouzounis, C., Casari,G., Sander, C. & Valencia, A. (1998). EUCLID: automatic classification ofproteins in functional classes by their database annotations. Bioinformatics, 14, 542-3.
48.Nair, R. & Rost, B. (2002).Inferring sub-cellular localization through automated lexical analysis. Bioinformatics, 18 Suppl 1, S78-S86.
49.Bork, P., Ouzounis, C., Sander, C.,Scharf, M., Schneider, R. et al. (1992). What's in a genome? Nature, 358, 287.
50.Riley, M. (1993). Functions of thegene products of Escherichia coli. Microbiol Rev,57, 862-952..
51.Ouzounis, C., Casari, G., Sander,C., Tamames, J. & Valencia, A. (1996). Computational comparisons of modelgenomes. Trends in Biotechnology, 14, 280-285.
52.Riley, M. & Labedan, B. (1997).Protein evolution viewed through Escherichia coli protein sequences:introducing the notion of a structural segment of homology, the module. Journalof Molecular Biology, 268, 857-868.
53.Andrade, M. A., Ouzounis, C.,Sander, C., Tamames, J. & Valencia, A. (1999). Functional classes in thethree domains of life. Journal of Molecular Evolution, 49, 551-557.
54.Karp, P. D., Riley, M., Paley, S.M., Pellegrini-Toole, A. & Krummenacker, M. (1999). Eco Cyc: encyclopediaof Escherichia coli genes and metabolism. Nucleic Acids Research, 27, 55-8.
55.Yang, Y. & Pederson, J. P.(1997). A comparative study on feature selection in text categorization. TheFourteenth International Conference on Machine Learning,412-420.
56.Yang, Y. & Chute, C. G. (1992).An application of least squares fit mapping to clinical classification. Proceedings- the Annual Symposium on Computer Applications in Medical Care,460-4.
57.Schutze, H., Hull, D. A. &Pederson, J. O. (1995). A comparison of classifiers and document representationfor the routing problem. 18th Ann Int ACM SIGIR Conference on Research andDevelopment in Information Retrieval (SIGIR '95),229-237.
58.Lewis, D. D. & Ringuette, M.(1994). Comparison of two learning algorithms for text categorization. Proceedingsof the Third Annual Symposium on Document Analysis and Information Retrieval(SDAIR'94),.
59.Apte, C., Damerau, F. & Weiss,S. (1994). Towards language independent automated learning of textcategorization models. Proceedings of the 17th Annual ACM/SIGIR conference,.
60.Dasarathy, B. V. (1991). NearestNeighbor (NN) Norms: NN Pattern Classification Techniques. IEEE ComputerSociety Press, Las Alamitos, California.
61.Yang, Y. & Liu, X. (1999). Are-examination of text categorisation methods. Proceedings of the ACM SIGIRConference on Research and Development in Information Retrieval.,42-49.
62.Kretschmann, E., Fleischmann, W.& Apweiler, R. (2001). Automatic rule generation for protein annotationwith the C4.5 data mining algorithm applied on SWISS-PROT. Bioinformatics, 17, 920-6.
63.Bazzan, A. L., Engel, P. M.,Schroeder, L. F. & Da Silva, S. C. (2002). Automated annotation of keywordsfor proteins related to mycoplasmataceae using machine learning techniques. Bioinformatics, 18 Suppl 2, S35-43.
64.Stapley, B. J., Kelley, L. A. &Sternberg, M. J. (2002). Predicting the sub-cellular location of proteins fromtext using support vector machines. Pac Symp Biocomput,374-85.
65.Salton, G. (1989). Automatic TextProcessing. Addison-Wesley, Reading, MA..
66.Krawiec, S. & Riley, M. (1990).Organization of the bacterial chromosome. Microbiol. Rev., 54, 502-539.
67.Riley, M. (1993). Function of thegene products in Escherichia coli. Microbiol. Rev., 57, 862-952.
68.Fraser, C. M., Gocayne, J. D.,White, O., Adams, M. D., Clayton, R. A. et al. (1995). The minimal genecomplement of Mycoplasma genitalium. Science,270, 397-403.
69.Andrade, M. A., Brown, N. P.,Leroy, C., Hoersch, S., de Daruvar, A. et al. (1999). Automated genome sequenceanalysis and annotation. Bioinformatics, 15, 391-412..
70.Shatkay, H., Edwards, S., Wilbur,W. J. & Boguski, M. (2000). Genes, themes and microarrays: usinginformation retrieval for large-scale gene analysis. Proc Int Conf IntellSyst Mol Biol, 8,317-28.
71.Iliopoulos, I., Enright, A. J.& Ouzounis, C. A. (2001). Textquest: document clustering of MEDLINEabstracts for concept discovery in molecular biology. Pac Symp Biocomput,384-95.
72.Stapley, B. J. & Benoit, G.(2000). Biobibliometrics: information retrieval and visualization fromco-occurrences of gene names in MEDLINE abstracts. Pac Symp Biocomput,529-40.
73.Stephens, M., Palakal, M.,Mukhopadhyay, S., Raje, R. & Mostafa, J. (2001). Detecting gene relationsfrom MEDLINE abstracts. Pac Symp Biocomput,483-95.
74.Ng, S. K. & Wong, M. (1999).Toward routine automatic pathway discovery from on-line scientific textabstracts. Genome Inform Ser Workshop Genome Inform, 10, 104-112.
75.Thomas, J., Milward, D., Ouzounis,C., Pulman, S. & Carroll, M. (2000). Automatic extraction of proteininteractions from scientific abstracts. Pac Symp Biocomput,541-52.
76.Friedman, C., Kra, P., Yu, H.,Krauthammer, M. & Rzhetsky, A. (2001). GENIES: a natural-languageprocessing system for the extraction of molecular pathways from journalarticles. Bioinformatics, 17 Suppl 1, S74-82.
77.Yakushiji, A., Tateisi, Y., Miyao,Y. & Tsujii, J. (2001). Event extraction from biomedical papers using afull parser. Pac Symp Biocomput,408-19.
78.Stevens, R., Goble, C. A. &Bechhofer, S. (2000). Ontology-based knowledge representation forbioinformatics. Brief Bioinform, 1, 398-414.
79.Airozo, D., Allard, R., Brylawski,B., Canese, K., Kenton, D. et al. (1999). MEDLINE. 1999, .
80.Hatzivassiloglou, V., Duboue, P. A.& Rzhetsky, A. (2001). Disambiguating proteins, genes, and RNA in text: amachine learning approach. Bioinformatics, 17, S97-S106.
81.Blaschke, C., Hirschman, L. &Valencia, A. (2002). Information extraction in molecular biology. BriefBioinform, 3,154-165.
82.Hersh, W. R., Evans, D. A.,Monarch, I. A. & Gorman, P. N. (1992). Indexing effectiveness of linguisticand non-Linguistic approaches to automatic indexing. Elsevier SciencePublishers, Amsterdam.
83.Andrade, M. A. & Bork, P.(2000). Automated extraction of information in molecular biology. FEBS Lett, 476, 12-7..
84.Ding, J., Berleant, D., Nettleton,D. & Wurtele, E. (2002). Mining MEDLINE: abstracts, sentences, or phrases? PacSymp Biocomput,326-337.
85.Krauthammer, M., Kra, P., Iossifov,I., Gomez, S. M., Hripcsak, G. et al. (2002). Of truth and pathways: chasingbits of information through myriads of articles. Bioinformatics, 18 Suppl 1, S249-S257.
86.Cowie, J. & Lehnert, W. (1996).Information extraction. Commun. ACM, 39, 80-91.
87.Baeza-Yates, R. & Ribeiro-Neto,B. (1999). Modern Information Retrieval. .
88.Ohta, Y., Yamamoto, Y., Okazaki,T., Uchiyama, I. & Takagi, T. (1997). Automatic construction of knowledge basefrom biological papers. Proc Int Conf Intell Syst Mol Biol, 5, 218-25.
89.Yang, Y. (1996). An evaluation ofstatistical approaches to MEDLINE indexing. Proc AMIA Annu Fall Symp,358-362.
90.Andrade, M. A. & Valencia, A.(1998). Automatic extraction of keywords from scientific text: application tothe knowledge domain of protein families. Bioinformatics, 14, 600-7..
91.Berry, M. W., Dumais, S. T. &O'Brien, G. W. (1995). Using linear algebra for intelligent informationretrieval. SIAM Rev., 37, 573-595.
92.Andrade, M. A. & Valencia, A.(1997). Automatic annotation for biological sequences by extraction of keywordsfrom MEDLINE abstracts. Development of a prototype system. In FifthInternational Conference on Intelligent Systems for Molecular Biology(Gaasterland, T., Karp, P., Karplus, K., Ouzounis, C., Sander, C. et al.,eds.), pp. 25-32, AAAI Press, Halkidiki, Greece.
93.Andrade, M., Blaschke, C. &Valencia, A. (1999). AbXtract: Automatic Abstract eXtraction of keywordsassociated to protein function. .
94.Allen, J. (1995). Natural languageunderstanding. Addison-Wesley Pub Co, New York.
95.Blaschke, C. (2001). Applicationsof information extraction techniques to molecular biology. U. Autonoma Madrid,PHD thesis.
96.Blaschke, C. & Valencia, A.(2001). The potential use of SUISEKI as a protein interaction discovery tool. GenomeInform Ser Workshop Genome Inform, 12, 123-134.
97.Blaschke, C. & Valencia, A.(2002). The frame-based module of the Suiseki information extraction system. IEEEIntelligent Systems, 17, 14-20. 

Contact:    rost@columbia.edu Version:    May 5, 2003
 top - TOC - CUBIC-papers - CUBIC