Home Columbia University
Services
Databases
Results   Data Papers Talks Proteomes
Download
Group   People Get Here About cubic Internal
Contact
Search
CUBIC: NLProt / Data
    Page Index:
    Index     Submit     License     Help

Explanation

This page contains data that we used for developing NLProt. Those are in particular the text corpora with the tagged protein names and the dictionary files used for the filtering procedure as it is described in our paper. The only tag we use in the corpora is the < n> tag for a protein-name (< /n> = terminating tag).

Tagged Corpora

  • Yapex corpus (we kept the original tagging by Franzen et al.; 200 abstracts; Yapex website)
  • GENIA corpus as it was retagged by us (original tag "protein_molecule" was transformed to < n>; 2000 abstracts; GENIA website)
  • BioCreative: a corpus used in the BioCreative competition (7,500 sentences for training; 2,500 for testing)
  •    Please note that none of the corpora above was tagged by ourselves!

  • Recent166: the recent 166 abstracts (Nov/Dec '03 from EMBO J and Cell) automatically tagged by the final version of our program
  •    The Recent166 corpus should not be used for training, since not all tags were correctly placed by NLProt.

    Data used for Filtering
  • Common Dictionary as we derived it from the Merriam-Webster (MW) online dictionary (Note that this file is not complete since our algorithm can access the MW through the internet and constantly adds words to this local version)
  • Species: List of species' names from SWISS-PROT
  • Tissue: List of tissue names from SWISS-PROT
  • Minerals: List of mineral/salt formulas and their names
  • Endings of Chemicals: List of 130 typical endings of chemicals

  • Filtering rules
  • Filtering rules List of all filtering rules used by NLProt to pre-filter input text