Home Columbia University
Services
Databases
Results   Data Papers Talks Proteomes
Download
Group   People Get Here About cubic Internal
Contact
Search
CUBIC: NLProt / Index
    Page Index:
    Submit     License     Data     Help
The program
  • Submit-Form: use NLProt to submit a text
  • License your copy of the command-line version of NLProt (Windows and Linux)
  • read the help pages
  • Data used for developing NLProt
  • Brief description of the program
    NLProt  is a tool for finding protein-names in natural language-text. It is based on Support Vector Machines (SVMs), which are trained on contextual-features of named entities in scientific language. Additionally, simple filtering rules and a protein-name dictionary are used to increase performance.
    NLProt reached a precicion (accuracy) of 70% at a recall (coverage) of 85% after running it on the 166 most recent abstracts of EMBL and Cell (Nov/Dec 2003). When run from the command line, NLProt takes about 1 second per abstract to finish.
    Contact
    E-Mail: mika@cubic.bioc.columbia.edu