Home Columbia University
Services
Databases
Results   Data Papers Talks Proteomes
Download
Group   People Get Here About cubic Internal
Contact
Search
CUBIC: NLProt / Help
    Page Index:
    Index     Submit     License     Data
Who should use NLProt

    NLProt should be used by researchers who want to build databases on a fully or partially automatic basis. NLProt is highly accurate in finding protein names in free language text and optimally assigns database IDs (SWISS-PROT, TrEMBL) to the found names.
Example Files
Input
     
  • Create a simple ASCII-file on your machine containing the text you want to scan for protein names. Copy and paste this file into the text box on the submit-page and press the Submit Text button. Please note that your input text has to consist of full sentences, since the algorithm needs the surrounding context of protein names in order to work properly.
  • Each request only takes a few seconds to finish. After that time, the output will appear on the screen.
Output
The output of the program is either an ASCII- or html-file depending on the user's preferences. It contains the tagged input text (if html-format, names are indicated in red) followed by a detailled table listing all found (tagged) names. Each found name is listed together with its position, its score and sometimes a database ID (SWISS-PROT, TrEMBL). For ASCII-output, the < n> tag indicates the beginning of a protein name and the < /n> tag indicates the end of a name. In the table at the end of the output-file, TXT-POS means the position of the name in the text, SCORE is the output-score of NLProt for this name and METHOD is the method by which the name was found. The following methods can be applied:
  • SVM: the name was found by the SVM-system
  • projected: the name was found the SVM-system, but at a different position of the text (thus the name was 'projected' to the rest of the text).
  • dictionary: the name is a long name, found in the dictionary (high length of names + name is in dictionary = strong indication for a protein name)
  • abbr.-ext.: name is the long form of an abbreviation that was found by the SVM-system.

  • Additionally, NLProt searches the text for tissue types and species names in order to assign the correct UniProt ID (SWISSPROT and TrEMBL) to each found name. In html-output, tissues and species are marked with green and blue, respectively. In ASCII-format, they are tagged with < t> or < s>.