Introduction Methods Genomes Prediction

Help document for LOCtree predictions

Reference
  • Mimicking Cellular Sorting Improves Prediction of Subcellular Localization.
  • Rajesh Nair and Burkhard Rost
  • Journal of molecular biology, 348(1):85-100
  • Predicted Subcellular localization Eukaryotic non-plant proteins are classified by LOCtree into one of five subcellular classes (extra-cellular,organelles,nuclear,cytoplasmic or mitochondrial) while for plants the classification is into one of six classes with chloroplast being the sixth class. The organelles are proteins sorted to one of the following subcellular classes (ER,Golgi,Lysosome,Peroxysome or Vacuoles). Gram-positive bacteria are classified into one of two classes (cytoplasmic and extra-cellular) while gram-negative bacteria are classified into one of three classes with periplasm being the third class. For eukaryotic proteins, if the protein is predicted to be Nuclear, LOCtree additionally tries to predict if the protein is "DNA-binding" or "Not DNA-binding". The "DNA-binding" prediction module has not been published. Additional information regading the LOCtree algorithm can be obtained from the original JMB article.
    Intermediate Localization prediction The novel feature of LOCtree is the prediction of intermediate subcellular classes using our SVM implementation which 'Mimicks the cellular sorting machinery'. Intermediate subcellular classes are predicted at much higher accuracy's and can provide useful clues in inferring the true localization of the protein. See figure for further explanation of the predicted intermediate subcellular classes.
    Reliability Index Reliability index (RI) values range from 1-10, with 10 denoting the most confident predictions. The reliability index is a measure of the strength of SVM prediction. A reliability index of 10 implies that the prediction is among the top 10% strongest predictions for the predicted subcellular class while a reliability index of 7 implies that the prediction is among the top 30%-40% strongest predictions. The performance of LOCtree has been rigorously evaluated on a non-redundant test set of proteins. Predictions with a reliability index of 3 or less are borderline cases which have a high chance of being a wrong prediction. For such predictions the user should try to corroborate the prediction using other sources.
    Interpretation of example prediction Consider the case where the intermediate localization prediction column reads: "Not Secreted,Nuclear,Not DNA-binding" and the Reliability index of intermediate localization prediction column reads: "9,3,4". This implies the protein is predicted to be "Not Secreted" with RI=9, "Nuclear" with RI=3 and "Not DNA-binding" with RI=4. The final predicted localization would be "Not DNA-binding" with RI=4. The weakest link in this prediction is the "Nuclear" prediction which only has a RI=3. Thus a prediction with the second highest confidence for this protein would be in the "Non Nuclear" protein category. Only the prediction with the highest confidence is reported.
    Datasets used Sequence unique datasets used for developing/testing LOCtree can be downloaded here.
    Additional Prediction Methods In addition to the support vector machine (SVM) based subcellular localization prediction,the LOCtree server also predicts nuclear localization signals using PredictNLS, SWISS-PROT keywords based localization prediction using LOCkey and localization prediction based on the presence of PROSITE and PFAM signatures. Localization predictions based on the diferent algorithms are reported separately since they are based on different features with very different causes of wrong predictions. In general, PredictNLS is the most accurate with nearly 100% accuracy but has the lowest coverage. This is followed by PROSITE/PFAM based predictions. Next in accuracy are the SWISS-PROT keywords based predictions and LOCtree predictions. PredictNLS and PROSITE/PFAM based predictions are all based on the presence of functional motifs/signatures, though PFAM family assignments are not 100% accurate and thus constitute an additional source of error. SWISS-PROT keywords are quite often wrongly assigned to a protein and hence the keywords based method has a lower accuracy than the previous two methods. In contrast to the previous methods LOCtree predicts localization at 100% coverage which leads to lower average accuracy's than the previous methods. However LOCTree predictions at high reliability have a comparable accuracy to the previous methods.
    LOCkey: keyword based annotations LOCkey is a novel method for assigning proteins to subcellular classes based on lexical analysis of SWISS-PROT keywords. For a query protein U, SWISS-PROT keywords are assigned by first identifying the sequence homologues of this protein in the SWISS-PROT database. Next, all keywords for the homologues of U are extracted and merged. These keywords are assigned to the query protein U. See figure for further explanation of the entropy-based algorithm used by LOCkey to infer subcellular localization. Read the LOCkey manuscript.
    Predicted subcellular localization using LOCkey LOCkey assigne proteins to one of 10 subcellular classes: Extra-cellular,Nuclear,Cytoplasm,Mitochondria,Chloroplast,ER, Golgi, Peroxysome, Lysosome and Vacuole. A protein is assigned to a subcellular class only if the SWISS-PROT keywords associated with this protein meet pre-specified entropy cut-off criteria.
    Confidence of prediction using LOCkey Confidence is assined to a prediction based on the occurrence of the combination of keywords that best localize a protein in a certain subcellular class. For example, if a protein is predicted as Nuclear with a confidence of 85%, this implies that the combination of keywords that were used to infer this localization were found in Nuclear proteins 85% of the times.
    SWISS-PROT keywords used in LOCkey LOCkey assigns subcellular localization by assigning SWISS-PROT keywords to a protein and looking at the occurrence of these keywords in a localization annotated database of SWISS-PROT proteins. Only those SWISS-PROT keywords are used which are found to be correlated with subcellular localization based on entropy criteria.
    PROSITE motif based annotations Proteins are assigned to subcellular classes based on lexical analysis of PROSITE and PFAM motifs or signatures found in the protein. The algorithm used to infer subcellular class is similar to the one used be LOCkey. See figure for further explanation of the entropy-based algorithm used by LOCkey to infer subcellular localization.
    Predicted subcellular localization using PROSITE and PFAM motifs Proteins are assigned to one of 10 subcellular classes: Extra-cellular,Nuclear,Cytoplasm,Mitochondria,Chloroplast,ER, Golgi, Peroxysome, Lysosome and Vacuole. A protein is assigned to a subcellular class only if the PROSITE or PFAM signature associated with this protein meets pre-specified entropy cut-off criteria.
    Confidence of prediction using PROSITE/PFAM signatures Confidence is assined to a prediction based on the occurrence of the combination of PROSITE or PFAM signatures that best localize a protein in a certain subcellular class. For example, if a protein is predicted as Nuclear with a confidence of 85%, this implies that the combination of PROSITE/PFAM signatures that were used to infer this localization were found in Nuclear proteins 85% of the times.
    PROSITE/PFAM signatures used to assign localization This column shows the PROSITE and PFAM signatures which were used to assign subcellular class to this protein. Clicking on the links provides further information about the respective PROSITE and PFAM signatures. Only those PROSITE and PFAM signatures are used which are found to be correlated with subcellular localization based on entropy criteria.

    rajesh nair
    Last modified: Mon Oct 3 17:03:56 EDT 2005