LOCkey: SWISS-PROT keyword based annotation of subcellular localization.


Image

LOCkey: information theory based classifier. The LOCkey system is a novel M-ary classifier which predicts the sub-cellular localization of a protein based on SWISS-PROT keywords. The algorithm can be divided into two steps: (1) Building data sets of trusted vectors for known proteins, and (2) classifying unknown proteins. Firstly, a list of keywords is extracted from SWISS-PROT for all proteins with known sub-cellular localization. Most proteins have 2-5 keywords, on average. A data set of binary vectors is generated for each protein by representing the presence of a certain keyword in the protein by 1 and absence by 0. Secondly, to infer sub-cellular localization of an unknown protein U all keywords for U are read from SWISS-PROT. These keywords are translated into a binary keyword vector. From this original keyword vector, LOCkey generates a set of all possible combinations of alternative vectors by flipping vector components of value 1 (presence of keyword) to 0 in all possible combinations. For example, for a protein with three keywords, there are 23-1 = 7 possible sub-vectors: 111, 110, 101, 011, 100, 010 and 001. These sub-vectors constitute all possible keyword combinations for protein U. The keyword combination, i.e. sub-vector, that yields the best classification of U into one of ten classes of sub-cellular localizations is found. This is done by retrieving all exact matches of each of the sub-vectors to any of the proteins in the trusted set, i.e. by finding all proteins in the trusted set that contain all the keywords present in the sub-vector. By construction, the proteins retrieved in this way may also contain keywords not found in U. The next task is to estimate the 'surprise value' of the given assignment. Toward this end, LOCkey simply compiles the number of proteins belonging to each type of sub-cellular localization. This procedure is repeated in turn for each of the sub-vectors and localization is finally assigned to a protein by minimising an entropy-based objective function. The system accurately solves the classification problem when the number of data points (proteins) and dimensionality of the feature space (number of keywords) are not too large. LOCkey reached a level of more than 82% accuracy in a full cross-validation test.