Chapter 1
Introduction†
Proteins are the machinery of life. The genetic information for life is stored in the nucleic acids (DNA) while proteins are the workhorses that are responsible for transforming this information into physical reality. This program requires the coordinated effort of many different types of proteins. A protein molecule is made from a long un-branched chain of twenty amino acids, each linked to its neighbor through a covalent peptide bond (Fig. 1-1). Proteins are therefore also known as polypeptides. A striking characteristic of proteins is that they have very well-defined 3-D structures. A stretched-out polypeptide chain has no biological activity, and protein function arises from the ‘conformation’ of the protein, which is the 3-D arrangement or shape of the molecules in the protein. Proteins are the most structurally complex and functionally sophisticated macromolecules known and perform a wide array of tasks in organisms, such as the catalysis of biochemical reactions, transport of nutrients, recognition and transmission of signals. All the plethora of aspects of the role of any particular protein is referred to as its 'function'.
Decoding protein function - a major challenge for modern biology. The NCBI RefSeq database now contains over 5 million sequences from more than 5000 organisms (Pruitt, Tatusova et al. 2007). The number of entirely sequenced genomes is expected to continue growing exponentially for at least the next few years.
With the availability of genome sequences of entire organisms, we are, for the first time, in a position to understand the expression, function, and regulation of the entire set of proteins encoded by an organism. This information will be invaluable for understanding how complex biological processes occur at a molecular level, how they differ in various cell types, and how they are altered in disease states (Zhu, Bilgin et al. 2003). Identifying protein function is a big step toward understanding diseases and identifying novel drug targets (Brutlag 1998). However,

experimentally
determining protein function continues to be a laborious task requiring
enormous resources. For example, more than a decade after its discovery, we
still do not know the precise and entire functional role of the prion protein (Harrison, Bamborough et
al. 1997).
The rate at which expert annotators add experimental information into more or
less controlled vocabularies of databases snails along at even slower pace.
This has left a huge and rapidly widening gap between the amount of sequences
deposited in databases and the experimental characterization of the
corresponding proteins (Bork and Koonin 1998;
Smith 1998).
Bioinformatics plays a central role in bridging this sequence-function gap
through the development of tools for faster and more effective prediction of
protein function (Bork, Dandekar et al.
1998; Fleischmann, Moller et al. 1999; Luscombe, Greenbaum et al. 2001).
'Protein function' has myriad meanings. During the last decade, advanced artificial intelligence (AI) techniques have proved remarkably successful in addressing numerous problems in molecular biology. However, excluding some bright spots, AI-based tools for predicting protein function have in general lagged in performance. One of the major problems hindering the development of methods for predicting protein function is that proteins are multi-functional and can perform different functions in different cellular contexts. Proteins can perform molecular functions like catalyzing metabolic reactions and transmitting signals to other proteins or to DNA. At the same time they can also be responsible for performing physiological functions as a set of cooperating proteins, such as the regulation of gene expression, metabolic pathways and signalling cascades (Bork, Dandekar et al. 1998) (Fig. 1-2). The function of a protein can be associated with many mutually overlapping levels: chemical, biochemical, cellular, organism mediated, developmental and physiological (Rost, Liu et al. 2003). These levels are related in complex ways, for example, protein kinases can be related to different cellular functions (such as cell cycle), and to a chemical function (transferase) plus a complex control mechanism by interaction with other proteins; the same kinase may also be the culprit that leads to mis-function, or disease.
The variety of functional roles of a protein often results in confusing database annotations which makes it difficult to develop tools for predicting protein function (Apweiler, Attwood et al. 2000). What we need for reliable automatic predictions are computer-readable hierarchical descriptions of function (Overbeek, Larsen et al. 1997; Bork, Dandekar et al. 1998; Ashburner, Ball et al. 2000). However, defining an ontology for protein function has proved to be an extremely difficult task. The most comprehensive effort at developing a controlled vocabulary for describing protein function originates from the Gene Ontology (GO) consortium (Ashburner, Ball et al. 2000). Subcellular localization is one of the three main classes used to organize protein

function within
the GO hierarchical classification scheme, the other two classes being
molecular function and biological process.
Subcellular localization an important aspect
of protein function. Living organisms can be divided into two types:
prokaryotes and eukaryotes. The defining feature of the prokaryotic cell is its
absence of a nucleus and any other membrane-bound organelles. During the course
of evolution, the cells of higher organisms, namely the eukaryotes, became
progressively divided into more elaborate membrane bound sub-compartments.
(Fig. 1-3). The major constituents of eukaryotic cells are: extracellular space,
cytoplasm, nucleus, mitochondria, Golgi apparatus, endoplasmic reticulum (ER),
peroxisome, vacuoles, cytoskeleton, nucleoplasm, nucleolus, nuclear matrix and
ribosomes. Each membrane-bound sub-compartment, or organelle, performs
specialized cellular functions. For example, mitochondria are the powerhouses
of the cell while the nucleus houses its genetic material. Proteins must be
localized in the same subcellular compartment to cooperate towards a common
physiological function. Knowledge of the subcellular localization of a protein
can significantly improve target identification during the drug discovery
process. For example, secreted proteins and plasma membrane proteins are easily
accessible by drug molecules due to their localization in the extracellular
space or on the cell surface. Aberrant subcellular localization of proteins has
been observed in the cells of several diseases, such as cancer and Alzheimer’s
disease. The consequences of
mis-localization and mis-targeting are manifested in a number of human genetic
diseases, including cystic fibrosis (Skach 2000),
Subcellular localization prediction an ideal testing ground for function prediction methods. Due to physical compartmentalization of the cell, the subcellular localization of a protein is a much more easily identifiable functional feature than its other roles in the cell. In addition, the protein trafficking mechanism is relatively well understood, and computer-readable subcellular localization data is available for large numbers of proteins. Though some proteins can localize in multiple compartments, the majority of proteins are localized within a single compartment for the largest part of their lifetime. Therefore, predicting subcellular localization has become one of the main testing grounds for the development of prediction methods for protein function.

Recently a
number of high-throughput experimental techniques, largely based on epitope and
green fusion protein (GFP) tagging, have become available for determining
subcellular localization. So far, the majority of large-scale experiments have
been restricted to yeast (Kumar, Agarwal et al.
2002; Huh, Falvo et al. 2003),
or to particular compartments, such as chloroplast proteins in Arabidopsis thaliana (grass) (Kleffmann, Russenberger
et al. 2004).
To date, high-throughput localization experiments cannot be performed for
mammalian or other higher eukaryotic proteomes. One significant obstacle is
that large-scale production of a collection of cell lines each with a defined
gene chromosomally tagged at the 3′-end is not yet possible (Davis 2004). In contrast, computational
tools can provide fast and accurate localization predictions for any organism (Bork and Koonin 1998;
Koonin 2000).
This has resulted in subcellular localization prediction becoming one of the
central challenges in bioinformatics (Eisenhaber and Bork
1998; Nakai 2000; Schneider and Fechner 2004).
The cell employs a complex trafficking mechanism to sort proteins. A basic knowledge of the protein sorting mechanism is essential for understanding the different methods for predicting subcellular localization. Most eukaryotic proteins are encoded in the nuclear genome and synthesized in the cytosol, and many need to be further sorted before they reach their final destinations (Fig. 1-4). The localization of a protein is largely determined by a trafficking system that is reasonably well understood for some organelles (Bar-Peled, Bassham et al. 1996; Schatz and Dobberstein 1996; Mattaj and Englmeier 1998; Bauer, Hofmann et al. 2000; Nakai 2000). There are three fundamentally different ways by which proteins move from one compartment to another (Alberts, Bray et al. 1994).

1) In ‘gated transport’, the protein traffic between the cytosol and nucleus occurs between topologically equivalent spaces, which are in continuity through the nuclear pore complexes. The nuclear pore complexes function as selective gates that actively transport specific macromolecules and macromolecular assemblies, although they also allow free diffusion of smaller molecules. 2) In ‘transmembrane transport’, membrane-bound protein translocators directly transport specific proteins across a membrane from the cytosol into a space that is topologically distinct. The transported protein molecule is unfolded during the translocation process. The initial transport of selected proteins from the cytosol into the ER lumen or from the cytosol into mitochondria, for example, occurs in this way. 3) In ‘vesicular transport’, membrane-enclosed transport intermediates ferry proteins from one compartment to another. The transport vesicles become loaded with a cargo of molecules derived from the lumen of one compartment as they pinch off from its membrane; they discharge their cargo into a second compartment by fusing with that compartment. The transfer of soluble proteins from the ER to the Golgi apparatus, for example, occurs in this way. Because the transported proteins do not cross a membrane, vesicular transport can move proteins only between compartments that are topologically equivalent. Each of the three modes of protein transfer is usually guided by sorting signals in the transported protein that are recognized by complementary receptor proteins.
Protein trafficking proceeds via sorting signals. Sorting signals in proteins are of two types (Fig. 1-5). 1) The first type resides in a continuous stretch of amino acid sequence, typically 15–60 residues long. Some of these ‘signal sequences’ are removed from the mature protein by specialized signal peptidases once the sorting process has been completed. 2) The second type consists of a specific three-dimensional arrangement of atoms on the protein's surface that forms when the protein folds up. The amino acid residues that comprise this ‘signal patch’ can be distant from one another in the linear amino acid sequence, and they generally persist in the finished protein.
Signal sequences act like ‘address labels’. Each signal sequence specifies a particular destination in the cell. Protein sorting to the nucleus is usually mediated by short stretches of consecutive amino acids called nuclear localization signals (NLSs). NLSs can occur anywhere within the protein sequence and are usually abundant in positively charged residues (Table 1-1). Proteins destined for initial transfer to the ER usually have a signal sequence at their N-terminus, which characteristically includes a sequence composed of about 5–10 hydrophobic amino acids. Many

of these proteins will in turn pass from the ER to the Golgi apparatus, but those with a specific sequence of four amino acids at their C-terminus are recognized as ER residents and are returned to the ER. Proteins destined for mitochondria have signal sequences of yet another type, in which positively charged amino acids alternate with hydrophobic ones. Finally, many proteins destined for peroxisomes have a signal peptide of three characteristic amino acids at their C-terminus (Table 1-1).
The importance of each of these signal sequences for protein targeting has been shown by experiments in which the peptide is transferred from one protein to another by genetic engineering techniques. Placing the N-terminal ER signal sequence at the beginning of a cytosolic protein, for example, redirects the protein to the ER. Signal sequences are therefore both necessary and sufficient for protein targeting. Even though their amino acid sequences can vary greatly, the signal sequences of all proteins having the same destination are functionally interchangeable, and physical properties, such as hydrophobicity, often seem to be more important in the signal-recognition process than the exact amino acid sequence. The discovery of signal peptides was the first major breakthrough in understanding protein targeting and Günter Blobel was awarded the Nobel Prize in 1999 for this discovery. A vast majority of signal sequences have not been documented. The situation is even worse for signal patches and few

structural motifs have been determined experimentally. In the absence of a clear understanding of the principles governing protein translocation, computational methods for predicting subcellular localization have pursued a number of conceptually distinct approaches.
Standard AI methods need extensive retooling for addressing biological problems. Recent progress in predicting protein function has been the result of applying advances in AI, namely machine learning (ML) methods like SVM’s, neural networks, Bayesian networks, Decision Trees and HMM’s. However, many issues, some of which are peculiar to biological data, must be considered before off-the-shelf ML techniques can be applied to biological problems. The major problem is the ‘noisiness’ of biological data. At the biological level, noise arises because proteins are multi-tasking, the observation of one function for a protein does not rule out other functional roles. At the experimental level, noise arises due to incorrect observations. At the curation level, noise enters databases due to misinterpretation of experimental observations by curators. A second issue is sequence similarity of training and test data sets, since even low levels of sequence similarity can imply homology, i.e. a pair of proteins are evolutionarily related and hence likely to be functionally related. This makes it difficult to assess accurately the performance of AI-based methods on blind samples that share differing degrees of sequence similarity with training samples. This is an important point since prediction using AI-based methods is most desired when function is to be inferred for a protein that shares no homology to proteins with known function. These issues necessitate significant retooling of off-the-shelf AI methods before application to a biological problem. To be successful, the method has to be rigorously tested and prediction accuracy carefully assessed.
No straightforward strategy for predicting
localization. Methods for predicting the subcellular localization of
proteins have primarily explored four avenues: 1) annotation transfer from
homologous sequences, 2) predicting the sorting signals that the cell uses as
‘address labels’, 3) mining the functional information deposited in databases
and scientific literature, and 4) using the observation that the subcellular
localization depends in subtle ways on the amino acid composition.
Additionally, there are meta-methods which combine the outputs from a number of
primary methods in an optimal way to enhance accuracy and coverage. Sequence
similarity is perhaps the most frequently used method to annotate function for
unknown proteins and accounts for the majority of annotations about function in
public databases (Koonin 2000; Devos and
Valencia 2001; Valencia and Pazos 2002).
A major limitation of sequence homology based methods is that they are only
applicable when another sequence similar protein with experimentally known
function is available. Hence, only a small fraction of known proteins can be
annotated using this approach (Eisenhaber and Bork
1998).
Since protein trafficking relies on the presence of sorting signals, ideally we
would like to predict the signals responsible for targeting. However, our
current knowledge of sorting signals is far from perfect and recent cell
biological studies seem to indicate that the protein sorting mechanism is far
more complex than previously thought. This makes it extremely difficult to
accurately identify sorting signals (Nakai 2001). In spite of their limited
applicability, methods that predict sorting signals provide the most useful
predictions since by pinpointing the ‘targeting signal’ they shed light on the
molecular mechanisms of protein translocation. Traditionally, expert human
annotators have been responsible for interpreting experimental data in the
scientific literature and annotating protein function in public databases (Apweiler, Gateau et al.
1997; Bairoch and Apweiler 1997). However, recent advances in data mining
techniques have made it possible to deploy automatic methods to complement the
role of ‘expert annotators’ and extract functional information directly from
biological databases, MEDLINE abstracts (Airozo, Allard et al.
1999)
and even full scientific papers. Due to the exponential growth in the size of
biological databases, a number of methods have recently been developed that
infer subcellular localization using automatic text analysis. Many recent
advances in predicting subcellular localization have been the result of using
the amino acid composition and other sequence derived features. These ab initio methods do not rely on ‘targeting signals’ or any other
feature directly associated with protein sorting. Since they utilize only features
derived or predicted from the primary sequence like the amino acid composition
or predicted secondary structure, they have the advantage of being applicable
to all protein sequences. Methods that can
accurately predict subcellular localization from the amino acid sequence alone
are invaluable in interpreting the wealth of data generated by large-scale
sequencing projects. Furthermore, predictions of localization can assist
high-throughput techniques to determine localization from cDNAs (Simpson, Wellenreuther
et al. 2000).
However, prediction accuracy for ab
initio methods still lags behind other approaches. This has led to the
development of combination methods, commonly referred to as
Inferring localization through sequence homology. Traditionally, the first approach for annotating function of an unknown protein relies on sequence similarity to proteins of known function (Koonin, Tatusov et al. 1996; Tamames, Ouzounis et al. 1998). The method works by first identifying a database protein of experimentally known function with significant sequence similarity to a query protein, U, and then transferring the experimental annotations of function from the homologue to the unknown query U. Understanding the relation between function and sequence is of fundamental importance, since it provides insights into the underlying mechanisms of evolving new functions through changes in sequence and structure (Thornton, Orengo et al. 1999). Several studies have explored the relationship of sequence and structure similarity to conservation of various aspects of protein function (Orengo, Todd et al. 1999; Wilson, Kreychman et al. 2000; Pawlowski and Godzik 2001; Rost 2002). One major observation is the existence of sharp ‘conservation thresholds’ for sequence similarity, above the threshold, sequence similar pairs of proteins share the same function, and below it, they have dissimilar functions. In practice, ad hoc thresholds of 50-60% sequence identity are often used for transferring functional annotations. Recent studies indicate that these levels of sequence similarity may not be sufficient to accurately infer function (Nair and Rost 2002; Rost 2002). Several pitfalls in transferring annotations of function have been reported, for example, inadequate knowledge of thresholds for 'significant sequence similarity', using only the best database hit, or ignoring the domain organization of proteins (Bork and Koonin 1998; Doerks, Bairoch et al. 1998; Galperin and Koonin 2000; Devos and Valencia 2001). In spite of this, homology based approaches continue to be among the most reliable for annotating subcellular localization (Nair and Rost 2002; Gardy, Spencer et al. 2003; Wrzeszczynski and Rost 2004). However, homology-transfer alone cannot bridge the sequence-function gap. Annotation transfer by sequence homology is applicable only in a limited number of cases. For the completely sequenced eukaryotic proteomes, subcellular localization can be inferred using homology for fewer than 25% of proteins.
Specialized prediction systems required for the different kingdoms. During the course of evolution, the amino acid sequences of proteins, which are evolutionarily related and perform the same function in different organisms, tend to diverge due to accumulated mutations in the genes. This tendency of protein sequences to diverge makes it hard to accurately identify a protein with annotated function that is evolutionarily related to an unknown query protein, especially when the organisms they belong to are only distantly related. The strength of AI methods for function prediction lies in their exceptional ability to detect and exploit remote similarities in various features of protein sequences. In general, the accuracy of machine learners increases with the size of training data. Combining experimental localization data for proteins from all organisms would result in the largest available training dataset. Ideally, such a combined predictor would provide the most accurate predictions for proteins from all organisms. However, in reality, protein sequences from distantly related organisms have diverged to such an extent that the inclusion of extremely diverged sequences in a combined predictor results in significantly reduced prediction accuracies. Hence, prediction methods employ specialized predictors for the most diverged kingdoms. In choosing the number of specialized predictors, the goal is to maximize both the size of available training data and the prediction accuracy. Since evolution is the driving force behind sequence and functional divergence, the phylogenetic classification of organisms is most often used to define kingdoms for training specialized predictors. The majority of subcellular localization prediction methods employ specialized predictors for gram- positive and negative bacteria, plants and animals. Fungi are usually included in the animals group but some methods employ specialized predictors for them. Most prediction methods do not explicitly consider archaeal proteins since they usually inhabit uncommon and extreme environments and very little experimental data is available for them.
Predicting sequence motifs involved in protein targeting. A number of methods have tried to predict localization by identifying local sequence motifs, such as signal peptides(von Heijne 1995; Nakai and Horton 1999) or nuclear localization signals (NLS) (Cokol, Nair et al. 2000; Nakai 2000) that are responsible for protein targeting. The prediction of N-terminal sorting signals has a long history originating from the early work on secretory signal peptides of von Heijne (von Heijne 1981; von Heijne 1985). N-terminal signal peptides are responsible for the transport of proteins between the ER and the Golgi apparatus and also for targeting proteins to the mitochondria (Voos, Martin et al. 1999) and to chloroplasts (Bruce 2000). Early methods for predicting signal peptides were essentially based on consensus signals, using linear discriminant functions with weight matrices. Modern machine learning (ML) techniques can predict whether a protein contains an N-terminal targeting peptide or not by automatically extracting correlations from the sequence data without any prior knowledge of targeting signals. This makes it impossible to gain any idea about the protein sorting mechanism by looking at the output from these predictors. The introduction of machine learning techniques like neural networks (NNs) and hidden markov models (HMMs) (Nielsen, Brunak et al. 1999; Emanuelsson, Nielsen et al. 2000), have resulted in spectacular improvements in prediction accuracy. ML methods like NNs and HMMs learn to discriminate automatically from the data, using only a set of experimentally verified examples as input. It is now possible to predict secretory signal peptides (SPs) (Nielsen, Engelbrecht et al. 1997; Kall, Krogh et al. 2004), mitochondrial targeting peptides (mTPs) (Fujiwara, Asogawa et al. 1997; Emanuelsson, von Heijne et al. 2001) and chloroplast targeting peptides (cTPs) (Emanuelsson, Nielsen et al. 1999) quite reliably using machine learning techniques (Table 1-2). SignalP (Nielsen, Engelbrecht et al. 1997; Bendtsen, Nielsen et al. 2004), which is currently in its third version, was the first predictor to apply advanced ML-techniques to the problem of predicting signal peptides. It consists of two different predictors based on neural networks (NN’s) and hidden Markov model (HMM) algorithms and is among the most widely

used research tools in Bioinfomatics. The signal peptide prediction problem is posed to the machine learners in two ways: (1) classification of individual amino acids as belonging to the signal peptide or not, and (2) recognition of the cleavage sites against the background of all other sequence positions. SignalP consists of specialized predictors for three organism groups, namely eukaryotes, Gram-negative and Gram-positive bacteria. SignalP 3.0 can discriminate proteins containing signal peptides with an accuracy of 93% for eukaryotes, 95% for gram-negative and 98% for gram-positive bacteria. The major improvement over the years has been in predicting the cleavage site for which gains in prediction accuracy range from 6-7% for all three organism classes. The neural network outperforms the HMM-based classifier in discriminating signal peptides and predicting cleavage sites. However, the HMM-based classifier better distinguishes signal peptides from signal anchors, which anchor proteins to the membrane. A particular problem afflicting methods detecting N-terminal signals is that start codons are predicted with less than 70% accuracy by genome projects (Gaasterland and Oprea 2001; Lander, Linton et al. 2001; Venter, Adams et al. 2001)