bottom - TOC - CUBIC-papers - CUBIC - Rost group

Title: Building a neural network for predicting protein features
Author:Marco Punta & Burkhard Rost
Quote: In Application of Artificial Neural Networks to Chemistry and Biology, David Livingston (ed.), Humana Press, in press.

Building a neural network for predicting protein features

Marco Punta 1,2,3,4,* & Burkhard Rost 1,2,3,4

1 Dept. of Biochemistry and Molecular Biophysics, Columbia University, 630 West 168th Street, New York, NY 10032, USA
2 Columbia University Center for Computational Biology and Bioinformatics (C2B2), 1130 St. Nicholas Avenue Rm 802, New York, NY 10032, USA
3 North East Structural Genomics Consortium (NESG), Columbia University, 1130 St. Nicholas Avenue Rm 802, New York, NY 10032, USA
4 New York Consortium On Membrane Protein Structure (NYCOMPS), Columbia University, 1130 St. Nicholas Avenue Rm 802, New York, NY 10032, USA
* Corresponding author: punta@rostlab.org URL http://www.rostlab.org/

 

 

Table of contents


Abstract

ADD

 

Key words: neural networks, protein structure, protein function, over-fitting, performance estimate

 

Abbreviations used

3Dthree-dimensional
aaamino acids
AUCArea Under the ROC Curve
I-Hinput-hidden
FNFalse Negatives
FPFalse Positives
FPRFalse Positive Rate
H-Ohidden-output
NFPNumber of Free Parameters
NHNNumber of Hidden Nodes
NNNeural Network
PDBProtein Data Bank
ROCReceiver Operating Characteristics
SASolvent Accessibility
SSSecondary Structure
TNTrue Negatives
TPTrue Positives
TPRTrue Positive Rate



 

1. Introduction

1.1 Scope

Our goal is to introduce the reader to the use of neural networks (NN) in protein structure and function prediction. NN are very popular among computational biologists that have to deal with extremely complex phenomena and very noisy data, since NN can solve classification and regression tasks without needing much prior knowledge of the problem and they are tolerant to errors. Indeed, both supervised (e.g. feed-forward and recurrent) and unsupervised (e.g. Kohonen maps) NN have been applied to a vast number of problems: from detection of secondary structures (SS) (see Przybylski and Rost {Przybylski and Rost} for a review) to prediction of post-translational modifications (1-3); from identification of disordered regions (4) to prediction of metal binding sites (5){Passerini et al. in press}; from assignment of sub-cellular localization (6) to separation of proteins into functional classes (9) as well as to many others. Developing a NN-based predictor requires addressing several different issues, such as deciding which type of NN to use, choosing the NN architecture, selecting the cost and activation functions, etc. However, most of these decisions are not specific to the use of NN for biological problems but rather of more general interest; they are discussed in several textbooks and review papers (see, for example, (10)). Here, we prefer to discuss those aspects of NN development that are more directly related to the task of predicting structural and functional features of proteins. In particular, we focus on database selection, sample labeling, input feature encoding, and performance assessment. Our analysis concerns methods that use information from the protein sequence only but that are generally useful for developing any learning algorithm that aims to predict protein structural or functional features. The paper is organized in the following way:

 

 

1.2 Proteins

1.2.1 What are proteins?

Proteins absolve many important functions in living organisms: from catalysis of chemical reactions to transport of nutrients, from recognition to signal transmission. They are polypeptide chains formed by a unique combination of 20 amino acids (aa) (11). These constituent molecules are characterized by a conserved region (backbone) and by a variable region (side-chain), where the side-chains confer different physico-chemical properties to each one of the aa (in terms of, for example, size, charge, hydrophobicity). Proteins are synthesized by living cells starting from genes, most often stored in DNA molecules in the form of nucleic acid sequences (exceptions are, for example, RNA viruses that use RNA as genetic information). DNA is copied into RNA and RNA sequences are translated into proteins with the help of large RNA-protein complexes called ribosomes. Most proteins, when exiting the ribosome machinery, fold into a stable 3D structure and are translocated to the cellular compartment where they absolve their function. Some others, called natively unstructured, are unable to fold under physiological conditions and remain disordered. Still, they are implicated in a large number of functional tasks such as molecular recognition, acetylation, glycosylation, etc. (12) (in a few reported cases they have been shown to become structured when binding to a substrate, i.e. another protein (13)). Finally, a small number of proteins (e.g. prions) have been observed in more than one stable 3D structure, with different conformations being responsible for different functions (or mal-functions) (14).

 

1.2.2 Protein structure determination

No matter if unique or not, flexible or rigid, folded or disordered, the 3D conformation is a key element for determining the capability of a protein to interact with its environment and perform a specific function. This is why over the last decades a great deal of experimental work has been devoted to protein structure determination.

 

Experimental methods that allow solving protein structure at atomic resolution include, most prominently, X-ray crystallography and NMR. Both methods have strengths and weaknesses: X-ray crystallography gives high-quality structures but provides a mostly static representation of the protein, with possible artifacts coming from crystal (i.e. non-native) contacts (15); NMR offers views of both the structure and dynamics of a protein in solution but lacks a robust resolution measure, thus making difficult to assess the quality of the obtained structures (16). In other words, where we are able to tell a reliable X-ray structure from a less reliable one, making the same distinction for NMR-derived structures is more difficult. Recently, cryo-electron microscopy has also been used to produce atomic resolution protein structures. Although experimental techniques have continued to improve over the last years and high-throughput production of protein structure is now becoming a reality thank to the Structural Genomics projects (17), our knowledge of the protein structural universe appears to be still very limited if compared to the number of known protein sequences. The latter have dramatically increased with the advent of genome (18) and environmental (19, 20) sequencing. In face of more than 3 million protein sequences found in TrEMBL (see Materials and Table 1), the Protein Data Bank (PDB) (21), i.e. the repository of all publicly available protein structures, contains only 36,000 entries (counting also multiple copies of the same protein as distinct entries). Most importantly, relevant protein classes that have proved tough to treat through experimental methods are considerably underrepresented in the PDB. A classic example is membrane proteins, estimated to constitute from 20 to 30% of all proteins and accounting for less than 1% of the PDB. More, even in cases for which experimental determination of a structure goes smoothly, the average cost per structure is still high (22).

 

 

 

Table 1: Some popular databases used by computational biologists for developing their prediction methods.

 

 

Database/Method

URL

GenBank

http://www.ncbi.nlm.nih.gov/Genbank/

EMBL nuc.sq.dtb.

http://www.ebi.ac.uk/embl/

DDBJ

http://www.ddbj.nig.ac.jp/

Swiss-Prot&TrEMBL (28)

http://ca.expasy.org/sprot/

UniProt (31)

http://www.ebi.uniprot.org/index.shtml and mirror sites

UniRef

ftp://ftp.expasy.org/databases/uniprot/uniref/

PDB (21)

http://www.rcsb.org/pdb/Welcome.do

PDB non-red. sets

http://www.rcsb.org/pdb/clusterStatistics.do

CD-HIT (35)

http://bioinformatics.ljcrf.edu/cd-hi/

UniqueProt (41).

http://cubic.bioc.columbia.edu/services/uniqueprot/

EVA (39) unique set

http://cubic.bioc.columbia.edu/eva/res/weeks.html

BLAST/PSI-BLAST (29),

http://www.ncbi.nlm.nih.gov/BLAST/

BLASTClust

http://www.ncbi.nlm.nih.gov/Web/Newsltr/Spring04/blastlab.html

SCOP {Andreeva A., Howorth and Murzin A.G. 2004 NAR 32}

http://scop.mrc-lmb.cam.ac.uk/scop/

CATH (33)

http://cathwww.biochem.ucl.ac.uk/latest/index.html

GO (34)

http://www.geneontology.org/index.shtml

 

 

 

 

Computational methods can help to reduce the gap between known sequences and known structures. NN, in particular, can be instrumental for the prediction of features such as secondary structure, aa solvent accessibility, aa flexibility, intra chain aa contacts, etc. {Przybylski and Rost} These predictions can be used in combination with techniques such as homology modeling (for a recent review see Petrey and Honig (23) and see Jacobson and Sali (24) for the impact of homology modeling on drug design) and fold recognition (25) to produce full 3D structural models or, when these methodologies cannot be applied, as a primary source of structural information to assist experimental studies.

 

1.2.3 Experimental Analysis of Protein function

Although very important, structural knowledge is often not sufficient for determining the function of a protein. Indeed, similar or same 3D structures can absolve completely different functions and the same function can be performed by different 3D scaffolds (26). This is due to at least two factors: on one hand, function and in particular biochemical function (e.g. enzymatic activity) is often carried out at a local level by a few key aa (i.e. small structural differences account for big functional differences); on the other hand, different environmental conditions (such as sub-cellular localization, temperature or the tissue the protein is expressed in) can lead to different functions (for a review, see Whisstock and Lesk (27)). As a consequence, specific experimental assays have been developed to predict function directly (that is, not necessarily using structural knowledge). Experimentalists use a wide range of methods and techniques and the information obtained is stored in numerous public databases. Arguably, the most important of them is Swiss-Prot (28) (a subset of TrEMBL), comprising more than 200,000 manually annotated protein entries, compiled from available structural and functional experimental data. Still, Swiss-Prot contains ten times less entries than TrEMBL; also, for several Swiss-Prot entries we have only a very partial functional knowledge. For this reason, protein function prediction has recently taken center stage in computational biology.

In conclusion, the gap between known sequences and structurally and functionally annotated proteins keeps widening. As a consequence, although experimental techniques for protein structure and function prediction continue to improve, computational methods in these areas are badly needed. NN-based methods can be (and have been widely) used for this purpose.

 

1.2.4 The concept of homology.

Here, we want to introduce a concept that is central to most of computational biology and which is a factor to be reckoned with when building a dataset for the development, or when assessing the performance of, a NN-based method: homology. Proteins, as we know them today (i.e. in contemporary organisms), were not all invented from scratch. According to evolutionary theory organisms originated from a single common ancestor and diverged gradually in time. During the course of evolution, the ancestral genes propagated through the newly emerging species, thus generating copies that carried occasional mutations (which eventually accumulated over time). At times, novel genes were discovered, and also transmitted to later species. As a consequence of this process, most contemporary genes and corresponding proteins have multiple relatives (called homologs) across, as well as within, species.

Close homology between two proteins can usually be detected from sequence similarity. Given two proteins and an alignment between them (obtained, for example, using PSI-BLAST (29)), if sequence similarity exceeds a certain threshold we can safely assume that it reflects a common evolutionary origin for the two proteins. In practice, >30% sequence identity on a region longer than 100 aa can be considered as a strong indication of homology. An important consequence of the evolutionary relationship between two proteins and the reason why homology is relevant to our discussion on prediction methods is that related proteins have a very similar structure and, often, similar function. When developing a NN for the prediction of a protein feature, the existence of these correlations is a factor to be taken into account (see Sections 3.3.2 and 3.3.3).

 

 

1.3 NN basics.

As we said at the beginning, we are not going to elaborate on general aspects of NN such as learning algorithms, cost functions, etc. (for which, once again, we refer to Wu (10)). However, we need to introduce a few concepts and definitions that are relevant for the issues discussed in the Methods section. The basic elements of NN architecture are the layers. Although the minimal NN is constituted of only two layers (input and output, e.g. perceptrons), the most widely used are slightly more complicated, comprising at least three layers: input, hidden (one or more) and output (Fig. 1). The different layers are connected through junctions in a serial way: input nodes connect to hidden nodes (via I-H junctions) and hidden nodes connect to output nodes (via H-O junctions). The most common NN are fully connected, with each node in a layer being connected to all nodes in the next layer. The layers constituents (nodes) and the junctions are nothing more than sets of numerical values. While the layers nodes describe different representations of the input data (either as given by the programmer - input nodes - or as calculated by the NN - hidden and output nodes -), the junctions are the free parameters that have to be optimized during training. In a feed-forward NN, the input node values are always provided by the programmer and incorporate the mathematical representation of the input samples (i.e. the examples given to the NN to learn a particular prediction task). The hidden nodes constitute a non-human interpretable set of numbers, each generated by the NN as a combination of input node and I-H junction values. The hidden layer is at the same time the way multi-layer NN learn very complicated classifications and patterns and the reason why NN are often referred to as black boxes. In brief, the NN transforms the programmer choices input (input node values) into a new representation (hidden node values), which makes the classification of the input samples in the output space easier (where the output space is the one defined by the output nodes). The output layer is formed by values that represent the NN response to the initial input or the predicted classification of the input samples. In a supervised NN, the output values are used by the learning algorithm to determine how the NN free parameters (i.e. the junctions) have to be changed in order to improve the NN performance. In back-propagation, the NN update starts with the H-O junctions and ends with the I-H junctions (hence it goes back-wards). The way this can be achieved is outlined in (10, 30).

 

 

 

 

Fig. 1: Schematic representation of a feed-forward NN that predicts the SS/non-SS state of an aa based solely on the aa type, using sparse encoding. In this example, the input is constituted of a phenylalanine residue (F); accordingly, the input vector is formed by 0s and by a single 100 value in correspondence to the F node (assigned previously assigned to F (20 nodes overall). The input is translated into a new 10 hidden node representation via 20x10 input-hidden (I-H) junctions. Finally, 20x2 output-hidden (O-H) junctions predict the hidden representation to belong to one of two output classes (SS or non-SS).

 

 

 

 

 


2. MATERIALS

 

2.1 Databases

In this section we describe protein sequence, structure and function databases. We dont mean to provide a comprehensive list but to point to the most used and popular databases, which are utilized by computational biologists as a basis for developing their prediction methods. Web links are found in Table 1.

2.1.1 Sequence databases

NCBI GenBank, the EMBL nucleotide sequence database and the DNA Databank of Japan (DDBJ) are the worlds largest repository of DNA sequences. These databases are linked to one another and exchange data on a regular basis. They currently (April 2006) contain more than 60 million sequences. Automatic translation (and automatic annotation) for the subset of all coding sequences is found in TrEMBL (~3 million proteins). Finally, Swiss-Prot (28) is a manually annotated database containing information about protein function, domains structure, post-translational modifications, etc. comprising ~220,000 TrEMBL sequences. TrEMBL and Swiss-Prot have recently been incorporated into the UniProt (31). Several other databases contain information about protein families, domains and functional sites; most of them are integrated into InterPro (32).

 

2.1.2 Structure databases

The protein data bank (PDB (21)) is the primary source of information for protein structure. It contains all publicly available protein structures obtained by X-ray, NMR or electron microscopy techniques (~35,000, of which ~14,000 unique at 95% sequence identity). Protein structures have been the object of several classification attempts. The most used schemes are SCOP {Andreeva 2004 NAR 32} and CATH (33), which classify PDB proteins in a hierarchical fashion, starting with overall secondary structure composition (only alpha-helices, only-beta, mixed alpha-beta) and ending with closely related homologous proteins.

 

2.1.3 Function databases

As we said in the Introduction, proteins part of Swiss-Prot (29) are annotated for known functional and structural features. The Gene Ontology (GO (34)) project describes the function of gene products according to three predefined classification schemes that encompass biological processes, cellular components and molecular functions.

 

2.1.4 Non-redundant and sequence-unique databases

Redundancy in sequence databases is a very common phenomenon. Redundancy reduction can aim at two distinct goals: 1) removing duplicate information (i.e. very similar sequences) thus effectively reducing the number of sequences one has to deal with; 2) creating a sequence-unique database for developing a de novo prediction method (i.e. a method that predicts a protein feature when no homologous template is available). A number of non-redundant databases are publicly available, as well as programs that can be used to create task-tailored unique datasets. The NCBI nr nucleotide and protein databases reduce redundancy at a very basic level by merging into a single entry proteins with the exact same sequence. UniProt maintains UniRef90 and UniRef50, two databases built using the CD-HIT algorithm (35) for clustering all sequences in UniProt at 90% and 50% sequence identity, respectively (e.g. no two sequences in UniRef90 have sequence identity >90%). nrdb90 (36) takes the union of several well-known databases (such as Swiss-Prot (28), TrEMBL, GenBank, PDB (21), etc.) and reduces redundancy at 90% identity.

 

Sequence-unique sets differ from generic non-redundant sets in that they are meant to maximally reduce the presence of homologous proteins (i.e. proteins with related structural and functional features). In practice, no approach exists that can guarantee that two proteins in a dataset are not homologous but several available methods manage to successfully remove the most trivial correlations. It has also to be noted that criteria to define uniqueness can vary according to the feature under consideration. Different thresholds or different measures may be needed to define a sequence-unique set with respect to e.g. structure (37) or sub-cellular localization (38). As far as structure is concerned, the PDB periodically updates several non-redundant sets (at different levels of sequence identity), created by using CD-hit (35) or BLASTClust. The EVA server (39) maintains a continuously updated subset of sequence-unique PDB chains (no pair of proteins in this set has HSSP-value above 0 (40)(37); note that, in the latter case, uniqueness has been specifically defined as related to structural similarity). Programs that can be used to create personalized sequence-unique datasets are, for example, CD-HIT (35) and UniqueProt (41).

 


3. METHODS

 

The aim of this section is to show how NN can be used to predict protein structural and functional features (Fig. 2). As an example, we take the prediction of secondary structure (SS) elements in water-soluble globular proteins. We describe how relevant data can be extracted from the available databases, discuss the steps that have to be taken in order to avoid over-fitting, review encoding schemes for the most used protein sequence features and, finally, we analyze ways to estimate the NN performance. Most of the issues that we discuss are not specific to the prediction task under consideration (i.e. SS) and their understanding can help developing NN that attempts to predict any protein structural or functional feature.

 

 

 

 

Fig. 2: Developing a NN for predicting protein features. First, we have to create a dataset by selecting data from available databases and filtering out those occurrences that cannot be reliably labeled for the features we want to predict. After the remaining data have been labeled, we can proceed in splitting the dataset into several folds (3 as a minimum) in order to be able to train and assess the performance of the NN (see text). Once the datasets are defined, we have to decide which sequence (or structure) features are relevant for the classification (or regression) problem under study and how to encode them in the NN input (see Fig. 4). Next, we can train our NN and finally assess its performance taking advantage of the dataset splitting (see Fig. 5).

 

 

 

3.1 What is secondary structure?

As we mentioned in the Introduction, most proteins under physiological conditions are folded into a stable 3D structure. Although at the level of the whole chain, 3D protein structure is considerably diverse, most structures can be seen as combinations of only a few recurrent local 3D motifs, usually referred to as SS. The most common among them are alpha-helices and beta-strands (Fig. 3). These structural motifs can be described by different properties. For example, helices and strands have characteristics backbone dihedral angle conformations (clearly visible in the so-called Ramachandran plot (42)). Energetically, alpha-helices are stabilized by hydrogen bonds between the backbone amide and carbonyl groups and involving amino acids separated by 3 or 4 positions along the chain (Fig. 3). In contrast, beta-strands are stabilized by interactions with other strands, and form super-SS known as beta-sheets (Fig. 3); this allows them to bridge amino acids that are very distant along the protein sequence. Predicting alpha-helices and beta-strands is attractive since they are ubiquitous and, as mentioned previously, most 3D structures can be described as assemblies of these elements. Indeed, in the past, accurate predictions of helices and strands have led to important advances in database sequence search, fold recognition and de novo 3D structure prediction (43).

 

 

 

 

Fig. 3: Center: (tube representation of the protein backbone) we highlight alpha-helices (red) and beta-strands (yellow) on the backbone structure of a protein of unknown function (PDBid:1xne). Left: (ball and stick representation of the backbone atoms only, balls=atoms and sticks=covalent bonds, omitting hydrogen atoms) alpha helices are stabilized by hydrogen bonds (H-bonds) between the amide and carbonyl groups of the protein backbone. Right: (same representation as Left) beta strands are stabilized by inter-strand H-bonds between the amide and carbonyl groups of the protein backbone.

 

 

 

3.2 The prediction task

Although recently there have been attempts to predict SS in water-soluble globular proteins by using an increasing number of classes (helix, strand, turn, bend, etc.) (44), most available methods predict SS on a three-class basis (helix, strand, other). Here, for the sake of simplicity, we adopt a two-state classification. So, according to our definition a residue can be in either of two states: SS (helix or strand) or non-SS (other).

 

 

3.3 Dataset

3.3.1 Selecting protein sequences from the PDB

For developing a NN that predicts SS in water-soluble globular proteins, we need a database whose sequences can be reliably annotated for the presence of helices and strands. In general, when predicting structural features, the choice falls on the PDB (21). This is, unless the specific feature we are interested in turns out to be underrepresented in this database to such an extent that more data are needed. If this is the case (and it is not for SS of water-soluble globular proteins, as we will see), we may want to also rely on lower resolution data (although this implies some risks, as shown for the use of low-resolution data in transmembrane helix predictions (46)). As of April 2006, the PDB contains more than 36,000 protein structures (note that all PDB statistics reported in this paper refer to the PDB as of April 2006); however, not all structures are good for inclusion in our dataset. First, we want to remove all proteins that are not water-soluble or globular, such as membrane and coiled-coil proteins (<1% of the PDB). Second, we need to take a decision about the resolution that we want to allow for structures part of our dataset. In particular, given the fact that SS is best assigned based on atomic-level structural information, we may want to consider only structures having high resolution. This would, however, automatically exclude all NMR structures, constituting about 1/7 of the PDB or ~5,000 structures. In fact, as we mentioned in the Introduction, these structures have no well-defined associated resolution value. The question is then whether we want to favor quantity or quality. Indeed, for some NMR structure that may have unreliable atomic coordinates, many others will still produce reliable SS annotation. So, including or not NMR structures depends critically on the size of the original, un-filtered database. In general, considering only PDB structures with resolution lower than 3.0 (note, for the way resolution is measured, low-values indicate high resolution) would return ~30,000 structures (i.e. almost all X-ray structures in the PDB); a slightly more conservative threshold, 2.5 , would still provide ~25,000 proteins (for PDB statistics go to http://www.rcsb.org/pdb/static.do?p=general_information/pdb_statistics/index.html). In the case of SS, each of these proteins would provide from tenths to thousands of samples (since one residue=one sample), with a good balance between positives and negatives (with about 50% of all residues being found in helices or strands {Przybylski and Rost 2005}). Although the upcoming sections will show that these numbers do not correspond to the actual number proteins that can be used for training (due to the need of splitting the dataset into several folds and to the need to reduce redundancy or create a sequence-unique set), we will see that the PDB is large enough for developing a SS predictor that uses hundreds of nodes for describing the NN input (Section 3.6).

 

3.3.2 Avoiding over-fitting

NN, as other learning algorithms, may suffer from under- or over-fitting. Both phenomena cause the NN to under-perform on new, previously unseen data (i.e. data not contained in the training set) and are both closely related to the number of free parameters (NFP) contained in the NN. If the NFP is too low, the NN may be unable to efficiently classify the training data (under-fitting); on the other hand, if the ratio between NFP and the number of training samples is high the NN will end up fitting the training data so well as to eventually hamper its capability of generalizing to new ones (over-fitting). Under-fitting can be prevented by steadily increasing the NFP (typically, by increasing the number of hidden nodes) until improvement of the NN performance becomes insignificant. Avoiding over-fitting is not as easy; indeed, just looking at the performance of the NN on the training set will not provide us with the information we need on the diminishing generalization power of the NN. In general, having a low ratio between the NFP and the number of training samples (<1/10) helps but cannot guarantee that there is absolutely no over-fitting. In computational biology, the most popular way to tackle this problem is to use stop training (other techniques exist, such as weight decay, Bayesian learning, etc.). Stop training consists in splitting the dataset into two folds, training on the first and using the second (cross-training set, from now on) to check if and when the NN enters an over-fitting regime. In practice, training is stopped when the performance on the cross-training set starts decreasing, although it is best to continue to train until its clear that the performance is really deteriorating and that we are not in the presence of a fluctuation. It is important to stress that, even when performing stop-training, the ratio between NFP and the number of training samples should not be >>1/10. In fact, in this case, due to the considerably sparse sampling of the input space, new (cross-training) samples will often fall far from known training samples and hence it will be intrinsically difficult for the NN to predict them correctly (i.e. to generalize to new data). In conclusion, in order to obtain a close-to-optimal performance of the NN we need to both keep under control the ratio between NFP and training samples and perform stop-training. Finally, note that for a correct evaluation of the NN performance the two sets here introduced are not sufficient. As we will see in Section 3.7, performance evaluation calls for a minimum three-fold split of the original dataset.

 

Two other important issues concern the relative size of the datasets and their protein composition. For what concerns the size, we have to satisfy two conflicting constraints. On one hand, we would like to train the NN on as many samples as possible. On the other, the cross-training set needs to be large enough to well represent our sample population. In practice, the cross-training set is usually taken as a fraction of the training set (1/3, 1/5, 1/10 etc., see also the Section 3.7 below). More complicated is the issue of how to assign proteins to each of the two datasets (dataset composition). Since the cross-training has to simulate a set of data the NN has not seen before, a first obvious requirement is that it shares no sequence with the training set. But is this enough? As we discussed in the Introduction, proteins are subdivided into families (according to their evolutionary origin) and proteins in the same family (homologs) share many structural and functional features. So, the question is whether it is a good idea to perform stop training on a set that contains proteins homologous to some of the training sequences. The answer really depends on what we are trying to predict. In general, we have to keep in mind that the composition of the cross-training set relative to the training set should mirror as closely as possible the composition of new un-annotated data relative to the latter. As a consequence, if we believe that the range of applicability of our method will extend to the prediction of proteins with a known annotated homolog (i.e. a homolog for which the feature that we intend to predict in the target protein is known), it is correct to include in the cross-training set proteins homologous to the training sequences. For example, it has been shown that sub-cellular localization is poorly conserved in homologous proteins (38). Hence, predicting such a feature is of value even when a homolog of known localization is available. On the other hand, many structural features (such as SS) are found to be significantly correlated in homologous proteins, with the structure of the homolog often providing the best guess for the targets one (although predictions based on machine learning approaches can still be useful in regions in which the similarity between the two homologs is low). In practice, most NN methods in computational biology are optimized for producing de novo predictions, or predictions on proteins with no known annotated homolog. This is a realistic case; in fact, 20-30% of all open reading frames in every newly full-sequenced genome is constituted of such proteins (47, 48). Now, if our goal is to predict protein features de novo, we have to build the cross-training set so that none of its proteins has a close homolog in the training set. The way this is achieved is described in the following section.

 

3.3.3 Sequence-unique sets

Reduction of homology redundancy in a dataset is achieved by creating sequence-unique sets. The term sequence-unique refers to the property of these sets of having a single representative (unique) for each protein sequence family or group of homologous proteins defined through sequence similarity. In other words, if the cross-training set is unique with respect to the training set, then none of its proteins has a close relative in the training set. So, using such a set should provide a good estimate for the performance of the method in a no-available-homolog situation and will eventually constitute a lower bound for the performance in cases in which the target protein has detectable homology to a protein of the training set. So much for the presence of inter-set homologs. How should instead intra-set homology be treated? Training with more than one representative for each protein family is somehow similar to jittering. Jittering is a technique that allows increasing the number of training samples by creating artificial samples that are small variations of the real ones. This is exactly the effect of adding homologs to the training set, with the advantage that homologs are real data and, thus, the class they belong to is generally known (i.e. they do not represent noise). Jittering samples instead, being virtual data, are often noisy with respect to class assignment. In conclusion, using groups of homologous proteins in training may be advisable. However, it is important to realize that some protein families are larger than others and that, additionally, protein databases are biased towards particular types of proteins. As a consequence, just adding all members of the families represented in the training set can lead to a biased predictor (that is, biased towards the families with more members). One way to solve this problem is to balance the presence of homologs between the different families, by limiting the maximal number of members allowed for each family. Similar considerations hold and are even more relevant when we talk about proteins in the cross-training set: the presence of large families could bias the evaluation of the methods performance, so homologs in this set should be added with great care. In fact, since theres no clear advantage in adding homologs to the cross-training set, our suggestion is not adding them at all.

In the Materials section we discuss where to find ready-to-use sequence-unique sets and programs that can be used to create ad hoc sequence-unique sets.

Let us know go back to the problem of predicting SS. As we said previously, if we take into consideration all PDB structures having resolution lower than 3 , we retrieve about 30,000 structures. If we now make this dataset sequence-unique by using one of the criteria described in the Notes (for example, requiring that the HSSP-distance (37, 40) between two proteins in the set be always lower than 0), we are left with about 2,500 proteins. This is the set we have to use to generate the training and cross-training data from.

 

 

3.4 Data labeling

Another fundamental step in the process of building a prediction method is the labeling of the samples. Problems related to sample labeling may be of different nature. On one side, the databases computational biologists rely upon are not devoid of errors (i.e. mis-annotations) (49, 50). This problem can be mitigated by an accurate selection of the database used (see Notes for examples of the most used and reliable databases) and by picking the most reliable annotations within those databases, if this kind of information is available (for example, Swiss-Prot includes both experimentally annotated proteins, i.e. more reliable, and proteins annotated by homology, i.e. less reliable). Once we believe we have reliable data annotation, we have to decide what kind of classification we want to adopt. In general, this choice is a trade-off between the desire to have an accurate feature definition (many classes) and the need to retrieve enough samples for each feature class. For example, when predicting metal binding sites in proteins, we might want to use one class for each metal, since each of them has different protein binding characteristics. In practice, due to the lack of a sufficient number of examples for most metals, we would probably be forced to group them into larger classes (e.g. transition, alkali, alkaline-earth metals, see Passerini et al. {Passerini et al. in press}).

 

As we said previously, we opt for a two-class labeling of the data (SS, non-SS). How are we going to assign our samples to one class or the other? For SS, assignment has to be done at the residue level (in other cases, e.g. when predicting function or sub-cellular localization, the assignment is done on a per-protein basis). SS can be defined in many different ways and, as a consequence, different criteria may provide different data labeling results. The most popular automatic methods for SS assignments are DSSP (51) and STRIDE (52) (for other methods, see {Przybylski and Rost 2005}). DSSP assigns a residue to one of 8 classes based on the evaluation of backbone atoms interaction energies; the classes are: H = alpha helix, B = residue in isolated beta-bridge, E = extended strand, G = 3-helix (3/10 helix), I = 5 helix (pi helix), T = hydrogen bonded turn, S = bend and other. STRIDE utilizes backbone angles and empirically derived energies for the hydrogen bonds. If we were to use e.g. DSSP, we would reduce the 8 classes to two: SS (H and E) and non-SS (all the others). (Note that classes G (3/10 helices), I (pi helices) and B (beta-bridges) could also be part of SS). By running DSSP on all proteins of our dataset, we can attach a SS or non-SS label to each of the residues.

 

 

3.5 Encoding protein sequence into the NN input

The primary sequence of a protein of length L is a series of L letters extracted from the natural-aa 20-character alphabet. In the following paragraphs we want to discuss how information from the protein sequence can be encoded into the input of a NN to predict some structural or functional feature (here, SS). The first thing we need to decide is what kind of primary sequence information we want to use; the second is how we are going to feed it into the NN input. It is difficult to over-emphasize the importance of these two steps; indeed, from these decisions depends much of the success of a method in predicting a desired feature.

One crucial aspect of encoding is that similar vectors must represent similar input samples (according to any a priori idea of similarity that we may have for the samples), while dissimilar vectors must stand for dissimilar samples. This may appear as trivial advice; however, as we will see soon, it can be very easy to introduce unwanted similarities (or correlations) between unrelated input samples.

3.5.1 Amino acid type

What is it that determines the propensity of an aa to be in a SS conformation? The most trivial answer is: the residue side chain (i.e. the aa type). Indeed, the conformational space accessible to the backbone of each aa depends critically on its side-chain, with different aa having different constraints. Since one way to define SS is by specifying the backbone conformation, it is reasonable to think that different aa types will correspond to different SS propensities. This is confirmed by a simple analysis of aa relative frequencies in SS elements (53).

There are several ways in which the aa type can be translated into a numeric entry for a NN. One of the most used in computational biology is the so-called sparse or one-hot encoding. It consists in storing the aa type for a single sequence position into a 20-element vector, whose elements represent the 20 possible aa side chains. The vector is ordered, so that each aa is assigned a given vector element. For example, in order to input into the NN an aspartate for a given sequence position (D, Fig. 4A), we need to build the following vector: 0,0,,1,,0; where 1 corresponds to the D node and 0s to the nodes coding for the other aa. Note that the vector looks the same no matter what the position of D along the sequence is. Also, this particular choice of values, 1 and 0, has no special meaning and we could as well assign 0 to the D node and 1 to the others or use 100 instead of 1. The only important thing is that the notation be consistent throughout all the samples. Sparse encoding has proved to be efficient and is the most widely used scheme for feeding a NN with aa type information. In particular, it allows a straightforward transition from aa type to evolutionary profile encoding, as discussed in the next paragraph. One problem, though, is that it is very expensive in terms of the number of input nodes used. For inputting the aa type at a single sequence position, we have to use 20 nodes and if we want to use information from neighboring residues as well (see Section 3.5.2 below), we will easily need hundreds of nodes. This is a problem, since it significantly increases the NFP of the NN and hence the risk of over-fitting. In fact, when using a fully connected NN, each additional input node needs to be connected via new junctions to all hidden nodes. A less expensive type of encoding is obtained by using a truly binary representation for the aa alphabet. Under this scheme, by allowing more than one node at the same time to be equal to 1 (acceptable inputs are e.g. 1,0,0,0,0; 0,1,0,0,0; 1,1,0,0,0; 1,0,1,0,0; etc.), all we need in order to describe the 20 aa are 5 nodes. However, this type of representation may be less efficient than sparse encoding (see, for nucleotide encoding, (54)). A different approach consists in redefining the aa alphabet. Indeed, aa can be grouped into classes of similarity according to their physico-chemical properties. A classification comprising 4 classes may include, for example: hydrophobic, aromatic, polar and charged. By using this new alphabet (or a similar one), we would only need 4 input nodes for each sequence position, thus trading the loss of part of the information contained into the 20-letter alphabet with a smaller NFP. This scheme can also be easily extended for the use with evolutionary profiles. Finding the optimal aa alphabet (i.e. the most efficient in terms of information content) is a challenge that has been addressed by several groups in the past (see, for example, (55-57)).

 

 

 

 

Fig. 4: Encoding information from the protein sequence into a NN. A) Sparse encoding for the aa type: we use 20 nodes, one for each of the 20 natural aa. Each residue is assigned a specific position in the vector that is fixed and does not change from sample to sample. For an aspartate D, the input is a vector with nineteen 0s and a single 1 value corresponding to the D element (or node). B) Sparse encoding for the aa type of neighboring residues. In this case, if we consider a widow of size N around the central D residue, we will have a vector with 20xN elements (or N vectors with 20 elements each). Note that the light blue separators in the figure do not represent nodes; they are separators that we added in order to show that the final input vector is obtained by joining the N vectors that code for each single position within the window. C) Encoding evolutionary profiles for neighboring residues using frequencies of occurrences in MSA. The frequencies at each sequence position are calculated starting from a MSA. The input node values are not anymore only 0s and 1s but the frequencies themselves.

 

 

 

3.5.2 Windows: using information from the aa neighborhood.

The probability that a certain aa adopts a SS conformation depends not only on its individual propensity for a particular SS but also on that one of its neighbors. Indeed, the conformational space that is accessible to the backbone of a di-peptide or a tri-peptide depends on the interactions between the constituent aa {ADD cite paper Pollastri}. Another argument in favor of using information from neighboring aa is that SS are not formed by isolated aa. Indeed, helices and strands are constituted of a minimum of 3-4 and can actually span tenths of aa (with helices longer on average than strands). Thus, we assume that for an aa to be in SS, it is necessary that at least some of its neighbors, preceding or following, have also a good SS propensity; this propensity should be identifiable from the aa composition at those positions. We can input this information into the NN by simply extending the notation devised for the single position input. If we are using sparse encoding, we will create an ordered vector with 20(2w+1) elements (with w the window half-length, which extends left and right to the aa whose SS we want to predict, Fig. 4B). In other words, every residue in the window will take up 20 input nodes. When using windows, there is one additional detail we have to take into account: N- and C-terminal residues. In fact, suppose that we use a 3-residue window (w=1) and that the residue we are trying to predict (lets call it residue i) is the first N-terminal residue; what values are we going to enter into the 20 nodes which are meant to contain information about the virtual position i-1 (virtual because it extends beyond the actual protein sequence)? One possible choice would be to leave the 20 nodes empty (i.e. all zero), however in so doing we would be using the nodes, which are meant to represent the residue type, for a completely different task (i.e. to tell the NN that the central residue in the window is N- or C-terminal). As we already said, it is always better not to introduce unwanted correlations between the input features if this can be avoided with little harm. A safer, if slightly more parameter-expensive procedure, is to add one node per position, which takes the value 0 when the encoded position corresponds to an actual residue, and 1 when its a virtual position. Indeed, this is the most common way N and C-terminal residues are encoded in NN predictors.

 

3.5.3 Using evolutionary information.

As we mentioned in the Materials (Section 1.2.4), given a protein, we can generally find a certain number of sequences that are homologous to it. Now, given a multiple alignment of sequences belonging to the same protein family, we will find that different positions along the alignment will have different degrees of conservation among the aligned proteins. Since homologous proteins share similar structural and functional features, knowledge of the particular aa mixture allowed at a specific position can provide important information for many prediction tasks. In fact, the residue that we see at a given position in a protein sequence has to be regarded as one of the possible occurrences that are compatible with the conservation of the protein structure and function in that family. As a consequence, it is generally a big advantage to be able to include evolutionary information into the input of a NN. Historically, the use of this information in NN was first introduced for predicting SS, improving performance considerably (58). Since then, it has been applied to a vast variety of problems, often proving to be a fundamental ingredient for obtaining better performances.

 

In order to use evolutionary information in a NN, it is first necessary to produce a list of homologs for protein that is being predicted. The most popular program that performs this task, by searching available sequence databases (such as Swiss-Prot (28) or TrEMBL), is PSI-BLAST (29). Besides providing a list of homologues (if they exist) PSI-BLAST also outputs i) the sequence alignments, ii) a position-specific scoring matrix that represents the aa substitution scores at each position in the protein family and iii) the frequency of occurrence of the 20 aa at each given position in the alignment. These data can be used to input information from the family multiple sequence alignment into a NN. The first and perhaps most straightforward approach consists in directly inputting into the NN the values of the PSI-BLAST position specific substitution matrix. When using sparse encoding, this translates into feeding each of the 20 nodes representing the aa type at a given position with the values found in the column of the substitution matrix that corresponds to that position (i.e. the PSI-BLAST calculated scores for the aa in the original sequence to mutate into one of the other 19 aa, or to be conserved). Another very popular strategy consists in using the frequency of occurrence of the aa at each sequence position. The frequencies (normalized to range e.g. between 0 and 1) feed a 20-element ordered vector (see Fig. 4C).

 

It is important to realize that the significance of quantities such as the calculated substitution scores or the aa frequencies in a family depends on the evolutionary spread of the proteins and on the number of sequences found in the multiple sequence alignment. To address the first issue PSI-BLAST reports weighted frequencies instead of the actual frequencies from all aligned sequences, balancing the profile by reducing the weight of very similar aligned sequences. The second problem (dependency of profile significance on the number of aligned sequences) can be addressed by inputting explicitly into the NN the number of aligned sequences (by using a sparse encoding binning scheme, see Notes). In this way, we ask the NN to learn how to assign a different weight to conserved (or variable) positions in profiles generated by different number of sequences. In any case, when developing a method that relies upon evolutionary information, it is always important to report performance estimate as a function of the amount of evolutionary information available (e.g. as a function of the number of aligned sequences).

 

 

3.6 Final decisions on input features, NN architecture and training.

3.6.1 Building the NN.

At this point, we have all the elements we need to build a first simple NN for predicting protein SS. We can define, for example, a window of length 19 (w=9). Then, for each position in the window we can extract from the PSI-BLAST output the frequency of occurrence of each of the 20 aa at that position and pour these values into a corresponding number of input nodes (i.e. 19x20=380). Additionally, we want to use one node for each position in the window to account for N- and C-terminal aa. Hence, overall, the input nodes amount to 21*19 = 399 (although, being precise, the central aa would not need the additional node since it cannot be virtual). How do we choose the number of hidden nodes (NHN)? The overall NFP in our NN is given by 399 times the NHN, plus the NHN times the number of output nodes. Say we choose to have only one output node (0=non-SS, 1=SS). Since our dataset includes about 2,500 proteins, most of them providing hundreds of samples, we expect the overall number of input samples to be of the order of 106. As we mentioned before, in order to reduce the risk of over-fitting, the ratio between the NFP and the number of training samples has to be lower than ~1/10 (even when performing stop-training), hence:

(399*NHN+NHN*1)/106 < 1/10

and

NHN<106 /4*103     or        NHN<250

So any number between 5 and ~200 will do.

 

3.6.2 Balanced-training.

Often, when addressing a classification task, the different classes are unevenly distributed in the training set and the less populated classes (less samples) generally suffer from poorer performance. One way to counter the dataset imbalance is to perform balanced-training (58). Balanced-training consists in presenting the NN with the same number of samples from each of the prediction classes at each epoch. This implies either a re-sampling of the less populated classes or discarding samples from the more populated ones (or both). For example, in the classic three-class SS prediction problem (helix, strand, other), strands are generally under-represented in the training set and, as a consequence, the predictors performance tends to be significantly lower for this class. It has been observed, that using balanced-training can help improving the performance on strands (58). In the case of two-classes SS prediction (SS or non-SS), the two classes turn out to be pretty well balanced, so that balanced-training is not really relevant.

 

We now move on to discuss performance estimate for a NN. As we will see, the way performance is estimated also affects the way we have to split our dataset.

 

 

3.7 Performance estimate

3.7.1 Hold out.

In the section on how to avoid over-fitting (Section 3.3.2), we discussed the notion of stop-training and the necessity to introduce a cross-training set. In this paragraph we want to address the issue of performance evaluation and the need to introduce one more dataset: the validation or test set. It is essential to realize that the performance of a NN-based prediction method can be correctly evaluated on an unseen dataset only if the NN parameters are frozen. In other words, the choice of all NN parameters has to be done before any performance assessment is performed. Having said this, it is clear that two sets (training and cross-training) do not suffice. In fact, we use training for producing different models (or set of NN parameters) and cross-training for choosing what we believe is the optimal model (Fig. 5). In other words, cross-training is an integral part of the parameter optimization process. Hence the necessity to introduce a third set, which is meant to be left untouched (unseen) until all parameters have been fixed. This set is subject to the very same restrictions that we discussed for the cross-training set. First, it must not contain exact copies of proteins found either in the training or in the cross-training set. Second, if what we wish to evaluate is the methods performance on de novo predictions, it should be sequence-unique with respect to the previous two sets. Third, it must be representative of the sample population. It is interesting to note that mistakes (non-uniqueness, composition bias) in defining training, cross-training and validation sets have quite different consequences. In fact, mistakes in selecting the training sets can cause the method to be sub-optimal and hence to under-perform; mistakes in selecting the validation set most often cause an over-estimate of the methods performance.

 

 

 

 

 

Fig. 5: NN Training, Cross-training and Validation/Test. A) Training is performed up to saturation; B) Cross-training on a set of unseen data (see text) is used to estimate the epoch (i.e. the training parameters) granting the best generalization (less over-fitting); C) Validation/Test is performed on a third set of data, seen neither during training nor cross-training. Arrows underline the entire protocol: from B) we identify the best generalizing epoch; then, we pick from A) the corresponding parameters (junction values); finally, we run once the NN with those parameters on the Validation/Test set. The performance value obtained on this latter set is our estimate of the NN performance.

 

 

 

 

 

 

In conclusion, our original dataset has to be split into three folds (Fig. 5): training, cross-training and validation set. The first is used to learn the classification problem; the second to decide when the training process should be stopped; the third to test the methods performance. This methodology is usually referred to as hold-out (meaning that one set is held out from the training process and saved for validation purposes).

 

3.7.2 Cross-validation

Another way to fairly evaluate a NN is to use cross-validation. In N-fold cross-validation, the original dataset is split into N subsets. N-2 subsets are used for training; one each for cross-training and validation. In this case, the NN is trained N times (with N typically equal to 5 or 10) until all subsets have served at least once as training, cross-training and validation set (hence the name cross-validation). The estimate of the NN performance is taken as the average performance over the N validation sets. The advantage of cross-validation over hold-out is that it allows using most of the available samples. On the other hand, it is time consuming (training has to be performed N times). For these reasons, cross-validation is mostly used when the number of available samples is relatively low.

 

3.7.3 Measures.

Several measures have been devised to assess the performance of NN. For classification problems, it is useful to first define a few basic quantities: true positives (TP) or correctly predicted positive samples (in our case, TP are e.g. SS samples); true negatives (TN) or correctly predicted negative samples (non-SS samples); false positives (FP) or samples that we predict as positives but are in fact negatives; false negatives (FN) or samples that we predict as negatives but are positives. Using these quantities, we can calculate the true positive rate and the false positive rate (TPR and FPR, respectively), according to the following equations:

FPR=FP/(FP+TN)                                                    (1)

TPR=TP/(TP+FN)                                                    (2)

TPR can be plotted against FPR to create a receiver operating characteristic (ROC) curve (Fig. 6). In other words, ROC curves indicate what is the fraction of correctly predicted positives (relative to all positives) as a function of the fraction of incorrectly predicted negatives (relative to all negatives.) A ROC curve is obtained by setting on the output of the NN a threshold (T) that separates predicted positives and predicted negatives. By sliding T over the entire NN output range we can generate a performance curve on the ROC plane. As an example, assume that the NN produces output values between 1 and 0 and that we call 1 a positive and 0 a negative. For T=1, the number of predicted positives (i.e. predictions with a score >1) is null and so are TP and FP. Thus, from Eq. 1 and Eq. 2, TPR and FPR will equal 0 (Fig. 6). As we lower the threshold, the number of TP and FP increases. The last point on the ROC curve is obtained by choosing T=0. In this case, no negatives are predicted (i.e. TN=0 and FN=0) and from (1) and (2) we have: TPR=1 and FPR=1 (Fig. 6). ROC curves can be used to compare a method with a random baseline or with other methods that address the same prediction task. A very popular performance measure is the area under the ROC curve (AUC): the higher the AUC the better the method, with the maximal AUC equal 1. In some circumstances, we may be interested in comparing two predictors not on the entire range of TPR and FPR but, for example, only at low FPR (i.e. we may want our method to predict only a few positives but with very high accuracy). In this case, we can restrict the AUC comparison at values below a certain FPR (say, 0.05) or pick a single FPR value and compare the two predictors based on the corresponding TPR. In general, we always expect a method to be better than random. The performance of a random predictor, i.e. a predictor that generates true positives and false positives at the same rate, is represented in the ROC space by a straight diagonal line (Fig. 6, dashed line). So, in order for our method to be better than random, its AUC has to be higher than 0.5. Note that, although it is important that the NN performance be better than random, this comparison is rarely interesting. In fact, it is often very easy to perform better than random. For example, in a two-class prediction problem, where one class is more populated than the other, we can obtain a better than random prediction by simply predicting the samples to be always in the most populated class. So, in this case, we may want to assess the success of our NN not with respect to AUC=0.5, but to the AUC obtained by this majority class predictor.

 

 

 

 

Fig. 6: ROC plots. The ROC is a curve in the FPR-TPR space (FPR=false positive rate, TPR=true positive rate, see text). The ROC of a random predictor (i.e. a method that predicts true positives and false positives at the same rate) is represented by the diagonal connecting the points (0,0) and (1,1) (dashed line). A NN predictor is expected to be better than random (thick line). The area under the ROC curve (AUC) is a measure of the performance of the NN (shaded area). For a random predictor, AUC=0.5.

 

 

 

 

 

 

Other popular measures are the so-called Q measures. A generic Qk measure reads:

Qk = 100(i=1,k) Ci/N                                (3)

where k is the number of classes being predicted, Ci is the number of correctly predicted samples in that class and N is the total number of samples (over all classes). For example, if we classify residues as SS or non-SS (two classes), we will have to calculate a Q2. Maximal and minimal values for Qk are 100 and 0, respectively (independently on the number of classes k).

It is important to remember that, in the presence of a strong imbalance in the population of the different classes (in terms of number of samples), using Q measures can be inappropriate. For example, if in a two state classification task only 1% of the samples are labeled as positives, Q2 will by and large reflect the methods performance on the negatives. For example a method predicting all samples to be negatives will have a Q2 = 99. One solution, is to introduce a per-class accuracy, which (for positives) is defined as: ACC=TP/(TP+FP). This is usually reported together with the TPR, also called coverage or recall in this context.

 

Additional measures commonly used for performance assessment in computational biology include, for regression problems, correlations. Often, other task-tailored performance measures are defined.

In conclusion, it has to be stressed that no measure can be singled out as the best measure; in fact, what is best depends on the task and on the datasets at hand.

 

3.7.4 Comparison between different methods.

There exist a few issues that have to be taken into account when comparing two methods that address the same prediction task. First, in order for the comparison to be fair, it is necessary that the two methods be compared on the same dataset. This dataset has to represent as well as possible the protein feature that is being predicted and must contain proteins that were not used for training either of the two methods. Second, as previously said, it is necessary to make sure that the presence of homologous proteins is not biasing the results of the assessment. As we have already learnt, these two goals are usually achieved by ensuring that the test set is sequence-unique and sequence-unique with respect to the training sets. For methods assessed on structural data (such as SS), the easiest way to satisfy all these conditions is to perform the evaluation on proteins whose structure was solved after both methods were developed and which are sequence-unique with respect to proteins previously present in the PDB. This is what experiments like EVA (39) and LIVEBENCH (59) have tried to implement.

 


 


References

 

 

1.    Blom, N., Hansen, J., Blaas, D., and Brunak, S. (1996) Protein Sci 5, 2203-16.

2.    Nielsen, H., Engelbrecht, J., Brunak, S., and von Heijne, G. (1997) Int J Neural Syst 8, 581-99.

3.    Nielsen, H., Brunak, S., and von Heijne, G. (1999) Protein Eng 12, 3-9.

4.    Li, X., Romero, P., Rani, M., Dunker, A. K., and Obradovic, Z. (1999) Genome Inform Ser Workshop Genome Inform 10, 30-40.

5.    Sodhi, J. S., Bryson, K., McGuffin, L. J., Ward, J. J., Wernisch, L., and Jones, D. T. (2004) J Mol Biol 342, 307-20.

6.    Nair, R., and Rost, B. (2003) Proteins 53, 917-30.

7.    Emanuelsson, O., Nielsen, H., Brunak, S., and von Heijne, G. (2000) J Mol Biol 300, 1005-16.

8.    Reinhardt, A., and Hubbard, T. (1998) Nucleic Acids Res 26, 2230-6.

9.    Jensen, L. J., Gupta, R., Blom, N., Devos, D., Tamames, J., Kesmir, C., Nielsen, H., Staerfeldt, H. H., Rapacki, K., Workman, C., Andersen, C. A., Knudsen, S., Krogh, A., Valencia, A., and Brunak, S. (2002) J Mol Biol 319, 1257-65.

10.  Wu, C. H. (1997) Comput Chem 21, 237-56.

11.  Creighton, T. E. (1993) Proteins: Structure and Molecular Properties, W.H. Freeman, New York, New York.

12.  Dunker, A. K., Brown, C. J., Lawson, J. D., Iakoucheva, L. M., and Obradovic, Z. (2002) Biochemistry 41, 6573-82.

13.  Dunker, A. K., Cortese, M. S., Romero, P., Iakoucheva, L. M., and Uversky, V. N. (2005) Febs J 272, 5129-48.

14.  Soto, C., Estrada, L., and Castilla, J. (2006) Trends Biochem Sci 31, 150-5.

15.  Carugo, O., and Argos, P. (1997) Protein Sci 6, 2261-3.

16.  Snyder, D. A., Bhattacharya, A., Huang, Y. J., and Montelione, G. T. (2005) Proteins 59, 655-61.

17.  Brenner, S. E. (2001) Nat Rev Genet 2, 801-9.

18.  Fleischmann, R. D., Adams, M. D., White, O., Clayton, R. A., Kirkness, E. F., Kerlavage, A. R., Bult, C. J., Tomb, J. F., Dougherty, B. A., Merrick, J. M., and et al. (1995) Science 269, 496-512.

19.  Venter, J. C., Remington, K., Heidelberg, J. F., Halpern, A. L., Rusch, D., Eisen, J. A., Wu, D., Paulsen, I., Nelson, K. E., Nelson, W., Fouts, D. E., Levy, S., Knap, A. H., Lomas, M. W., Nealson, K., White, O., Peterson, J., Hoffman, J., Parsons, R., Baden-Tillson, H., Pfannkoch, C., Rogers, Y. H., and Smith, H. O. (2004) Science 304, 66-74.

20.  Tringe, S. G., and Rubin, E. M. (2005) Nat Rev Genet 6, 805-14.

21.  Berman, H. M., Battistuz, T., Bhat, T. N., Bluhm, W. F., Bourne, P. E., Burkhardt, K., Feng, Z., Gilliland, G. L., Iype, L., Jain, S., Fagan, P., Marvin, J., Padilla, D., Ravichandran, V., Schneider, B., Thanki, N., Weissig, H., Westbrook, J. D., and Zardecki, C. (2002) Acta Crystallogr D Biol Crystallogr 58, 899-907.

22.  Chandonia, J. M., and Brenner, S. E. (2006) Science 311, 347-51.

23.  Petrey, D., and Honig, B. (2005) Mol Cell 20, 811-9.

24.  Sali, M. J. a. A. (2004) Annual Reports in Medicinal Chemistry 39, 259-76.

25.  Godzik, A. (2003) Methods Biochem Anal 44, 525-46.

26.  Watson, J. D., Laskowski, R. A., and Thornton, J. M. (2005) Curr Opin Struct Biol 15, 275-84.

27.  Whisstock, J. C., and Lesk, A. M. (2003) Q Rev Biophys 36, 307-40.

28.  Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M. C., Estreicher, A., Gasteiger, E., Martin, M. J., Michoud, K., O'Donovan, C., Phan, I., Pilbout, S., and Schneider, M. (2003) Nucleic Acids Res 31, 365-70.

29.  Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D. J. (1997) Nucleic Acids Res 25, 3389-402.

30.  Rost, B. (2003) in "Artificial intelligence and heuristic methods for bioinformatics" (Frasconi, P., Ed.), pp. 34-50, Amsterdam: IOS Press.

31.  Wu, C. H., Apweiler, R., Bairoch, A., Natale, D. A., Barker, W. C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M., Martin, M. J., Mazumder, R., O'Donovan, C., Redaschi, N., and Suzek, B. (2006) Nucleic Acids Res 34, D187-91.

32.  Mulder, N. J., Apweiler, R., Attwood, T. K., Bairoch, A., Bateman, A., Binns, D., Bradley, P., Bork, P., Bucher, P., Cerutti, L., Copley, R., Courcelle, E., Das, U., Durbin, R., Fleischmann, W., Gough, J., Haft, D., Harte, N., Hulo, N., Kahn, D., Kanapin, A., Krestyaninova, M., Lonsdale, D., Lopez, R., Letunic, I., Madera, M., Maslen, J., McDowall, J., Mitchell, A., Nikolskaya, A. N., Orchard, S., Pagni, M., Ponting, C. P., Quevillon, E., Selengut, J., Sigrist, C. J., Silventoinen, V., Studholme, D. J., Vaughan, R., and Wu, C. H. (2005) Nucleic Acids Res 33, D201-5.

33.  Pearl, F., Todd, A., Sillitoe, I., Dibley, M., Redfern, O., Lewis, T., Bennett, C., Marsden, R., Grant, A., Lee, D., Akpor, A., Maibaum, M., Harrison, A., Dallman, T., Reeves, G., Diboun, I., Addou, S., Lise, S., Johnston, C., Sillero, A., Thornton, J., and Orengo, C. (2005) Nucleic Acids Res 33, D247-51.

34.  Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E., Ringwald, M., Rubin, G. M., and Sherlock, G. (2000) Nat Genet 25, 25-9.

35.  Li, W., Jaroszewski, L., and Godzik, A. (2001) Bioinformatics 17, 282-3.

36.  Holm, L., and Sander, C. (1998) Bioinformatics 14, 423-9.

37.  Rost, B. (1999) Protein Eng 12, 85-94.

38.  Rost, B., Liu, J., Nair, R., Wrzeszczynski, K. O., and Ofran, Y. (2003) Cell Mol Life Sci 60, 2637-50.

39.  Koh, I. Y., Eyrich, V. A., Marti-Renom, M. A., Przybylski, D., Madhusudhan, M. S., Eswar, N., Grana, O., Pazos, F., Valencia, A., Sali, A., and Rost, B. (2003) Nucleic Acids Res 31, 3311-5.

40.  Sander, C., and Schneider, R. (1991) Proteins 9, 56-68.

41.  Mika, S., and Rost, B. (2003) Nucleic Acids Res 31, 3789-91.

42.  Ramachandran, G. N., Ramakrishnan, C., and Sasisekharan, V. (1963) J Mol Biol 7, 95-9.

43.  Dunbrack, R. L., Jr. (2006) Curr Opin Struct Biol 16, 374-84.

44.  Pollastri, G., Przybylski, D., Rost, B., and Baldi, P. (2002) Proteins 47, 228-35.

45.  Karchin, R., Cline, M., Mandel-Gutfreund, Y., and Karplus, K. (2003) Proteins 51, 504-14.

46.  Chen, C. P., Kernytsky, A., and Rost, B. (2002) Protein Sci 11, 2774-91.

47.  Siew, N., and Fischer, D. (2003) Proteins 53, 241-51.

48.  Siew, N., and Fischer, D. (2003) Structure 11, 7-9.

49.  Kyrpides, N. C., and Ouzounis, C. A. (1998) Science 281, 1457.

50.  Iyer, L. M., Aravind, L., Bork, P., Hofmann, K., Mushegian, A. R., Zhulin, I. B., and Koonin, E. V. (2001) Genome Biol 2, RESEARCH0051.

51.  Kabsch, W., and Sander, C. (1983) Biopolymers 22, 2577-637.

52.  Frishman, D., and Argos, P. (1995) Proteins 23, 566-79.

53.  Chou, P. Y., and Fasman, G. D. (1974) Biochemistry 13, 211-22.

54.  Demeler, B., and Zhou, G. W. (1991) Nucleic Acids Res 19, 1593-9.

55.  Fan, K., and Wang, W. (2003) J Mol Biol 328, 921-6.

56.  Wang, J., and Wang, W. (1999) Nat Struct Biol 6, 1033-8.

57.  Chan, H. S. (1999) Nat Struct Biol 6, 994-6.

58.  Rost, B., and Sander, C. (1993) J Mol Biol 232, 584-99.

59.  Rychlewski, L., and Fischer, D. (2005) Protein Sci 14, 240-5.

 

 


Tables

 

Table 1 : Some popular databases used by computational biologists for developing their prediction methods.

 

 

Database/Method

URL

GenBank

http://www.ncbi.nlm.nih.gov/Genbank/

EMBL nuc.sq.dtb.

http://www.ebi.ac.uk/embl/

DDBJ

http://www.ddbj.nig.ac.jp/

Swiss-Prot&TrEMBL (28)

http://ca.expasy.org/sprot/

UniProt (31)

http://www.ebi.uniprot.org/index.shtml and mirror sites

UniRef

ftp://ftp.expasy.org/databases/uniprot/uniref/

PDB (21)

pdb/Welcome.do">http://www.rcsb.org/pdb/Welcome.do

PDB non-red. sets

pdb/clusterStatistics.do">http://www.rcsb.org/pdb/clusterStatistics.do

CD-HIT (35)

http://bioinformatics.ljcrf.edu/cd-hi/

UniqueProt (41).

http://cubic.bioc.columbia.edu/services/uniqueprot/

EVA (39) unique set

http://cubic.bioc.columbia.edu/eva/res/weeks.html

BLAST/PSI-BLAST (29),

http://www.ncbi.nlm.nih.gov/BLAST/

BLASTClust

http://www.ncbi.nlm.nih.gov/Web/Newsltr/Spring04/blastlab.html

SCOP {Andreeva A., Howorth and Murzin A.G. 2004 NAR 32}

http://scop.mrc-lmb.cam.ac.uk/scop/

CATH (33)

http://cathwww.biochem.ucl.ac.uk/latest/index.html

GO (34)

http://www.geneontology.org/index.shtml

 

 

 

References


Contact:    admin@rostlab.org Version:    Jun 30, 2008
 top - TOC - CUBIC-papers - CUBIC - Rost group