| Title: | Predicting protein structure and function through evolutionary information |
| Author: | Burkhard Rost, Jinfeng Liu, Dariusz Przybylski, Rajesh Nair, Kazimierz O. Wrzeszczynski, Henry Bigelow & Yanay Ofran |
| Quote: | In 'Chemoinformatics - From Data to Knowledge' J Gasteiger & T Engel (eds.), 2002, Wiley, 1789-1811 |
Predicting protein structure and function through evolutionary information
| 1 | CUBIC, Dept. of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA |
| 2 | Columbia University Center for Computational Biology and Bioinformatics (C2B2), Russ Berrie Pavilion, 1150 St. Nicholas Avenue, New York, NY 10032, USA |
| 3 | North East Structural Genomics Consortium (NESG), Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA |
| 4 | Dept. of Pharmacology, Columbia Univ., 630 West 168th Street, New York, NY 10032, USA |
| 5 | Dept. of Physics, Columbia Univ., 538 West 120th Street, New York, NY 10027, USA |
| 6 | Dept. of Medical Informatics, Columbia Univ., 630 West 168th Street, New York, NY 10032, USA |
| * | Corresponding author: email = rost@columbia.edu URL http://cubic.bioc.columbia.edu/ Tel: +1-212-305-3773, fax: +1-212-305-7932 |
Book chapter for:
'Chemoinformatics - From Data to Knowledge' J Gasteiger & T Engel (eds.), Wiley
This article is published in (Wiley, 2002, pages) © copyright Wiley (2002). Wiley is the only authorized source. All copying of this article including placing on another website requires the written permission of the copyright owner.
The ultimate goal of protein structure prediction is to extend our knowledge and understanding of the structures and functions of proteins beyond that which is possible by experiment. Virtually all techniques, including 1D, 2D, and 3D structure prediction, and diverse kinds of function prediction use profiles rather than single sequences as the ‘information object’ for prediction. Database methods rely on structural information to evaluate the fitness of a protein sequence for a given structure according to a statistical model. Energetic methods derive predictions by calculating the fitness according to thermodynamic and kinetic principles. The two approaches have their limitations: database methods suffer from sparse statistics and therefore often over-fit the data, while energetic methods must vastly simplify theory to be tractable with the limited computational power available. The best predictors almost always use a combination of both in an intelligent way.
| 1D | one-dimensional |
| 1D structure | one-dimensional (e.g. sequence or string of secondary structure) |
| 2D | two-dimensional |
| 2D structure | two-dimensional (e.g. inter-residue distances) |
| 3D | three-dimensional |
| 3D structure | three-dimensional (co-ordinates of protein structure) |
| PDB | Protein Data Bank of experimentally determined 3D structures of proteins [1] |
| PHD | Profile based neural network prediction of secondary structure [2] (PHDsec [3, 4, 2] ), solvent accessibility (PHDacc [5, 2] ), and transmembrane helices (PHDhtm [6, 2, 7] ) |
| PSIPRED | divergent profile (PSI-Blast) based neural network prediction [8] |
| SWISS-PROT | data base of protein sequences [9] |
| TrEMBL | translation of the EMBL-nucleotide database coding DNA to protein sequences [9] |
Proteins are the machinery of life. The information for life is stored by a four-letter alphabet in the genes (DNA). Proteins are, among others, the macromolecules that perform all important tasks in organisms, such as catalysis of biochemical reactions, transport of nutrients, recognition, and transmission of signals. Thus, genes are the blueprints or library, and proteins are the machinery of life. Proteins are formed by joining amino acids by peptide bonds into an unbranched chain. This protein sequence comprises a translation of the four-letter DNA alphabet into a 20-letter alphabet of native amino acids. Proteins differ in length (from 30 to over 30,000 amino acids), and in the arrangement of the amino acids (dubbed residues, when joined in proteins). In water, the chain folds up into a unique three-dimensional (3D) structure. The main driving force is the need to pack residues for which a contact with water is entropically unfavorable (hydrophobic residues) into the interior of the molecule. A detailed analysis of the underlying chemistry shows that this is only possible if the protein forms regular patterns of a substructure called secondary structure ( Fig. 1 ; for an excellent introduction into protein structure: [10] ; for a short review of the basic principles of folding: [11] ).
Sequence determines structure determines function. Protein three-dimensional (3D) structure (i.e. the co-ordinates of all atoms) determines protein function. But what determines 3D structure? The hypothesis that structure (also referred to as 'the fold') is uniquely determined by the specificity of the sequence, has been verified for many proteins [12] . While it is now known that particular proteins (chaperones) often play a role in the folding pathway, and in correcting misfolds [13, 14, 15, 16] , it is still generally assumed that the final structure is at the free-energy minimum. Thus, all information about the native structure of a protein is coded in the amino acid sequence, plus its native solution environment. Can the code be deciphered, i.e. can 3D structure be predicted from sequence? In principle, the code could by deciphered from physico-chemical principles using, for example, molecular dynamics methods [17, 18] . In practice, however, such approaches are frustrated by two principle obstacles [19] . Firstly, energy differences between native and unfolded proteins are extremely small (order of 1 kcal/mol). Secondly, the high complexity resulting from the cooperativity of protein folding requires several orders of magnitude more computing time than we anticipate to have over the next decades. Thus, the inaccuracy in experimentally determining the basic parameters, and the limited computing resources become fatal for predicting protein structure from first principles [20] . The only successful structure prediction tools are knowledge-based, using a combination of statistical theory and empirical rules.
The sequence-structure gap is rapidly increasing. Databases for protein sequences are expanding rapidly, largely due to large-scale genome sequencing projects. We now know the sequences for about a million proteins [9] . Only seven years after the first organism - the small prokaryote Haemophilus influenzae - was entirely sequenced [21] , we now have the entire genomes for over 50 organisms [22, 23, 24] from all kingdoms of life, amongst them five multi-cellular eukaryotes: Caenorhabditis elegans [25] , Drosophila melanogaster [26] , Arabidopsis thaliana [27] , Homo sapiens [28, 29] , and Mus musculus [30] . This implies that the explosion of genome, and hence, protein sequences is supposedly the only field outgrowing the speed in development of computer hardware. It also implies that despite significant improvements of structure determination techniques the gap between the number of proteins for which structure is deposited in public databases (PDB [31, 1] ), and the number of proteins for which sequences are known is increasing.
Can the egg be un-boiled? When an egg is boiled, the proteins it contains unfold. Can theory reverse this procedure, i.e., can the encrypted code of protein structure be deciphered? Even if not, can theory at least help to bridge the sequence-structure gap? For over 40 years, there has been an ardent search for methods that predict protein structure from sequence. Many methods looked initially very promising, but always the hope has been dashed. How well do we do?
No general prediction of structure from sequence, yet. John Moult (CARB, Washington DC) initiated the biannual experiment for a critical assessment of structure prediction (CASP): those who determine protein structures submitted the sequences of proteins for which they were about to solve the structure to a 'to-be-predicted' database; for each entry in that database predictors could send in their predictions before a given deadline (the public release of the structure); finally, the results were compared and discussed during a workshop (in Asilomar, California). Four such experiments have been completed and published in special issues of the journal Proteins (Vol. 23 1995, Suppl. 1 1997, Suppl. 2 1999, Suppl. 3 2001); the prediction-season for CASP5 is this summer, and the meeting will be held in Dec 2002. Most groups that ever worked on protein structure prediction methods have participated in this large-scale enterprise. Undoubtedly, CASP has influenced the field in many ways. Many types of methods have improved significantly since 1994 (CASP1). However, while a few predictions were surprisingly close to the experimental structure at CASP4 [32] , overall we still cannot reliably predict protein structure from sequence.
In this review, focus will be laid on prediction methods that actually contribute to bridging the sequence-structure gap with a view to analyzing entire genomes and proteomes. The first section summarizes briefly where we are today in protein structure prediction. The following chapters sketch the problems and some of the solutions in database searches, and the prediction of protein structure in 1D, 2D, and 3D ( Fig. 1 ).
Fig. 1. : Protein structure in 1D, 2D and 3D. Representation of HIV-1 protease monomer (Protein Data Bank code 1HHP) in one, two and three dimensions. Each of the representations gives rise to a different type of prediction problem.
1D - predict secondary structure and solvent accessibility. From left to right: amino acids for the first 33 residues (one letter code, first column); alignment exemplified by 5 sequences (second column); secondary structure [70] (H, helix; E, strand; blank, other: third column), solvent accessibility (measured in Å, fourth column, [70] ), and a typical prediction by the neural network program PHD [2] for secondary structure and solvent accessibility (in italics, fifth and sixth column).
2D - predict contact map. The 3D structure can be projected onto a two-dimensional matrix of inter-residue distances or contacts (as shown here). The entry at position ij of the matrix gives the contact strength between residue i and residue j; the stronger a contact, the darker the marker. Horizontal and vertical lines give borders of secondary structure segments. Graph made with CONAN [298] .
3D - predict three-dimensional coordinates. The trace of the protein chain in 3D is plotted schematically as a ribbon Ca-trace. Strands are indicated by arrows; the short helix is visible on the right towards the end (C-term) of the protein. Graph made with MOLSCRIPT [299] . Predictions not shown.
Many more protein sequences are known (~1,000,000) than protein structures (~20,000). Three major prediction techniques are used to extend this knowledge to sequences: homology modeling, threading, and 1D prediction. The results of many prediction efforts are examined closely to produce a non-redundant target list for structural genomics to complete our knowledge of protein structure missed by predictions. Because many costly research decisions are based on predictions, evaluating and fairly comparing prediction accuracies is crucial. The main pitfalls are associated with sparse or different datasets, and inappropriate measures given the intended use of the prediction.
Bridging the sequence-structure gap for 5 - 40% of all sequences. The gap between the number of known sequences (~ 1,000,000) and the number of known structures (~ 20,000) is widening rapidly. The most successful theoretical approach to bridging this gap is homology modeling. The principle idea originates from the following observation: each native protein sequence adopts a unique structure. However, many different sequences can adopt the same basic fold. In other words, proteins with similar sequences tend to fold into similar structures. Indeed, for a pair of naturally evolved proteins that have more than 33 in 100 pairwise identical residues we can infer that the two proteins fold into similar structures (note: for shorter alignments the level of significant pairwise sequence identity is much higher; for very long alignments, it approaches a value around 20% [33] , Fig. 4 ). Thus, if a sequence of unknown structure (U) has significant sequence similarity to a protein of known structure (T), it is possible to build an approximate 3D model for U based on the assumption that U has basically the same structure as T. This technique is referred to as homology or comparative modeling. It effectively raises the number of 'known' 3D structures by more than 10-fold. For example, for 30 of the about 100 completely sequenced organisms, the percentage of proteins with experimentally determined structures remains below 2% [23, 34] . Comparative modeling at very high levels of accuracy already more than doubles this number ( Fig. 2 ). We can predict the basic fold correctly for about 38% of all proteins in entirely sequenced proteomes ( Fig. 2 ).
|
Fig. 2. : Scope of structure prediction. For the entire proteomes of 30 organisms, we investigated the structural coverage by experiment and comparative modeling: black bars give high-accuracy models (2-3Å), gray bars accurate models (2-5Å), and striped bars give the coverage with low-accuracy models, i.e. models that give the basic fold correctly. The lowest threshold for which we can learn about structure through comparative modelling is a PSI-BLAST E-value of 10-3. At this level, about 38% of all proteins have similarity to known structures. Threading techniques may bridge the sequence-structure gap by another 5-10 percentage points. However, for the remaining half of all proteins we can currently not predict 3D structure. The only general methods available for this half are predictions in 1D and 2D. Note that the graph shows the coverage in terms of proteins. The coverage of residues for which we have structural knowledge is about 10 percentage points lower [34] . |
Widening the bridge by threading. Comparative modeling allows prediction of 3D structure for 5-40% of all protein sequences. However, there is evidence that most pairs of proteins with similar structure are remote homologues with less than 25% pairwise sequence identity [35] These remote homologues cannot usually be recognized by conventional sequence alignments, as this level of sequence identity is not significant for structural similarity in the following sense. If one were to collect all pairwise alignments of < 25% sequence identity that result from a search with U against a database of protein sequences, than the vast majority (> 90% [33] ) of these pairs would be entirely unrelated proteins. Thus, most similar structures appear to be remote homologues, but most possible pairs at low levels of sequence identity are, in fact, unrelated. Consequently, searching for remote homologues is similar to the task of finding a needle in a haystack [36, 35, 37, 38, 33] . Techniques to manage this difficult task are referred to as 'threading techniques' [39, 40, 41, 42, 43, 44] . Most of these techniques are applicable if and only if the remote homologue to Uhas known structure. Once a remote homology is detected, remote homology modeling may be used to construct a 3D model. This could potentially reduce the sequence-structure gap by an additional ten percentage points ( Fig. 2 ).
Accurate prediction for 1D aspects of 3D structure. If neither a close nor a remote homologue can be detected for U , we are forced to simplify the prediction problem. There is a pay-off from making this simplification: using the rich diversity of information in current databases, it is possible to make very accurate 1D predictions from the sequence alone. Automatic prediction services are readily available for secondary structure, solvent accessibility, location and topology for transmembrane helices [7] , and the location of helices for the special class of coiled-coil proteins [45] . These simplified predictions are the only means that we have today to compare entire proteomes [46] . A few results are that most organisms have a similar percentage of helical transmembrane proteins (15-30%), and that eukaryotes have significantly higher fractions of coiled-coil proteins (10%) and of proteins with long regions lacking regular secondary structure (20-30%) than all other kingdoms [23, 34] .
Structural genomics to determine all native protein structures. In 2000, the National Institutes of Health (NIH) in the USA began to finance pilot projects for large-scale protein structure determination (structural genomics). Two major objectives of structural genomics have often been given. First, experimentally determine one protein structure for each natural protein [47] . Second, determine one structure for all missing links in pathways and biological mechanisms [48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60] . These two objectives correspond to the two aspects of genome sequencing: (i) the mass of data, and (ii) the completeness of entirely sequenced organisms. One expected technical benefit from structural genomics is the development of techniques and protocols for large-scale expression, purification, crystallization and structure determination. An important benefit for molecular biology may be the determination of the structural scaffolds for most basic functional elements. A considerable increase in the fraction of proteins for which we have some structural information may also advance the determination of function for single proteins or entire proteomes. It is commonly assumed that the scaffolds of protein folds constitute one of the 'basic units' for evolution. If so, structural genomics will also help to better understand evolution. Structural genomics focuses on structural modules or domains. However, isolated domains do not always suffice to understand function. Instead, understanding function often requires studying complexes composed of many proteins. The difficulty of determining structures for large complexes will be prohibitive for the first round of structural genomics.
One representative for each structural family. The safest strategy to go about the goal of determining structures for all native proteins is to simply express, purify, crystallize and X-ray all protein sequences one by one, just in the way large-scale genome sequencing operates. However, sequencing is technically much simpler than is structure determination. None of the necessary steps - express, purify, crystallize, X-ray - has ever been accomplished on the scale of 'all proteins in a proteome'. In fact, in the second year of hands-on large-scale structural genomics, we realize that there is no single bottleneck in the process. Rather, all experimental steps constitute limiting factors for structural genomics. For example, in the NESG consortium about 2000 constructs led to about 35 protein structures [61] . Consequently, we have to find a way of focusing on some representative fraction of all proteins. This enterprise is much more complicated in many ways than anticipated. In fact, as much as the experimental community shifted the paradigm from 'crystallization is the bottleneck' to 'any step limits', bioinformatics realized gradually how difficult a systematic target selection is. The first problem results from the fact that our tools are not sufficient to point at sequences that are likely to adopt novel folds, since we cannot systematically unravel relations in the midnight zone of sequence alignment [50, 38, 33] . In contrast, the levels of pairwise sequence similarity that imply similarity in structure are well established [62, 63, 64, 65, 66, 33, 67] . Thus, it may seem that all bioinformatics has to do is to cluster all proteins into families of proteins with similar structures, exclude all clusters with known structures and define the remaining list as the target list for structural genomics [50, 59, 34] . In fact, this procedure describes the current modus operandi of structural genomics initiatives fairly well. Additionally, most groups exclude clusters that are particularly problematic due to the presence of membrane regions, and/or long regions of low-complexity. These unwanted proteins reduce the fraction of proteins for which we need experimental information to less than 50% of the entirely sequenced eukaryotes [34] . However, one important problem remains: cluster the sequences [68] . What complicates the situation even further in practice is that we have to dissect protein sequences into likely structural domains to enable clean-cut clustering [34] . In order to accomplish this task, we again need structure prediction methods [69] .
Seemingly improve accuracy by ignoring short segments. There are many ways to publish higher levels of accuracy. Amongst the simplest for secondary structure prediction is to convert 310 helices and beta-bulges assigned by DSSP [70] to non-regular structure. This yields higher levels of accuracy since all methods – on average – are better at predicting the middle of helices and strands than their caps, and hence are more accurate for longer regular secondary structure segments [71, 72] . However, when using predicted secondary structure to predict 3D structure, short helices are important. Thus, the more conservative conversion strategy appears more realistic.
Comparing apples and oranges, or too few apples with one another. To overstate the point: there is NO value in comparing methods evaluated on different data sets. Nevertheless, many developers compare their method to the performance of other methods based on different data sets. Thus, developers may want to compare their results to public methods based on the same data set (not previously used for any of the two). Many methods predicting aspects of protein structure and function have to fight with limited data availability. This is not at all the case for structure prediction. Hundreds of new protein structures are added every year [1] . If because of some reason or other, small data sets have to be used, developers should painstakingly try to estimate what 'significant difference' means for their data set.
EVA: automatic evaluation of automatic prediction servers. In collaboration with Volker Eyrich (Columbia), Marc Marti-Renom, Andrej Sali (both Rockefeller), Florencio Pazos, and Alfonso Valencia (both CNB Madrid), we have started to address the above problems through the automatic server EVA [73, 74, 75] . Leszek Rychlewski (IIMCB Warsaw) and Dani Fischer (Ben-Gurion Univ.) are implementing similar ideas in LiveBench [76] . The simple concept is the following: take the N newest experimental structures added to PDB, send the sequences to all prediction servers, collect the results, and accumulate a continuous evaluation of prediction accuracy every week. EVA has been evaluating comparative modeling, fold recognition/threading, contact prediction, and secondary structure prediction methods for over two years now. It is instructive to monitor how the 'ranking' of methods may change from week to week due to insufficiently large data sets.
For every protein structure nature has invented, it has produced a myriad of sequence variants. Sequence alignment is our primary tool for both discriminating different structures and estimating residue structural neighbors among these variants. The HSSP curve, a calibration of alignment quality against structural knowledge, enables us to infer structural similarity without knowing the structure. A ‘profile’ calculated from a multiple sequence alignment extracts signal (conserved residues) from noise (non-conserved residues).
Basic concept. The principle problem of sequence alignments is to find the 'optimal' superposition between two strings of amino (or nucleic) acid sequences, i.e. to optimally align the two strings. One mathematical criterion for 'optimal' is the percentage of pairwise identical residues in the final superposition. A dynamic programming algorithm is guaranteed to find the optimal solution for a given 'objective function' (here pairwise identity) in an algorithmic time quadratic in the length of both sequences [77] ( Fig. 3 ). For protein sequences this simple approach is not sufficient; finding the best alignment usually requires introduction of gaps in one sequence, or insertions in the other [78] ( Fig. 3 : rather than placing two dots into sequence T , residues A and E could be deleted in sequence U ; note: the gap increases the score from 2 to 4 identical residues). Usually, gaps are introduced mathematically by adding a constant (gap open penalty) to the final score (here number of identical residues). However, to sensitively align protein sequences, this is still not sufficient. The major addition to the simple approach described so far is to evaluate scores that are not based on residue identities but based on biochemical properties of the amino acids. For example, aligning two hydrophobic residues (I and L) is more beneficial than aligning a hydrophobic and a charged residue (L and K; note: when treating hydrophobic residues as identical, the score for the best gapped alignment in Fig. 3 increases from 4 to 6).
Evolution distinguishes signal from noise. At the level of protein molecules, selective pressure results from the need to maintain function, which in turn requires maintenance of the specific 3D structure. This evolutionary history is the basis for the success in aligning protein (or nucleotide) sequences. Accordingly, conservation and mutation patterns observed in alignments contain very specific information about 3D structure. How much variation is tolerated without loss of structure or function? Obviously, it is more likely to find two strings of ten residues with five identical residues by chance than to find two strings of hundred residues with fifty identical residues [79] . Therefore, pairwise sequence identity alone is not enough to separate between proteins of similar and dissimilar structure. Reinhard Schneider and Chris Sander pioneered the HSSP threshold that describes the line of similarity below which we cannot infer similarity in structure through similarity in sequence [62] . The details of the curve originally proposed did not sustain the flood of new structures in the 90's, however, the basic shape remains correct [33] . For very short alignments, we cannot infer anything about structural similarity ( [80] ; in the extreme: 11 identical residues have been observed in strand and helix [81] ; Fig. 4 A). For very long alignments, 20% pairwise sequence identity implies structural similarity ( Fig. 4 A). The transition from the 'safe' zone in which we can infer structural similarity without errors into the twilight zone [82] in which the sequence signal begins to fade, is characterized (1) by an increase in structural homologues by more than an order of magnitude (explosion of true pairs, Fig. 4 B), and (2) by an even more dramatic explosion of pairs that are structurally unrelated (false pairs, Fig. 4 B). This simple observation highlights the struggle of today's sequence analysis: on the one hand, we want to intrude as far as possible into the twilight zone in order to unravel distant similarities. On the other hand, for each percentage point that we advance, the false pairs increase by some factor. For example, at HSSP-distances of 0 (marked by curve in Fig. 4 A), there are no false positives; at distance of -2.5 already half of the new relations found are false (dissimilar structures), at -5 the false pairs dominate by 4:1, and at -7.5 they dominate 9:1 ( Fig. 4 B).
|
Fig. 4. : Thresholds of significant sequence similarity.
HSSP_distance = PIDE
where PIDE is the percentage of pairwise identical residues between a pair of proteins over the alignment length of L residues (the two thin lines illustrate HSSP distances of 10 and -10).
|
Routine database searches by simplified procedures. Any sequence analysis starts with database searches: all known databases are scanned by sequence alignment procedures for proteins homologous to the search sequence U. When the pairwise sequence identity between U and a putative homologue H is high ( Fig. 4 ), alignment procedures are usually straightforward [83, 84, 85, 86, 87] . For less similar protein pairs, alignments may fail. Aligning two sequences by dynamic programming is a matter of seconds on a modern workstation. However, database searches require repeating this many times, and since the databases grow, CPU time becomes a constraint in everyday sequence analysis. This bottleneck is opened by methods that start to find 'identical words' (sub-strings), and then grow the alignment around such blocks. The most widely used programs of this sort are BLAST and FASTA [84, 86] . In practice, advanced alignment algorithms typically first run a fast scan with BLAST and/or FASTA and then apply the full dynamic programming algorithm. This is also implemented in the fast and good alignment bestseller PSI-BLAST [88] . For users who want to fine-tune the final alignment, ClustalW and its newer implementation ClustalX provide an excellent tool [89, 90] .
Intruding into the twilight zone by profile-based alignments. It was also recognized very early on that information from the position-specific evolutionary exchange profile of a particular protein family facilitates discovering more distant members of that family [91] . Automatic database search methods successfully used position-specific profiles for searching [92] . However, the breakthrough to large-scale routine searches has been achieved by the development of PSI-BLAST [88] and Hidden Markov models [93, 94] . In particular, the gapped, profile-based, and iterated search tool PSI-BLAST continues to revolutionize the field of protein sequence analysis through its unique combination of speed and accuracy. More distant relationships are found through iteration starting from the safe zone of comparisons and intruding deeply and reliably into the twilight zone. However, users ought to be aware that the profiles used during the iteration may drift away from the original sequence [95] . In extreme cases - which happen for about 10% of the database searches [96] - the final profile will not even recognize close homologues of the query protein U with which the search was started.
Drawback: lack of sufficiently tested cut-off criteria. There are many different alignment methods available. Which is best? One of the difficulties in comparing different alignment procedures is the lack of well-defined criteria for measuring the alignment quality. Very few papers have attempted to define such measures for the comparison of various methods [97] . The second problem for users is that most methods do not supply a cut-off criterion for distinguishing between homologous and non-homologous sequences, or if they implicitly do (like the E values in PSI-BLAST), the values are not necessarily appropriate for the objective of the user. For example, significance thresholds are completely different when attempting to retrieve proteins of similar structure than when attempting to retrieve enzymes of similar function ( Fig. 4 D-F). There is no simple way around this dilemma; the only practical way out is to re-establish significance thresholds for different problems [33, 98] . One important lesson appears to hold in general: any threshold that does not account for the alignment length is not sufficient as illustrated by the poor performance of pairwise sequence identity ( Fig. 4 D) and the problems of PSI-BLAST E values in inferring similarity of function for very high levels of confidence ( Fig. 4 F right hand side).
A 1D prediction assigns a state from a discrete set, e.g. helix, strand, loop, defined by physical or functional criteria, to each residue in the sequence. Prediction is achieved by training, in most cases, a Neural Network (NN) on fixed-size sequence windows classified by the state of the central residue. Thus the predictive information is sequence-local. The main improvement in accuracy was due to using family-derived profiles for both training and input. The most widely used 1D predictions today are secondary structure, solvent accessibility, transmembrane strands and helices, and recently, regions of structural switches.
Basic concept. The principal idea underlying most secondary structure prediction methods is the fact that segments of consecutive residues have preferences for certain secondary structure states [10, 2] . Thus, the prediction problem becomes a pattern-classification problem tractable by pattern recognition algorithms. The goal is to predict whether the residue at the center of a segment of typically 13-21 sequence-consecutive residues is in a helix, strand or in none of the two (no regular secondary structure, often referred to as the 'coil' or 'loop' state). Many different algorithms have been applied to tackle this simplest version of the protein structure prediction problem: physico-chemical principles, rule-based devices, expert systems, graph theory, linear and multi-linear statistics, nearest-neighbor algorithms, molecular dynamics, and neural networks [2] . However, until 1992 performance accuracy seemed to have been limited to about 60% (percentage of residues correctly predicted in either helix, strand, or other). The limited accuracy was argued to result from the fact that all methods used only information local in sequence (window of less than 20 adjacent residues). Local information was estimated to account for roughly 65% of the secondary structure formation. Two additional problems were common to all methods developed from 1957 to 1993: (1) strands were predicted at levels of accuracy only slightly superior to random predictions, and (2) predicted secondary structure segments were, on average, only half as long as observed segments. The later two shortcomings could be surmounted by using a particular combination of neural networks [2] .
Evolutionary information key to significantly improved predictions. On the one hand, about 75 out of 100 residues can be exchanged in a protein without changing structure. On the other hand, exchanges of 1-5 residues can already destabilize a protein structure. These statements may appear contradictory. However, the explanation is simple: evolution has explored exactly the unlikely exchanges of particular amino acids at particular positions that do not change structure, as a change of structure usually results in a loss of function (and thus would not survive). Thus, the residue exchange patterns extracted from a protein family (i.e. alignments of similar sequences) are highly indicative of the specific structural details for that family. The first method that reached a sustained level of a three-state prediction accuracy above 70% was the profile-based neural network system PHD which uses exactly such evolutionary information derived from multiple sequence alignments as input [2] . By stepwise incorporation of particular evolutionary information, prediction accuracy was pushed above 72% accuracy [3, 101, 4, 2] . The currently best methods (PROFsec [102] and PSIPRED [8] ) reach a level of 76% three-state per-residue accuracy [103, 95] . This constitutes a sustained level more than four percentage points above last century's best method not using diverged profiles. Fortunately, the improvement is valid for helix, strand and non-regular regions. Furthermore, significantly fewer residues are confused between the helix and strand states. Finally, some new methods also improve in a more global sense by improving the accuracy of assigning the secondary structural class (all-alpha, all-beta, alpha/beta, other) based on the predicted content of regular secondary structure.
Sources of improvement: 4 parts database growth, 3 extended search, 2 other. Jones solicited two causes for the improved accuracy: (1) training and (2) testing the method on PSI-BLAST profiles. Cuff & Barton examined in detail how different alignment methods improve prediction accuracy [104] . However, which fraction of the improvement results from the mere growth of the database, which from using more diverged profiles, and which from training on larger profiles? Using PHD from 1994 to separate the effects [96] , we first compared a non-iterative standard BLAST [84] search against SWISS-PROT [9] with one against SWISS-PROT + TrEMBL [9] + PDB [1] . The larger database improves performance by about two percentage points [96] . Secondly, we compared the standard BLAST against the big database with an iterative PSI-BLAST search. This yielded less than two percentage points additional improvement [96] . Thus, overall, the more divergent profile search against today's databases supposedly improves any method using alignment information by almost four percentage points (PHDpsi in Table 1 ). The improvement through using PSI-BLAST profiles to develop the method are relatively small: PHDpsi was trained on a small database of not very divergent profiles in 1994, e.g., PROFsec was trained on PSI-BLAST profiles of a 20 times larger database in 2000. The two differ by only one percentage point, and part of this difference resulted from implementing new concepts into PROF (Rost, unpublished; [105] ).
Secondary structure predictions now extremely useful, in practice. How good is a prediction accuracy of 72-76% in practice? It is certainly reasonably good compared with the prediction of secondary structure by comparative modeling [106] . However, prediction accuracy varies between different proteins, i.e., prediction accuracy for the best methods is 76%±10% (one standard deviation) [2] . For applications this implies that predictions can be as good as > 95%, but also as bad as <54%. Can users distinguish one from the other? A few methods successfully use reliability indices that indicate residues for which predictions are, on average, likely to be more accurate. For PSIPRED and PROFsec the correlation between such a reliability index and accuracy is linear [2] . Thus, the reliability index effectively becomes a means to predict prediction accuracy, and hence to assess to which class a protein of unknown structure (U) belongs: to the well predicted, or to the badly predicted ones. Young, Kirshenbaum, Dill & Highsmith [107] have unraveled an impressive correlation between local secondary structure predictions and global conditions. The authors monitor regions for which secondary structure prediction methods give equally strong preferences for two different states. Such regions are processed combining simple statistics and expert-rules. The final method is tested on 16 proteins known to undergo structural rearrangements, and on a number of other proteins. The authors report no false positives, and identify most known structural switches. Subsequently, the group applied the method to the myosin family identifying putative switching regions that were not know before, but appeared reasonable candidates [108] . Are secondary structure predictions accurate enough to help predicting higher order aspects of protein structure automatically?
Basic concept. It has long been argued that if the segments of secondary structure could be accurately predicted, the 3D structure could be predicted by simply trying different arrangements of the segments in space [109] . One criterion for assessing each arrangement could be to use predictions of residue solvent accessibility [110, 111] . The principal goal is to predict the extent to which a residue embedded in a protein structure is accessible to solvent. Solvent accessibility can be described in several ways [110, 111] . The simplest is a two-state description distinguishing between residues that are buried (relative solvent accessibility < 16%) and exposed (relative solvent accessibility ≥ 16%). The classical method to predict accessibility is to assign either of the two states, buried or exposed, according to residue hydrophobicity. However, a neural network prediction of accessibility has been shown to be superior to simple hydrophobicity analyses [112, 104] . More recently, other advanced methods have been applied successfully to predict accessibility [113, 114, 115, 116, 117] . A particular twist comes from methods that predict the contact environment of a residue [118, 119] : rather than predicting that residue X is n% accessible, these methods predict that X is in contact with less than m other residues.
Evolutionary information improves prediction accuracy. Solvent accessibility at each position of the protein structure is evolutionarily conserved within sequence families. This fact has been used to develop methods for predicting accessibility using multiple alignment information [37] . Prediction accuracy is about 75±7%, four percentage points higher than for methods not using alignment information. Recently, PHDacc has been superseded by PROFacc, a neural network based method that improves prediction accuracy significantly (Rost, unpublished; [102] ). PROFacc is particularly accurate in predicting surface residues reliably: almost 80% of the residues on the surface (>16% accessible) are correctly predicted. Predictions of solvent accessibility have also been used successfully for prediction-based threading, as a second criterion towards 3D prediction by packing secondary structure segments according to upper and lower bounds provided by accessibility predictions, and as basis for predicting functional sites [37] . More recently, predictions of accessibility were also used successfully to predict sub-cellular location (below).
Basic concept. Even in the optimistic scenario that in the near future most protein structures will be experimentally determined, one class of proteins will still represent a challenge for experimental determination of 3D structure: transmembrane proteins. The major obstacle with these proteins is that they do not crystallize, and are hardly tractable by NMR spectroscopy. Consequently, for this class of proteins structure prediction methods are even more needed than for globular water-soluble proteins. Fortunately, the prediction task is simplified by strong environmental constraints on transmembrane proteins: the lipid bilayer of the membrane reduces the degrees of freedom to such an extent that 3D structure formation becomes almost a 2D problem. Two major classes of membrane proteins are known: proteins which insert helices into the lipid bilayer ( Fig. 5 ), and proteins that form pores by a barrel of an even number of antiparallel ß-strands (one of the larger of these families is the 16-stranded porins [120] ). For helical membrane proteins, elaborate combinations of expert-rules, hydrophobicity analysis and statistics was believed to yield a two-state per-residue accuracy of about 90% (residues predicted correctly as either transmembrane helix, or other; for recent reviews: [121, 122, 123, 124] .
Evolutionary information improves prediction accuracy. For two methods the use of multiple alignment information is reported to clearly improve the accuracy of predicting transmembrane helices [125, 2] . In order to predict the orientation of the helices (i.e. the topology Fig. 5 ) a simple rule is applied: positively charged residues occur more often in intra-cytoplasmic than in extra-cytoplasmic regions. The advanced neural network system has been improved significantly by adding a dynamic programming algorithm to the neural network output. The principle idea is to use the neural network output as an energy landscape and to find the optimal path through this landscape [7] . TMAP was another early application of multiple sequence alignments to determine membrane-spanning segments [125] . More recently, Hidden Markov Models used a combination of intelligence and evolutionary information (below).
Grammatical rules reflect global aspects of membrane regions. The lipid bilayer constrains the structure of the membrane-passing regions of proteins in many ways. TMHMM pioneered building models of predicted membrane proteins considering a variety of such constraints in one consistent methodology [126, 127] . A similar concept was implemented in HMMTOP [128, 129] . TMHMM implements a cyclic HMM with seven states for transmembrane-helix (TMH) core, TMH-caps on the N- and C-terminal sides, non-membrane regions on the cytoplasmic side, two non-membrane regions on the non-cytoplasmic side, and a globular domain state in the middle of each non-membrane region. The two non-membrane regions on the non-cytoplasmic-side model short and long loops respectively, which correspond to two different membrane insertion mechanisms. In contrast, HMMTOP is an HMM representing the following five structural states: inside non-membrane region, inside TMH-cap, membrane helix, outside TMH-cap, and outside non-membrane region. Conceptually, this model is similar to the one used in MEMSAT [130] . It differed in the placement and interpretation of TMH-caps, which Tusnady et al. interpret as not being in the membrane [128] .
Comparing methods. We recently re-evaluated the performance of 27 advanced and simple prediction methods [124] . Although we spent considerable effort on comparing prediction methods, our comparisons suffered from one crucial problem: we do not have cross-validation data available for all methods and our high-resolution data sets are still too small. Nevertheless, we observed the following trends. While some of the advanced methods performed better than others, we showed in a thorough bootstrapping experiment based on various measures of accuracy that no method performed consistently best. In contrast, most simple hydrophobicity scale-based methods were significantly less accurate than any advanced method as they over-predicted membrane helices and confused membrane helices with hydrophobic regions outside of membranes. In contrast, the advanced methods usually distinguished correctly between membrane helical and other proteins. Nonetheless, few methods reliably distinguished between signal peptides and membrane helices. We could not verify a significant difference in performance between eukaryotic and prokaryotic proteins. Surprisingly, we found that proteins with more than five helices were predicted at a significantly lower accuracy than proteins with five or fewer. The important implication is that structurally unsolved multi-spanning membrane proteins, which are often important drug targets, will remain problematic for transmembrane helix prediction algorithms. One of the problems of current methods is similar to that of low-resolution experiments: very long and very short helices, as well as very short regions connecting helices are difficult to see.
Similar content of membrane helical proteins across all kingdoms of life. Despite this uncertainty in detail, the prediction of transmembrane helices is a valuable tool to quickly scan entire proteomes [131, 23, 34] . In fact, predictions of membrane helices and coiled-coil regions are among the few means that allow to compare entirely sequenced organisms [46] . The surprising outcome of analyzing over 30 entirely sequenced organisms from all three kingdoms of life (eukaryotes, prokaryotes and archaea) is that while the percentage of proteins with membrane helices varies from 15-30%, there is no significant difference in the percentage between bacteria and multi-cellular organisms [23, 34, 46] .
Basic concept. b-barrel membrane proteins are found in the outer membranes (OMs) of Gram-negative bacteria, mitochondria and chloroplasts. In prokaryotes, they mediate non-specific, passive transport of ions and small molecules or can selectively pass molecules such as maltose and sucrose [132, 133] . In eukaryotic organelles, b-barrel membrane proteins have been suggested to be involved in voltage-dependent anion channels [134] . This wide range of functions is associated with a wide range of structural variants: barrels with as few as 8 and as many as 22 strands [135] . Of the b-barrel membrane proteins, porins are the best studied. Many porins are homotrimeric barrels containing 16 antiparallel b-strands per barrel; maltoporin from E.coli contains 18 strands [133] . A band of hydrophobic residues encircles the trimer [136, 137, 133] . Porins also contain a central channel that is partially blocked by a loop that folds inwardly and is attached to the inner side of the barrel wall [138] . This arrangement forms an "eyelet" which defines the size of solute molecule that can traverse the channel. Currently, high-resolution structures are only available for bacterial OM proteins [139] . Unlike for a-helical membrane proteins, there are no simple low-resolution experiments that yield large amounts of data for b-barrel membrane proteins. This has constrained the ability to develop prediction methods. Many b-strands contain alternating hydrophobic and hydrophilic side-chains. However, this simple rule usually does not suffice to identify membrane strands [135] . Methods that implement physico-chemical properties were applied successfully only in the context of experimental information [140, 141, 142] . All early attempts to predict membrane strands employed the amphipathicity and hydrophobicity of b-strands. Paul and Rosenbusch attempted a minimal approach to predict and identify segments causing polypeptides to reverse their direction (turn identification) but they avoided hydrophobicity parameters [140] . In contrast, Jahnig suggested that a generalization of hydrophobicity analysis was sufficient to predict membrane-spanning amphiphilic a-helices and b-strands [143] . Unfortunately, membrane strands have no long stretch of consecutive hydrophobic residues. In fact, the overall hydrophobicity for b-barrel membrane proteins is similar to that of soluble proteins. Cowan and colleagues [144] suggested to use the mean hydrophobicity of one side of a putative b-strand by averaging over hydrophobic moments [145] of every second residue within a sliding window [146, 142] . To improve the signal-to-noise-ratio, they accounted for the band of aromatic residues in flanking positions of the b-strands. Another method that was considered for predicting b-membrane spanning regions was a rule-based approach. Gromiha and colleagues combined amino acid preferences for b-strands with the surrounding hydrophobicity of the respective residues to predict b-strands [147, 148] .
Non-linear statistics enables to predict membrane beta-strands. Diederichs and colleagues proposed to use a neural network to predict the topology of the bacterial OM b-barrel proteins and to locate residues along the axes of the pores [149] . The neural network predicts the z-coordinate of C-alpha atoms in a coordinate frame with the outer membrane in the xy-plane, such that low z-values indicate periplasmic turns, medium z-values indicate transmembrane b-strands, and high z-values indicate extracellular loops. Most recently, Jacoboni, Fariselli, Casadio and colleagues applied a method combining neural networks and dynamic programming to predict the location of membrane strands [150] . The networks used alignment information as input and predicted whether or not a particular residue is part of a membrane strand. In the second step, the method simply finds the optimal path through the network prediction, much like the methods applied to predict membrane helical proteins [130, 6, 151] . Finally, the topology is assigned based on the location of the longest loop that is taken to be exterior. The authors estimated that their system correctly predicts about 93% of all known membrane-strands. It is not clear whether or not the estimates from Diederichs et al. and Jacoboni et al. will hold true for all beta-strand membrane proteins. The first problem is merely a technical one: for such a small set of experimentally known families (about 15 different families, Bigelow & Rost, unpublished) it is almost impossible to avoid over-training methods with many free parameters (such as neural networks). The second problem is of principle nature: We have to assume that the 15 different beta-strand membrane families for which we have high-resolution structures are representative for all beta-strand membrane proteins. This may turn out to be an incorrect assumption.
It is observed that sequence distant mutations in a structural family are found to have higher correlation if neighboring in structure. This 2nd order statistic is used to predict inter-residue contacts (2D prediction or contact prediction). A special case, inter-strand contacts, is a simpler prediction problem. Overall, the field of contact prediction is in its beginning, and accuracy still rather low. However, such predictions can refine many 1D and 3D prediction algorithms, and their uses are actively being explored.
Prediction problem is a hard one, but the stakes are high. Given all inter-residue contacts or distances ( Fig. 1 ), 3D structure can be reconstructed by distance geometry or molecular dynamics. This is used for the determination of 3D structures by nuclear magnetic resonance (NMR) spectroscopy which produces experimental data of distances between protons [152] . Can inter-residue contacts be predicted? Obviously, some fraction of these contacts can be: helices and strands can be assigned based on hydrogen-bonding pattern between residues. Thus, a successful prediction of secondary structure implies a successful prediction of some fraction of all the contacts. However, contacts predicted from secondary structure assignment are short-ranged, i.e., between residues nearby in sequence. For a successful application of distance geometry, long-range contacts have to be predicted, i.e., contacts between residues far apart in sequence. A few methods have been proposed for the prediction of long-range inter-residue contacts. Two questions surround such methods: (1) how accurate are these prediction methods on average and (2) are all the important contacts predicted? The answers appear to have deterred many researchers from this important sub-field of protein structure prediction: accuracy is low and only some of the important contacts are correctly predicted. Nevertheless, contact predictions help to predict folds [153, 154, 155, 156] and can be used to predict protein-protein interactions [157, 158, 159] .
Correlated mutations and neural networks. In sequence alignments, some pairs of positions appear to co-vary in a physico-chemically plausible manner, i.e., a 'loss of function' point mutation is often rescued by an additional mutation that compensates for the change [160] . One hypothesis is that compensations would be most effective in maintaining a structural motif if the mutated residues were spatial neighbors. Attempts have been made to quantify such a hypothesis and to use it for contact predictions [161, 162, 163] . In general, prediction accuracy is rather poor, with a direct trade-off between predicting enough contacts, and predicting only correct ones, e.g., taking 5% of the best-predicted long-range contacts (sequence separation above 10 residues) the accuracy prediction is about 50% [74, 103] . The signal from correlated mutations is somehow independent of the signal from evolutionary conservation [164, 165] . Another obvious approach is to apply neural networks to the problem [166, 167, 168] . Recently, neural networks were combined with correlated mutations to yield the currently most accurate prediction tool [169] .
Simplifying the contact prediction problem. One simplification of the problem to predict inter-residue contacts focuses on predicting the contacts between residues in adjacent strands ( Fig. 1 ). Such an attempt is motivated by the hope that such interactions are more specific than are sequence-distant (long-range) contacts in general, and hence are easier to predict.
Identifying the correct b-strand alignment. The only method published for predicting inter-strand contacts is based on potentials of mean-force [170] similar to those used in the evaluation of strand-strand threading [171] . Propensities are compiled by database counts for 2 ´ 2 ´ 2 classes (parallel/anti-parallel, H-bonded/not H-bonded, N-/C-terminal). Each of the eight classes is divided further into five sub-classes in the following way. Suppose the two strand residues at positions i and j are in close in space. Then the following five residue pairs are counted in separate tables: i/j-2, i/j-1, i/j, i/j+1, i/j+2. Such pseudo-potentials identify the correct b-strand alignment in 35-45% of the cases.
Using evolutionary information to predict inter-strand contacts. Even if the locations of strands in the sequence are known exactly, the pseudo-potentials cannot predict the correct inter-strand contacts in most cases [170] . However, when using multiple alignment information, the signal-to-noise ratio increases such that inter-strand contacts have been predicted correctly for most of the strands inspected in some test cases [170] . For the purpose of reliable contact prediction, this result is inadequate, especially as the locations of the strands are not known precisely. Various test examples using predictions by PHDsec [2] as input to the strand pseudo-potentials indicate that the accuracy in predicting inter-strand contacts drops (T Hubbard, unpublished), but in some cases is still high enough to be useful for approximate modeling of 3D structure [172] . Implementing rules improves the accuracy [173] ; elaborate recurrent neural networks improve the prediction of beta-strand contacts even further [174] . However, all these methods do not systematically combine clusters of predicted contacts. This task was solved systematically by implementing a Hopfield-like [175] neural network [176] : expert-rule based constraints (e.g. each strand can only be contacting two other strands) are optimized to predict the entire contact map for beta-strand proteins. Unfortunately, none of these apparently helpful prediction methods is publicly available at the moment.
The ultimate goal of structure prediction is an atomic resolution (3D) model of a given sequence. This goal can only be achieved when the sequence has a close homologue of known structure. Both comparative modeling and threading combine statistical information and simplified models for the energy and entropy of folding to evaluate the fitness of a sequence to a particular structure. The major hurdles for both endeavors are 1) identifying homologues of known structure, 2) correctly aligning those homologues to the sequence of interest, and 3) adjusting the final structure to optimize its fitness. If no homologue of known structure exists, ab initio prediction is the only option. However, most ab initio methods are still being developed and trained on proteins with homologues of known structure.
Basic concept. Proteins with similar sequence have similar structure ( Fig. 4 ). The observation that proteins with fairly different sequences still have relatively similar structures is often incorrectly described as 'structure is more conserved than sequence'. This statement is misleading since, in fact, any residue substitution changes the structure in detail. The basic topology of a 3D structure is often described through a simple sketch ( Fig. 1 ) that is intuitively referred to as 'the fold'. The concept of 'a fold' is not well-defined. However, experts can recognize the similarity of two structures by visually classifying them into folds (as e.g. done in the SCOP database [177] ; note that the concept of 'fold similarity' is at the base of the data presented in Fig. 4 ). This is the pillar for the success of comparative modeling [178, 179, 51, 180, 181, 182, 183, 184, 185, 186, 67, 187, 188, 189] . The principal idea is to model the structure of U (protein of unknown structure) based on the template of a sequence homologue of known structure (T). Consequently, the precondition for comparative modeling is that a sequence homologue of known structure is found in PDB. Since comparative modeling is currently the only theoretical means to successfully predict 3D structure, this has two implications. First, comparative modeling is applicable to 'only' 5-40% of the known protein sequences ( Fig. 2 ). Second, as the template of a homologue is required, no unique 3D structure can be predicted by comparative modeling, i.e., no structure that has no similarity to any experimentally determined 3D structure.
High level of sequence identity: atomic resolution. The basic assumption of comparative modeling is that U and T have identical backbones (main chain C-alpha). The modeling task then becomes to correctly place the side chains of U into the backbone of T. For very high levels of sequence identity between U and T (ideally differing by one residue only), side chains can be 'grown' during molecular dynamics simulations [190] . For slightly lower levels (still of high sequence similarity), side chains are built based on similar environments in known structures [191, 192] . Rotamer libraries (libraries containing all side-chain orientations observed in known structures) are used in the following way. (1) Rotamer distributions are extracted from a database of non-redundant sequences. (2) Fragments of seven (helix, strand) or five residues (other) are compiled. (3) Fragments of the same length are successively shifted through the backbone of U. (4) For modeling the side chains of U only those fragments from the rotamer library are accepted which have the same amino acid in the center as U, and for which the local backbone is similar to that around the evaluated position). Over the whole range of sequence identity between U and T for which comparative modeling is applicable, the accuracy of the model drops with decreasing similarity. For levels of at least 60% sequence identity, the resulting models are quite accurate [192, 193] , for even higher values, the models are as accurate as is experimental structure determination. The limiting factor is the computation time required [194] .
Low level of sequence identity: loop regions sometimes correct. With decreasing sequence identity between the known structure H and the query protein U the number of loops that have to be inserted to align the two grows. An accurate modeling of loop regions, however, implies solving the structure prediction problem. The problem is simplified in two ways. First, loop regions are often relatively short and can thus be simulated by molecular dynamics (note the CPU time required for molecular dynamics simulations grows exponentially with the number of residues of the polypeptide to be modeled). Second, the ends of the loop regions are fixed by the backbone of the template structure. Various methods are employed to model loop regions. The best have the orientation of the loop regions correct in some cases [193] . This illustrates the current limitations of molecular dynamics: not even short loop regions can be predicted from sequence. Furthermore, for experimental structure refinement (use of molecular dynamics to improve consistency, and accuracy of experimental data) molecular dynamics is successfully applied to find a better solution when starting from an almost correct structure. However, for comparative modeling, molecular dynamics refinement usually reduces prediction accuracy [193] . Below about 40% sequence identity the accuracy of the sequence alignment used as basis for comparative modeling becomes an additional problem. Nevertheless, even down to levels of 25-30% sequence identity, comparative modeling produces coarse-grained models for the overall fold of proteins of unknown structure [187, 75] .
Basic concept. As noted above, the majority of protein pairs with similar structure populate the midnight zone in which the sequence signal alone does not suffice to detect the relation ( Fig. 4 ). If we knew that a protein of unknown structure (U) has a similar structure as a protein of known structure (T) that has no detectable sequence similarity to U, we could build the 3D structure of U by (remote) homology modeling based on the template of T. Thus, remote homology modeling must solve three tasks: (1) The remote homologue (T) has to be detected, (2) U and T have to be correctly aligned, and (3) the homology modeling procedure has to be tailored to the harder problem of extremely low sequence identity (with many loop regions to be modeled). Methods that address these goals are often referred to as 'threading' or 'fold recognition' techniques. Most methods developed over the last decade have been primarily optimized to solve the first two problems. The basic idea is to thread the sequence of U into the known structure of T and to evaluate the fitness of sequence for structure by some kind of environment-based or knowledge-based potential [41, 195] . Threading is in some respects a harder problem than is the prediction of 3D structure (NP-complete: [196] ; no physical connection between remote homologues, as many remotely homologous protein pairs may have originated from different ancestors [35, 67] ). However, the stakes are high: solving the threading problem could enable the prediction of thousands of protein structures. Indeed, threading has evolved to become one of the most active fields in the arena of protein structure prediction.
Variety of threading techniques. The optimism generated by one of the first papers published in the field in the 90s [197] has boosted attempts to develop threading methods. The principle idea has been to use structural propensities of amino acids (such as preferences for secondary structure formation, hydrophobicity, and polarity), and to then assess whether or not a given sequence with its structural preferences fits into the structural environment of a given structure [195] . A principally different approach has been pushed by Manfred Sippl [198, 43] . The idea is to use the rich knowledge deposited in the database of protein structures (PDB) by extracting mean-force potentials. Such potentials monitor the observed distances between residue pairs of particular amino acids, with a particular sequence separation (number of residues between the two). Until 1995, most threading methods have used mean-force-potentials [41, 42, 43] . A more recent generation of threading methods is based on 1D predictions [199, 37, 200, 201] : first 1D structure (secondary structure and solvent accessibility) is predicted for a sequence of unknown structure, then the 1D structure is extracted from a library of known structures, and finally the observed and the predicted 1D structure strings are aligned by typical dynamic programming algorithms [78] . Most of the best current approaches attempt to incorporate some or all of the earlier ideas in one method. It is a difficult task however, due to a great number of possible combinations. Has all this effort achieved to crack the hard nut threading?
Remote homologues can often be detected. First the good news: since the different threading approaches which have been proposed capture different aspects of protein structure, the correct remote homologue is likely to be found by at least one of them [76] . Now the bad news: so far, no single method has been able to detect the correct remote homologue for more than half of all test cases [76, 202, 74] . For the methods which have been rigorously evaluated using large test sets, the correct remote homologue is detected in less than about 40% of all cases [76, 202, 74] . However, this performance is clearly superior to that of traditional sequence alignments at this low level (<25%) of sequence identity. Furthermore, the success of the last Asilomar experiment on structure prediction [203] suggests that the likelihood to detect the correct remote homologue is reasonably high when the choice is refined by experts.
3D prediction by threading is still not reliable. Detecting the remote homology is only the first of the three obstacles. It appears that the second obstacle (correct alignment between U and T ) is much more difficult and, unfortunately, there is no general solution so far [186, 204] . Thus the final step, building a 3D model, usually fails since the modeling procedures available today cannot correct the mistakes in the alignments. Although the last Asilomar experiment on structure prediction [203] suggested that major improvements have been accomplished over the last two years, there are still very few publications to date which report accurate 3D predictions from threading methods. Currently, the successful use of threading methods requires skeptical, expert user intervention to spot wrong hits and false alignments. It is still possible that threading method will become the most successful structure prediction method, but a lot of detailed work lies ahead.
Recent breakthrough in structure prediction? In the 1994 Asilomar meeting, none of the 3D ab initio methods were able to predict the correct protein structure [193] . Since that time, new methods have been proposed which indicate possible directions for the future. Several groups have obtained promising results using distance geometry methods [37] . Simplified force-fields in combination with dynamic optimization strategies have yielded promising, but still relatively inaccurate results [205, 206] . Srinivasan and Rose have reported very encouraging results with their hierarchical search method [207] . However, the second Asilomar experiment on structure prediction (forthcoming issue of Proteins) concluded similarly to the first: no prediction of 3D structure from sequence, yet.
Accurate prediction of 3D structure for coiled-coil proteins. A particular class of proteins are coiled-coils. These are proteins can be defined by a rather simple geometry of long helices, of which two or more wind around one another [45] . Nilges and Brünger [208] have achieved atomic-accuracy in an ab initio prediction of the GCN4 leucine zipper using a hybrid molecular dynamics/simulated annealing search strategy. Recently, equally accurate models for three leucine zippers were obtained with faster calculations based on mean-force-potentials (O'Donoghue, in press).
Recognizing incorrect structures. The single most important theoretical advance in 3D prediction in recent years may have been the development of mean-force-potentials. Before these potentials, structure prediction was normally done with 'physical' potentials, i.e., bonds, angles, torsion angles, and van der Waals as well as electrostatic non-bonded terms which describe the internal energy of the molecule [20] . In contrast, the mean-force-potentials, derived from databases of protein structure [209] , attempt to describe the free energy of the molecule. The physical potentials have been used very successfully to refine experimentally determined structures [152] . However, these terms cannot distinguish between a native fold and a grossly misfolded structure [209] . In contrast, mean-force-potentials of pairwise residue distances are quite successful in fold recognition, as well as remote homology modeling [43] . It remains to be seen how best to combine these two different potentials. In one pilot study on the use of mean-force-potentials for 3D structure prediction, best results where obtained by combining both potentials (O'Donoghue and Nilges, private communication).
Extracting principles about structure formation from structures? The mean-force-potential approach has recently been extended to study protein folding. Both the physical basis and the general characteristics of protein folding remain controversial [210] . Simulations and other studies indicate that the free energy balance of hydrogen bond formation is close to zero, or slightly unfavorable [211, 212] , and that a specific fold is selected primarily by side-chain interactions [210] . Sippl et al. have extended the concept of deriving mean-force potentials to a formalism of describing Helmholtz free energies of atom-pair interactions [213] . The formalism starts with the following two assumptions: (1) that protein structures can be described by Helmholtz free energies (or mean-force-potentials), and (2) that the distribution of intra-molecular distances in experimentally determined protein structures does, on average, not deviate substantially from the corresponding distribution in native proteins. To normalize the absolute free energy contributions, the ideal gas is chosen (no internal interactions). Without any further assumptions or approximations, atom-atom mean-force-potentials are derived from a data set of known protein structures. The resulting Helmholtz mean-force-potentials unravel interesting principles about protein structure formation. (1) Backbone H-bonds (except for the a-helix interaction Oi ... Ni+4) do not contribute to the thermodynamic stability of native folds. (2) H-bond formation (except for Oi ... Ni+4) requires energy input to break H-bonds with water molecules that is regained when H-bonds are formed intramolecularly. Once formed, H-bonds are locked in a deep, narrow minimum. (3) The energy gain of forming one ionic or two hydrophobic contacts can provide roughly the activation energy required for forming a H-bond. Both the eloquence and the conclusions of the approach have prompted strong criticism, even unanimous rejection of these findings. Do we witness an error in a method laid out to spot errors, or the begin of a new era of force fields? Further applications of these mean-force-potentials will be needed to answer this question.
Several different approaches have been developed to predict aspects of protein function, including sub-cellular localization, post-translational modification, functional type, and protein-protein interactions. For the first two, the most successful approaches rely on identifying a ‘sequence motif’ described by a position specific scoring matrix (PSSM). For mapping proteins to a set of functional types, text analysis has been successfully applied, with great success. Finally, a comprehensive knowledge of protein-protein interactions is necessary to unravel the cause and effect relationships actuated by genetic pathways. Both text analysis and computational approaches have been applied to this problem, notably in the DIP (database of interacting proteins).
Basic concept. Bacterial cells generally consist of a single intracellular compartment surrounded by a plasma membrane. In contrast, eukaryotic cells are elaborately subdivided into functionally distinct, membrane-bounded compartments. Most eukaryotic proteins are encoded in the nuclear genome and synthesized in the cytosol, and many need to be further sorted into other sub-cellular compartments. The sorting signals that direct the movement of a protein through the cell, and thereby determine its eventual sub-cellular localization, are contained in its amino acid sequence [214, 215, 216] . Most proteins do not have a sorting signal; they remain in the cytosol as permanent residents. Many others, however, have specific sorting signals that direct their transport from the cytosol into the nucleus, the ER, mitochondria, plastids (in plants), or peroxisomes; sorting signals can also direct the transport of proteins from the ER to other destinations in the cell. For example, the active transport between the cytosol and the nucleus occurs through the presence of nuclear localization signals (NLSs). NLSs are short stretches of residues (usually positively charged amino acids) that can be found anywhere in the protein sequence [217] . When the final destination is the mitochondrion, the chloroplast, or the secretory pathway, sorting usually relies on the presence of an N-terminal targeting sequence that is recognized by the translocation machinery [218] . Transport of proteins between the Endoplasmic reticulum (ER) and the Golgi apparatus occurs through the action of transport vesicles, which ferry proteins from one compartment to the other [219] . Proteins must be localized in the same subcellular compartment to cooperate towards a common physiological function. Thus, the native subcellular localization of a protein is a strong indicator of gene/protein function. Aberrant sub-cellular localization of proteins has been observed in the cells of several diseases, such as cancer and Alzheimer’s disease. Attempts to predict subcellular localization have become one of the central problems in bioinformatics [220, 221] .
Inferring localization through homology. One of the most reliable approaches to annotate subcellular localization is through annotation transfer from homologues [222] . If a protein of experimentally known localization is significantly similar in sequence to a query protein U, localization can be inferred for U. However even when accepting many errors, less than 35% of the proteins in SWISS-PROT [9] can be classified by homology into one of ten subcellular localization classes .
Prediction of localization through sequence motifs. Another means for predicting localization is the identification of local sequence motifs such as signal peptides and nuclear localization signals (NLS). A number of neural network-based tools identify signal peptides that target proteins to the secretory pathway and the mitochondrion [223, 224] . In a recent benchmark study [225] , these tools predicted the signal peptide cleavage site with over 84% accuracy though the false positive rate was over 18%. We have collected a data set of experimental and potential NLS motifs to predict nuclear localization [226] . These were selected to yield 100% accuracy, however, most proteins have no known NLS. A particular problem for methods detecting N-terminal signals is that start codons are predicted with less than 70% accuracy by genome projects [227] . Overall, known and predicted sequence motifs enable annotating about 30% of the proteins in six eukaryotic proteomes [23, 34] .
A third approach to predicting localization has been suggested by the observation that the total amino acid composition correlates with the subcellular localization [228, 229, 230] . This observation has led to the development of a variety of prediction methods based solely on composition [231, 227, 232] . With the availability of large numbers of completely sequenced genomes, phylogenetic profiles have been employed to identify subcellular localization. So far, this approach has been much less accurate than methods based solely on composition. Other methods have tried to integrate rules based on amino acid composition with databases of known signal sequences, e.g., PSORT II is a knowledge-based expert system that integrates the two kinds of information [233] . In particular, PSORT II uses other original prediction methods such as SignalP [234] , ChloroP [235] , and NNPSL [227] as input. Consequently, we may expect that PSORT II would improve if these original methods were improved. Drawid & Gerstein have proposed a Bayesian system based on a diverse range of 30 different features [236] . They applied their method to predicting localization of the full Saccharomyces cerevisiae proteome and provide estimates of the fraction of all yeast proteins found in different compartments.
|
Predictor |
Web address (URL) |
|
|
|
|
ChloroP |
|
|
MitoProt |
http://www.mips.biochem.mpg.de/cgi-bin/proj/medgen/mitofilter |
|
NNPSL |
|
|
predictNLS |
|
|
Predotar |
|
|
PSORT |
|
|
SignalP |
|
|
SubLoc |
|
|
TargetP |
|
|
TMHMM |
Predict localization: problem not solved, yet. The complex compartmentalization of the eukaryotic cell cannot as yet be accurately captured by bioinformatics tools. For proteins with annotated homologues and in cases where the targeting signal sequence can be identified, currently available sequence analysis tools do a good job. In cases where the sorting signals are presented in the context of a folded protein, they are very difficult to identify and one often has to resort to purely statistical approaches (amino acid composition). With improved fold recognition and three-dimensional structure prediction algorithms, it may eventually become possible both to detect these more complex sorting signals and to predict the localization of a protein based on its general surface characteristics [237, 238] .
Basic concept. With over 250 structural and regulatory post translational modifications occurring in proteins [239] prediction of these modifications has currently narrowed to a vital few. Prediction based methods for post-translational modification is generally developed through the use of highly conserved sequence motifs or through more complex patterns of amino acid composition paralleled with structure properties or surface accessibility. Prominent post-translational modifications targeted for prediction include: N-terminal signal peptide cleavage sites, proteolytic cleavage and more specifically proteasome cleavage sites, phosphorylation sites, lipid modification and N- and O-glycosylations [240] .
Archiving known sequence motifs and predicting modifications. A database of phosphorylation sites, PhosphoBase [241] , includes information on over 400 phosphorylated proteins, their phosphorylation sites and the specific kinase of action. Furthermore, the site has incorporated the work of Blom, Gammeltoft & Brunak for the prediction of eukaryotic phosphorylation sites of independent sequences [242] . Specifically, predictions for peptide sequences at serine, threonine, and tyrosine residues were achieved with 69% to 96% sensitivity. The technique used a neural network approach based on both sequence and structure composition of the modification site. However, with the difficulty of predicting structural conformation around the site of phosphorylation and the large varying pattern of consensus sequences for kinase substrate specificity, prediction of protein phosphorylation remains a difficult process to fully optimize. A similar neural network approach based on charged residues within glycosylation sites together with sequence context and surface accessibility was used to identify O-glycosylation modifications at 83% accuracy [243] . The limited substrate specificity for both N-glycosylation and O-glycosylation has currently limited the field towards more of a discovery role of the modification site [244, 245] with limited success in prediction of protein glycosylations. Lipid modification predictions have benefited the most in glycosylphosphatidylinositol or GPI-anchor modifications [246, 247] . The examination of amino acid consensus regions near the C-terminal (omega-site) GPI-modification site and particularly the physical properties in those regions resulted in 83% accuracy for predicting the effect of mutations in the GPI sequence motif. Furthermore, an evaluation of N-terminal N-myristoylation by the protein myristol-CoA:protein N-myristoyltransferase (NMT) achieved values of 95% prediction sensitivity for NMT substrate sites [248] . Finally, a comprehensive study in protein degradation with proteasome digestion data lead to the prediction capabilities of MHC Class I ligand boundaries resulting after proteasomal degradation of polypeptides [249] . Cleavage sites were determined to 65% accuracy and 85% of non-cleavage site were correctly determined.
Basic concept. Protein functions can be classified in different ways. Biochemically, functions can be categorized by the molecular mechanisms underlining the biological reactions, e.g., the EC (Enzyme Commission) number. From another aspect, functions can be assorted according to their cellular roles, e.g., protein synthesis, energy metabolism, etc. One of the most widely used functional category schemes was originally proposed by Monica Riley for annotating E. coli genome [250] , and was later modified slightly and adopted by TIGR (The Institute for Genome Research) [21, 251] and many other genome centers. The TIGR version of classification (role category) has 16 different categories that can be further divided into sub-categories. They are: "Amino acid biosynthesis", "Biosynthesis of cofactors, prosthetic groups, and carriers", "Cell envelope", "Cellular processes", "Central intermediary metabolism", "DNA metabolism", "Energy metabolism", "Fatty acid and phospholipid metabolism", "Other categories", "Protein fate", "Protein synthesis", "Purines, pyrimidines, nucleosides, and nucleotides", "Regulatory functions", "Signal transduction", "Transcription", and "Transport and binding proteins". Functional class assignment not only is important in annotating individually proteins, but also provides the overall pictures for the whole genomes and allows for convenient functional comparison between organisms. However, it is not a trivial task to do the assignment automatically, even for proteins with experimental annotation, since annotations in current protein databases are too detailed, diverse, and non-standardized. The problem is more severe when there is no experimental information about the protein.
Functional class assignment by text analysis. SWISS-PROT is one of the most comprehensive curated protein databases. It has collected a wealth of functional annotations, and has extracted them into some predefined set of keywords, which makes it an ideal source for text analysis for functional class assignment. A text-mining tool Euclid was developed to automatically classify proteins in functional classes based on their SWISS-PROT annotations [252] . It learned the relationship between SWISS-PROT keywords and the classes from a set of manually classified sequences, and applied these relationships to new sequences. It was able to classify 52% of the sequences in Mycoplasma genitalium genome, slightly less than human experts, and more than 80% of the classification was correct. Although Euclid performs reasonably well, it is still desirable to have a straightforward way of relating functional annotation in databases to the functional class. Gene Ontology (GO) [253] is bringing this hope into reality. The goal of GO is to provide controlled vocabularies for the description of the molecular function, biological process and cellular component of gene products. And this set of vocabularies will be standardized and adopted by most major databases. The GO consortium has already provided indices of other classification systems to GO, including the TIGR role categories (http://www.geneontology.org/external2go/tigr2go). Therefore, as long as functional annotations in any databases are using GO terms, it will take little effort to translate those into functional class.
Functional class assignment by homology. For proteins without clear annotations in the databases, functional class can sometimes be inferred by sequence homology. The conservation of functional classes at different level of sequence identity has been established by Devos and Valencia [254] . At the sequence identity level of more than 85%, functional classes are conserved in over 90% of the cases. The conservation drops to 70% if two sequences are between 30% and 70% identical. No reliable inference can be made if the identity level is below 20%. To address the problem of relatively poor signal-noise ratio inherent in pairwise alignment, a profile HMM based method was developed by TIGR to obtain more accurate and sensitive functional assignment through homology [255] . TIGRFAMs is a collection of protein family HMMs that are built from highly curated multiple alignments of proteins thought to share the same function or to be members of the same family. HMM searches result in a score measuring the probability that the query protein is in the same family as the proteins used to build the model. TIGRFAMs has been used by TIGR as part of its automatic functional assignment toolkit, and by other genome centers as well.
Functional classes can be predicted from sequence. The most recent breakthrough in the field of predicting protein function came through a collaboration of the groups from Søren Brunak (CBS Copenhagen) and Alfonso Valencia (CNB Madrid). Their ends are to predict cellular function from sequence alone. Their means are a complicated, elaborate, and hierarchical system of neural networks [256] . A first group of networks is used to identify which particular global features describing a protein (like length or amino acid composition) separate best between any two types of functional classes. These basic predictions are then combined into a final prediction step, again through neural networks. The system is accurate enough to be applied in the context of entire genome analyses.
Basic concept. Every protein has a biological function, yet most of the biological functions are carried out by groups of proteins interacting in complex networks. Interactions between proteins can be physical, i.e. by chemically binding each other or by binding together to a third substrate, or they can be functional, e.g. by controlling each others expression or by participating in the same biochemical pathway. To fully understand the molecular mechanism that underlies a certain biological function (or malfunction) we need to decipher its meticulous network of protein interactions. Therefore, an extensive research effort is invested in both experimental and computational methods that unravel protein-protein interactions [257, 258, 259, 260, 261] . Particularly, many methods and databases attempt to draw complete maps of interactions for entire proteomes. Once it is known with which other proteins a newly discovered protein interacts, it will be easier to predict its function. Furthermore, it is hoped that these interaction maps will surrender the secrets of biological processes, and enhance the understanding of the molecular mechanisms which underlies them. A complete picture of all the proteins that are involved in a certain biological process would also break new grounds in drug development by identifying new targets for drugs.
Databases and data-mining techniques compile existing information. A vast amount of information about protein-protein interactions already exists in the literature. However, this information is scattered across millions of text pages of scientific publications. A few different enterprises are aimed at extracting this information from the literature [262, 263, 264, 265, 266, 267, 261, 268] . The DIP database [268] is an example of a database that is dedicated to protein interactions. The curators of DIP manually survey the literature to find experimentally determined interactions. They also employ automatic techniques to obtain data from other databases. Other approaches to this problem use natural language processing algorithms, as well as other computational methods, to automatically extract interaction information from scientific papers [269, 270, 271, 272, 273, 274, 275] . SUISEKI, a system for information extraction on interactions, [270] , is reported to successfully extract 70-80% of the interactions in a large corpus of scientific abstracts.
Limitation of the experimental methods Traditionally, the discovery of hitherto unknown protein-protein interactions is based on experiments that typically deal with a small number of proteins. Therefore, the study of protein interaction lags behind the large-scale sequencing efforts. Experimentalists try to deal with this ever-widening gap by developing high throughput laboratory methods that study many proteins at once. The two-hybrid system is the most prominent among these new experimental methods [276, 277, 278, 279, 280, 281, 282, 283, 284] . Using two-hybrid-system it was possible to identify thousands of hitherto unknown protein-protein interactions. Other systematic experimental techniques aim at unraveling large-scale networks of protein-protein interactions through mass spectrometer of isolated protein complexes [285, 286, 287, 288, 289] , protein chips [290] and hybrid approaches [291, 292] . All these techniques will enhance our understanding of protein-protein networks significantly. Nevertheless, at this point the rates of false positives (detected interactions that do not really occur in the cell) and of false negatives (genuine interactions are not detected) are difficult to assess. Furthermore, all methods involves extensive work, expertise, and are expansive.
Computational approaches predict protein-protein interactions. Many groups attempt to develop computational methods to predict protein-protein interactions in-silico [261] . David Eisenberg and his co-workers developed methods that are based on comparative genomics. Their Rosetta stone method screens genomes for sequences that appear as two different chains in one genome, and are fused to create a single protein-chain in another genome, which is evolutionarily younger [293] . The assumption is that evolution fused these two proteins into a single one because they interact with one another. Hence, when we find a pair of proteins of this sort, we suspect that they interact. Another comparative method developed by this group searches for pairs of proteins that always occur together in all known genomes, i.e., there is no genome in which only one of the two proteins occur [294] . These types of protein pairs also very likely interact. Using these two methods, Eisenberg and his co-workers proposed thousands of protein pairs that may interact. However, there is no confirmed statistics regarding the reliability of the predictions of these methods. Alfonso Valencia and his group developed two methods that predict protein interactions from sequence [158, 159] . Conceptually, their approach is based on the assumption that interacting proteins evolve together, hence the mutations that occur in two interacting proteins along evolution should be correlated. By analyzing the correlation between the mutations in different proteins across different species, they succeeded in correctly identifying protein interactions. Preliminary results indicate that the predictions of these methods have a low false negative rate. Sprinzak and Margalit predicted protein-protein interactions based on a very simple concept: proteins with similar binding motifs or domains are likely to interact [295] . The method can be improved by adding filters that take into account entire networks [296] . However, the major restriction of all these methods is that they cannot be applied to any pair of proteins. Another shortcoming of these methods is that they merely indicate whether a pair of proteins is in interaction, but they do not identify the interaction sites - a crucial piece of information for molecular research.
Native 3D structures of proteins are encoded by a linear sequence of amino acid residues. To predict 3D structure from sequence is a task challenging enough to have occupied a generation of researchers. Have they finally succeeded in their goal? The bad news is: no, we still cannot predict structure for any sequence. The good news are: we have come closer, and growing databases facilitate the task.
Sequence alignment Sequence alignment is the basis of the computational study of proteins. Virtually all the methods and algorithms in the field are based at least to some extent on sequence alignment. In particular, database searching tools which underlie most of the prediction methods implement sequence alignment algorithms. Although the basic algorithms for alignments were introduced decades age, recent advancements have improved the performance of these algorithms substantially. The growth of the sequence databases contributes to the improvements of these methods. One challenge that still remains is the development of a statistical system to assess alignment and database searching tools. This is of particular importance in the so-called twilight zone.
Prediction in 3D: theory bridges the sequence-structure gap. The only source for new, unique protein structures (structures for which no homologue exists in the database) are experiments. However, given the amount of time needed to determine a protein structure experimentally, more non-unique structures can be predicted at atomic resolution by homology modeling in a month than have been determined experimentally over the last three decades. Homology derived models are frequently accurate at the level of atomic resolution. Unfortunately, most models typically have considerable co-ordinate errors in loop regions. Coarse-grained homology derived models are available for almost one third of the sequences deposited in the SWISS-PROT database [297] . Threading techniques could increase this ratio considerably by finding more distant homologues. However, for large scale sequence analyses threading techniques are not yet reliable.
Predictions in 1D: significant improvement by larger databases. The rich information contained in the growing sequence and structure databases has been used to improve the accuracy of predictions of some aspects of protein structure. Evolutionary information is successfully used for predictions of secondary structure, solvent accessibility, and transmembrane helices. These predictions of protein structure in 1D are significantly more accurate and more useful than five years ago. Some methods have indicated that 1D predictions can be useful as an intermediate step on the way to predicting 3D structure (inter-strand contacts; prediction-based threading). Another advantage of predictions in 1D is that they are not very CPU-intensive, i.e., 1D structure can be predicted for the protein sequence of, for example, entire yeast chromosomes overnight.
Predictions in 2D and 3D: so far of limited success. The prediction accuracy of chain-distant inter-residue contacts is so far relatively limited. Analysis of correlated mutations can be used to distinguish between alternative models (e.g. for threading techniques). The prediction of inter-strand contacts appears to be useful in some cases. An accurate method for the automatic prediction of contacts between residues not close in sequence remains to be developed. Most breakthroughs in protein structure prediction were achieved over the last six years. Thus, although we still cannot solve the general prediction problem, progress has been made. In particular, in the last few years the idea of replacing, or at least combining, the physical-theoretical energy calculations with knowledge derived ones, yielded exciting results.
Prediction of function: first successes scored. As opposed to protein structure, which is expressed in a very formal in rigorous way using the coordinates of each atoms, function is a more vague and multifaceted notion. Consequentially, it easier to define the goals and to assess the performance of structure prediction methods. Function prediction is composed of a few different aspects. Each of them is based on different methods with different levels of success. The first break-throughs have been made in predicting protein-protein interactions and cellular function from sequence. The combination of these method may aid the advance of molecular biology considerably.
Thanks to Phil Carter and Hedi Hegyi (both Columbia University) for helpful comments. Particular thanks to Volker Eyrich (Columbia) for programming and maintaining most of the immensely valuable software that runs the EVA and META-PredictProtein servers! This work was supported by the grants 1-P50-GM62413-01 and RO1-GM63029-01 from the National Institute of Health (NIH), and the grant DBI-0131168 from the National Science Foundation (NSF). Last, not least, thanks to all those who deposit their experimental data in public databases, and to those who maintain these databases.
| 1. | Berman, H. M., Westbrook, J., Feng,Z., Gillliland, G., Bhat, T. N. et al. (2000). The Protein Data Bank. Nucl.Acids Res., 28, 235-242. |
| 2. | Rost, B. (1996). PHD: predictingone-dimensional protein structure by profile based neural networks. Meth.Enzymol., 266, 525-539. |
| 3. | Rost, B. & Sander, C. (1993).Prediction of protein secondary structure at better than 70% accuracy. J. Mol.Biol., 232, 584-599. |
| 4. | Rost, B. & Sander, C. (1994).Combining evolutionary information and neural networks to predict proteinsecondary structure. Proteins, 19, 55-72. |
| 5. | Rost, B. & Sander, C. (1994).Conservation and prediction of solvent accessibility in protein families.Proteins, 20, 216-226. |
| 6. | Rost, B., Casadio, R., Fariselli, P.& Sander, C. (1995). Prediction of helical transmembrane segments at 95%accuracy. Prot. Sci., 4, 521-533. |
| 7. | Rost, B., Casadio, R. & Fariselli,P. (1996). Topology prediction for helical transmembrane proteins at 86%accuracy. Prot. Sci., 5, 1704-1718. |
| 8. | Jones, D. T. (1999). Proteinsecondary structure prediction based on position-specific scoring matrices. J.Mol. Biol., 292, 195-202. |
| 9. | Bairoch, A. & Apweiler, R.(2000). The SWISS-PROT protein sequence database and its supplement TrEMBL in2000. Nucl. Acids Res., 28, 45-48. |
| 10. | Brändén, C. &Tooze, J. (1991). Introduction to Protein Structure. Garland Publ., New York,London. |
| 11. | Lattman, E. E. & Rose, G. D.(1993). Protein folding-what's the question? Proc. Natl. Acad. Sci. U.S.A., 90,439-441. |
| 12. | Anfinsen, C. B. (1973). Principlesthat govern the folding of protein chains. Science, 181, 223-230. |
| 13. | Corrales, F. J. & Fersht, A. R.(1996). Kinetic significance of GroEL14 . (GroES7)2 complexes in molecularchaperone activity. Folding & Design, 1, 265-273. |
| 14. | Joachimiak, A. (1997). Capturingthe misfolds: chaperone-peptide-binding motifs. Nat. Struct. Biol., 4, 430-434. |
| 15. | Ellis, R. J., Dobson, C. &Hartl, U. (1998). Sequence does specify protein conformation. TIBS, 23, 468. |
| 16. | Gottesman, M. E. & Hendrickson,W. A. (2000). Protein folding and unfolding by Escherichia coli chaperones andchaperonins. Curr. Opin. Microbiol., 3, 197-202. |
| 17. | Levitt, M. & Warshel, A.(1975). Computer simulation of protein folding. Nature, 253, 694-698. |
| 18. | Hagler, A. T. & Honig, B.(1978). On the formation of protein tertiary structure on a computer. Proc.Natl. Acad. Sci. U.S.A., 75, 554-558. |
| 19. | Honig, B. (1993). Theory andsimulation. Curr. Opin. Str. Biol., 3, 223-224. |
| 20. | van Gunsteren, W. F. (1993).Molecular dynamics studies of proteins. Curr. Opin. Str. Biol., 3, 167-174. |
| 21. | Fleischmann, R. D., Adams, M. D.,White, O., Clayton, R. A., Kirkness, E. F. et al. (1995). Whole-genome randomsequencing and assembly of Haemophilus influenzae Rd. Science, 269, 496-512. |
| 22. | Frishman, D. & Mewes (1997).PEDANTic genome analysis. TIGS, 13, 415-416. |
| 23. | Liu, J. & Rost, B. (2001).Comparing function and structure between entire proteomes. Prot. Sci., 10,1970-1979. |
| 24. | Carter, P., Liu, J. & Rost, B.(2002). PEP: Predictions for Entire Proteomes. Nucl. Acids Res., submitted. |
| 25. | The C. elegans SequencingConsortium (1998). Genome sequence of the nematode C. elegans: a platform forinvestigating biology. Science, 282, 2012-2018. |
| 26. | Adams, M. D., Celniker, S. E.,Holt, R. A., Evans, C. A., Gocayne, J. D. et al. (2000). The genome sequence ofDrosophila melanogaster. Science, 287, 2185-2195. |
| 27. | Arabidopsis Genome Initiative(2000). Analysis of the genome sequence of the flowering plant Arabidopsisthaliana. Nature, 408, 796-815. |
| 28. | The genome international sequencingconsortium (2001). Initial sequencing and analysis of the human genome. Nature,409, 860-921. |
| 29. | Venter, J. C., Adams, M. D., Myers,E. W., Li, P. W., Mural, R. J. et al. (2001). The Human genome. Science, 291,1304-1351. |
| 30. | Kawai, J., Shinagawa, A., Shibata,K., Yoshino, M., Itoh, M. et al. (2001). Functional annotation of a full-lengthmouse cDNA collection. Nature, 409, 685-690. |
| 31. | Bernstein, F. C., Koetzle, T. F.,Williams, G. J. B., Meyer, E. F., Brice, M. D. et al. (1977). The Protein DataBank: a computer based archival file for macromolecular structures. J. Mol.Biol., 112, 535-542. |
| 32. | Lesk, A. M., Lo Conte, L. &Hubbard, T. J. P. (2001). Assessment of novel folds targets in CASP4:Predictions of three-dimensional structures, secondary structures, andinterresidue contacts. Proteins, 45 Suppl 5, 98-118. |
| 33. | Rost, B. (1999). Twilight zone ofprotein sequence alignments. Prot. Engin., 12, 85-94. |
| 34. | Liu, J. & Rost, B. (2002).Target space for structural genomics revisited. Bioinformatics, 18, 922-933. |
| 35. | Rost, B. (1997). Protein structuressustain evolutionary drift. Folding & Design, 2, S19-S24. |
| 36. | Rost, B. & Valencia, A. (1996).Pitfalls of protein sequence analysis. Curr. Opin. Biotech., 7, 457-461. |
| 37. | Rost, B. & O'Donoghue, S. I.(1997). Sisyphus and prediction of protein structure. CABIOS, 13, 345-356. |
| 38. | Rost, B., O'Donoghue, S. &Sander, C. (1998). Midnight zone of protein structure evolution. EMBLHeidelberg, . |
| 39. | Jones, D. & Thornton, J.(1993). Protein fold recognition. J. Comp.-Aided Mol. Design, 7, 439-456. |
| 40. | Sippl, M. J. (1993). Boltzmann'sprinciple, knowledge based mean fields and protein folding. An approach to thecomputational determination of protein structures. J. Comp.-Aided Mol. Design,7, 473-501. |
| 41. | Wodak, S. J. & Rooman, M. J.(1993). Generating and testing protein folds. Curr. Opin. Str. Biol., 3,247-259. |
| 42. | Bryant, S. H. & Altschul, S. F.(1995). Statistics of sequence-structure threading. Curr. Opin. Str. Biol., 5,236-244. |
| 43. | Sippl, M. J. (1995). Knowledge-basedpotentials for proteins. Curr. Opin. Str. Biol., 5, 229-235. |
| 44. | Finkelstein, A. V. (1997). Proteinstructure: what is it possible to predict now? Curr. Opin. Str. Biol., 7,60-71. |
| 45. | Lupas, A. (1996). Coiled coils: newstructures and new functions. TIBS, 21, 375-382. |
| 46. | Rost, B. (2002). Did evolution leapto create the protein universe? Curr. Opin. Str. Biol., 12, 409-416. |
| 47. | NIGMS (2001). Structural genomicsinitiatives. 2001, . |
| 48. | Lima, C. D., Klein, M. G. &Hendrickson, W. A. (1997). Structure-based analysis of catalysis and substratedefinition in the HIT protein family. Science, 278, 286-290. |
| 49. | Gaasterland, T. (1998). Structuralgenomics: bioinformatics in the driver's seat. Nat. Biotechnol., 16, 625-627. |
| 50. | Rost, B. (1998). Marrying structureand genomics. Structure, 6, 259-263. |
| 51. | Sali, A. (1998). 100,000 proteinstructures for the biologist. Nat. Struct. Biol., 5, 1029-1032. |
| 52. | Burley, S. K., Almo, S. C.,Bonanno, J. B., Capel, M., Chance, M. R. et al. (1999). Structural genomics:beyond the human genome project. Nat. Gen., 23, 151-157. |
| 53. | Blundell, T. L. & Mizuguchi, K.(2000). Structural genomics: an overview. Prog Biophys Mol Biol, 73, 289-295. |
| 54. | Edwards, A. M., Arrowsmith, C. H.,Christendat, D., Dharamsi, A., Friesen, J. D. et al. (2000). Proteinproduction: feeding the crystallographers and NMR spectroscopists. Nat. Struct.Biol., 7, 970-972. |
| 55. | Moult, J. & Melamud, E. (2000).From fold to function. Curr. Opin. Str. Biol., 10, 384-389. |
| 56. | Shapiro, L. & Harris, T.(2000). Finding function through structural genomics. Curr. Opin. Biotech., 11,31-35. |
| 57. | Brenner, S. E. (2001). A tour ofstructural genomics. Nature, 2, 801-809. |
| 58. | Thornton, J. (2001). Structuralgenomics takes off. TIBS, 26, 88-89. |
| 59. | Vitkup, D., Melamud, E., Moult, J.& Sander, C. (2001). Completeness in structural genomics. Nat. Struct.Biol., 8, 559-566. |
| 60. | Rost, B., Honig, B. & Valencia,A. (2002). Bioinformatics in structural genomics. Bioinformatics, 18, 897. |
| 61. | Montelione, G. T. (2001). NortheastStructural Genomics Consortium. . |
| 62. | Sander, C. & Schneider, R.(1991). Database of homology-derived structures and the structural meaning ofsequence alignment. Proteins, 9, 56-68. |
| 63. | Abagyan, R. A. & Batalov, S.(1997). Do aligned sequences share the same fold? J. Mol. Biol., 273, 355-368. |
| 64. | Park, J., Teichmann, S. A.,Hubbard, T. & Chothia, C. (1997). Intermediate sequences increase thedetection of distant sequence homologies. J. Mol. Biol., 273, 349-354. |
| 65. | Brenner, S. E., Chothia, C. &Hubbard, T. J. P. (1998). Assessing sequence comparison methods with reliablestructurally identified distant evolutionary relationships. Proc. Natl. Acad.Sci. U.S.A., 95, 6073-6078. |
| 66. | Park, J., Karplus, K., Barrett, C.,Hughey, R., Haussler, D. et al. (1998). Sequence comparisons using multiplesequences detect three times as many remote homologues as pairwise methods. J.Mol. Biol., 284, 1201-1210. |
| 67. | Yang, A. S. & Honig, B. (2000).An integrated approach to the analysis and modeling of protein sequences andstructures. II. On the relationship between sequence and structural similarityfor proteins that are not obviously related in sequence. J. Mol. Biol., 301,679-689. |
| 68. | Linial, M. & Yona, G. (2000).Methodologies for target selection in structural genomics. Prog. Biophys.molec. Biol., 73, 297-320. |
| 69. | Liu, J. & Rost, B. (2002). CHOPproteomes into structural domains. Bioinformatics, in preparation. |
| 70. | Kabsch, W. & Sander, C. (1983).Dictionary of protein secondary structure: pattern recognition of hydrogenbonded and geometrical features. Biopolymers, 22, 2577-2637. |
| 71. | Rost, B. & Sander, C. (1994).1D secondary structure prediction through evolutionary profiles. In ProteinStructure by Distance Analysis (Bohr, H. & Brunak, S., eds.), pp. 257-276,IOS Press, Amsterdam, Oxford, Washington. |
| 72. | Chandonia, J. M. & Karplus, M.(1999). New methods for accurate prediction of protein secondary structure.Proteins, 35, 293-306. |
| 73. | Rost WWW, B., Eyrich, V. A.,Przybylski, D., Pazos, F., Valencia, A. et al. (2000). EVA - Evaluation ofautomatic protein structure prediction services. Columbia University /Rockefeller University / CNB Madrid, WWW document(http://cubic.bioc.columbia.edu/eva). |
| 74. | Eyrich, V., Martí-Renom, M.A., Przybylski, D., Fiser, A., Pazos, F. et al. (2001). EVA: continuousautomatic evaluation of protein structure prediction servers. Bioinformatics,17, 1242-1243. |
| 75. | Marti-Renom, M. A., Madhusudhan, M.S., Fiser, A., Rost, B. & Sali, A. (2002). Reliability of assessment ofprotein structure prediction methods. Structure, 10, 435-440. |
| 76. | Bujnicki, J. M., Elofsson, A.,Fischer, D. & Rychlewski, L. (2001). LiveBench-1: continuous benchmarkingof protein structure prediction servers. Prot. Sci., 10, 352-361. |
| 77. | Needleman, S. B. & Wunsch, C.D. (1970). A general method applicable to the search for similarities in theamino acid sequence of two proteins. J. Mol. Biol., 48, 443-53. |
| 78. | Smith, T. F. & Waterman, M. S.(1981). Identification of common molecular subsequences. J. Mol. Biol., 147,195-197. |
| 79. | Alexandrov, N. N. & Soloveyev,V. V. (1998). Statistical significance of ungapped sequence alignments. InHICCS' 98: Pacific Symposium on Biocomputing' 98 (Altman, R. B., Dunker, A. K.,Hunter, L. & Klein, T. E., eds.), pp. 463-472, World Scientific, Maui,Hawaii, U.S.A.. |
| 80. | Kabsch, W. & Sander, C. (1984).On the use of sequence homologies to predict protein structure: Identicalpentapeptides can have completely different conformations. Proc. Natl. Acad.Sci. U.S.A., 81, 1075-1078. |
| 81. | Minor, D. L. J. & Kim, P. S.(1996). Context-dependent secondary structure formation of a designed proteinsequence. Nature, 380, 730-734. |
| 82. | Doolittle, R. F. (1986). Of URFsand ORFs: a primer on how to analyze derived amino acid sequences. UniversityScience Books, Mill Valley California. |
| 83. | Bryant, S. H. & Amzel, L. M.(1987). Correctly folded proteins make twice as many hydrophobic contacts. J.Int. Pept. Prot. Res., 29, 46-52. |
| 84. | Altschul, S. F. & Gish, W.(1996). Local alignment statistics. Meth. Enzymol., 266, 460-480. |
| 85. | Higgins, D. G., Thompson, J. D.& Gibson, T. J. (1996). Using CLUSTAL for multiple sequence alignments.Meth. Enzymol., 266, 383-402. |
| 86. | Pearson, W. R. (1996). Effectiveprotein sequence comparison. Meth. Enzymol., 266, 227-258. |
| 87. | Taylor, W. R. (1996). Multipleprotein sequence alignment: algorithms and gap insertion. Meth. Enzymol., 266,343-367. |
| 88. | Altschul, S., Madden, T., Shaffer,A., Zhang, J., Zhang, Z. et al. (1997). Gapped Blast and PSI-Blast: a newgeneration of protein database search programs. Nucl. Acids Res., 25,3389-3402. |
| 89. | Jeanmougin, F., Thompson, J. D.,Gouy, M., Higgins, D. G. & Gibson, T. J. (1998). Multiple sequencealignment with Clustal X. TIBS, 23, 403-405. |
| 90. | Higgins, D. G. (2000). Aminoacid-based phylogeny and alignment. Adv. Prot. Chem., 54, 99-135. |
| 91. | Dickerson, R. E., Timkovich, R.& Almassy, R. J. (1976). The cytochrome fold and the evolution of bacterialenergy metabolism. J. Mol. Biol., 100, 473-491. |
| 92. | Barton, G. J. (1996). Proteinsequence alignment and database scanning. In Protein structure prediction(Sternberg, M. J. E., eds.), pp. 31-64, Oxford Univ. Press, Oxford. |
| 93. | Eddy, S. R. (1998). Profile hiddenMarkov models. Bioinformatics, 14, 755-763. |
| 94. | Karplus, K., Barrett, C. &Hughey, R. (1998). Hidden Markov models for detecting remote proteinhomologies. Bioinformatics, 14, 846-856. |
| 95. | Rost, B. (2001). Protein secondarystructure prediction continues to rise. J. Struct. Biol., 134, 204-218. |
| 96. | Przybylski, D. & Rost, B.(2002). Alignments grow, secondary structure prediction improves. Proteins, 46,195-205. |
| 97. | Henikoff, S. & Henikoff, J. G.(1993). Performance evaluation of amino acid substitution matrices. Proteins,17, 49-61. |
| 98. | Rost, B. (2002). Enzyme functionless conserved than anticipated. J. Mol. Biol., 318, 595-608. |
| 99. | Wrzeszczynski, K. O. & Rost, B.(2002). In silico anaysis of retention signals for Endoplasmic reticulum andGolgi apparatus. Proteins, submitted. |
| 100. | Wrzeszczynski, K. O. & Rost,B. (2002). Cataloguing proteins in cell cycle contro. In Cell cycle checkpointcontrol protocols (Lieberman, H., eds.), pp. submitted, Humana Press, Totowa,NJ. |
| 101. | Rost, B. & Sander, C. (1993).Improved prediction of protein secondary structure by use of sequence profilesand neural networks. Proc. Natl. Acad. Sci. U.S.A., 90, 7558-7562. |
| 102. | Rost WWW, B. (2000).PredictProtein - internet prediction service. . |
| 103. | Eyrich, V., Martí-Renom, M.A., Przybylski, D., Fiser, A., Pazos, F. et al. (2001). EVA: continuousautomatic evaluation of protein structure prediction servers. 2001, . |
| 104. | Cuff, J. A. & Barton, G. J.(2000). Application of multiple sequence alignment profiles to improve proteinsecondary structure prediction. Proteins, 40, 502-511. |
| 105. | Rost WWW, B. (2000). Bettersecondary structure prediction through more data. Columbia University, WWWdocument (http://cubic.bioc.columbia.edu/predictprotein). |
| 106. | Rost, B., Sander, C. &Schneider, R. (1994). Redefining the goals of protein secondary structureprediction. J. Mol. Biol., 235, 13-26. |
| 107. | Young, M., Kirshenbaum, K., Dill,K. A. & Highsmith, S. (1999). Predicting conformational switches inproteins. Prot. Sci., 8, 1752-1764. |
| 108. | Kirshenbaum, K., Young, M. &Highsmith, S. (1999). Predicting allosteric switches in myosins. Prot. Sci., 8,1806-1815. |
| 109. | Cohen, F. E. & Presnell, S. R.(1996). The combinatorial approach. In Protein structure prediction (Sternberg,M. J. E., eds.), pp. 207-228, Oxford Univ. Press, Oxford. |
| 110. | Lee, B. K. & Richards, F. M.(1971). The interpretation of protein structures: estimation of staticaccessibility. J. Mol. Biol., 55, 379-400. |
| 111. | Chothia, C. (1976). The nature ofthe accessible and buried surfaces in proteins. J. Mol. Biol., 105, 1-12. |
| 112. | Holbrook, S. R., Muskal, S. M.& Kim, S.-H. (1990). Predicting surface exposure of amino acids fromprotein sequence. Prot. Engin., 3, 659-665. |
| 113. | Hirakawa, H., Muta, S. &Kuhara, S. (1999). The hydrophobic cores of proteins predicted by waveletanalysis. Bioinformatics, 15, 141-148. |
| 114. | Mucchielli-Giorgi, M. H., Hazout,S. & Tuffery, P. (1999). PredAcc: prediction of solvent accessibility.Bioinformatics, 15, 176-177. |
| 115. | Carugo, O. (2000). Predictingresidue solvent accessibility from protein sequence by considering the sequenceenvironment. Prot. Engin., 13, 607-609. |
| 116. | Li, X. & Pan, X. M. (2001).New method for accurate prediction of solvent accessibility from proteinsequence. Proteins, 42, 1-5. |
| 117. | Naderi-Manesh, H., Sadeghi, M.,Arab, S. & Moosavi Movahedi, A. A. (2001). Prediction of protein surfaceaccessibility with information theory. Proteins, 42, 452-459. |
| 118. | Fariselli, P. & Casadio, R.(2000). Prediction of the number of residue contacts in proteins. Ismb, 8,146-151. |
| 119. | Fariselli, P. & Casadio, R.(2001). RCNPRED: prediction of the residue co-ordination numbers in proteins.Bioinformatics, 17, 202-204. |
| 120. | von Heijne, G. (1996). Predictionof transmembrane protein topology. In Protein structure prediction (Sternberg,M. J. E., eds.), pp. 101-110, Oxford Univ. Press, Oxford. |
| 121. | Ikeda, M., Arai, M., Lao, D. M.& Shimizu, T. (2001). Transmembrane topology prediction methods: Are-assessment and improvement by a consensus method using a dataset ofexperimentally-characterized transmembrane topologies. In Silico Biol. , 1,http://www.bioinfo.de/isb/2001/02/0003/. |
| 122. | Möller, S., Croning, D. R.& Apweiler, R. (2001). Evaluation of methods for the prediction of membranespanning regions. Bioinformatics, 17, 646-653. |
| 123. | Simon, I., Fiser, A. &Tusnady, G. E. (2001). Predicting protein conformation by statistical methods.Biochim Biophys Acta, 1549, 123-136. |
| 124. | Chen, C. P. & Rost, B. (2002).State-of-the-art in membrane prediction. Appl. Bioinf., 1, 21-35. |
| 125. | Persson, B. & Argos, P.(1996). Topology prediction of membrane proteins. Prot. Sci., 5, 363-371. |
| 126. | Sonnhammer, E. L. L., von Heijne,G. & Krogh, A. (1998). A hidden Markov model for predicting transmembranehelices in protein sequences. In Sixth International Conference on IntelligentSystems for Molecular Biology (ISMB98) (Glasgow, J., Littlejohn, T., Major, F.,Lathrop, R., Sankoff, D. et al., eds.), pp. 175-182, AAAI Press, Montreal,Canada. |
| 127. | Krogh, A., Larsson, B., vonHeijne, G. & Sonnhammer, E. L. (2001). Predicting transmembrane proteintopology with a hidden Markov model: application to complete genomes. J. Mol.Biol., 305, 567-580. |
| 128. | Tusnady, G. E. & Simon, I.(1998). Principles governing amino acid composition of integral membraneproteins: application to topology prediction. J. Mol. Biol., 283, 489-506. |
| 129. | Tusnady, G. E. & Simon, I.(2001). Topology of membrane proteins. J Chem Inf Comput Sci, 41, 364-368. |
| 130. | Jones, D. T., Taylor, W. R. &Thornton, J. M. (1994). A model recognition approach to the prediction ofall-helical membrane protein structure and topology. Biochem., 33, 3038-3049. |
| 131. | Wallin, E. & von Heijne, G.(1998). Genome-wide analysis of integral membrane proteins from eubacterial,archaean, and eukaryotic organisms. Prot. Sci., 7, 1029-1038. |
| 132. | Schirmer, T., Keller, T. A., Wang,Y. F. & Rosenbusch, J. P. (1995). Structural basis for sugar translocationthrough maltoporin channels at 3.1 A resolution. Science, 267, 512-514. |
| 133. | Meyer, J. E. W., Hofnung, M. &Schulz, G. E. (1997). Structure of Maltoporin from Salmonella typhimuriumligated with a Nitrophenyl-maltotrioside. J. Mol. Biol., 266, 761-775. |
| 134. | Mannella, C. A. (1998).Conformational changes in the mitochondrial channel protein, VDAC, and theirfunctional implications. J. Struct. Biol., 121, 207-218. |
| 135. | Schulz, G. E. (2000). beta-Barrelmembrane proteins. Curr. Opin. Str. Biol., 10, 443-447. |
| 136. | Weiss, M. S. & Schulz, G. E.(1992). Structure of porin refined at 1.8 Å resolution. J. Mol. Biol.,227, 493-509. |
| 137. | Pebay-Peyroula, E., Garavito, R.M., Rosenbusch, J. P., Zulauf, M. & Timmins, P. A. (1995). Detergentstructure in tetragonal crystals of OmpF pori. Structure, 3, 1051-1059. |
| 138. | Schirmer, T. (1998). General andspecific porins from bacterial outer membranes. J. Struct. Biol., 121, 101-109. |
| 139. | Tamm, L. K., Arora, A. & Kleinschmidt,J. H. (2001). Structure and assembly of beta-barrel membrane proteins. J. Biol.Chem., 276, 32399-32402. |
| 140. | Paul, C. & Rosenbusch, J. P.(1985). Folding patterns of porin and bacteriorhodopsin. EMBO J., 4, 1594-1597. |
| 141. | Welte, W., Weiss, M. S., Nestel,U., Weckesser, J., Schiltz, E. et al. (1991). Prediction of the generalstructure of OmpF and PhoE from the sequence and structure of porin fromRhodobacter capsulatus: Orientation of porin in the membrane. Biochim. Biophys.Acta., 1080, 271-274. |
| 142. | Schirmer, T. & Cowan, S. W.(1993). Prediction of membrane spanning beta-strands and its application tomaltoporin. Prot. Sci., 2, 1361-1363. |
| 143. | Jahnig, F. (1990). Structurepredictions of membrane proteins are not that bad. TIBS, 15, 93-95. |
| 144. | Cowan, S. W., Schirmer, T.,Rummel, G., Steiert, M., Ghosh, R. et al. (1992). Crystal structure explainfunctional properties of two E. coli porins. Nature, 358, 727-733. |
| 145. | Eisenberg, D., Schwartz, E.,Komaromy, M. & Wall, R. (1984). Analysis of membrane and surface proteinsequences with the hydrophobic moment plot. J. Mol. Biol., 179, 125-142. |
| 146. | Vogel, H. & Jahnig, F. (1986).Models for the structure of outer-membrane proteins of Escherichia coli derivedfrom raman spectroscopy and prediction methods. J. Mol. Biol., 190, 191-199. |
| 147. | Gromiha, M. M. & Ponnuswamy,P. K. (1993). Prediction of transmembrane beta-strands from hydrophobiccharacteristics of proteins. Int J Pept Protein Res, 42, 420-431. |
| 148. | Gromiha, M. M., Majumdar, R. &Ponnuswamy, P. K. (1997). Identification of membrane spanning beta strands inbacterial porins. Prot. Engin., 10, 497-500. |
| 149. | Diederichs, K., Freigang, J.,Umhau, S., Zeth, K. & Breed, J. (1998). Prediction by a neural network ofouter membrane beta-strand protein topology. Prot. Sci., 7, 2413-2420. |
| 150. | Jacoboni, I., Martelli, P. L.,Fariselli, P., De Pinto, V. & Casadio, R. (2001). Prediction of thetransmembrane regions of beta-barrel membrane proteins with a neuralnetwork-based predictor. Prot. Sci., 10, 779-787. |
| 151. | Rost, B., Casadio, R. &Fariselli, P. (1996). Refining neural network predictions for helicaltransmembrane proteins by dynamic programming. In Fourth InternationalConference on Intelligent Systems for Molecular Biology (States, D., Agarwal,P., Gaasterland, T., Hunter, L. & Smith, R. F., eds.), pp. 192-200, MenloPark, CA: AAAI Press, St. Louis, M.O., U.S.A.. |
| 152. | Nilges, M. (1996). Structurecalculation from NMR data. Curr. Opin. Str. Biol., 6, 617-623. |
| 153. | Ortiz, A. R., Kolinski, A. &Skolnick, J. (1998). Fold assembly of small proteins using monte carlosimulations driven by restraints derived from multiple sequence alignments. J.Mol. Biol., 277, 419-448. |
| 154. | Ortiz, A. R., Kolinski, A. &Skolnick, J. (1998). Tertiary structure prediction of the KIX domain of CBPusing Monte Carlo simulations driven by restraints derived from multiplesequence alignments. Proteins, 30, 287-294. |
| 155. | Pazos, F., Rost, B. &Valencia, A. (1999). A platform for integrating threading results with proteinfamily analyses. Bioinformatics, 15, 1062-1063. |
| 156. | Pazos, F., Heredia, P., Valencia,A. & de las Rivas, J. (2001). Threading structural model of themanganese-stabilizing protein PsbO reveals presence of two possible beta-sandwichdomains. Proteins, 45, 372-381. |
| 157. | Pazos, F., Helmer-Citterich, M.,Ausiello, G. & Valencia, A. (1997). Correlated mutations containinformation about protein-protein interaction. J. Mol. Biol., 271, 511-523. |
| 158. | Pazos, F. & Valencia, A.(2001). Similarity of phylogenetic trees as indicator of protein-proteininteraction. Prot. Engin., 14, 609-614. |
| 159. | Pazos, F. & Valencia, A.(2002). In silico two-hybrid systemfor the selection of physically interactingprotein pairs. Proteins, 47, 219-227. |
| 160. | Altschuh, D., Lesk, A. M.,Bloomer, A. C. & Klug, A. (1987). Correlation of co-ordinated amino acidsubstitutions with function in viruses related to tobacco mosaic virus. J. Mol.Biol., 193, 693-707. |
| 161. | Goebel, U., Sander, C., Schneider,R. & Valencia, A. (1994). Correlated mutations and residue contacts inproteins. Proteins, 18, 309-317. |
| 162. | Neher, E. (1994). How frequent arecorrelated changes in families of protein sequences? Proc. Natl. Acad. Sci.U.S.A., 91, 98-102. |
| 163. | Taylor, W. R. & Hatrick, K.(1994). Compensating changes in protein multiple sequence alignment. Prot.Engin., 7, 341-348. |
| 164. | Olmea, O. & Valencia, A.(1997). Improving contact predictions by the combination of correlatedmutations and other sources of sequence information. Folding & Design, 2,S25-S32. |
| 165. | Olmea, O., Rost, B. &Valencia, A. (1999). Effective use of sequence correlation and conservation infold recognition. J. Mol. Biol., 293, 1221-1239. |
| 166. | Lund, O., Frimand, K., Gorodkin,J., Bohr, H., Bohr, J. et al. (1997). Protein distance constraints predicted byneural networks and probability density functions. Prot. Engin., 10, 1241-1248. |
| 167. | Fariselli, P. & Casadio, R.(1999). A neural network based predictor of residue contacts in proteins. Prot.Engin., 12, 15-21. |
| 168. | Gorodkin, J., Lund, O., Andersen,C. A. & Brunak, S. (1999). Using sequence motifs for enhanced neuralnetwork prediction of protein distance constraints. Ismb, 95-105. |
| 169. | Fariselli, P., Olmea, O.,Valencia, A. & Casadio, R. (2001). Prediction of contact maps with neuralnetworks and correlated mutations. Prot. Engin., 14, 835-843. |
| 170. | Hubbard, T. J. P. (1994). Use ofb-strand interaction pseudo-potential in protein structure prediction andmodelling. In 27th Hawaii International Conference on System Sciences (Hunter,L., eds.), pp. 336-344, IEEE Society Press, Maui, Hawaii, USA. |
| 171. | Lifson, S. & Sander, C.(1980). Specific recognition in the tertiary structure of b-sheets in proteins.J. Mol. Biol., 139, 627-639. |
| 172. | Hubbard, T. J. P. & Park, J.(1995). Fold recognition and ab initio structure predictions using HiddenMarkov models and b-strand pair potentials. Proteins, 23, 398-402. |
| 173. | Nagarajaram, H. A., Reddy, B. V.& Blundell, T. L. (1999). Analysis and prediction of inter-strand packingdistances between beta-sheets of globular proteins. Prot. Engin., 12,1055-1062. |
| 174. | Baldi, P., Pollastri, G.,Andersen, C. A. & Brunak, S. (2000). Matching protein beta-sheet partnersby feedforward and recurrent neural networks. Ismb, 8, 25-36. |
| 175. | Hopfield, J. J. (1982). Neuralnetworks and physical systems with emergent collective computational abilities.Proc. Natl. Acad. Sci. U.S.A., 79, 2554-2558. |
| 176. | Asogawa, M. (1997). Beta-sheetprediction using inter-strand residue pairs and refinement with Hopfield neuralnetwork. Ismb, 5, 48-51. |
| 177. | Lo Conte, L., Brenner, S. E.,Hubbard, T. J., Chothia, C. & Murzin, A. G. (2002). SCOP database in 2002:refinements accommodate structural genomics. Nucl. Acids Res., 30, 264-267. |
| 178. | Guex, N. & Peitsch, M. C.(1997). SWISS-MODEL and the Swiss-PdbViewer: An environment for comparativeprotein modelling. Electrophoresis, 18, 2714-2723. |
| 179. | Fetrow, J. S., Godzik, A. &Skolnick, J. (1998). Functional analysis of the Escherichia coli genome usingthe sequence-to-structure-to-function paradigm: identification of proteinsexhibiting the glutaredoxin/thioredoxin disulfide oxidoreductase activity. J.Mol. Biol., 282, 703-711. |
| 180. | Sanchez, R. & Sali, A. (1998).Large-scale protein structure modeling of the Saccharomyces cerevisiae genome.Proc. Natl. Acad. Sci. U.S.A., 95, 13597-13602. |
| 181. | Bates, P. A. & Sternberg, M.J. (1999). Model building by comparison at CASP3: Using expert knowledge andcomputer automation. Proteins, 37, 47-54. |
| 182. | Burke, D. F., Deane, C. M.,Nagarajaram, H. A., Campillo, N., Martin-Martinez, M. et al. (1999). Aniterative structure-assisted approach to sequence alignment and comparativemodeling. Proteins, Suppl 3, 55-60. |
| 183. | Dunbrack Jr, R. L. (1999).Comparative modeling of CASP3 targets using PSI-BLAST and SCWRL. Proteins, 37,81-87. |
| 184. | Moult, J. (1999). Predictingprotein three-dimensional structure. Curr. Opin. Biotech., 10, 583-588. |
| 185. | Sternberg, M. J., Bates, P. A.,Kelley, L. A. & MacCallum, R. M. (1999). Progress in protein structureprediction: assessment of CASP3. Curr. Opin. Str. Biol., 9, 368-73. |
| 186. | Sauder, J. M., Arthur, J. W. &Dunbrack Jr, R. L. (2000). Large-scale comparison of protein sequence alignmentalgorithms with structure alignments. Proteins, 40, 6-22. |
| 187. | Marti-Renom, M. A., Madhusudhan,M. S., Fiser, A. & Sali, A. (2001). Accuracy of comparative modelling. . |
| 188. | Melo, F., Sanchez, R. & Sali,A. (2002). Statistical potentials for fold assessment. Prot. Sci., 11, 430-448. |
| 189. | Pieper, U., Eswar, N., Stuart, A.C., Ilyin, V. A. & Sali, A. (2002). MODBASE, a database of annotatedcomparative protein structure models. Nucl. Acids Res., 30, 255-259. |
| 190. | Karplus, M. & Petsko, G. A.(1990). Molecular dynamics simulations in biology. Nature, 347, 631-639. |
| 191. | Summers, N. L. & Karplus, M.(1990). Modeling of globular proteins. J. Mol. Biol., 216, 991-1016. |
| 192. | May, A. C. W. & Blundell, T.L. (1994). Automated comparative modelling of protein structures. Curr. Opin.Biotech., 5, 355-360. |
| 193. | Moult, J., Pedersen, J. T.,Judson, R. & Fidelis, K. (1995). A large-scale experiment to assess proteinstructure prediction methods. Proteins, 23, ii-iv. |
| 194. | Peitsch, M. C. (2002). About theuse of protein models. Bioinformatics, 18, 934-938. |
| 195. | Fischer, D., Rice, D. W., Bowie,J. U. & Eisenberg, D. (1996). Assigning amino acid sequences to 3D proteinfolds. FASEB J., 10, 126-136. |
| 196. | Lathrop, R. H. (1994). The proteinthreading problem with sequence amino acid interaction preferences isNP-complete. Prot. Engin., 7, 1059-1068. |
| 197. | Bowie, J. U., Lüthy, R. &Eisenberg, D. (1991). A method to identify protein sequences that fold into aknown three-dimensional structure. Science, 253, 164-169. |
| 198. | Sippl, M. J. & Weitckus, S.(1992). Detection of native-like models for amino acid sequences of unknownthree-dimensional structure in a data base of known protein conformations.Proteins, 13, 258-271. |
| 199. | Fischer, D. & Eisenberg, D.(1996). Fold recognition using sequence-derived properties. Prot. Sci., 5,947-955. |
| 200. | Jones, D. T. (1999). GenTHREADER:an efficient and reliable protein fold recognition method for genomicsequences. J. Mol. Biol., 287, 797-815. |
| 201. | Kelley, L. A., MacCallum, R. M.& Sternberg, M. J. (2000). Enhanced genome annotation using structuralprofiles in the program 3D-PSSM. J. Mol. Biol., 299, 499-520. |
| 202. | Bujnicki, J. M., Elofsson, A.,Fischer, D. & Rychlewski, L. (2001). LiveBench-2: large-scale automatedevaluation of protein structure prediction servers. Proteins, Suppl, 184-91.. |
| 203. | Sippl, M. J., Lackner, P.,Domingues, F. S., Prlic, A., Malik, R. et al. (2001). Assessment of the CASP4fold recognition category. Proteins, Suppl, 55-67. |
| 204. | Elofsson, A. (2002). A study onprotein sequence alignment quality. Proteins, 46, 330-9.. |
| 205. | Elofsson, A., Le Grand, S. M.& Eisenberg, D. (1995). Local moves: an efficient algorithm for simulationof protein folding. Proteins, 23, 73-82. |
| 206. | Pedersen, J. T. & Moult, J.(1996). Genetic algorithms for protein structure prediction. Curr. Opin. Str.Biol., 6, 227-31. |
| 207. | Srinivasan, R. & Rose, G. D.(1995). LINUS: a hierarchic procedure to predict the fold of a protein. Proteins,22, 81-99. |
| 208. | Nilges, M. & Brünger, A.T. (1993). Successful prediction of coiled coil geometry of the GCN4 leucinezipper domain by simulated annealing: comparison to the X-ray. Proteins, 15,133-146. |
| 209. | Sippl, M. J. (1990). Thecalculation of conformational ensembles from potentials of mean force. Anapproach to the knowledge-based prediction of local structures of globularproteins. J. Mol. Biol., 213, 859-883. |
| 210. | Honig, B. & Cohen, F. E.(1996). Adding backbone to protein folding: why proteins are polypeptides.Folding & Design, 1, R17-R20. |
| 211. | Yang, A.-S. & Honig, B.(1995). Free energy determinants of secondary structure formation. 1.Alpha-helices. J. Mol. Biol., 252, 351-365. |
| 212. | Yang, A.-S. & Honig, B.(1995). Free energy determinants of secondary structure formation. 2.Antiparallel beta-sheets. J. Mol. Biol., 252, 366-376. |
| 213. | Sippl, M. J. (1996). Helmholtzfree energy of peptide hydrogen bonds in proteins. J. Mol. Biol., 260, 644-648. |
| 214. | Pfeffer, S. R. & Rotheman, J.E. (1987). Biosynthetic protein transport and sorting by the endoplasmicreticulum and Golgi. Annu. Rev. Biochem., 56, 829-852. |
| 215. | Schatz, G. & Dobberstein, B.(1996). Common principles of protein translocation across membranes. Science,271, 1519-26. |
| 216. | Mattaj, I. W. & Englmeier, L.(1998). Nucleocytoplasmic transport: the soluble phase. Annu Rev Biochem, 67,265-306. |
| 217. | Boulikas, T. (1993). Nuclearlocalization signals (NLS). Crit Rev Eukaryot Gene Expr, 3, 193-227. |
| 218. | von Heijne, G. (1995). Proteinsorting signals: simple peptides with complex functions. Exs, 73, 67-76. |
| 219. | Rothman, J. E. & Wieland, F.T. (1996). Protein sorting by transport vesicles. Science, 272, 227-234. |
| 220. | Eisenhaber, F. & Bork, P.(1998). Wanted: subcellular localization of proteins based on sequence. TICB,8, 169-170. |
| 221. | Nakai, K. (2000). Protein sortingsignals and prediction of subcellular localization. Adv Protein Chem, 54,277-344. |
| 222. | Bork, P., Dandekar, T.,Diaz-Lazcoz, Y., Eisenhaber, F., Huynen, M. et al. (1998). Predicting function:from genes to genomes and back. J. Mol. Biol., 283, 707-725. |
| 223. | Nielsen, H., Engelbrecht, J.,Brunak, S. & von Heijne, G. (1997). A neural network method foridentification of prokaryotic and eukaroytoic signal peptides and prediction oftheir cleavage sites. Internationl Journal of Neural Systems, 8, 581-599. |
| 224. | Emanuelsson, O., Nielsen, H.,Brunak, S. & von Heijne, G. (2000). Predicting subcellular localization ofproteins based on their N-terminal amino acid sequence. J. Mol. Biol., 300,1005-1016. |
| 225. | Menne, K. M., Hermjakob, H. &Apweiler, R. (2000). A comparison of signal sequence prediction methods using atest set of signal peptides. Bioinformatics, 16, 741-742. |
| 226. | Cokol, M., Nair, R. & Rost, B.(2000). Finding nuclear localisation signals. EMBO Rep., 1, 411-415. |
| 227. | Reinhardt, A. & Hubbard, T.(1998). Using neural networks for prediction of the subcellular location ofproteins. Nucl. Acids Res., 26, 2230-2235. |
| 228. | Nishikawa, K. & Ooi, T.(1982). Correlation of the amino acid composition of a protein to itsstructural and biological characteristics. J. Biochem., 91, 1821-1824. |
| 229. | Nishikawa, K., Kubota, Y. &Ooi, T. (1983). Classification of proteins into groups based on amino acidcomposition and other characters. II. Grouping into four types. J. Biochem.(Tokyo), 94, 997-1007. |
| 230. | Nakashima, H. & Nishikawa, K.(1992). The amino acid composition is different between the cytoplasmic andextracellular sides in membrane proteins. FEBS Lett., 303, 141-146. |
| 231. | Cedano, J., Aloy, P.,Pérez-Pons, J. A. & Querol, E. (1997). Relation between amino acidcomposition and cellular location of proteins. J. Mol. Biol., 266, 594-600. |
| 232. | Hua, S. & Sun, Z. (2001).Support vector machine approach for protein subcellular localizationprediction. Bioinformatics, 17, 721-8. |
| 233. | Nakai, K. & Horton, P. (1999).PSORT: a program for detecting sorting signals in proteins and predicting theirsubcellular localization. TIBS, 24, 34-6. |
| 234. | Nielsen, H., Engelbrecht, J.,Brunak, S. & von Heijne, G. (1997). Identification of prokaryotic andeukaryotic signal peptides and prediction of their cleavage sites. Prot.Engin., 10, 1-6. |
| 235. | Emanuelsson, O., Nielsen, H. &von Heijne, G. (1999). ChloroP, a neural network-based method for predictingchloroplast transit peptides and their cleavage sites. Prot. Sci., 8, 978-984. |
| 236. | Drawid, A. & Gerstein, M.(2000). A Bayesian system integrating expression data with sequence patterns forlocalizing proteins: comprehensive application to the yeast genome. J. Mol.Biol., 301, 1059-1075. |
| 237. | Andrade, M. A., O'Donoghue, S. I.& Rost, B. (1998). Adaptation of protein surfaces to subcellular location.J. Mol. Biol., 276, 517-525. |
| 238. | Emanuelsson, O. & von Heijne,G. (2001). Prediction of organellar targeting signals. Biochim. Biophys. Ac.,1541, 114-119. |
| 239. | Garavelli, J. S., Hou, Z.,Pattabiraman, N. & Stephens, R. M. (2001). The RESID database of proteinstructure modifications and the NRL-3D sequence-structure database. Nucl. AcidsRes., 29, 199-201. |
| 240. | Nakai, K. (2001). Prediction of invivo fates of proteins in the era of genomics and proteomics. J. Struct. Biol.,134, 103-116. |
| 241. | Kreegipuu, A., Blom, N. &Brunak, S. (1999). PhosphoBase, a database of phosphorylation sites: release2.0. Nucl. Acids Res., 27, 237-9. |
| 242. | Blom, N., Gammeltoft, S. &Brunak, S. (1999). Sequence and structure-based prediction of eukaryoticprotein phosphorylation sites. J. Mol. Biol., 294, 1351-1362. |
| 243. | Hansen, J., Lund, O., Tolstrup,N., Gooley, A. A., Williams, K. L. et al. (1998). NetOglyc: Prediction of mucintype O-glycosylation sites based on sequence context and surface accessibility.Glycoconjugate Journal, 15, 115-130. |
| 244. | Christlet, T. H., Biswas, M. &Veluraja, K. (1999). A database analysis of potential glycosylatingAsn-X-Ser/Thr consensus sequences. Acta Crystallogr D Biol Crystallogr, 55 ( Pt8), 1414-20. |
| 245. | Petrescu, A. J., Petrescu, S. M.,Dwek, R. A. & Wormald, M. R. (1999). A statistical analysis of N- andO-glycan linkage conformations from crystallographic data. Glycobiology, 9,343-52. |
| 246. | Eisenhaber, B., Bork, P. &Eisenhaber, F. (1999). Prediction of potential GPI-modification sites inproprotein sequences. J. Mol. Biol., 292, 741-58. |
| 247. | Eisenhaber, B., Bork, P. &Eisenhaber, F. (2001). Post-translational GPI lipid anchor modification ofproteins in kingdoms of life: analysis of protein sequence data from completegenomes. Prot. Engin., 14, 17-25. |
| 248. | Maurer-Stroh, S., Eisenhaber, B.& Eisenhaber, F. (2002). N-terminal N-myristoylation of proteins:prediction of substrate proteins from amino acid sequence. J. Mol. Biol., 317,541-57. |
| 249. | Kesimir, C., Nussbaum, A. K.,Schild, H., Detours, V. & Brunak, S. (2002). Prediction of proteasomecleavage motifs by neural networks. Prot. Engin., 15, 287-296. |
| 250. | Riley, M. (1993). Functions of thegene products of Escherichia coli. Microbiol Rev, 57, 862-952.. |
| 251. | Fraser, C. M., Gocayne, J. D.,White, O., Adams, M. D., Clayton, R. A. et al. (1995). The minimal genecomplement of Mycoplasma genitalium. Science, 270, 397-403. |
| 252. | Tamames, J., Ouzounis, C., Casari,G., Sander, C. & Valencia, A. (1998). EUCLID: automatic classification ofproteins in functional classes by their database annotations. Bioinformatics,14, 542-3. |
| 253. | Ashburner, M., Ball, C. A., Blake,J. A., Botstein, D., Butler, H. et al. (2000). Gene ontology: tool for theunification of biology. The Gene Ontology Consortium. Nat. Gen., 25, 25-29. |
| 254. | Devos, D. & Valencia, A.(2000). Practical limits of function prediction. Proteins, 41, 98-107. |
| 255. | Haft, D. H., Loftus, B. J.,Richardson, D. L., Yang, F., Eisen, J. A. et al. (2001). TIGRFAMs: a proteinfamily resource for the functional identification of proteins. Nucl. AcidsRes., 29, 41-3.. |
| 256. | Jensen, L. J., Gupta, R., Blom,N., Devos, D., Tamames, J. et al. (2002). Prediction of human protein functionfrom post-translational modifications and localization features. J. Mol. Biol.,319, 1257-1265. |
| 257. | Sheinerman, F. B., Norel, R. &Honig, B. (2000). Electrostatic aspects of protein-protein interactions. Curr.Opin. Str. Biol., 10, 153-159. |
| 258. | Mann, M., Hendrickson, R. C. &Pandey, A. (2001). Analysis of proteins and proteomes by mass spectrometry.Annu. Rev. Biochem., 70, 437-473. |
| 259. | Michnick, S. W. (2001). Exploringprotein interactions by interaction-induced folding of proteins fromcomplementary peptide fragments. Curr. Opin. Str. Biol., 11, 472-477. |
| 260. | DeLano, W. (2002). Unravelling hotspots in binding interfaces: progress and challenges. Curr. Opin. Str. Biol.,12, 14-20. |
| 261. | Valencia, A. & Pazos, F.(2002). Computational methods for the prediction of protein interactions. Curr.Opin. Str. Biol., 12, 368-373. |
| 262. | Chen, Z. & Han, M. (2000).Building a protein interaction map: research in the post-genome era. Bioessays,22, 503-506. |
| 263. | Ghosh, D. (2000). Object-orientedtranscription factors database (ooTFD). Nucl. Acids Res., 28, 308-10. |
| 264. | Kel-Margoulis, O. V.,Romashchenko, A. G., Kolchanov, N. A., Wingender, E. & Kel, A. E. (2000).COMPEL: a database on composite regulatory elements providing combinatorialtranscriptional regulation. Nucl. Acids Res., 28, 311-5. |
| 265. | Bader, G. D., Donaldson, I.,Wolting, C., Ouellette, B. F., Pawson, T. et al. (2001). BIND-The biomolecularinteraction network database. Nucl. Acids Res., 29, 242-245. |
| 266. | Luscombe, N. M., Laskowski, R. A.& Thornton, J. M. (2001). Amino acid-base interactions: a three-dimensionalanalysis of protein-DNA interactions at an atomic level. Nucl. Acids Res., 29,2860-2874. |
| 267. | Qian, J., Stenger, B., Wilson, C.A., Lin, J., Jansen, R. et al. (2001). PartsList: a web-based system fordynamically ranking protein folds based on disparate attributes, includingwhole-genome expression and interaction information. Nucl. Acids Res., 29,1750-1764. |
| 268. | Xenarios, I., Salwinski, L., Duan,X. J., Higney, P., Kim, S. M. et al. (2002). DIP, the Database of InteractingProteins: a research tool for studying cellular networks of proteininteractions. Nucl. Acids Res., 30, 303-5. |
| 269. | Blaschke, C., Oliveros, J. C.& Valencia, A. (2001). Mining functional information associated withexpression arrays. Funct Integr Genomics, 1, 256-268. |
| 270. | Blaschke, C. & Valencia, A.(2001). The potential use of SUISEKI as a protein interaction discovery tool.Genome Inform Ser Workshop Genome Inform, 12, 123-134. |
| 271. | Marcotte, E. M., Xenarios, I.& Eisenberg, D. (2001). Mining literature for protein-protein interactions.Bioinformatics, 17, 359-363. |
| 272. | Rzhetsky, A. & Gomez, S. M.(2001). Birth of scale-free molecular networks and the number of distinct DNAand protein domains per genome. Bioinformatics, 17, 988-96. |
| 273. | Blaschke, C., Hirschman, L. &Valencia, A. (2002). Information extraction in molecular biology. BriefBioinform, 3, 154-165. |
| 274. | Krauthammer, M., Kra, P.,Iossifov, I., Gomez, S. M., Hripcsak, G. et al. (2002). Of truth and pathways:chasing bits of information through myriads of articles. Bioinformatics, 18,S249-S257. |
| 275. | Valencia, A. (2002). Search andretrieve: Large-scale data generation is becoming increasingly important inbiological research. But how good are the tools to make sense of the data? EMBORep., 3, 396-400. |
| 276. | Fields, S. & Song, O. (1989).A novel genetic system to detect protein-protein interactions. Nature, 340,245-246. |
| 277. | Flajolet, M., Rotondo, G., Daviet,L., Bergametti, F., Inchauspe, G. et al. (2000). A genomic approach of thehepatitis C virus generates a protein interaction map. Gene, 242, 369-79. |
| 278. | Legrain, P. & Selig, L.(2000). Genome-wide protein interaction maps using two-hybrid systems. FEBSLett., 480, 32-36. |
| 279. | McCraith, S., Holtzman, T., Moss,B. & Fields, S. (2000). Genome-wide analysis of vaccinia virusprotein-protein interactions. Proc. Natl. Acad. Sci. U.S.A., 97, 4879-4884. |
| 280. | Uetz, P., Giot, L., Cagney, G.,Mansfield, T. A., Judson, R. S. et al. (2000). A comprehensive analysis of protein-proteininteractions in Saccharomyces cerevisiae. Nature, 403, 623-627. |
| 281. | Uetz, P. & Hughes, R. E.(2000). Systematic and large-scale two-hybrid screens. Curr. Opin. Microbiol.,3, 303-308. |
| 282. | Walhout, A. J., Sordella, R., Lu,X., Hartley, J. L., Temple, G. F. et al. (2000). Protein interaction mapping inC. elegans using proteins involved in vulval development. Science, 287, 116-22. |
| 283. | Ito, T., Chiba, T., Ozawa, R.,Yoshida, M., Hattori, M. et al. (2001). A comprehensive two-hybrid analysis toexplore the yeast protein interactome. Proc. Natl. Acad. Sci. U.S.A., 98,4569-4574. |
| 284. | Rain, J. C., Selig, L., De Reuse,H., Battaglia, V., Reverdy, C. et al. (2001). The protein-protein interactionmap of Helicobacter pylori. Nature, 409, 211-215. |
| 285. | Jeong, H., Tombor, B., Albert, R.,Oltvai, Z. N. & Barabasi, A. L. (2000). The large-scale organization ofmetabolic networks. Nature, 407, 651-654. |
| 286. | Rowley, A., Choudhary, J. S.,Marzioch, M., Ward, M. A., Weir, M. et al. (2000). Applications of protein massspectrometry in cell biology. Methods, 20, 383-397. |
| 287. | Peng, J. & Gygi, S. P. (2001).Proteomics: the move to mixtures. J Mass Spectrom, 36, 1083-1091. |
| 288. | Gavin, A. C., Bosche, M., Krause,R., Grandi, P., Marzioch, M. et al. (2002). Functional organization of theyeast proteome by systematic analysis of protein complexes. Nature, 415,141-147. |
| 289. | Ho, Y., Gruhler, A., Heilbut, A.,Bader, G. D., Moore, L. et al. (2002). Systematic identification of proteincomplexes in Saccharomyces cerevisiae by mass spectrometry. Nature, 415,180-183. |
| 290. | Zhu, H., Bilgin, M., Bangham, R.,Hall, D., Casamayor, A. et al. (2001). Global analysis of protein activitiesusing proteome chips. Science, 293, 2101-2105. |
| 291. | Tong, A. H., Evangelista, M.,Parsons, A. B., Xu, H., Bader, G. D. et al. (2001). Systematic genetic analysiswith ordered arrays of yeast deletion mutants. Science, 294, 2364-2368. |
| 292. | Tong, A. H., Drees, B., Nardelli,G., Bader, G. D., Brannetti, B. et al. (2002). A combined experimental andcomputational strategy to define protein interaction networks for peptiderecognition modules. Science, 295, 321-324. |
| 293. | Marcotte, E. M., Pellegrini, M.,Thompson, M. J., Yeates, T. O. & Eisenberg, D. (1999). A combined algorithmfor genome-wide prediction of protein function. Nature, 402, 83-86. |
| 294. | Marcotte, E. M., Pellegrini, M.,Ng, H. L., Rice, D. W., Yeates, T. O. et al. (1999). Detecting protein functionand protein-protein interactions from genome sequences. Science, 285, 751-753. |
| 295. | Sprinzak, E. & Margalit, H.(2001). Correlated sequence-signatures as markers of protein-proteininteraction. J. Mol. Biol., 311, 681-692. |
| 296. | Gomez, S. M., Lo, S. H. &Rzhetsky, A. (2001). Probabilistic prediction of unknown metabolic andsignal-transduction networks. Genetics, 159, 1291-1298. |
| 297. | Schneider, R., de Daruvar, A.& Sander, C. (1997). The HSSP database of protein structure-sequencealignments. Nucl. Acids Res., 25, 226-230. |
| 298. | Scharf, M. (1988). CONAN (CONtactANalysis). . |
| 299. | Kraulis, P. (1991). MOLSCRIPT: aprogram to produce both detailed and schematic plots of protein structures. J.Appl. Cryst., 24, 946-950. |
| 300. | Holm, L. & Sander, C. (1999).Protein folds and families: sequence and structure alignments. Nucl. AcidsRes., 27, 244-247. |
| Contact: rost@columbia.edu | Version: Sep 13, 2002 |