| Title: | CHOP proteins into structural domain-like fragments |
| Author: | Jinfeng Liu , & Burkhard Rost |
| Quote: | Proteins, 2004, 55(3):678-688 |
CHOP proteins into structural domain-like fragments
| 1 | CUBIC, Dept. of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA |
| 2 | Columbia University Center for Computational Biology and Bioinformatics (C2B2), Russ Berrie Pavilion, 1150 St. Nicholas Avenue, New York, NY 10032, USA |
| 3 | North East Structural Genomics Consortium (NESG), Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA |
| 4 | Dept. of Pharmacology, Columbia Univ., 630 West 168th Street, New York, NY 10032, USA |
| * | Corresponding authors: cubic@cubic.bioc.columbia.edu URL http://cubic.bioc.columbia.edu/ Tel: +1-212-305-4018, fax: +1-212-305-7932 |
We developed a method CHOP dissecting proteins into domain-like fragments. The basic idea was to cut proteins from entirely sequenced organisms beginning from very reliable experimental information (PDB), proceeding to expert annotations of domain-like regions (Pfam-A), and completing through cuts based on termini of known proteins. In this way, CHOP dissected over two thirds of all proteins from 62 proteomes. Analysis of our structural domain-like fragments revealed four surprising results. First, over 70% of all dissected proteins contained more than one fragment. Second, most domains spanned on average over about 100 residues. This average was similar for eukaryotic and prokaryotic proteins, and it is also valid - although previously not described - for all proteins in the PDB. Third, single domain proteins were significant longer than most domains in multi-domain proteins. Fourth, three-fourth of all domains appeared shorter than 210 residues. We believe that our CHOP fragments constituted an important resource for functional and structural genomics. Nevertheless, our main motivation to develop CHOP was that single-linkage clustering method failed to adequately group full-length proteins. In contrast, CLUP - the simple clustering scheme CLUP introduced here - succeeded largely to group the CHOP fragments from 62 proteomes such that all members of one cluster shared a basic structural core. CLUP found over 63,000 multi- and over 118,000 single-member clusters. Although most fragments were restricted to a particular cluster, about 24% of the fragments were duplicated in at least two clusters. Our thresholds for grouping two fragments into the same cluster were rather conservative. Nevertheless, our results suggested that structural genomics initiatives have to target over 30,000 fragments to at least cover the multi-member clusters in 62 proteomes.
Key words: genome sequence analysis; protein domains; automatic sequence clustering; protein structure; structural genomics.
| ; 3D structure | three-dimensional co-ordinates of protein structure |
| CHOP | dissection into structural domain-like fragments introduced here ( Fig. 1 ) |
| CLUP | simple clustering algorithm introduced here |
| ORF | open reading frame (for simplicity we usually refer to ORFs from genome sequencing projects as 'proteins') |
| PDB | Protein Data Bank of experimentally determined 3D structures of proteins [1] |
| Pfam-A | expert curated database of protein families [2] |
| PrISM | automatic method assigning sequence-consecutive structural domains from PDB co-ordinates [3] |
| ProDom | automatic assignment of domain-like fragments from alignment information [4, 5, 6] |
| SCOP | Structural Classificationof Proteins, i.e., expert-based classification and domain-dissection of protein structures [7] |
| SWISS-PROT | a database of protein sequences [8] . |
printer:
Domains are the structural units of proteins.While less than one percent of the proteins in entirely sequenced archae and prokaryotes is longer than 1000 residues, over seven percent of the proteins in the six entirely sequenced eukaryotes (yeast, fly, worm, weed, human, and mouse) constitute such large macromolecules [9, 10] . Undoubtedly, all these large proteins are built of more than one domain. The term 'domain' is not well defined: many biologists refer to any consecutive segment, such as a coiled-coil helix or a nuclear localization signal as a domain. Structural biologists view domains as semi-independent three-dimensional (3D) sub-units that are compact and may fold independently [11, 12, 13, 14, 15, 16, 17, 18] . Some structural domains appear to constitute units that evolve independently [19, 20, 21, 15, 22, 23, 24] . Methods that automatically assign structural domains from 3D co-ordinates identify domains through their compactness [25, 3, 26, 27, 7, 28, 29] . Structural domains are often related to particular functions. Hence, the domain organisation of a protein may contain information crucial for understanding structure and function. Structural biologists also care about guessing the domain organisation prior to experimentally solving the structure of a protein, since crystallisation is more likely to succeed when expressing fragments that constitute domains [16] , and since NMR spectroscopy - limited by protein length - is more likely to unravel the structure for long proteins when dissecting these into structural domains [30] .
Progress in identifying structural domains without structures. An increasing number of methods and databases address the problem of identifying structural domains from sequence [31, 5, 2, 6, 32, 33] . The first automatic method - ProDom - identified likely domains through 'boundaries' in multiple alignments [4, 5, 6] . The major problem of basing domain boundaries on alignments is that proteins are dissected into too small fragments [6, 33] ; in fact, entirely conserved segments as capture in the BLOCKS database [34, 35] usually span over short fragments of structural domains. This observation is by no means self-evident, rather it points to the complexity of evolutionary constraints. DOMAINATION [36] that delineates domains through analysing iterative PSI-BLAST alignments explicitly attempts at elongating domain-like regions. Other automatic methods apply concepts from protein structure prediction (SnapDRAGON) [37] , statistics about domain size distributions [38] , a statistical approach toward combining various sources of information [39] , artificial neural networks [40, 41] , or other ways of exploring alignment information [42, 43] . Most recently, DomSSEA utilises information from alignments of predicted secondary structure segments to identify structural domains [44] . Another unique idea is to first predict inter-residue contacts by exploring correlated mutations and to then choose the domain such that the quotient intra/inter-domain contact becomes minimal [45] . None of these more recent methods has yet been experimentally verified on large scale.
Clustering the protein universe without domains. Given the explosion of protein sequences, the necessity to cluster these data becomes increasingly urgent [46, 47, 48, 49, 50, 51, 33] . One of the most practical reasons is to speed up databases comparisons, another to reduce the bias and hence sharpen such searches [48, 52, 53, 54, 55, 56] . Maps of the universe of proteins like ProtoNet [57] , ProtoMap [46, 58] or BioSphere [51] tend to group proteins with similar function [59] . One problem is that such clustering methods typically begin with full-length proteins. Two groups attempted to first dissect proteins into domain-like fragments and to then cluster these fragments through GeneRage [43] and PICASSO [47] ; the GeneRage algorithm fails to handle long, complex eukaryotic proteins [43, 60] , and PICASSO is not available for public testing.
Here, we introduce a hierarchical approach for chopping proteins into fragments that resemble structural domains (CHOP, Fig. 1 ). The hierarchy imposed begins from the most reliable information (proteins of known structure), continues to families that are well characterised by experts (Pfam-A), and finally explores the reliable information about N- and C-terminal ends of full-length proteins that have been characterised experimentally (SWISS-PROT). The objective is not to obtain all domain boundaries, but rather to identify only those boundaries for which we are confident. 70% of the proteins from 62 entirely sequenced organisms can be dissected by CHOP; the length distribution of CHOP fragments resembles that of known structural domains. Next, we clustered all these fragments and un-chopped full-length proteins (CLUP). The CHOP fragments and the CLUP clusters have been successfully applied to the target selection process in North East Structural Genomics Consortium (NESG, http://www.nesg.org/) and are publicly available through a SRS [61] interface at http://cubic.bioc.columbia.edu/srs/ [60] ; data in flat files format will also be available upon request. ).
Fig. 1 : Concept of CHOP. We dissect proteins in a hierarchical manner beginning from PrISM/PDB domains [96, 97, 3] , i.e. very reliable data about boundaries of structural domains (CUT1). If we find a similarity to many PrISM domains, we use the longest to chop (P12). For the example shown, this leaves us with two fragments (P11 and P13) to be processed. The next step (CUT2) is the identification of matches in Pfam-A [2] . In the example, fragments P21, P23, and P25 remain; all three are aligned to all SWISS-PROT proteins and are cut further if we find full-length proteins in SWISS-PROT the N- or C-termini of which align to part of the fragments (longest alignment that covers >80% of the SWISS-PROT protein). In the example this dissects P25 into P36, P37 and into two un-digested fragments P35 and P38.
Most proteins had more than one fragment. We applied hierarchically the three steps of CHOP ( Fig. 1 ) to all proteins/ORFs from 62 entirely sequenced organisms (Table in Supplement ): 164,433 (69%) of the 238,492 proteins were dissected by CHOP. Over 70% of these chopped proteins had more than one fragment ( Fig. 2 A). As expected, eukaryotes have fewer single domain proteins than do prokaryotes and archae ( Fig. 2 A, inlet; detailed graph in Fig. S3, Supplement ). In contrast to these data, most the 30,309 PDB chains available in fall 2003 constituted single-domain proteins (18,292=60%). However, structural biologists have strong incentives to choose domain-like proteins and to determine the structure for fragments of longer proteins. In fact, for one fourth (4,854) of the seemingly single-domain proteins, we found the corresponding proteins in SWISS-PROT to be at least 50 residues longer than the PDB version. Assuming that these do then also constitute multi-domain proteins, less than 45% (13,438) of the PDB proteins have single domains. One extreme case of a 'mega' multi-domain - Drosophila melanogaster protein bt (flybase identifier: FBan0001479) with 7,107 residues - was chopped into 79 fragments. This is not surprising, since bt is a member of the titin family, well known for its titanic usage of immunoglobulin-like and fibronectin type 3 (Fn3) domains [62] . The longest human protein in our data set was the Ovarian cancer related tumour marker CA125 (TrEMBL id q8wxi7 [8] ) with 11,721 residues; it was cut into seven fragments in its middle while the C-terminus remained uncut for almost 9,000 residues; over 2,000 of the residues were in low-complexity regions, distributed more or less equally over the entire protein.
Fig. 2 : Fraction of proteins chopped and number of fragments per protein. (A) Distribution of number of CHOP fragments per protein/ORF, for example, although CHOP is incomplete, we found that 70% of all chopped proteins appeared to have more than one structural domain-like fragment. The inlet shows the percentages of single-domain proteins for eukaryotes, prokaryotes and archae; the bars mark the minimal and maximal numbers within each kingdom (more details in Fig. S3, Supplement ). (B) Relation between number of fragments and protein length: on average, the number of CHOP fragments appeared to increase linearly with protein length. The line constituted a linear regression fit of L = 150 + 106 N (R>0.99, with L being the length of the protein and N the number of fragments). For comparison, average length of PDB proteins also increases with number of domain they have, with linear fit of L = 65 + 97 N (R=0.979).
Most domains span on average over about 100 residues. When averaging over the lengths of all proteins with N CHOP fragments, we observed that the number of CHOP fragments was directly proportional to the protein length ( Fig. 2 B): On average most CHOP fragments for multi-fragment proteins stretched over ~106 residues (slope of linear fit in Fig. 2 B). When we compared structurally known PrISM domains on the same plot, we noted a very similar slope ( Fig. 2 B). Although the fit for PDB proteins was supported by less data - in particular for proteins with many domains, the two fits were strikingly parallel, suggesting that our result was not caused by the lack of precision and/or completeness of the CHOP procedure. The linear fits also unravelled another surprising observation: Single CHOP fragments from the 62 proteomes extended over about 256 residues, those from PrISM domains extended over 163 residues. Both numbers were significantly larger than the averages for subsequent domains in multi-domain proteins. Thus, N-1 domains in proteins with N domains extend over 97-106 residues while one extends over 163-256 residues. (Note that the average length of the first fragment exceeded 300 residues for eukaryotic proteins [63] .) In order to establish that this unexpected finding was not caused by the particular way of presenting the data (average length vs. number of domains), we pooled all domain-like fragments and randomly 'assembled proteins' according to the observed distributions for the number of fragments per protein (data not shown). As expected, this control experiment yielded a line passing through 0. Thus, our finding is not explained by the particular presentation of the data. Thus, the detailed fit is likely to constitute a more precise estimate of the average length of a structural domain than the one that is obtained from compiling a simple average over all domains currently annotated in PDB.
47% of the fragments were similar to PrISM and Pfam-A. The number of domains for
proteins of known structure fully agrees for only 40% of all proteins contained
in PDB/PrISM and Pfam-A; for about 10% of the common proteins, PrISM/PDB and
Pfam-A disagree by more than two domains (
Fig. S2,
Supplement
). Since our goal is to generate structural domain-like fragments, we favoured the
structure-derived (PrISM/PDB) over the sequence-derived (Pfam-A) domain
assignment. PrISM domains were the origin of 20% of all CHOP fragments ( Fig. 3
>
Fig. 3 : Sources of chopping. Left: One fifth of all fragments were obtained from homology to PrISM/PDB domains, and half did not find any similar region to cut; 15% of the 'fragments' were full-length proteins that were not cut at all. Right: Residue-wise, the PrISM fragments covered 17% of all the 87,732,248 residues in the 62 proteomes, Pfam-A fragments 24%, and 3% of the residues were discarded during the chopping since they were from fragments that were smaller than 30 residues. In other words, about 47% of all CHOP fragments appeared supported by very conservative data, and they covered 41% of all residues.
Lengths of CHOP fragments resembled structural
domains. Are the
remaining fragments and the uncut full-length proteins similar to structural
domains, or did we simply not find the necessary data to dissect these?
Although we could not answer this question conclusively, the length
distribution of the remaining fragments suggested that at least - on average -
these differed significantly from the whole set of full-length proteins
(Fig. 4)
In fact, the major problem in terms of length-distribution appeared the over-representation of fragments shorter than 50 residues in the set of fragments that remained after chopping. In contrast, the length distribution of those full-length proteins that had not been chopped at all appeared more similar to Pfam-A regions than to the entire set of all full-length proteins. In other words, these untouched proteins constituted a subset of short proteins, many of which might in fact have single domains. The length distribution of all CHOP fragments (including the short remaining fragments and the untouched proteins) were most similar to the Pfam-A distribution in which short (50% shorter than 106 residues) and long fragments (>350) are over-represented in comparison to 'real' structural domain as taken from PrISM. The obvious outliers were the fragments obtained through homology to full-length SWISS-PROT proteins: these tended to be much longer than fragments chopped according to PrISM
(Fig. 4)
. However, since they accounted for only 4% of all CHOP fragments
(Fig. 3A)
, they did not affect the length distribution of all CHOP fragments markedly. The length-distribution for all CHOP fragments was similar for all three kingdoms (eukaryotes, prokaryotes and archae, data not shown). However, we observed significant differences for those fragments that originated from SWISS-PROT termini: while almost 30% of the eukaryote SWISS-PROT fragments were longer than 500 residues, less than 5% of the archae and less than 10% of the prokaryote SWISS-PROT fragments were as long. Prokaryotic fragments from Pfam-A were slightly longer than eukaryotic ones, and long eukaryotic fragments were over-represented in both the sets of untouched full-length proteins, and that of remaining fragments (data not shown).
Fig. 4 : Length-distribution of CHOP fragments. Note that except for the controls shown in the cumulative percentages in the top-right inlet (full-length proteins), all curves described the CHOP fragments, e.g. the thick black line with filled triangles showed the distributions for fragments that were chopped through similarity to PrISM domains. 'Remain' marks those fragments that remained N- and/or C-terminal from a region cut out according to similarity to either PrISM domains, Pfam-A regions, or SWISS-PROT termini; 'Full-length' mark proteins that were not touched at all by the CHOP algorithm. Both the cumulative and non-cumulative curves for all CHOP fragments (grey with open circles) were almost indistinguishable from those for Pfam-A fragments. Fragments cut according to SWISS-PROT termini (4%, Fig. 3) were closer to the distribution of all full-length proteins (control, light grey with downward pointing triangles) than the subset of proteins that remained untouched.
Partial agreement with SUPERFAMILY assignments for yeast. One difficulty in evaluating the CHOP procedure was that we used all available information. The method of cross-validation was not easily applicable since the different sources of domain-dissection overlapped only partially and even where they did overlap, we could at best verify that our procedure was self-consistent (see above). The overall length-distribution therefore provided an independent perspective on the global performance of CHOP. Lacking unused standards-of-truth, we compared CHOP to another method that has implicitly a partially similar goal, namely, SUPERFAMILY [64] . SUPERFAMILY is a library of hidden Markov models (HMMs) for SCOP structural super-families. SUPERFAMILY contains data for over 100 proteomes; due to CPU restrictions, we had to limit the comparison to one particular proteome, namely yeast the smallest entirely sequenced eukaryote with 6,349 proteins. SUPERFAMILY assigned 4,794 domains in a subset of 3,359 proteins. In contrast, CHOP found 6,000 domain-like fragments from PrISM and Pfam-A in a subset of 3,915 proteins. About 72% of the SUPERFAMILY domains are continuous in sequence (3,435); thus, their domain boundaries were directly comparable to CHOP fragments: 40% (1,380) of these SUPERFAMILY domains agreed with CHOP fragments (defined as 80% overlap between both assignments), 30% were significantly longer than CHOP fragments, and 28% shorter. SUPERFAMILY associations imply predictions for structure, since SUPERFAMILY models are based on proteins of known structure. The same is true - on average for only 20% of the CHOP fragments ( Fig. 3 A). Therefore, it may appear surprising that both methods cover such a similar number of proteins in yeast (3,359 by SUPERFAMILY vs. 3,915 by CHOP). The reason is that we apply much more stringent thresholds in sequence-similarity when we fragment a protein. When we use CHOP fragments to pinpoint putative targets for structural genomics [63] , we add another step that is conceptually similar to SUPERFAMILY and thereby increase the coverage to levels more similar to those achieved by SUPERFAMILY.
Clustering sequence space must begin from fragments. Multi-domain proteins constitute the most troublesome challenge to clustering sequence space. When we ignored this challenge and single-linkage clustered full-length proteins from five entirely sequenced eukaryotes (yeast, fly, worm, weed, human), we observed that - no matter how we chose the thresholds for the clustering, we ended up with one big 'cluster' that pulled in almost half of all proteins [65] ( Table 1 ). Thus, we cannot cluster entire proteomes in a way that ascertains all proteins in one cluster to have a structurally similar domain-like region without dissecting proteins into fragments. Could we succeed with single-linkage clustering if we knew all structural domains? We addressed this question by clustering all PrISM domains of known structure (47,582): We found 835 clusters with one member (singleton) and 3039 clusters with more than one. Not surprisingly, the largest cluster with 2156 members was mostly immunoglobulin related, and the second largest had 1224 serine proteases/hydrolases. Although the results from clustering structural domains looked much more reasonable than those from clustering full-length eukaryotic proteins, the largest cluster still contained many unrelated domain pairs, for example, two immunoglobulin domains unrelated in sequence. To avoid this problem, we developed a clustering scheme (CLUP) that started with proteins that have few homologues, and avoided merging families based on transitive homology. For the same set of PrISM/PDB domains, CLUP yielded 835 singletons, 4061 multi-member groups, the largest of which had 558 structural domains from serine protease family. Visual inspection of the largest clusters did not reveal any grouping of unrelated proteins. Furthermore, no cluster was degenerate in the sense that each domain belonged to only one cluster.
Over 63,000 multi-member clusters from the CHOP fragments. We clustered all 499,465 CHOP fragments from 62 entirely sequenced genomes, including those uncut full-length proteins that were treated as single fragments (CLUP, Methods). CLUP grouped these fragments into 118,108 single- and 63,300 multi-member clusters. About 43% of the multi-member clusters had more than three and about 13% over ten members ( Fig. 5 A ˆ). The largest cluster contained 3,343 members; the seed for this cluster was an ATP-binding cassette from the ABC transporters. Our premise was that structural domains constitute something like the atom or basic unit of sequence space. If true, two such units should not occur in different clusters. We could explicitly build such a constraint into our clustering scheme. However, the benefit of the simplicity of CLUP was that no such constraint was applied. Hence for our clustering scheme, the percentage of ambivalently clustered fragments constituted an indirect measure for the reliability of the CHOP fragmentation. The majority of CHOP fragments (76%) were single-cluster fragments, while only 3% of the fragments are associated with more than 3 clusters ( Fig. 5 B).
Fig. 5 : Cluster sizes and degeneracy. (A) For the subset of multi-member clusters, 41% have two, about two-thirds more than four, and about 13% more than 10 members. (B) Most CHOP fragments (76%) were associated with a single cluster (not degenerate), while only 3% of the fragments were associated with more than three clusters (highly degenerate).
CHOP is the first step toward complete domain dissection. On the one hand, CHOP failed to identify all domain boundaries: 31% of the proteins remained un-chopped, these accounted for 15% of all final CHOP 'fragments' ( Fig. 3 A). On the other hand, even the remaining CHOP fragments and those proteins that were not chopped - on average - were more similar to structural domains than to full-length proteins ( Fig. 4 ). The obvious difference between the length distribution of CHOP fragments and PrISM domains, or Pfam-A regions was an abundance of short fragments. Since this fragmentation was largely due to the fragments remaining after chopping ( Fig. 4 ), it is not clear whether these fragments indicated the limitation of the CHOP procedure or simply constituted a large pool of domain-linking fragments. For example, about one fifth of the remaining fragments were probably signal peptides, and many originated from membrane spanning helices [67, 9, 65] . We did find some examples in these remaining fragments that appeared to be valid functional regions rather than just left-over fragments. For example, the yeast Tup1 protein was dissected by CHOP into a C-terminal domain (residues 315-713) homologous to the PDB entry 1ERJ [68] and into an N-terminal fragment (1-315); it has been reported that this N-terminal portion of Tup1 protein is responsible for the oligomerisation of Tup1 and for mediating the repression of transcription [69] . Overall, CHOP fragments appeared not as atomised as those obtained through prediction from, e.g., ProDom [6] or through expert-annotation from SMART [70, 33] ( Fig. S4, Supplement ). Obviously, the length distribution of CHOP fragments constituted only a very crude means of evaluating the reliability of CHOP. However, given the nature of our procedure - we used all reliable information to chop - we had no data left that enabled a more comprehensive evaluation. Cross-validating our procedure by chopping according to Pfam-A and SWISS-PROT only for proteins of known structure confirmed that CHOP was somehow consistent. However, even this cross-validation was somehow circular. The CHOP procedure never dissected proteins without strong evidence; this reduced the data set considerably without introducing too many mistakes. Currently, we work on the next step, namely, the development of a de novo prediction of domain boundaries that may help to distinguish 'single domain' from 'uncut because of lacking homology' for the remaining fragments and for the untouched full-length proteins.
Most proteins had more than one structural domain-like fragment. On average, we found a CHOP fragment for every 106 residues ( Fig. 2 B). Thus, most of the proteins from the 62 entirely sequenced organisms contained more than one domain-like fragment, in fact, over 70% of the proteins that could be chopped had multiple domain-like fragments ( Fig. 2 A). This number is significantly higher than the multi-domain proteins previously described in E. coli (6% with 2-4 domains) [71] and it also exceeds the percentage of multi-domain proteins analysed in a detailed comparison of structural known enzymatic domain 'mosaics' (32%) [24, 72] . Supposedly the latter differs so much from what we observed, as it is biased by the bias of known structures in PDB: most PDB proteins appear to be single-domain proteins [7] . However, we observed that the bias toward single-domain proteins in PDB partially originated from the fact that proteins are often cut in order to determine structure: less than 44%of our PrISM domain set appeared to correspond to single-domain proteins. Supposedly, our estimate of that ratio of single/multi-domain proteins provides a lower limit to the 'real' number, since the CHOP algorithm misses domains that have not been characterised previously.
Do domains from single and multi-domain proteins differ? Given the CHOP algorithm, fragments left and right of a structural domain will often be too short (e.g. a signal peptide or a few helices before the globular domain in G-coupled receptors), we expected that the relation between length and number of fragments differed between single- and multi-fragment proteins. More specifically, we expected that this 'left-over' effect would be more extreme for proteins with two than for those with five fragments (less 'left-over'). While the data confirmed the expected tendency, we observed that the number of CHOP fragments was directly proportional to the protein length ( Fig. 2 B). In fact, on average CHOP fragments for multi-fragment proteins stretched over about 106 residues (slope of linear fit in Fig. 2 B), while single-fragment proteins were about 256 residues long (value of fit for N=1). We might argue that this difference between the average lengths of single- and multi-fragments originated from the 'left-over' problem. However, this interpretation was not consistent with the data, since the 'left-over effect' between proteins with two and five fragments appeared negligible. Did this suggest that there really is a genuine difference between the domains used in single- and those used in multiple-domain proteins? May be it is less expansive to shuffle short domains than long ones? Do proteins need a minimal length for entropic reasons in order to fold? Are shorter domains relicts from ancient, longer domains? Or do catalytic functions require a minimal length and do many of the shorter domains in multi-domain proteins increase the complexity of regulation [73] ? Our data could not shed light onto these speculations.
CHOP procedure rather robust with respect to local parameter changes. The two major parameters for the CHOP procedure are the minimal coverage of a known domain, i.e. the minimal fraction of a PrISM or Pfam-A region that we require to be similar in sequence to chop, and the minimal level of sequence similarity. By default, we required 80% coverage of and BLAST E-values < 10-3. However, the overall number of fragments did not alter significantly when modifying these parameters. For example, the number of CHOP fragments varied only 0.3% when the coverage threshold was changed from 90% to 70%, and 0.6% when changing the BLAST E-value from 10-3 to 10-1 ( Fig. S1, Supplement ). We also checked the consistency of the sources for chopping by chopping according to only one source and verifying the fragments with another. The domain boundaries from similarity to SWISS-PROT proteins were rarely in conflict with those from similarity to PrISM domains (0.2%) and Pfam domains (1%). For proteins that could be chopped according to both PrISM and Pfam-A, the number of domain-like fragments resulting from applying the two methods independently was largely consistent: 40% of the proteins showed no difference, and 34% differed by one domain ( Fig. S2, Supplement ). When the two methods differed substantially, often Pfam-A detected multiple copies of a repeat, while PrISM treated them altogether as one structural domain. Examples for this are typically very short fragments such as helix-turn-helix motifs, or the Pfam-A dissection of beta-helices into single helices annotated independently. We removed some of such cases by introducing another parameter that removed all Pfam-A entries shorter than 30 residues. Thus, overall CHOP was rather consistent and fairly robust to local parameter changes.
CLUP succeeded somehow in clustering protein space. Previously, we showed that single-linkage clustering failed to generate reasonable clusters from full-length eukaryotic proteins [60] . In this study, we demonstrated the failure of such a simple clustering scheme more systematically ( Table 1 ). Even when clustering PDB domains with single-linkage we found many unrelated protein pairs in the resulting clusters. We presented a novel clustering scheme that did not merge clusters based on transitive homology. Overall, CLUP yielded clusters that were reasonable by the following criteria: (1) the largest cluster did not appear to join many completely unrelated proteins, (2) only about 24% of the clusters had cross-relations, i.e. two different clusters shared one or more fragments, and (3) the overall length-distribution of the clusters appeared reasonable ( Fig. S4, Supplement ). As much as CHOP is a first step toward domain-dissection, CLUP is a first step toward clustering these domain-like fragments. While systems like ProtoNet [57] , ProtoMap [46, 58] or BioSphere [51] explicitly map out the protein universe, CLUP has no notion of 'distance' between two clusters other than that we cannot merge the two. Instead, CLUP only groups all fragments that are likely to have some common structural core. We imagine that systems comprehensively mapping sequence space could benefit directly from using our CHOP fragments as input to their clustering. On the one hand, the success of CLUP relies on the fact that CHOP succeeded in dissecting a reasonable fraction of the proteomes into domain-like fragments. Much more advanced and intelligent clustering schemes have been proposed [46, 43, 58, 47, 74, 75, 49, 76, 51, 56, 57] . Some of these seemingly even cope with the complexity of eukaryotic proteomes. On the other hand, the advantage of CLUP over more complex systems may be that the clustering is based exclusively on pair-relations: while e.g. ProtoNet and BioSphere enable discovering non-trivial connections, both advanced systems cannot always 'name' the relation between any pair of proteins in two connected clusters. In contrast, all pairs in CLUP clusters most likely share a common structural fold-like region. This feature was crucial to applying CLUP to the task of selecting targets for structural genomics [63] . Our structural domain-like fragments may also constitute good starting points for reducing the noise in two-hybrid, TAP and mass-spectrometry experiments that probe protein-protein interactions between fragments rather than between full-length proteins.
How many clusters of close homologues are there? A back-of-the-envelope calculation made Cyrus Chothia challenge that there are only 1000 different folds in nature [77] . This short note set off an avalanche of alternative estimates, most of which corrected the number upwards [78, 79, 80, 81, 82, 9, 83, 84] . Structural genomics initiatives thrive at experimentally determining most existing folds [85, 86, 87, 88, 89, 90, 91, 92, 83, 16] . How many structures will these initiatives have to determine to cover fold space? Even if Chothia's estimate were right within an order of magnitude, i.e. if there were less than 10,000 folds in nature, it has been estimated that because of technical reasons structural genomics would have to determine over 2-10 times more structures to get a representative for each fold [86, 93, 83, 65] . However, all these estimates were based on a number of assumptions generalising from statistics about proteins of known structure; none of these estimates actually clustered sequences and explicitly selected targets for structural genomics. The methods that we presented here can be used for target selection. The only step that remains to actually select targets for structural genomics is to exclude all clusters for which we either already have structural information or which may not constitute high-priority targets for structural genomics [63] . About 30,000 of the multi-member clusters have no obvious sequence similarity to known structures, hence, could constitute a minimal set of targets for structural genomics (data not shown). However, CLUP also created 118K singletons; some of these might find a connection to the multi-member clusters when we know more protein sequences. Yet, many may eventually support another daring challenge put forward by Moult and colleagues [84] , namely that many folds have been realised only once in evolution. In any case, our clusters of structural domain-like fragments are likely to constitute a good starting point for both structural and functional genomics.
Structural domains for PDB [1] proteins were extracted by PrISM [3] . We chose PrISM rather than other databases such as SCOP [7] and CATH [29] for the following reasons: (1) All domains defined by PrISM are continuous in sequence. This was important to us to maximise the probability that a domain identified by sequence similarity is correct. (2) Since we have the program locally available, we can obtain the domain classification in 'real-time', i.e. while structures are being added to PDB.
CHOP implements three hierarchical steps that were applied by decreasing confidence in the accuracy of the information ( Fig. 1 ). In particular, we discarded all fragments from step S that overlapped with fragments identified in the previous step S-1 (more reliable identification of domain boundaries). At any step, we discarded fragments with less than 30 residues. For the set of all proteins in 62 organisms {P62}, we applied the following three steps.
(1) High reliability: PrISM domains. We applied a pairwise BLAST [94] search with each protein against all PrISM domains (49300 domains); we marked all hits at expectation values < 10-2 that aligned at least 80% of the PrISM domain. If this criterion applied to more than one PrISM domains, we chose the longest. Next, we cut these domain-fragments from the list of all proteins, and repeated the search with the remaining fragments until we no longer found any similarity to PrISM. Note that the new BLAST is no longer strictly local. Therefore, the repeated search with fragments rather than full-length proteins may uncover similarities that were previously overlooked.
(2) Acceptable confidence: Pfam families. When comparing the length distribution of public curated databases of protein families, we found Pfam-A to come closest to the notion of structural domains [33] . Therefore, we assumed that Pfam-A constitutes the next best resource to increase coverage in our domain-dissection. For each of the remaining fragments, we searched for similarities in Pfam-A with HMMER [95] (global mode, Pfam release 7.0 with 5049 families, threshold 10-2). If a similarity to any Pfam family was detected for the fragment, we again dissected it in the same way as before for the PrISM domains. The procedure was repeated until no homology of Pfam was detected in any of the remaining fragments.
(3) Evolutionary relation unravelled by SWISS-PROT termini. If the N-terminal N1 residues of protein A are similar to ³80% of an experimentally characterised, full-length protein B, while the N2 C-terminal residues of A find no similarity in known databases, we have a good reason to suspect that A has - at least - two domains. SWISS-PROT [8] contains full-length proteins that have typically been studied independently by experimentalists, in particular, SWISS-PROT is not prone to 'pollution' by short EST fragments. SWISS-PROT entries marked as ÒfragmentÓ or shorter than 30 residues were excluded from our search. We identified all homologues of full-length proteins contained SWISS-PROT [8] through pairwise BLAST searches with all the remaining fragments (BLAST E-value < 10-2, covering at least 80% of the SWISS-PROT protein). As before, we iterated until no similarity remained.
In order to retain the integrity of domains, we imposed the restriction that cutting points were only introduced when the homology covered most (>80%) of the structural domain (PrISM), domain-like region (Pfam), or full-length protein. The final set of fragments was the combination of all fragments identified in the three steps and all remaining fragments that were longer than 30 residues. The dissection was not sensitive to our choice of parameters (BLAST E-value of 10-2 and coverage of 80%).
Our goal for the final clustering is that all members of one cluster share a region of common structure that basically spans an entire structural domain. We approached this goal by clustering all the fragments (and uncut full-length) proteins obtained from CHOP. Toward this end, we simply ran an all-against-all PSI-BLAST [94] search with a threshold E-value of < 10-3. The final clustering step involved the following iteration.
(1) Initialise: Put all CHOP fragments onto a stack C, sort (a) according to number of homologues found by PSI-BLAST (ascending), and (b) by ascending sequence length. In other words, we began with the smallest group and the shortest sequence.
(2) Iterate: Select the first fragment from the stack (small group, short sequence), consider it as seed, and create a cluster that contains this seed and all related fragments. Finally, remove the seed and all the related fragments from the stack. Note that while seeds and related fragments that are removed from the stack at this point may still become members of other clusters, they will not seed any other cluster.
(3) Repeat until completion: We repeat step 2 until no fragment is left in our stack.
(4) Merge clusters: Two clusters are merged if homologous regions of any common member with regards to the seeds significantly overlap (80%).
Thanks to our experimental colleagues at the Northeast Structural Genomics Consortium (NESG) for their advice and strong support of our project. In particular, thanks to Guy Montelione (Rutgers) for his invaluable optimism in leading the NESG team. Thanks also to the other experimental teams around Tom Acton (Rutgers), Cheryl Arrowsmith and Aled Edwards (Toronto), Wayne Hendrickson, John Hunt and Liang Tong (Columbia), Mike Kennedy (Pacific Northwest Natl Laboratory, Richland) and George DeTitta (Buffalo). Thanks to our colleagues from target selection for crucially helpful discussions: Barry Honig and Sharon Goldsmith (Columbia) and Diana Murray (Cornell). To Mark Gerstein and his group (Yale) for pushing us to develop PEP. Particular thanks to Phil Carter (Columbia, New York and Imperial College, London) for building the databases PEP, CHOP, and CLUP, and to An-Suei Yang (Columbia) for providing and helping with PrISM. Thanks also to Michal Linial (Jerusalem) for insightful discussions that kept the fun part of science alive. This work was supported by a grant from the Protein Structure Initiative of National Institutes of Health (P50 GM62413). Last not least, thanks to all those who deposit their experimental data in public databases, in particular in the context of structural genomics, and to the teams around PDB (Helen Berman, Rutgers and Phil Bourne, UCSD), Pfam (Alex Bateman, Sanger and Erik Sonnhammer, Stockholm), and SWISS-PROT (Amos Bairoch, SIB Geneva) who maintain these databases that were central to this work.
| 1. | Berman, H. M., Westbrook, J., Feng,Z., Gillliland, G., Bhat, T. N. et al. (2000). The Protein Data Bank. Nucl.Acids Res., 28,235-242. |
| 2. | Bateman, A., Birney, E., Cerruti,L., Durbin, R., Etwiller, L. et al. (2002). The Pfam protein families database.Nucl. Acids Res., 30, 276-80. |
| 3. | Yang, A. S. & Honig, B. (2000).An integrated approach to the analysis and modeling of protein sequences andstructures. III. A comparative study of sequence conservation in proteinstructural families using multiple structural alignments. J. Mol. Biol., 301, 691-711. |
| 4. | Sonnhammer, E. L., Eddy, S. R. &Durbin, R. (1997). Pfam: a comprehensive database of protein domain familiesbased on seed alignments. Proteins, 28, 405-420. |
| 5. | Corpet, F., Servant, F., Gouzy, J.& Kahn, D. (2000). ProDom and ProDom-CG: tools for protein domain analysisand whole genome comparisons. Nucl. Acids Res.,28, 267-9. |
| 6. | Servant, F., Bru, C., Carrere, S.,Courcelle, E., Gouzy, J. et al. (2002). ProDom: automated clustering ofhomologous domains. Brief Bioinform, 3, 246-51. |
| 7. | Lo Conte, L., Brenner, S. E.,Hubbard, T. J., Chothia, C. & Murzin, A. G. (2002). SCOP database in 2002:refinements accommodate structural genomics. Nucl. Acids Res., 30, 264-7. |
| 8. | Bairoch, A. & Apweiler, R.(2000). The SWISS-PROT protein sequence database and its supplement TrEMBL in2000. Nucl. Acids Res., 28, 45-48. |
| 9. | Liu, J. & Rost, B. (2001).Comparing function and structure between entire proteomes. Prot. Sci., 10, 1970-1979. |
| 10. | Rost, B. (2002). Did evolution leapto create the protein universe? Curr. Opin. Str. Biol., 12, 409-416. |
| 11. | Rose, G. D. (1979). Hierarchicorganization of domains in globular proteins. J. Mol. Biol., 134, 447-470. |
| 12. | Jaennicke, R. (1987). Folding andassociation of proteins. Prog. Biophys. molec. Biol., 49, 117-237. |
| 13. | Holm, L. & Sander, C. (1994).Parser for protein folding units. Proteins,19, 256-268. |
| 14. | Hao, M. H. & Scheraga, H. A.(1998). Molecular mechanisms for cooperative folding of proteins. J. Mol.Biol., 277, 973-983. |
| 15. | Thornton, J. M., Orengo, C. A.,Todd, A. E. & Pearl, F. M. (1999). Protein folds, functions and evolution. J.Mol. Biol., 293,333-42. |
| 16. | Hurley, J. H., Anderson, D. E.,Beach, B., Canagarajah, B., Ho, Y. S. et al. (2002). Structural genomics andsignaling domains. TIBS, 27, 48-53. |
| 17. | Ponting, C. P. & Russell, R. R.(2002). The natural history of protein domains. Annu. Rev. Biophys. Biomol.Struct., 31, 45-71. |
| 18. | Sham, Y. Y., Ma, B., Tsai, C. J.& Nussinov, R. (2002). Thermal unfolding molecular dynamics simulation ofEscherichia coli dihydrofolate reductase: thermal stability of protein domainsand unfolding pathway. Proteins, 46, 308-320. |
| 19. | Zuckerkandl, E. & Pauling, L.(1965). Evolutionary divergence and convergence in proteins. In Evolving GenesAnd Proteins (Bryson, V. & Vogel, H. J., eds.), pp. 97-166, Academic Press,New York and London. |
| 20. | Baron, M., Norman, D. G. &Campbell, I. D. (1991). Protein modules. Trends Biochem. Sci., 16, 13-17. |
| 21. | Bennett, M. J., Schlunegger, M. P.& Eisenberg, D. (1996). 3D domain swapping: a mechanism for oligomerassembly. Prot. Sci., 5, 2455-2468. |
| 22. | Apic, G., Gough, J. &Teichmann, S. A. (2001). Domain combinations in archaeal, eubacterial andeukaryotic proteomes. J. Mol. Biol., 310, 311-325. |
| 23. | Lupas, A. N., Ponting, C. P. &Russell, R. B. (2001). On the evolution of protein folds: are similar motifs indifferent protein folds the result of convergence, insertion, or relics of anancient peptide world? J. Struct. Biol., 134, 191-203. |
| 24. | Teichmann, S. A., Rison, S. C.,Thornton, J. M., Riley, M., Gough, J. et al. (2001). The evolution andstructural anatomy of the small molecule metabolic pathways in Escherichiacoli. J. Mol. Biol., 311, 693-708. |
| 25. | Holm, L. & Sander, C. (1998).Dictionary of recurrent domains in protein structures. Proteins, 33, 88-96. |
| 26. | Dengler, U., Siddiqui, A. S. &Barton, G. J. (2001). Protein structural domains: analysis of the 3Dee domainsdatabase. Proteins, 42, 332-344. |
| 27. | Dietmann, S., Park, J., Notredame, C.,Heger, A., Lappe, M. et al. (2001). A fully automatic evolutionaryclassification of protein folds: DALI domain dictionary version 3. Nucl.Acids Res., 29,55-57. |
| 28. | Marchler-Bauer, A., Panchenko, A.R., Ariel, N. & Bryant, S. H. (2002). Comparison of sequence and structurealignments for protein domains. Proteins, 48, 439-446. |
| 29. | Orengo, C. A., Bray, J. E., Buchan,D. W., Harrison, A., Lee, D. et al. (2002). The CATH protein family database: Aresource for structural and functional annotation of genomes. Proteomics, 2, 11-21. |
| 30. | Montelione, G. T., Zheng, D.,Huang, Y. J., Gunsalus, K. C. & Szyperski, T. (2000). Protein NMRspectroscopy in structural genomics. Nat. Struct. Biol., 7, 982-985. |
| 31. | Ponting, C. P., Schultz, J.,Milpetz, F. & Bork, P. (1999). SMART: identification and annotation ofdomains from signalling and extracellular protein sequences. Nucl. AcidsRes., 27, 229-32. |
| 32. | Haft, D. H., Selengut, J. D. &White, O. (2003). The TIGRFAMs database of protein families. Nucl. AcidsRes., 31, 371-373. |
| 33. | Liu, J. & Rost, B. (2003).Domains, motifs and clusters in the protein universe. Curr. Opin. Chem.Biol., 7, 5-11. |
| 34. | Henikoff, J. G. & Henikoff, S.(1996). Blocks database and its applications. Meth. Enzymol., 266, 88-104. |
| 35. | Henikoff, J. G., Greene, E. A.,Pietrokovski, S. & Henikoff, S. (2000). Increased coverage of proteinfamilies with the blocks database servers. Nucl. Acids Res., 28, 228-30. |
| 36. | George, R. A. & Heringa, J.(2002). Protein domain identification and improved sequence similaritysearching using PSI-BLAST. Proteins, 48, 672-81. |
| 37. | George, R. A. & Heringa, J.(2002). SnapDRAGON: a method to delineate protein structural domains fromsequence data. J. Mol. Biol., 316, 839-851. |
| 38. | Wheelan, S. J., Marchler-Bauer, A.& Bryant, S. H. (2000). Domain size distributions can predict domainboundaries. Bioinformatics, 16, 613-618. |
| 39. | Kulikowski, C. A., Muchnik, I.,Yun, H. J., Dayanik, A. A., Zhang, D. et al. (2001). Protein structural domainparsing by consensus reasoning over multiple knowledge sources and methods. Medinfo, 10, 965-969. |
| 40. | Murvai, J., Vlahovicek, K.,Szepesvari, C. & Pongor, S. (2001). Prediction of protein functionaldomains from sequences using artificial neural networks. Genome Res., 11, 1410-1417. |
| 41. | Miyazaki, S., Kuroda, Y. &Yokoyama, S. (2002). Characterization and prediction of linker sequences ofmulti-domain proteins by a neural network. J. Struct. Funct. Gen., 2, 37-51. |
| 42. | Marcotte, E. M., Pellegrini, M.,Thompson, M. J., Yeates, T. O. & Eisenberg, D. (1999). A combined algorithmfor genome-wide prediction of protein function. Nature, 402, 83-86. |
| 43. | Enright, A. J. & Ouzounis, C.A. (2000). GeneRAGE: a robust algorithm for sequence clustering and domaindetection. Bioinformatics, 16, 451-7. |
| 44. | Marsden, R. L., McGuffin, L. J.& Jones, D. T. (2002). Rapid protein domain assignment from amino acidsequence using predicted secondary structure. Prot. Sci., 11, 2814-24. |
| 45. | Rigden, D. J. (2002). Use ofcovariance analysis for the prediction of structural domain boundaries frommultiple protein sequence alignments. Prot. Engin., 15, 65-77. |
| 46. | Yona, G., Linial, N., Tishby, N.& Linial, M. (1998). A map of the protein space--an automatic hierarchicalclassification of all protein sequences. In Sixth International Conference onIntelligent Systems for Molecular Biology (ISMB98) (Glasgow, J., Littlejohn,T., Major, F., Lathrop, R., Sankoff, D. et al., eds.), pp. 212-221, AAAI Press,Montreal, Canada. |
| 47. | Heger, A. & Holm, L. (2001).Picasso: generating a covering set of protein family profiles. Bioinformatics, 17, 272-279. |
| 48. | Kriventseva, E. V., Biswas, M.& Apweiler, R. (2001). Clustering and analysis of protein families. Curr.Opin. Str. Biol., 11,334-339. |
| 49. | Krause, A., Haas, S. A., Coward, E.& Vingron, M. (2002). SYSTERS, GeneNest, SpliceNest: exploring sequencespace from genome to protein. Nucl. Acids Res.,30, 299-300. |
| 50. | Sasson, O., Linial, N. &Linial, M. (2002). The metric space of proteins-comparative study of clusteringalgorithms. Bioinformatics, 18, S14-S21. |
| 51. | Yona, G. & Levitt, M. (2002).Within the twilight zone: a sensitive profile-profile comparison tool based oninformation theory. J. Mol. Biol., 315, 1257-1275. |
| 52. | Li, W., Jaroszewski, L. &Godzik, A. (2001). Clustering of highly homologous sequences to reduce the sizeof large protein databases. Bioinformatics,17, 282-283. |
| 53. | Przybylski, D. & Rost, B.(2002). Alignments grow, secondary structure prediction improves. Proteins, 46, 195-205. |
| 54. | Rost, B. (2002). Enzyme functionless conserved than anticipated. J. Mol. Biol.,318, 595-608. |
| 55. | Wise, M. J. (2002). The POPPs:clustering and searching using peptide probability profiles. Bioinformatics, 18, S38-S45. |
| 56. | Kriventseva, E. V., Servant, F.& Apweiler, R. (2003). Improvements to CluSTr: the database ofSWISS-PROT+TrEMBL protein clusters. Nucl. Acids Res., 31, 388-389. |
| 57. | Sasson, O., Vaaknin, A., Fleischer,H., Portugaly, E., Bilu, Y. et al. (2003). ProtoNet: hierarchicalclassification of the protein space. Nucl. Acids Res., 31, 348-352. |
| 58. | Yona, G., Linial, N. & Linial,M. (2000). ProtoMap: automatic classification of protein sequences andhierarchy of protein families. Nucl. Acids Res.,28, 49-55. |
| 59. | Linial, M., Linial, N., Tishby, N.& Yona, G. (1997). Global self-organization of all known protein sequencesreveals inherent biological signatures. J. Mol. Biol., 268, 539-556. |
| 60. | Carter, P., Liu, J. & Rost, B.(2003). PEP: Predictions for Entire Proteomes. Nucl. Acids Res., 31, 410-3. |
| 61. | Etzold, T. & Argos, P. (1993).SRS--an indexing and retrieval tool for flat file data libraries. CABIOS, 9, 49-57.. |
| 62. | Amodeo, P., Fraternali, F., Lesk,A. M. & Pastore, A. (2001). Modularity and homology: modelling of the titintype I modules and their interfaces. J. Mol. Biol., 311, 283-296. |
| 63. | Liu, J., Acton, T., Goldsmith, S.,Honig, B., Montelione, G. T. et al. (2003). Automatic target selection forstructural genomics on eukaryotes. Prot. Sci.,submitted. |
| 64. | Gough, J., Karplus, K., Hughey, R.& Chothia, C. (2001). Assignment of homology to genome sequences using alibrary of hidden Markov models that represent all proteins of known structure.J. Mol. Biol., 313, 903-19. |
| 65. | Liu, J. & Rost, B. (2002).Target space for structural genomics revisited. Bioinformatics, 18, 922-933. |
| 66. | Zipf, G. K. (1949). Human Behaviorand the Principle of Least Effort. Addison-Wesley Press, Reading. |
| 67. | Nielsen, H., Brunak, S. & vonHeijne, G. (1999). Machine learning approaches for the prediction of signalpeptides and other protein sorting signals. Prot. Engin., 12, 3-9. |
| 68. | Sprague, E. R., Redd, M. J.,Johnson, A. D. & Wolberger, C. (2000). Structure of the C-terminal domainof Tup1, a corepressor of transcription in yeast. EMBO J., 19, 3016-27. |
| 69. | Tzamarias, D. & Struhl, K.(1994). Functional dissection of the yeast Cyc8-Tup1 transcriptionalco-repressor complex. Nature, 369, 758-61. |
| 70. | Letunic, I., Goodstadt, L.,Dickens, N. J., Doerks, T., Schultz, J. et al. (2002). Recent improvements tothe SMART domain-based sequence annotation resource. Nucl. Acids Res., 30, 242-4. |
| 71. | Serres, M. H., Gopal, S., Nahum, L.A., Liang, P., Gaasterland, T. et al. (2001). A functional update of the Escherichiacoli K-12 genome. Genome Biol., 2, RESEARCH0035. |
| 72. | Teichmann, S. A., Rison, S. C.,Thornton, J. M., Riley, M., Gough, J. et al. (2001). Small-molecule metabolism:an enzyme mosaic. TIBTECH, 19, 482-486. |
| 73. | Dueber, J. E., Yeh, B. J., Chak, K.& Lim, W. A. (2003). Reprogramming control of an allosteric signalingswitch through modular recombination. Science,301, 1904-1908. |
| 74. | Remm, M., Storm, C. E. &Sonnhammer, E. L. (2001). Automatic clustering of orthologs and in-paralogsfrom pairwise species comparisons. J. Mol. Biol.,314, 1041-1052. |
| 75. | Abascal, F. & Valencia, A.(2002). Clustering of proximal sequence space for the identification of proteinfamilies. Bioinformatics, 18, 908-921. |
| 76. | Vlahovicek, K., Murvai, J., Barta,E. & Pongor, S. (2002). The SBASE protein domain library, release 9.0: anonline resource for protein domain identification. Nucl. Acids Res., 30, 273-5. |
| 77. | Chothia, C. (1992). One thousandprotein families for the molecular biologist. Nature, 357, 543-544. |
| 78. | Finkelstein, A. V. & Ptitsyn,O. B. (1987). Why do globular proteins fit the limited set of folding patterns?Prog. Biophys. molec. Biol., 50, 171-190. |
| 79. | Blundell, T. L. & Johnson, M.S. (1993). Catching a common fold. Prot. Sci.,2, 877-883. |
| 80. | Orengo, C. A., Jones, D. T. &Thornton, J. M. (1994). Protein superfamilies and domain superfolds. Nature, 372, 631-634. |
| 81. | Crippen, G. M. & Maiorov, V. N.(1995). How many protein folding motifs are there? J. Mol. Biol., 252, 144-151. |
| 82. | Wolf, Y. I., Grishin, N. V. &Koonin, E. V. (2000). Estimating the number of protein folds and families fromcomplete genome data. J. Mol. Biol., 299, 897-905. |
| 83. | Vitkup, D., Melamud, E., Moult, J.& Sander, C. (2001). Completeness in structural genomics. Nat. Struct.Biol., 8, 559-566. |
| 84. | Coulson, A. F. & Moult, J.(2002). A unifold, mesofold, and superfold model of protein fold use. Proteins, 46, 61-71. |
| 85. | Rost, B. (1998). Marrying structureand genomics. Structure, 6, 259-263. |
| 86. | Sali, A. (1998). 100,000 proteinstructures for the biologist. Nat. Struct. Biol.,5, 1029-1032. |
| 87. | Shapiro, L. & Lima, C. D.(1998). The Argonne Structural Genomics Workshop: Lamaze class for the birth ofa new science. Structure, 6, 265-267. |
| 88. | Burley, S. K., Almo, S. C.,Bonanno, J. B., Capel, M., Chance, M. R. et al. (1999). Structural genomics:beyond the human genome project. Nat. Gen.,23, 151-157. |
| 89. | Montelione, G. T. & Anderson,S. (1999). Structural genomics: keystone for a Human Proteome Project. Nat.Struct. Biol., 6,11-12. |
| 90. | Christendat, D., Yee, A., Dharamsi,A., Kluger, Y., Savchenko, A. et al. (2000). Structural proteomics of anarchaeon. Nat. Struct. Biol., 7, 903-9. |
| 91. | Hendrickson, W. A. (2000).Synchrotron crystallography. TIBS, 25, 637-643. |
| 92. | Thornton, J. (2001). Structuralgenomics takes off. TIBS, 26, 88-89. |
| 93. | Linial, M. & Yona, G. (2000).Methodologies for target selection in structural genomics. Prog. Biophys.molec. Biol., 73,297-320. |
| 94. | Altschul, S. F., Madden, T. L.,Schaffer, A. A., Zhang, J., Zhang, Z. et al. (1997). Gapped BLAST andPSI-BLAST: a new generation of protein database search programs. Nucl. AcidsRes., 25, 3389-402.. |
| 95. | Eddy, S. R. (1998). Profile hiddenMarkov models. Bioinformatics, 14, 755-63. |
| 96. | Yang, A. S. & Honig, B. (2000).An integrated approach to the analysis and modeling of protein sequences andstructures. II. On the relationship between sequence and structural similarityfor proteins that are not obviously related in sequence. J. Mol. Biol., 301, 679-689. |
| 97. | Yang, A. S. & Honig, B. (2000).An integrated approach to the analysis and modeling of protein sequences andstructures. I. Protein structural alignment and a quantitative measure forprotein structural distance. J. Mol. Biol.,301, 665-678. |
| Contact: rost@columbia.edu | Version: Dec 2, 2003 |