bottom - CUBIC-papers - CUBIC

Title: Analysing six types of protein-protein interfaces
Author:Yanay Ofran & Burkhard Rost
Quote: J Mol Biol, 325, 377-387

This article is published in (Journal of Molecular Biology, Vol xx, 2002, ppxx) © copyright Journal of Molecular Biology, Academic Press (2002). Academic Press is the only authorised source. All copying of this article including placing on another website requires the written permission of the copyright owner.


Analysing six types of protein-protein interfaces

Yanay Ofran 1,2, *& Burkhard Rost 1, 3, 4, *

1 CUBIC, Dept. of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA
2 Dept. of Bio-medical Informatics, Columbia Univ., 630 West 168th Street, New York, NY 10032, USA
3 North East Structural Genomics Consortium (NESG), Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA
4 Columbia University Center for Computational Biology and Bioinformatics (C2B2), Russ Berrie Pavilion, 1150 St. Nicholas Avenue, New York, NY 10032, USA
* Corresponding authors:  email = ofran@cubic.bioc.columbia.edu , rost@columbia.edu URL http://cubic.bioc.columbia.edu/  Tel: +1-212-305-3773, fax: +1-212-305-7932
Table of contents



 


Abstract

Non-covalent residue side-chain interactions occur in many different types of proteins and facilitate many biological functions. Are these differences manifested in the sequence compositions and/or the residue-residue contact preferences of the interfaces? Previous studies analysed small data sets and gave contradicting answers. Here, we introduced a new data-mining method that yielded the largest high-resolution data set of interactions analysed. We also introduced an information theory-based analysis method. Based on sequence features we were able to clearly differentiate six types of protein interfaces, each corresponding to a different functional or structural association between residues. Particularly, we found significant differences in amino acid composition and residue-residue preferences between interactions of residues within the same structural domain and between different domains, between permanent and transient interfaces, and between interactions associating homo- and hetero-oligomers. The differences between the six types were so substantial that using amino acid composition alone, we could statistically predict to which of the six types of interfaces a pool of 1000 residues belongs at 63-100% accuracy. All interfaces differed significantly from the background of all residues in SWISS-PROT, from the group of surface residues, and from internal residues that were not involved in non-trivial interactions. Overall, our results may suggest that the interface type could be predicted from sequence and that interface-type specific mean-field potentials may be adequate for certain applications.

 

Key words: protein-protein interaction, protein complexes, protein interface, protein folding, drug design, molecular recognition, interface area, protein modelling, bioinformatics, statistics.

 

Abbreviations used

DIPdatabase of interacting proteins [1]
PDBProtein Data Bank of experimentally determined 3D structures of proteins [2, 3]
SWISS-PROThuman curated data base of annotated protein sequences [4]


 

 

 

Introduction

Do different types of interactions use different biochemical mechanisms? Non-covalent contacts between residue side chains are the basis for protein folding, protein assembly, and protein-protein interaction. These contacts occur under many different conditions, and facilitate a variety of interactions and associations within and between proteins. For example, residue-residue contacts determine protein structure by a myriad of interactions between residue side chains. Non-covalent interactions between side-chains also mediate the assembly of folded chains into multi-chain proteins. In these two instances the interactions are permanent in the sense that they typically last for the lifetime of a protein. However, non-covalent residue-residue interactions can also be transient as in receptor-ligand interaction or in signal transduction. These interactions typically last for only short times. Given the wide range of interfaces, one may hypothesise that different types of interactions are facilitated by different biochemical mechanisms.

Previous studies. Many studies have investigated whether or not the characteristics of interfaces differ between e.g. internal (within the same chain) and external (between different chains) interactions [5, 6, 7, 8, 9, 10, 11] . Although all studies analysed proteins of known structure, their results were contradictory. Three theoretical, technical and computational problems may account for these differences. (1) In order to draw veritable conclusions from the available structural data it is necessary to analyse as many proteins as possible. However, none of the studies fully exploited the wealth of data available in the Protein Data Bank (PDB) [2, 3] ; most analyses have been limited to relatively small, hand-selected data sets. One reason for analysing small data sets was that there is no simple way to automatically distinguish (i) interfaces between two chains that belong to one multi-chain protein and (ii) interfaces between two different proteins. (2) Due to small datasets, most studies could not distinguish between homo- and hetero-multimers or between permanent and transient interactions. Instead, they had to focus on comparing internal (within one chain) vs. external (between chains) interactions. (3) Most studies have described external interactions through surface patches. Such surface patches may not capture all aspects of protein interactions. For example, slightly buried residues with long side chains may be missed although they participate in interfaces. Furthermore, analyses of residue mutations have indicated that the contribution to the free energy of binding is not evenly distributed across the interface [12] : Some residues identified as part of a surface patch may form important contacts while others may not form contacts at all. Therefore, the analysis of surface patches may not capture all residue-residue contacts that underlie the interaction.

Different conclusions from analysing surface patches. Comparisons of protein interfaces have yielded contradicting results. Some studies report that the amino acid composition of different types of interfaces are similar [8, 13, 11] ; others report significant differences [6, 9] . Most studies focused on comparing internal and external interfaces. A few studies distinguished external interfaces in more detail. For instance, Jones & Thornton [5] proposed a distinction between 'obligatory' interactions, i.e. interfaces between chains that are in permanent contact (e.g. multi-chain proteins), and transient interactions, i.e. interfaces between separate proteins that interact only transiently to carry out a particular biological task (e.g. signal transduction or receptor-ligand binding). Unfortunately, such a detailed distinction of external interfaces reduced the available hand-selected data sets even further. Nevertheless, two groups suggested that the composition differs between internal, transient, and obligatory interfaces [6, 9] . It may be suggested to surmount the problem of non-representative data sets by assuming that all homo-oligomers constitute permanent and all hetero-oligomers transient interactions. If so, we could automatically classify the whole PDB into transient and permanent oligomers. However, there are many examples of permanent hetero-oligomers or transient homo-oligomers. Furthermore, even if we accept this assumption, the literature still gives conflicting answers to the question whether or not residue-residue preferences differ between homo- and hetero-oligomers.

We developed a simple data-mining method to analyse and sort structural data in a way that allows analysing interfaces in very large data sets of high-resolution structures. In particular, we sorted the data into different groups of homo- vs. hetero-oligomers and permanent vs. transient interactions. To our knowledge, this is the largest non-redundant data set of residue-residue contacts analysed thus far. We found significant differences in the sequence features between the following six types of interfaces:

  1. intra-domain: interfaces within one structural domain,
  2. domain-domain: interfaces between different domains within one chain,
  3. homo-obligomer: interfaces between permanently interacting identical chains,
  4. homo-complex: interfaces between transiently interacting identical protein chains,
  5. hetero-obligomer: interfaces between permanently interacting different protein chains,
  6. hetero-complex: interfaces between different transiently interacting protein chains.

We introduced the term 'obligomer' to denote interfaces between residues from two chains that are 'obligatory' in the sense introduced by Jones & Thornton [5] . In contrast, we referred to 'complexes' as interfaces between transiently interacting chains. In the literature, all interfaces between different chains ('hetero') are often referred to as protein-protein interactions. Note that while results from experiments such as yeast-two-hybrid systems [14, 15] are usually thought to reflect generic protein-protein interactions, these experimental means may also detect interfaces between identical chains ('homo') [2, 1] .

 

Results and Discussion

Accurate automatic distinction between homo- and hetero-interfaces. Most PDB records that describe the structure of more than one chain do not specify whether the different chains belong to a single protein (interacting permanently), or to several proteins (interacting transiently). This data-mining problem has often been quoted as the reason for using small data sets and/or for the particular way in which external interfaces were distinguished [Jones, 1997 #4;Jones, 1996 #5;Lo Conte, 1999 #19;Jernigan, 1996 #35;Bahar, 1997 #17;Bahar, 1996 #31;Keskin, 1998 #6;Sheinerman, 2000 #12;Nayal, 1999 #69;Fetrow, 2001 #70;Glaser, 2001 #16]. Here, we proposed an extremely simple solution: profit from the biological expertise that is at the heart of the SWISS-PROT database [4] (Methods). This simple procedure correctly and automatically reproduced the small data sets that were hand-selected for previous publications [5, 6, 9] , i.e. we found all the complexes identified in the literature also to be classified as complexes by our simple assignment method. Moreover, the resulting data sets were more than one order of magnitude larger than data sets analysed in most previous studies ( Table 1 ). Thus, we could analyse statistically significant sets even for a very fine-grained separation of six interface types.



Table . 1
Table 1 : Data set statistics
Typeof interfaceNumberof contacts
Internalintra-domain 13,340,485
domain-domain 2255,144
Externalhomo-obligomers 3218,104
homo-complexes 43,077
hetero-obligomers 518,886
hetero-complexes 6166,412

∆ Contacts: residues weredefined as 'in contact' if the separation between the two closest atoms was≤ 6Å. We separatedthe following six types of interfaces (Methods for details): 1 Intra-domain: contactsbetween residues in the same structural domains (according to the domaindefinition of PrISM [39] ).
2 Domain-domain: contactsbetween residues in different structural domains in the same chain.
3 Homo-obligomers: contactsbetween residues on two different chains that have identical sequence and arepermanent in the sense that we have no evidence for any biological interactionof the monomer.
4 Homo-complexes: contactsbetween residues on two different chains that have identical sequence and aretransient in the sense that another chain of that sequence is also observed inthe cell as functional monomers.
5 Hetero-obligomers: contactsbetween two non-identical chains from the same protein (transient).
6 Hetero-complexes: contactsbetween two non-identical chains from two different proteins (permanent).



 

Data set so large that 'statistical significance' no longer implies 'scientific meaning'. One generic problem of bioinformatics is the lack of suitable statistical tools. Significance tests such as c2 are sensitive to sample size [16] . Consequently, applying these tests to very large data sets is prone to inferential errors, especially when the number of degrees of freedom is substantially low compared to the number of data points [16] . When we applied the standard c2 test to our data, we found that the differences between the amino acid compositions of the six interfaces were extremely significant statistically [Ofran, 2002 #71]. However, even when we randomly reshuffled our data sets, c2 indicated significant differences between such nonsense splits. Therefore, we introduced a method that used the Jensen-Shannon information to explore the self-consistency of the data (find-self procedure, Methods). This procedure revealed that the amino acid compositions differed significantly between the six interface types: samples of 1000 residues taken at random from each type of interface correctly identified their own type in over 63-100% of the cases ( Table 2 ). Note that this did not imply that >63% of the individual interfaces were correctly classified, rather that the pool of contacts from each type of interface was consistent to a certain extent. Note furthermore, that the absolute values of the percentages depend on the size of the samples drawn at each iteration (i.e. 1000, Methods).

Differences in sequence on two levels: amino acid composition and contact preferences. In general, the concept of 'difference in sequence' has two aspects: interfaces may differ in their amino acid composition and/or their residue-residue contact preferences. For example, complexes might have fewer negatively charged residues than obligomers, however, complexes might incorporate these fewer negative charges more often into salt-bridges. We investigated these two aspects (residue composition and residue-residue contact preferences) separately.

The six interface types differed significantly in their amino acid compositions. The find-self procedure unravelled that there are at least six types of interfaces that differ significantly in their amino acid composition ( Table 2 : all values on diagonal substantially higher than off-diagonal counts). In terms of the most coarse-grained separation, the sequence compositions differed most strongly between internal and external interfaces ( Table 1 S, Supplement). The least-distinct types were (a) domain-domain interfaces that overlapped with intra-domain interfaces and (b) hetero-obligomers that resembled domain-domain interfaces remarkably often (off-diagonal elements in Table 2 ). The former (a) was not surprising since domain-domain interfaces are formed between residues on the same chain, and thus are likely to be similar to intra-domain contacts. On the other hand, we can also view domain-domain interfaces as permanent interactions between independently folded units. Therefore, we may expect them to be similar to hetero-obligomers, which by definition, associate independently folded chains. Transient interfaces between identical chains (homo-complexes) constituted the seemingly most distinct interface type. The fact that our method could automatically distinguish between the interfaces of different chains of the same protein (hetero-obligomers) and different chains of different proteins (hetero-complexes) strongly indicated that the success of our data-mining approach in distinguishing complexes from obligomers was not restricted to the expert-curated data sets.



Table . 2
Table 2 : Significant difference betweenamino acid composition of six interface types
internalexternal
intra-domaindomain-domainhomo-obligomerhomo-complexhetero-obligomerhetero-complex
intra-domain75.919.40.4-4.30.2
domain-domain18.662.70.9-16.41.4
homo-obligomer0.91.578.0-2.217.4
homo-complex--0.299.8--
hetero-obligomer3.917.21.9-70.86.2
hetero-complex0.11.216.3-6.376.1

∆ Numbers indicate how often(in percentage points) the amino acid compositions of 1000 residues drawn atrandom from one interface type was most similar to a different set of 1000residues randomly drawn from another interface ('-' indicates values ≤0.1). For example, in 75.9% of the cases, the composition of the 1000 residuesfrom intra-domain contacts were more similar to the composition of another1000-residues sample from the same data set than to 1000-resides samples fromany other interface type. All values in the diagonal reflect correctidentification of the respective class; off-diagonal elements reflect theconfusion. For instance, 19.4% of the misclassified intra-domain contacts weremisclassified as domain-domain contacts. The symmetry of the table indicates ahigh level of consistency in the stochastic find-self procedure. The maximalstandard deviation for each cell in the table was 4.4 percentage points. 
Note that the percentages should not be misread to mean 'retrieval ofindividual interfaces', rather they refer to the retrieval of contacts frompools of 1000 contacts. The absolute numbers of the percentages obviouslydepend on the size of the randomly sampled pool (here 1000).



 

All interfaces differed in residue composition from background and surface. We used the residue compositions of all proteins in SWISS-PROT as the background to compare the compositions between the six interface types. We found that the composition of all interface types differed substantially from the composition of SWISS-PROT (Fig 1; Table 2 S in the Supplement shows that this difference was highly significant.) Nonetheless, our results showed why some studies report a strong correlation between the propensities of residues in internal and external interactions [8, 11] . Most residues show similar trends in internal contacts and many other types of interfaces ( Fig. 1 B). Indeed, we observed very strong correlations ( eqn. 3 ) between the amino acid distributions in all interface types (r>0.8 in all pairwise comparisons) except for homo-complexes. Furthermore, all interface types were also highly correlated to the distribution of amino acids in SWISS-PROT (r>0.8). Nevertheless, the find-self procedure indicated substantial differences between these distributions, suggesting that the correlation coefficients were not sensitive enough for this comparison. All interfaces differed significantly from exposed residues ( Table 3 ). Interestingly, about 1.3% of all the internal residues were found not to be in any non-trivial contact by our definition of contact. Most of these were at the ends of chains. When we added these 'free' residues as a separate class, we found that they again differed significantly from all other classes.



Fig. 1
fig1.gif

Fig. 1. : Amino acid composition of six interface types. The propensities of all residues found in SWISS-PROT were used as background. If the frequency of an amino acid is similar to its frequency in SWISS-PROT, the height of the bar is close to zero. Over-representation results in a positive bar, and under-representation in a negative bar. The amino acids are given by their one-letter code, sorted by biophysical features.





 

>>>Table 3<<<

Similarities and differences in and to the literature for composition. Some studies report substantial differences between internal and external interfaces [17] , while others report a high similarity [11] . Xu et al. [17] conclude from the differences they found that internal contacts are facilitated by different mechanisms than external ones. The substantial differences we found between internal and external contacts support this view. Theoretical and experimental works attempted to identify the residues that play key roles in each type of interaction. A few groups found polar and charged residues as well as salt bridges to be the major contributors for the formation of interactions [18, 17, 10] . Other studies reported that salt bridges are not an important factor in protein-protein interaction [7] , or that interfaces favour non-polar residues [19] . The detailed separation of six types of interfaces explained these contradictions: while we identified some general trends, most of the residues showed different behaviours in different types of interfaces. In particular, we found no clear common denominator for charged residues: lysine was under-represented in all types of interfaces, while arginine was over-represented. Most large hydrophobic residues were favoured in all types of interactions (in particular, histidine, methionine, and tyrosine). In contrast, serine, alanine and glycine were under-represented. The other residues demonstrated different trends in different types of interfaces, yet, bio-physically similar residues, such as leucine and isoleucine, or aspartic acid and glutamic acid, usually showed similar trends, indicating the reliability of the data. Overall, the composition of homo-complexes was most exceptional in that it frequently differed from the trends of all other interface types. Jones and Thornton [5] compared the propensities of residues in homo- and hetero-multimer interfaces. They conclude that hydrophobic residues often are more abundant in homo- than in hetero-multimers. When grouping all homo- and all hetero-multimeric interfaces, we found a similar trend. However, when we separated permanent and transient interactions, this distinction disappeared. Jones et al. [13] reported significant differences between the compositions of domain-domain interfaces and of the protein cores. Their conclusion was based on a standard c2 test. To revisit this point we checked whether our six data sets of contacts differed in composition from (a) SWISS-PROT as a whole ( Table 2 S), (b) from exposed residues ( Table 3 ) and (c) from residues that do not form any contact ( Table 3 S). Our results indicated that each of these biophysical categories are characterised by unique residue compositions. In particular, we confirmed the earlier results [5, 6] that domain-domain interfaces differed from both the background and from intra-domain interfaces.

Analysing 'hot spot' residues. Bogan & Thorn [12] reported 'hot spots' in binding energy for protein interfaces. These spots are reported to be abundant in tryptophan, tyrosine and arginine, and depleted of serine, threonine, leucine and valine. Chakrabarti & Janin [20] take an approach similar to that of Bogan & Thorn [12] , and differentiate between the core of the interface and its rim. They find tryptophan and tyrosine to be over-represented in the core, and leucine and valine under-represented. While arginine also appears abundant in the core, its propensity does not exceed the expected level on protein surfaces. Overall, our data confirmed these findings. However, when looking at the different types of interfaces, some exceptions were revealed. For example, tryptophan, which was extremely over-represented in most interface types, was underrepresented in homo-complexes. Leucine, valine, and threonine showed different trends in different types of interfaces. In contrast, leucine, isoleucine and valine were remarkably similar in their preferences for all interfaces.

Residue contact preferences differed statistically. The residue-residue preferences differed also remarkably between the six types of interfaces. When we repeated the find-self procedure for the set of preferences the results were very similar to those obtained for composition ( Table 4 ). Overall, the off-diagonal pattern was similar to the one for residue composition ( Table 2 ). The contact preferences were slightly more similar between the intra-domain and the domain-domain interfaces than between domain-domain and hetero-obligomer interfaces. However, there was still a clear similarity between the later two. For residue compositions, we noticed a trend to confuse homo-obligomers with hetero-complexes. For contact preferences, this trend was intensified: Over 15% of the samples of homo-obligomer contacts were misidentified as hetero-complexes and vice-versa. Grouping the two internal interface types into one, and the four external types into another class, we found that the residue contact preferences between these two classes differed substantially.



Table . 4
Table 4 : Significant difference betweencontact preferences of six interface types
internalexternal
intra-domaindomain-domainhomo obligomerhomo complexhetero obligomershetero-complex
intra-domain63.029.51.7-4.71.1
domain-domain21.861.20.9-12.63.5
homo obligomer1.51.679.6-1.915.4
homo complex---100.0--
hetero obligomers4.29.60.9-81.63.7
hetero-complex0.64.615.9-6.972.0

∆ Same procedure as in Table 2,however, now the randomly chosen samples were 1000 contact preferences ratherthan 1000 <<<5 percentage points.



 

Homo-complexes depleted in salt bridges and rich in contacts between identical residues. Hydrophobic-hydrophilic interactions dominated intra-domain, domain-domain and hetero-complex interfaces ( Fig. 2 A, 3B, 3F; red squares indicate highly preferred interactions, blue squares highly unlikely interactions). Cysteine bridges were observed more often than expected for all interface types. Similarly, salt bridges were also common with the exception of homo-complexes for which they were observed less often than expected. Homo-complexes exhibited an extreme general preference for interactions between identical amino acids. Furthermore, overall the contact preferences also stood out most for homo-complexes.



Fig. 2
fig2.gif

Fig. 2. : Residue-residue preferences. (A) Intra-domain, (B) domain-domain, (C) obligatory homo-oligomers (homo-obligomers), (D) transient homo-oligomers (homo-complexes), (E) obligatory hetero-oligomers (hetero-obligomers), and (F) transient hetero-oligomers (hetero-complexes). A red square indicates that the interaction occurs more than expected; a blue square that it occurs less than expected. The amino acids are ordered according to hydrophobicity [38] with isoleucine as the most and arginine as the least hydrophobic.





 

Similarities and differences in and to the literature for contact preferences. The assessment of residue-residue preferences under different circumstances may be crucial for successful structure prediction, as well as for protein threading and drug design. Previous attempts to determine residue-residue preferences have yielded many scales and matrices [21, 22, 18, 8, 23, 11] . Most of these matrices use the same set of preferences for all different types of interactions or focus on internal interfaces. However, recently a few studies demonstrated the success of including data from external interactions to compile mean-field potentials for improving docking [24, 25, 26] . Bahar, Jernigan and colleagues present matrices of contact energies reflecting the attraction or repulsion between each residue pair [18] . Those matrices are given in RT units and hence are not fully comparable with our results that are based on log odds. Yet, some interesting similarities and differences are noticeable. Bahar et al. find a high preference in the pairing between identical amino acids. This observation may appear rather odd, because an interaction between two identically charged residues appears energetically extremely unfavourable [27, 28] . Our results appear to explain this oddity: Homo-obligomers were the only type of interaction for which we observed strong preferences for interactions between identical amino acids ( Fig. 2 C). This observation might be explained by the evolutionary advantage of favouring identical-acid pairs in contacts between identical-chain: while the conservation of non-identical contacts requires two 'neutral/beneficial' point mutations, identical contacts need only one (Shoshana Wodak, Brussels, private communication). For the other interface types, we observed strong self-interaction preferences exclusively for cysteines, which are known to stabilise interactions through forming cysteine-bridges ( Fig. 2 ). Confirming earlier findings, we found salt-bridges abundant in interfaces [29, 7, 10] . Based on amino acid composition, some studies hypothesise that hydrophobic interactions are more frequent in permanent than in transient interactions [5] . This hypothesis is confirmed by our residue-residue preference data, both for homo- and hetero-obligomers.

Can we predict interfaces from sequence? Usually, interfaces have been defined structurally, i.e. according to the topography of the interacting macromolecules (surface patches). Jones & Thornton attempted to predict external interfaces from protein structures [30, 13] . Ultimately, we pursue a different objective, namely to predict protein-protein interactions directly from sequence. Hence, we had to replace the concept of external surface patches by that of sequence-consecutive interface segments (note the data for the explicit analysis of segments is not shown). The similarities of our results to those obtained by some of the groups that analysed interface patches, may or may not indicate that the two concepts are not that different, after all. Our direct analysis of sequence composition revealed significant differences between the types of interfaces, and between all interfaces and the background distribution. The method we used to explore statistical significance conceptually resembled a prediction method: we used the sequence-composition entropy of a pooled sample to 'predict' its interface type. Obviously, the high level of success (between 62-100%) did not imply that we could predict individual interfaces at this level of precision. Nevertheless, our data may suggest the feasibility of such a prediction method. Even if this speculation turns out to be over-optimistic, our results still may have important impacts for methods that attempt to infer protein function from protein structure and/or attempt to predict aspects of protein structure and function.

 

Conclusions

Our study differed in four important ways from previous analyses. (1) We data-mined a set of interfaces from PDB that was, to our knowledge, by far larger than data sets analysed before. (2) This large data set enabled us to base our analysis on a more fine-grained distinction of interfaces than previously explored. In particular, we distinguished between two types of internal interactions (intra-domain, domain-domain) and between four types of external interactions (homo-obligomers, homo-complexes, hetero-obligomers, and hetero-complexes). (3) We analysed interface contacts rather than surface patches. (4) We established the statistical significance of differences through a rigorous information theory derived procedure. These four novel components together yielded results that appeared to unambiguously establish that the six types of interfaces analysed differed in both their amino acid compositions and their residue-contact preferences. It was suggested in the past that there are many different types of interactions and that each of them is based on different biophysical mechanisms. The results of the find-self procedure may confirm this intuition. Thus, our data may a posteriori be considered as expected by many readers. However, it was encouraging how cleanly our algorithm distinguishing between multi-chain proteins and complexes of different proteins generated very consistent results. The success of this automatic method might eventually become the aspect of our work that will influence prediction methods most, as it allows creating large data sets of high resolution.

 

 

Methods

Generation of data set. Today's PDB [2, 3] is biased; such bias can seriously impact statistical analyses [31] . To reduce the bias, we compiled the largest possible non-redundant subset of PDB: No pair of proteins in that set had more than 25% identity over 100 aligned residues [32] . The non-redundant set included 1812 high-resolution structures. We excluded NMR structures, theoretical models, and chains shorter than 30 residues. 936 of these proteins (51%) had resolutions below 2Å, 74 proteins (4%) had resolutions above 3Å. Our results did not change qualitatively when restricting the analysis to structures with higher resolution. We included all 1812 for the data shown in order to guarantee better statistics.

Elimination of packing- complexes. Parsing PDB files, it is very hard to determine when a pair of chains is merely a packing multimer and when it is genuinely a biologically functional multimer. The problem is intensified when attempting to automatically parse hundreds of PDB files. Two approaches are suggested for coping with this problem. One is based on calculating the reduction of solvent accessibility due to oligomerisation [33] and the other is based on measuring the conservation of contacting residues [34] . We used the PQS server [33] , which applies the first method, to eliminate PDB files which appear to be packing complexes rather than biologically functional multimers.

Analysing interface contacts rather than interface patches. Typically, internal interfaces are defined in terms of contacting residues. External interfaces, however, are most often defined according to geometrically continuous patches of residues on the surface of a protein that exclude solvent by binding to another chain. The difference in definitions hampers the comparison between these two types of interactions. Furthermore, patch analysis might include some residues that are not really involved in the interactions (i.e. do not form inter-chain residue-residue contacts). They also might exclude residues that play a key role in the interaction. We replaced the notion of patches by defining the interface in terms of contacting residues both for internal and for external interfaces. We defined a residue pair to be in contact if the distance between the closest of their respective atoms was ≤ 6 Å and their sequence separation was ≥ 3 residues. Note that this particular definition included contacts between beta-sheets, while it ignored the contacts responsible for sharp beta-turns. Note furthermore that the same residue may participate in different interfaces. The choice of the distance cut-off threshold that defines a contact is not straightforward. Previous studies used distances between 4 and 12 Å between two C-alpha or C-beta atoms. However, the variations in the sizes of side chains might result in an under-representation of large residues in the data, as their side chain themselves can extend of several Ångstrøms. Hence, we defined contacts based on the distance between the closest pair of atoms of any two residues. This definition is more permissive than the ones used in other studies, thus classifying more residues to be in contact. However, it is not biased towards amino acids of any size. Thus, rather then biasing the data towards some residues, our permissive definition merely introduces 'white' noise. Using this definition we parsed the set of 1812 PDB files to obtain all the contacting residues. Once we obtained the list of all pairs of contacting residues in these PDB files, we classified them into six types using the methods described hereunder.

Homo- vs. hetero-multimers. Using simple sequence comparisons we differentiated between homo- and hetero-multimers. Interactions between chains with more than 10% difference in sequence were defined as hetero-multimers. All other interactions were classified as homo-multimers, while interactions between .

Expert-driven automatic distinction between hetero-obligomers and hetero-complexes. A multimer can be permanent, i.e. all the functions of each of the chains can be carried out only in this multimeric state. Alternatively, it can be transient, i.e. one or more of the chains can be functional also in different contexts and only in this particular multimer. Several studies have hypothesised that these two different types of interactions are based on different residue-residue contacts. However, it is hard to determined from the PDB file of a multimer, which is the case. We introduced the following simple idea to achieve such a distinction automatically. SWISS-PROT files describe the sequence of a protein as it was studied in the lab. If the protein is studied in its multimeric state than the sequence of all the chains will be submitted to SWISS-PROT in a single file. We hypothesised that experts typically add a new entry to the database if they study one the chains by itself. That is, if there is experimental evidence identifying this chain as a separate functional protein. If true, we only have to map all chains in our data set to SWISS-PROT [4] and label chains by their respective SWISS-PROT identifiers. If non-identical chains from one PDB file appear in the same SWISS-PROT file, this indicates that there is no known situation in which they function separately in the cell. Hence, if we found two or more chains in the same SWISS-PROT file, we assumed that their association is obligatory (hetero-obligomer), otherwise we assumed that their association is transient (hetero-complexes). Following this logic we divided our date set of hetero-multimers into two subsets. Contacts between chains in a PDB file that appear in the same SWISS-PROT file were classified as permanent interactions, or hetero-obligomers. Contacts between chains that appear in different SWISS-PROT files were classified as transient interactions, or hetero-complexes.

Database-driven distinction between homo-obligomers and homo-complexes. The same problem of differentiating permanent interactions from transient ones exists also with homo-multimers. However the cross-reference to SWISS-PROT is not applicable for homo-multimers. To distinguish between homo-multimers that are obligatory and those that are transient we used the DIP the database of interacting proteins [1] . We used DIP to detect those among our homo-multimers that, according to DIP, appear as functional monomers in the cell. As we obtained our data from PDB and PQS, we can assume that all the homo-multimer in our dataset appear in the cell as functional multimers. Therefore, those among them that are annotated by DIP to appear also as monomers should be classified as non-obligatory homo-multimers, or homo-complexes. All homomers that were not annotated as monomers in DIP were then classified as obligatory homomers, or homo-obligomers. Thus, we classified all the homo-multimer residue-residue contacts in our datasets to be either homo-complexes or homo-obligomers.

Establish statistical significance by find-self procedure based on Jensen-Shannon divergence. Most researchers are aware that standard significance tests are problematic when the data sets are too small [16] . Another problem with these tests that is not commonly noted, is that the application of significance tests to very large data sets with a few degrees of freedom - like the ones analysed here – can also lead to severe inferential mistakes [Ofran, 2002 #71;Royall, 1986 #63]. Therefore, we introduced a simple information theory based procedure ( Fig. 3 ) in order to answer the question whether or not the amino acid compositions and contact preferences differed significantly between the six groups of interfaces. We considered two groups to differ if we could correctly sort the interfaces into their respective group using only its amino acid composition. Conceptually, the procedure resembles boot-strapping techniques [35] . Technically, it measured the Jensen-Shannon (JS) divergence [36] between random samples from each data set. For a pair of distributions p1 and p2, with prior probabilities p1 and p2 this measure is defined as:

JS (p1,p2) = H (p1 p1 + p2 p1) - p1 H (p1) - p2 H (p2)
with: p1 and p2 ≥ 0, and p1+ p2 =1 
and: H(p) =- Si p(xi)log2(xi
(Eq. 1)

where p1 and p2 are the weights of the two probability distributions p1 and p2, respectively, and H(x) is the Shannon entropy [37] .

The following procedure measured how often interfaces from one group were most similar in their amino acid composition to interfaces from the same or any other group:

  1. Pick a random sample P with 1000 residues from one of the six sets of residue-residue contacts.
  2. Pick six random samples Q1-Q6 with 1000 residues from each of the six sets.
  3. Find the set Qp from Q1-Q6 least divergent from set P by measuring the JS divergence between P and Q1-Q6 .
  4. Record the types of interactions from which P and QP were sampled.

We repeated this procedure 6000 times (1000 for each of the six types of interactions). If the residue composition differed significantly between the types, we expect Qp to be, in most cases, the sample that was sampled from the same population of P. That is, we expect that in most cases, P and the sample most similar to P were sampled from contacts of the same interface type.



Fig. 3
fig3.gif

Fig. 3. : Sketch of find-self procedure. We sampled 1000 contacts from one data set (P, here hetero-complexes) and then another 1000 from e each of the six interface types (Q1-Q6). Then, we measure the divergence between the amino acid composition of P and each of the samples Q1-Q6. If the data set from which P was sampled has a unique and distinguishable composition, we expect that the sample most similar to P was Q6. This process was repeated 1000 times for each type of interface (6000 total).


 




 

Measuring residue-residue preferences. After we had established that the amino acid composition differed between the six interface types, we used the six lists to compute the likelihood of forming contacts between each pair of amino acids. In particular, we compiled the log odds ratio of the observed frequency of the pair over its expected frequency:

Lx(i,j)=log2(P(i,j)/(P(i)*P(j))) (Eq. 2)

where the subscript x represents one of the six types of interfaces (intra-domain, domain-domain, homo-obligomers, homo-complex, hetero-obligomers and hetero-complex), i and j are types of amino acids, P(i,j) is the probability of a contact between amino acids of type i and j in interfaces of type x, and P(i) and P(j) are the probability of occurrence for amino acids i and j respectively in the interfaces of type x. Hence, the denominator described the probability of a contact between i and j if the formation of contacts between i and j were random. Based on this equation, we generated six matrices for the likelihood for all possible contacts in each interface type.

Standard correlation. We applied the following standard correlation coefficient to compare our results to the literature:

(Eq. 3)

where x and y are two data sets (e.g. internal SWISS-PROT), xi and yi are the propensities of amino acid i, and <x>, <y> denote the mean over all 20 amino acids.

 

 


Acknowledgements

Thanks to Lukasz Salwinski (UCLA) and Ioannis Xenarios (UCLA, Lausanne) for their help in obtaining homo-complexes from DIP; thanks to Jinfeng Liu (Columbia) for computer assistance and Henry Bigelow (Columbia) for invaluable comments on the manuscript. We are also grateful to the invaluable comments from two unknown referees, from Shoshana Wodak (Brussels), and from Barry Honig (Columbia). The work of YO and BR was supported by the grants 1-P50-GM62413-01 and RO1-GM63029-01 from the National Institute of Health. Last, not least, thanks to all those who deposit their experimental data in public databases, and to those who maintain these databases, in particular to Phil Bourne (UCSD), Amos Bairoch (Geneva), Rolf Apweiler (EBI) and their teams.

 

 

References

1.Xenarios, I., Rice, D. W.,Salwinski, L., Baron, M. K., Marcotte, E. M. et al. (2000). DIP: the databaseof interacting proteins. Nucl. Acids Res., 28, 289-91..
2.Bernstein, F. C., Koetzle, T. F.,Williams, G. J., Meyer, E. F., Jr., Brice, M. D. et al. (1977). The ProteinData Bank. A computer-based archival file for macromolecular structures. Eur.J. Biochem., 80,319-24.
3.Berman, H. M., Westbrook, J., Feng,Z., Gillliland, G., Bhat, T. N. et al. (2000). The Protein Data Bank. Nucl.Acids Res., 28,235-242.
4.Bairoch, A. & Apweiler, R.(2000). The SWISS-PROT protein sequence database and its supplement TrEMBL in2000. Nucl. Acids Res., 28, 45-8.
5.Jones, S. & Thornton, J. M.(1996). Principles of protein-protein interactions. Proc Natl Acad Sci U S A, 93, 13-20.
6.Jones, S. & Thornton, J. M.(1997). Analysis of protein-protein interaction sites using surface patches. J.Mol. Biol., 272,121-132.
7.McCoy, A. J., Chandana Epa, V. &Colman, P. M. (1997). Electrostatic complementarity at protein/proteininterfaces. J. Mol. Biol., 268, 570-84.
8.Keskin, O., Bahar, I., Badretdinov,A. Y., Ptitsyn, O. B. & Jernigan, R. L. (1998). Empirical solvent-mediatedpotentials hold for both intra-molecular and inter-molecular inter-residueinteractions. Prot. Sci., 7, 2578-86.
9.Lo Conte, L., Chothia, C. &Janin, J. (1999). The atomic structure of protein-protein recognition sites. J.Mol. Biol., 285, 2177-98.
10.Sheinerman, F. B., Norel, R. &Honig, B. (2000). Electrostatic aspects of protein-protein interactions. CurrOpin Struct Biol, 10,153-9.
11.Glaser, F., Steinberg, D. M.,Vakser, I. A. & Ben-Tal, N. (2001). Residue frequencies and pairing preferencesat protein-protein interfaces. Proteins, 43, 89-102.
12.Bogan, A. A. & Thorn, K. S.(1998). Anatomy of hot spots in protein interfaces. J. Mol. Biol., 280, 1-9.
13.Jones, S., Marin, A. &Thornton, J. M. (2000). Protein domain interfaces: characterization andcomparison with oligomeric protein interfaces. Prot. Engin., 13, 77-82.
14.Uetz, P., Giot, L., Cagney, G.,Mansfield, T. A., Judson, R. S. et al. (2000). A comprehensive analysis ofprotein-protein interactions in Saccharomyces cerevisiae. Nature, 403, 623-627.
15.Ito, T., Chiba, T., Ozawa, R.,Yoshida, M., Hattori, M. et al. (2001). A comprehensive two-hybrid analysis toexplore the yeast protein interactome. Proc. Natl. Acad. Sci. U.S.A., 98, 4569-4574.
16.Royall, R. M. (1986). The effect ofsample size on the meaning of significance tests. The American Statistician, 40, 313-315.
17.Xu, D., Lin, S. L. & Nussinov,R. (1997). Protein binding versus protein folding: the role of hydrophilicbridges in protein associations. J. Mol. Biol.,265, 68-84.
18.Bahar, I. & Jernigan, R. L.(1997). Inter-residue potentials in globular proteins and the dominance ofhighly specific hydrophilic interactions at close separation. J. Mol. Biol., 266, 195-214.
19.Zhou, H. X. & Shan, Y. (2001).Prediction of protein interaction sites from sequence profile and residueneighbor list. Proteins, 44, 336-43.
20.Chakrabarti, P. & Janin, J.(2002). Dissecting protein-protein recognition sites. Proteins, 47, 334-43.
21.Hendlich, M., Lackner, P.,Weitckus, S., Flöckner, H., Froschauer, R. et al. (1990). Identificationof Native Protein Folds Amongst a Large Number of Incorrect Models. TheCalculation of Low Energy Conformations from Potentials of Mean Force. J.Mol. Biol., 216,167-180.
22.Sippl, M. J. (1995).Knowledge-based potentials for proteins. Curr. Opin. Str. Biol., 5, 229-235.
23.Prlic, A., Domingues, F. S. &Sippl, M. J. (2000). Structure-derived substitution matrices for alignment ofdistantly related sequences. Prot. Engin., 13, 545-550.
24.Moont, G., Gabb, H. A. &Sternberg, M. J. (1999). Use of pair potentials across protein interfaces inscreening predicted docked complexes. Proteins,35, 364-373.
25.Aloy, P., Querol, E., Aviles, F. X.& Sternberg, M. J. (2001). Automated structure-based prediction offunctional sites in proteins: applications to assessing the validity ofinheriting protein function from homology in genome annotation and to proteindocking. J. Mol. Biol., 311, 395-408.
26.Aloy, P. & Russell, R. B.(2002). Interrogating protein interaction networks through structural biology. Proc.Natl. Acad. Sci. U.S.A., 99, 5896-5901.
27.Schueler, O. & Margalit, H.(1995). Conservation of salt bridges in protein families. J. Mol. Biol., 248, 125-135.
28.Polticelli, F., Ascenzi, P.,Bolognesi, M. & Honig, B. (1999). Structural determinants of trypsinaffinity and specificity for cationic inhibitors. Prot. Sci., 8, 2621-2629.
29.Honig, B. & Nicholls, A.(1995). Classical electrostatics in biology and chemistry. Science, 268, 1144-1149.
30.Jones, S. & Thornton, J. M.(1997). Prediction of protein-protein interaction sites using patch analysis. J.Mol. Biol., 272,133-143.
31.Rost, B. (2002). Enzyme functionless conserved than anticipated. J. Mol. Biol.,318, 595-608.
32.Rost, B. (1999). Twilight zone ofprotein sequence alignments. Prot. Engin., 12, 85-94.
33.Henrick, K. & Thornton, J. M.(1998). PQS: a protein quaternary structure file server. TIBS, 23, 358-61.
34.Elcock, A. H. & McCammon, J. A.(2001). Identification of protein oligomerization states by analysis ofinterface conservation. Proc Natl Acad Sci U S A,98, 2990-4.
35.Efron, B., Halloran, E. &Holmes, S. (1996). Bootstrap confidence levels for phylogenetic trees. Proc.Natl. Acad. Sci. U.S.A., 93, 13429-13434.
36.Lin, J. (1991). Divergence measuresbased on the Shannon entropy. IEEE Transactions on Information Theory, 37, 145-151.
37.Shannon, C. E. (1948). Amathematical theory of communication. Bell System Tech. J., 27, 379-423/623-656.
38.Kyte, J. & Doolittle, R. F.(1982). A simple method for displaying the hydropathic character of a protein. J.Mol. Biol., 157,105-32.
39.Yang, A. S. & Honig, B. (2000).An integrated approach to the analysis and modeling of protein sequences andstructures. I. Protein structural alignment and a quantitative measure forprotein structural distance. J. Mol. Biol.,301, 665-78..
40.Kabsch, W. & Sander, C. (1983).Dictionary of protein secondary structure: pattern recognition ofhydrogen-bonded and geometrical features. Biopolymers, 22, 2577-637.
41.Rost, B. & Sander, C. (1994).Combining evolutionary information and neural networks to predict proteinsecondary structure. Proteins, 19, 55-72.  

Contact:    rost@columbia.edu Version:    Oct 29, 2002
top - CUBIC-papers - CUBIC