Supporting online material

for:
SNAP: predictions for functional effects of non-synonymous polymorphisms

Yana Bromberg & Burkhard Rost

 

Table of Contents for Supporting Online Material

Table 1: Data set summary                                                                        2

Table 2: Full set of evaluated features                                                    3

Table 3: Other measures of performance                                               4

Table 4: PMD/EC data set predictions made by SNAPannotated             5

Figure 1: Choosing features for inputs to networks                              6

Figure 2: Differential performance for exposed and buried SNPs     7

References                                                                                                   8

Short description of Supporting Online Material

            Four tables are included and two figures. Table 1 contains a full description of features used in developing SNAP. Table 2 suggests some possible ways of measuring method performance other than accuracy (Matthews correlation and mutual information). Table 3 is a report of numbers of samples in sets used to develop SNAP (classified by solvent accessibility, source of data, and functional effect). Table 4 is a confusion matrix of SNAPannotated predictions. Figure 1 is a pictorial representation of contribution of various inputs to performance of neural nets on differentially accessible data sets. Figure 2 illustrates the difference in SNAP prediction performance for differentially accessible data sets.

 


Material

Table 1 : Data set summary.

 

Proteins

PMD* Non-Neutral

PMD Neutral

EC* Neutral

Total Neutral

Total

Before Split

6821

40641

14334

26840

41174

81815

After Split**

6413

39987

14148

26682

40830

80817

Buried Set

5144

19741

5151

9649

14800

34541

Intermediate Set

4841

12285

4362

8711

13073

25358

Exposed Set

4150

7961

4635

8322

12957

20918

*           PMD annotation is indicative of mutants extracted from the PMD database [12, 13], while EC annotation represents neutral only pseudo-entries extracted by comparing sequences with the same EC number [14].

**         Indicates a slight reduction of the data set after a ten-fold split for cross-validation purposes.


Table 2 : Full set of evaluated features.

Feature

Description

Bio-Chemical

Difference in: mass, hydrophobic class (phobic/neutral/philic), charge (pos/ neutral/ negative), c-beta branch (present/absent), surface (tiny/ small/ regular), presence of buried charge, and proline in a helix 

Sequence

A vector of 21, where a single node, different for each residue, is on.

PSI-BLAST Profiles [1]

A vector of scores and frequencies from PSI-BLAST PSSM

A vector of frequencies of each residue at the position in alignment

Absolute difference (and direction) in score and frequency values at mutant position between wt and mutant residues

A vector of 0Õs and frequency of given residue at position in alignment.

1D rsa [2]

Relative accessibility and reliability of accessibility prediction

1D ss [2]

3-node output of sec. structure predictor & reliability of prediction

A vector of three nodes, where each is turned on one at a time to represent helix/loop/ext_strand states. (plus reliability of prediction)

Sequence ss & sa [2]

Difference in raw output of helix/loop/strand and accessibility values in predictions from sequence made for mutant and wt residues.

Pfam[3]

Presence of a domain affected by mutation and the e-value of the hit.

Above information and indication of presence of a consensus sequence agreement and BLOSUM62Õs value of the likelihood of a match of the original residue to the residue in the consensus.

Above information and BLOSUM62Õs value of the likelihood of a match of the mutant residue to the residue in the consensus, proximity of start and end of domain within 10 residues of mutant, and presence of mutant position in other domains

PSIC Profile[4]

The difference of scores of the original and mutant residues, subjected to a cutoff measure splitting space into 3 segments

PSIC score (and sign) for each residue, absolute difference between the scores of the wt and mutant, and number of sequences aligned

PSIC score (and sign)  for mutant, difference as described above, and number of sequences aligned

Flexibility

Raw output of B-value predictor network

Transition

Vector of 3 values, each showing the likelihood of a stretch of residues where the residue concerned is 1st, 2nd, or 3rd (sliding window of 3)

SWISS-PROT [5]

A vector of 6 nodes, where 5 are class annotations (Methods), and the 6th is PHAT value of substitution if mutant is in trans-membrane region

PolyPhen [6]

2-node prediction vector and 3-node Òsource of evidenceÓ vector

SIFT[7]

SIFT score, final prediction , and number of sequences at position


Table 3 : Other measures of performance

 

Matthews correlation*

Mutual Information**

SIFT [7]

0.49

0.18

PolyPhen [6]

0.50

0.19

SNAP

0.56

0.24

SNAPannotated

0.58

0.26

*           Matthews [8] correlation was computed using the standard formula:

                                

            where TP, FP, TN, and FN are numbers of true positive (correctly predicted non- neutral samples), false positives (non-neutral samples predicted to be neutral), true            negatives (correctly predicted neutral samples), and false negatives (neutral   samples predicted to be non-neutral) respectively.

**         Normalized Mutual Information (MI)[9, 10] was computed as the Kullback-Lieblier (KL) [11] divergence, between the product of the individual distributions and the    joint distribution:

                   normalized to 0<MI<1 range by dividing by the entropy value (H) as follows:

        

 

                  where, i represents the observed class and j represents the predicted class. Thus           P(i,j) is the probability of predicting sample observed in class i, to be in class j (#_samples i predicted as j / #_total samples). Similarly, Pobservation(i) is the     probability of observation in class i (#_samples i / #_total samples), and     Pprediction(j) is probability of prediction in class j (#_predictions j / #_total predictions).


Table 4 : PMD/EC data set predictions made by SNAPannotated

 

Non-Neutral observed

Neutral observed

Non-Neutral predicted

TP = 33291

FP = 10338

Neutral predicted

FN = 6696

TN = 30492

á                PMD/EC data set annotation is indicative of mutants extracted from the PMD database and from enzyme data (Methods; Table SOM_3, Supporting Online Material).


Fig. 1

 

Fig. 1 : Choosing features for inputs to networks. All available features are considered for each set. At each consequent step another (highest overall two-state accuracy feature, Eqn. 2 in Methods) is added to the input. If one of the representations of a particular feature is added to the input, other representations are no longer considered in following steps. For more detailed explanation of features and selection process see Methods.


Fig. 2 /span>

 

Fig. 2: Differential performance for exposed and buried SNPs.  Split of the PMD/EC data set according to the levels of solvent accessibility predicted by PROFacc for the nsSNP: buried (red lines; relative solvent accessibility RSA<9%), intermediate (green lines; 9%<= RSA <36%), and exposed (blue lines; RSA>36%). The x-axis shows the accuracy of predictions, and the y-axis is coverage (Eqn. 3 and 4). The non-neutral curves (full lines) clearly indicated a better performance on more buried residues (red curve more toward top right than green, and green more than blue). In contrast, neutral effects (dashed lines) were predicted slightly better for exposed than for buried residues.

 

 


References for Supporting Online Material

 

 

1.         Altschul, S.F., et al., Gapped Blast and PSI-Blast: a new generation of protein database search programs. Nucleic Acids Research, 1997. 25(17): p. 3389-3402.

2.         Rost, B., How to use protein 1D structure predicted by PROFphd, in The Proteomics Protocols Handbook, J.E. Walker, Editor. 2005, Humana: Totowa NJ. p. 875-901.

3.         Bateman, A., et al., The Pfam Protein Families Database. Nucleic Acids Research, 2004. 32(Database Issue): p. D138-D141.

4.         Sunyaev, S.R., et al., PSIC: profile extraction from sequence alighnments with position-specific counts of independent observations. Protein Engineering, 1999. 12(5): p. 387-394.

5.         Boeckmann, B., et al., The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res, 2003. 31(1): p. 365-70.

6.         Ramensky, V., P. Bork, and S.R. Sunyaev, Human non-synonymous SNPs: server and survey. Nucleic Acids Research, 2002. 30(17): p. 3894-3900.

7.         Ng, P.C. and S. Henikoff, SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Research, 2003. 31: p. 3812-3814.

8.         Matthews, B.W., Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta, 1975. 405(2): p. 442-51.

9.         Rost, B. and C. Sander, Prediction of protein secondary structure at better than 70% accuracy. Journal of Molecular Biology, 1993. 232: p. 584-599.

10.       Rost, B., C. Sander, and R. Schneider, Redefining the goals of protein secondary structure prediction. J Mol Biol, 1994. 235(1): p. 13-26.

11.       Kullback, S. and L. R.A, On information and sufficiency. Annals of Mathematical Statistics, 1951. 22: p. 79-86.

12.       Nishikawa, K., et al., Constructing a protein mutant database. Protein Engineering, 1994. 7(5): p. 773.

13.       Kawabata, T., M. Ota, and K. Nishikawa, The protein mutant database. Nucleic Acids Research, 1999. 27: p. 355-357.

14.       NC-IUBMB, Nomenclature committee of the international union of biochemistry and molecular biology (NC-IUBMB), Enzyme Supplement 5. European Journal of Biochemistry, 1999. 264(2): p. 610-650