Supporting online material
for:
SNAP: predictions for functional effects of non-synonymous polymorphisms
Yana Bromberg & Burkhard Rost
Table 1: Data set summary 2
Table 2: Full set of evaluated features 3
Table 3: Other measures of performance 4
Table 4: PMD/EC data set predictions made by SNAPannotated 5
Figure 1: Choosing features for inputs to networks 6
Figure 2: Differential performance for exposed and buried SNPs 7
References 8
Four
tables are included and two figures. Table 1 contains a full description of
features used in developing SNAP. Table 2 suggests some possible ways of
measuring method performance other than accuracy (Matthews correlation and
mutual information). Table 3 is a report of numbers of samples in sets used to
develop SNAP (classified by solvent accessibility, source of data, and
functional effect). Table 4 is a confusion matrix of SNAPannotated
predictions. Figure 1 is a pictorial representation of contribution of various
inputs to performance of neural nets on differentially accessible data sets.
Figure 2 illustrates the difference in SNAP prediction performance for
differentially accessible data sets.
Table 1
: Data set summary.
|
|
Proteins |
PMD* Non-Neutral |
PMD Neutral |
EC* Neutral |
Total Neutral |
Total |
|
Before Split |
6821 |
40641 |
14334 |
26840 |
41174 |
81815 |
|
After Split** |
6413 |
39987 |
14148 |
26682 |
40830 |
80817 |
|
Buried Set |
5144 |
19741 |
5151 |
9649 |
14800 |
34541 |
|
Intermediate Set |
4841 |
12285 |
4362 |
8711 |
13073 |
25358 |
|
Exposed Set |
4150 |
7961 |
4635 |
8322 |
12957 |
20918 |
* PMD
annotation is indicative of mutants extracted from the PMD database [12,
13], while EC annotation represents
neutral only pseudo-entries extracted by comparing sequences with the same EC
number [14].
** Indicates
a slight reduction of the data set after a ten-fold split for cross-validation
purposes.
Table 2
: Full set of evaluated features.
|
Feature |
Description |
|
Bio-Chemical |
Difference in: mass, hydrophobic class
(phobic/neutral/philic), charge (pos/ neutral/ negative), c-beta branch
(present/absent), surface (tiny/ small/ regular), presence of buried charge,
and proline in a helix |
|
Sequence |
A vector of 21, where a single node,
different for each residue, is on. |
|
PSI-BLAST Profiles [1] |
A vector of scores and frequencies from
PSI-BLAST PSSM |
|
A vector of frequencies of each
residue at the position in alignment |
|
|
Absolute difference (and direction) in
score and frequency values at mutant position between wt and mutant residues |
|
|
A vector of 0Õs and frequency of given
residue at position in alignment. |
|
|
1D rsa [2] |
Relative accessibility and reliability
of accessibility prediction |
|
1D ss [2] |
3-node output of sec. structure
predictor & reliability of prediction |
|
A vector of three nodes, where each is
turned on one at a time to represent helix/loop/ext_strand states. (plus
reliability of prediction) |
|
|
Sequence ss & sa [2] |
Difference in raw output of
helix/loop/strand and accessibility values in predictions from sequence made
for mutant and wt residues. |
|
Pfam[3] |
Presence of a domain affected by
mutation and the e-value of the hit. |
|
Above information and indication of
presence of a consensus sequence agreement and BLOSUM62Õs value of the
likelihood of a match of the original residue to the residue in the
consensus. |
|
|
Above information and BLOSUM62Õs value
of the likelihood of a match of the mutant residue to the residue in the
consensus, proximity of start and end of domain within 10 residues of mutant,
and presence of mutant position in other domains |
|
|
PSIC Profile[4] |
The difference of scores of the
original and mutant residues, subjected to a cutoff measure splitting space
into 3 segments |
|
PSIC score (and sign) for each
residue, absolute difference between the scores of the wt and mutant, and number
of sequences aligned |
|
|
PSIC score (and sign) for mutant, difference as described
above, and number of sequences aligned |
|
|
Flexibility |
Raw output of B-value predictor
network |
|
Transition |
Vector of 3 values, each showing the
likelihood of a stretch of residues where the residue concerned is 1st,
2nd, or 3rd (sliding window of 3) |
|
SWISS-PROT [5] |
A vector of 6 nodes, where 5 are class
annotations (Methods), and the 6th is PHAT value of substitution
if mutant is in trans-membrane region |
|
PolyPhen [6] |
2-node prediction vector and 3-node Òsource
of evidenceÓ vector |
|
SIFT[7] |
SIFT score, final prediction , and
number of sequences at position |
Table 3 : Other measures of performance
|
|
Matthews correlation* |
Mutual Information** |
|
SIFT [7] |
0.49 |
0.18 |
|
PolyPhen [6] |
0.50 |
0.19 |
|
SNAP |
0.56 |
0.24 |
|
SNAPannotated |
0.58 |
0.26 |
* Matthews
[8] correlation was computed using the
standard formula:
where
TP, FP, TN, and FN are numbers of true positive (correctly predicted non- neutral samples), false positives (non-neutral
samples predicted to be neutral), true negatives
(correctly predicted neutral samples), and false negatives (neutral samples predicted to be non-neutral)
respectively.
** Normalized
Mutual Information (MI)[9, 10] was computed as the
Kullback-Lieblier (KL) [11] divergence, between the product of
the individual distributions and the joint
distribution:
![]()
normalized to 0<MI<1 range by dividing by
the entropy value (H) as follows:
![]()
![]()
where, i represents the observed
class and j represents the predicted class. Thus P(i,j) is the
probability of predicting sample observed in class i, to be in class j (#_samples i predicted as j / #_total samples).
Similarly, Pobservation(i) is the probability
of observation in class i (#_samples i / #_total samples), and Pprediction(j) is
probability of prediction in class j (#_predictions j / #_total predictions).
Table 4
: PMD/EC data set predictions made
by SNAPannotated
|
|
Non-Neutral observed |
Neutral observed |
|
Non-Neutral predicted |
TP = 33291 |
FP = 10338 |
|
Neutral predicted |
FN = 6696 |
TN = 30492 |
á
PMD/EC data set annotation is indicative of mutants
extracted from the PMD database and from enzyme data (Methods; Table SOM_3,
Supporting Online Material).
Fig.
1

Fig. 1 : Choosing features for inputs to networks. All available features are considered for each set. At each consequent step another (highest overall two-state accuracy feature, Eqn. 2 in Methods) is added to the input. If one of the representations of a particular feature is added to the input, other representations are no longer considered in following steps. For more detailed explanation of features and selection process see Methods.
Fig. 2
/span>

Fig. 2: Differential performance
for exposed and buried SNPs. Split of the PMD/EC data set according to the levels of solvent
accessibility predicted by PROFacc for the nsSNP: buried (red lines; relative
solvent accessibility RSA<9%), intermediate (green lines; 9%<= RSA <36%),
and exposed (blue lines; RSA>36%). The x-axis shows the accuracy of
predictions, and the y-axis is coverage (Eqn. 3 and 4). The non-neutral curves
(full lines) clearly indicated a better performance on more buried residues
(red curve more toward top right than green, and green more than blue). In
contrast, neutral effects (dashed lines) were predicted slightly better for
exposed than for buried residues.
1. Altschul,
S.F., et al., Gapped Blast and PSI-Blast: a new generation of protein
database search programs.
Nucleic Acids Research, 1997. 25(17): p. 3389-3402.
2. Rost,
B., How to use protein 1D structure predicted by PROFphd, in The Proteomics Protocols Handbook, J.E. Walker, Editor. 2005, Humana:
Totowa NJ. p. 875-901.
3. Bateman,
A., et al., The Pfam Protein Families Database. Nucleic Acids Research, 2004. 32(Database Issue): p. D138-D141.
4. Sunyaev,
S.R., et al., PSIC: profile extraction from sequence alighnments with
position-specific counts of independent observations. Protein Engineering, 1999. 12(5): p. 387-394.
5. Boeckmann,
B., et al., The SWISS-PROT protein knowledgebase and its supplement TrEMBL
in 2003. Nucleic Acids
Res, 2003. 31(1): p.
365-70.
6. Ramensky,
V., P. Bork, and S.R. Sunyaev, Human non-synonymous SNPs: server and survey. Nucleic Acids Research, 2002. 30(17): p. 3894-3900.
7. Ng,
P.C. and S. Henikoff, SIFT: predicting amino acid changes that affect
protein function. Nucleic
Acids Research, 2003. 31:
p. 3812-3814.
8. Matthews,
B.W., Comparison of the predicted and observed secondary structure of T4
phage lysozyme. Biochim
Biophys Acta, 1975. 405(2):
p. 442-51.
9. Rost,
B. and C. Sander, Prediction of protein secondary structure at better than
70% accuracy. Journal of
Molecular Biology, 1993. 232: p. 584-599.
10. Rost,
B., C. Sander, and R. Schneider, Redefining the goals of protein secondary
structure prediction. J
Mol Biol, 1994. 235(1):
p. 13-26.
11. Kullback,
S. and L. R.A, On information and sufficiency. Annals of Mathematical Statistics, 1951. 22: p. 79-86.
12. Nishikawa,
K., et al., Constructing a protein mutant database. Protein Engineering, 1994. 7(5): p. 773.
13. Kawabata,
T., M. Ota, and K. Nishikawa, The protein mutant database. Nucleic Acids Research, 1999. 27: p. 355-357.
14. NC-IUBMB,
Nomenclature committee of the international union of biochemistry and
molecular biology (NC-IUBMB), Enzyme Supplement 5. European Journal of Biochemistry, 1999. 264(2): p. 610-650