Table of Contents
Preface ix
1 Introduction 1
1.1
Compartmentalization of the cell and the
sorting of proteins 1
1.2
Advanced artificial intelligence (AI)
techniques pave the way
for predicting subcellular localization 10
1.3
Importance of accurate assessment of prediction
accuracy 22
2 Finding nuclear
localization signals 24
2.1 Introduction 25
2.2 Materials and methods 27
2.3 Results and discussions 30
3 Sequence
conserved for subcellular localization 36
3.1
Introduction 36
3.2 Materials and methods 41
3.3 Results 45
3.4 Discussions and conclusions 52
4 Inferring
subcellular localization through automated
lexical analysis 56
4.1 Introduction 56
4.2 Materials and methods 59
4.3 Results 63
4.4 Discussions and conclusions 68
5 Better
prediction of subcellular localization by combining
evolutionary and structural information 70
5.1
Introduction 71
5.2
Materials and methods 75
5.3
Results 81
5.4 Discussions
and conclusions 94
6 Predicting subcellular
localization based on functional
hierarchies 97
6.1
Introduction 97
6.2
Materials and methods 103
6.3
Results 108
6.4 Discussions
and conclusions 118
Bibliography 121
Appendix 140
I Glossary 140
List of Figures
1-1 The structure of proteins 2
1-2 Characterization of protein function 4
1-3 The major compartments of eukaryotic cells 6
1-4 A simplified “roadmap” of protein traffic 7
1-5 Two ways in which a sorting signal can be built into a protein 9
2-1 Simplified scheme for nuclear import 26
2-2 Scheme for the concept of 'in silico mutagenesis' 28
2-3 NLS motif also used for DNA-binding 34
3-1 Transition from safe over twilight to midnight zone of protein
comparisons 38
3-2 Sequence conservation for major classes of subcellular
localization 46
3-3 Average conservation of subcellular localization 47
3-4 Performance for different measures of sequence similarity 49
3-5 Percentage pairwise sequence identity vs. length of alignment 50
3-6 Conservation of function and structure 54
4-1 LOCkey algorithm 60
4-2 Results for five-fold cross-validation 64
4-3 Performance improves with number of keywords 66
5-1 Final neural network architecture 74
5-2 The LOC3D system 81
5-3 Maximal linear separation of subcellular localization 83
5-4 Structural and evolutionary information improves prediction
accuracy 85
5-5 Pairwise first level neural networks accurate for some
localizations 86
5-6 Better prediction through combining neural networks 88
5-7 Over 75% accuracy for the most reliably predicted half of all
proteins 89
6-1 Hierarchical architecture of LOCtree 102
6-2 Benchmarking LOCtree using large-scale experimental data 117
List of Tables
1-1 Typical signal sequences involved in protein sorting 10
1-2 Services for subcellular localization prediction 15
2-1 Accuracy and coverage of NLS motifs 31
2-2 Nuclear proteins in seven entirely sequenced proteomes 33
2-3 DNA-binding regions in genomes 35
3-1 Experimentally annotated subcellular localization data from
SWISS-PROT 42
3-2 Inferring subcellular localization by homology 51
3-3 Assessing accuracy of prediction methods on sequence unique
data set 52
4-1 Number of proteins in 'trusted' data sets 61
4-2 'Confusion Matrix' for LOCkey 64
4-3 Automatically annotating subcellular localization for five
proteomes 65
5-1 Number of proteins in data set 76
5-2 Neural network performance on test set of sequence-unique PDB
chains 84
5-3 Comparison on test set of sequence-unique PDB chains 91
5-4 Comparison on sequence-unique set of new SWISS-PROT proteins 92
5-5 Predicted subcellular localization for all eukaryotic PDB
chains 93
6-1 Number of proteins in data set 104
6-2 Accuracy of LOCtree on non-redundant test set of eukaryotic
non-plant
proteins 110
6-3 Accuracy of LOCtree
on non-redundant test set of plant proteins 111
6-4 Accuracy of LOCtree
on non-redundant test set of prokaryotic proteins by
localization classes 112
6-5 Comparison of LOCtree to other publicly available predictors 114
6-6 Comparison on sequence-unique set of new SWISS-PROT proteins 115
6-7 Performance of LOCtree based on large scale Yeast localization
data 116
6-8 LOCtree predicts localization for entire genomes 118
Acknowledgements
I am deeply indebted to my advisor Dr.
Burkhard Rost for his continuous support during my years at
Special thanks go to all members in Dr. Rost's
lab for their generous help. It is always a great pleasure to work with them. I
would like to especially thank Jinfeng Liu for helping me with biology concepts
and with computers, Dariusz Przybylski for extremely interesting discussions
that helped me refine many of my ideas and Kazimierz Wrzeszczynski, Trevor
Siggers and Cinque Soto for all the intellectual discussions I’ve had with
them. NLSdb database was constructed in collaboration with Phil Carter.
I am also grateful to Dr. Christina Leslie
for discussions on support vector machines, Dr. Barry Honig and Dr. Wayne
Hendrickson for interesting suggestions on the connection between nuclear
localization signals and DNA-binding.
I am thankful to members of my thesis
committee and defense committee, Dr. Horst Stormer, Dr. Aron Pinczuk, Dr.
Timothy-Halpin Healy, and Dr. Christina Leslie for reviewing my thesis and
helpful advice.
Finally, I would like to take this
opportunity to thank other professors in this campus, including Dr. Brian
Greene, Dr. Hal Evans, and Dr. Alan Blaer for their kind help during my
graduate study here.
Dedication
This dissertation is dedicated
to my mother Padmavathi Radhakrishnan,
and my wife Christina Schlecht,
for their love, encouragement and
support.
Preface
Towards the end of my second year as a graduate student in the Physics department, I started exploring the idea of applying the skills I had learned in physics to other areas of science. I was fascinated by the recent developments in biology, particularly in theoretical biology. Around this time, sequencing of the human genome was nearing completion and for the first time in history, we were on the brink of understanding biological systems in their entirety. As a theoretical physicist, it seemed to me the time was ripe to apply ideas from theoretical physics to solve problems in computational biology.
With encouragement from Burkhard, I started looking at the problem of predicting the subcellular localization of proteins. The subcellular localization of a protein is an important aspect of its function. During this time, the first computational methods to address this problem were development and they seemed to fare poorly. The early methods were conceptually simplistic in that they focused solely on the amino acid composition of the protein. At first, I experimented with improving predictions by replacing the amino acid composition with the composition of the protein surface, which has evolved to function in the specific cellular environment of the protein. However, this did not lead to any significant improvement in predictions. The prediction problem seemed particularly intractable since a number of experimental studies had revealed a complex cellular sorting mechanism that was responsible for the active transport of proteins in the cell. A major problem for the development of prediction tools was the limited number of proteins for which localization had been determined experimentally. I realized that the annotations in protein databases were often incomplete and an expert could easily infer the subcellular localization for a large number of proteins based on their functional annotations. To remedy this situation I developed the LOCkey algorithm, which automatically infers localization based on ‘functional keyword’ annotations. Rather than taking a ‘one size fits all’ approach, I decided to pursue an integrated approach to predicting localization. Cellular sorting is mediated by targeting signals and nuclear localization signals (NLS’s) are responsible for targeting proteins to the nucleus. A large number of NLSs had been identified experimentally, but they could explain the nuclear targeting of fewer than 10% of known nuclear proteins. In collaboration with Murat Cokol, I developed a procedure of ‘in-silico’ mutagenesis to discover the key residues in NLSs. Using this procedure we could explain the nuclear localization of over 40% of the known nuclear proteins. However, we could not extend this procedure to the other subcellular classes. This left me still looking for ways to predict the subcellular localization of a protein when only its amino acid sequence is known. I found prediction accuracy could be significantly improved by incorporating predicted features like the secondary structure of the protein and evolutionary information in the form of sequence profiles. Another major problem was that traditional machine learning algorithms in their standard implementation have a parallel architecture. However, protein trafficking pathways have a hierarchical architecture. I developed LOCtree for predicting subcellular localization of a protein from its amino acid sequence. LOCtree utilizes a novel architecture of hierarchical support vector machines (SVM’s) and predicts localization by mimicking the cellular sorting mechanism. LOCtree was over 20% more accurate than the best publicly available localization prediction server. Taken together, the prediction servers that I have developed during the course of my thesis represent a significant breakthrough in addressing the problem of predicting the subcellular localization of a protein.