Table of Contents

Preface                                                                                                    ix

1  Introduction                                                                                        1

1.1  Compartmentalization of the cell and the sorting of proteins                                  1

1.2  Advanced artificial intelligence (AI) techniques pave the way

for predicting subcellular localization                                                                    10

1.3  Importance of accurate assessment of prediction accuracy                                   22

2  Finding nuclear localization signals                                                 24

2.1  Introduction                                                                                                           25

2.2  Materials and methods                                                                                           27

2.3  Results and discussions                                                                                          30

2.3.1   Improved accuracy and coverage of NLS database                                     30

2.3.2   Specific NLS motifs used to bind DNA                                                       32

2.3.3   Availability of data set and program                                                            35

3  Sequence conserved for subcellular localization                              36

3.1  Introduction                                                                                                           36

3.2  Materials and methods                                                                                           41

3.3  Results                                                                                                                    45

3.4  Discussions and conclusions                                                                                  52

4  Inferring subcellular localization through automated
lexical analysis
                                                                                   56

4.1  Introduction                                                                                                           56

4.2  Materials and methods                                                                                           59

4.3  Results                                                                                                                    63

4.4  Discussions and conclusions                                                                                  68

5  Better prediction of subcellular localization by combining
evolutionary and structural information
                                         70

5.1  Introduction                                                                                                           71

5.2  Materials and methods                                                                                           75

5.3  Results                                                                                                                    81

5.4  Discussions and conclusions                                                                                  94

6  Predicting subcellular localization based on functional
hierarchies
                                                                                          97

6.1  Introduction                                                                                                           97

6.2  Materials and methods                                                                                         103

6.3  Results                                                                                                                  108

6.4  Discussions and conclusions                                                                                118

Bibliography                                                                                       121

Appendix                                                                                             140

I     Glossary                                                                                                                140


List of Figures

 

1-1       The structure of proteins                                                                                           2

1-2       Characterization of protein function                                                                         4

1-3       The major compartments of eukaryotic cells                                                             6

1-4       A simplified “roadmap” of protein traffic                                                                7

1-5       Two ways in which a sorting signal can be built into a protein                                9

2-1       Simplified scheme for nuclear import                                                                     26

2-2       Scheme for the concept of 'in silico mutagenesis'                                                   28

2-3       NLS motif also used for DNA-binding                                                                  34

3-1       Transition from safe over twilight to midnight zone of protein comparisons         38

3-2       Sequence conservation for major classes of subcellular localization                       46

3-3       Average conservation of subcellular localization                                                    47

3-4       Performance for different measures of sequence similarity                                    49

3-5       Percentage pairwise sequence identity vs. length of alignment                              50

3-6       Conservation of function and structure                                                                  54

4-1       LOCkey algorithm                                                                                                  60

4-2       Results for five-fold cross-validation                                                                     64

4-3       Performance improves with number of keywords                                                  66

5-1       Final neural network architecture                                                                            74

5-2       The LOC3D system                                                                                                81

5-3       Maximal linear separation of subcellular localization                                              83

5-4       Structural and evolutionary information improves prediction accuracy                 85

5-5       Pairwise first level neural networks accurate for some localizations                      86

5-6       Better prediction through combining neural networks                                           88

5-7       Over 75% accuracy for the most reliably predicted half of all proteins                 89

6-1       Hierarchical architecture of LOCtree                                                                    102

6-2       Benchmarking LOCtree using large-scale experimental data                               117

 


List of Tables

 

1-1       Typical signal sequences involved in protein sorting                                              10

1-2       Services for subcellular localization prediction                                                       15

2-1       Accuracy and coverage of NLS motifs                                                                  31

2-2       Nuclear proteins in seven entirely sequenced proteomes                                        33

2-3       DNA-binding regions in genomes                                                                          35

3-1       Experimentally annotated subcellular localization data from SWISS-PROT         42

3-2       Inferring subcellular localization by homology                                                       51

3-3       Assessing accuracy of prediction methods on sequence unique data set               52

4-1       Number of proteins in 'trusted' data sets                                                                61

4-2       'Confusion Matrix' for LOCkey                                                                              64

4-3       Automatically annotating subcellular localization for five proteomes                    65

5-1       Number of proteins in data set                                                                                76

5-2       Neural network performance on test set of sequence-unique PDB chains             84

5-3       Comparison on test set of sequence-unique PDB chains                                        91

5-4       Comparison on sequence-unique set of new SWISS-PROT proteins                     92

5-5       Predicted subcellular localization for all eukaryotic PDB chains                           93

6-1       Number of proteins in data set                                                                              104

6-2       Accuracy of LOCtree on non-redundant test set of eukaryotic

            non-plant proteins                                                                                                 110

6-3       Accuracy of LOCtree on non-redundant test set of plant proteins                      111

6-4      Accuracy of LOCtree on non-redundant test set of prokaryotic proteins by
localization classes                                                                                                112

6-5       Comparison of LOCtree to other publicly available predictors                            114

6-6       Comparison on sequence-unique set of new SWISS-PROT proteins                   115

6-7       Performance of LOCtree based on large scale Yeast localization data                116

6-8       LOCtree predicts localization for entire genomes                                                118

 


Acknowledgements

 

I am deeply indebted to my advisor Dr. Burkhard Rost for his continuous support during my years at Columbia. Burkhard led me into the field of computational biology and bioinformatics, and his encouragement and guidance has made this thesis possible.

Special thanks go to all members in Dr. Rost's lab for their generous help. It is always a great pleasure to work with them. I would like to especially thank Jinfeng Liu for helping me with biology concepts and with computers, Dariusz Przybylski for extremely interesting discussions that helped me refine many of my ideas and Kazimierz Wrzeszczynski, Trevor Siggers and Cinque Soto for all the intellectual discussions I’ve had with them. NLSdb database was constructed in collaboration with Phil Carter.

I am also grateful to Dr. Christina Leslie for discussions on support vector machines, Dr. Barry Honig and Dr. Wayne Hendrickson for interesting suggestions on the connection between nuclear localization signals and DNA-binding.

I am thankful to members of my thesis committee and defense committee, Dr. Horst Stormer, Dr. Aron Pinczuk, Dr. Timothy-Halpin Healy, and Dr. Christina Leslie for reviewing my thesis and helpful advice.

Finally, I would like to take this opportunity to thank other professors in this campus, including Dr. Brian Greene, Dr. Hal Evans, and Dr. Alan Blaer for their kind help during my graduate study here.


 

 

 

 

 

Dedication

 

 

This dissertation is dedicated

to my mother Padmavathi Radhakrishnan,

and my wife Christina Schlecht,

for their love, encouragement and support.


Preface

 

Towards the end of my second year as a graduate student in the Physics department, I started exploring the idea of applying the skills I had learned in physics to other areas of science. I was fascinated by the recent developments in biology, particularly in theoretical biology. Around this time, sequencing of the human genome was nearing completion and for the first time in history, we were on the brink of understanding biological systems in their entirety. As a theoretical physicist, it seemed to me the time was ripe to apply ideas from theoretical physics to solve problems in computational biology.

With encouragement from Burkhard, I started looking at the problem of predicting the subcellular localization of proteins. The subcellular localization of a protein is an important aspect of its function. During this time, the first computational methods to address this problem were development and they seemed to fare poorly. The early methods were conceptually simplistic in that they focused solely on the amino acid composition of the protein. At first, I experimented with improving predictions by replacing the amino acid composition with the composition of the protein surface, which has evolved to function in the specific cellular environment of the protein. However, this did not lead to any significant improvement in predictions. The prediction problem seemed particularly intractable since a number of experimental studies had revealed a complex cellular sorting mechanism that was responsible for the active transport of proteins in the cell. A major problem for the development of prediction tools was the limited number of proteins for which localization had been determined experimentally. I realized that the annotations in protein databases were often incomplete and an expert could easily infer the subcellular localization for a large number of proteins based on their functional annotations. To remedy this situation I developed the LOCkey algorithm, which automatically infers localization based on ‘functional keyword’ annotations. Rather than taking a ‘one size fits all’ approach, I decided to pursue an integrated approach to predicting localization. Cellular sorting is mediated by targeting signals and nuclear localization signals (NLS’s) are responsible for targeting proteins to the nucleus. A large number of NLSs had been identified experimentally, but they could explain the nuclear targeting of fewer than 10% of known nuclear proteins. In collaboration with Murat Cokol, I developed a procedure of ‘in-silico’ mutagenesis to discover the key residues in NLSs. Using this procedure we could explain the nuclear localization of over 40% of the known nuclear proteins. However, we could not extend this procedure to the other subcellular classes. This left me still looking for ways to predict the subcellular localization of a protein when only its amino acid sequence is known. I found prediction accuracy could be significantly improved by incorporating predicted features like the secondary structure of the protein and evolutionary information in the form of sequence profiles. Another major problem was that traditional machine learning algorithms in their standard implementation have a parallel architecture. However, protein trafficking pathways have a hierarchical architecture. I developed LOCtree for predicting subcellular localization of a protein from its amino acid sequence. LOCtree utilizes a novel architecture of hierarchical support vector machines (SVM’s) and predicts localization by mimicking the cellular sorting mechanism. LOCtree was over 20% more accurate than the best publicly available localization prediction server. Taken together, the prediction servers that I have developed during the course of my thesis represent a significant breakthrough in addressing the problem of predicting the subcellular localization of a protein.