bottom - TOC - CUBIC-papers - CUBIC

Title: Annotating proteins from Endoplasmic Reticulum and Golgi apparatus in eukaryotic proteomes
Author:Kazimierz O Wrzeszczynski & Burkhard Rost
Quote: QUOTE

Kazimierz O Wrzeszczynski 1 & Burkhard Rost ?

1 CUBIC, Dept. of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA
2 Columbia University Center for Computational Biology and Bioinformatics (C2B2), Russ Berrie Pavilion, 1150 St. Nicholas Avenue, New York, NY 10032, USA
3 North East Structural Genomics Consortium (NESG), Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA
3 Integrated Program in Cellular, Molecular and Biophysical Studies, Columbia University, 630 West 168th Street, New York, NY 10032, USA
* Corresponding author: cubic@cubic.bioc.columbia.edu URL http://cubic.bioc.columbia.edu/  Tel: +1-212-305-4018, fax: +1-212-305-7932

This article is published in (CMLS, issue, 2003 and pages) copyright Cellular and Molecular Life Sciences, Birkhuser (2003). OMJ is the only authorised source. All copying of this article including placing on another website requires the written permission of the copyright owner.

Table of contents


Abstract

The sub-cellular localization of a native protein constitutes one coarse-grained aspect of its function. Transport between compartments is often regulated through short sequence motifs. Here, we analysed experimentally characterised ER/Golgi retrieval motifs and investigated the accuracy of homology-transfer. Only the C-terminal ER retrieval motifs KDEL, HDEL and AIAKE were sufficiently specific. However, even unspecific motifs may help, provided we know the probability for localization given this motif. We provided such estimates. We also rigorously estimated the accuracy and coverage for inferring ER and Golgi localization through homology-transfer by sequence similarity. In entire proteomes, we could thereby annotate 3304 ER (3182 membrane) and 1853 Golgi proteins (759 membrane). We identified another 5157 globular and 3941 membrane putative ER or Golgi proteins. Each experimental annotation yielded, on average, 1-3 high-accuracy and 5-6 low-accuracy homology-transfers in the six proteomes. These numbers will increase with each new experimental annotation.

 

Key words: endoplasmic reticulum, Golgi apparatus, genome sequence analysis, sub-cellular localization, protein sequence motifs.

 

Abbreviations used

BIGmerger of three databases: PDB, Swiss-Prot and TrEMBL
C-termcarboxy-terminal, i.e. end of protein
EREndoplasmic reticulum
EVALBLAST expectation value
HVALHSSP-value, i.e. function correlating pairwise sequence identity and alignment length ( eqn. 1)ORFopen reading frame
PHDsecProfile based neural network prediction of secondary structure [1, 2, 3]
PIDEpairwise percentage of identical residues
PDBdatabase of protein structures [4]
Swiss-Protannotated protein sequence database [5]
TrEMBLtranslated EMBL database of un-annotated protein sequences [5] .


 

Notation used: globular, describes all proteins that are neither integral membrane proteins nor attached to the membrane; proteome, all the proteins in an organism as the 'proteome' of that organism; retention/recycle/retrieval, strictly motifs such as KDEL are shown to guarantee the retrieval or recycling of proteins back into the ER rather than the retention in the ER; trusted, with the term 'trusted data set' we refer to a set for which the sub-cellular localization has been annotated experimentally; [XY] means either amino acid X or Y at the given position.

 

Introduction

Trafficking through eukaryotic cells. The major constituents of eukaryotic cells are: extra-cellular space, cytoplasm, nucleus, mitochondria, Golgi apparatus, endoplasmic reticulum, peroxisome, mitochondria, and lysosomes. The native sub-cellular localization of a protein is assumed to be determined largely by a trafficking system that is reasonably well captured experimentally for some of the organelles [6, 7, 8, 9, 10, 11, 12] . The system has two main branches [13] . On one branch, proteins are synthesised on cytoplasmic ribosomes, and from there can go to the nucleus, mitochondria or peroxisomes. The second branch leads from the ribosomes attached to the endoplasmic reticulum to the Golgi apparatus, then to lysosomes, or secretory vesicles, and on to the extra-cellular space. At each branch point in the trafficking system, a 'decision is made'; either retain the protein in the current compartment or transport it onward to the next. For many examples, we have experimental evidence that membrane transport complexes 'make these decisions' by recognising motifs on the proteins that are shuttled. The most comprehensively characterised branch point is the second one leading to secretion [14, 15, 16, 17, 18] . Most proteins destined for this branch are assumed to have an N-terminal signal peptide that causes them to be transferred into the endoplasmic reticulum as they are being synthesised; most proteins lacking this signal are synthesised in the cytoplasm and follow the former mentioned branch of protein trafficking. Note some proteins appear to be secreted through a different pathway and clearly lack signal peptides.

Protein sorting through the secretory pathway. The secretory pathway involves a complex protein transport system between its organelles while maintaining no significant loss in organelle resident proteins. This highly selective process allows for the post-translational modification and maturation of newly synthesised proteins passing through the Endoplasmic reticulum (ER) and Golgi apparatus while strictly sorting and retaining residential soluble and membrane bound proteins [19, 20] . A small fraction of proteins undergo ER/Golgi-independent protein secretion. This process is performed through at least four distinct pathways under varying cellular conditions [21] . The common assumption is that proteins are kept within the ER and to a much lesser extent within the Golgi apparatus (Golgi) through specific short peptides that act as signals important for a retention and retrieval/recycle mediated sorting mechanisms [22, 23, 24] .

Technically, groups of a few, specific residues are referred to as sequence motifs. Secreted proteins usually contain an N-terminal signal peptide of 10-30 residues that is cleaved upon successful transport through the extra-cellular membranes in the ER [22] ; these signal peptides have distinct sequence features and can therefore be predicted accurately for proteins of unknown localization [18, 25] . Conceptually, we can distinguish between three types of motifs that are recognised by transport proteins: (i) generic, sequence-consecutive motifs with common features such as cleaved signal peptides, (ii) specific, sequence-consecutive motifs like nuclear localization signals, and (iii) non-sequence consecutive motifs recognisable only after the protein has folded. While binding motifs that require details of the three-dimensional, folded protein structure are common for all kinds of protein function such as small molecule or enzyme binding surprisingly few such cases have been implicated in the regulation of protein trafficking. One prominent exception is the mannose-6-phosphate (M6P) receptors that bind M6P-containing soluble acid hydrolases in the Golgi and transport them on to the endosomal-lysosmal system [26, 27] . Only the first type of motif for regulation - generic and sequence-consecutive - can currently be predicted to identify proteins with signal peptides [18, 28] , chloroplast transit peptides [29, 28] , as well as peroxisomal [30] and mitochondrial targeting signals [31, 28] . For the second type of specific, sequence-consecutive motifs, all computational biology can do so far is to archive these in databases that can be queried to find unknown nuclear proteins [32, 33, 34, 35] .

Computational analysis of ER and Golgi retrieval signals has been vastly limited to PROSITE [36] and PSORT [37] both of which only rely specifically on the classical ER and Golgi retrieval motifs (or conservative derivations of these classical motifs); few attempts analysed Golgi and ER proteins on the level of entirely sequenced proteomes [38] . Large-scale genomic consortiums rely in part on valid function and sub-cellular localization information for their target selection decisions. Large-scale experimental efforts can adequately account for a proportion of a specific proteome [39, 40] but are often limited by the experimental design. Therefore computational efforts are often needed to complete out an entire proteomic analysis. Our lab has recently reported that sequence conservation established using the HSSP-distance value correlated well with sub-cellular localization [41] . We have further applied this technique specifically to the ER and Golgi organelles. Here, we analysed to which extent ER and Golgi proteins can be identified through such short sequence motifs and/or through sequence similarity to proteins known to reside in these two compartments. This analysis required three steps: (1) Collect known signals from literature and databases, (2) build unbiased, trusted data sets of proteins experimentally known to reside in ER and Golgi, and (3) test specificity and accuracy of the signals found. (Note: we failed to uncover novel motifs through motif-finding algorithms.) The first part of our work explored the limits of how far we can reach when trying to predict ER and Golgi proteins from experimentally known and theoretically refined signals. Next, we established thresholds for significant sequence similarity, i.e. for when we can accurately infer ER and Golgi location through homology. Finally, we applied our results to annotate ER and Golgi proteins in the proteomes of Saccharomyces cerevisiae (yeast), Drosophila melanogaster (fruit-fly), Caenorhabditis elegans (worm), Arabidopsis thaliana (weed), Homo sapiens (human), and Mus musculus (mouse).

 

Methods

Trusted data sets of proteins with known localization. We retrieved all proteins from Swiss-Prot [5] that had experimental annotations about sub-cellular localization, removing all with 'putatively known' localization that contained either 'PROBABLE', 'PUTATIVE', or 'BY SIMILARITY' as additional qualifiers in Swiss-Prot. We split these proteins into 'trusted ER/Golgi' and 'trusted non-ER/non-Golgi'. As another control data set, we also retrieved all non-eukaryotic Swiss-Prot proteins. The resulting trusted data sets contained 676 ER proteins, 131 lumenal ER proteins, 104 proteins with a [KH]DEL C-terminal motif, 545 ER membrane proteins, 312 trusted Golgi proteins, and 194 Golgi membrane proteins. A non-ER/non-Golgi set of 8417 localization annotated eukaryotic proteins was used to identify false positives (numbers summarised in Table in Supporting Online Material). Additional experimentally annotated yeast proteins were collected from the Yeast GFP Fusion Localization Database - http://yeastgfp.ucsf.edu. However, we considered only ORFs without the keyword 'Hypothetical Protein'. This increased the total number of ER proteins to 784 and the Golgi trusted set to 351. The trusted data sets can be obtained from 'ER-GolgiDB at http://cubic.bioc.columbia.edu/db/ERGolgiDB.

Data sets for entire proteomes. The human proteins constituted the set currently available in the latest versions of Swiss-Prot (release 40) and TrEMBL (release 22) [5] ; Drosophila melanogaster was obtained from http://www.fruitfly.org/ (release 2), Mus Musculus was obtained from http://www.ensembl.org/Mus_musculus/ and Caenorhabditis elegans from http://www.sanger.ac.uk/Projects/C_elegans/wormpep/ (wormpep 65). All remaining proteomes (weed and yeast) were downloaded from ftp://ncbi.nih.gov/genbank/genomes/.

Aligning proteins. We aligned the trusted ER and Golgi proteins against all proteins of known localization using pairwise BLAST [42] . Next, we built PSI-BLAST profiles for all data sets using a filtered version of all currently known sequences with three iterations [43] . These profiles were then aligned against all proteins of known localization. After compiling the results for the sequence conservation ( Fig. 1 ), we changed these profiles such that we only included homologues with HVALs 40 for ER and 20 for Golgi proteins. We based our identification of ER/Golgi proteins in entire proteomes on these three data sets: trusted, trusted families, and unique subset of trusted families.

Scores for measuring sequence similarity.  The simplest way to measure sequence similarity is percentage pairwise sequence identity (PIDE), i.e. the percentage of residues identical between two proteins (not counting gaps). Another measure is the statistical expectation values as reported by BLAST (EVAL, note: we typically report the logarithm of this value in our figures). As third measure we used the HSSP-value (HVAL) [44, 45] :

                     ( eqn. 1)

where L was the number of residues aligned between two proteins, PIDE the percentage of pairwise identical residues. The HSSP-value reflects whether an alignment is above the HSSP-curve [44, 45] (HVAL >0) or below (HVAL<0). For the first case (>0) the HSSP-value can be seen as a degree of sequence-proximity or similarity (the higher the value the more similar to two proteins), whereas for the latter (HVAL<0) estimates the distance, or level of divergence between two proteins (the more negative the value, the less similar the two proteins). An HSSP-value of 0 defines the line, above which (almost) no two naturally evolved proteins differ grossly in their three-dimensional structures. To illustrate the curve: for alignment lengths around 100 residues, 33% pairwise sequence identity suffices to infer structure, above 250 residues 21% is significant, and below 11 residues even 100% identity is not enough to infer structural (or functional) similarity. Although the HSSP-curve was derived to describe structural similarity, we noted that it also constitutes a sensitive approach when distinguishing between proteins of similar and dissimilar enzymatic activity [46] , between the largest four compartments (nucleus, extra-cellular space, cytoplasm and mitochondria) [41] , and between proteins involved in cell-cycle control [47] .

Sequence-unique subsets. We built sequence-unique subsets for all types of proteins under consideration to avoid bias that is likely to skew estimates for accuracy and coverage [46] . 'Sequence-unique' was defined by that no pair of proteins in the set had HVALs>0 ( eqn. 1). Given an all-against-all pairwise alignment for the biased set, we simply used a greedy search to find the largest subset that fulfilled the above condition. (Note: a tool performing this type of reduction is available through the web [48] .)

Measuring accuracy and coverage. We used the following definition to measure accuracy/specificity:

                     ( eqn. 2)

with the thresholds given by either (1) percentage pairwise sequence identity (PIDE), (2) BLAST expectations values (EVAL), (3) the HSSP-value (HVAL). We considered all pairs as 'true' that were experimentally found in the same sub-cellular compartment. In analogy, we used the following definitions for coverage/sensitivity:

                     ( eqn. 3)

 

Results and Discussion



Retention and recycle signals can predict unknown ER and Golgi proteins

Collecting retention and retrieval/recycle motifs from literature and databases. We retrieved experimentally annotated retention and recycle signals for ER and Golgi from the literature, Swiss-Prot [5] and PROSITE [49] . The resulting list was tiny in comparison to that obtained previously for nuclear localization signals [32, 35] . Supposedly, the reason is that many soluble and membrane proteins in the ER have rather specific retention signals ( Table 1 ). Predominantly, the C-terminal motifs KDEL, HDEL, and closely related derivatives have been experimentally related to the retrieval mechanism [19] . Other motifs implicated in ER and Golgi targeting include [20] : (1) the C-terminal motif HDEF in the Ca2+-binding protein Calumenin [50] , (2) the Di-lysine motif KK [51] , the Di-arginine motif RR [52] or RKR (RxR) [53] , (3) the tyrosine-based tetra-peptide motif Yxxh (where x can be any amino acid and h signifies a hydrophobic residue), predominately associated in vesicular traffic sorting mechanisms [54] , has also been shown as a localization motif as evident in YQRL of TGN38 [55] and for the retrieval of UCE [56] , (4) the Di-acidic ER-export motifs [DE] often associated with the Yxxh motif [57] , (5) the cytoplasmic tail FxFxD motif in DPAP-A necessary for retrieval back to the Golgi [58] , and (6) the targeting domain GRIP, found in peripheral Golgi membrane proteins [59, 60] . The only motifs that were previously available to automatic proteome searches were KDEL and HDEL, as well as some derivatives of these deposited in PROSITE [49] and PSORT [37] . For the Golgi apparatus, other than C-terminal YQRL also used by PSORT, there is currently no other specific sequence motif available for automatic database searches [24] .



Table . 1
Table 1 : Analysing ER and Golgi retentionand retrieval signals .
Sequence motif (1) Total Eukaryotes Non-Eukaryotes ER/Golgi Non-ER/Non-Golgi Non-Annotated
  N N % N % N % N % N %
                       
Endoplasmic reticulum (ER) (2)                      
KDEL-C-term 67 60 90 7 10 60 90 0 0 0 0
KDEL 1201 636 53 565 47 76 6 560 47 230 41
HDEL-C-term 64 64 100 0 0 62 97 2 3 2 100
HDEL 498 261 52 237 48 68 14 193 38 121 63
HDEF-C-term 4 3 75 1 25 2 50 1 25 0 0
HDEF 91 50 55 41 45 2 2 48 53 28 58
KKxx-C-term 907 492 52 415 46 55 6 437 48 211 48
KKxx-C-term (membrane protein subset) 254 183 72 71 28 55 22 128 50 21 16
KKxx 57848 32493 56 25355 44 810 1 31683 55 15171 48
KxKxx-C-term 810 420 52 390 48 42 5 378 47 177 47
KxKxx-C-term (membrane protein subset) 230 139 60 91 40 42 18 97 42 25 26
xxRR 83869 39769 47 44100 53 1062 1 38707 46 16050 41
KKFF-C-term 8 5 63 3 37 3 38 3 25 2 67
KKFF 416 234 56 93 22 7 2 316 76 118 37
KKAA-C-term 29 7 24 22 76 5 17 2 7 0 0
KKAA 1639 824 50 815 50 40 2 784 48 267 34
AIAKE-C-term 10 10 100 0 0 10 100 0 0 0 0
AIAKE 161 55 34 106 66 11 7 44 27 11 25
CRAR 199 127 64 72 36 0 0 127 64 42 33
                       
Golgi apparatus (3)                      
YQRL 442 212 48 230 52 10 2 202 46 83 41
YKGL 632 304 48 328 52 5 1 299 47 143 48
YHPL 150 70 47 80 53 7 5 65 43 29 45
Yxxh 135637 62800 46 72837 54 859 1 62941 45 27729 44
NPFKD 17 13 76 4 24 0 0 13 76 8 62
FxFxD 4971 2513 51 2458 49 67 1 2446 49 1101 45
FQFND 7 4 57 3 43 3 43 1 14 1 100
PxPxP 8856 2766 31 4023 45 139 2 4694 53 3088 66
[DE] 131139 59784 46 71355 54 834 1 58941 45 25843 44
GRIP-motif (5) 11 11 100 0 0 10 90 1 10 1 100
GRIP-motif (shortened) (6) 58 32 55 24 41 10 18 24 41 11 46
                       
C-term variations (4)                      
PROSITE Pattern (7) 232 197 85 35 15 167 72 30 13 13 43
[KH]DEL 131 124 95 7 5 122 93 2 2 2 100
[KHR][DENQ]EL 203 174 86 29 14 157 77 17 8 9 52
[KHR][DENQ] [87] L 230 187 81 43 19 159 72 28 1 13 46
[KHRDENQAS] [DENQIYCV] [DENQ]L 696 428 61 268 39 193 28 235 33 107 45
[KRDEAVYF][KRDEVYFMQ] [KHED][DK]EL 80 59 74 21 26 50 63 9 11 5 55

 Columns: Total: number of proteins found in Swiss-Prot that have the respective motif; N: number of proteins found in subset; %: percentage of proteins in subset (column 'Total' gives 100%); Eukaryotes: all eukaryotic proteins; Non-Eukaryotes: since only eukaryotes have ER and Golgi, this column estimates a lower bound for the false positives; ER/Golgi: subset of eukaryotic proteins that have the respective motif and are experimentally known to be in either ER (for ER motifs) or Golgi (for Golgi motifs), this column gives a lower-bound for the true positives (percentage is based on total number); Non-ER / Non-Golgi: subset of eukaryotic proteins that have the respective motif and are experimentally known to be neither in ER nor in Golgi (percentage is based on total number) or do not contain any subcellular localization information in the Swiss-Prot database. Non-Annotated: subset of non-er/non-golgi that which does not contain any localization information in Swiss-Prot. The Non-Eukaryotes and Non-ER/Non-Golgi columns provide the total FP percentage. The total numbers were: ER = 1060 of which 324 (30%) were annotated as Probable, Putative or By Similarity and 72 (7%) Viral/Prokaryotic/Archaea: Golgi subcellular localization total = 495 of which 163 (33%) were annotated as Probable, Putative or By Similarity and 9 (2%) Viral/Prokaryotic/Archaea.

1 'C-term' indicates the carboxy-terminal (last) residue of the protein; motifs are given by the one-letter code of the respective amino acids with the following conventions: [AG] means either A or G, 'x' stands for 'any' amino acid, 'h' stands for any hydrophobic amino acid.

2 Source of ER motifs: KDEL [19] , HDEL [19] , KKxx [51] [88] , xxRR [52] , KKFF [89] , KKAA [90, 91] , HDEF [50] , AIAKE [92] , CRAR [93] .

3 Source of Golgi motifs: YQRL [94] , YKGL [95] , YHPL [56] , Yxxh [55] , NPFKD [56] , FxFxD [58] , FQFND [58] , PxPxP [96] , shott [57] , GRIP-motif [59, 60] .

C-term variations: most of these motifs were compiled for this work.

5 The consensus pattern of the GRIP-motif is described by: 
[DEA]Y[LIT][KR][KHN][VI][VILF]XX[YF][MIL].

6 Shortened derivative of GRIP-motif: [DEA]Y[LIT][KR][KHN][VI][VILF]

ER retrieval motif found in PROSITE: [KHRQSA][DENQ]EL [36] .



Validating motifs against databases. For each motif found ( Table 1 ), we retrieved all proteins with this motif deposited in Swiss-Prot [5] , TrEMBL [5] , and PDB [4] . Next, we extracted a subset of proteins annotated in Swiss-Prot by their experimentally known sub-cellular localization. This subset along with a grouping of all Swiss-Prot species into eukaryotes and non-eukaryotes provided two means of assessing the specificity/accuracy of a given motif. The most specific ER motifs were KDEL and HDEL when restricted to the carboxy-terminus ( Table 1 ). These two retrieved 131 proteins from Swiss-Prot, most of which have already been experimentally characterised as 'retained in the ER' (data not shown). While the KDEL motif was also present in a few non-eukaryotic proteins, the HDEL motif was found in only two eukaryotic non-ER protein and those two were orthologues for the protein 'Protein Kinase C Substrate' in bovine and human (g19p_human and g19p_bovin). Whereas this identification of ER and Golgi localization from such motifs clearly seems very reliable this finding illustrated the other problem of these two motifs: they occur frequently in non-ER proteins at positions other than the C-termini. In other words, in order to rely on KDEL/HDEL to infer localization, we must know the C-terminus of the full-length protein. All other ER motifs published were either very unspecific (found in many non-ER proteins) or far too specific (found in very few ER protein families), or both. For example, the Di-lysine (KKxx and KxKxx) motif retrieved all known ER proteins when located at the C-terminal position of membrane proteins however this included a set of 128 proteins (KKxx) and 378 proteins (KxKxx), most of which could not be classified as ER proteins ( Table 1 ). When including this motif (and the more difficult to distinguish Di-arginine (xxRR) N-terminal motif) among a non-membrane subset of proteins and more significantly when not limiting the motif to the terminal ends this high sensitivity is greatly compromised at the cost of an extremely low specificity: both motifs were found in most non-ER proteins. In fact, over 80% of the matches were wrong. Overall, the information contained in the published Golgi motifs was even less promising. For example, the most sensitive GRIP-motif [59, 60] was found in 11 proteins mainly orthologues of each other. A generalised GRIP-motif matched in slightly more proteins, none from the Golgi, and many from non-eukaryotic proteins. Similarly, Yxxh (matched in most known Golgi proteins, however, it also matched almost the entire Swiss-Prot database. Obviously, only C-terminal motifs KDEL, HDEL, and AIAKE suffice to accurately annotate ER proteins. All other experimentally characterised retention and recycle motifs for ER and Golgi need to be combined with other means of annotation.

 



ER and Golgi localization conserved at high levels of sequence similarity

We explored the power of using sequence similarity for the entire proteins to identify ER and Golgi proteins. Toward this end we had (1) to establish thresholds for sequence similarity that enable accurate inference by homology, and (2) to build family profiles of known ER/Golgi proteins. The final 'prediction step' requires searching with a query protein of unknown localization against these family profiles. We could have simplified this final step by aligning all query proteins against the known ER/Golgi proteins. However, sequence-profile alignments are more sensitive and more specific than sequence-sequence alignments. Note that we looked for similarities over the entire proteins, rather than for similarities between short signal peptides [61] .

ER and Golgi proteins correctly detected by homology at high levels of similarity. We aligned all experimentally known ER and Golgi (true positives) and all known non-ER and non-Golgi proteins (true negatives) by pairwise BLAST [62, 42] and by the more powerful PSI-BLAST [63] (Methods); alignments were ranked by expectation values [62] (EVAL), percentage pairwise sequence identity (PIDE), and the HSSP-value (HVAL eqn. 1). At HVAL=0, the accuracy for homology inference was 65% ( Fig. 1 top); it increased to 98% at HVAL>40. The majority of false positives (non-ER proteins) at high HSSP-values were of two specific types: heat shock protein 70 and elongation factor alpha. These two, large families are not exclusive to the ER rather they are also abundant in other cellular compartments. They caused the transition between the regions of mostly incorrect inference (HVAL<20) and mostly correct inference (HVAL>40) to be more gradual for ER than for Golgi proteins. The accuracy for Golgi proteins ( Fig. 1 bottom) was slightly higher than that for the ER proteins: 98% accuracy was reached at HVAL>20. We also investigated the effect from database bias [46] , confirming that biased data sets incorrectly suggested much higher levels of accuracy at all thresholds (data not shown). At high levels of accuracy, the coverage versus accuracy curve was slightly higher for HSSP-values than for expectation values (data not shown). Thus, we relied on the HSSP-value for the annotations of entire proteomes.



Fig. 1
fig1.gif

Fig. 1 : Sequence conservation for Endoplasmic Reticulum (ER) and Golgi apparatus.
We aligned all experimentally annotated, sequence-unique ER and Golgi proteins (ER-top graphs, Golgi-bottom graphs) against all true negatives using BLAST (squares) and PSI-BLAST (circles). Solid lines with filled symbols describe cumulative accuracy/specificity (percentage of correctly identified localized proteins at given threshold, eqn. 2>

 


Detailed distinction of ER and Golgi proteins. We also collected data sets for more specific subsets of ER proteins: (1) lumenal, (2) proteins containing only the [KH]DEL motif (see below), (3) proteins containing the Swiss-Prot annotation PREVENT SECRETION FROM ER, and (4) membrane proteins. For the first three subsets of ER proteins, each sequence-unique set contained very few proteins: 21 of 131 total for the lumenal set, 14 of 102 total for those with a [KH]DEL motif, and 30 of 212 total for proteins with a Swiss-Prot ER retention annotation. While these sets were too specific and too small to establish reliable conservation thresholds, the detailed distinction of ER/Golgi-subtypes could be used to annotate proteomes. Although the sequence-unique sets of ER and Golgi membrane proteins were also rather small, we could analyse the sequence conservation for these subsets. We compared two different sets of true negatives: (i) all non-ER/non-Golgi proteins, and (ii) only non-ER/non-Golgi membrane proteins. Not surprisingly, inference by homology was more accurate when using the additional constraint that the protein had to be in the membrane ( Fig. 2 diamonds vs. triangles). Due to the small numbers of proteins in the set, the higher levels of accuracy may not hold in general. However, the data certainly supported the assumption that homology inference for ER/Golgi membrane proteins is at least as accurate as that for all other ER/Golgi proteins. We applied this result to searching ER/Golgi membrane proteins in proteomes.



Fig. 2
fig2.gif

Fig. 2 : Sequence conservation for ER and Golgi membrane proteins.
We aligned all sequence-unique ER (top graphs) and Golgi (bottom) membrane proteins against all non-ER/non-Golgi proteins (triangles) and against all non-ER/non-Golgi membrane proteins (diamonds). Solid lines with filled symbols describe cumulative accuracy/specificity ( eqn. 2); dotted lines with open symbols describe cumulative coverage/selectivity ( eqn. 3). We measured sequence similarity (A) by the HSSP-value ( eqn. 1left graphs), and (B) by the logarithm of the BLAST E-values (right graphs).

 


 



Annotating ER and Golgi proteins in six eukaryotic proteomes

Identifying ER and Golgi proteins in six proteomes. We aimed at annotating as many ER/Golgi proteins as possible through homology and retention and recycle signals in six entirely sequenced eukaryotes (yeast, weed, worm, fly, mouse, and human). Swiss-Prot currently annotates 257 ER and 204 Golgi proteins in these six proteomes ( Table 2 column labelled 'ER(Golgi)-trusted'). Alignments using the trusted data sets added 718 potential ER and 800 potential Golgi proteins ( Table 2 rows labelled by an HVAL corresponding to 98% accuracy). 41 of these proteins were previously annotated as 'Hypothetical protein'. Swiss-Prot also contains annotations for localization based on sequence similarity to proteins of experimentally known localization. In order to establish how many of the proteins identified by our homology-inference were also annotated by Swiss-Prot, we identified the closest Swiss-Prot homologue for each protein in any of the six proteomes from the PEP database (Predictions for Entire Proteomes) [64] . This revealed that most of our annotations corresponding to 98% were also annotated by Swiss-Prot as either 'probable', 'putative' or 'by similarity'. In contrast, most putative annotations according to sequence similarity thresholds that correspond to 75% accuracy are not annotated as ER/Golgi by Swiss-Prot. At this threshold, we could propose another 3304 possible ER and 1853 possible Golgi proteins ( Table 2 , rows labelled by (75%) and (78%) for ER and Golgi respectively). While we expect that a majority of these annotations are likely to be false, these subsets constitute a good 'hunting-ground' for discovery of uncharacterised ER and Golgi proteins. Overall, each experimental annotation in our trusted set yielded about 1-3 (lower value for ER, higher for Golgi) homology-transfers as high accuracy (98%) and about 5-6 at low accuracy (>75%). The entire set of results is publicly available at http://cubic.bioc.columbia.edu.

Identifying ER and Golgi membrane proteins. We found a total of 3941 putative ER and Golgi membrane proteins in the six proteomes at a threshold corresponding to 75% accuracy. In most proteomes we could expand reliable annotations (98% accuracy threshold) for ER membrane proteins between 2.5- (human) and 8-fold (weed). At the same accuracy threshold, we also identified ER membrane proteins in worm for which our initial trusted set contained no ER membrane proteins ( Fig. 3 ). Homology inference allowed annotating between 190 (98% accuracy) and 759 (75% accuracy) Golgi membrane proteins ( Fig. 3 ). We also identified 155 possible lumenal ER proteins at 75% accuracy (data not shown) based solely on using the much smaller but less reliable motif only data sets. Identifying lumenal ER proteins is particularly relevant as 82% of the current 257 experimentally annotated ER proteins within are dataset are membrane-associated.



Table 2
Table 2 : ER and Golgi proteinsin eukaryotic proteomes.

 
ProteomeHVAL (%)TotalER(Golgi)-trustedAnnotated-ER(Golgi)Annotated-otherHypothetical
A. Endoplasmic reticulum:            
             
Saccharomyces cerevisiae (yeast) 45 (98) 53 51 51 2 0
  5 (75) 149   64 85 14
             
Arabidopsis thaliana (weed) 45 (98) 38 9 22 16 0
  5 (75) 570   126 444 9
             
Caenorhabditis elegans (worm) 45 (98) 12 5 9 3 1
  5 (75) 394   96 298 138
             
Drosophila melanogaster (fruit-fly) 45 (98) 17 8 14 3 0
  5 (75) 367   169 198 2
             
Mus musculus (mouse)            
  45 (98) 289 82 269 20 0
  5 (75) 860   412 448 5
             
Homo sapiens (human)            
  45 (98) 309 102 274 35 0
  5 (75) 964   426 538 8
             
All 6 proteomes 45 (98) 718 257 639 79 1
  38 (95) 830   686 144 3
  27 (90) 1098   795 303 14
  17 (85) 1528   930 598 44
  10 (80) 2328   1151 1177 123
  5 (75) 3304   1293 2011 176
             
B. Golgi apparatus:            
             
Saccharomyces cerevisiae (yeast) 23 (98) 70 52 53 17 8
  7 (75) 119   55 64 10
             
Arabidopsis thaliana (weed) 23 (98) 70 10 31 39 5
  7 (75) 260   80 180 15
             
Caenorhabditis elegans (worm) 23 (98) 57 7 23 34 27
  7 (75) 185   52 133 96
             
Drosophila melanogaster (fruit-fly) 23 (98) 61 9 40 21 0
  7 (75) 185   76 109 3
             
Mus musculus (mouse)            
  23 (98) 195 47 145 50 0
  7 (75) 425   206 219 1
             
Homo sapiens (human)            
  23 (98) 347 79 273 74 0
  7 (75) 679   357 322 9
             
All 6 proteomes 23 (98) 800 204 565 235 40
  16 (95) 1110   675 435 66
  12 (90) 1358   728 630 99
  8 (85) 1726   812 914 125
  7 (78) 1853   826 1027 134
             

 

 The first column identifies each eukaryotic proteome examined for ER (A) and Golgi (B) proteins. The second column HVAL, i.e. the HSSP-value threshold ( eqn. 1), marks the threshold for sequence similarity; the corresponding estimated percentage accuracy for inference by homology at this threshold (Fig. 1) is given in brackets; note that the 98% thresholds differ between ER (HVAL>45) and Golgi proteins (HVAL>23). All following columns give the number of proteins identified at the given thresholds for similarity: Total gives the total number of proteins from each proteome found, ER(Golgi)-trusted shows the number of proteins with experimental annotations in our trusted data sets of ER and Golgi; Annotated-ER(Golgi) gives the number of proteins for which the closest Swiss-Prot homologue (taken from the PEP database [64] ) has ER(Golgi) annotations marked as PROBABLE, PUTATIVE or BY SIMILARITY; Annotated-other gives the number of proteins for which the closest Swiss-Prot homologue - irrespectively of the similarity threshold - is not annotated as either ER (A) or Golgi (B); Hypothetical lists the number of proteins for which the only previous annotation was 'Hypothetical protein'. The entire set of results is publicly available at http://cubic.bioc.columbia.edu/ERGolgiDB/.




Fig. 3
fig3.gif

Fig. 3 : Identification of ER and Golgi membrane proteins in six eukaryotic proteomes.
Gray bars (A: ER and B: Golgi) represent total counts of proteins within each set of proteomes. The first bar (darkest of each colour) represents the total amount of annotated membrane proteins found in Swiss-Prot for each proteome. The next two bars (lighter in colour) represent the number of additional membrane proteins found within threshold levels of 98% accuracy and 75% accuracy, respectively. 

 



Closer inspection of a few examples. Many close homologues of ER and Golgi proteins also contained retention motifs thereby increasing the reliability of the inference ( Table 3 ). Specifically, the ER-lumenal protein Cyclophilin-B (Swiss-Prot identifier cypb_bovin, Table 3 ) from the cyclophilin-type peptidyl prolyl cis-trans isomerase family is sequence similar to proteins of that family that are non-ER. However, since it also contains the ER recycle motif, we can correctly pick the annotation of the closest ER family member. Similarly, the heat shock proteins hs7c_caeel and hs7f_caeel differ mainly at their ends; hs7c_caeel contains a C-terminal KDEL motif and is ER, hs7f_caeel lacks the motif and is mitochondrial. Nevertheless, KDEL-like motifs are accurate to find more distant homologues in a protein family. For example the evolutionary conserved family of boca chaperones [65, 66] . Boca proteins have very little similarity to other known chaperones that reside in the ER (closest homologue from our ER trusted list is protein disulfide isomerase in mouse (Swiss-Prot identifier: pdi_mouse: HVAL to boca=-14 with only 13% identical residues). This lack of sequence similarity to other chaperones may be crucial for boca to specifically play a role in the quality control and trafficking of only one particular set of proteins, namely the LDL receptors. However, the six proteins resembling boca have variations of the generic and accurate [KH]DEL C-terminal ER recycle motifs that generally are far less accurate, in particular, the C-terminal motifs KKEL, RDEL, and REDL (with an estimated accuracy of 61%, data not show). REDL and KKEL were not present in any trusted ER protein, KKEL, in particular, was found in many non-ER proteins, including a ribosomal protein from mitochondrial DNA (Swiss-Prot identifier rt13_acaca). Why would some boca-homologues diverge to less specific motifs? The answer remains unclear. Other regulatory mechanisms might prevent mis-localization for these proteins. All these examples illustrate how we can increase the accuracy in annotation by combining sequence similarity and motifs.

Recent large-scale experiment verified our estimates. Obviously, our homology-transfer based assignments become stronger with increasing experimental data. However, here, we also utilized large-scale experimental data for cross-examination of our estimates. The TRIPLES database annotates results from large-scale experimental annotations of localization for yeast [67, 39] . While TRIPLES distinguishes ER proteins, it does not classify Golgi proteins separately. Overall, TRIPLES annotates 74 ER proteins (and another 60 annotated as ER and something else). Of the proteins that we annotated at high accuracy and that were classified by TRIPLES, only one ydl093w (Swiss-Prot identifier pmt5_yeast; HVAL=27 to pmt1_yeast) is not found to be ER by TRIPLES, rather it is annotated as cytoplasmic. Given that ydl093w appears to have globular - non-membrane associated - regions in the cytoplasm, the TRIPLES result was actually compatible with our annotations. Overall, we identified 46 proteins as ER (at 75% accuracy/HVAL 5) for which TRIPLES had non-ER annotations. Similarly, we examined the Yeast GFP fusion localization database (YEAST-GFP) [40] that - in contrast to TRIPLES - also annotates Golgi proteins. Overall, YEAST-GFP classified 295 ER and 144 Golgi proteins (including proteins with experimental annotations for more than one compartment). GFP proteins are usually fused at the C-terminus; as many ER-recycle signals are C-terminal, GFP fusions may be particularly difficult for ER proteins [40] . Nevertheless, the YEAST-GFP results agree very well with our findings at high accuracy (98%). At that level, we identified 31 ER proteins, 30 of these were classified by YEAST-GFP, the only exception was yal026c the membrane calcium transporting ATPase DRS2 which YEAST-GFP classifies as Golgi (members of this protein family have been identified in both the ER and trans-Golgi network [68, 69] ). Note yal026c is classified as cytoplasmic by TRIPLES. At the same level of accuracy (98%) we annotated an additional 23 ER proteins that were not classified by YEAST-GFP. At 98% accuracy, we annotated 19 proteins as ER that overlapped with YEAST-GFP, 13 of these were also classified as Golgi by YEAST-GFP; the other six were classified as ER. Looking in more detail, we found annotations that may indicate a dual localization for five of the six (ydr498c-sc20_yeast, ynr026c-sc12_yeast, ycr067c-sed4_yeast, ygl054c-erv4_yeast, ygl145w-tp20_yeastj; note given are the genome identifiers and the identifiers of the corresponding annotated Swiss-Prot files). At 98% accuracy, we annotated an additional 54 Golgi proteins that were not classified by YEAST-GFP. Thus the comparison to the latest large-scale experimental data suggested that overall our estimates were fairly accurate, and that even almost complete experimental coverage still requires homology-transfer for a considerable number of proteins.

Latest data surprisingly specific to yeast. The large-scale experimental results in TRIPLES and YEAST-GFP obviously make our efforts to annotate through homology-transfer much more powerful since they increase the number of proteins in the data sets of experimentally trusted proteins. In particular, the recent YEAST-GFP data increased our sets by a total of 108 ER proteins (30 of which corresponded to sequence-unique additions) and 49 (12 unique) Golgi proteins. Surprisingly, these additional annotations did not yield many highly reliable annotations for other proteomes: the 108 ER proteins identified only 12 non-yeast homologues at 98% accuracy (raising the total from 718 to 730) and 554 at 75% (total from 3304 to 3638). The 49 Golgi proteins yielded another 7 non-yeast proteins at 98% (total to 856) and another 187 at 78% (total to 2040). The increases from the YEAST-GFP data at lower accuracy were similar to the yield of our original trusted data, e.g. each trusted ER protein from our original set (676) yielded about four annotations in the proteomes (2949 when subtracting yeast). In contrast, the original data yielded about 2/3 annotations at high accuracy (676 yielded 459 annotations), while the new data gave only 1/10 new annotations (108 yielded 12 annotations). Thus, the new data turned out to be fairly specific to yeast.



Table 3
Table 3: ER and Golgi homologues withdifferently annotated localization.*

True positive

Homologue

PIDE

HVAL

EVAL

Localization

 

 

 

 

 

 

ER

 

 

 

 

 

hs7c_caeel

hs7f_caeel

52

25

-147

Mitochondria

aca2_arath

aca1_arath

75

52

0

Chloroplast

cypb_bovin

cyp4_bovin

48

28

-47

Cytoplasm

dha4_rat

dhac_rat

26

5

-33

Cytoplasm

scj1_yeast

sis1_yeast

44

3

-17

Nucleus

calu_mouse

cb45_mouse

27

3

-32

Golgi

bip5_tobac

gr75_mouse

52

28

-157

Mitochondria

fd61_soybn

fd6c_soybn

25

-1

-23

Chloroplast

 

 

 

 

 

 

Golgi

 

 

 

 

 

maba_rat

pspd_rat

35

8

-32

Extracellular

rb6b_human

ypt7_yeast

35

7

-31

Vacuole

ynd1_yeast

ntpa_pea

26

3

-28

Nucleus

arse_human

ids_human

35

0

-17

Lysosome

cb45_mouse

rcn1_mouse

26

3

-30

ER

 

 

 

 

 

 

ER/Golgi and their close homologues are presented using the Swiss-Prot identifier; bold identifiers mark proteins containing ER retrieval signals. PIDE is residue percent identity, HVAL is the HSSP-value ( eqn. 1), and EVAL is the PSI-BLAST expectation value rounded to the closest exponent as measured between the two pairs of proteins.    


Proteins with KDEL and HDEL diverged less than expected. Only 114 proteins in all six eukaryotic proteomes contained the C-terminal ER recycle motifs KDEL or HDEL. Of these 114 proteins, 39 have a close homologue in Swiss-Prot that is experimentally annotated as ER. Assuming that all proteins with C-terminal KDEL or HDEL motifs identified in the eukaryotic proteomes are retained in the ER, we can analyse to which extent these proteins could have been identified based on homology alone. Somehow surprisingly, almost half of all these proteins mapped to proteins in our trusted profile families above thresholds corresponding to >80% accuracy ( Fig. 4 ), and about 60% were found in proteins that shared a similar fold with one of the proteins in the trusted families (HVAL=0, grey line in Fig. 4 ). All of the [KH]DEL proteins mapped to trusted families at HVALs <-10. Although this value was too low to infer similarities, it was significantly higher than the comparable values for all ER proteins ( Fig. 1 ). This observation suggested a rather puzzling conclusion: on the one hand there is no good reason to expect that two proteins containing C-terminal [KH]DEL motifs are evolutionarily related. On the other hand, the majority of the proteins with this motif have not diverged very much from the ER proteins we know. In other words, proteins with this motif diverged - on average - less than proteins with the same nuclear localization signals [32, 35] .



Fig. 4
fig4.gif

Fig. 4 : Mapping [KH]DEL proteins onto the trusted set.
We retrieved all proteins in six eukaryotic proteomes that contain the motifs KDEL or HDEL and aligned these to all ER proteins in our trusted families. The y-axis gives the percentage of [KH]DEL proteins found above a certain HSSP-value ( eqn. 1; for comparison the estimates for accuracy are copied from Fig. 1 to the right y-axis). The grey line at an HSSP-value of 0 marks the point above which protein pairs share a similar 3D fold.


 

 

Conclusions

Annotating sub-cellular localization is an essential aspect of large-scale experimental endeavours. Most methods that predict localization in absence of experimental annotations either leave out or predict with limited results ER and Golgi due to lack of experimental data [70, 71, 72, 73, 37, 25, 32, 28, 74, 75, 76, 77, 78, 41, 79, 80, 81, 82] . We introduced estimates for the accuracy in inferring ER and Golgi localization based on experimentally characterised short sequence motifs and based on sequence similarity to experimentally characterised proteins (homology-transfer). Our results suggested that most of the few currently known retention and recycle motifs for ER and Golgi proteins are too specific and/or too inaccurate to be used in isolation to automatically annotate entire proteomes. Nevertheless, these motifs might be crucial indicators provided we know their reliability ( Table 1 ). Providing such estimates was one of the major tasks that we addressed here. Similarly, we found that even very high levels of sequence similarity might not suffice to infer ER and Golgi localization without errors ( Table 3 ). In fact, we confirmed our previous findings [83] that only extremely high levels of pairwise sequence identity (>80%, Fig. 1 <<<10-100, Fig. 1 right panel) enabled accurate homology-transfer at low coverage (few true positives identified at threshold). When we also considered alignment length (HVAL eqn. 1), we could find thresholds for reliable homology-transfer at significantly higher levels of coverage ( Fig. 1 central panels). Similar observations have been observed for other cellular organelles [41] . We therefore applied these safe HVAL thresholds to the six entirely sequenced eukaryotes (human, mouse, fly, worm, weed, yeast). Knowing what to expect on average at a given threshold for sequence similarity was essential for such large-scale homology-transfer. Combining motifs and sequence similarity yielded the most reliable annotations ( Fig. 4 ). Another way to increase the reliability of inferring ER and Golgi proteins was by separating lumenal and membrane proteins ( Fig. 2 ). This might be explored in context of considering other targeting mechanisms for membrane proteins [84, 85] . Our exploration of ER and Golgi resident proteins can also assist large-scale proteomic endeavours which often are not fully encompassing of the proteome, can be limited to specific organelles or are unable to distinguish between lumenal and membrane proteins ( Table 2 ). Finally, more specific methods based on intrinsic protein structural features such as membrane boundaries [80] , post-translational modifications or functional domains [86] may improve ER and Golgi localization prediction techniques to fully encompass all proteins localized within these cellular compartments.

 



Acknowledgements

Thanks to Jinfeng Liu (Columbia) for computer assistance and to Rajesh Nair (Columbia) and Dariusz Przybylski (Columbia) for providing preliminary information, programs and knowledgeable discussions. Thanks to Richard Mann (Columbia) for bringing boca to our attention and for helpful discussions on this family. This work was supported by the grant DBI-0131168 from the National Science Foundation (NSF) and by a grant to the Northeast Structural Genomics Consortium from the Protein Structure Initiative of National Institutes of Health (P50 GM62413). Last, but not least, thanks to all those who deposit their experimental data in public databases, and to those who maintain these databases.

 

 

References

1.Rost, B. & Sander, C. (1993).Prediction of protein secondary structure at better than 70% accuracy. J.Mol. Biol., 232,584-599.
2.Rost, B. & Sander, C. (1994).Combining evolutionary information and neural networks to predict proteinsecondary structure. Proteins, 19, 55-72.
3.Rost, B. (1996). PHD: predictingone-dimensional protein structure by profile based neural networks. Meth.Enzymol., 266,525-539.
4.Berman, H. M., Westbrook, J., Feng, Z.,Gillliland, G., Bhat, T. N. et al. (2000). The Protein Data Bank. Nucl.Acids Res.,inpress.
5.Bairoch, A. & Apweiler, R. (2000).The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucl.Acids Res., 28,45-8.
6.Pfeffer, S. R. & Rothman, J. E.(1987). Biosynthetic protein transport and sorting by the endoplasmic reticulumand Golgi. Annu Rev Biochem, 56, 829-52.
7.Pemberton, L. F., Blobel, G. &Rosenblum, J. S. (1998). Transport routes through the nuclear pore complex. Curr.Opin. Cell Biol., 10,392-399.
8.Adam, S. A. (1999). Transport pathwaysof macromolecules between the nucleus and the cytoplasm. Curr. Opin. CellBiol., 11, 402-6.
9.Chen, X. & Schnell, D. J. (1999).Protein import into chloroplasts. Trends in Cell Biolology, 9, 222-227.
10.Hettema, E. H., Distel, B. &Tabak, H. F. (1999). Import of proteins into peroxisomes. Biochim. Biophys.Ac., 1451, 17-34.
11.Hood, J. K. & Silver, P. A.(1999). In or out? Regulating nuclear transport. Curr. Opin. Cell Biol., 11, 241-247.
12.Koehler, C. M., Merchant, S. &Schatz, G. (1999). How membrane proteins travel across the mitochondrialintermembrane space. TIBS, 24, 428-432.
13.Darnell, J., Lodish, H. &Baltimore, D. (1990). Molecular cell biology. Freeman, New York.
14.von Heijne, G. (1985). Signalsequences. The limits of variation. J. Mol. Biol., 184, 99-105.
15.Briggs, M. S. & Gierasch, L. M.(1986). Molecular mechanisms of protein secretion: The role of the signalsequence. Adv. Prot. Chem., 38, 109-180.
16.Sjstrm, M., Wold, S., Wieslander, .& Rilfors, L. (1987). Signal peptide amino acid sequences in Escherichiacoli contain information related to final protein localization. EMBO J., 6, 823-831.
17.Verner, K. & Schatz, G. (1988).Protein translocation across membranes. Science,241, 1307-1313.
18.Nielsen, H., Engelbrecht, J., Brunak,S. & von Heijne, G. (1997). Identification of prokaryotic and eukaryoticsignal peptides and prediction of their cleavage sites. Prot. Engin., 10, 1-6.
19.Pelham, H. R. (1990). The retentionsignal for soluble proteins of the endoplasmic reticulum. TIBS, 15, 483-6.
20.Gleeson, P. A. (1998). Targeting ofproteins to the Golgi apparatus. Histochem Cell Biol, 109, 517-32.
21.Nickel, W. (2003). The mystery ofnonclassical protein secretion. A current view on cargo proteins and potentialexport routes. Eur. J. Biochem., 270, 2109-19.
22.Teasdale, R. D. & Jackson, M. R.(1996). Signal-mediated sorting of membrane proteins between the endoplasmicreticulum and the golgi apparatus. Annu Rev Cell Dev Biol, 12, 27-54.
23.Mellman, I. & Warren, G. (2000).The road taken: past and future foundations of membrane traffic. Cell, 100, 99-112.
24.Saint-Jore-Dupas, C., Gomord, V. &Paris, N. (2004). Protein localization in the plant Golgi apparatus and thetrans-Golgi network. Cell Mol Life Sci, 61, 159-71.
25.Nielsen, H., Brunak, S. & vonHeijne, G. (1999). Machine learning approaches for the prediction of signalpeptides and other protein sorting signals. Prot. Engin., 12, 3-9.
26.Traub, L. M. & Kornfeld, S.(1997). The trans-Golgi network: a late secretory sorting station. Curr.Opin. Cell Biol., 9,527-33.
27.Ghosh, P., Dahms, N. M. &Kornfeld, S. (2003). Mannose 6-phosphate receptors: new twists in the tale. NatRev Mol Cell Biol, 4,202-12.
28.Emanuelsson, O., Nielsen, H., Brunak,S. & von Heijne, G. (2000). Predicting subcellular localization of proteinsbased on their N-terminal amino acid sequence. J. Mol. Biol., 300, 1005-16.
29.Emanuelsson, O., Nielsen, H. & vonHeijne, G. (1999). ChloroP, a neural network-based method for predictingchloroplast transit peptides and their cleavage sites. Prot. Sci., 8, 978-84.
30.Emanuelsson, O., Elofsson, A., vonHeijne, G. & Cristobal, S. (2003). In silico prediction of the peroxisomalproteome in fungi, plants and animals. J. Mol. Biol., 330, 443-56.
31.Claros, M. G., Brunak, S. & vonHeijne, G. (1997). Prediction of N-terminal protein sorting signals. Curr.Opin. Str. Biol., 7,394-398.
32.Cokol, M., Nair, R. & Rost, B.(2000). Finding nuclear localization signals. EMBO Rep., 1, 411-5.
33.la Cour, T., Gupta, R., Rapacki, K.,Skriver, K., Poulsen, F. M. et al. (2003). NESbase version 1.0: a database ofnuclear export signals. Nucl. Acids Res., 31, 393-6.
34.Nair, R., Carter, P. & Rost, B.(2003). NLSdb: database of nuclear localization signals. Nucl. Acids Res., 31, 397-9.
35.Nair, R. & Rost, B. (2003). LOC3D:annotate sub-cellular localization for protein structures. Nucl. Acids Res., 31, 3337-40.
36.Bucher, P. & Bairoch, A. (1994). Ageneralized profile syntax for biomolecular sequence motifs and its function inautomatic sequence interpretation. Proc Int Conf Intell Syst Mol Biol, 2, 53-61.
37.Nakai, K. & Horton, P. (1999).PSORT: a program for detecting sorting signals in proteins and predicting theirsubcellular localization. TIBS, 24, 34-6.
38.Kimata, Y., Ooboki, K., Nomura-Furuwatari,C., Hosoda, A., Tsuru, A. et al. (2000). Identification of a novel mammalianendoplasmic reticulum-resident KDEL protein using an EST database motif search.Gene, 261, 321-7.
39.Kumar, A., Cheung, K. H., Tosches, N.,Masiar, P., Liu, Y. et al. (2002). The TRIPLES database: a community resourcefor yeast molecular biology. Nucl. Acids Res.,30, 73-5.
40.Huh, W. K., Falvo, J. V., Gerke, L.C., Carroll, A. S., Howson, R. W. et al. (2003). Global analysis of proteinlocalization in budding yeast. Nature, 425, 686-91.
41.Nair, R. & Rost, B. (2002).Sequence conserved for subcellular localization. Prot. Sci., 11, 2836-47.
42.Altschul, S. F. & Gish, W. (1996).Local alignment statistics. Meth. Enzymol.,266, 460-480.
43.Przybylski, D. & Rost, B. (2002).Alignments grow, secondary structure prediction improves. Proteins, 46, 195-205.
44.Sander, C. & Schneider, R. (1991).Database of homology-derived structures and the structural meaning of sequencealignment. Proteins, 9, 56-68.
45.Rost, B. (1999). Twilight zone ofprotein sequence alignments. Prot. Engin., 12, 85-94.
46.Rost, B. (2002). Enzyme function lessconserved than anticipated. J. Mol. Biol.,318, 595-608.
47.Wrzeszczynski, K. O. & Rost, B.(2004). Cataloging proteins in cell cycle control. Methods Mol Biol, 241, 219-33.
48.Mika, S. & Rost, B. (2003).UniqueProt: creating sequence-unique protein data sets. Nucl. Acids Res.,submitted.
49.Hofmann, K., Bucher, P., Falquet, L.& Bairoch, A. (1999). The PROSITE database, its status in 1999. Nucl.Acids Res., 27,215-219.
50.Yabe, D., Nakamura, T., Kanazawa, N.,Tashiro, K. & Honjo, T. (1997). Calumenin, a Ca2+-binding protein retainedin the endoplasmic reticulum with a novel carboxyl-terminal sequence, HDEF. J.Biol. Chem., 272,18232-9.
51.Jackson, M. R., Nilsson, T. &Peterson, P. A. (1990). Identification of a consensus motif for retention oftransmembrane proteins in the endoplasmic reticulum. EMBO J., 9, 3153-62.
52.Schutze, M. P., Peterson, P. A. & Jackson,M. R. (1994). An N-terminal double-arginine motif maintains type II membraneproteins in the endoplasmic reticulum. EMBO J.,13, 1696-705.
53.Zerangue, N., Schwappach, B., Jan, Y.N. & Jan, L. Y. (1999). A new ER trafficking signal regulates the subunitstoichiometry of plasma membrane K(ATP) channels. Neuron, 22, 537-48.
54.Kirchhausen, T. (2002). Clathrinadaptors really adapt. Cell, 109, 413-6.
55.Bos, K., Wraight, C. & Stanley, K.K. (1993). TGN38 is maintained in the trans-Golgi network by atyrosine-containing motif in the cytoplasmic domain. EMBO J., 12, 2219-28.
56.Rohrer, J. & Kornfeld, R. (2001).Lysosomal Hydrolase Mannose 6-Phosphate Uncovering Enzyme Resides in thetrans-Golgi Network. Mol Biol Cell, 12, 1623-31.
57.Bannykh, S. I., Nishimura, N. &Balch, W. E. (1998). Getting into the Golgi. Trends Cell Biol, 8, 21-5.
58.Nothwehr, S. F., Roberts, C. J. &Stevens, T. H. (1993). Membrane protein retention in the yeast Golgi apparatus:dipeptidyl aminopeptidase A is retained by a cytoplasmic signal containingaromatic residues. J. Cell Biol., 121, 1197-209.
59.Kjer-Nielsen, L., Teasdale, R. D., vanVliet, C. & Gleeson, P. A. (1999). A novel Golgi-localisation domain sharedby a class of coiled-coil peripheral membrane proteins. Curr. Biol., 9, 385-8.
60.Munro, S. & Nichols, B. J. (1999).The GRIP domain - a novel Golgi-targeting domain found in several coiled-coilproteins. Curr. Biol., 9, 377-80.
61.Nielsen, H., Engelbrecht, J., vonHeijne, G. & Brunak, S. (1996). Defining a similarity threshold for afunctional protein sequence pattern: the signal peptide cleavage site. Proteins, 24, 165-177.
62.Altschul, S. F. (1993). A proteinalignment scoring system sensitive at all evolutionary distances. J. Mol. Evol., 36, 290-300.
63.Altschul, S., Madden, T., Shaffer, A.,Zhang, J., Zhang, Z. et al. (1997). Gapped Blast and PSI-Blast: a newgeneration of protein database search programs. Nucl. Acids Res., 25, 3389-3402.
64.Carter, P., Liu, J. & Rost, B. (2003).PEP: Predictions for Entire Proteomes. Nucl. Acids Res., 31, 410-3.
65.Culi, J. & Mann, R. S. (2003).Boca, an endoplasmic reticulum protein required for wingless signaling andtrafficking of LDL receptor family members in Drosophila. Cell, 112, 343-54.
66.Herz, J. & Marschang, P. (2003).Coaxing the LDL receptor family into the fold. Cell, 112, 289-92.
67.Ross-Macdonald, P., Coelho, P. S.,Roemer, T., Agarwal, S., Kumar, A. et al. (1999). Large-scale analysis of theyeast genome by transposon tagging and gene disruption. Nature, 402, 413-8.
68.Hua, Z., Fatheddin, P. & Graham,T. R. (2002). An essential subfamily of Drs2p-related P-type ATPases isrequired for protein trafficking between Golgi complex and endosomal/vacuolarsystem. Mol Biol Cell, 13, 3162-77.
69.Hua, Z. & Graham, T. R. (2003).Requirement for neo1p in retrograde transport from the Golgi complex to theendoplasmic reticulum. Mol Biol Cell, 14, 4971-83.
70.Nakashima, H. & Nishikawa, K.(1994). Discrimination of intracellular and extracellular proteins using aminoacid composition and residue-pair frequencies. J. Mol. Biol., 238, 54-61.
71.Andrade, M. A., O'Donoghue, S. I.& Rost, B. (1998). Adaptation of protein surfaces to subcellular location. J.Mol. Biol., 276,517-25.
72.Reinhardt, A. & Hubbard, T.(1998). Using neural networks for prediction of the subcellular location ofproteins. Nucl. Acids Res., 26, 2230-6.
73.Chou, K. C. & Elrod, D. W. (1999).Protein subcellular location prediction. Prot. Engin., 12, 107-18.
74.Nakai, K. (2000). Protein sortingsignals and prediction of subcellular localization. Adv Protein Chem, 54, 277-344.
75.Fujiwara, Y. & Asogawa, M. (2001).Prediction of subcellular localizations using amino acid composition and order.Genome Inform Ser Workshop Genome Inform, 12, 103-12.
76.Chou, K. C. & Cai, Y. D. (2002).Using functional domain composition and support vector machines for predictionof protein subcellular location. J. Biol. Chem.,277, 45765-9.
77.Emanuelsson, O. (2002). Predictingprotein subcellular localisation from amino acid sequence information. BriefBioinform, 3,361-76.
78.Mott, R., Schultz, J., Bork, P. &Ponting, C. P. (2002). Predicting protein cellular localization using a domainprojection method. Genome Res., 12, 1168-74.
79.Nair, R. & Rost, B. (2002).Inferring sub-cellular localization through automated lexical analysis. Bioinformatics, 18 Suppl 1, S78-S86.
80.Yuan, Z. & Teasdale, R. D. (2002).Prediction of Golgi Type II membrane proteins based on their transmembranedomains. Bioinformatics, 18, 1109-15.
81.Nair, R. & Rost, B. (2003). Betterprediction of sub-cellular localization by combining evolutionary andstructural information. Proteins, 53, 917-30.
82.Zhou, G. P. & Doctor, K. (2003).Subcellular location prediction of apoptosis proteins. Proteins, 50, 44-8.
83.Rost, B., Liu, J., Nair, R.,Wrzeszczynski, K. O. & Ofran, Y. (2003). Automatic prediction of proteinfunction. Cell Mol Life Sci, 60, 2637-50.
84.Rayner, J. C. & Pelham, H. R.(1997). Transmembrane domain-dependent sorting of proteins to the ER and plasmamembrane in yeast. EMBO J., 16, 1832-41.
85.Wattenberg, B. & Lithgow, T.(2001). Targeting of C-terminal (tail)-anchored proteins: understanding howcytoplasmic activities are anchored to intracellular membranes. Traffic, 2, 66-71.
86.Ponting, C. P. (2000). Proteins of theendoplasmic-reticulum-associated degradation pathway: domain detection andfunction prediction. Biochem. J., 351 Pt 2, 527-35.
87.El Hamel, C., Chevalier, S., De, E.,Orange, N. & Molle, G. (2001). Isolation and characterisation of the majorouter membrane protein of Erwinia carotovora. Biochim Biophys Acta, 1515, 12-22.
88.Itin, C., Kappeler, F., Linstedt, A.D. & Hauri, H. P. (1995). A novel endocytosis signal related to the KKXXER-retrieval signal. EMBO J., 14, 2250-6.
89.Itin, C., Schindler, R. & Hauri,H. P. (1995). Targeting of protein ERGIC-53 to the ER/ERGIC/cis-Golgi recyclingpathway. J. Cell Biol., 131, 57-67.
90.Andersson, H., Kappeler, F. &Hauri, H. P. (1999). Protein targeting to endoplasmic reticulum by dilysinesignals involves direct retention in addition to retrieval. J. Biol. Chem., 274, 15080-4.
91.Dogic, D., Dubois, A., de Chassey, B.,Lefkir, Y. & Letourneur, F. (2001). ERGIC-53 KKAA signal mediatesendoplasmic reticulum retrieval in yeast. Eur. J. Cell Biol., 80, 151-5.
92.Arber, S., Krause, K. H. & Caroni,P. (1992). s-cyclophilin is retained intracellularly via a unique COOH-terminalsequence and colocalizes with the calcium storage protein calreticulin. J.Cell Biol., 116,113-25.
93.Boyd, G. W., Doward, A. I., Kirkness,E. F., Millar, N. S. & Connolly, C. N. (2003). Cell surface expression of5-HT3 receptors is controlled by an endoplasmic reticulum retention signal. J.Biol. Chem.,.
94.Humphrey, J. S., Peters, P. J., Yuan,L. C. & Bonifacino, J. S. (1993). Localization of TGN38 to the trans-Golginetwork: involvement of a cytoplasmic tyrosine-containing sequence. J. CellBiol., 120, 1123-35.
95.Voorhees, P., Deignan, E., vanDonselaar, E., Humphrey, J., Marks, M. S. et al. (1995). An acidic sequencewithin the cytoplasmic domain of furin functions as a determinant oftrans-Golgi network localization and internalization from the cell surface. EMBOJ., 14, 4961-75.
96.Ugur, O. & Jones, T. L. (2000). Aproline-rich region and nearby cysteine residues target XLalphas to the Golgicomplex region. Mol Biol Cell, 11, 1421-32. 

Contact:    rost@columbia.edu Version:    Mar 27, 2004
 top - TOC - CUBIC-papers - CUBIC