Short yeast ORFs: expressed protein or not?
Dept Biochemistry & Mol Biophysics, 630 West 168th street, New York, NY 10032; rost@columbia.edu; http://dodo.bioc.columbia.edu/~rost/
contact e-mail:rost@columbia.edu
Sequencing the entire genome of Saccaromyces cerevisiae (yeast)
revealed about 2500 ORFs with less than 100 residues. Most of
these supposedly do not correspond to expressed proteins. However,
some do. How could theory help separating the wheat from the
chaff? Here, I introduced a simple measure for the 'globularity'
of a protein. I used this measure to develop a novel method that
firstly predicted the globularity, and secondly compared the predicted
globularity to the database background. The difference between
these two values provided an indication for how likely a sequence
would adopt a typical globular protein structure. On average,
globular domains differed from randomly chosen fragments of these
domains and - to some extent - from native protein chains extending
over the core domain. Thus, the method might be useful for predicting
domains from sequence. Analysing a set of 2427 short yeast ORFs,
I first predicted membrane helices. The results indicated that
most short ORFs with putative membrane helices would not correspond
to expressed proteins. Assuming that ORFs are more likely expressed
if similar to typical globular proteins, I sorted all short yeast
ORFs according to their predicted globularity. About half of
the non-membrane ORFs resembled globular domains.
Key words: genome sequence analysis, predicting globularity, protein domains, protein structure prediction, solvent accessibility, multiple alignments, transmembrane helices.
The problem. Large genome sequencing projects generate protein sequences at breath-taking speed. We already know all sequences for more than 16 entire organisms [1, 2] . One problem is the assembly of short segments. For sequences amounting to less than 100 residues it is technical difficult to distinguish whether or not the corresponding open reading frame (ORF ×) corresponds to an expressed protein. A prominent example for this problem has been encountered when sequencing yeast [9, 10, 11, 12] . More than 2000 (i.e. about one fourth) of all open reading frames from the yeast genome are shorter than 100 residues. By comparing the distribution of the lengths of all proteins in yeast and other organisms, we can infer that many of the short proteins are supposedly not expressed [13] (Casari, unpublished). However, can we also make theoretical statements about the individual genes, i.e. can we predict whether or not a particular short ORF differs from expressed proteins, and thus is not likely to be expressed?
Addressing the problem by straightforward database searches. A database search is a straightforward solution to the problem of predicting whether or not a short open reading frame corresponds to an expressed protein. (1) The short open reading frame in question is used for a search against a database of expressed proteins (such as SWISS-PROT [4] ). (2) 'Hits' are considered if and only if the entire protein found matches to the search sequence. Unfortunately, this simple protocol did not yield many assignments for the short open reading frames in yeast (Results).
The idea for a database-independent solution. Here, I
described an alternative, novel solution to the problem of identifying
expressed proteins in a set of short open reading frames. The
principle assumption was that expressed proteins would be more
likely to 'look like globular proteins' than open reading frames
corresponding to non-expressed proteins ( Fig. 1 ). The method
proceeded as follows. (1) Trivial cases were spotted by database
comparisons. (2) Proteins with putative membrane regions were
detected, and excluded. (3) The remaining (majority) of the open
reading frames was ranked by their predicted 'globularity'. Additionally,
I explored to which extent database searches indicated 'globularity'.
Indeed, only a small fraction of the database hits turned out
to permit strong conclusions about the expression of the respective
ORFs.
Fig. 1. Globularity of a protein structure. The principle idea was that ORFs are less likely to be expressed if the resulting protein would not be sufficiently 'globular' to fold into a typical protein structure. The most globular (lowest ratio of surface to volume) protein unit is the domain. Protein chains are - on average - clearly less globular than are domains. Fragments generated by cutting a number of consecutive residues from a protein domain should also to be less globular than the domain.
Globular proteins resemble simple model of closely packed spheres.
Proteins exhibit densities characteristic for solid states [14] .
This was reflected by the good correlation between the simple
model of proteins composed of closely packed residue spheres ( eqn. 2 Fig. 2 ). Considering a residue as exposed when less than 16%
of its surface was bound to other residues implied that about
half the residues were exposed. Interestingly, the relation between
the number of more than 9% exposed residues to the protein length
fall between the two simple models of spheres and cubes (eqs.
2 and 1, data not shown).
Fig. 2. Protein surface vs. protein length. For all globular protein domains the number of exposed residues (relative accessibility > 16%) was related to the length of the protein. Obviously, the observed points (crosses) could be approximated by the simple model of proteins as bodies consisting of perfect spheres (the residues; line with open circles). The even simpler model describing residues as cubes over-estimated the number of residues on the protein surface (line with open rectangle).
Overall number of exposed residues well predicted. The
correlation between the number of exposed residues predicted by
PHDacc [15, 7] , and the number of residues observed (according
to DSSP [3] ) was, on average, rather high ( Fig. 3 A). Nevertheless,
the relation between protein length and predicted number of exposed
residues was better approximated by a slight correction to the
over-simplified model of dense spheres ( eqn. 3 ). In particular,
the free parameters were chosen to be: a =
0.84, and b = 0.41 ( eqn. 3 ); the observed
data was approximated by a = 1, and
b = 1.2 (instead of using eqn. 1, eqn. 2 ).
Both resulting distributions for the difference between the expected
and the observed (respectively predicted) number of exposed residues
were approximately Gaussian and peaked around values of 1 (Fig.
3B). About 20% (756) of the 3457 globular domains proteins fall
outside a region of ± 1 s, and
4% (142) outside a region of ± 2 s.
Interestingly, using domains rather than single protein chains
proved essential for this analysis. Initially, I used a set of
structures with single chains for the analysis. For these PHDacc
under-predicted the number of exposed residues ( Fig. 3 C showed
the correlation for 716 single chained proteins).
Fig. 3. Predicted protein surface vs. protein length. Although the method used to predict solvent accessibility (PHDacc [6, 7] ) was not 100% accurate, overall the numbers for predicted and observed exposed residues correlated quite well (A). The difference between the expected and the predicted number of exposed residues (Æg, eqn. 4) was a normal Gaussian peaking around a difference of one, with one standard deviation s of about 10 (B; curve with crosses). About 70% of all points falling into a ± s interval of the peak (B; curve with dotted circles). Two data sets of were explored: domains, and entire proteins. Interestingly, the correlation between the predicted and observed exposed residues was clearly higher for domains (A) than it was for entire proteins (C).
Random fragments differed from globular domains. The
randomly selected domain fragments (Methods) differed from the
globular proteins in that, on average, fewer residues were exposed
( Fig. 4 ). The peak of the distribution for the difference in
predicted and expected ( eqn. 3 ) number of residues differed significantly
from that of the distribution for globular domains ( Fig. 4 A).
However, the random distribution was broader than the domain
distribution. Thus, the two were not clearly separated. The
majority of all fragments fall outside the one-s
fit to the domain data ( Fig. 4 B). Furthermore, cutting off at
a difference of Æg ² -5 ( eqn. 4 ) covered 80% of the
domains, but only 30% of the fragments; cutting at a difference
of Æg ² -10 covered 90% of the domains, and 60% of
the fragments ( Fig. 4 B).
Fig. 4. 'Globularity' of random protein fragments. The distributions of Æg (eqn. 4) for randomly chosen fragments of globular domains was clearly shifted towards lower values (too few exposed residues predicted, A). For example, about 80% of the domains had values > -5 (B, upper line with stars), whereas only about 30% of the fragments fall above -5 (B, lower line with crosses).
Most database hits not sufficient to conclude expression of
ORFs. For naturally evolved proteins, similar sequences
imply similar structures. The level of pairwise sequence identity
necessary to guarantee structural similarity depends on the length
of the aligned regions [16] : for alignments shorter than 100
residues (as the investigated short yeast ORFs) the minimal level
of sequence identity indicating structural similarity is above
30% [17] . However, for comparing ORFs that may not be expressed
to the SWIS-SPROT [4] protein sequence database, the level
of sequence identity is not sufficient. Here, I applied three
rather conservative 'filters'. An alignment was considered to
indicate expression if (1) the minimal number of residues was
30 (length of shortest ORF considered), (2) the minimal pairwise
sequence identity was 30% (more for shorter alignments [17] ,
and (3) the lengths of the ORF and the database hit did not differ
by more than 30%. This analysis covered 2427 yeast ORFs with
length between 100 and 30 residues. More than half of these could
be aligned to known proteins ( Table 2 ; the total number of proteins
found to have similar regions amounted to 8106). However, the
more conservative three-step-filter applied to only 90 ORFs (Table
1, Table 2 ; the total number of proteins found to be entirely
similar to a yeast ORF was 317). The small yield was further
diminished by that 45 of the 90 database hits enabling to conclude
expression of the respective ORF were yeast proteins previously
known to be expressed.
Putative membrane helices for one fifth of all ORFs.
For 550 of the 2427 ORFs PHDhtm [8] predicted the existence
of membrane helices ( Table 2 ). This number is slightly smaller
than the percentage detected for expressed yeast proteins [2] .
However, the number ought to be viewed with caution. Firstly,
PHDhtm occassionally confuses signal peptides with transmembrane
helices. Secondly, the false-positive rate (globular proteins
predicted to contain membrane helices) of PHDhtm is particularly
high for proteins with only one transmembrane helix [2] , and
only for 79 of the 550 proteins more than one membrane helix was
detected (stronger predictions listed in the Appendix). Finally,
the distribution of proteins with 1, 2, and 3 predicted membrane
helices appeared to differ from what might be expected for a set
of short expressed yeast proteins. Due to the problems that arise
with short yeast ORFs the detailed comparison was not possible.
However, compiling the statistics on all yeast proteins between
30 and 150 residues known to be expressed (total of 141), showed
that about one third of these had one transmembrane helix (HTM),
one third two HTM's, and almost one third three HTM's ( Table 3 App;
[18] ).
Separating more and less likely to-be-expressed yeast ORFs.
The distribution of the difference between the number of predicted
(PHDacc) and expected (fit eqn. 3 ) exposed residues for the short
yeast ORFs deviated both from that found for globular domains,
and from that found for protein fragments ( Fig. 5 ). In particular,
the number of ORFs for which too many residues exposed residues
were predicted was substantially higher. However, according to
the control ( Fig. 4 ) a similar distribution described the set
of entire proteins (rather than that of domains). Thus, although
this end of the distribution (which covered about 10% of all short
yeast ORFs without membrane regions, Fig. 5 ) revealed less globular
sequences, it did not reveal ORFs less likely to be expressed.
On the other hand, for about 40% of the yeast ORFs Æg ( eqn. 4 ) was below -6. The corresponding percentage was below 20% for
the globular proteins, and above 65% for the random fragments
( Fig. 5 ; sorted lists of the least and most globular yeast ORFs
given in the Appendix). The percentage of the yeast ORFs that
fall in the most globular region was about 50% ( Fig. 5 ; sorted
lists of the least and most globular yeast ORFs given in the Appendix).
Fig. 5. 'Globularity' for short yeast ORFs. The distribution of Æg (eqn. 4) for the short yeast ORFs (dotted circles) differed from that for random fragments (crosses), as well as from that for globular domains (filled triangles) and entire proteins (open triangles). The shaded boxes highlighted the regions that were occupied by about 30% (light grey), and 60% (darker grey) of all globular domains.
The concept of 'globularity' established. The principle simplified description of protein 'globularity' established here was motivated by the attempt to find a simple, and intuitive way to correlate protein length and 'globularity'. The particular definition of the number of exposed residues (relative solvent accessibility > 16%) was surprisingly similar to the over-simplified naïve model describing proteins as consisting of spheres (Methods, Fig. 2 ). Given the observed relation between protein length and the number of exposed residues, the difference between the number of exposed residues actually observed for a particular protein and that expected according to a fit to this relation ( eqn. 1, eqn. 2 ) provided a measure on how much the 'globularity' of a particular protein differed from the database background (Æg, eqn. 4 ). The distributions of Æg differed between the sets of globular domains, entire globular proteins, and randomly chosen fragments thereof ( Fig. 4 and Fig. 5 ). Thus, the simple measure Æg qualitatively described well our expectation ( Fig. 1 ).
Qualitative prediction of 'globularity' possible. Predicted and observed globularity correlated surprisingly well for globular domains ( Fig. 2 A). This evidence suggested that if globularity would be a meaningful measure, it could be predicted from sequence. Interestingly, the prediction method PHDacc [6, 7] under-predicted exposed residues for the less globular entire proteins ( Fig. 2 C). On average, the distributions for Æg ( eqn. 4 ) differed between randomly chosen domain fragments and entire domains (Fig. 4). This allowed a qualitative distinction between domains and fragments. However, frequently there was a fragment that had a Æg closer to the database average than did the entire respective domain (data not shown). Thus, the measure Æg was not sufficient to predict domains from sequence. Nevertheless, it may be useful in combination with other methods [19, 20, 21] .
Most yeast ORFs with predicted membrane helices, may not be expressed. For about 23% of the yeast ORFs, transmembrane helices were predicted by PHDhtm [7, 8] . However, the comparison with transmembrane helix predictions for a subset of yeast proteins known to be expressed indicated that many of the membrane helix predictions for the short ORFs were supposedly false positives. The exact number was difficult to estimate. Assuming that most predictions for proteins with more than one membrane helix were correct, and that the ratio of proteins with one and two membrane helices in yeast is similar to that in other organisms, the conclusion was that about half of the proteins predicted to have one membrane helix were false positives. This high error rate is completely atypical for the prediction method (usually about 2% false positives). This atypical feature of many short yeast ORFs indicated that most of the proteins predicted to contain membrane helices might not be expressed. The reliability of the membrane predictions provided some criterion for sorting the respective yeast ORFs according to the extent to which they looked 'normal'.
Sorting short yeast ORFs by 'globularity'. Most short yeast ORFs (62%, Table 2 ) were similar to some expressed protein in SWISS-PROT. However, only for about 4% ( Table 2 ) the alignment unequivocally supported the conclusion that the respective yeast ORF could be expressed. The remaining short yeast ORFs that were not predicted to have transmembrane helices were sorted according to their 'globularity'. About 40% fall into the region defined by most globular domains ( Fig. 5 ). About 20-40% appeared to miss the features of globular proteins ( Fig. 5 ). Thus, the benefit of the method proposed here would be to first focus on the more 'globular' ORFs when hunting for expressed proteins in the pool of short yeast ORFs, and to discard the 20-40% non-globular ones immediately. Furthermore, about 10% of the yeast ORFs were defined by a globularity typical for short proteins that extent over the core domain ( Fig. 5 ).
Availability and usefulness of the method. The method
described will be made available through the PredictProtein server
[22] . The results indicated that the method was not sufficient
to entirely predict domains, nor to predict whether or not an
ORF is expressed. However, for both tasks the tool appeared to
provide crucial assistance.
Correlating protein length and solvent accessibility. One measure for the globularity of a protein is its overall solvent accessibility. Suppose a protein is a perfect box, consisting of small boxes of equal size (the residue). Then the following equation holds for the number of boxes on the surface:
where
is the number of residues on the
surface, and Nres was the total number
of residues. The example was simplified to the extent that author
and reader could easily verify the correctness of this simple
relation between protein length and residues exposed to solvent.
From solid state physics we know that the packing density of
the simple box-to-box packing is about 80% of the density of a
cubic-hexagonal lattice (i.e. imagining residues to be spheres,
rather than boxes). Thus the simple model of a protein sphere
can be described approximately by:
Compiling relative solvent accessibility. Solvent accessibility was computed from known protein structures based on the program DSSP [3] . The normalisation to relative accessibility was performed according to [6] . Residues were considered exposed when more than 16% of their relative surface was accessible to solvent. (Note: this particular value was chosen as it assured that roughly half of the residues in a globular protein were exposed (data not given). The particular choice is also supported by other structural considerations [23, 24, 25, 26, 27, 28] )
Predicting solvent accessibility. Solvent accessibility was predicted using the program PHDacc [7] . Residues predicted at levels of > 16% relative accessibility were considered to be exposed. Prediction errors required the introduction of a fitting to the relation between sequence length and number of exposed residues based on two free parameters:
where
was the number of predicted exposed
residues and the two free parameters were a
and b (the particular choice
used in this study specified in Results). (Note that this functional
form was rather similar to the one introduced by Jöel Janin
[24] .)
Describing the 'globularity' of a protein with respect to the database. Given the number of exposed residues as a measure for the 'globularity' of a protein, and the relation between protein length and the number of exposed residues ( eqn. 1, eqn. 2, eqn. 3 ), the 'globularity' of a particular protein could be related to that of the database background. The simple difference (Æg) used in this study was defined as:
where
was the number of exposed residues
observed (DSSP), respectively predicted (PHD), and
was the fit of the protein length to the number of expected surface
residues ( eqn. 2, eqn. 3 ).
Data set of structurally known single-chain proteins and protein domains. Initially, I used a set of singled-chain protein structures taken from PDB [29] . Using the top hierarchy of the FSSP database [30] I restricted the set such that no pair in the set had a significant level of pairwise sequence identity [16] . However, even single chains were often not completely 'globular'. Thus, I finally chose a set of 3457 domains compiled by visual inspection, as well, as by visual inspection [31] .
Generation of random subsets of non-globular fragments. For each domain (protein) in the set of 3457 domains (716 proteins) I generated fragments choosing begin and end points at random (i.e. uninterrupted fragments of various lengths were chopped from the proteins). The random generation of non-globular fragments was constrained in two ways. (1) Fragments were forced to be shorter than 80% of the proteins from which they were taken. (2) The distribution of the length of the selected fragments exactly mirrored the distribution observed for globular proteins. In total about 30.000 fragments were created, at random. The random background set thus chosen provided rather conservative estimates for the goal of predicting the globularity of short open reading frames.
Data set of short yeast proteins. Short open reading frames were taken from the original yeast analysis performed by the GeneQuiz consortium [32] . Alignments were obtained by the dynamic programming algorithm MaxHom [16, 5] .
Predicting transmembrane helices for short open-reading frames.
The distinction between proteins with and without transmembrane
regions was realised by running the program PHDhtm [7, 8]
on all short open reading frames. The cut-off value for the distinction
was set to the default value (0.8) at which about 2% of the proteins
for which membrane regions were detected are expected to be globular
[8] .
Particular thanks to Christine Orengo (London) for her database
of protein domains. Thanks to Sean O' Donoghue (EMBL, Heidelberg)
for discussions, and to Gerrit Vriend (EMBL, Heidelberg) for financial
support. Last, not least, thanks to all those who enabled this
analysis by depositing information about protein structures, protein
sequences, and genome sequences in public databases, and to those
who maintain such databases.