| Title: | Distinguishing protein-coding from non-coding RNA through support vector machines |
| Author: | Jinfeng Liu, Julian Gough, & Burkhard Rost |
| Quote: | PLoS Genetics, 2006, Apr. 2(4):e29. Epub 2006 Apr 28. |
Distinguishing protein-coding from non-coding RNA through support vector machines
| 1 | Dept. of Biochemistry and Molecular Biophysics, Columbia University, 630 West 168th Street, New York, NY 10032, USA |
| 2 | Columbia University Center for Computational Biology and Bioinformatics (C2B2), 1130 St. Nicholas Avenue Rm 802, New York, NY 10032, USA |
| 3 | North East Structural Genomics Consortium (NESG), Department of Biochemistry and Molecular Biophysics, Columbia University, 630 West 168th Street, New York, NY 10032, USA |
| 4 | Genome Exploration Research Group (Genome Network Project Core Group), RIKEN Genomic Sciences Center (GSC), RIKEN Yokohama Institute, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa, 230-0045, Japan |
| * | Corresponding author: server@rostlab.org URL http://www.rostlab.org/ Tel: +1-212-851-4669, fax: +1-212-305-7932 |
RIKEN's FANTOM project revealed many previously unknown coding sequences, as well as, an unexpected degree of variation in transcripts resulting from alternative promoter usage and splicing. Ever more transcripts that do not code for proteins have been identified by transcriptome studies, in general. Increasing evidence points to the important cellular roles of such non-coding RNAs (ncRNA). The distinction of protein-coding from non-coding RNA transcripts is therefore an important problem in understanding the transcriptome and in annotating fully sequenced organisms. Very few in silico methods have specifically addressed this problem. Here, we described a novel method based on support vector machines (SVMs) that classifies transcripts according to features they would have if they were coding for proteins. These features include peptide length, amino acid composition, predicted secondary structure content, predicted percentage of exposed residues, compositional entropy, number of homologues from database searches, and alignment entropy. Nucleotide frequencies were also incorporated into the method. Confirmed coding cDNAs for eukaryotic proteins from the Swiss-Prot database constituted the set of true positives, non-coding RNAs from RNAdb and NONCODE the true negatives. Ten-fold cross-validation suggested that our SVM-based method distinguished coding from non-coding at about 97% specificity and 98% sensitivity. Applied to 102,801 mouse cDNAs from the FANTOM-3 data set, our method reliably identified over 14,000 ncRNAs and estimated the total number of ncRNAs to be about 28,000.
Key words: genome sequence analysis, non-coding RNA, support vector machines, multiple alignments, machine learning, protein features.
There are two types of RNA: messenger RNAs (mRNAs), which are translated into proteins, and non-coding RNAs (ncRNAs), which function as RNA molecules. Besides textbook examples such as tRNAs and rRNAs, non-coding RNAs have been found to carry out very diverse functions, from mRNA splicing, RNA modification, to translational regulation. It has been estimated that non-coding RNAs make up vast majority of transcription output of higher eukaryotes. Discriminating mRNA from ncRNA has become an important biological and computational problem. The authors describe a computational method based on a machine-learning algorithm known as Support Vector Machine (SVM) that classifies transcripts according to features they would have if they were coding for proteins. These features include peptide length, amino acid composition, secondary structure content, and protein alignment information. The method is applied to the data set from the FANTOM-3 large-scale mouse cDNA sequencing project; it identifies over 14,000 ncRNAs in mouse and estimates the total number of ncRNAs in the FANTOM-3 data to be about 28,000.
| FANTOM | Functional ANnoTation Of Mouse cDNAs |
| ncRNA | non-coding RNA |
| ROC | receiver operating characteristics |
| SVM | support vector machine. |
Francis Crick about 50 years ago [1] , holds that genetic information is stored in DNA, that it is transferred into RNA through transcription by DNA polymerase, and that the information is finally decoded when RNA is translated into proteins. The paradigm has been that proteins largely constitute the machinery that makes life live: proteins carry out all structural, catalytic and regulatory functions. In this view, RNAs mostly play the passive role of a 'messenger'. When in vitro RNA splicing experiments revealed that RNAs can act as enzymes [2] , this was initially considered an exception. Many more biological functions of RNAs are known today. RNAs can be divided into two classes: messenger RNAs (mRNAs), which are translated into proteins, and non-coding RNAs (ncRNAs), which are functional as RNA molecules rather than encoding proteins. The textbook examples of ncRNAs are tRNAs and rRNAs. ncRNAs have been found to carry out very diverse functions, from mRNA splicing (snRNAs), RNA modification (snoRNAs), to translational regulation (microRNAs), and chromatin structure modulation (XIST) [3, 4] . The functions of many ncRNAs remain unknown. During the first two phases of the Functional Annotation of Mouse cDNAs (FANTOM) project, only 17,594 of the 33,409 transcriptional units were determined to have coding potential; all remaining units were considered as putative ncRNAs [5, 6] . The third phase of FANTOM [7] has discovered even more putative ncRNAs. Many of these ncRNAs are polyadenylated just as mRNAs; possibly more surprisingly, many of these ncRNAs are not short, rather they are, on average, even longer than protein-coding transcripts [6] . It has been estimated that 98% of the human genomic output may be ncRNAs, and speculated that differences in organism complexity may originate mainly from the vast difference in the amount of ncRNAs between higher eukaryotes and simpler organisms, rather than from the difference in protein coding genes [8] . Given the importance of ncRNAs and the increasing body of large-scale data, the development of computational methods that distinguish between mRNA and ncRNA, i.e. between protein-coding and non-coding, becomes both increasingly urgent and increasingly feasible.
Few computational approaches have been specifically designed for the distinction between mRNA and ncRNAs. Instead, methods developed for related tasks (detection of protein-coding regions within cDNAs, ESTs, or prokaryotic genomic DNAs) have been applied to this task. The analysis of DNA alignments and codon usage is one of the most widely used strategies. CSTminer [9] identifies conserved sequence tags by comparing cross-species DNA alignments, based on the observation that in homologous regions, synonymous changes (i.e., base changes without amino acid substitution) occur more frequently than non-synonymous ones, and that non-synonymous substitutions often result in amino acid changes that preserve structural similarity. QRNA [10] adopts a similar strategy by using a Hidden Markov Model (HMM) to identify ncRNAs. CRITICA [11] identifies protein-coding sequences in prokaryotic genomic DNAs through the combination information from DNA alignments and from dicodon usage. Statistical models for mRNA constitute an alternative strategy. DIANA-EST [12] distinguishes coding ESTs from un-translated regions (UTR) or out-of-frame cDNA windows with artificial neural networks, and ESTScan [13] detects coding sequences within ESTs through Hidden Markov Models (HMM). Alignments to known proteins in the databases were used by rsCDS [14] to identify coding regions in the FANTOM-2 project [5] . Additional unpublished methods have also been used in the FANTOM-3 distinction between protein-coding and non-coding RNA [15] . These include methods based on protein domain-like regions from Pfam [16] and SUPERFAMILY [17, 18] , imposing a size threshold for longest ORF, and mTRANS, which uses both ORF length and cDNA coding potentials according to common features found in cDNA.
Most of the above methods succeed partially in the identification of protein-coding regions from cDNAs. However, the sustained performance in specifically distinguishing mRNA from ncRNA has not been thoroughly evaluated for any method due to the previous lack of data. Many methods were assessed using 5' and 3' un-translated regions as negative examples, which may not have the same characteristics as ncRNAs. Growing interest in ncRNA functions has prompted the construction of several databases cataloging different types of ncRNAs. Rfam [19] provides structure-annotated multiple sequence alignments for non-coding RNA families, most of which are structural ncRNAs. The major focus of Rfam is on the use of alignments and covariance models for automatically analyzing and annotating sequences, in analogy to Pfam [16] . However, for many Rfam sequences the direct experimental evidence for their transcription is missing. RNAdb [20] catalogues more than 800 experimentally studied mammalian ncRNAs, including microRNAs and snoRNAs, but not structural RNAs. NONCODE [21] contains more than 5,000 manually curated ncRNAs from 861 organisms, with more than 80% of the entries based on experimental data. It also classifies ncRNAs by their cellular function.
Here, we introduced a novel method for the distinction between protein-coding and non-coding RNAs. The particular focus of this method was on the reliable distinction for long transcripts as abundantly identified by FANTOM-3. Support Vector Machines (SVMs), like other supervised machine learning algorithms, try to learn decision rules from labeled input data (in this case, known protein-coding RNAs and ncRNAs) and use these rules to classify novel data. SVMs have a unique way of mapping input data into a very high dimensional feature space using kernel functions, and of identifying a hyperplane in this highly complex space that maximizes the distance from the closest samples to the hyperplane [22] . SVMs have been widely used for pattern recognition, as well as image and text classification. Recently, they have also been extensively applied to many problems in computational biology [23] , including the detection of distant relatives of proteins [24] , the prediction of subcellular localization [25, 26] , and the classification of microarray data [27] . In this study, we trained SVMs using eukaryotic ncRNAs from RNAdb and NONCODE databases, and showed that protein features can be used to reliably distinguish protein-coding RNAs from ncRNAs. Sustained performance was established through rigorous cross-validation. Applying the method to the FANTOM-3 data estimated the number of ncRNAs in mouse to be about 28,000.
We trained SVMs on three different data sets of ncRNA. The first were 778 experimentally documented mammalian ncRNAs from RNAdb. The cross-validated SVM for this set identified 99% of the known protein-coding RNAs, while 14% of the ncRNAs were incorrectly predicted as coding (Table S1, Supporting Information). When applying the SVM trained on this set to the classification of eukaryotic ncRNAs from NONCODE and from randomly generated DNAs, the fractions of incorrect predictions were 8% and 9% respectively (Table 1 ). We used random DNAs to assess the ability of the SVM in extracting useful features that distinguish signals from random background noise. Our second training set comprised eukaryotic ncRNAs from the NONCODE (set NONCODE-E). Cross-validation indicated a similar performance for predicting mRNA, but considerably lower false predictions for ncRNAs of around 5%. When this SVM was applied to the RNAdb set and random DNAs the error rates were 10% and 7% (Table 1 ).
The best SVM was trained on the ncRNAs from both RNAdb and NONCODE-E. This set included 1158 mammalian ncRNAs and 1512 non-mammalian eukaryotic ncRNAs. The specificity and sensitivity for the coding RNAs were similar to the other sets. However, the accuracy for ncRNAs was notably higher when the SVM trained on this set was tested on RNAdb, NONCODE-NE (prokaryotic and archaean ncRNAs from NONCODE), and random DNAs (Table 1). A detailed analysis of the 166 incorrect predictions showed that 102 (61%) were from mammalians, the remaining 64 from non-mammalian eukaryotes. The difference in performance indicated that the mammalian ncRNAs in RNAdb differed from those in NONCODE-E. On the other hand, SVMs trained on the combination of RNAdb and NONCODE-E appeared to capture the general features of ncRNAs, and performed well on all data sets, even on from NONCODE-NE. This was further confirmed by the high accuracy on Rfam (Table 1) mostly containing rRNAs and tRNAs that are not included in RNAdb and NONCODE, and that served therefore as means of independent evaluation for the performance.
| Negative part of the training set | False prediction rates for testing ncRNA sets | ||||
| NONCODE-E b | NONCODE-NE c | RNAdb | Rfam | Random DNA | |
| NONCODE-E | 0 | 4.1% | 9.9% | 8.7% | 7.3% |
| RNAdb | 7.7% | 9.1% | 0.1% | 28.4% | 8.8% |
| NONCODE-E and RNAdb | 0.3% | 3.8% | 6.4% | 7.7% | 8.2% |
a SVMs trained on ncRNA data sets on the first column and a common positive coding RNA set were tested on the data sets on other columns. The SVM parameters were tuned so that all three SVMs predicted same number of coding RNAs from the positive set during cross-validation, i.e., the same sensitivity for coding prediction. False prediction rate was defined as percentage of ncRNAs in the data sets that were predicted as coding, e.g., 9.9% of ncRNAs were predicted incorrectly as coding when the SVM was trained on NONCODE-E and tested on RNAdb.
b. Eukaryotic ncRNAs from NONCODE database
c. Prokaryotic and archaean ncRNAs from NONCODE database
Our major hypothesis for the development of our method was that native proteins have particular structural and sequence properties that distinguish them from putative translation products of ncRNAs. The protein characteristics that we selected included: peptide length, amino acid composition, percentage of residues that were predicted by PROFsec [28, 29, 30] to adopt the secondary structure of alpha-helix or beta-strand, percentage of residues predicted by PROFacc [31, 29, 30] to be exposed to solvent, sequence complexity as measured by sequence compositional entropy [32] , number of homologues from database searches, and sequence conservation computed as alignment entropy. We chose the longest possible translation products from each of the three forward frames, including those without obvious start codons. This was motivated by the attempt to account for sequencing errors: our method would be able to capture most of the protein regardless of the position of the error (immediately after the start codon/middle of ORF) and the type of the error (point mutations or insertions/deletions).
One important measure of performance is provided by receiver operating characteristics (ROC) curves [33] that plot sensitivity against (1-specificity) over the full range of the specificity (Fig. 1). The area under the ROC curve (ROC score) describes the overall performance of the method under different thresholds. The ROC score of above 0.98 achieved by our SVM (Fig. 1) indicated an excellent performance. At the particular decision boundary chosen by a prediction method, the performance is more intuitively measured by specificity (eqn. 1), sensitivity (eqn. 2), or the harmonic mean of those two known as F-measure (eqn. 3). Since the numbers of mRNA and ncRNA differ significantly in our data set, reporting performance for both aspects (predict mRNA and predict ncRNA) becomes crucial (Table 2). At the threshold selected by our SVM classifier protein-coding RNAs were predicted at F>97% (Table 2) and ncRNAs at F>94% (Table 2). We compared our method with ESTScan [13] , one of the publicly available methods that have been applied in the discrimination of ncRNAs [15] . Run on our data set with "-O -S" options so that only positive strands were analyzed, and using the default threshold of 0, ESTScan was about ten percentage points less accurate than our SVM-based method (Table 2 ). In contrast, using the same input features to train another machine-learning algorithm, namely a Naive Bayes classifier (Methods), we were surprised to find that this simpler method yielded only slightly worse performance (Table 2).
|
Fig. 1: The receiver operating characteristics (ROC) curves for the SVM classifiers. ROC curve plots true positive rate (i.e., sensitivity) against false positive rate (i.e., 1-specificity). Shown are the ROC curves for SVMs using all features, two of top single features, and ESTScan. The diagonal line indicated random prediction. |
| Method | Coding prediction | Non-coding predicition | ||||
| F-measure | Specificity | Sensitivity | F-measure | Specificity | Sensitivity | |
| SVM | 97.4% | 97.1% | 97.8% | 94.5% | 95.2% | 93.8% |
| Naive Bayes | 96.7% | 95.6% | 97.9% | 92.8% | 95.3% | 90.5% |
| ESTScan | 86.7% | 84.7% | 88.7% | 69.9% | 73.7% | 66.5% |
a All three methods were evaluated on the same data set, with the non-coding set being the combination of NONCODE-E and RNAdb. The performance numbers for SVM and Naive Bayes were from 10-fold cross-validation.
We investigated the importance of individual input features by training separate SVMs on each feature. Most single features already discriminated coding from non-coding. However, the combination of all features performed much better than single feature alone. The top-performing individual features were the number of database homologues and peptide length, followed by alignment entropy and amino acid composition (Fig. 2). It is not surprising that the length or the number of homologues, alone, discriminate at reasonable accuracy. Most ncRNAs are short (average length of ncRNAs was 526 nucleotides in our data set, compared with 1746 nucleotides for coding sequences), and having homologous proteins is certainly a very good indication for protein-coding. In fact, some methods exclusively relied on one of these two features: rsCDS [14] analyzed alignments with protein databases to make coding cDNA assignment for FANTOM-2 project, and Òlongest ORFÓ was also tested to evaluate ncRNAs in FANTOM-3 [15] . However, there are obvious limitations to these two features. Many ncRNAs (about 10% in our data set) can be conceptually translated into peptides longer than 100 amino acids, and cDNAs coding for short proteins would also be incorrectly predicted as non-coding based on length alone. Similarly, protein alignment-based predictions would go wrong for cDNAs coding for novel proteins and ncRNAs that happen to align with some mis-annotated hypothetical proteins in the protein databases.
It was for these considerations that we chose to combine eight different protein properties, each of which has a good discrimination power by itself. The resulting combined SVM improved both specificity and sensitivity by more than three percentage points. The other advantage of combining all features was that predictions become more robust against mistakes in a single feature (e.g. incorrect alignment, incorrect peptide length). In such cases, other features can still contribute enough to the overall score and lead to correct predictions. For example, since our positive samples were taken from mRNAs that code for proteins in the Swiss-Prot database, it is trivial for them to find homologues in the databases. Consequently, SVMs using protein alignment alone are likely to perform worse on novel transcripts than on the well-defined data sets used to develop the SVM. In contrast, SVMs based on additional input features are likely to suffer less from this imbalance because they also rely on other signatures. Additionally incorporating nucleotide frequencies (single nucleotide, di-nucleotide, and triplet) into the multi-feature SVM slightly improved performance (about one percentage point) over the SVM with only protein-derived features.
|
Fig. 2 : Performance of the SVM classifiers with different input features. F-measures (harmonic mean of specificity and sensitivity, see Methods) were calculated for different SVMs on both the coding (A) and non-coding (B) predictions. Since the size of coding set was twice as big as the non-coding set, the percentage of incorrect predictions was bigger for the non-coding set, hence the smaller F-measures. When used alone, all input features can achieve the F-measures of 67.6 to 90.9 on the non-coding set. The combination of these features improved the performance to 97.4 for coding and 94.5 for non-coding. In comparison, ESTScan got F-measures of 86.7 and 69.9. The top features were number of homologues in the protein database and peptide length. |
Prediction specificity and SVM output scores were highly
correlated: the further the raw score was away from the SVM decision boundary
(score=0), the more accurate the prediction (Fig. 3 A). For the extreme ends of
scores >0.5 (coding) and <-1 (non-coding), specificity exceeded 96%. As
scores approached 0, specificity dropped sharply to about 50%, i.e. to levels
reminiscent of random predictions. Therefore, the SVM output score is likely to
be a good indicator for the reliability of novel predictions. Most predictions
in our cross-validation tests fell into the region of high reliability (Fig. 3).
|
Fig. 3 : Distribution and reliability of SVM scores. (A) The gray shade indicates the prediction accuracy as a function of the SVM scores (left y-axis). Predictions are most accurate (>98%) when SVM scores are >0 5 <<<-1 (for non-coding). Also shown are the score distributions (right y-axis) of three data set, training/testing set, random DNAs, and FANTOM-3 transcripts. The solid vertical line in the middle indicates the SVM decision boundary (score=0). (B) Most SVM predictions for training/testing set are in the high specificity range. FANTOM-3 transcripts have larger fraction of low accuracy predictions. |
We inspected a few cases for which our SVM was very wrong indeed, i.e., ncRNAs with high scores and protein-coding with very low scores. In many ncRNA cases, the experimental evidence for non-coding was ambiguous. For example, human ST7OT3 mRNA (RNAdb ID: LIT1900, GenBank accession: AF400044) is part of a complex multi-transcript system at the RAY1/ST7 locus. This locus contains two non-coding sense strand genes (ST7OT3 and ST7OT4) that overlap with many alternative forms of the coding RAY1/ST7 transcript, and two non-coding genes on the antisense strand (ST7OT1 and ST7OT2) [34] . Although there was no explicit evidence that ST7OT3 is protein-coding, the authors suggested it possible. Another transcript from this locus was also annotated as non-coding in RNAdb (RNAdb ID: LIT2007, GenBank accession: NM_018412). However, according to GenBankÕs annotation, it seems that it is coding for ST7 isoform A, i.e., this may more likely be an example for a possible mis-annotation in RNAdb than for a serious mistake in our method. Another example was ncRNA u1056 from NONCODE (GenBank accession: AF222983). This 7291nt long DNA encodes partial CDS of human DISC1 protein gene and DISC2 non-coding RNA gene since it is the flanking region of chromosome translocation breaking point associated with schizophrenia [35] . This is certainly not an example that the SVMs can learn! Most of those coding RNAs with very confident non-coding scores were very short proteins. The average protein length from the top ten mis-predicted coding RNAs was 41 amino acids. Additionally, these short proteins had very few database homologues. They are likely to be the most difficult samples for any ncRNA classifiers. Some of the incorrectly predicted coding cDNAs encode proteins that are annotated as 'hypothetical protein' in SWISS-PROT, i.e. may not be coding after all. Overall, this detailed analysis appeared to indicate that our estimates might be too conservative. However, the few cases for which we could analyze what appeared to be bad mistakes due to related experimental work did not suffice to make any general statement. Nevertheless, this detailed analysis did clearly underline the importance of the output score as a measure for prediction reliability.
The performance might not appear dramatically different between multi- and single-feature SVMs. However, the most important advantage of using multiple features became apparent when trying to identify reliable output scores in single feature-based predictions: we could not (Fig. S1, Supporting Information). In other words, there would have been no way to identify any subset of reliable ncRNA predictions in the FANTOM-3 data set, had we only used single feature SVMs.
We applied our novel method to 102,801 transcripts from the FANTOM-3 project. 14 transcripts were shorter than 80 nucleotides. Since this was the threshold we used for our training, we automatically classified those 14 as non-coding. Of the remaining 102,787 transcripts, the SVM predicted 28,316 to be non-coding. However, the fraction of highly reliable predictions was lower than that for our cross-validation set. The SVM output scores were in the reliable region of <-1 for only 13,873 transcripts (13.5%); another 14,443 (14.0%) were predicted as non-coding with lower confidence (Fig. 3 B). Our ncRNA predictions were in general agreement with the FANTOM-3Õs non-coding annotation [7, 15] , most of which was done by using consensus from several previously developed computational methods. Of the 28,316 transcripts that we predicted as ncRNAs, 22,918 (80.9%) were annotated as non-coding. Among our confident predictions, 2,348 were not annotated as non-coding by the FANTOM consortium. This would be the most interesting set for experimental verification. The correlation between SVM scores and prediction accuracy provided an alternative means for estimating the number of non-coding cDNAs in the FANTOM-3 data set. More precisely, the estimate can be based upon the correlation and the score distribution of the data. For example, if we knew 1,000 samples scored between Ð0.7 and Ð0.6, and 79% of the predictions were correct in this range, the number of coding and non-coding RNAs would be approximately 790 and 210, respectively. Based on this approach, we estimated the FANTOM-3 data set to contain 27,787 ncRNAs. Note that by the same logic, we predicted that while most of those 28K would be in our set of ncRNA predictions, we estimated 2834 of the ncRNAs to be found in the set of our protein-coding predictions.
Throughout the three FANTOM projects the importance of ncRNAs has become increasingly obvious, in particular for mammalian genomes. In parallel, detection methods have evolved. FANTOM-3 topped the state-of-the-art by combining many prediction methods through linear averaging [15] . Here, we demonstrated that the combination of different protein features through SVMs improves performance significantly over single features. SVMs are an excellent framework for the efficient incorporation of different features; in fact, they are far more successful in accomplishing this than simple linear averages would be. Sometimes SVMs trained on different features are needed in different situations. For example, if the orientation of the input RNA is not clear, it may be better to use an SVM that trained on protein features from all six frames (both forward and reverse). For FANTOM-3 transcripts orientation errors were rare; therefore, SVMs trained only on three forward frames identified 2,000 ncRNAs more than those trained on six frames. Presumably, most of these were anti-sense ncRNAs that were predicted as coding on the reverse frames. SVMs trained on protein alignment information are inevitably biased toward the known proteins. In predictions for novel transcripts unrelated to known proteins it may therefore be advisable to avoid this bias by using SVMs trained without protein alignment features.
We introduced a comprehensive collection of data sets for the evaluation of methods that distinguish between protein-coding RNAs and ncRNAs. Most of these data became available only recently. They were essential in the development of a novel prediction method that distinguishes between mRNA and ncRNA. Our method was essentially based on the assumption that features prone to capturing general characteristics of native proteins will help considerably in this distinction task. We used support vector machines to combine many relevant features that we chose initially largely by intuition rather than by optimization. The performance of our method appeared significantly higher than that of simpler methods. It clearly suggested that our initial hypothesis about the value of protein-related features was largely correct. The performance may be further improved by incorporating other features into the SVM, such as output from ESTScan and other methods. The major strength of our multi-feature SVM was that it enabled the solution of two different tasks: (1) the annotation of the most reliable subset of all ncRNAs in the FANTOM-3 data (14,000) that are most likely to be confirmed by more detailed, experimental follow-up studies, and (2) the estimate of the number of ncRNA in the FANTOM-3 set (28,000). The second task is very different from the first because our estimates for performance can be translated into estimated numbers although we cannot pinpoint exactly which 28,000 transcripts are indeed non-coding.
Protein-coding RNA. We first selected all eukaryotic proteins in the Swiss-Prot [36] database, and then removed sequence redundancy so that no protein pair in the set exceeded a sequence similarity above an HSSP value of 0. The HSSP-curve [37, 38] relates alignment length to pairwise sequence identity or similarity; for alignments of 100 residues, HSSP=0 corresponds to 33% pairwise sequence identity, for alignments longer than 250 residues to about 20%. cDNAs for these proteins were extracted from GenBank [39] . Potential sequence redundancy at the DNA level was further removed by running NCBI BLASTCLUST [40] with '-L 0.7' option so that no pair of sequences were similar over the 70% of the full length. Our final coding DNA set (positive set) contained 5610 coding cDNAs.
Non-coding RNA. Three eukaryotic non-coding RNA sets were used in this study as the negative set for training and testing. One is from RNAdb [20] , one from NONCODE [21] (named NONCODE-E), and the other is the combination of the two. In all cases, sequences shorter than 80 nucleotides were excluded since potential translation products from these RNAs were too short for meaningful protein sequence alignments and secondary structure predictions. Sequence redundancy was removed in the same way as for the coding RNAs using BLASTCLUST. The final number of non-coding RNAs was 778 in the RNAdb set, 2178 in the NONCODE-E set, and 2670 in the combination set. We also tested our SVMs on three additional negative sets although they were not used during training. The Rfam set contained 29009 non-redundant ncRNAs longer than 80 nucleotides in the Rfam database [19] , the "random DNA" set included 2000 randomly generated DNAs with length ranging from 80 to 3000 nucleotides, and NONCODE-NE (non-eukaryotes) set had 683 ncRNAs from prokaryotes and archaebacteria.
Support Vector Machines (SVMs) [22] are a machine-learning system based on recent advances in statistical learning theory. In our study, we used SVMlight [41] , a publicly available implementation of SVMs. The input features to the SVMs are the protein properties of potential peptides from the RNAs. Specifically, each input RNA was translated into the three forward reading frames; properties of the longest potential peptide from each frame were encoded into variables ranging from 0 to 1, and input to the SVMs. The following properties were used: (i) peptide length (4 variables for length intervals of 20, 40, 80, and longer than 80 amino acids), (ii) amino acid composition (20 variables), (iii) average hydrophobicity [42] (1 variable), (iv) secondary structure content (percentage of residues in secondary structure classes helix, strand, and others as predicted by PROFsec [31, 29, 30] , 3 variables), (v) percentage of residues that are exposed to solvent as predicted by PROFacc [31, 29, 30] (1 variable), (vi) sequence compositional entropy [32] describing sequence complexity (1 variable), (vii) number of homologues from database searches using PSI-BLAST [43] against an in-house protein database (similar to NCBI's nr database) for four iterations with e-value threshold of 1 (1 variable), (viii) alignment entropy: relative entropy between the observed fractions of amino acids and the respective background probabilities was calculated for each position in the multiple alignment and averaged over the full length of the sequence (1 variable).
In addition to these protein features, the frequencies of
mono-, di-, and tri-nucleotides were also calculated for the entire input RNA
and used as SVM inputs (84 variables). The total number of input variables was
180. These features were selected based on their distinguishing power when used
alone in the SVMs. The SVMs were trained using the radial basis function (RBF)
kernel. The g parameter for the RBF-kernel and the C parameter for the
trade-off between training error and margin were determined by optimizing on
small subset of the training data. Cost factor
Cross validation. We tested the performance of our SVM classifiers via 10-fold cross-validation experiments. The data set (coding + non-coding) was randomly divided into 10 equal-sized subsets. During each test, a SVM was trained on 9 subsets, and tested on the 10th one. This procedure was repeated 10 times so that each subset has been used as test set exactly once. The performance was measured for each test, and the mean was reported. We also tested our method via 3-fold, 5-fold, 20-fold, and 50-fold cross-validation, and the results were virtually identical. The XiAlpha and Leave-one-out estimates of the generalized error were reported in the Supporting Information (Fig. S3).
Performance measures. Several measures were used to evaluate the performance of the SVMs. All of them were derived from TP (True Positive, coding RNA predicted as coding), FP (False Positive, ncRNA predicted as coding), TN (True Negative, ncRNA predicted as non-coding), and FN (False Negative, coding RNA predicted as non-coding).
Specificity = TP/(TP+FP) (eqn. 1)
Sensitivity = TP/(TP+FN) (eqn. 2)
F =
(eqn. 3)
Specificity (accuracy, precision, eqn. 1) and sensitivity (coverage, recall, eqn. 2) are the most commonly used measures. The harmonic mean of these two numbers (eqn. 3), known as F-measure [44] , is often used as a single-value performance benchmark. Since both prediction categories (coding and non-coding) are of interests and the sizes of the two classes differ significantly, measuring performance for only one class is sometimes misleading and fails to capture the full picture; therefore, we reported the prediction performance for both classes in this study. We also evaluated the SVMs using receiver operating characteristics (ROC) curves [33] . SVM output scores were first sorted, and then the rate of true and false positives was calculated by setting the threshold to each score in the list. The ROC curves plot true positive rate as a function of false positive rate. The area under the ROC curve (the ROC score) is the average sensitivity over all possible specificity values, which can be used as a metric for prediction performance over different thresholds. A totally random predictor will produce a curve around the diagonal line from bottom-left to top-right and get the score of ~0.5, while a perfect predictor will get a curve along the left and top boundary of the square and receive a score of 1.
Comparison with a Naive Bayes classifier. We compared our SVM with a Naive Bayes classifier downloaded from http://fuzzy.cs.uni-magdeburg.de/~borgelt/bayes.html. The Naive Bayes classifier was trained with default parameters using exactly the same input features and encoding as our SVM, and evaluated via the same 10-fold cross-validation procedure.
We are grateful to Ken Pang (University of Queensland, Australia) for providing the RNAdb data set, and Martin Frith (University of Queensland, Australia), William Noble (University of Washington, US), Rajesh Nair (Columbia University, US) and Henry Bigelow (Columbia University, US) for helpful discussions. Thanks to the FANTOM Consortium Core Group, especially Piero Carninci, for sharing their exciting data with us. The work of JL and BR was supported by the grants U54-GM074958-01 and RO1-LM07329-01 from the National Library of Medicine (NLM). JG was supported by Research Grant for the RIKEN Genome Exploration Research Project from the Ministry of Education, Culture, Sports, Science and Technology of the Japanese Government to Y.H., a grant of the Genome Network Project from the Ministry of Education, Culture, Sports, Science and Technology, Japan. to Y.H., grant for the Strategic Programs for R&D of RIKEN to Y.H., Research Grants for Preventure Program C of Japan Science and Technology Agency (JST) to Y.H. Last, not least, thanks to all those who contributed to the spectacular advances in all three FANTOM projects.
| 1. | Crick, F. H. (1958). On protein synthesis. Symp SocExp Biol, 12, 138-63. |
| 2. | Cech, T. R., Zaug, A. J. & Grabowski, P. J. (1981).In vitro splicing of the ribosomal RNA precursor of Tetrahymena: involvement ofa guanosine nucleotide in the excision of the intervening sequence. Cell, 27, 487-96. |
| 3. | Eddy, S. R. (2001). Non-coding RNA genes and the modernRNA world. Nat Rev Genet, 2,919-29. |
| 4. | Huttenhofer, A., Schattner, P. & Polacek, N. (2005).Non-coding RNAs: hope or hype? Trends Genet,21, 289-97. |
| 5. | Okazaki, Y., Furuno, M., Kasukawa, T., Adachi, J., Bono,H. et al. (2002). Analysis of the mouse transcriptome based on functionalannotation of 60,770 full-length cDNAs. Nature, 420, 563-73. |
| 6. | Suzuki, M. & Hayashizaki, Y. (2004). Mouse-centriccomparative transcriptomics of protein coding and non-coding RNAs. Bioessays, 26, 833-43. |
| 7. | Carninci, P., Kasukawa, T., Katayama, S., Gough, J.,Frith, M. et al. (2005). The Transcriptional Landscape of the Mammalian Genome.Science,in press. |
| 8. | Mattick, J. S. (2001). Non-coding RNAs: the architectsof eukaryotic complexity. EMBO Rep.,2, 986-91. |
| 9. | Mignone, F., Grillo, G., Liuni, S. & Pesole, G. (2003).Computational identification of protein coding potential of conserved sequencetags through cross-species evolutionary analysis. Nucl. Acids Res., 31, 4639-45. |
| 10. | Rivas, E. & Eddy, S. R. (2001). Noncoding RNA genedetection using comparative sequence analysis. BMC Bioinformatics, 2, 8. |
| 11. | Badger, J. H. & Olsen, G. J. (1999). CRITICA:coding region identification tool invoking comparative analysis. Mol BiolEvol, 16, 512-24. |
| 12. | Hatzigeorgiou, A. G., Fiziev, P. & Reczko, M.(2001). DIANA-EST: a statistical analysis. Bioinformatics, 17, 913-9. |
| 13. | Lottaz, C., Iseli, C., Jongeneel, C. V. & Bucher,P. (2003). Modeling sequencing errors by combining Hidden Markov models. Bioinformatics, 19 Suppl 2, II103-II112. |
| 14. | Furuno, M., Kasukawa, T., Saito, R., Adachi, J.,Suzuki, H. et al. (2003). CDS annotation in full-length cDNA sequence. GenomeRes., 13, 1478-87. |
| 15. | Frith, M. C., Bailey, T. L., Kasukawa, T., Mignone, F.,Kummerfeld, S. K. et al. (2005). Discrimination of Non-Protein-Coding Transcriptsfrom Protein-Coding mRNA. PLoS Genetics,submitted. |
| 16. | Bateman, A., Coin, L., Durbin, R., Finn, R. D.,Hollich, V. et al. (2004). The Pfam protein families database. Nucl. AcidsRes., 32, D138-41. |
| 17. | Gough, J., Karplus, K., Hughey, R. & Chothia, C.(2001). Assignment of homology to genome sequences using a library of hiddenMarkov models that represent all proteins of known structure. J. Mol. Biol., 313, 903-19. |
| 18. | Gough, J. & Chothia, C. (2002). SUPERFAMILY: HMMsrepresenting all proteins of known structure. SCOP sequence searches,alignments and genome assignments. Nucl. Acids Res., 30, 268-72. |
| 19. | Griffiths-Jones, S., Moxon, S., Marshall, M., Khanna,A., Eddy, S. R. et al. (2005). Rfam: annotating non-coding RNAs in completegenomes. Nucl. Acids Res., 33,D121-4. |
| 20. | Pang, K. C., Stephen, S., Engstrom, P. G.,Tajul-Arifin, K., Chen, W. et al. (2005). RNAdb--a comprehensive mammaliannoncoding RNA database. Nucl. Acids Res.,33, D125-30. |
| 21. | Liu, C., Bai, B., Skogerbo, G., Cai, L., Deng, W. etal. (2005). NONCODE: an integrated knowledge database of non-coding RNAs. Nucl.Acids Res., 33, D112-5. |
| 22. | Vapnik, V. N. (1995). The nature of statisticallearning theory. Springer, . |
| 23. | Noble, W. S. (2004). Support vector machine applicationsin computational biology. In Kernel Methods in Computational Biology(Schoelkopf, B., Tsuda, K. & Vert, J.-P., eds.), pp. 71-92, MIT Press,Cambridge, MA. |
| 24. | Jaakkola, T., Diekhans, M. & Haussler, D. (2000). Adiscriminative framework for detecting remote protein homologies. J. Comp.Biol., 7, 95-114. |
| 25. | Hua, S. & Sun, Z. (2001). Support vector machineapproach for protein subcellular localization prediction. Bioinformatics, 17, 721-8. |
| 26. | Nair, R. & Rost, B. (2005). Mimicking cellularsorting improves prediction of subcellular localization. J. Mol. Biol., 348, 85-100. |
| 27. | Brown, M. P., Grundy, W. N., Lin, D., Cristianini, N.,Sugnet, C. W. et al. (2000). Knowledge-based analysis of microarray geneexpression data by using support vector machines. Proc Natl Acad Sci U S A, 97, 262-7. |
| 28. | Rost, B. (2001). Protein secondary structure predictioncontinues to rise. J. Struct. Biol.,134, 204-218. |
| 29. | Rost, B., Yachdav, G. & Liu, J. (2004). ThePredictProtein server. Nucl. Acids Res.,32, W321-6. |
| 30. | Rost, B. (2005). How to use protein 1D structurepredicted by PROFphd. In The Proteomics Protocols Handbook (Walker, J. E.,eds.), pp. 875-901, Humana, Totowa NJ. |
| 31. | Rost, B. (1996). PHD: predicting one-dimensionalprotein structure by profile based neural networks. Meth. Enzymol., 266, 525-539. |
| 32. | Wootton, J. C. & Federhen, S. (1996). Analysis ofcompositionally biased regions in sequence databases. Meth. Enzymol., 266, 554-571. |
| 33. | Hanley, J. A. & McNeil, B. J. (1982). The meaningand use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143, 29-36. |
| 34. | Vincent, J. B., Petek, E., Thevarkunnel, S.,Kolozsvari, D., Cheung, J. et al. (2002). The RAY1/ST7 tumor-suppressor locuson chromosome 7q31 represents a complex multi-transcript system. Genomics, 80, 283-94. |
| 35. | Millar, J. K., Wilson-Annan, J. C., Anderson, S.,Christie, S., Taylor, M. S. et al. (2000). Disruption of two novel genes by atranslocation co-segregating with schizophrenia. Hum Mol Genet, 9, 1415-23. |
| 36. | Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.C., Estreicher, A. et al. (2003). The SWISS-PROT protein knowledgebase and itssupplement TrEMBL in 2003. Nucl. Acids Res.,31, 365-370. |
| 37. | Sander, C. & Schneider, R. (1991). Database ofhomology-derived structures and the structural meaning of sequence alignment. Proteins, 9, 56-68. |
| 38. | Rost, B. (1999). Twilight zone of protein sequencealignments. Prot. Engin., 12,85-94. |
| 39. | Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J.,Ostell, J. & Wheeler, D. L. (2005). GenBank. Nucl. Acids Res., 33, D34-8. |
| 40. | McGinnis, S. & Madden, T. L. (2004). BLAST: at thecore of a powerful and diverse set of sequence analysis tools. Nucl. AcidsRes., 32, W20-5. |
| 41. | Joachims, T. (1998). Making large-Scale SVM LearningPractical. In Advances in Kernel Methods: Support Vector Learning (Schlkopf ,B., C., B. C. J. & J., S. A., eds.), pp. 169-184, The MIT press, Cambridge,MA. |
| 42. | Kyte, J. & Doolittle, R. F. (1982). A simple methodfor displaying the hydropathic character of a protein. J. Mol. Biol., 157, 105-32. |
| 43. | Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang,J., Zhang, Z. et al. (1997). Gapped BLAST and PSI-BLAST: a new generation ofprotein database search programs. Nucl. Acids Res., 25, 3389-402.. |
| 44. | van Rijsbergen, C. J. (1979). Information retrieval.Butterworths, London. |
| Contact: server@rostlab.org | Version: Aug 22, 2006 |