1
|
Li Z, Zhou P, Kwon E, Fitzgerald KA, Weng Z, Zhou C. Flnc: Machine Learning Improves the Identification of Novel Long Noncoding RNAs from Stand-Alone RNA-Seq Data. Noncoding RNA 2022; 8:70. [PMID: 36287122 PMCID: PMC9607125 DOI: 10.3390/ncrna8050070] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2022] [Revised: 10/01/2022] [Accepted: 10/06/2022] [Indexed: 01/16/2025] Open
Abstract
Long noncoding RNAs (lncRNAs) play critical regulatory roles in human development and disease. Although there are over 100,000 samples with available RNA sequencing (RNA-seq) data, many lncRNAs have yet to be annotated. The conventional approach to identifying novel lncRNAs from RNA-seq data is to find transcripts without coding potential but this approach has a false discovery rate of 30-75%. Other existing methods either identify only multi-exon lncRNAs, missing single-exon lncRNAs, or require transcriptional initiation profiling data (such as H3K4me3 ChIP-seq data), which is unavailable for many samples with RNA-seq data. Because of these limitations, current methods cannot accurately identify novel lncRNAs from existing RNA-seq data. To address this problem, we have developed software, Flnc, to accurately identify both novel and annotated full-length lncRNAs, including single-exon lncRNAs, directly from RNA-seq data without requiring transcriptional initiation profiles. Flnc integrates machine learning models built by incorporating four types of features: transcript length, promoter signature, multiple exons, and genomic location. Flnc achieves state-of-the-art prediction power with an AUROC score over 0.92. Flnc significantly improves the prediction accuracy from less than 50% using the conventional approach to over 85%. Flnc is available via GitHub platform.
Collapse
Affiliation(s)
- Zixiu Li
- Division of Biostatistics and Health Services Research, Department of Population and Quantitative Health Sciences, University of Massachusetts Chan Medical School, Worcester, MA 01605, USA
| | - Peng Zhou
- Division of Biostatistics and Health Services Research, Department of Population and Quantitative Health Sciences, University of Massachusetts Chan Medical School, Worcester, MA 01605, USA
| | - Euijin Kwon
- Division of Biostatistics and Health Services Research, Department of Population and Quantitative Health Sciences, University of Massachusetts Chan Medical School, Worcester, MA 01605, USA
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Chan Medical School, Worcester, MA 01605, USA
| | - Katherine A. Fitzgerald
- Program in Innate Immunity, Division of Infectious Disease and Immunology, Department of Medicine, University of Massachusetts Chan Medical School, Worcester, MA 01605, USA
| | - Zhiping Weng
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Chan Medical School, Worcester, MA 01605, USA
| | - Chan Zhou
- Division of Biostatistics and Health Services Research, Department of Population and Quantitative Health Sciences, University of Massachusetts Chan Medical School, Worcester, MA 01605, USA
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Chan Medical School, Worcester, MA 01605, USA
- The RNA Therapeutics Institute, University of Massachusetts Chan Medical School, Worcester, MA 01605, USA
- UMass Cancer Center, University of Massachusetts Chan Medical School, Worcester, MA 01605, USA
| |
Collapse
|