1
|
Annotation depth confounds direct comparison of gene expression across species. BMC Bioinformatics 2021; 22:499. [PMID: 34654362 PMCID: PMC8518172 DOI: 10.1186/s12859-021-04414-y] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2021] [Accepted: 09/30/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Comparisons of the molecular framework among organisms can be done on both structural and functional levels. One of the most common top-down approaches for functional comparisons is RNA sequencing. This estimation of organismal transcriptional responses is of interest for understanding evolution of molecular activity, which is used for answering a diversity of questions ranging from basic biology to pre-clinical species selection and translation. However, direct comparison between species is often hindered by evolutionary divergence in structure of molecular framework, as well as large difference in the depth of our understanding of the genetic background between humans and other species. Here, we focus on the latter. We attempt to understand how differences in transcriptome annotation affect direct gene abundance comparisons between species. RESULTS We examine and suggest some straightforward approaches for direct comparison given the current available tools and using a sample dataset from human, cynomolgus monkey, dog, rat and mouse with a common quantitation and normalization approach. In addition, we examine how variation in genome annotation depth and quality across species may affect these direct comparisons. CONCLUSIONS Our findings suggest that further efforts for better genome annotation or computational normalization tools may be of strong interest.
Collapse
|
2
|
Zhou Y, Yang B, Wang J, Zhu J, Tian G. A scaling-free minimum enclosing ball method to detect differentially expressed genes for RNA-seq data. BMC Genomics 2021; 22:479. [PMID: 34174824 PMCID: PMC8234728 DOI: 10.1186/s12864-021-07790-0] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2020] [Accepted: 06/10/2021] [Indexed: 12/13/2022] Open
Abstract
Background Identifying differentially expressed genes between the same or different species is an urgent demand for biological and medical research. For RNA-seq data, systematic technical effects and different sequencing depths are usually encountered when conducting experiments. Normalization is regarded as an essential step in the discovery of biologically important changes in expression. The present methods usually involve normalization of the data with a scaling factor, followed by detection of significant genes. However, more than one scaling factor may exist because of the complexity of real data. Consequently, methods that normalize data by a single scaling factor may deliver suboptimal performance or may not even work.The development of modern machine learning techniques has provided a new perspective regarding discrimination between differentially expressed (DE) and non-DE genes. However, in reality, the non-DE genes comprise only a small set and may contain housekeeping genes (in same species) or conserved orthologous genes (in different species). Therefore, the process of detecting DE genes can be formulated as a one-class classification problem, where only non-DE genes are observed, while DE genes are completely absent from the training data. Results In this study, we transform the problem to an outlier detection problem by treating DE genes as outliers, and we propose a scaling-free minimum enclosing ball (SFMEB) method to construct a smallest possible ball to contain the known non-DE genes in a feature space. The genes outside the minimum enclosing ball can then be naturally considered to be DE genes. Compared with the existing methods, the proposed SFMEB method does not require data normalization, which is particularly attractive when the RNA-seq data include more than one scaling factor. Furthermore, the SFMEB method could be easily extended to different species without normalization. Conclusions Simulation studies demonstrate that the SFMEB method works well in a wide range of settings, especially when the data are heterogeneous or biological replicates. Analysis of the real data also supports the conclusion that the SFMEB method outperforms other existing competitors. The R package of the proposed method is available at https://bioconductor.org/packages/MEB. Supplementary Information The online version contains supplementary material available at (10.1186/s12864-021-07790-0).
Collapse
Affiliation(s)
- Yan Zhou
- College of Mathematics and Statistics, Institute of Statistical Sciences, Shenzhen Key Laboratory of Advanced Machine Learning and Applications, Shenzhen University, Shenzhen, China
| | - Bin Yang
- College of Mathematics and Statistics, Institute of Statistical Sciences, Shenzhen Key Laboratory of Advanced Machine Learning and Applications, Shenzhen University, Shenzhen, China
| | - Junhui Wang
- School of Data Science, City University of Hong Kong, Hong Kong
| | - Jiadi Zhu
- College of Mathematics and Statistics, Institute of Statistical Sciences, Shenzhen Key Laboratory of Advanced Machine Learning and Applications, Shenzhen University, Shenzhen, China.
| | - Guoliang Tian
- Department of Statistics and Data Science, Southern University of Science and Technology, Shenzhen, China.
| |
Collapse
|
3
|
Chowdhury HA, Bhattacharyya DK, Kalita JK. Differential Expression Analysis of RNA-seq Reads: Overview, Taxonomy, and Tools. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:566-586. [PMID: 30281477 DOI: 10.1109/tcbb.2018.2873010] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Analysis of RNA-sequence (RNA-seq) data is widely used in transcriptomic studies and it has many applications. We review RNA-seq data analysis from RNA-seq reads to the results of differential expression analysis. In addition, we perform a descriptive comparison of tools used in each step of RNA-seq data analysis along with a discussion of important characteristics of these tools. A taxonomy of tools is also provided. A discussion of issues in quality control and visualization of RNA-seq data is also included along with useful tools. Finally, we provide some guidelines for the RNA-seq data analyst, along with research issues and challenges which should be addressed.
Collapse
|
4
|
Zhou Y, Wan X, Zhang B, Tong T. Classifying next-generation sequencing data using a zero-inflated Poisson model. Bioinformatics 2019; 34:1329-1335. [PMID: 29186294 DOI: 10.1093/bioinformatics/btx768] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2017] [Accepted: 11/24/2017] [Indexed: 11/14/2022] Open
Abstract
Motivation With the development of high-throughput techniques, RNA-sequencing (RNA-seq) is becoming increasingly popular as an alternative for gene expression analysis, such as RNAs profiling and classification. Identifying which type of diseases a new patient belongs to with RNA-seq data has been recognized as a vital problem in medical research. As RNA-seq data are discrete, statistical methods developed for classifying microarray data cannot be readily applied for RNA-seq data classification. Witten proposed a Poisson linear discriminant analysis (PLDA) to classify the RNA-seq data in 2011. Note, however, that the count datasets are frequently characterized by excess zeros in real RNA-seq or microRNA sequence data (i.e. when the sequence depth is not enough or small RNAs with the length of 18-30 nucleotides). Therefore, it is desired to develop a new model to analyze RNA-seq data with an excess of zeros. Results In this paper, we propose a Zero-Inflated Poisson Logistic Discriminant Analysis (ZIPLDA) for RNA-seq data with an excess of zeros. The new method assumes that the data are from a mixture of two distributions: one is a point mass at zero, and the other follows a Poisson distribution. We then consider a logistic relation between the probability of observing zeros and the mean of the genes and the sequencing depth in the model. Simulation studies show that the proposed method performs better than, or at least as well as, the existing methods in a wide range of settings. Two real datasets including a breast cancer RNA-seq dataset and a microRNA-seq dataset are also analyzed, and they coincide with the simulation results that our proposed method outperforms the existing competitors. Availability and implementation The software is available at http://www.math.hkbu.edu.hk/∼tongt. Contact xwan@comp.hkbu.edu.hk or tongt@hkbu.edu.hk. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yan Zhou
- College of Mathematics and Statistics, Institute of Statistical Sciences, Shenzhen University, Shenzhen 518060, China
| | - Xiang Wan
- Department of Computer Science, and Institute of Computational and Theoretical Studies, Hong Kong Baptist University, Kowloon Tong, Hong Kong
| | - Baoxue Zhang
- School of Statistics, Capital University of Economics and Business, Beijing 100070, China
| | - Tiejun Tong
- Department of Mathematics, Hong Kong Baptist University, Kowloon Tong, Hong Kong
| |
Collapse
|
5
|
A statistical normalization method and differential expression analysis for RNA-seq data between different species. BMC Bioinformatics 2019; 20:163. [PMID: 30925894 PMCID: PMC6441199 DOI: 10.1186/s12859-019-2745-1] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2018] [Accepted: 03/18/2019] [Indexed: 02/06/2023] Open
Abstract
Background High-throughput techniques bring novel tools and also statistical challenges to genomic research. Identifying genes with differential expression between different species is an effective way to discover evolutionarily conserved transcriptional responses. To remove systematic variation between different species for a fair comparison, normalization serves as a crucial pre-processing step that adjusts for the varying sample sequencing depths and other confounding technical effects. Results In this paper, we propose a scale based normalization (SCBN) method by taking into account the available knowledge of conserved orthologous genes and by using the hypothesis testing framework. Considering the different gene lengths and unmapped genes between different species, we formulate the problem from the perspective of hypothesis testing and search for the optimal scaling factor that minimizes the deviation between the empirical and nominal type I errors. Conclusions Simulation studies show that the proposed method performs significantly better than the existing competitor in a wide range of settings. An RNA-seq dataset of different species is also analyzed and it coincides with the conclusion that the proposed method outperforms the existing method. For practical applications, we have also developed an R package named “SCBN”, which is freely available at http://www.bioconductor.org/packages/devel/bioc/html/SCBN.html. Electronic supplementary material The online version of this article (10.1186/s12859-019-2745-1) contains supplementary material, which is available to authorized users.
Collapse
|
6
|
Athanasiadou R, Neymotin B, Brandt N, Wang W, Christiaen L, Gresham D, Tranchina D. A complete statistical model for calibration of RNA-seq counts using external spike-ins and maximum likelihood theory. PLoS Comput Biol 2019; 15:e1006794. [PMID: 30856174 PMCID: PMC6428340 DOI: 10.1371/journal.pcbi.1006794] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2018] [Revised: 03/21/2019] [Accepted: 01/16/2019] [Indexed: 01/09/2023] Open
Abstract
A fundamental assumption, common to the vast majority of high-throughput transcriptome analyses, is that the expression of most genes is unchanged among samples and that total cellular RNA remains constant. As the number of analyzed experimental systems increases however, different independent studies demonstrate that this assumption is often violated. We present a calibration method using RNA spike-ins that allows for the measurement of absolute cellular abundance of RNA molecules. We apply the method to pooled RNA from cell populations of known sizes. For each transcript, we compute a nominal abundance that can be converted to absolute by dividing by a scale factor determined in separate experiments: the yield coefficient of the transcript relative to that of a reference spike-in measured with the same protocol. The method is derived by maximum likelihood theory in the context of a complete statistical model for sequencing counts contributed by cellular RNA and spike-ins. The counts are based on a sample from a fixed number of cells to which a fixed population of spike-in molecules has been added. We illustrate and evaluate the method with applications to two global expression data sets, one from the model eukaryote Saccharomyces cerevisiae, proliferating at different growth rates, and differentiating cardiopharyngeal cell lineages in the chordate Ciona robusta. We tested the method in a technical replicate dilution study, and in a k-fold validation study.
Collapse
Affiliation(s)
- Rodoniki Athanasiadou
- Center for Genomics and Systems Biology, Department of Biology, New York University, New York, New York, United States of America
| | - Benjamin Neymotin
- Center for Genomics and Systems Biology, Department of Biology, New York University, New York, New York, United States of America
| | - Nathan Brandt
- Center for Genomics and Systems Biology, Department of Biology, New York University, New York, New York, United States of America
| | - Wei Wang
- Center for Developmental Genetics, Department of Biology, New York University, New York, New York, United States of America
| | - Lionel Christiaen
- Center for Developmental Genetics, Department of Biology, New York University, New York, New York, United States of America
| | - David Gresham
- Center for Genomics and Systems Biology, Department of Biology, New York University, New York, New York, United States of America
| | - Daniel Tranchina
- Department of Biology, New York University, New York, New York, United States of America
- Courant Institute of Mathematical Sciences, New York University, New York, New York, United States of America
| |
Collapse
|
7
|
Liu P, Yang X, Zhang H, Pu J, Wei K. Analysis of change in microRNA expression profiles of lung cancer A549 cells treated with Radix tetrastigma hemsleyani flavonoids. Onco Targets Ther 2018; 11:4283-4300. [PMID: 30100735 PMCID: PMC6065472 DOI: 10.2147/ott.s164276] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open
Abstract
Background The aim of this study was to determine the inhibition effects of Radix tetrastigma hemsleyani (RTH) flavonoids on human lung adenocarcinoma A549 cells and the underlying molecular mechanism. RTH is an important Chinese traditional herb that has been widely used in cancer therapy. As an important type of active substance, RTH flavones (RTHF) have been shown to have good antiproliferative effects on various cancer cells. MicroRNAs (miRNAs) are small, noncoding RNA molecules that play important roles in cancer progression and prevention. However, the miRNA profile of RTHF-treated A549 cells has not yet been studied. Materials and methods The miRNA expression profile changes of A549 cell treated with RTHF were determined using the miRNA-seq analysis. Furthermore, Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analyses of differentially expressed miRNAs' (DE-miRNAs) target genes were carried out. Results In this study, we identified 162 miRNAs that displayed expression changes >1.2-fold in RTHF-treated A549 cells. GO analysis results showed that target genes of DE-miRNAs were significantly enriched in protein binding, binding, cell, cell part, intracellular, cellular process, single-organism process, and single-organism cellular process. Pathway analysis illustrated that target genes of DE-miRNAs are mainly involved in endocytosis, axon guidance, lysosome, melanogenesis, and acute myeloid leukemia pathway. Conclusion These results may assist in the better understanding of the anticancer effects of RTHF in A549 cells.
Collapse
Affiliation(s)
- Peigang Liu
- Center for Medicinal Resources Research, Zhejiang Academy of Traditional Chinese Medicine, Hangzhou 310007, People's Republic of China,
| | - Xu Yang
- Center for Medicinal Resources Research, Zhejiang Academy of Traditional Chinese Medicine, Hangzhou 310007, People's Republic of China,
| | - Hongjian Zhang
- Center for Medicinal Resources Research, Zhejiang Academy of Traditional Chinese Medicine, Hangzhou 310007, People's Republic of China,
| | - Jinbao Pu
- Center for Medicinal Resources Research, Zhejiang Academy of Traditional Chinese Medicine, Hangzhou 310007, People's Republic of China,
| | - Kemin Wei
- Center for Medicinal Resources Research, Zhejiang Academy of Traditional Chinese Medicine, Hangzhou 310007, People's Republic of China,
| |
Collapse
|
8
|
Zhou Y, Wang J, Zhao Y, Tong T. Discriminant Analysis and Normalization Methods for Next-Generation Sequencing Data. NEW FRONTIERS OF BIOSTATISTICS AND BIOINFORMATICS 2018. [DOI: 10.1007/978-3-319-99389-8_18] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/03/2022]
|