1
|
Chen S, Wang P, Guo H, Zhang Y. Deciphering gene expression patterns using large-scale transcriptomic data and its applications. Brief Bioinform 2024; 25:bbae590. [PMID: 39541191 PMCID: PMC11562847 DOI: 10.1093/bib/bbae590] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2024] [Revised: 10/07/2024] [Accepted: 10/31/2024] [Indexed: 11/16/2024] Open
Abstract
Gene expression varies stochastically across genders, racial groups, and health statuses. Deciphering these patterns is crucial for identifying informative genes, classifying samples, and understanding diseases like cancer. This study analyzes 11,252 bulk RNA-seq samples to explore expression patterns of 19,156 genes, including 10,512 cancer tissue samples and 740 normal samples. Additionally, 4,884 single-cell RNA-seq samples are examined. Statistical analysis using 16 probability distributions shows that normal samples display a wider range of distributions compared to cancer samples. Cancer samples tend to favor asymmetric distributions such as generalized extreme value, logarithmic normal, and Gaussian mixture distributions. In contrast, certain genes in normal samples exhibit symmetric distributions. Remarkably, more than 95.5% of genes exhibit non-normal distributions, which challenges traditional assumptions. Furthermore, distributions differ significantly between bulk and single-cell RNA-seq data. Many cancer driver genes exhibit distinct distribution patterns across sample types, suggesting potential for gene selection and classification based on distribution characteristics. A novel skewness-based metric is proposed to quantify distribution variation across datasets, showing genes with significant skewness differences have biological relevance. Finally, an improved naïve Bayes method incorporating gene-specific distributions demonstrates superior performance in simulations over traditional methods. This work enhances understanding of gene expression and its application in omics-based gene selection and sample classification.
Collapse
Affiliation(s)
- Shunjie Chen
- School of Mathematics and Statistics, Henan University, Jinming Avenue, 475004, Kaifeng, China
| | - Pei Wang
- School of Mathematics and Statistics, Henan University, Jinming Avenue, 475004, Kaifeng, China
- Henan Engineering Research Center for Industrial Internet of Things, Henan University, Mingli Road, 450046, Zhengzhou, China
| | - Haiping Guo
- School of Mathematics and Statistics, Henan University, Jinming Avenue, 475004, Kaifeng, China
| | - Yujie Zhang
- School of Mathematics and Statistics, Henan University, Jinming Avenue, 475004, Kaifeng, China
| |
Collapse
|
2
|
Jiao Z, Lai Y, Kang J, Gong W, Ma L, Jia T, Xie C, Xiang S, Cheng W, Heinz A, Desrivières S, Schumann G, Sun F, Feng J. A model-based approach to assess reproducibility for large-scale high-throughput MRI-based studies. Neuroimage 2022; 255:119166. [PMID: 35398282 DOI: 10.1016/j.neuroimage.2022.119166] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2021] [Revised: 03/26/2022] [Accepted: 03/30/2022] [Indexed: 12/21/2022] Open
Abstract
Magnetic Resonance Imaging (MRI) technology has been increasingly used in neuroscience studies. Reproducibility of statistically significant findings generated by MRI-based studies, especially association studies (phenotype vs. MRI metric) and task-induced brain activation, has been recently heavily debated. However, most currently available reproducibility measures depend on thresholds for the test statistics and cannot be use to evaluate overall study reproducibility. It is also crucial to elucidate the relationship between overall study reproducibility and sample size in an experimental design. In this study, we proposed a model-based reproducibility index to quantify reproducibility which could be used in large-scale high-throughput MRI-based studies including both association studies and task-induced brain activation. We performed the model-based reproducibility assessments for a few association studies and task-induced brain activation by using several recent large sMRI/fMRI databases. For large sample size association studies between brain structure/function features and some basic physiological phenotypes (i.e. Sex, BMI), we demonstrated that the model-based reproducibility of these studies is more than 0.99. For MID task activation, similar results could be observed. Furthermore, we proposed a model-based analytical tool to evaluate minimal sample size for the purpose of achieving a desirable model-based reproducibility. Additionally, we evaluated the model-based reproducibility of gray matter volume (GMV) changes for UK Biobank (UKB) vs. Parkinson Progression Marker Initiative (PPMI) and UK Biobank (UKB) vs. Human Connectome Project (HCP). We demonstrated that both sample size and study-specific experimental factors play important roles in the model-based reproducibility assessments for different experiments. In summary, a systematic assessment of reproducibility is fundamental and important in the current large-scale high-throughput MRI-based studies.
Collapse
Affiliation(s)
- Zeyu Jiao
- Shanghai Center for Mathematical Sciences, Fudan University, 220 Handan Road, Shanghai, China; Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China; Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, China; MOE Frontiers Center for Brain Science, Fudan University, Shanghai, China; Zhangjiang Fudan International Innovation Center, China
| | - Yinglei Lai
- School of Mathematical Sciences, University of Science and Technology of China, 96 Jinzhai Road, Hefei, Anhui 230026, China
| | - Jujiao Kang
- Shanghai Center for Mathematical Sciences, Fudan University, 220 Handan Road, Shanghai, China; Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China; Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, China; MOE Frontiers Center for Brain Science, Fudan University, Shanghai, China; Zhangjiang Fudan International Innovation Center, China
| | - Weikang Gong
- Center for Functional MRI of the Brain (FMRIB), Nuffield Department of Clinical Neurosciences, Welcome Center for Integrative Neuroimaging, University of Oxford, Oxford, United Kingdom
| | - Liang Ma
- Key Laboratory of Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China
| | - Tianye Jia
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China; Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, China; MOE Frontiers Center for Brain Science, Fudan University, Shanghai, China; Zhangjiang Fudan International Innovation Center, China; Center for Population Neuroscience and Precision Medicine (PONS), Institute of Psychiatry, Psychology and Neuroscience, SGDP Center, King's College London, United Kingdom
| | - Chao Xie
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China; Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, China; MOE Frontiers Center for Brain Science, Fudan University, Shanghai, China; Zhangjiang Fudan International Innovation Center, China
| | - Shitong Xiang
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China; Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, China; MOE Frontiers Center for Brain Science, Fudan University, Shanghai, China; Zhangjiang Fudan International Innovation Center, China
| | - Wei Cheng
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China; Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, China; MOE Frontiers Center for Brain Science, Fudan University, Shanghai, China; Zhangjiang Fudan International Innovation Center, China
| | - Andreas Heinz
- Department of Psychiatry and Psychotherapy CCM, Charité - Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin, Humboldt-Universität zu Berlin, and Berlin Institute of Health, Berlin, Germany
| | - Sylvane Desrivières
- Center for Population Neuroscience and Precision Medicine (PONS), Institute of Psychiatry, Psychology and Neuroscience, SGDP Center, King's College London, United Kingdom
| | - Gunter Schumann
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China; Center for Population Neuroscience and Precision Medicine (PONS), Institute of Psychiatry, Psychology and Neuroscience, SGDP Center, King's College London, United Kingdom; PONS Research Group, Department of Psychiatry and Psychotherapy, Campus Charite Mitte, Humboldt University, Berlin, Germany
| | | | - Fengzhu Sun
- Quantitative and Computational Biology Department, University of Southern California, 1050 Childs Way, Los Angeles, CA 90089, United States
| | - Jianfeng Feng
- Shanghai Center for Mathematical Sciences, Fudan University, 220 Handan Road, Shanghai, China; Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China; Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, China; MOE Frontiers Center for Brain Science, Fudan University, Shanghai, China; Zhangjiang Fudan International Innovation Center, China; Department of Computer Science, University of Warwick, Coventry CV4 7AL, United Kingdom; School of Life Science and the Collaborative Innovation Center for Brain Science, Fudan University, Shanghai, China.
| |
Collapse
|
3
|
Qin W, Wang X, Zhao H, Lu H. A Novel Joint Gene Set Analysis Framework Improves Identification of Enriched Pathways in Cross Disease Transcriptomic Analysis. Front Genet 2019; 10:293. [PMID: 31031796 PMCID: PMC6473067 DOI: 10.3389/fgene.2019.00293] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2018] [Accepted: 03/19/2019] [Indexed: 12/25/2022] Open
Abstract
Motivation: Gene set enrichment analysis is a widely accepted expression analysis tool which aims at detecting coordinated expression change within a pre-defined gene sets rather than individual genes. The benefit of gene set analysis over individual differentially expressed (DE) gene analysis includes more reproducible and interpretable results and detecting small but consistent change among gene set which could not be detected by DE gene analysis. There have been many successful gene set analysis applications in human diseases. However, when the sample size of a disease study is small and no other public data sets of the same disease are available, it will lead to lack of power to detect pathways of importance to the disease. Results: We have developed a novel joint gene set analysis statistical framework which aims at improving the power of identifying enriched gene sets through integrating multiple similar disease data sets. Through comprehensive simulation studies, we demonstrated that our proposed frameworks obtained much better AUC scores than single data set analysis and another meta-analysis method in identification of enriched pathways. When applied to two real data sets, the proposed framework could retain the enriched gene sets identified by single data set analysis and exclusively obtained up to 200% more disease-related gene sets demonstrating the improved identification power through information shared between similar diseases. We expect that the proposed framework would enable researchers to better explore public data sets when the sample size of their study is limited.
Collapse
Affiliation(s)
- Wenyi Qin
- Center for Biomedical Informatics, Shanghai Children's Hospital, Shanghai Jiaotong University, Shanghai, China
- Department of Bioengineering, University of Illinois at Chicago, Chicago, IL, United States
- Department of Genetics, School of Medicine, Yale University, New Haven, CT, United States
| | - Xujun Wang
- Department of Bioinformatics and Biostatistics, SJTU-Yale Joint Center for Biostatistics, Shanghai Jiaotong University, Shanghai, China
| | - Hongyu Zhao
- Department of Bioinformatics and Biostatistics, SJTU-Yale Joint Center for Biostatistics, Shanghai Jiaotong University, Shanghai, China
- Department of Biostatistics, School of Public Health, Yale University, New Haven, CT, United States
| | - Hui Lu
- Center for Biomedical Informatics, Shanghai Children's Hospital, Shanghai Jiaotong University, Shanghai, China
- Department of Bioengineering, University of Illinois at Chicago, Chicago, IL, United States
- Department of Bioinformatics and Biostatistics, SJTU-Yale Joint Center for Biostatistics, Shanghai Jiaotong University, Shanghai, China
- Department of Biostatistics, School of Public Health, Yale University, New Haven, CT, United States
| |
Collapse
|
4
|
Shen C, Ding Y, Tang J, Guo F. Multivariate Information Fusion With Fast Kernel Learning to Kernel Ridge Regression in Predicting LncRNA-Protein Interactions. Front Genet 2019; 9:716. [PMID: 30697228 PMCID: PMC6340980 DOI: 10.3389/fgene.2018.00716] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2018] [Accepted: 12/21/2018] [Indexed: 12/31/2022] Open
Abstract
Long non-coding RNAs (lncRNAs) constitute a large class of transcribed RNA molecules. They have a characteristic length of more than 200 nucleotides which do not encode proteins. They play an important role in regulating gene expression by interacting with the homologous RNA-binding proteins. Due to the laborious and time-consuming nature of wet experimental methods, more researchers should pay great attention to computational approaches for the prediction of lncRNA-protein interaction (LPI). An in-depth literature review in the state-of-the-art in silico investigations, leads to the conclusion that there is still room for improving the accuracy and velocity. This paper propose a novel method for identifying LPI by employing Kernel Ridge Regression, based on Fast Kernel Learning (LPI-FKLKRR). This approach, uses four distinct similarity measures for lncRNA and protein space, respectively. It is remarkable, that we extract Gene Ontology (GO) with proteins, in order to improve the quality of information in protein space. The process of heterogeneous kernels integration, applies Fast Kernel Learning (FastKL) to deal with weight optimization. The extrapolation model is obtained by gaining the ultimate prediction associations, after using Kernel Ridge Regression (KRR). Experimental outcomes show that the ability of modeling with LPI-FKLKRR has extraordinary performance compared with LPI prediction schemes. On benchmark dataset, it has been observed that the best Area Under Precision Recall Curve (AUPR) of 0.6950 is obtained by our proposed model LPI-FKLKRR, which outperforms the integrated LPLNP (AUPR: 0.4584), RWR (AUPR: 0.2827), CF (AUPR: 0.2357), LPIHN (AUPR: 0.2299), and LPBNI (AUPR: 0.3302). Also, combined with the experimental results of a case study on a novel dataset, it is anticipated that LPI-FKLKRR will be a useful tool for LPI prediction.
Collapse
Affiliation(s)
- Cong Shen
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Yijie Ding
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Jijun Tang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China.,Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, United States
| | - Fei Guo
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| |
Collapse
|
5
|
Qin W, Lu H. A novel joint analysis framework improves identification of differentially expressed genes in cross disease transcriptomic analysis. BioData Min 2018; 11:3. [PMID: 29467826 PMCID: PMC5819186 DOI: 10.1186/s13040-018-0163-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2017] [Accepted: 01/29/2018] [Indexed: 11/22/2022] Open
Abstract
Motivation Detecting differentially expressed (DE) genes between disease and normal control group is one of the most common analyses in genome-wide transcriptomic data. Since most studies don’t have a lot of samples, researchers have used meta-analysis to group different datasets for the same disease. Even then, in many cases the statistical power is still not enough. Taking into account the fact that many diseases share the same disease genes, it is desirable to design a statistical framework that can identify diseases’ common and specific DE genes simultaneously to improve the identification power. Results We developed a novel empirical Bayes based mixture model to identify DE genes in specific study by leveraging the shared information across multiple different disease expression data sets. The effectiveness of joint analysis was demonstrated through comprehensive simulation studies and two real data applications. The simulation results showed that our method consistently outperformed single data set analysis and two other meta-analysis methods in identification power. In real data analysis, overall our method demonstrated better identification power in detecting DE genes and prioritized more disease related genes and disease related pathways than single data set analysis. Over 150% more disease related genes are identified by our method in application to Huntington’s disease. We expect that our method would provide researchers a new way of utilizing available data sets from different diseases when sample size of the focused disease is limited. Electronic supplementary material The online version of this article (10.1186/s13040-018-0163-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Wenyi Qin
- 1Department of Bioengineering, University of Illinois at Chicago, 851 S. Morgan, Rm 218, Chicago, IL 60607 USA
| | - Hui Lu
- 1Department of Bioengineering, University of Illinois at Chicago, 851 S. Morgan, Rm 218, Chicago, IL 60607 USA.,2SJTU-Yale Joint Center for Biostatistics, Department of Bioinformatics and Biostatistics, Shanghai Jiaotong University, Shanghai, China.,Shanghai Engineering Research Center for Big Data in Pediatric Precision Medicine, Shanghai, China
| |
Collapse
|