1
|
Costa-Silva J, Domingues DS, Menotti D, Hungria M, Lopes FM. Temporal progress of gene expression analysis with RNA-Seq data: A review on the relationship between computational methods. Comput Struct Biotechnol J 2022; 21:86-98. [PMID: 36514333 PMCID: PMC9730150 DOI: 10.1016/j.csbj.2022.11.051] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2022] [Revised: 11/25/2022] [Accepted: 11/25/2022] [Indexed: 12/03/2022] Open
Abstract
Analysis of differential gene expression from RNA-seq data has become a standard for several research areas. The steps for the computational analysis include many data types and file formats, and a wide variety of computational tools that can be applied alone or together as pipelines. This paper presents a review of the differential expression analysis pipeline, addressing its steps and the respective objectives, the principal methods available in each step, and their properties, therefore introducing an organized overview to this context. This review aims to address mainly the aspects involved in the differentially expressed gene (DEG) analysis from RNA sequencing data (RNA-seq), considering the computational methods. In addition, a timeline of the computational methods for DEG is shown and discussed, and the relationships existing between the most important computational tools are presented by an interaction network. A discussion on the challenges and gaps in DEG analysis is also highlighted in this review. This paper will serve as a tutorial for new entrants into the field and help established users update their analysis pipelines.
Collapse
Affiliation(s)
- Juliana Costa-Silva
- Department of Informatics – Federal University of Paraná, Rua Coronel Francisco Heráclito dos Santos, 100, 81531-990 Curitiba, Paraná, Brazil
| | - Douglas S. Domingues
- Department of Genetics, “Luiz de Queiroz” College of Agriculture, University of São Paulo, Av. Pádua Dias, 11, 13418-900 Piracicaba, São Paulo, Brazil
| | - David Menotti
- Department of Informatics – Federal University of Paraná, Rua Coronel Francisco Heráclito dos Santos, 100, 81531-990 Curitiba, Paraná, Brazil
| | - Mariangela Hungria
- Department of Soil Biotecnology - Embrapa Soybean, Cx. Postal 231, 86000-970 Londrina, Paraná, Brazil
| | - Fabrício Martins Lopes
- Department of Computer Science, Universidade Tecnológica Federal do Paraná – UTFPR, Av. Alberto Carazzai, 1640, 86300-000, Cornélio Procópio, Paraná, Brazil
| |
Collapse
|
2
|
Yin H, Tao J, Peng Y, Xiong Y, Li B, Li S, Yang H. MSPJ: Discovering potential biomarkers in small gene expression datasets via ensemble learning. Comput Struct Biotechnol J 2022; 20:3783-3795. [PMID: 35891786 PMCID: PMC9304602 DOI: 10.1016/j.csbj.2022.07.022] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2022] [Revised: 07/10/2022] [Accepted: 07/11/2022] [Indexed: 11/24/2022] Open
Abstract
In transcriptomics, differentially expressed genes (DEGs) provide fine-grained phenotypic resolution for comparisons between groups and insights into molecular mechanisms underlying the pathogenesis of complex diseases or phenotypes. The robust detection of DEGs from large datasets is well-established. However, owing to various limitations (e.g., the low availability of samples for some diseases or limited research funding), small sample size is frequently used in experiments. Therefore, methods to screen reliable and stable features are urgently needed for analyses with limited sample size. In this study, MSPJ, a new machine learning approach for identifying DEGs was proposed to mitigate the reduced power and improve the stability of DEG identification in small gene expression datasets. This ensemble learning-based method consists of three algorithms: an improved multiple random sampling with meta-analysis, SVM-RFE (support vector machines-recursive feature elimination), and permutation test. MSPJ was compared with ten classical methods by 94 simulated datasets and large-scale benchmarking with 165 real datasets. The results showed that, among these methods MSPJ had the best performance in most small gene expression datasets, especially those with sample size below 30. In summary, the MSPJ method enables effective feature selection for robust DEG identification in small transcriptome datasets and is expected to expand research on the molecular mechanisms underlying complex diseases or phenotypes.
Collapse
Key Words
- AUC, area under the ROC curve (AUC)
- DEGs, differentially expressed genes
- Differentially expressed genes
- FDR, false positive rate
- Feature selection
- GA, genetic algorithm
- GEO, Gene Expression Omnibus
- GO, gene ontology
- MSPJ, the Joint method of Meta-analysis, SVM-RFE, and Permutation test
- Machine learning
- RF, random forest
- ROC, receiver operating characteristic
- Random sampling
- SAM, significance analysis of microarrays
- SMDs, standardized mean differences
- SNR, signal noise ratio
- SVM-RFE, support vector machines-recursive feature elimination
- Small sample size
- mRMR, minimum-redundancy-maximum-relevance
Collapse
Affiliation(s)
- HuaChun Yin
- Department of Neurosurgery, Xinqiao Hospital, The Army Medical University, Chongqing 400037, China.,College of Life Sciences, Chongqing Normal University, Chongqing 401331, China.,Department of Neurobiology, Chongqing Key Laboratory of Neurobiology, The Army Medical University, Chongqing 400038, China
| | - JingXin Tao
- College of Life Sciences, Chongqing Normal University, Chongqing 401331, China
| | - Yuyang Peng
- Department of Neurosurgery, Xinqiao Hospital, The Army Medical University, Chongqing 400037, China
| | - Ying Xiong
- Department of Neurobiology, Chongqing Key Laboratory of Neurobiology, The Army Medical University, Chongqing 400038, China
| | - Bo Li
- College of Life Sciences, Chongqing Normal University, Chongqing 401331, China
| | - Song Li
- Department of Neurosurgery, Xinqiao Hospital, The Army Medical University, Chongqing 400037, China.,Guangyang Bay Laboratory, Chongqing Institute for Brain and Intelligence, Chongqing, China
| | - Hui Yang
- Department of Neurosurgery, Xinqiao Hospital, The Army Medical University, Chongqing 400037, China.,Guangyang Bay Laboratory, Chongqing Institute for Brain and Intelligence, Chongqing, China
| |
Collapse
|
3
|
Carrillo-Perez F, Morales JC, Castillo-Secilla D, Gevaert O, Rojas I, Herrera LJ. Machine-Learning-Based Late Fusion on Multi-Omics and Multi-Scale Data for Non-Small-Cell Lung Cancer Diagnosis. J Pers Med 2022; 12:601. [PMID: 35455716 PMCID: PMC9025878 DOI: 10.3390/jpm12040601] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2022] [Revised: 03/29/2022] [Accepted: 04/06/2022] [Indexed: 01/27/2023] Open
Abstract
Differentiation between the various non-small-cell lung cancer subtypes is crucial for providing an effective treatment to the patient. For this purpose, machine learning techniques have been used in recent years over the available biological data from patients. However, in most cases this problem has been treated using a single-modality approach, not exploring the potential of the multi-scale and multi-omic nature of cancer data for the classification. In this work, we study the fusion of five multi-scale and multi-omic modalities (RNA-Seq, miRNA-Seq, whole-slide imaging, copy number variation, and DNA methylation) by using a late fusion strategy and machine learning techniques. We train an independent machine learning model for each modality and we explore the interactions and gains that can be obtained by fusing their outputs in an increasing manner, by using a novel optimization approach to compute the parameters of the late fusion. The final classification model, using all modalities, obtains an F1 score of 96.81±1.07, an AUC of 0.993±0.004, and an AUPRC of 0.980±0.016, improving those results that each independent model obtains and those presented in the literature for this problem. These obtained results show that leveraging the multi-scale and multi-omic nature of cancer data can enhance the performance of single-modality clinical decision support systems in personalized medicine, consequently improving the diagnosis of the patient.
Collapse
Affiliation(s)
- Francisco Carrillo-Perez
- Department of Computer Architecture and Technology, University of Granada, C.I.T.I.C., Periodista Rafael Gómez Montero, 2, 18170 Granada, Spain; (J.C.M.); (I.R.); (L.J.H.)
- Stanford Center for Biomedical Informatics Research (BMIR), Department of Medicine, Stanford University, 1265 Welch Rd, Stanford, CA 94305, USA;
| | - Juan Carlos Morales
- Department of Computer Architecture and Technology, University of Granada, C.I.T.I.C., Periodista Rafael Gómez Montero, 2, 18170 Granada, Spain; (J.C.M.); (I.R.); (L.J.H.)
| | - Daniel Castillo-Secilla
- Fujitsu Technology Solutions S.A, CoE Data Intelligence, Camino del Cerro de los Gamos, 1, Pozuelo de Alarcón, 28224 Madrid, Spain;
| | - Olivier Gevaert
- Stanford Center for Biomedical Informatics Research (BMIR), Department of Medicine, Stanford University, 1265 Welch Rd, Stanford, CA 94305, USA;
| | - Ignacio Rojas
- Department of Computer Architecture and Technology, University of Granada, C.I.T.I.C., Periodista Rafael Gómez Montero, 2, 18170 Granada, Spain; (J.C.M.); (I.R.); (L.J.H.)
| | - Luis Javier Herrera
- Department of Computer Architecture and Technology, University of Granada, C.I.T.I.C., Periodista Rafael Gómez Montero, 2, 18170 Granada, Spain; (J.C.M.); (I.R.); (L.J.H.)
| |
Collapse
|
4
|
Bajo-Morales J, Prieto-Prieto JC, Herrera LJ, Rojas I, Castillo-Secilla D. COVID-19 Biomarkers Recognition & Classification Using Intelligent Systems. Curr Bioinform 2022. [DOI: 10.2174/1574893617666220328125029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Abstract
Background:
SARS-CoV-2 has paralyzed mankind due to its high transmissibility and its associated mortality, causing millions of infections and deaths worldwide. The search for gene expression biomarkers from the host transcriptional response to infection may help understand the underlying mechanisms by which the virus causes COVID-19. This research proposes a smart methodology integrating different RNA-Seq datasets from SARS-CoV-2, other respiratory diseases, and healthy patients.
Methods:
The proposed pipeline exploits the functionality of the ‘KnowSeq’ R/Bioc package, integrating different data sources and attaining a significantly larger gene expression dataset, thus endowing the results with higher statistical significance and robustness in comparison with previous studies in the literature. A detailed preprocessing step was carried out to homogenize the samples and build a clinical decision system for SARS-CoV-2. It uses machine learning techniques such as feature selection algorithm and supervised classification system. This clinical decision system uses the most differentially expressed genes among different diseases (including SARS-Cov-2) to develop a four-class classifier.
Results:
The multiclass classifier designed can discern SARS-CoV-2 samples, reaching an accuracy equal to 91.5%, a mean F1-Score equal to 88.5%, and a SARS-CoV-2 AUC equal to 94% by using only 15 genes as predictors. A biological interpretation of the gene signature extracted reveals relations with processes involved in viral responses.
Conclusion:
This work proposes a COVID-19 gene signature composed of 15 genes, selected after applying the feature selection ‘minimum Redundancy Maximum Relevance’ algorithm. The integration among several RNA-Seq datasets was a success, allowing for a considerable large number of samples and therefore providing greater statistical significance to the results than previous studies. Biological interpretation of the selected genes was also provided.
Collapse
Affiliation(s)
- Javier Bajo-Morales
- Department of Computer Architecture and Technology, University of Granada. C.I.T.I.C., Periodista Rafael Gómez Montero, 2, 18014, Granada, Spain
| | - Juan Carlos Prieto-Prieto
- Nuclear Medicine Department, IMIBIC, University Hospital Reina Sofia, Menéndez Pidal Avenue, 14004, Córdoba, Spain
| | - Luis Javier Herrera
- Department of Computer Architecture and Technology, University of Granada. C.I.T.I.C., Periodista Rafael Gómez Montero, 2, 18014, Granada, Spain
| | - Ignacio Rojas
- Department of Computer Architecture and Technology, University of Granada. C.I.T.I.C., Periodista Rafael Gómez Montero, 2, 18014, Granada, Spain
| | - Daniel Castillo-Secilla
- Department of Computer Architecture and Technology, University of Granada. C.I.T.I.C., Periodista Rafael Gómez Montero, 2, 18014, Granada, Spain
| |
Collapse
|
5
|
Bajo-Morales J, Galvez JM, Prieto-Prieto JC, Herrera LJ, Rojas I, Castillo-Secilla D. Heterogeneous Gene Expression Cross-Evaluation of Robust Biomarkers
Using Machine Learning Techniques Applied to Lung Cancer. Curr Bioinform 2022. [DOI: 10.2174/1574893616666211005114934] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
Background:
Nowadays, gene expression analysis is one of the most promising pillars for
understanding and uncovering the mechanisms underlying the development and spread of cancer. In this
sense, Next Generation Sequencing technologies, such as RNA-Seq, are currently leading the market
due to their precision and cost. Nevertheless, there is still an enormous amount of non-analyzed data obtained
from older technologies, such as Microarray, which could still be useful to extract relevant
knowledge.
Methods:
Throughout this research, a complete machine learning methodology to cross-evaluate the
compatibility between both RNA-Seq and Microarray sequencing technologies is described and implemented.
In order to show a real application of the designed pipeline, a lung cancer case study is addressed
by considering two detected subtypes: adenocarcinoma and squamous cell carcinoma. Transcriptomic
datasets considered for our study have been obtained from the public repositories
NCBI/GEO, ArrayExpress and GDC-Portal. From them, several gene experiments have been carried
out with the aim of finding gene signatures for these lung cancer subtypes, linked to both transcriptomic
technologies. With these DEGs selected, intelligent predictive models capable of classifying new samples
belonging to these cancer subtypes have been developed.
Results:
The predictive models built using one technology are capable of discerning samples from a different
technology. The classification results are evaluated in terms of accuracy, F1-score and ROC
curves along with AUC. Finally, the biological information of the gene sets obtained and their relationship
with lung cancer are reviewed, encountering strong biological evidence linking them to the disease.
Conclusion:
Our method has the capability of finding strong gene signatures which are also independent
of the transcriptomic technology used to develop the analysis. In addition, our article highlights the
potential of using heterogeneous transcriptomic data to increase the amount of samples for the studies,
increasing the statistical significance of the results.
Collapse
Affiliation(s)
- Javier Bajo-Morales
- Department of Computer Architecture and Technology, University of Granada, C.I.T.I.C., Periodista Rafael Gómez
Montero, 2, 18014, Granada, Spain
| | - Juan Manuel Galvez
- Department of Computer Architecture and Technology, University of Granada, C.I.T.I.C., Periodista Rafael Gómez
Montero, 2, 18014, Granada, Spain
| | - Juan Carlos Prieto-Prieto
- Nuclear Medicine Department, IMIBIC, University Hospital Reina Sofia, Menéndez
Pidal Avenue, 14004, Córdoba, Spain
| | - Luis Javier Herrera
- Department of Computer Architecture and Technology, University of Granada. C.I.T.I.C., Periodista Rafael Gómez Montero, 2, 18014, Granada,Spain
| | - Ignacio Rojas
- Department of Computer Architecture and Technology, University of Granada, C.I.T.I.C., Periodista Rafael Gómez
Montero, 2, 18014, Granada, Spain
| | - Daniel Castillo-Secilla
- Department of Computer Architecture and Technology, University of Granada. C.I.T.I.C., Periodista Rafael Gómez Montero, 2, 18014, Granada,Spain
| |
Collapse
|
6
|
|
7
|
Huang G, Zhang H, Qu Y, Huang K, Gong X, Wei J, Du H. ARMT: An automatic RNA-seq data mining tool based on comprehensive and integrative analysis in cancer research. Comput Struct Biotechnol J 2021; 19:4426-4434. [PMID: 34471489 PMCID: PMC8379379 DOI: 10.1016/j.csbj.2021.08.009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2021] [Revised: 07/19/2021] [Accepted: 08/06/2021] [Indexed: 11/02/2022] Open
Abstract
The comprehensive and integrative analysis of RNA-seq data, in different molecular layers from diverse samples, holds promise to address the full-scale complexity of biological systems. Recent advances in gene set variant analysis (GSVA) are providing exciting opportunities for revealing the specific biological processes of cancer samples. However, it is still urgently needed to develop a tool, which combines GSVA and different molecular characteristic analysis, as well as prognostic characteristics of cancer patients to reveal the biological processes of disease comprehensively. Here, we develop ARMT, an automatic tool for RNA-Seq data analysis. ARMT is an efficient and integrative tool with user-friendly interface to analyze related molecular characters of single gene and gene set comprehensively based on transcriptome and genomic data, which builds the bridge for deeper information between genes and pathways, to further accelerate scientific findings. ARMT can be installed easily from https://github.com/Dulab2020/ARMT.
Collapse
Affiliation(s)
- Guanda Huang
- School of Biology and Biological Engineering, South China University of Technology, Guangzhou 510006, China
| | - Haibo Zhang
- School of Biology and Biological Engineering, South China University of Technology, Guangzhou 510006, China
| | - Yimo Qu
- School of Biology and Biological Engineering, South China University of Technology, Guangzhou 510006, China
| | - Kaitang Huang
- School of Biology and Biological Engineering, South China University of Technology, Guangzhou 510006, China
| | - Xiaocheng Gong
- School of Biology and Biological Engineering, South China University of Technology, Guangzhou 510006, China
| | - Jinfen Wei
- School of Biology and Biological Engineering, South China University of Technology, Guangzhou 510006, China
| | - Hongli Du
- School of Biology and Biological Engineering, South China University of Technology, Guangzhou 510006, China
| |
Collapse
|