1
|
Keivanimehr AR, Akbari M. TinyML and edge intelligence applications in cardiovascular disease: A survey. Comput Biol Med 2025; 186:109653. [PMID: 39798504 DOI: 10.1016/j.compbiomed.2025.109653] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2024] [Revised: 01/01/2025] [Accepted: 01/01/2025] [Indexed: 01/15/2025]
Abstract
Tiny machine learning (TinyML) and edge intelligence have emerged as pivotal paradigms for enabling machine learning on resource-constrained devices situated at the extreme edge of networks. In this paper, we explore the transformative potential of TinyML in facilitating pervasive, low-power cardiovascular monitoring and real-time analytics for patients with cardiac anomalies, leveraging wearable devices as the primary interface. To begin with, we provide an overview of TinyML software and hardware enablers, accompanied by an examination of networking solutions such as Low-power Wide area network (LPWAN) that facilitate the seamless deployment of TinyML frameworks. Following this, we delve into the methodologies of knowledge distillation, quantization, and pruning, which represent the cornerstone strategies for optimizing machine learning models to operate efficiently within resource-constrained environments. Furthermore, our discussion extends to the role of efficient deep neural networks tailored specifically for cardiovascular monitoring on wearable devices with limited computational resources. Through a comprehensive review, we analyze the applications of prominent artificial neural network architectures including Convolutional Neural Networks (CNNs), Autoencoders, Deep Belief Networks (DBNs), and Transformers in the domain of Electrocardiogram (ECG) analytics, shedding light on their efficacy and potential in advancing healthcare technology.
Collapse
Affiliation(s)
- Ali Reza Keivanimehr
- Department of Management, Science and Technology, Amirkabir University of Technology (Tehran Polytechnic), Tehran, Iran.
| | - Mohammad Akbari
- Department of Computer Science, Amirkabir University of Technology (Tehran Polytechnic), Tehran, Iran.
| |
Collapse
|
2
|
Lin Y, Li C, Wang X, Li H. Development of a machine learning-based risk assessment model for loneliness among elderly Chinese: a cross-sectional study based on Chinese longitudinal healthy longevity survey. BMC Geriatr 2024; 24:939. [PMID: 39543473 PMCID: PMC11562678 DOI: 10.1186/s12877-024-05443-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2024] [Accepted: 10/07/2024] [Indexed: 11/17/2024] Open
Abstract
BACKGROUND Loneliness is prevalent among the elderly and has intensified due to global aging trends. It adversely affects both mental and physical health. Traditional scales for measuring loneliness may yield biased results due to varying definitions. The advancements in machine learning offer new opportunities for improving the measurement and assessment of loneliness through the development of risk assessment models. METHODS Data from the 2018 Chinese Longitudinal Healthy Longevity Survey, involving about 16,000 participants aged ≥ 65 years, were used. The study examined the relationships between loneliness and factors such as functional limitations, living conditions, environmental influences, age-related health issues, and health behaviors. Using R 4.4.1, seven assessment models were developed: logistic regression, ridge regression, support vector machines, K-nearest neighbors, decision trees, random forests, and multi-layer perceptron. Models were evaluated based on ROC curves, accuracy, precision, recall, F1 scores, and AUC. RESULTS Loneliness prevalence among elderly Chinese was 23.4%. Analysis identified 15 evaluative factors and evaluated seven models. Multi-layer perceptron stands out for its strong nonlinear mapping capability and adaptability to complex data, making it one of the most effective models for assessing loneliness risk. CONCLUSION The study found a 23.4% prevalence of loneliness among elderly individuals in China. SHAP values indicated that marital status has the strongest evaluative value across all forecasting periods. Specifically, elderly individuals who are never married, widowed, divorced, or separated are more likely to experience loneliness compared to their married counterparts.
Collapse
Affiliation(s)
- Youbei Lin
- Jinzhou Medical University, School of Nursing, Jinzhou City, Liaoning Province, 121001, China
| | - Chuang Li
- Jinzhou Medical University, School of Nursing, Jinzhou City, Liaoning Province, 121001, China
| | - Xiuli Wang
- The First Affiliated Hospital of Jinzhou Medical University, Jinzhou City, Liaoning Province, 121001, China
| | - Hongyu Li
- Jinzhou Medical University, School of Nursing, Jinzhou City, Liaoning Province, 121001, China.
| |
Collapse
|
3
|
Yan F, Jiang L, Chen D, Ceccarelli M, Guo Y. Reinventing gene expression connectivity through regulatory and spatial structural empowerment via principal node aggregation graph neural network. Nucleic Acids Res 2024; 52:e60. [PMID: 38884259 PMCID: PMC11260459 DOI: 10.1093/nar/gkae514] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2024] [Accepted: 06/04/2024] [Indexed: 06/18/2024] Open
Abstract
The intricacies of the human genome, manifested as a complex network of genes, transcend conventional representations in text or numerical matrices. The intricate gene-to-gene relationships inherent in this complexity find a more suitable depiction in graph structures. In the pursuit of predicting gene expression, an endeavor shared by predecessors like the L1000 and Enformer methods, we introduce a novel spatial graph-neural network (GNN) approach. This innovative strategy incorporates graph features, encompassing both regulatory and structural elements. The regulatory elements include pair-wise gene correlation, biological pathways, protein-protein interaction networks, and transcription factor regulation. The spatial structural elements include chromosomal distance, histone modification and Hi-C inferred 3D genomic features. Principal Node Aggregation models, validated independently, emerge as frontrunners, demonstrating superior performance compared to traditional regression and other deep learning models. By embracing the spatial GNN paradigm, our method significantly advances the description of the intricate network of gene interactions, surpassing the performance, predictable scope, and initial requirements set by previous methods.
Collapse
Affiliation(s)
- Fengyao Yan
- Department of Public Health and Sciences, University of Miami, Miami, FL 33126, USA
- Department of Computer Science, University of South Carolina, Columbia, SC 29201, USA
| | - Limin Jiang
- Department of Public Health and Sciences, University of Miami, Miami, FL 33126, USA
| | - Danqian Chen
- Department of Public Health and Sciences, University of Miami, Miami, FL 33126, USA
| | - Michele Ceccarelli
- Department of Public Health and Sciences, University of Miami, Miami, FL 33126, USA
| | - Yan Guo
- Department of Public Health and Sciences, University of Miami, Miami, FL 33126, USA
| |
Collapse
|
4
|
Tian H, Tang L, Yang Z, Xiang Y, Min Q, Yin M, You H, Xiao Z, Shen J. Current understanding of functional peptides encoded by lncRNA in cancer. Cancer Cell Int 2024; 24:252. [PMID: 39030557 PMCID: PMC11265036 DOI: 10.1186/s12935-024-03446-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2023] [Accepted: 07/09/2024] [Indexed: 07/21/2024] Open
Abstract
Dysregulated gene expression and imbalance of transcriptional regulation are typical features of cancer. RNA always plays a key role in these processes. Human transcripts contain many RNAs without long open reading frames (ORF, > 100 aa) and that are more than 200 bp in length. They are usually regarded as long non-coding RNA (lncRNA) which play an important role in cancer regulation, including chromatin remodeling, transcriptional regulation, translational regulation and as miRNA sponges. With the advancement of ribosome profiling and sequencing technologies, increasing research evidence revealed that some ORFs in lncRNA can also encode peptides and participate in the regulation of multiple organ tumors, which undoubtedly opens a new chapter in the field of lncRNA and oncology research. In this review, we discuss the biological function of lncRNA in tumors, the current methods to evaluate their coding potential and the role of functional small peptides encoded by lncRNA in cancers. Investigating the small peptides encoded by lncRNA and understanding the regulatory mechanisms of these functional peptides may contribute to a deeper understanding of cancer and the development of new targeted anticancer therapies.
Collapse
Affiliation(s)
- Hua Tian
- Laboratory of Molecular Pharmacology, Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou, 646000, China
- Cell Therapy and Cell Drugs of Luzhou Key Laboratory, Luzhou, 646000, China
- South Sichuan Institute of Translational Medicine, Luzhou, 646000, China
- School of Nursing, Chongqing College of Humanities, Science & Technology, Chongqing, China
| | - Lu Tang
- Laboratory of Molecular Pharmacology, Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou, 646000, China
- Cell Therapy and Cell Drugs of Luzhou Key Laboratory, Luzhou, 646000, China
- South Sichuan Institute of Translational Medicine, Luzhou, 646000, China
| | - Zihan Yang
- Department of Pathology, The Affiliated Hospital of Southwest Medical University, Luzhou, China, 646000
| | - Yanxi Xiang
- Laboratory of Molecular Pharmacology, Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou, 646000, China
- Cell Therapy and Cell Drugs of Luzhou Key Laboratory, Luzhou, 646000, China
- South Sichuan Institute of Translational Medicine, Luzhou, 646000, China
| | - Qi Min
- Laboratory of Molecular Pharmacology, Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou, 646000, China
- Cell Therapy and Cell Drugs of Luzhou Key Laboratory, Luzhou, 646000, China
- South Sichuan Institute of Translational Medicine, Luzhou, 646000, China
| | - Mengshuang Yin
- Laboratory of Molecular Pharmacology, Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou, 646000, China
- Cell Therapy and Cell Drugs of Luzhou Key Laboratory, Luzhou, 646000, China
- South Sichuan Institute of Translational Medicine, Luzhou, 646000, China
| | - Huili You
- Laboratory of Molecular Pharmacology, Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou, 646000, China
- Cell Therapy and Cell Drugs of Luzhou Key Laboratory, Luzhou, 646000, China
- South Sichuan Institute of Translational Medicine, Luzhou, 646000, China
| | - Zhangang Xiao
- Laboratory of Molecular Pharmacology, Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou, 646000, China.
- Cell Therapy and Cell Drugs of Luzhou Key Laboratory, Luzhou, 646000, China.
- South Sichuan Institute of Translational Medicine, Luzhou, 646000, China.
- Gulin Traditional Chinese Medicine Hospital, Luzhou, China.
- Department of Pharmacology, School of Pharmacy, Sichuan College of Traditional Chinese Medicine, Mianyang, China.
| | - Jing Shen
- Laboratory of Molecular Pharmacology, Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou, 646000, China.
- Cell Therapy and Cell Drugs of Luzhou Key Laboratory, Luzhou, 646000, China.
- South Sichuan Institute of Translational Medicine, Luzhou, 646000, China.
| |
Collapse
|
5
|
Dey V, Ning X. Improving Anticancer Drug Selection and Prioritization via Neural Learning to Rank. J Chem Inf Model 2024; 64:4071-4088. [PMID: 38740382 PMCID: PMC11134508 DOI: 10.1021/acs.jcim.3c01060] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2023] [Revised: 03/27/2024] [Accepted: 04/16/2024] [Indexed: 05/16/2024]
Abstract
Personalized cancer treatment requires a thorough understanding of complex interactions between drugs and cancer cell lines in varying genetic and molecular contexts. To address this, high-throughput screening has been used to generate large-scale drug response data, facilitating data-driven computational models. Such models can capture complex drug-cell line interactions across various contexts in a fully data-driven manner. However, accurately prioritizing the most effective drugs for each cell line still remains a significant challenge. To address this, we developed multiple neural ranking approaches that leverage large-scale drug response data across multiple cell lines from diverse cancer types. Unlike existing approaches that primarily utilize regression and classification techniques for drug response prediction, we formulated the objective of drug selection and prioritization as a drug ranking problem. In this work, we proposed multiple pairwise and listwise neural ranking methods that learn latent representations of drugs and cell lines and then use those representations to score drugs in each cell line via a learnable scoring function. Specifically, we developed neural pairwise and listwise ranking methods, Pair-PushC and List-One on top of the existing methods, pLETORg and ListNet, respectively. Additionally, we proposed a novel listwise ranking method, List-All, that focuses on all the effective drugs instead of the top effective drug, unlike List-One. We also provide an exhaustive empirical evaluation with state-of-the-art regression and ranking baselines on large-scale data sets across multiple experimental settings. Our results demonstrate that our proposed ranking methods mostly outperform the best baselines with significant improvements of as much as 25.6% in terms of selecting truly effective drugs within the top 20 predicted drugs (i.e., hit@20) across 50% test cell lines. Furthermore, our analyses suggest that the learned latent spaces from our proposed methods demonstrate informative clustering structures and capture relevant underlying biological features. Moreover, our comprehensive evaluation provides a thorough and objective comparison of the performance of different methods (including our proposed ones).
Collapse
Affiliation(s)
- Vishal Dey
- Department
of Computer Science and Engineering, The
Ohio State University, Columbus, Ohio 43210, United States
| | - Xia Ning
- Department
of Computer Science and Engineering, The
Ohio State University, Columbus, Ohio 43210, United States
- Biomedical
Informatics, The Ohio State University, Columbus, Ohio 43210, United States
- Translational
Data Analytics Institute, The Ohio State
University, Columbus, Ohio 43210, United States
| |
Collapse
|
6
|
Meimetis N, Pullen KM, Zhu DY, Nilsson A, Hoang TN, Magliacane S, Lauffenburger DA. AutoTransOP: translating omics signatures without orthologue requirements using deep learning. NPJ Syst Biol Appl 2024; 10:13. [PMID: 38287079 PMCID: PMC10825146 DOI: 10.1038/s41540-024-00341-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2023] [Accepted: 01/17/2024] [Indexed: 01/31/2024] Open
Abstract
The development of therapeutics and vaccines for human diseases requires a systematic understanding of human biology. Although animal and in vitro culture models can elucidate some disease mechanisms, they typically fail to adequately recapitulate human biology as evidenced by the predominant likelihood of clinical trial failure. To address this problem, we developed AutoTransOP, a neural network autoencoder framework, to map omics profiles from designated species or cellular contexts into a global latent space, from which germane information for different contexts can be identified without the typically imposed requirement of matched orthologues. This approach was found in general to perform at least as well as current alternative methods in identifying animal/culture-specific molecular features predictive of other contexts-most importantly without requiring homology matching. For an especially challenging test case, we successfully applied our framework to a set of inter-species vaccine serology studies, where 1-to-1 mapping between human and non-human primate features does not exist.
Collapse
Affiliation(s)
- Nikolaos Meimetis
- Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Krista M Pullen
- Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Daniel Y Zhu
- Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Avlant Nilsson
- Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, SE, 41296, Sweden
| | - Trong Nghia Hoang
- School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA, 99164-236, USA
| | - Sara Magliacane
- Institute of Informatics, University of Amsterdam, Amsterdam, The Netherlands
- MIT-IBM Watson AI Lab, Cambridge, MA, 02139, USA
| | - Douglas A Lauffenburger
- Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA.
| |
Collapse
|
7
|
Yue T, Wang Y, Zhang L, Gu C, Xue H, Wang W, Lyu Q, Dun Y. Deep Learning for Genomics: From Early Neural Nets to Modern Large Language Models. Int J Mol Sci 2023; 24:15858. [PMID: 37958843 PMCID: PMC10649223 DOI: 10.3390/ijms242115858] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2023] [Revised: 10/24/2023] [Accepted: 10/30/2023] [Indexed: 11/15/2023] Open
Abstract
The data explosion driven by advancements in genomic research, such as high-throughput sequencing techniques, is constantly challenging conventional methods used in genomics. In parallel with the urgent demand for robust algorithms, deep learning has succeeded in various fields such as vision, speech, and text processing. Yet genomics entails unique challenges to deep learning, since we expect a superhuman intelligence that explores beyond our knowledge to interpret the genome from deep learning. A powerful deep learning model should rely on the insightful utilization of task-specific knowledge. In this paper, we briefly discuss the strengths of different deep learning models from a genomic perspective so as to fit each particular task with proper deep learning-based architecture, and we remark on practical considerations of developing deep learning architectures for genomics. We also provide a concise review of deep learning applications in various aspects of genomic research and point out current challenges and potential research directions for future genomics applications. We believe the collaborative use of ever-growing diverse data and the fast iteration of deep learning models will continue to contribute to the future of genomics.
Collapse
Affiliation(s)
- Tianwei Yue
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Yuanxin Wang
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Longxiang Zhang
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Chunming Gu
- Department of Biomedical Engineering, School of Medicine, Johns Hopkins University, Baltimore, MD 21218, USA;
| | - Haoru Xue
- The Robotics Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA;
| | - Wenping Wang
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Qi Lyu
- Department of Computational Mathematics, Science, and Engineering, Michigan State University, East Lansing, MI 48824, USA;
| | - Yujie Dun
- School of Information and Communications Engineering, Xi’an Jiaotong University, Xi’an 710049, China;
| |
Collapse
|
8
|
Wang X, Zeng H, Lin L, Huang Y, Lin H, Que Y. Deep learning-empowered crop breeding: intelligent, efficient and promising. FRONTIERS IN PLANT SCIENCE 2023; 14:1260089. [PMID: 37860239 PMCID: PMC10583549 DOI: 10.3389/fpls.2023.1260089] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/18/2023] [Accepted: 09/13/2023] [Indexed: 10/21/2023]
Abstract
Crop breeding is one of the main approaches to increase crop yield and improve crop quality. However, the breeding process faces challenges such as complex data, difficulties in data acquisition, and low prediction accuracy, resulting in low breeding efficiency and long cycle. Deep learning-based crop breeding is a strategy that applies deep learning techniques to improve and optimize the breeding process, leading to accelerated crop improvement, enhanced breeding efficiency, and the development of higher-yielding, more adaptive, and disease-resistant varieties for agricultural production. This perspective briefly discusses the mechanisms, key applications, and impact of deep learning in crop breeding. We also highlight the current challenges associated with this topic and provide insights into its future application prospects.
Collapse
Affiliation(s)
- Xiaoding Wang
- Fujian Provincial Key Lab of Network Security & Cryptology, College of Computer and Cyber Security, Fujian Normal University, Fuzhou, China
| | - Haitao Zeng
- Fujian Provincial Key Lab of Network Security & Cryptology, College of Computer and Cyber Security, Fujian Normal University, Fuzhou, China
| | - Limei Lin
- Fujian Provincial Key Lab of Network Security & Cryptology, College of Computer and Cyber Security, Fujian Normal University, Fuzhou, China
| | - Yanze Huang
- School of Computer Science and Mathematics, Fujian Provincial Key Laboratory of Big Data Mining and Applications, Fujian University of Technology, Fuzhou, China
| | - Hui Lin
- Fujian Provincial Key Lab of Network Security & Cryptology, College of Computer and Cyber Security, Fujian Normal University, Fuzhou, China
| | - Youxiong Que
- Key Laboratory of Sugarcane Biology and Genetic Breeding, Ministry of Agriculture and Rural Affairs, Fujian Agriculture and Forestry University, Fuzhou, China
- National Key Laboratory for Tropical Crop Breeding, Institute of Tropical Bioscience and Biotechnology, Chinese Academy of Tropical Agricultural Sciences, Hainan, China
| |
Collapse
|
9
|
Morabito F, Adornetto C, Monti P, Amaro A, Reggiani F, Colombo M, Rodriguez-Aldana Y, Tripepi G, D’Arrigo G, Vener C, Torricelli F, Rossi T, Neri A, Ferrarini M, Cutrona G, Gentile M, Greco G. Genes selection using deep learning and explainable artificial intelligence for chronic lymphocytic leukemia predicting the need and time to therapy. Front Oncol 2023; 13:1198992. [PMID: 37719021 PMCID: PMC10501728 DOI: 10.3389/fonc.2023.1198992] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2023] [Accepted: 07/31/2023] [Indexed: 09/19/2023] Open
Abstract
Analyzing gene expression profiles (GEP) through artificial intelligence provides meaningful insight into cancer disease. This study introduces DeepSHAP Autoencoder Filter for Genes Selection (DSAF-GS), a novel deep learning and explainable artificial intelligence-based approach for feature selection in genomics-scale data. DSAF-GS exploits the autoencoder's reconstruction capabilities without changing the original feature space, enhancing the interpretation of the results. Explainable artificial intelligence is then used to select the informative genes for chronic lymphocytic leukemia prognosis of 217 cases from a GEP database comprising roughly 20,000 genes. The model for prognosis prediction achieved an accuracy of 86.4%, a sensitivity of 85.0%, and a specificity of 87.5%. According to the proposed approach, predictions were strongly influenced by CEACAM19 and PIGP, moderately influenced by MKL1 and GNE, and poorly influenced by other genes. The 10 most influential genes were selected for further analysis. Among them, FADD, FIBP, FIBP, GNE, IGF1R, MKL1, PIGP, and SLC39A6 were identified in the Reactome pathway database as involved in signal transduction, transcription, protein metabolism, immune system, cell cycle, and apoptosis. Moreover, according to the network model of the 3D protein-protein interaction (PPI) explored using the NetworkAnalyst tool, FADD, FIBP, IGF1R, QTRT1, GNE, SLC39A6, and MKL1 appear coupled into a complex network. Finally, all 10 selected genes showed a predictive power on time to first treatment (TTFT) in univariate analyses on a basic prognostic model including IGHV mutational status, del(11q) and del(17p), NOTCH1 mutations, β2-microglobulin, Rai stage, and B-lymphocytosis known to predict TTFT in CLL. However, only IGF1R [hazard ratio (HR) 1.41, 95% CI 1.08-1.84, P=0.013), COL28A1 (HR 0.32, 95% CI 0.10-0.97, P=0.045), and QTRT1 (HR 7.73, 95% CI 2.48-24.04, P<0.001) genes were significantly associated with TTFT in multivariable analyses when combined with the prognostic factors of the basic model, ultimately increasing the Harrell's c-index and the explained variation to 78.6% (versus 76.5% of the basic prognostic model) and 52.6% (versus 42.2% of the basic prognostic model), respectively. Also, the goodness of model fit was enhanced (χ2 = 20.1, P=0.002), indicating its improved performance above the basic prognostic model. In conclusion, DSAF-GS identified a group of significant genes for CLL prognosis, suggesting future directions for bio-molecular research.
Collapse
Affiliation(s)
| | - Carlo Adornetto
- Department of Mathematics and Computer Science, University of Calabria, Cosenza, Italy
| | - Paola Monti
- Mutagenesis and Cancer Prevention Unit, Istituto di Ricovero e Cura a Carattere Scientifico (IRCCS) Ospedale Policlinico San Martino, Genoa, Italy
| | - Adriana Amaro
- Tumor Epigenetics Unit, Istituto di Ricovero e Cura a Carattere Scientifico (IRCCS) Ospedale Policlinico San Martino, Genoa, Italy
| | - Francesco Reggiani
- Tumor Epigenetics Unit, Istituto di Ricovero e Cura a Carattere Scientifico (IRCCS) Ospedale Policlinico San Martino, Genoa, Italy
| | - Monica Colombo
- Molecular Pathology Unit, Istituto di Ricovero e Cura a Carattere Scientifico (IRCCS) Ospedale Policlinico San Martino, Genoa, Italy
| | | | - Giovanni Tripepi
- Consiglio Nazionale delle Ricerche, Istituto di Fisiologia Clinica del Consiglio Nazionale delle Ricerche (CNR), Reggio Calabria, Italy
| | - Graziella D’Arrigo
- Consiglio Nazionale delle Ricerche, Istituto di Fisiologia Clinica del Consiglio Nazionale delle Ricerche (CNR), Reggio Calabria, Italy
| | - Claudia Vener
- Department of Oncology and Hemato-Oncology, University of Milan, Milan, Italy
| | - Federica Torricelli
- Laboratory of Translational Research, Azienda Unità Sanitaria Locale - Istituto di Ricovero e Cura a Crabtree Scientifico (USL-IRCCS) of Reggio Emilia, Reggio Emilia, Italy
| | - Teresa Rossi
- Laboratory of Translational Research, Azienda Unità Sanitaria Locale - Istituto di Ricovero e Cura a Crabtree Scientifico (USL-IRCCS) of Reggio Emilia, Reggio Emilia, Italy
| | - Antonino Neri
- Scientific Directorate, Azienda Unità Sanitaria Locale - Istituto di Ricovero e Cura a Carattere Scientifico (USL-IRCCS) of Reggio Emilia, Reggio Emilia, Italy
| | - Manlio Ferrarini
- Unità Operariva (UO) Molecular Pathology, Ospedale Policlinico San Martino Istituto di Ricovero e Cura a Carattere Scientifico (IRCCS), Genoa, Italy
| | - Giovanna Cutrona
- Molecular Pathology Unit, Istituto di Ricovero e Cura a Carattere Scientifico (IRCCS) Ospedale Policlinico San Martino, Genoa, Italy
| | - Massimo Gentile
- Hematology Unit, Department of Onco-Hematology, Azienda Ospedaliera (A.O.) of Cosenza, Cosenza, Italy
- Department of Pharmacy and Health and Nutritional Sciences, University of Calabria, Cosenza, Italy
| | - Gianluigi Greco
- Department of Mathematics and Computer Science, University of Calabria, Cosenza, Italy
| |
Collapse
|
10
|
Wang D, Gao L, Gao X, Wang C, Tian S. Identification of monotonically expressed long non-coding RNA signatures for breast cancer using variational autoencoders. PLoS One 2023; 18:e0289971. [PMID: 37561760 PMCID: PMC10414641 DOI: 10.1371/journal.pone.0289971] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2023] [Accepted: 07/29/2023] [Indexed: 08/12/2023] Open
Abstract
As breast cancer is a multistage progression disease resulting from a genetic sequence of mutations, understanding the genes whose expression values increase or decrease monotonically across pathologic stages can provide insightful clues about how breast cancer initiates and advances. Utilizing variational autoencoder (VAE) networks in conjunction with traditional statistical testing, we successfully ascertain long non-coding RNAs (lncRNAs) that exhibit monotonically differential expression values in breast cancer. Subsequently, we validate that the identified lncRNAs really present monotonically changed patterns. The proposed procedure identified 248 monotonically decreasing expressed and 115 increasing expressed lncRNAs. They correspond to a total of 65 and 33 genes respectively, which possess unique known gene symbols. Some of them are associated with breast cancer, as suggested by previous studies. Furthermore, enriched pathways by the target mRNAs of these identified lncRNAs include the Wnt signaling pathway, human papillomavirus (HPV) infection, and Rap 1 signaling pathway, which have been shown to play crucial roles in the initiation and development of breast cancer. Additionally, we trained a VAE model using the entire dataset. To assess the effectiveness of the identified lncRNAs, a microarray dataset was employed as the test set. The results obtained from this evaluation were deemed satisfactory. In conclusion, further experimental validation of these lncRNAs with a large-sized study is warranted, and the proposed procedure is highly recommended.
Collapse
Affiliation(s)
- Dongjiao Wang
- Department of Gynecological Oncology, The First Hospital of Jilin University, Changchun, Jilin, People’s Republic of China
| | - Ling Gao
- Department of Radiation Oncology, The First Hospital of Jilin University, Changchun, Jilin, People’s Republic of China
| | - Xinliang Gao
- Department of Thoracic Surgery, The First Hospital of Jilin University, Changchun, Jilin, People’s Republic of China
| | - Chi Wang
- Department of Internal Medicine, College of Medicine, University of Kentucky, Lexington, Kentucky, United States of America
- Markey Cancer Center, University of Kentucky, Lexington, KY, United States of America
| | - Suyan Tian
- Division of Clinical Research, The First Hospital of Jilin University, Changchun, Jilin, People’s Republic of China
| |
Collapse
|
11
|
Zhu Y, Wang M, Yin X, Zhang J, Meijering E, Hu J. Deep Learning in Diverse Intelligent Sensor Based Systems. SENSORS (BASEL, SWITZERLAND) 2022; 23:62. [PMID: 36616657 PMCID: PMC9823653 DOI: 10.3390/s23010062] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/02/2022] [Revised: 12/06/2022] [Accepted: 12/14/2022] [Indexed: 05/27/2023]
Abstract
Deep learning has become a predominant method for solving data analysis problems in virtually all fields of science and engineering. The increasing complexity and the large volume of data collected by diverse sensor systems have spurred the development of deep learning methods and have fundamentally transformed the way the data are acquired, processed, analyzed, and interpreted. With the rapid development of deep learning technology and its ever-increasing range of successful applications across diverse sensor systems, there is an urgent need to provide a comprehensive investigation of deep learning in this domain from a holistic view. This survey paper aims to contribute to this by systematically investigating deep learning models/methods and their applications across diverse sensor systems. It also provides a comprehensive summary of deep learning implementation tips and links to tutorials, open-source codes, and pretrained models, which can serve as an excellent self-contained reference for deep learning practitioners and those seeking to innovate deep learning in this space. In addition, this paper provides insights into research topics in diverse sensor systems where deep learning has not yet been well-developed, and highlights challenges and future opportunities. This survey serves as a catalyst to accelerate the application and transformation of deep learning in diverse sensor systems.
Collapse
Affiliation(s)
- Yanming Zhu
- School of Computer Science and Engineering, University of New South Wales, Sydney, NSW 2052, Australia
| | - Min Wang
- School of Engineering and Information Technology, University of New South Wales, Canberra, ACT 2612, Australia
| | - Xuefei Yin
- School of Engineering and Information Technology, University of New South Wales, Canberra, ACT 2612, Australia
| | - Jue Zhang
- School of Engineering and Information Technology, University of New South Wales, Canberra, ACT 2612, Australia
| | - Erik Meijering
- School of Computer Science and Engineering, University of New South Wales, Sydney, NSW 2052, Australia
| | - Jiankun Hu
- School of Engineering and Information Technology, University of New South Wales, Canberra, ACT 2612, Australia
| |
Collapse
|
12
|
Cheng X, Dai C, Wen Y, Wang X, Bo X, He S, Peng S. NeRD: a multichannel neural network to predict cellular response of drugs by integrating multidimensional data. BMC Med 2022; 20:368. [PMID: 36244991 PMCID: PMC9575288 DOI: 10.1186/s12916-022-02549-0] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/02/2022] [Accepted: 09/01/2022] [Indexed: 11/17/2022] Open
Abstract
BACKGROUND Considering the heterogeneity of tumors, it is a key issue in precision medicine to predict the drug response of each individual. The accumulation of various types of drug informatics and multi-omics data facilitates the development of efficient models for drug response prediction. However, the selection of high-quality data sources and the design of suitable methods remain a challenge. METHODS In this paper, we design NeRD, a multidimensional data integration model based on the PRISM drug response database, to predict the cellular response of drugs. Four feature extractors, including drug structure extractor (DSE), molecular fingerprint extractor (MFE), miRNA expression extractor (mEE), and copy number extractor (CNE), are designed for different types and dimensions of data. A fully connected network is used to fuse all features and make predictions. RESULTS Experimental results demonstrate the effective integration of the global and local structural features of drugs, as well as the features of cell lines from different omics data. For all metrics tested on the PRISM database, NeRD surpassed previous approaches. We also verified that NeRD has strong reliability in the prediction results of new samples. Moreover, unlike other algorithms, when the amount of training data was reduced, NeRD maintained stable performance. CONCLUSIONS NeRD's feature fusion provides a new idea for drug response prediction, which is of great significance for precise cancer treatment.
Collapse
Affiliation(s)
- Xiaoxiao Cheng
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, China
| | - Chong Dai
- College of Life Science and Technology, Beijing University of Chemical Technology, Beijing, China.,Department of Biotechnology, Beijing Institute of Health Service and Transfusion Medicine, Beijing, China
| | - Yuqi Wen
- Department of Biotechnology, Beijing Institute of Health Service and Transfusion Medicine, Beijing, China
| | - Xiaoqi Wang
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, China
| | - Xiaochen Bo
- Department of Biotechnology, Beijing Institute of Health Service and Transfusion Medicine, Beijing, China.
| | - Song He
- Department of Biotechnology, Beijing Institute of Health Service and Transfusion Medicine, Beijing, China.
| | - Shaoliang Peng
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, China. .,The State Key Laboratory of Chemo/Biosensing and Chemometrics, Hunan University, Changsha, China.
| |
Collapse
|
13
|
Zhang B, Fan T. Knowledge structure and emerging trends in the application of deep learning in genetics research: A bibliometric analysis [2000–2021]. Front Genet 2022; 13:951939. [PMID: 36081985 PMCID: PMC9445221 DOI: 10.3389/fgene.2022.951939] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2022] [Accepted: 07/13/2022] [Indexed: 11/13/2022] Open
Abstract
Introduction: Deep learning technology has been widely used in genetic research because of its characteristics of computability, statistical analysis, and predictability. Herein, we aimed to summarize standardized knowledge and potentially innovative approaches for deep learning applications of genetics by evaluating publications to encourage more research.Methods: The Science Citation Index Expanded TM (SCIE) database was searched for deep learning applications for genomics-related publications. Original articles and reviews were considered. In this study, we derived a clustered network from 69,806 references that were cited by the 1,754 related manuscripts identified. We used CiteSpace and VOSviewer to identify countries, institutions, journals, co-cited references, keywords, subject evolution, path, current characteristics, and emerging topics.Results: We assessed the rapidly increasing publications concerned about deep learning applications of genomics approaches and identified 1,754 articles that published reports focusing on this subject. Among these, a total of 101 countries and 2,487 institutes contributed publications, The United States of America had the most publications (728/1754) and the highest h-index, and the US has been in close collaborations with China and Germany. The reference clusters of SCI articles were clustered into seven categories: deep learning, logic regression, variant prioritization, random forests, scRNA-seq (single-cell RNA-seq), genomic regulation, and recombination. The keywords representing the research frontiers by year were prediction (2016–2021), sequence (2017–2021), mutation (2017–2021), and cancer (2019–2021).Conclusion: Here, we summarized the current literature related to the status of deep learning for genetics applications and analyzed the current research characteristics and future trajectories in this field. This work aims to provide resources for possible further intensive exploration and encourages more researchers to overcome the research of deep learning applications in genetics.
Collapse
Affiliation(s)
- Bijun Zhang
- Department of Clinical Genetics, Shengjing Hospital of China Medical University, Shenyang, China
| | - Ting Fan
- Department of Computer, School of Intelligent Medicine, China Medical University, Shenyang, China
- *Correspondence: Ting Fan,
| |
Collapse
|
14
|
Towards computational solutions for precision medicine based big data healthcare system using deep learning models: A review. Comput Biol Med 2022; 149:106020. [DOI: 10.1016/j.compbiomed.2022.106020] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Revised: 08/16/2022] [Accepted: 08/20/2022] [Indexed: 12/14/2022]
|
15
|
Zhao X, Liu T, Wang G. Ensemble classification based signature discovery for cancer diagnosis in RNA expression profiles across different platforms. Brief Bioinform 2022; 23:6590877. [PMID: 35605226 DOI: 10.1093/bib/bbac185] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2022] [Revised: 04/21/2022] [Accepted: 04/23/2022] [Indexed: 11/13/2022] Open
Abstract
Molecular signatures have been excessively reported for diagnosis of many cancers during the last 20 years. However, false-positive signatures are always found using statistical methods or machine learning approaches, and that makes subsequent biological experiments fail. Therefore, signature discovery has gradually become a non-mainstream work in bioinformatics. Actually, there are three critical weaknesses that make the identified signature unreliable. First of all, a signature is wrongly thought to be a gene set, each component of which keeps differential expressions between or among sample groups. Second, there may be many false-positive genes expressed differentially found, even if samples derived from cancer or normal group can be separated in one-dimensional space. Third, cross-platform validation results of a discovered signature are always poor. In order to solve these problems, we propose a new feature selection framework based on ensemble classification to discover signatures for cancer diagnosis. Meanwhile, a procedure for data transform among different expression profiles across different platforms is also designed. Signatures are found on simulation and real data representing different carcinomas across different platforms. Besides, false positives are suppressed. The experimental results demonstrate the effectiveness of our method.
Collapse
Affiliation(s)
- Xudong Zhao
- College of Information and Computer Engineering, Northeast Forestry University, No. 26, Hexing Road, 150040, Heilongjiang Province, China
| | - Tong Liu
- College of Information and Computer Engineering, Northeast Forestry University, No. 26, Hexing Road, 150040, Heilongjiang Province, China
| | - Guohua Wang
- College of Information and Computer Engineering, Northeast Forestry University, No. 26, Hexing Road, 150040, Heilongjiang Province, China.,State Key Laboratory of Tree Genetics and Breeding, Northeast Forestry University, No. 26, Hexing Road, 150040, Heilongjiang Province, China
| |
Collapse
|
16
|
Su L, Xu C, Zeng S, Su L, Joshi T, Stacey G, Xu D. Large-Scale Integrative Analysis of Soybean Transcriptome Using an Unsupervised Autoencoder Model. FRONTIERS IN PLANT SCIENCE 2022; 13:831204. [PMID: 35310659 PMCID: PMC8927983 DOI: 10.3389/fpls.2022.831204] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/08/2021] [Accepted: 02/09/2022] [Indexed: 06/14/2023]
Abstract
Plant tissues are distinguished by their gene expression patterns, which can help identify tissue-specific highly expressed genes and their differential functional modules. For this purpose, large-scale soybean transcriptome samples were collected and processed starting from raw sequencing reads in a uniform analysis pipeline. To address the gene expression heterogeneity in different tissues, we utilized an adversarial deconfounding autoencoder (AD-AE) model to map gene expressions into a latent space and adapted a standard unsupervised autoencoder (AE) model to help effectively extract meaningful biological signals from the noisy data. As a result, four groups of 1,743, 914, 2,107, and 1,451 genes were found highly expressed specifically in leaf, root, seed and nodule tissues, respectively. To obtain key transcription factors (TFs), hub genes and their functional modules in each tissue, we constructed tissue-specific gene regulatory networks (GRNs), and differential correlation networks by using corrected and compressed gene expression data. We validated our results from the literature and gene enrichment analysis, which confirmed many identified tissue-specific genes. Our study represents the largest gene expression analysis in soybean tissues to date. It provides valuable targets for tissue-specific research and helps uncover broader biological patterns. Code is publicly available with open source at https://github.com/LingtaoSu/SoyMeta.
Collapse
Affiliation(s)
- Lingtao Su
- Department of Electrical Engineering and Computer Science and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| | - Chunhui Xu
- Institute for Data Science and Informatics, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| | - Shuai Zeng
- Department of Electrical Engineering and Computer Science and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| | - Li Su
- Institute for Data Science and Informatics, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| | - Trupti Joshi
- Department of Electrical Engineering and Computer Science and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
- Institute for Data Science and Informatics, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
- Department of Health Management and Informatics and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| | - Gary Stacey
- Division of Plant Sciences and Technology and Biochemistry Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| | - Dong Xu
- Department of Electrical Engineering and Computer Science and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
- Institute for Data Science and Informatics, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| |
Collapse
|
17
|
Artificial Intelligence and Cardiovascular Genetics. Life (Basel) 2022; 12:life12020279. [PMID: 35207566 PMCID: PMC8875522 DOI: 10.3390/life12020279] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2021] [Revised: 01/26/2022] [Accepted: 02/09/2022] [Indexed: 12/13/2022] Open
Abstract
Polygenic diseases, which are genetic disorders caused by the combined action of multiple genes, pose unique and significant challenges for the diagnosis and management of affected patients. A major goal of cardiovascular medicine has been to understand how genetic variation leads to the clinical heterogeneity seen in polygenic cardiovascular diseases (CVDs). Recent advances and emerging technologies in artificial intelligence (AI), coupled with the ever-increasing availability of next generation sequencing (NGS) technologies, now provide researchers with unprecedented possibilities for dynamic and complex biological genomic analyses. Combining these technologies may lead to a deeper understanding of heterogeneous polygenic CVDs, better prognostic guidance, and, ultimately, greater personalized medicine. Advances will likely be achieved through increasingly frequent and robust genomic characterization of patients, as well the integration of genomic data with other clinical data, such as cardiac imaging, coronary angiography, and clinical biomarkers. This review discusses the current opportunities and limitations of genomics; provides a brief overview of AI; and identifies the current applications, limitations, and future directions of AI in genomics.
Collapse
|
18
|
Liu Q, Hu P. Extendable and explainable deep learning for pan-cancer radiogenomics research. Curr Opin Chem Biol 2022; 66:102111. [PMID: 34999476 DOI: 10.1016/j.cbpa.2021.102111] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2021] [Revised: 12/06/2021] [Accepted: 12/13/2021] [Indexed: 12/12/2022]
Abstract
Radiogenomics is a field where medical images and genomic profiles are jointly analyzed to answer critical clinical questions. Specifically, people want to identify non-invasive imaging biomarkers that are associated with both genomic features and clinical outcomes. Deep learning is an advanced computer science technique that has been applied in many fields, including medical image and genomic data analysis. This review summarizes the current state of deep learning in pan-cancer radiogenomic research, discusses its limitations, and indicates the potential future directions. Traditional machine learning in radiomics, genomics, and radiogenomics have also been briefly discussed. We also summarize the main pan-cancer radiogenomic research resources. Two characteristics of deep learning are emphasized when discussing its application to pan-cancer radiogenomics, which are extendibility and explainability.
Collapse
Affiliation(s)
- Qian Liu
- Department of Biochemistry and Medical Genetics, University of Manitoba, Winnipeg, Manitoba, R3E 0W3, Canada; Department of Computer Science, University of Manitoba, Winnipeg, Manitoba, R3E 0W3, Canada; Department of Statistics, University of Manitoba, Winnipeg, Manitoba, R3E 0W3, Canada.
| | - Pingzhao Hu
- Department of Biochemistry and Medical Genetics, University of Manitoba, Winnipeg, Manitoba, R3E 0W3, Canada; Department of Computer Science, University of Manitoba, Winnipeg, Manitoba, R3E 0W3, Canada.
| |
Collapse
|
19
|
Viaud G, Mayilvahanan P, Cournede PH. Representation Learning for the Clustering of Multi-Omics Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:135-145. [PMID: 33600320 DOI: 10.1109/tcbb.2021.3060340] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
The integration of several sources of data for the identification of subtypes of diseases has gained attention over the past few years. The heterogeneity and the high dimensions of the data sets calls for an adequate representation of the data. We summarize the field of representation learning for the multi-omics clustering problem and we investigate several techniques to learn relevant combined representations, using methods from group factor analysis (PCA, MFA and extensions) and from machine learning with autoencoders. We highlight the importance of appropriately designing and training the latter, notably with a novel combination of a disjointed deep autoencoder (DDAE) architecture and a layer-wise reconstruction loss. These different representations can then be clustered to identify biologically meaningful clusters of patients. We provide a unifying framework for model comparison between statistical and deep learning approaches with the introduction of a new weighted internal clustering index that evaluates how well the clustering information is retained from each source, favoring contributions from all data sets. We apply our methodology to two case studies for which previous works of integrative clustering exist, TCGA Breast Cancer and TARGET Neuroblastoma, and show how our method can yield good and well-balanced clusters across the different data sources.
Collapse
|
20
|
Zhang L, Yang Y, Chai L, Li Q, Liu J, Lin H, Liu L. A deep learning model to identify gene expression level using cobinding transcription factor signals. Brief Bioinform 2021; 23:6447678. [PMID: 34864886 DOI: 10.1093/bib/bbab501] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2021] [Revised: 10/13/2021] [Accepted: 11/01/2021] [Indexed: 01/02/2023] Open
Abstract
Gene expression is directly controlled by transcription factors (TFs) in a complex combination manner. It remains a challenging task to systematically infer how the cooperative binding of TFs drives gene activity. Here, we quantitatively analyzed the correlation between TFs and surveyed the TF interaction networks associated with gene expression in GM12878 and K562 cell lines. We identified six TF modules associated with gene expression in each cell line. Furthermore, according to the enrichment characteristics of TFs in these TF modules around a target gene, a convolutional neural network model, called TFCNN, was constructed to identify gene expression level. Results showed that the TFCNN model achieved a good prediction performance for gene expression. The average of the area under receiver operating characteristics curve (AUC) can reach up to 0.975 and 0.976, respectively in GM12878 and K562 cell lines. By comparison, we found that the TFCNN model outperformed the prediction models based on SVM and LDA. This is due to the TFCNN model could better extract the combinatorial interaction among TFs. Further analysis indicated that the abundant binding of regulatory TFs dominates expression of target genes, while the cooperative interaction between TFs has a subtle regulatory effects. And gene expression could be regulated by different TF combinations in a nonlinear way. These results are helpful for deciphering the mechanism of TF combination regulating gene expression.
Collapse
Affiliation(s)
- Lirong Zhang
- School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China
| | - Yanchao Yang
- School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China
| | - Lu Chai
- School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China
| | - Qianzhong Li
- School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China
| | - Junjie Liu
- School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China
| | - Hao Lin
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Li Liu
- School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China
| |
Collapse
|
21
|
Umarov R, Li Y, Arner E. DeepCellState: An autoencoder-based framework for predicting cell type specific transcriptional states induced by drug treatment. PLoS Comput Biol 2021; 17:e1009465. [PMID: 34610009 PMCID: PMC8519465 DOI: 10.1371/journal.pcbi.1009465] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2021] [Revised: 10/15/2021] [Accepted: 09/20/2021] [Indexed: 11/18/2022] Open
Abstract
Drug treatment induces cell type specific transcriptional programs, and as the number of combinations of drugs and cell types grows, the cost for exhaustive screens measuring the transcriptional drug response becomes intractable. We developed DeepCellState, a deep learning autoencoder-based framework, for predicting the induced transcriptional state in a cell type after drug treatment, based on the drug response in another cell type. Training the method on a large collection of transcriptional drug perturbation profiles, prediction accuracy improves significantly over baseline and alternative deep learning approaches when applying the method to two cell types, with improved accuracy when generalizing the framework to additional cell types. Treatments with drugs or whole drug families not seen during training are predicted with similar accuracy, and the same framework can be used for predicting the results from other interventions, such as gene knock-downs. Finally, analysis of the trained model shows that the internal representation is able to learn regulatory relationships between genes in a fully data-driven manner.
Collapse
Affiliation(s)
- Ramzan Umarov
- Graduate School of Integrated Sciences for Life, Hiroshima University, Higashi-Hiroshima, Japan
- * E-mail: (RU); (EA)
| | - Yu Li
- Department of Computer Science and Engineering (CSE), The Chinese University of Hong Kong (CUHK), Hong Kong, People’s Republic of China
| | - Erik Arner
- Graduate School of Integrated Sciences for Life, Hiroshima University, Higashi-Hiroshima, Japan
- Laboratory for Applied Regulatory Genomics Network Analysis, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan
- * E-mail: (RU); (EA)
| |
Collapse
|
22
|
Cao C, Kwok D, Edie S, Li Q, Ding B, Kossinna P, Campbell S, Wu J, Greenberg M, Long Q. kTWAS: integrating kernel machine with transcriptome-wide association studies improves statistical power and reveals novel genes. Brief Bioinform 2021; 22:5985285. [PMID: 33200776 DOI: 10.1093/bib/bbaa270] [Citation(s) in RCA: 36] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2020] [Revised: 09/17/2020] [Accepted: 09/18/2020] [Indexed: 12/31/2022] Open
Abstract
The power of genotype-phenotype association mapping studies increases greatly when contributions from multiple variants in a focal region are meaningfully aggregated. Currently, there are two popular categories of variant aggregation methods. Transcriptome-wide association studies (TWAS) represent a set of emerging methods that select variants based on their effect on gene expressions, providing pretrained linear combinations of variants for downstream association mapping. In contrast to this, kernel methods such as sequence kernel association test (SKAT) model genotypic and phenotypic variance use various kernel functions that capture genetic similarity between subjects, allowing nonlinear effects to be included. From the perspective of machine learning, these two methods cover two complementary aspects of feature engineering: feature selection/pruning and feature aggregation. Thus far, no thorough comparison has been made between these categories, and no methods exist which incorporate the advantages of TWAS- and kernel-based methods. In this work, we developed a novel method called kernel-based TWAS (kTWAS) that applies TWAS-like feature selection to a SKAT-like kernel association test, combining the strengths of both approaches. Through extensive simulations, we demonstrate that kTWAS has higher power than TWAS and multiple SKAT-based protocols, and we identify novel disease-associated genes in Wellcome Trust Case Control Consortium genotyping array data and MSSNG (Autism) sequence data. The source code for kTWAS and our simulations are available in our GitHub repository (https://github.com/theLongLab/kTWAS).
Collapse
Affiliation(s)
- Chen Cao
- Department of Biochemistry & Molecular Biology, University of Calgary
| | - Devin Kwok
- Department of Mathematics & Statistics, University of Calgary
| | | | - Qing Li
- Department of Biochemistry & Molecular Biology, University of Calgary
| | - Bowei Ding
- Department of Mathematics & Statistics, University of Calgary
| | - Pathum Kossinna
- Department of Biochemistry & Molecular Biology, University of Calgary
| | | | - Jingjing Wu
- Department of Mathematics & Statistics, University of Calgary
| | | | - Quan Long
- Departments of Biochemistry & Molecular Biology, Medical Genetics and Mathematics & Statistics
| |
Collapse
|
23
|
Zrimec J, Buric F, Kokina M, Garcia V, Zelezniak A. Learning the Regulatory Code of Gene Expression. Front Mol Biosci 2021; 8:673363. [PMID: 34179082 PMCID: PMC8223075 DOI: 10.3389/fmolb.2021.673363] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2021] [Accepted: 05/24/2021] [Indexed: 11/13/2022] Open
Abstract
Data-driven machine learning is the method of choice for predicting molecular phenotypes from nucleotide sequence, modeling gene expression events including protein-DNA binding, chromatin states as well as mRNA and protein levels. Deep neural networks automatically learn informative sequence representations and interpreting them enables us to improve our understanding of the regulatory code governing gene expression. Here, we review the latest developments that apply shallow or deep learning to quantify molecular phenotypes and decode the cis-regulatory grammar from prokaryotic and eukaryotic sequencing data. Our approach is to build from the ground up, first focusing on the initiating protein-DNA interactions, then specific coding and non-coding regions, and finally on advances that combine multiple parts of the gene and mRNA regulatory structures, achieving unprecedented performance. We thus provide a quantitative view of gene expression regulation from nucleotide sequence, concluding with an information-centric overview of the central dogma of molecular biology.
Collapse
Affiliation(s)
- Jan Zrimec
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
| | - Filip Buric
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
| | - Mariia Kokina
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Victor Garcia
- School of Life Sciences and Facility Management, Zurich University of Applied Sciences, Wädenswil, Switzerland
| | - Aleksej Zelezniak
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
- Science for Life Laboratory, Stockholm, Sweden
| |
Collapse
|
24
|
Patel N, Bush WS. Modeling transcriptional regulation using gene regulatory networks based on multi-omics data sources. BMC Bioinformatics 2021; 22:200. [PMID: 33874910 PMCID: PMC8056605 DOI: 10.1186/s12859-021-04126-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2020] [Accepted: 04/09/2021] [Indexed: 11/17/2022] Open
Abstract
Background Transcriptional regulation is complex, requiring multiple cis (local) and trans acting mechanisms working in concert to drive gene expression, with disruption of these processes linked to multiple diseases. Previous computational attempts to understand the influence of regulatory mechanisms on gene expression have used prediction models containing input features derived from cis regulatory factors. However, local chromatin looping and trans-acting mechanisms are known to also influence transcriptional regulation, and their inclusion may improve model accuracy and interpretation. In this study, we create a general model of transcription factor influence on gene expression by incorporating both cis and trans gene regulatory features. Results We describe a computational framework to model gene expression for GM12878 and K562 cell lines. This framework weights the impact of transcription factor-based regulatory data using multi-omics gene regulatory networks to account for both cis and trans acting mechanisms, and measures of the local chromatin context. These prediction models perform significantly better compared to models containing cis-regulatory features alone. Models that additionally integrate long distance chromatin interactions (or chromatin looping) between distal transcription factor binding regions and gene promoters also show improved accuracy. As a demonstration of their utility, effect estimates from these models were used to weight cis-regulatory rare variants for sequence kernel association test analyses of gene expression. Conclusions Our models generate refined effect estimates for the influence of individual transcription factors on gene expression, allowing characterization of their roles across the genome. This work also provides a framework for integrating multiple data types into a single model of transcriptional regulation. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04126-3.
Collapse
Affiliation(s)
- Neel Patel
- Department of Nutrition, Case Western Reserve University, Cleveland, OH, USA.,Department of Population and Quantitative Health Sciences, Case Western Reserve University, Cleveland, OH, USA
| | - William S Bush
- Department of Population and Quantitative Health Sciences, Case Western Reserve University, Cleveland, OH, USA.
| |
Collapse
|
25
|
Kuksin M, Morel D, Aglave M, Danlos FX, Marabelle A, Zinovyev A, Gautheret D, Verlingue L. Applications of single-cell and bulk RNA sequencing in onco-immunology. Eur J Cancer 2021; 149:193-210. [PMID: 33866228 DOI: 10.1016/j.ejca.2021.03.005] [Citation(s) in RCA: 81] [Impact Index Per Article: 20.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2020] [Revised: 02/26/2021] [Accepted: 03/04/2021] [Indexed: 02/08/2023]
Abstract
The rising interest for precise characterization of the tumour immune contexture has recently brought forward the high potential of RNA sequencing (RNA-seq) in identifying molecular mechanisms engaged in the response to immunotherapy. In this review, we provide an overview of the major principles of single-cell and conventional (bulk) RNA-seq applied to onco-immunology. We describe standard preprocessing and statistical analyses of data obtained from such techniques and highlight some computational challenges relative to the sequencing of individual cells. We notably provide examples of gene expression analyses such as differential expression analysis, dimensionality reduction, clustering and enrichment analysis. Additionally, we used public data sets to exemplify how deconvolution algorithms can identify and quantify multiple immune subpopulations from either bulk or single-cell RNA-seq. We give examples of machine and deep learning models used to predict patient outcomes and treatment effect from high-dimensional data. Finally, we balance the strengths and weaknesses of single-cell and bulk RNA-seq regarding their applications in the clinic.
Collapse
Affiliation(s)
- Maria Kuksin
- ENS de Lyon, 15 Parvis René Descartes, 69007, Lyon, France; Département d'Innovations Thérapeutiques et Essais Précoces (DITEP), Gustave Roussy Cancer Campus, 114 rue Edouard Vaillant, 94800, Villejuif, France
| | - Daphné Morel
- Département d'Innovations Thérapeutiques et Essais Précoces (DITEP), Gustave Roussy Cancer Campus, 114 rue Edouard Vaillant, 94800, Villejuif, France; Département de Radiothérapie, Gustave Roussy Cancer Campus, Gustave Roussy, 114 rue Edouard Vaillant, 94800, Villejuif, France; INSERM UMR1030, Molecular Radiotherapy and Therapeutic Innovations, Gustave Roussy, 114 rue Edouard Vaillant, 94800, Villejuif, France
| | - Marine Aglave
- INSERM US23, CNRS UMS 3655, Gustave Roussy Cancer Campus, 114 rue Edouard Vaillant, 94800, Villejuif, France
| | | | - Aurélien Marabelle
- Département d'Innovations Thérapeutiques et Essais Précoces (DITEP), Gustave Roussy Cancer Campus, 114 rue Edouard Vaillant, 94800, Villejuif, France; INSERM U1015, Gustave Roussy, Université Paris Saclay, France
| | - Andrei Zinovyev
- Institut Curie, PSL Research University, F-75005, Paris, France; INSERM, U900, F-75005, Paris, France; MINES ParisTech, PSL Research University, CBIO-Centre for Computational Biology, F-75006, Paris, France; Laboratory of Advanced Methods for High-dimensional Data Analysis, Lobachevsky University, 603000, Nizhny Novgorod, Russia
| | - Daniel Gautheret
- Institute for Integrative Biology of the Cell, UMR 9198, CEA, CNRS, Université Paris-Saclay, Gif-Sur-Yvette, France; IHU PRISM, Gustave Roussy Cancer Campus, Gustave Roussy, 114 Rue Edouard Vaillant, 94800, Villejuif, France; Université Paris-Saclay, France
| | - Loïc Verlingue
- Département d'Innovations Thérapeutiques et Essais Précoces (DITEP), Gustave Roussy Cancer Campus, 114 rue Edouard Vaillant, 94800, Villejuif, France; INSERM UMR1030, Molecular Radiotherapy and Therapeutic Innovations, Gustave Roussy, 114 rue Edouard Vaillant, 94800, Villejuif, France; Institut Curie, PSL Research University, F-75005, Paris, France; Université Paris-Saclay, France.
| |
Collapse
|
26
|
Fatima N, Rueda L. iSOM-GSN: an integrative approach for transforming multi-omic data into gene similarity networks via self-organizing maps. Bioinformatics 2021; 36:4248-4254. [PMID: 32407457 DOI: 10.1093/bioinformatics/btaa500] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2019] [Revised: 04/27/2020] [Accepted: 05/07/2020] [Indexed: 01/04/2023] Open
Abstract
MOTIVATION One of the main challenges in applying graph convolutional neural networks (CNNs) on gene-interaction data is the lack of understanding of the vector space to which they belong, and also the inherent difficulties involved in representing those interactions on a significantly lower dimension, viz Euclidean spaces. The challenge becomes more prevalent when dealing with various types of heterogeneous data. We introduce a systematic, generalized method, called iSOM-GSN, used to transform 'multi-omic' data with higher dimensions onto a 2D grid. Afterwards, we apply a CNN to predict disease states of various types. Based on the idea of Kohonen's self-organizing map, we generate a 2D grid for each sample for a given set of genes that represent a gene similarity network. RESULTS We have tested the model to predict breast and prostate cancer using gene expression, DNA methylation and copy number alteration. Prediction accuracies in the 94-98% range were obtained for tumor stages of breast cancer and calculated Gleason scores of prostate cancer with just 14 input genes for both cases. The scheme not only outputs nearly perfect classification accuracy, but also provides an enhanced scheme for representation learning, visualization, dimensionality reduction and interpretation of multi-omic data. AVAILABILITY AND IMPLEMENTATION The source code and sample data are available via a Github project at https://github.com/NaziaFatima/iSOM_GSN. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Nazia Fatima
- School of Computer Science, University of Windsor, Windsor, ON N9B 3P4, Canada
| | - Luis Rueda
- School of Computer Science, University of Windsor, Windsor, ON N9B 3P4, Canada
| |
Collapse
|
27
|
López-Cortés XA, Matamala F, Maldonado C, Mora-Poblete F, Scapim CA. A Deep Learning Approach to Population Structure Inference in Inbred Lines of Maize. Front Genet 2020; 11:543459. [PMID: 33329691 PMCID: PMC7732446 DOI: 10.3389/fgene.2020.543459] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2020] [Accepted: 10/19/2020] [Indexed: 11/16/2022] Open
Abstract
Analysis of population genetic variation and structure is a common practice for genome-wide studies, including association mapping, ecology, and evolution studies in several crop species. In this study, machine learning (ML) clustering methods, K-means (KM), and hierarchical clustering (HC), in combination with non-linear and linear dimensionality reduction techniques, deep autoencoder (DeepAE) and principal component analysis (PCA), were used to infer population structure and individual assignment of maize inbred lines, i.e., dent field corn (n = 97) and popcorn (n = 86). The results revealed that the HC method in combination with DeepAE-based data preprocessing (DeepAE-HC) was the most effective method to assign individuals to clusters (with 96% of correct individual assignments), whereas DeepAE-KM, PCA-HC, and PCA-KM were assigned correctly 92, 89, and 81% of the lines, respectively. These findings were consistent with both Silhouette Coefficient (SC) and Davies-Bouldin validation indexes. Notably, DeepAE-HC also had better accuracy than the Bayesian clustering method implemented in InStruct. The results of this study showed that deep learning (DL)-based dimensional reduction combined with ML clustering methods is a useful tool to determine genetically differentiated groups and to assign individuals into subpopulations in genome-wide studies without having to consider previous genetic assumptions.
Collapse
Affiliation(s)
| | - Felipe Matamala
- Department of Computer Sciences and Industries, Catholic University of the Maule, Talca, Chile
| | - Carlos Maldonado
- Instituto de Ciencias Agroalimentarias, Animales y Ambientales, Universidad de O’Higgins, San Fernando, Chile
| | | | | |
Collapse
|
28
|
Emon MA, Heinson A, Wu P, Domingo-Fernández D, Sood M, Vrooman H, Corvol JC, Scordis P, Hofmann-Apitius M, Fröhlich H. Clustering of Alzheimer's and Parkinson's disease based on genetic burden of shared molecular mechanisms. Sci Rep 2020; 10:19097. [PMID: 33154531 PMCID: PMC7645798 DOI: 10.1038/s41598-020-76200-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2020] [Accepted: 10/23/2020] [Indexed: 02/07/2023] Open
Abstract
One of the visions of precision medicine has been to re-define disease taxonomies based on molecular characteristics rather than on phenotypic evidence. However, achieving this goal is highly challenging, specifically in neurology. Our contribution is a machine-learning based joint molecular subtyping of Alzheimer's (AD) and Parkinson's Disease (PD), based on the genetic burden of 15 molecular mechanisms comprising 27 proteins (e.g. APOE) that have been described in both diseases. We demonstrate that our joint AD/PD clustering using a combination of sparse autoencoders and sparse non-negative matrix factorization is reproducible and can be associated with significant differences of AD and PD patient subgroups on a clinical, pathophysiological and molecular level. Hence, clusters are disease-associated. To our knowledge this work is the first demonstration of a mechanism based stratification in the field of neurodegenerative diseases. Overall, we thus see this work as an important step towards a molecular mechanism-based taxonomy of neurological disorders, which could help in developing better targeted therapies in the future by going beyond classical phenotype based disease definitions.
Collapse
Affiliation(s)
- Mohammad Asif Emon
- Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), 53754, Sankt Augustin, Germany
- Bonn-Aachen International Center for IT, University of Bonn, Endenicher Allee 19c, 53115, Bonn, Germany
| | - Ashley Heinson
- UCB Pharma (UCB Celltech Ltd.), 208 Bath Road, Slough, SL1 3WE, Berkshire, UK
| | - Ping Wu
- UCB Pharma (UCB Celltech Ltd.), 208 Bath Road, Slough, SL1 3WE, Berkshire, UK
| | - Daniel Domingo-Fernández
- Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), 53754, Sankt Augustin, Germany
- Bonn-Aachen International Center for IT, University of Bonn, Endenicher Allee 19c, 53115, Bonn, Germany
| | - Meemansa Sood
- Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), 53754, Sankt Augustin, Germany
- Bonn-Aachen International Center for IT, University of Bonn, Endenicher Allee 19c, 53115, Bonn, Germany
| | - Henri Vrooman
- Department of Radiology and Nuclear Medicine, Department of Medical Informatics, Erasmus MC, University Medical Center Rotterdam, PO Box 2040, 3000 CA, Rotterdam, The Netherlands
| | | | - Phil Scordis
- UCB Pharma (UCB Celltech Ltd.), 208 Bath Road, Slough, SL1 3WE, Berkshire, UK
| | - Martin Hofmann-Apitius
- Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), 53754, Sankt Augustin, Germany
- Bonn-Aachen International Center for IT, University of Bonn, Endenicher Allee 19c, 53115, Bonn, Germany
| | - Holger Fröhlich
- Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), 53754, Sankt Augustin, Germany.
- Bonn-Aachen International Center for IT, University of Bonn, Endenicher Allee 19c, 53115, Bonn, Germany.
- UCB Pharma (UCB Biosciences GmbH), Alfred-Nobel-Str. 10, 40789, Monheim, Germany.
| |
Collapse
|
29
|
Application of deep learning in genomics. SCIENCE CHINA-LIFE SCIENCES 2020; 63:1860-1878. [PMID: 33051704 DOI: 10.1007/s11427-020-1804-5] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/13/2020] [Accepted: 08/15/2020] [Indexed: 12/19/2022]
Abstract
In recent years, deep learning has been widely used in diverse fields of research, such as speech recognition, image classification, autonomous driving and natural language processing. Deep learning has showcased dramatically improved performance in complex classification and regression problems, where the intricate structure in the high-dimensional data is difficult to discover using conventional machine learning algorithms. In biology, applications of deep learning are gaining increasing popularity in predicting the structure and function of genomic elements, such as promoters, enhancers, or gene expression levels. In this review paper, we described the basic concepts in machine learning and artificial neural network, followed by elaboration on the workflow of using convolutional neural network in genomics. Then we provided a concise introduction of deep learning applications in genomics and synthetic biology at the levels of DNA, RNA and protein. Finally, we discussed the current challenges and future perspectives of deep learning in genomics.
Collapse
|
30
|
Kang M, Lee S, Lee D, Kim S. Learning Cell-Type-Specific Gene Regulation Mechanisms by Multi-Attention Based Deep Learning With Regulatory Latent Space. Front Genet 2020; 11:869. [PMID: 33133123 PMCID: PMC7561362 DOI: 10.3389/fgene.2020.00869] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2020] [Accepted: 07/16/2020] [Indexed: 12/13/2022] Open
Abstract
Epigenetic gene regulation is a major control mechanism of gene expression. Most existing methods for modeling control mechanisms of gene expression use only a single epigenetic marker and very few methods are successful in modeling complex mechanisms of gene regulations using multiple epigenetic markers on transcriptional regulation. In this paper, we propose a multi-attention based deep learning model that integrates multiple markers to characterize complex gene regulation mechanisms. In experiments with 18 cell line multi-omics data, our proposed model predicted the gene expression level more accurately than the state-of-the-art model. Moreover, the model successfully revealed cell-type-specific gene expression control mechanisms. Finally, the model was used to identify genes enriched for specific cell types in terms of their functions and epigenetic regulation.
Collapse
Affiliation(s)
- Minji Kang
- Bioinformatics Institute, Seoul National University, Seoul, South Korea
| | - Sangseon Lee
- Bioinformatics Institute, Seoul National University, Seoul, South Korea
| | - Dohoon Lee
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, South Korea
| | - Sun Kim
- Bioinformatics Institute, Seoul National University, Seoul, South Korea.,Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, South Korea.,Department of Computer Science and Engineering, Institute of Engineering Research, Seoul National University, Seoul, South Korea
| |
Collapse
|
31
|
Ferreira MF, Camacho R, Teixeira LF. Using autoencoders as a weight initialization method on deep neural networks for disease detection. BMC Med Inform Decis Mak 2020; 20:141. [PMID: 32819347 PMCID: PMC7439655 DOI: 10.1186/s12911-020-01150-w] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2020] [Accepted: 06/08/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND As of today, cancer is still one of the most prevalent and high-mortality diseases, summing more than 9 million deaths in 2018. This has motivated researchers to study the application of machine learning-based solutions for cancer detection to accelerate its diagnosis and help its prevention. Among several approaches, one is to automatically classify tumor samples through their gene expression analysis. METHODS In this work, we aim to distinguish five different types of cancer through RNA-Seq datasets: thyroid, skin, stomach, breast, and lung. To do so, we have adopted a previously described methodology, with which we compare the performance of 3 different autoencoders (AEs) used as a deep neural network weight initialization technique. Our experiments consist in assessing two different approaches when training the classification model - fixing the weights after pre-training the AEs, or allowing fine-tuning of the entire network - and two different strategies for embedding the AEs into the classification network, namely by only importing the encoding layers, or by inserting the complete AE. We then study how varying the number of layers in the first strategy, the AEs latent vector dimension, and the imputation technique in the data preprocessing step impacts the network's overall classification performance. Finally, with the goal of assessing how well does this pipeline generalize, we apply the same methodology to two additional datasets that include features extracted from images of malaria thin blood smears, and breast masses cell nuclei. We also discard the possibility of overfitting by using held-out test sets in the images datasets. RESULTS The methodology attained good overall results for both RNA-Seq and image extracted data. We outperformed the established baseline for all the considered datasets, achieving an average F1 score of 99.03, 89.95, and 98.84 and an MCC of 0.99, 0.84, and 0.98, for the RNA-Seq (when detecting thyroid cancer), the Malaria, and the Wisconsin Breast Cancer data, respectively. CONCLUSIONS We observed that the approach of fine-tuning the weights of the top layers imported from the AE reached higher results, for all the presented experiences, and all the considered datasets. We outperformed all the previous reported results when comparing to the established baselines.
Collapse
Affiliation(s)
- Mafalda Falcão Ferreira
- Faculty of Engineering, University of Porto, Rua Dr. Roberto Frias, s/n, Porto, 4200-465, Portugal.,INESC TEC - Institute for Systems and Computer Engineering, Technology and Science, Porto, Portugal
| | - Rui Camacho
- Faculty of Engineering, University of Porto, Rua Dr. Roberto Frias, s/n, Porto, 4200-465, Portugal.,INESC TEC - Institute for Systems and Computer Engineering, Technology and Science, Porto, Portugal
| | - Luís F Teixeira
- Faculty of Engineering, University of Porto, Rua Dr. Roberto Frias, s/n, Porto, 4200-465, Portugal.,INESC TEC - Institute for Systems and Computer Engineering, Technology and Science, Porto, Portugal
| |
Collapse
|
32
|
Zhou X, Chai H, Zhao H, Luo CH, Yang Y. Imputing missing RNA-sequencing data from DNA methylation by using a transfer learning-based neural network. Gigascience 2020; 9:giaa076. [PMID: 32649756 PMCID: PMC7350980 DOI: 10.1093/gigascience/giaa076] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2019] [Revised: 04/23/2020] [Accepted: 06/24/2020] [Indexed: 12/13/2022] Open
Abstract
BACKGROUND Gene expression plays a key intermediate role in linking molecular features at the DNA level and phenotype. However, owing to various limitations in experiments, the RNA-seq data are missing in many samples while there exist high-quality of DNA methylation data. Because DNA methylation is an important epigenetic modification to regulate gene expression, it can be used to predict RNA-seq data. For this purpose, many methods have been developed. A common limitation of these methods is that they mainly focus on a single cancer dataset and do not fully utilize information from large pan-cancer datasets. RESULTS Here, we have developed a novel method to impute missing gene expression data from DNA methylation data through a transfer learning-based neural network, namely, TDimpute. In the method, the pan-cancer dataset from The Cancer Genome Atlas (TCGA) was utilized for training a general model, which was then fine-tuned on the specific cancer dataset. By testing on 16 cancer datasets, we found that our method significantly outperforms other state-of-the-art methods in imputation accuracy with a 7-11% improvement under different missing rates. The imputed gene expression was further proved to be useful for downstream analyses, including the identification of both methylation-driving and prognosis-related genes, clustering analysis, and survival analysis on the TCGA dataset. More importantly, our method was indicated to be useful for general purposes by an independent test on the Wilms tumor dataset from the Therapeutically Applicable Research to Generate Effective Treatments (TARGET) project. CONCLUSIONS TDimpute is an effective method for RNA-seq imputation with limited training samples.
Collapse
Affiliation(s)
- Xiang Zhou
- School of Data and Computer Science, Sun Yat-sen University, 132 East Waihuan Road, Guangzhou 510006, China
| | - Hua Chai
- School of Data and Computer Science, Sun Yat-sen University, 132 East Waihuan Road, Guangzhou 510006, China
| | - Huiying Zhao
- Sun Yat-sen Memorial Hospital, Sun Yat-sen University, 107 Yan Jiang West Road, Guangzhou 510120, China
| | - Ching-Hsing Luo
- School of Data and Computer Science, Sun Yat-sen University, 132 East Waihuan Road, Guangzhou 510006, China
| | - Yuedong Yang
- School of Data and Computer Science, Sun Yat-sen University, 132 East Waihuan Road, Guangzhou 510006, China
- Key Laboratory of Machine Intelligence and Advanced Computing (Sun Yat-sen University), Ministry of Education, 132 East Waihuan Road, Guangzhou 510006, China
| |
Collapse
|
33
|
Liu S, Li T, Ding H, Tang B, Wang X, Chen Q, Yan J, Zhou Y. A hybrid method of recurrent neural network and graph neural network for next-period prescription prediction. INT J MACH LEARN CYB 2020; 11:2849-2856. [PMID: 33727983 PMCID: PMC7308113 DOI: 10.1007/s13042-020-01155-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2019] [Accepted: 06/10/2020] [Indexed: 01/17/2023]
Abstract
Electronic health records (EHRs) have been widely used to help physicians to make decisions by predicting medical events such as diseases, prescriptions, outcomes, and so on. How to represent patient longitudinal medical data is the key to making these predictions. Recurrent neural network (RNN) is a popular model for patient longitudinal medical data representation from the view of patient status sequences, but it cannot represent complex interactions among different types of medical information, i.e., temporal medical event graphs, which can be represented by graph neural network (GNN). In this paper, we propose a hybrid method of RNN and GNN, called RGNN, for next-period prescription prediction from two views, where RNN is used to represent patient status sequences, and GNN is used to represent temporal medical event graphs. Experiments conducted on the public MIMIC-III ICU data show that the proposed method is effective for next-period prescription prediction, and RNN and GNN are mutually complementary.
Collapse
Affiliation(s)
- Sicen Liu
- Department of Computer Science, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China
| | - Tao Li
- Department of Computer Science, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China
| | - Haoyang Ding
- Yidu Cloud (Beijing) Technology Co., Ltd, Beijing, China
| | - Buzhou Tang
- Department of Computer Science, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China
- PengCheng Laboratory, Shenzhen, China
| | - Xiaolong Wang
- Department of Computer Science, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China
| | - Qingcai Chen
- Department of Computer Science, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China
- PengCheng Laboratory, Shenzhen, China
| | - Jun Yan
- Yidu Cloud (Beijing) Technology Co., Ltd, Beijing, China
| | - Yi Zhou
- Zhongshan School of Medicine, Sun Yat-Sen University, Guangzhou, China
| |
Collapse
|
34
|
Lekschas F, Peterson B, Haehn D, Ma E, Gehlenborg N, Pfister H. Peax: Interactive Visual Pattern Search in Sequential Data Using Unsupervised Deep Representation Learning. COMPUTER GRAPHICS FORUM : JOURNAL OF THE EUROPEAN ASSOCIATION FOR COMPUTER GRAPHICS 2020; 39:167-179. [PMID: 34334852 PMCID: PMC8323802 DOI: 10.1111/cgf.13971] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]
Abstract
We present Peax, a novel feature-based technique for interactive visual pattern search in sequential data, like time series or data mapped to a genome sequence. Visually searching for patterns by similarity is often challenging because of the large search space, the visual complexity of patterns, and the user's perception of similarity. For example, in genomics, researchers try to link patterns in multivariate sequential data to cellular or pathogenic processes, but a lack of ground truth and high variance makes automatic pattern detection unreliable. We have developed a convolutional autoencoder for unsupervised representation learning of regions in sequential data that can capture more visual details of complex patterns compared to existing similarity measures. Using this learned representation as features of the sequential data, our accompanying visual query system enables interactive feedback-driven adjustments of the pattern search to adapt to the users' perceived similarity. Using an active learning sampling strategy, Peax collects user-generated binary relevance feedback. This feedback is used to train a model for binary classification, to ultimately find other regions that exhibit patterns similar to the search target. We demonstrate Peax's features through a case study in genomics and report on a user study with eight domain experts to assess the usability and usefulness of Peax. Moreover, we evaluate the effectiveness of the learned feature representation for visual similarity search in two additional user studies. We find that our models retrieve significantly more similar patterns than other commonly used techniques.
Collapse
Affiliation(s)
| | | | - Daniel Haehn
- Harvard School of Engineering and Applied Sciences
- University of Massachusetts Boston
| | - Eric Ma
- Novartis Institutes for BioMedical Research
| | | | | |
Collapse
|
35
|
Seal DB, Das V, Goswami S, De RK. Estimating gene expression from DNA methylation and copy number variation: A deep learning regression model for multi-omics integration. Genomics 2020; 112:2833-2841. [PMID: 32234433 DOI: 10.1016/j.ygeno.2020.03.021] [Citation(s) in RCA: 29] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2019] [Revised: 03/17/2020] [Accepted: 03/22/2020] [Indexed: 12/21/2022]
Abstract
Gene expression analysis plays a significant role for providing molecular insights in cancer. Various genetic and epigenetic factors (being dealt under multi-omics) affect gene expression giving rise to cancer phenotypes. A recent growth in understanding of multi-omics seems to provide a resource for integration in interdisciplinary biology since they altogether can draw the comprehensive picture of an organism's developmental and disease biology in cancers. Such large scale multi-omics data can be obtained from public consortium like The Cancer Genome Atlas (TCGA) and several other platforms. Integrating these multi-omics data from varied platforms is still challenging due to high noise and sensitivity of the platforms used. Currently, a robust integrative predictive model to estimate gene expression from these genetic and epigenetic data is lacking. In this study, we have developed a deep learning-based predictive model using Deep Denoising Auto-encoder (DDAE) and Multi-layer Perceptron (MLP) that can quantitatively capture how genetic and epigenetic alterations correlate with directionality of gene expression for liver hepatocellular carcinoma (LIHC). The DDAE used in the study has been trained to extract significant features from the input omics data to estimate the gene expression. These features have then been used for back-propagation learning by the multilayer perceptron for the task of regression and classification. We have benchmarked the proposed model against state-of-the-art regression models. Finally, the deep learning-based integration model has been evaluated for its disease classification capability, where an accuracy of 95.1% has been obtained.
Collapse
Affiliation(s)
- Dibyendu Bikash Seal
- A. K. Choudhury School of Information Technology, University of Calcutta, JD-2, Sector III, Salt Lake City, Kolkata 700106, India
| | - Vivek Das
- Novo Nordisk Research Center Seattle, Inc., 530 Fairview Ave N # 5000, Seattle, WA 98109, United States
| | - Saptarsi Goswami
- Bangabasi Morning College, 35 Rajkumar Chakraborty Sarani, Scott Ln, Kolkata 700009, India
| | - Rajat K De
- Machine Intelligence Unit, Indian Statistical Institute, 203 Barrackpore Trunk Road, Kolkata 700108, India.
| |
Collapse
|
36
|
Classical and Deep Learning Paradigms for Detection and Validation of Key Genes of Risky Outcomes of HCV. ALGORITHMS 2020. [DOI: 10.3390/a13030073] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
Hepatitis C virus (HCV) is one of the most dangerous viruses worldwide. It is the foremost cause of the hepatic cirrhosis, and hepatocellular carcinoma, HCC. Detecting new key genes that play a role in the growth of HCC in HCV patients using machine learning techniques paves the way for producing accurate antivirals. In this work, there are two phases: detecting the up/downregulated genes using classical univariate and multivariate feature selection methods, and validating the retrieved list of genes using Insilico classifiers. However, the classification algorithms in the medical domain frequently suffer from a deficiency of training cases. Therefore, a deep neural network approach is proposed here to validate the significance of the retrieved genes in classifying the HCV-infected samples from the disinfected ones. The validation model is based on the artificial generation of new examples from the retrieved genes’ expressions using sparse autoencoders. Subsequently, the generated genes’ expressions data are used to train conventional classifiers. Our results in the first phase yielded a better retrieval of significant genes using Principal Component Analysis (PCA), a multivariate approach. The retrieved list of genes using PCA had a higher number of HCC biomarkers compared to the ones retrieved from the univariate methods. In the second phase, the classification accuracy can reveal the relevance of the extracted key genes in classifying the HCV-infected and disinfected samples.
Collapse
|
37
|
de Jongh RP, van Dijk AD, Julsing MK, Schaap PJ, de Ridder D. Designing Eukaryotic Gene Expression Regulation Using Machine Learning. Trends Biotechnol 2020; 38:191-201. [DOI: 10.1016/j.tibtech.2019.07.007] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2019] [Revised: 07/12/2019] [Accepted: 07/19/2019] [Indexed: 12/11/2022]
|
38
|
Classification of Kidney Cancer Data Using Cost-Sensitive Hybrid Deep Learning Approach. Symmetry (Basel) 2020. [DOI: 10.3390/sym12010154] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
Recently, large-scale bioinformatics and genomic data have been generated using advanced biotechnology methods, thus increasing the importance of analyzing such data. Numerous data mining methods have been developed to process genomic data in the field of bioinformatics. We extracted significant genes for the prognosis prediction of 1157 patients using gene expression data from patients with kidney cancer. We then proposed an end-to-end, cost-sensitive hybrid deep learning (COST-HDL) approach with a cost-sensitive loss function for classification tasks on imbalanced kidney cancer data. Here, we combined the deep symmetric auto encoder; the decoder is symmetric to the encoder in terms of layer structure, with reconstruction loss for non-linear feature extraction and neural network with balanced classification loss for prognosis prediction to address data imbalance problems. Combined clinical data from patients with kidney cancer and gene data were used to determine the optimal classification model and estimate classification accuracy by sample type, primary diagnosis, tumor stage, and vital status as risk factors representing the state of patients. Experimental results showed that the COST-HDL approach was more efficient with gene expression data for kidney cancer prognosis than other conventional machine learning and data mining techniques. These results could be applied to extract features from gene biomarkers for prognosis prediction of kidney cancer and prevention and early diagnosis.
Collapse
|
39
|
Clinical-learning versus machine-learning for transdiagnostic prediction of psychosis onset in individuals at-risk. Transl Psychiatry 2019; 9:259. [PMID: 31624229 PMCID: PMC6797779 DOI: 10.1038/s41398-019-0600-9] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/09/2019] [Revised: 05/03/2019] [Accepted: 05/31/2019] [Indexed: 02/08/2023] Open
Abstract
Predicting the onset of psychosis in individuals at-risk is based on robust prognostic model building methods including a priori clinical knowledge (also termed clinical-learning) to preselect predictors or machine-learning methods to select predictors automatically. To date, there is no empirical research comparing the prognostic accuracy of these two methods for the prediction of psychosis onset. In a first experiment, no improved performance was observed when machine-learning methods (LASSO and RIDGE) were applied-using the same predictors-to an individualised, transdiagnostic, clinically based, risk calculator previously developed on the basis of clinical-learning (predictors: age, gender, age by gender, ethnicity, ICD-10 diagnostic spectrum), and externally validated twice. In a second experiment, two refined versions of the published model which expanded the granularity of the ICD-10 diagnosis were introduced: ICD-10 diagnostic categories and ICD-10 diagnostic subdivisions. Although these refined versions showed an increase in apparent performance, their external performance was similar to the original model. In a third experiment, the three refined models were analysed under machine-learning and clinical-learning with a variable event per variable ratio (EPV). The best performing model under low EPVs was obtained through machine-learning approaches. The development of prognostic models on the basis of a priori clinical knowledge, large samples and adequate events per variable is a robust clinical prediction method to forecast psychosis onset in patients at-risk, and is comparable to machine-learning methods, which are more difficult to interpret and implement. Machine-learning methods should be preferred for high dimensional data when no a priori knowledge is available.
Collapse
|
40
|
Wang H, Li C, Zhang J, Wang J, Ma Y, Lian Y. A new LSTM-based gene expression prediction model: L-GEPM. J Bioinform Comput Biol 2019; 17:1950022. [PMID: 31617459 DOI: 10.1142/s0219720019500227] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
Abstract
Molecular biology combined with in silico machine learning and deep learning has facilitated the broad application of gene expression profiles for gene function prediction, optimal crop breeding, disease-related gene discovery, and drug screening. Although the acquisition cost of genome-wide expression profiles has been steadily declining, the requirement generates a compendium of expression profiles using thousands of samples remains high. The Library of Integrated Network-Based Cellular Signatures (LINCS) program used approximately 1000 landmark genes to predict the expression of the remaining target genes by linear regression; however, this approach ignored the nonlinear features influencing gene expression relationships, limiting the accuracy of the experimental results. We herein propose a gene expression prediction model, L-GEPM, based on long short-term memory (LSTM) neural networks, which captures the nonlinear features affecting gene expression and uses learned features to predict the target genes. By comparing and analyzing experimental errors and fitting the effects of different prediction models, the LSTM neural network-based model, L-GEPM, can achieve low error and a superior fitting effect.
Collapse
Affiliation(s)
- Huiqing Wang
- College of Information and Computer, Taiyuan University of Technology, P. R. China
| | - Chun Li
- College of Information and Computer, Taiyuan University of Technology, P. R. China
| | - Jianhui Zhang
- College of Information and Computer, Taiyuan University of Technology, P. R. China
| | - Jingjing Wang
- College of Information and Computer, Taiyuan University of Technology, P. R. China
| | - Yue Ma
- College of Information and Computer, Taiyuan University of Technology, P. R. China
| | - Yuanyuan Lian
- College of Information and Computer, Taiyuan University of Technology, P. R. China
| |
Collapse
|
41
|
Sparse Convolutional Denoising Autoencoders for Genotype Imputation. Genes (Basel) 2019; 10:genes10090652. [PMID: 31466333 PMCID: PMC6769581 DOI: 10.3390/genes10090652] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2019] [Revised: 08/23/2019] [Accepted: 08/24/2019] [Indexed: 12/14/2022] Open
Abstract
Genotype imputation, where missing genotypes can be computationally imputed, is an essential tool in genomic analysis ranging from genome wide associations to phenotype prediction. Traditional genotype imputation methods are typically based on haplotype-clustering algorithms, hidden Markov models (HMMs), and statistical inference. Deep learning-based methods have been recently reported to suitably address the missing data problems in various fields. To explore the performance of deep learning for genotype imputation, in this study, we propose a deep model called a sparse convolutional denoising autoencoder (SCDA) to impute missing genotypes. We constructed the SCDA model using a convolutional layer that can extract various correlation or linkage patterns in the genotype data and applying a sparse weight matrix resulted from the L1 regularization to handle high dimensional data. We comprehensively evaluated the performance of the SCDA model in different scenarios for genotype imputation on the yeast and human genotype data, respectively. Our results showed that SCDA has strong robustness and significantly outperforms popular reference-free imputation methods. This study thus points to another novel application of deep learning models for missing data imputation in genomic studies.
Collapse
|
42
|
Synergy of ICESat-2 and Landsat for Mapping Forest Aboveground Biomass with Deep Learning. REMOTE SENSING 2019. [DOI: 10.3390/rs11121503] [Citation(s) in RCA: 36] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Spatially continuous estimates of forest aboveground biomass (AGB) are essential to supporting the sustainable management of forest ecosystems and providing invaluable information for quantifying and monitoring terrestrial carbon stocks. The launch of the Ice, Cloud, and land Elevation Satellite-2 (ICESat-2) on September 15th, 2018 offers an unparalleled opportunity to assess AGB at large scales using along-track samples that will be provided during its three-year mission. The main goal of this study was to investigate deep learning (DL) neural networks for mapping AGB with ICESat-2, using simulated photon-counting lidar (PCL)-estimated AGB for daytime, nighttime, and no noise scenarios, Landsat imagery, canopy cover, and land cover maps. The study was carried out in Sam Houston National Forest located in south-east Texas, using a simulated PCL-estimated AGB along two years of planned ICESat-2 profiles. The primary tasks were to investigate and determine neural network architecture, examine the hyper-parameter settings, and subsequently generate wall-to-wall AGB maps. A first set of models were developed using vegetation indices calculated from single-date Landsat imagery, canopy cover, and land cover, and a second set of models were generated using metrics from one year of Landsat imagery with canopy cover and land cover maps. To compare the effectiveness of final models, comparisons with Random Forests (RF) models were made. The deep neural network (DNN) models achieved R2 values of 0.42, 0.49, and 0.50 for the daytime, nighttime, and no noise scenarios respectively. With the extended dataset containing metrics calculated from Landsat images acquired on different dates, substantial improvements in model performance for all data scenarios were noted. The R2 values increased to 0.64, 0.66, and 0.67 for the daytime, nighttime, and no noise scenarios. Comparisons with Random forest (RF) prediction models highlighted similar results, with the same R2 and root mean square error (RMSE) range (15–16 Mg/ha) for daytime and nighttime scenarios. Findings suggest that there is potential for mapping AGB using a combinatory approach with ICESat-2 and Landsat-derived products with DL.
Collapse
|
43
|
Zhong H, Kim S, Zhi D, Cui X. Predicting gene expression using DNA methylation in three human populations. PeerJ 2019; 7:e6757. [PMID: 31106051 PMCID: PMC6500370 DOI: 10.7717/peerj.6757] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2018] [Accepted: 03/10/2019] [Indexed: 12/30/2022] Open
Abstract
Background DNA methylation, an important epigenetic mark, is well known for its regulatory role in gene expression, especially the negative correlation in the promoter region. However, its correlation with gene expression across genome at human population level has not been well studied. In particular, it is unclear if genome-wide DNA methylation profile of an individual can predict her/his gene expression profile. Previous studies were mostly limited to association analyses between single CpG site methylation and gene expression. It is not known whether DNA methylation of a gene has enough prediction power to serve as a surrogate for gene expression in existing human study cohorts with DNA samples other than RNA samples. Results We examined DNA methylation in the gene region for predicting gene expression across individuals in non-cancer tissues of three human population datasets, adipose tissue of the Multiple Tissue Human Expression Resource Projects (MuTHER), peripheral blood mononuclear cell (PBMC) from Asthma and normal control study participates, and lymphoblastoid cell lines (LCL) from healthy individuals. Three prediction models were investigated, single linear regression, multiple linear regression, and least absolute shrinkage and selection operator (LASSO) penalized regression. Our results showed that LASSO regression has superior performance among these methods. However, the prediction power is generally low and varies across datasets. Only 30 and 42 genes were found to have cross-validation R2 greater than 0.3 in the PBMC and Adipose datasets, respectively. A substantially larger number of genes (258) were identified in the LCL dataset, which was generated from a more homogeneous cell line sample source. We also demonstrated that it gives better prediction power not to exclude any CpG probe due to cross hybridization or SNP effect. Conclusion In our three population analyses DNA methylation of CpG sites at gene region have limited prediction power for gene expression across individuals with linear regression models. The prediction power potentially varies depending on tissue, cell type, and data sources. In our analyses, the combination of LASSO regression and all probes not excluding any probe on the methylation array provides the best prediction for gene expression.
Collapse
Affiliation(s)
- Huan Zhong
- Department of Biology, Hong Kong Baptist University, Hong Kong, China
| | - Soyeon Kim
- School of Medicine, University of Pittsburgh, Pittsburgh, PA, United States of America
| | - Degui Zhi
- School of Biomendical Informatics, University of Texas Health Center at Houston, Houston, TX, United States of America
| | - Xiangqin Cui
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA, United States of America
| |
Collapse
|
44
|
Deep Learning in the Biomedical Applications: Recent and Future Status. APPLIED SCIENCES-BASEL 2019. [DOI: 10.3390/app9081526] [Citation(s) in RCA: 76] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
Deep neural networks represent, nowadays, the most effective machine learning technology in biomedical domain. In this domain, the different areas of interest concern the Omics (study of the genome—genomics—and proteins—transcriptomics, proteomics, and metabolomics), bioimaging (study of biological cell and tissue), medical imaging (study of the human organs by creating visual representations), BBMI (study of the brain and body machine interface) and public and medical health management (PmHM). This paper reviews the major deep learning concepts pertinent to such biomedical applications. Concise overviews are provided for the Omics and the BBMI. We end our analysis with a critical discussion, interpretation and relevant open challenges.
Collapse
|
45
|
Zou J, Huss M, Abid A, Mohammadi P, Torkamani A, Telenti A. A primer on deep learning in genomics. Nat Genet 2019; 51:12-18. [PMID: 30478442 PMCID: PMC11180539 DOI: 10.1038/s41588-018-0295-5] [Citation(s) in RCA: 439] [Impact Index Per Article: 73.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2018] [Accepted: 09/26/2018] [Indexed: 12/13/2022]
Abstract
Deep learning methods are a class of machine learning techniques capable of identifying highly complex patterns in large datasets. Here, we provide a perspective and primer on deep learning applications for genome analysis. We discuss successful applications in the fields of regulatory genomics, variant calling and pathogenicity scores. We include general guidance for how to effectively use deep learning methods as well as a practical guide to tools and resources. This primer is accompanied by an interactive online tutorial.
Collapse
Affiliation(s)
- James Zou
- Department of Biomedical Data Science, Stanford University, Palo Alto, CA, USA.
- Chan-Zuckerberg Biohub, San Francisco, CA, USA.
- Department of Electrical Engineering, Stanford University, Palo Alto, CA, USA.
| | - Mikael Huss
- Peltarion, Stockholm, Sweden
- Department of Learning, Informatics, Management and Ethics, Karolinska Institutet, Stockholm, Sweden
| | - Abubakar Abid
- Department of Electrical Engineering, Stanford University, Palo Alto, CA, USA
| | - Pejman Mohammadi
- Scripps Research Translational Institute, La Jolla, CA, USA
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, USA
| | - Ali Torkamani
- Scripps Research Translational Institute, La Jolla, CA, USA
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, USA
| | - Amalio Telenti
- Scripps Research Translational Institute, La Jolla, CA, USA.
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, USA.
| |
Collapse
|
46
|
Gold MP, LeNail A, Fraenkel E. Shallow Sparsely-Connected Autoencoders for Gene Set Projection. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2019; 24:374-385. [PMID: 30963076 PMCID: PMC6417803] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
When analyzing biological data, it can be helpful to consider gene sets, or predefined groups of biologically related genes. Methods exist for identifying gene sets that are differential between conditions, but large public datasets from consortium projects and single-cell RNA-Sequencing have opened the door for gene set analysis using more sophisticated machine learning techniques, such as autoencoders and variational autoencoders. We present shallow sparsely-connected autoencoders (SSCAs) and variational autoencoders (SSCVAs) as tools for projecting gene-level data onto gene sets. We tested these approaches on single-cell RNA-Sequencing data from blood cells and on RNA-Sequencing data from breast cancer patients. Both SSCA and SSCVA can recover known biological features from these datasets and the SSCVA method often outperforms SSCA (and six existing gene set scoring algorithms) on classification and prediction tasks.
Collapse
Affiliation(s)
- Maxwell P. Gold
- Department of Biological Engineering, Massachusetts Institute of Technology, 21 Ames St. Cambridge, MA, 02139, USA
| | - Alexander LeNail
- Department of Biological Engineering, Massachusetts Institute of Technology, 21 Ames St. Cambridge, MA, 02139, USA
| | - Ernest Fraenkel
- Department of Biological Engineering, Massachusetts Institute of Technology, 21 Ames St. Cambridge, MA, 02139, USA
| |
Collapse
|
47
|
|