1
|
Arango-Argoty G, Haghighi M, Sun GJ, Choe EY, Markovets A, Barrett JC, Lai Z, Jacob E. An artificial intelligence-based model for prediction of clonal hematopoiesis variants in cell-free DNA samples. NPJ Precis Oncol 2025; 9:147. [PMID: 40394286 DOI: 10.1038/s41698-025-00921-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2024] [Accepted: 04/23/2025] [Indexed: 05/22/2025] Open
Abstract
Circulating tumor DNA is a critical biomarker in cancer diagnostics, but its accurate interpretation requires careful consideration of clonal hematopoiesis (CH), which can contribute to variants in cell-free DNA and potentially obscure true tumor-derived signals. Accurate detection of somatic variants of CH origin in plasma samples remains challenging in the absence of matched white blood cells sequencing. Here we present an open-source machine learning framework (MetaCH) which classifies variants in cfDNA from plasma-only samples as CH or tumor origin, surpassing state-of-the-art classification rates.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | - Etai Jacob
- Oncology R&D, AstraZeneca, Waltham, MA, USA.
| |
Collapse
|
2
|
Yaacov A, Ben Cohen G, Landau J, Hope T, Simon I, Rosenberg S. Cancer mutational signatures identification in clinical assays using neural embedding-based representations. Cell Rep Med 2024; 5:101608. [PMID: 38866015 PMCID: PMC11228799 DOI: 10.1016/j.xcrm.2024.101608] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2023] [Revised: 03/28/2024] [Accepted: 05/16/2024] [Indexed: 06/14/2024]
Abstract
While mutational signatures provide a plethora of prognostic and therapeutic insights, their application in clinical-setting, targeted gene panels is extremely limited. We develop a mutational representation model (which learns and embeds specific mutation signature connections) that enables prediction of dominant signatures with only a few mutations. We predict the dominant signatures across more than 60,000 tumors with gene panels, delineating their landscape across different cancers. Dominant signature predictions in gene panels are of clinical importance. These included UV, tobacco, and apolipoprotein B mRNA editing enzyme, catalytic polypeptide (APOBEC) signatures that are associated with better survival, independently from mutational burden. Further analyses reveal gene and mutation associations with signatures, such as SBS5 with TP53 and APOBEC with FGFR3S249C. In a clinical use case, APOBEC signature is a robust and specific predictor for resistance to epidermal growth factor receptor-tyrosine kinase inhibitors (EGFR-TKIs). Our model provides an easy-to-use way to detect signatures in clinical setting assays with many possible clinical implications for an unprecedented number of cancer patients.
Collapse
Affiliation(s)
- Adar Yaacov
- Gaffin Center for Neuro-Oncology, Sharett Institute for Oncology, Hadassah Medical Center and Faculty of Medicine, Hebrew University of Jerusalem, Jerusalem, Israel; The Wohl Institute for Translational Medicine, Hadassah Medical Center and Faculty of Medicine, Hebrew University of Jerusalem, Jerusalem, Israel; Department of Microbiology and Molecular Genetics, IMRIC, Faculty of Medicine, Hebrew University of Jerusalem, Jerusalem, Israel.
| | - Gil Ben Cohen
- Gaffin Center for Neuro-Oncology, Sharett Institute for Oncology, Hadassah Medical Center and Faculty of Medicine, Hebrew University of Jerusalem, Jerusalem, Israel; The Wohl Institute for Translational Medicine, Hadassah Medical Center and Faculty of Medicine, Hebrew University of Jerusalem, Jerusalem, Israel
| | - Jakob Landau
- Gaffin Center for Neuro-Oncology, Sharett Institute for Oncology, Hadassah Medical Center and Faculty of Medicine, Hebrew University of Jerusalem, Jerusalem, Israel; The Wohl Institute for Translational Medicine, Hadassah Medical Center and Faculty of Medicine, Hebrew University of Jerusalem, Jerusalem, Israel
| | - Tom Hope
- School of Computer Science and Engineering, Hebrew University of Jerusalem, Jerusalem, Israel
| | - Itamar Simon
- Department of Microbiology and Molecular Genetics, IMRIC, Faculty of Medicine, Hebrew University of Jerusalem, Jerusalem, Israel
| | - Shai Rosenberg
- Gaffin Center for Neuro-Oncology, Sharett Institute for Oncology, Hadassah Medical Center and Faculty of Medicine, Hebrew University of Jerusalem, Jerusalem, Israel; The Wohl Institute for Translational Medicine, Hadassah Medical Center and Faculty of Medicine, Hebrew University of Jerusalem, Jerusalem, Israel.
| |
Collapse
|
3
|
Gharavi E, LeRoy NJ, Zheng G, Zhang A, Brown DE, Sheffield NC. Joint Representation Learning for Retrieval and Annotation of Genomic Interval Sets. Bioengineering (Basel) 2024; 11:263. [PMID: 38534537 PMCID: PMC10967841 DOI: 10.3390/bioengineering11030263] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2023] [Revised: 02/20/2024] [Accepted: 02/22/2024] [Indexed: 03/28/2024] Open
Abstract
As available genomic interval data increase in scale, we require fast systems to search them. A common approach is simple string matching to compare a search term to metadata, but this is limited by incomplete or inaccurate annotations. An alternative is to compare data directly through genomic region overlap analysis, but this approach leads to challenges like sparsity, high dimensionality, and computational expense. We require novel methods to quickly and flexibly query large, messy genomic interval databases. Here, we develop a genomic interval search system using representation learning. We train numerical embeddings for a collection of region sets simultaneously with their metadata labels, capturing similarity between region sets and their metadata in a low-dimensional space. Using these learned co-embeddings, we develop a system that solves three related information retrieval tasks using embedding distance computations: retrieving region sets related to a user query string, suggesting new labels for database region sets, and retrieving database region sets similar to a query region set. We evaluate these use cases and show that jointly learned representations of region sets and metadata are a promising approach for fast, flexible, and accurate genomic region information retrieval.
Collapse
Affiliation(s)
- Erfaneh Gharavi
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
| | - Nathan J. LeRoy
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA
| | - Guangtao Zheng
- Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA
| | - Aidong Zhang
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA
- Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA
| | - Donald E. Brown
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
- Department of Systems and Information Engineering, University of Virginia, Charlottesville, VA 22908, USA
| | - Nathan C. Sheffield
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA
- Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA
- Department of Public Health Sciences, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Department of Biochemistry and Molecular Genetics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Child Health Research Center, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
| |
Collapse
|
4
|
Poulsgaard GA, Sørensen SG, Juul RI, Nielsen MM, Pedersen JS. Sequence dependencies and mutation rates of localized mutational processes in cancer. Genome Med 2023; 15:63. [PMID: 37592287 PMCID: PMC10436389 DOI: 10.1186/s13073-023-01217-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2023] [Accepted: 08/02/2023] [Indexed: 08/19/2023] Open
Abstract
BACKGROUND Cancer mutations accumulate through replication errors and DNA damage coupled with incomplete repair. Individual mutational processes often show nucleotide sequence and functional region preferences. As a result, some sequence contexts mutate at much higher rates than others, with additional variation found between functional regions. Mutational hotspots, with recurrent mutations across cancer samples, represent genomic positions with elevated mutation rates, often caused by highly localized mutational processes. METHODS We count the 11-mer genomic sequences across the genome, and using the PCAWG set of 2583 pan-cancer whole genomes, we associate 11-mers with mutational signatures, hotspots of single nucleotide variants, and specific genomic regions. We evaluate the mutation rates of individual and combined sets of 11-mers and derive mutational sequence motifs. RESULTS We show that hotspots generally identify highly mutable sequence contexts. Using these, we show that some mutational signatures are enriched in hotspot sequence contexts, corresponding to well-defined sequence preferences for the underlying localized mutational processes. This includes signature 17b (of unknown etiology) and signatures 62 (POLE deficiency), 7a (UV), and 72 (linked to lymphomas). In some cases, the mutation rate and sequence preference increase further when focusing on certain genomic regions, such as signature 62 in transcribed regions, where the mutation rate is increased up to 9-folds over cancer type and mutational signature average. CONCLUSIONS We summarize our findings in a catalog of localized mutational processes, their sequence preferences, and their estimated mutation rates.
Collapse
Affiliation(s)
- Gustav Alexander Poulsgaard
- Department of Clinical Medicine, Aarhus University, Palle Juul-Jensens Boulevard 82, 8200, Aarhus N, Denmark
- Department of Molecular Medicine (MOMA), Aarhus University Hospital, Palle Juul-Jensens Boulevard 99, 8200, Aarhus N, Denmark
| | - Simon Grund Sørensen
- Department of Clinical Medicine, Aarhus University, Palle Juul-Jensens Boulevard 82, 8200, Aarhus N, Denmark
- Department of Molecular Medicine (MOMA), Aarhus University Hospital, Palle Juul-Jensens Boulevard 99, 8200, Aarhus N, Denmark
| | - Randi Istrup Juul
- Department of Clinical Medicine, Aarhus University, Palle Juul-Jensens Boulevard 82, 8200, Aarhus N, Denmark
- Department of Molecular Medicine (MOMA), Aarhus University Hospital, Palle Juul-Jensens Boulevard 99, 8200, Aarhus N, Denmark
| | - Morten Muhlig Nielsen
- Department of Clinical Medicine, Aarhus University, Palle Juul-Jensens Boulevard 82, 8200, Aarhus N, Denmark
- Department of Molecular Medicine (MOMA), Aarhus University Hospital, Palle Juul-Jensens Boulevard 99, 8200, Aarhus N, Denmark
| | - Jakob Skou Pedersen
- Department of Clinical Medicine, Aarhus University, Palle Juul-Jensens Boulevard 82, 8200, Aarhus N, Denmark.
- Department of Molecular Medicine (MOMA), Aarhus University Hospital, Palle Juul-Jensens Boulevard 99, 8200, Aarhus N, Denmark.
- Bioinformatics Research Centre (BiRC), Aarhus University, University City 81, Building 1872, 3Rd Floor, 8000, Aarhus C, Denmark.
| |
Collapse
|
5
|
Liu Z, Samee M. Structural underpinnings of mutation rate variations in the human genome. Nucleic Acids Res 2023; 51:7184-7197. [PMID: 37395403 PMCID: PMC10415140 DOI: 10.1093/nar/gkad551] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2022] [Revised: 06/06/2023] [Accepted: 06/15/2023] [Indexed: 07/04/2023] Open
Abstract
Single nucleotide mutation rates have critical implications for human evolution and genetic diseases. Importantly, the rates vary substantially across the genome and the principles underlying such variations remain poorly understood. A recent model explained much of this variation by considering higher-order nucleotide interactions in the 7-mer sequence context around mutated nucleotides. This model's success implicates a connection between DNA shape and mutation rates. DNA shape, i.e. structural properties like helical twist and tilt, is known to capture interactions between nucleotides within a local context. Thus, we hypothesized that changes in DNA shape features at and around mutated positions can explain mutation rate variations in the human genome. Indeed, DNA shape-based models of mutation rates showed similar or improved performance over current nucleotide sequence-based models. These models accurately characterized mutation hotspots in the human genome and revealed the shape features whose interactions underlie mutation rate variations. DNA shape also impacts mutation rates within putative functional regions like transcription factor binding sites where we find a strong association between DNA shape and position-specific mutation rates. This work demonstrates the structural underpinnings of nucleotide mutations in the human genome and lays the groundwork for future models of genetic variations to incorporate DNA shape.
Collapse
Affiliation(s)
- Zian Liu
- Department of Integrative Physiology, Baylor College of Medicine, Houston, TX 77030, USA
| | - Md Abul Hassan Samee
- Department of Integrative Physiology, Baylor College of Medicine, Houston, TX 77030, USA
| |
Collapse
|
6
|
Sanjaya P, Maljanen K, Katainen R, Waszak SM, Aaltonen LA, Stegle O, Korbel JO, Pitkänen E. Mutation-Attention (MuAt): deep representation learning of somatic mutations for tumour typing and subtyping. Genome Med 2023; 15:47. [PMID: 37420249 PMCID: PMC10326961 DOI: 10.1186/s13073-023-01204-4] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2022] [Accepted: 06/21/2023] [Indexed: 07/09/2023] Open
Abstract
BACKGROUND Cancer genome sequencing enables accurate classification of tumours and tumour subtypes. However, prediction performance is still limited using exome-only sequencing and for tumour types with low somatic mutation burden such as many paediatric tumours. Moreover, the ability to leverage deep representation learning in discovery of tumour entities remains unknown. METHODS We introduce here Mutation-Attention (MuAt), a deep neural network to learn representations of simple and complex somatic alterations for prediction of tumour types and subtypes. In contrast to many previous methods, MuAt utilizes the attention mechanism on individual mutations instead of aggregated mutation counts. RESULTS We trained MuAt models on 2587 whole cancer genomes (24 tumour types) from the Pan-Cancer Analysis of Whole Genomes (PCAWG) and 7352 cancer exomes (20 types) from the Cancer Genome Atlas (TCGA). MuAt achieved prediction accuracy of 89% for whole genomes and 64% for whole exomes, and a top-5 accuracy of 97% and 90%, respectively. MuAt models were found to be well-calibrated and perform well in three independent whole cancer genome cohorts with 10,361 tumours in total. We show MuAt to be able to learn clinically and biologically relevant tumour entities including acral melanoma, SHH-activated medulloblastoma, SPOP-associated prostate cancer, microsatellite instability, POLE proofreading deficiency, and MUTYH-associated pancreatic endocrine tumours without these tumour subtypes and subgroups being provided as training labels. Finally, scrunity of MuAt attention matrices revealed both ubiquitous and tumour-type specific patterns of simple and complex somatic mutations. CONCLUSIONS Integrated representations of somatic alterations learnt by MuAt were able to accurately identify histological tumour types and identify tumour entities, with potential to impact precision cancer medicine.
Collapse
Affiliation(s)
- Prima Sanjaya
- Institute for Molecular Medicine Finland (FIMM), HiLIFE, University of Helsinki, Helsinki, Finland
- Applied Tumor Genomics Research Program, Faculty of Medicine, University of Helsinki, Helsinki, Finland
- iCAN Digital Precision Cancer Medicine Flagship, Helsinki, Finland
| | - Katri Maljanen
- Institute for Molecular Medicine Finland (FIMM), HiLIFE, University of Helsinki, Helsinki, Finland
- Applied Tumor Genomics Research Program, Faculty of Medicine, University of Helsinki, Helsinki, Finland
- iCAN Digital Precision Cancer Medicine Flagship, Helsinki, Finland
| | - Riku Katainen
- Institute for Molecular Medicine Finland (FIMM), HiLIFE, University of Helsinki, Helsinki, Finland
- Applied Tumor Genomics Research Program, Faculty of Medicine, University of Helsinki, Helsinki, Finland
- iCAN Digital Precision Cancer Medicine Flagship, Helsinki, Finland
- Department of Medical and Clinical Genetics, Faculty of Medicine, University of Helsinki, Helsinki, Finland
| | - Sebastian M Waszak
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo and Oslo University Hospital, Oslo, Norway
- Swiss Institute for Experimental Cancer Research School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- Department of Neurology, University of California, San Francisco (UCSF), San Francisco, CA, USA
| | - Lauri A Aaltonen
- Applied Tumor Genomics Research Program, Faculty of Medicine, University of Helsinki, Helsinki, Finland
- Department of Medical and Clinical Genetics, Faculty of Medicine, University of Helsinki, Helsinki, Finland
| | - Oliver Stegle
- Division of Computational Genomics and Systems Genetics, German Cancer Research Center (DKFZ), Heidelberg, Germany
- Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
| | - Jan O Korbel
- Division of Computational Genomics and Systems Genetics, German Cancer Research Center (DKFZ), Heidelberg, Germany
- Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Esa Pitkänen
- Institute for Molecular Medicine Finland (FIMM), HiLIFE, University of Helsinki, Helsinki, Finland.
- Applied Tumor Genomics Research Program, Faculty of Medicine, University of Helsinki, Helsinki, Finland.
- iCAN Digital Precision Cancer Medicine Flagship, Helsinki, Finland.
- Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany.
| |
Collapse
|
7
|
Abdollahi S, Lin PC, Chiang JH. DiaDeL: An Accurate Deep Learning-Based Model With Mutational Signatures for Predicting Metastasis Stage and Cancer Types. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1336-1343. [PMID: 34570707 DOI: 10.1109/tcbb.2021.3115504] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Mutational signatures help identify cancer-associated genes that are being involved in tumorigenesis pathways. Hence, these pathways guide precision medicine approaches to find appropriate drugs and treatments. The pattern of mutations varies in different cancer types. Some mutations dysregulate protein function so that their accumulation is responsible for cancer development and might be associated with different cancer types. Therefore, mutations as a feature set can be used as an informative candidate to distinguish various cancer types. There are several options for demonstrating mutations. One might employ binary values to demonstrate mutation regions. Another potential method for extracting features is utilizing mutation interpreters. In this study, we investigate the trinucleotide mutational pattern of each cancer type. Moreover, we extract salient NMF-based mutational signatures across various cancer types. Then, we identify cancer-associated genes of a target cancer based on its salient signatures. We evaluate the cancer-associated genes using survival and gene expression analysis in different stages of cancer. Furthermore, we introduce DiaDeL, which is a deep learning-based binary classifier. The DiaDeL model uses mutational signatures as input features and distinct a cancer type from the others. Our proposed model outperforms six state-of-the-art methods with 0.824 and 0.88 for accuracy and AUC, respectively. The source code is available at https://github.com/sabdollahi/DiaDeL.
Collapse
|
8
|
Abdollahi S, Dehghanian SZ, Hung LY, Yang SJ, Chen DP, Medeiros LJ, Chiang JH, Chang KC. Deciphering genes associated with diffuse large B-cell lymphoma with lymphomatous effusions: A mutational accumulation scoring approach. Biomark Res 2021; 9:74. [PMID: 34635181 PMCID: PMC8504051 DOI: 10.1186/s40364-021-00330-8] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2021] [Accepted: 09/22/2021] [Indexed: 12/11/2022] Open
Abstract
Introduction Earlier studies have shown that lymphomatous effusions in patients with diffuse large B-cell lymphoma (DLBCL) are associated with a very poor prognosis, even worse than for non-effusion-associated patients with stage IV disease. We hypothesized that certain genetic abnormalities were associated with lymphomatous effusions, which would help to identify related pathways, oncogenic mechanisms, and therapeutic targets. Methods We compared whole-exome sequencing on DLBCL samples involving solid organs (n = 22) and involving effusions (n = 9). We designed a mutational accumulation-based approach to score each gene and used mutation interpreters to identify candidate pathogenic genes associated with lymphomatous effusions. Moreover, we performed gene-set enrichment analysis from a microarray comparison of effusion-associated versus non-effusion-associated DLBCL cases to extract the related pathways. Results We found that genes involved in identified pathways or with high accumulation scores in the effusion-based DLBCL cases were associated with migration/invasion. We validated expression of 8 selected genes in DLBCL cell lines and clinical samples: MUC4, SLC35G6, TP53BP2, ARAP3, IL13RA1, PDIA4, HDAC1 and MDM2, and validated expression of 3 proteins (MUC4, HDAC1 and MDM2) in an independent cohort of DLBCL cases with (n = 31) and without (n = 20) lymphomatous effusions. We found that overexpression of HDAC1 and MDM2 correlated with the presence of lymphomatous effusions, and HDAC1 overexpression was associated with the poorest prognosis. Conclusion Our findings suggest that DLBCL associated with lymphomatous effusions may be associated mechanistically with TP53-MDM2 pathway and HDAC-related chromatin remodeling mechanisms. Supplementary Information The online version contains supplementary material available at 10.1186/s40364-021-00330-8.
Collapse
Affiliation(s)
- Sina Abdollahi
- Intelligent Information Retrieval Lab, Department of Computer Science and Information Engineering, National Cheng Kung University, 701, Tainan, Taiwan
| | | | - Liang-Yi Hung
- Department of Biotechnology and Bioindustry Sciences, College of Bioscience and Biotechnology, National Cheng Kung University, Tainan, Taiwan.,Department of Pharmacology, College of Medicine, National Cheng Kung University, Tainan, Taiwan.,University Center for Bioscience and Biotechnology, National Cheng Kung University, Tainan, Taiwan.,Cancer Molecular Biology and Drug Discovery, College of Medical Science and Technology, Taipei Medical University, Taipei, Taiwan.,Graduate Institute of Medicine, College of Medicine, Kaohsiung Medical University, Kaohsiung, Taiwan
| | - Shiang-Jie Yang
- Institute of Basic Medical Sciences, College of Medicine, National Cheng Kung University, Tainan, Taiwan
| | - Dao-Peng Chen
- Kim Forest Enterprise Co., Ltd, New Taipei City, Taiwan
| | - L Jeffrey Medeiros
- Department of Hematopathology, The University of Texas MD Anderson Cancer Center, Houston, Texas, USA
| | - Jung-Hsien Chiang
- Intelligent Information Retrieval Lab, Department of Computer Science and Information Engineering, National Cheng Kung University, 701, Tainan, Taiwan. .,Institute of Medical Informatics, National Cheng Kung University, Tainan, Taiwan.
| | - Kung-Chao Chang
- Department of Pathology, College of Medicine, National Cheng Kung University Hospital, National Cheng Kung University, 138 Sheng-Li Road, 704, Tainan, Taiwan. .,Department of Pathology, College of Medicine, Kaohsiung Medical University, Kaohsiung, Taiwan. .,Department of Pathology, Kaohsiung Medical University Hospital, Kaohsiung, Taiwan. .,Center for Cancer Research, Kaohsiung Medical University, Kaohsiung, Taiwan.
| |
Collapse
|
9
|
Iuchi H, Matsutani T, Yamada K, Iwano N, Sumi S, Hosoda S, Zhao S, Fukunaga T, Hamada M. Representation learning applications in biological sequence analysis. Comput Struct Biotechnol J 2021; 19:3198-3208. [PMID: 34141139 PMCID: PMC8190442 DOI: 10.1016/j.csbj.2021.05.039] [Citation(s) in RCA: 35] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2021] [Revised: 05/10/2021] [Accepted: 05/20/2021] [Indexed: 12/16/2022] Open
Abstract
Although remarkable advances have been reported in high-throughput sequencing, the ability to aptly analyze a substantial amount of rapidly generated biological (DNA/RNA/protein) sequencing data remains a critical hurdle. To tackle this issue, the application of natural language processing (NLP) to biological sequence analysis has received increased attention. In this method, biological sequences are regarded as sentences while the single nucleic acids/amino acids or k-mers in these sequences represent the words. Embedding is an essential step in NLP, which performs the conversion of these words into vectors. Specifically, representation learning is an approach used for this transformation process, which can be applied to biological sequences. Vectorized biological sequences can then be applied for function and structure estimation, or as input for other probabilistic models. Considering the importance and growing trend for the application of representation learning to biological research, in the present study, we have reviewed the existing knowledge in representation learning for biological sequence analysis.
Collapse
Affiliation(s)
- Hitoshi Iuchi
- Waseda Research Institute for Science and Engineering, Waseda University, Tokyo 169-8555, Japan
- Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 169-8555, Japan
| | - Taro Matsutani
- Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 169-8555, Japan
- Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
| | - Keisuke Yamada
- School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
| | - Natsuki Iwano
- Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
| | - Shunsuke Sumi
- Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
- Department of Life Science Frontiers, Center for iPS Cell Research and Application, Kyoto University, Kyoto 606-8507, Japan
| | - Shion Hosoda
- Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 169-8555, Japan
- Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
| | - Shitao Zhao
- Waseda Research Institute for Science and Engineering, Waseda University, Tokyo 169-8555, Japan
| | - Tsukasa Fukunaga
- Waseda Institute for Advanced Study, Waseda University, Tokyo 169-0051, Japan
- Department of Computer Science, Graduate School of Information Science and Technology, The University of Tokyo, Tokyo 113-0032, Japan
| | - Michiaki Hamada
- Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 169-8555, Japan
- Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
- School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
- Graduate School of Medicine, Nippon Medical School, Tokyo 113-8602, Japan
| |
Collapse
|