Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Ji Y, Zhou Z, Liu H, Davuluri RV. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics 2021;37:2112-2120. [PMID: 33538820 PMCID: PMC11025658 DOI: 10.1093/bioinformatics/btab083] [Citation(s) in RCA: 340] [Impact Index Per Article: 85.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2020] [Revised: 12/31/2020] [Accepted: 02/01/2021] [Indexed: 12/19/2022] Open

For:	Ji Y, Zhou Z, Liu H, Davuluri RV. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics 2021;37:2112-2120. [PMID: 33538820 PMCID: PMC11025658 DOI: 10.1093/bioinformatics/btab083] [Citation(s) in RCA: 340] [Impact Index Per Article: 85.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2020] [Revised: 12/31/2020] [Accepted: 02/01/2021] [Indexed: 12/19/2022] Open

Number

Cited by Other Article(s)

101

Cheng S, Wei Y, Zhou Y, Xu Z, Wright DN, Liu J, Peng Y. Deciphering genomic codes using advanced NLP techniques: a scoping review. ARXIV 2024:arXiv:2411.16084v1. [PMID: 39650606 PMCID: PMC11623714] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 12/11/2024]

102

He Y, Zhou F, Bai J, Gao Y, Huang X, Wang Y. ViTax: adaptive hierarchical viral taxonomy classification with a taxonomy belief tree on a foundation model. Brief Bioinform 2024;26:bbaf041. [PMID: 39921398 PMCID: PMC11805961 DOI: 10.1093/bib/bbaf041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2024] [Revised: 12/18/2024] [Accepted: 01/20/2025] [Indexed: 02/10/2025] Open

Abstract

Viruses exert a profound influence on both human health and the global ecosystem, yet they remain largely unexplored. Precise taxonomic classification of viral sequences is essential for discovering novel viruses, elucidating their functions, and assessing their implications for public health and environmental monitoring. Traditional taxonomy methods based on genome references are limited by the vast number of unexplored viruses, rapid mutation rates, and high genetic diversity. Additionally, highly imbalanced species distribution and significant variances in inter-species genomic distances across taxonomic units pose challenges to classifier training. Conceptualizing genomic sequences as sentences in a natural language, large language models provide novel approaches for extracting intrinsic viral genome characteristics. In this study, we introduce ViTax, a virus taxonomy classification tool powered by HyenaDNA, a large language foundation model for long-range genomic sequences at single nucleotide resolution. ViTax integrates supervised prototypical contrastive learning to address the highly imbalanced distributions across various taxonomic clades and demonstrates superior performance to current leading methods in virus taxonomy, particularly significant for long sequences. Moreover, ViTax designs a belief mapping tree using the Lowest Common Ancestor algorithm to adaptively assign a sequence to the lowest taxonomy clade with confidence. For the open-set problem, where sequences belong to novel and unexplored genera, ViTax can adaptively assign them to a higher level of known taxonomy with outstanding performance. These capabilities make ViTax a robust tool for advancing the accuracy and reliability of viral taxonomy classification. The code is available at https://github.com/Ying-Lab/ViTax.

Collapse

103

Lu Q, Xu J, Zhang R, Liu H, Wang M, Liu X, Yue Z, Gao Y. RiceSNP-ABST: a deep learning approach to identify abiotic stress-associated single nucleotide polymorphisms in rice. Brief Bioinform 2024;26:bbae702. [PMID: 39757606 PMCID: PMC11962596 DOI: 10.1093/bib/bbae702] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2024] [Revised: 11/16/2024] [Accepted: 12/23/2024] [Indexed: 01/07/2025] Open

Affiliation(s)

Quan Lu School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
Jiajun Xu School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
Renyi Zhang School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
Hangcheng Liu School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
Meng Wang School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
Xiaoshuang Liu Research Center for Biological Breeding Technology, Advance Academy, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
Zhenyu Yue School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China Research Center for Biological Breeding Technology, Advance Academy, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
Yujia Gao School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China Research Center for Biological Breeding Technology, Advance Academy, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China

Collapse

104

Wang Y, Kong S, Zhou C, Wang Y, Zhang Y, Fang Y, Li G. A review of deep learning models for the prediction of chromatin interactions with DNA and epigenomic profiles. Brief Bioinform 2024;26:bbae651. [PMID: 39708837 DOI: 10.1093/bib/bbae651] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2024] [Revised: 10/29/2024] [Accepted: 12/03/2024] [Indexed: 12/23/2024] Open

Affiliation(s)

Yunlong Wang Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, No. 97 Buxin Road, Dapeng New District, Shenzhen 518120, China
Siyuan Kong Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, No. 97 Buxin Road, Dapeng New District, Shenzhen 518120, China
Cong Zhou Agricultural Bioinformatics Key Laboratory of Hubei Province, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, China Hubei Engineering Technology Research Center of Agricultural Big Data, 3D Genomics Research Center, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, China College of Informatics, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, China
Yanfang Wang State Key Laboratory of Animal Biotech Breeding, Institute of Animal Science, Chinese Academy of Agricultural Sciences (CAAS), No. 2 West Yuanmingyuan Rd, Haidian District, Beijing 100193, China
Yubo Zhang Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, No. 97 Buxin Road, Dapeng New District, Shenzhen 518120, China Sequencing Facility, Frederick National Laboratory for Cancer Research, 8560 Progress Drive, Frederick, MD 21701, United States
Yaping Fang Agricultural Bioinformatics Key Laboratory of Hubei Province, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, China Hubei Engineering Technology Research Center of Agricultural Big Data, 3D Genomics Research Center, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, China College of Informatics, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, China
Guoliang Li Agricultural Bioinformatics Key Laboratory of Hubei Province, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, China Hubei Engineering Technology Research Center of Agricultural Big Data, 3D Genomics Research Center, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, China College of Informatics, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, China

Collapse

105

Ding Z, Wei R, Xia J, Mu Y, Wang J, Lin Y. Exploring the potential of large language model-based chatbots in challenges of ribosome profiling data analysis: a review. Brief Bioinform 2024;26:bbae641. [PMID: 39668339 PMCID: PMC11638007 DOI: 10.1093/bib/bbae641] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2024] [Revised: 11/02/2024] [Accepted: 11/27/2024] [Indexed: 12/14/2024] Open

Affiliation(s)

Zheyu Ding School of Pharmacy, Hangzhou Normal University, Hangzhou, Zhejiang 311121, China Key Laboratory of Elemene Class Anti-Cancer Chinese Medicines, Engineering Laboratory of Development and Application of Traditional Chinese Medicines, Collaborative Innovation Center of Traditional Chinese Medicines of Zhejiang Province, Hangzhou Normal University, Hangzhou, Zhejiang 311121, China
Rong Wei School of Pharmacy, Hangzhou Normal University, Hangzhou, Zhejiang 311121, China Key Laboratory of Elemene Class Anti-Cancer Chinese Medicines, Engineering Laboratory of Development and Application of Traditional Chinese Medicines, Collaborative Innovation Center of Traditional Chinese Medicines of Zhejiang Province, Hangzhou Normal University, Hangzhou, Zhejiang 311121, China
Jianing Xia School of Pharmacy, Hangzhou Normal University, Hangzhou, Zhejiang 311121, China Key Laboratory of Elemene Class Anti-Cancer Chinese Medicines, Engineering Laboratory of Development and Application of Traditional Chinese Medicines, Collaborative Innovation Center of Traditional Chinese Medicines of Zhejiang Province, Hangzhou Normal University, Hangzhou, Zhejiang 311121, China
Yonghao Mu School of Pharmacy, Hangzhou Normal University, Hangzhou, Zhejiang 311121, China Key Laboratory of Elemene Class Anti-Cancer Chinese Medicines, Engineering Laboratory of Development and Application of Traditional Chinese Medicines, Collaborative Innovation Center of Traditional Chinese Medicines of Zhejiang Province, Hangzhou Normal University, Hangzhou, Zhejiang 311121, China
Jiahuan Wang School of Pharmacy, Hangzhou Normal University, Hangzhou, Zhejiang 311121, China Key Laboratory of Elemene Class Anti-Cancer Chinese Medicines, Engineering Laboratory of Development and Application of Traditional Chinese Medicines, Collaborative Innovation Center of Traditional Chinese Medicines of Zhejiang Province, Hangzhou Normal University, Hangzhou, Zhejiang 311121, China
Yingying Lin School of Pharmacy, Hangzhou Normal University, Hangzhou, Zhejiang 311121, China Key Laboratory of Elemene Class Anti-Cancer Chinese Medicines, Engineering Laboratory of Development and Application of Traditional Chinese Medicines, Collaborative Innovation Center of Traditional Chinese Medicines of Zhejiang Province, Hangzhou Normal University, Hangzhou, Zhejiang 311121, China

Collapse

106

Cho HN, Jun TJ, Kim YH, Kang H, Ahn I, Gwon H, Kim Y, Seo J, Choi H, Kim M, Han J, Kee G, Park S, Ko S. Task-Specific Transformer-Based Language Models in Health Care: Scoping Review. JMIR Med Inform 2024;12:e49724. [PMID: 39556827 PMCID: PMC11612605 DOI: 10.2196/49724] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Revised: 07/10/2023] [Accepted: 10/21/2024] [Indexed: 11/20/2024] Open

Abstract

BACKGROUND

Transformer-based language models have shown great potential to revolutionize health care by advancing clinical decision support, patient interaction, and disease prediction. However, despite their rapid development, the implementation of transformer-based language models in health care settings remains limited. This is partly due to the lack of a comprehensive review, which hinders a systematic understanding of their applications and limitations. Without clear guidelines and consolidated information, both researchers and physicians face difficulties in using these models effectively, resulting in inefficient research efforts and slow integration into clinical workflows.

OBJECTIVE

This scoping review addresses this gap by examining studies on medical transformer-based language models and categorizing them into 6 tasks: dialogue generation, question answering, summarization, text classification, sentiment analysis, and named entity recognition.

METHODS

We conducted a scoping review following the Cochrane scoping review protocol. A comprehensive literature search was performed across databases, including Google Scholar and PubMed, covering publications from January 2017 to September 2024. Studies involving transformer-derived models in medical tasks were included. Data were categorized into 6 key tasks.

RESULTS

Our key findings revealed both advancements and critical challenges in applying transformer-based models to health care tasks. For example, models like MedPIR involving dialogue generation show promise but face privacy and ethical concerns, while question-answering models like BioBERT improve accuracy but struggle with the complexity of medical terminology. The BioBERTSum summarization model aids clinicians by condensing medical texts but needs better handling of long sequences.

CONCLUSIONS

This review attempted to provide a consolidated understanding of the role of transformer-based language models in health care and to guide future research directions. By addressing current challenges and exploring the potential for real-world applications, we envision significant improvements in health care informatics. Addressing the identified challenges and implementing proposed solutions can enable transformer-based language models to significantly improve health care delivery and patient outcomes. Our review provides valuable insights for future research and practical applications, setting the stage for transformative advancements in medical informatics.

Collapse

Affiliation(s)

Ha Na Cho Department of Information Medicine, Asan Medical Center, Seoul, Republic of Korea
Tae Joon Jun Big Data Research Center, Asan Institute for Life Sciences, Asan Medical Center, Seoul, Republic of Korea
Young-Hak Kim Division of Cardiology, Department of Information Medicine, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea
Heejun Kang Division of Cardiology, Asan Medical Center, Seoul, Republic of Korea
Imjin Ahn Department of Information Medicine, Asan Medical Center, Seoul, Republic of Korea
Hansle Gwon Department of Information Medicine, Asan Medical Center, Seoul, Republic of Korea
Yunha Kim Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea
Jiahn Seo Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea
Heejung Choi Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea
Minkyoung Kim Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea
Jiye Han Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea
Gaeun Kee Department of Information Medicine, Asan Medical Center, Seoul, Republic of Korea
Seohyun Park Department of Information Medicine, Asan Medical Center, Seoul, Republic of Korea
Soyoung Ko Department of Information Medicine, Asan Medical Center, Seoul, Republic of Korea

Collapse

107

Nguyen E, Poli M, Durrant MG, Kang B, Katrekar D, Li DB, Bartie LJ, Thomas AW, King SH, Brixi G, Sullivan J, Ng MY, Lewis A, Lou A, Ermon S, Baccus SA, Hernandez-Boussard T, Ré C, Hsu PD, Hie BL. Sequence modeling and design from molecular to genome scale with Evo. Science 2024;386:eado9336. [PMID: 39541441 PMCID: PMC12057570 DOI: 10.1126/science.ado9336] [Citation(s) in RCA: 35] [Impact Index Per Article: 35.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2024] [Accepted: 09/09/2024] [Indexed: 11/16/2024]

Affiliation(s)

Eric Nguyen Arc Institute, Palo Alto, CA, USA Department of Bioengineering, Stanford University, Stanford, CA, USA
Michael Poli Department of Computer Science, Stanford University, Stanford, CA, USA TogetherAI, San Francisco, CA, USA
Matthew G. Durrant Arc Institute, Palo Alto, CA, USA
Brian Kang Arc Institute, Palo Alto, CA, USA Department of Bioengineering, Stanford University, Stanford, CA, USA
Dhruva Katrekar Arc Institute, Palo Alto, CA, USA
David B. Li Arc Institute, Palo Alto, CA, USA Department of Bioengineering, Stanford University, Stanford, CA, USA
Liam J. Bartie Arc Institute, Palo Alto, CA, USA
Armin W. Thomas Stanford Data Science, Stanford University, Stanford, CA, USA
Samuel H. King Arc Institute, Palo Alto, CA, USA Department of Bioengineering, Stanford University, Stanford, CA, USA
Garyk Brixi Arc Institute, Palo Alto, CA, USA Department of Genetics, Stanford University, Stanford, CA, USA
Jeremy Sullivan Arc Institute, Palo Alto, CA, USA
Madelena Y. Ng Stanford Center for Biomedical Informatics Research, Stanford, CA, USA
Ashley Lewis Department of Biomedical Data Science, Stanford University, Stanford, CA, USA
Aaron Lou Department of Computer Science, Stanford University, Stanford, CA, USA
Stefano Ermon Department of Computer Science, Stanford University, Stanford, CA, USA CZ Biohub, San Francisco, CA, USA
Stephen A. Baccus Department of Neurobiology, Stanford University, Stanford, CA, USA
Tina Hernandez-Boussard Department of Biomedical Data Science, Stanford University, Stanford, CA, USA
Christopher Ré Department of Computer Science, Stanford University, Stanford, CA, USA
Patrick D. Hsu Arc Institute, Palo Alto, CA, USA Department of Bioengineering and Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA
Brian L. Hie Arc Institute, Palo Alto, CA, USA Stanford Data Science, Stanford University, Stanford, CA, USA Department of Chemical Engineering, Stanford University, Stanford, CA, USA

Collapse

108

Theodoris CV. Learning the language of DNA. Science 2024;386:729-730. [PMID: 39541478 DOI: 10.1126/science.adt3007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2024]

109

Cao C, Wang C, Dai Q, Zou Q, Wang T. CRBPSA: CircRNA-RBP interaction sites identification using sequence structural attention model. BMC Biol 2024;22:260. [PMID: 39543602 PMCID: PMC11566611 DOI: 10.1186/s12915-024-02055-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2024] [Accepted: 10/30/2024] [Indexed: 11/17/2024] Open

110

Kumar A, Dixit S, Srinivasan K, M D, Vincent PMDR. Personalized cancer vaccine design using AI-powered technologies. Front Immunol 2024;15:1357217. [PMID: 39582860 PMCID: PMC11581883 DOI: 10.3389/fimmu.2024.1357217] [Citation(s) in RCA: 12] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2023] [Accepted: 09/24/2024] [Indexed: 11/26/2024] Open

111

Liu J, Shen H, Yang Y, Yang M, Zhang Q, Chen K, Li X. Transformer-based representation learning and multiple-instance learning for cancer diagnosis exclusively from raw sequencing fragments of bisulfite-treated plasma cell-free DNA. Mol Oncol 2024;18:2755-2769. [PMID: 39380154 PMCID: PMC11547222 DOI: 10.1002/1878-0261.13745] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2023] [Revised: 07/31/2024] [Accepted: 09/24/2024] [Indexed: 10/10/2024] Open

Abstract

Early cancer diagnosis from bisulfite-treated cell-free DNA (cfDNA) fragments requires tedious data analytical procedures. Here, we present a deep-learning-based approach for early cancer interception and diagnosis (DECIDIA) that can achieve accurate cancer diagnosis exclusively from bisulfite-treated cfDNA sequencing fragments. DECIDIA relies on transformer-based representation learning of DNA fragments and weakly supervised multiple-instance learning for classification. We systematically evaluate the performance of DECIDIA for cancer diagnosis and cancer type prediction on a curated dataset of 5389 samples that consist of colorectal cancer (CRC; n = 1574), hepatocellular cell carcinoma (HCC; n = 1181), lung cancer (n = 654), and non-cancer control (n = 1980). DECIDIA achieved an area under the receiver operating curve (AUROC) of 0.980 (95% CI, 0.976-0.984) in 10-fold cross-validation settings on the CRC dataset by differentiating cancer patients from cancer-free controls, outperforming benchmarked methods that are based on methylation intensities. Noticeably, DECIDIA achieved an AUROC of 0.910 (95% CI, 0.896-0.924) on the externally independent HCC testing set in distinguishing HCC patients from cancer-free controls, although there was no HCC data used in model development. In the settings of cancer-type classification, we observed that DECIDIA achieved a micro-average AUROC of 0.963 (95% CI, 0.960-0.966) and an overall accuracy of 82.8% (95% CI, 81.8-83.9). In addition, we distilled four sequence signatures from the raw sequencing reads that exhibited differential patterns in cancer versus control and among different cancer types. Our approach represents a new paradigm towards eliminating the tedious data analytical procedures for liquid biopsy that uses bisulfite-treated cfDNA methylome.

Collapse

Affiliation(s)

Jilei Liu Tianjin Cancer Institute, Tianjin's Clinical Research Center for Cancer, National Clinical Research Center for Cancer, Key Laboratory of Cancer Prevention and Therapy, Tianjin Medical University Cancer Institute and HospitalTianjin Medical UniversityChina
Hongru Shen Tianjin Cancer Institute, Tianjin's Clinical Research Center for Cancer, National Clinical Research Center for Cancer, Key Laboratory of Cancer Prevention and Therapy, Tianjin Medical University Cancer Institute and HospitalTianjin Medical UniversityChina
Yichen Yang Tianjin Cancer Institute, Tianjin's Clinical Research Center for Cancer, National Clinical Research Center for Cancer, Key Laboratory of Cancer Prevention and Therapy, Tianjin Medical University Cancer Institute and HospitalTianjin Medical UniversityChina
Meng Yang Tianjin Cancer Institute, Tianjin's Clinical Research Center for Cancer, National Clinical Research Center for Cancer, Key Laboratory of Cancer Prevention and Therapy, Tianjin Medical University Cancer Institute and HospitalTianjin Medical UniversityChina
Qiang Zhang Department of Maxillofacial and Otorhinolaryngology Oncology, Tianjin's Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and HospitalTianjin Medical UniversityChina
Kexin Chen Department of Epidemiology and Biostatistics, Key Laboratory of Molecular Cancer Epidemiology of Tianjin, Tianjin's Clinical Research Center for Cancer, Key Laboratory of Prevention and Control of Major Diseases in the Population Ministry of Education, National Clinical Research Center for Cancer, Key Laboratory of Cancer Prevention and Therapy, Tianjin Medical University Cancer Institute and HospitalTianjin Medical UniversityChina
Xiangchun Li Tianjin Cancer Institute, Tianjin's Clinical Research Center for Cancer, National Clinical Research Center for Cancer, Key Laboratory of Cancer Prevention and Therapy, Tianjin Medical University Cancer Institute and HospitalTianjin Medical UniversityChina

Collapse

112

Romeijn L, Bernatavicius A, Vu D. MycoAI: Fast and accurate taxonomic classification for fungal ITS sequences. Mol Ecol Resour 2024;24:e14006. [PMID: 39152642 DOI: 10.1111/1755-0998.14006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2024] [Revised: 07/12/2024] [Accepted: 08/06/2024] [Indexed: 08/19/2024]

113

Song T, Song H, Pan Z, Gao Y, Dai H, Wang X. DeepDualEnhancer: A Dual-Feature Input DNABert Based Deep Learning Method for Enhancer Recognition. Int J Mol Sci 2024;25:11744. [PMID: 39519295 PMCID: PMC11546905 DOI: 10.3390/ijms252111744] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2024] [Revised: 10/23/2024] [Accepted: 10/28/2024] [Indexed: 11/16/2024] Open

114

Xu R, Li D, Yang W, Wang G, Li Y. Improving ncRNA family prediction using multi-modal contrastive learning of sequence and structure. Bioinformatics 2024;40:btae640. [PMID: 39460948 PMCID: PMC11639665 DOI: 10.1093/bioinformatics/btae640] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2024] [Revised: 10/15/2024] [Accepted: 10/22/2024] [Indexed: 10/28/2024] Open

Abstract

MOTIVATION

Recent advancements in high-throughput sequencing technology have significantly increased the focus on non-coding RNA (ncRNA) research within the life sciences. Despite this, the functions of many ncRNAs remain poorly understood. Research suggests that ncRNAs within the same family typically share similar functions, underlining the importance of understanding their roles. There are two primary methods for predicting ncRNA families: biological and computational. Traditional biological methods are not suitable for large-scale data prediction due to the significant human and resource requirements. Concurrently, most existing computational methods either rely solely on ncRNA sequence data or are exclusively based on the secondary structure of ncRNA molecules. These methods fail to fully utilize the rich multimodal information available from ncRNAs, thereby preventing them from learning more comprehensive and in-depth feature representations.

RESULTS

To tackle these problems, we proposed MM-ncRNAFP, a multi-modal contrastive learning framework for ncRNA family prediction. We first used a pre-trained language model to encode the primary sequences of a large mammalian ncRNA dataset. Then, we adopted a contrastive learning framework with an attention mechanism to fuse the secondary structure information obtained by graph neural networks. The MM-ncRNAFP method can effectively fuse multi-modal information. Experimental comparisons with several competitive baselines demonstrated that MM-ncRNAFP can achieve more comprehensive representations of ncRNA features by integrating both sequence and structural information. This integration significantly enhances the performance of ncRNA family prediction. Ablation experiments and qualitative analyses were performed to verify the effectiveness of each component in our model. Moreover, since our model is pre-trained on a large amount of ncRNA data, it has the potential to bring significant improvements to other ncRNA-related tasks.

AVAILABILITY AND IMPLEMENTATION

MM-ncRNAFP and the datasets are available at https://github.com/xuruiting2/MM-ncRNAFP.

Collapse

115

Li C, Wang H, Wen Y, Yin R, Zeng X, Li K. GenoM7GNet: An Efficient N⁷-Methylguanosine Site Prediction Approach Based on a Nucleotide Language Model. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024;21:2258-2268. [PMID: 39302806 DOI: 10.1109/tcbb.2024.3459870] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/22/2024]

116

Yu Z, Zhang Y. Foundation model for comprehensive transcriptional regulation analysis. Natl Sci Rev 2024;11:nwae355. [PMID: 39555104 PMCID: PMC11565239 DOI: 10.1093/nsr/nwae355] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2024] [Revised: 09/22/2024] [Accepted: 10/11/2024] [Indexed: 11/19/2024] Open

117

Yu X, Yani C, Wang Z, Long H, Zeng R, Liu X, Anas B, Ren J. iDNA-ITLM: An interpretable and transferable learning model for identifying DNA methylation. PLoS One 2024;19:e0301791. [PMID: 39480834 PMCID: PMC11527195 DOI: 10.1371/journal.pone.0301791] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2023] [Accepted: 03/20/2024] [Indexed: 11/02/2024] Open

118

Jyoti, Ritu, Gupta S, Shankar R. Comprehensive analysis of computational approaches in plant transcription factors binding regions discovery. Heliyon 2024;10:e39140. [PMID: 39640721 PMCID: PMC11620080 DOI: 10.1016/j.heliyon.2024.e39140] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2024] [Revised: 08/23/2024] [Accepted: 10/08/2024] [Indexed: 12/07/2024] Open

119

Shao B, Yan J. A long-context language model for deciphering and generating bacteriophage genomes. Nat Commun 2024;15:9392. [PMID: 39477977 PMCID: PMC11525655 DOI: 10.1038/s41467-024-53759-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2024] [Accepted: 10/22/2024] [Indexed: 11/02/2024] Open

120

Nahali S, Safari L, Khanteymoori A, Huang J. StructmRNA a BERT based model with dual level and conditional masking for mRNA representation. Sci Rep 2024;14:26043. [PMID: 39472486 PMCID: PMC11522565 DOI: 10.1038/s41598-024-77172-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2024] [Accepted: 10/21/2024] [Indexed: 11/02/2024] Open

121

Kabir A, Bhattarai M, Peterson S, Najman-Licht Y, Rasmussen K, Shehu A, Bishop A, Alexandrov B, Usheva A. DNA breathing integration with deep learning foundational model advances genome-wide binding prediction of human transcription factors. Nucleic Acids Res 2024;52:e91. [PMID: 39271116 PMCID: PMC11514457 DOI: 10.1093/nar/gkae783] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2024] [Revised: 08/21/2024] [Accepted: 08/29/2024] [Indexed: 09/15/2024] Open

122

Zhao H, Song G. Antiviral Peptide-Generative Pre-Trained Transformer (AVP-GPT): A Deep Learning-Powered Model for Antiviral Peptide Design with High-Throughput Discovery and Exceptional Potency. Viruses 2024;16:1673. [PMID: 39599788 PMCID: PMC11599114 DOI: 10.3390/v16111673] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2024] [Revised: 10/20/2024] [Accepted: 10/23/2024] [Indexed: 11/29/2024] Open

123

La Fleur A, Shi Y, Seelig G. Decoding biology with massively parallel reporter assays and machine learning. Genes Dev 2024;38:843-865. [PMID: 39362779 PMCID: PMC11535156 DOI: 10.1101/gad.351800.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/05/2024]

124

Chiliński M, Plewczynski D. HiCDiffusion - diffusion-enhanced, transformer-based prediction of chromatin interactions from DNA sequences. BMC Genomics 2024;25:964. [PMID: 39407104 PMCID: PMC11481779 DOI: 10.1186/s12864-024-10885-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2024] [Accepted: 10/09/2024] [Indexed: 10/19/2024] Open

125

Bunne C, Roohani Y, Rosen Y, Gupta A, Zhang X, Roed M, Alexandrov T, AlQuraishi M, Brennan P, Burkhardt DB, Califano A, Cool J, Dernburg AF, Ewing K, Fox EB, Haury M, Herr AE, Horvitz E, Hsu PD, Jain V, Johnson GR, Kalil T, Kelley DR, Kelley SO, Kreshuk A, Mitchison T, Otte S, Shendure J, Sofroniew NJ, Theis F, Theodoris CV, Upadhyayula S, Valer M, Wang B, Xing E, Yeung-Levy S, Zitnik M, Karaletsos T, Regev A, Lundberg E, Leskovec J, Quake SR. How to Build the Virtual Cell with Artificial Intelligence: Priorities and Opportunities. ARXIV 2024:arXiv:2409.11654v2. [PMID: 39398201 PMCID: PMC11468656] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 10/15/2024]

Affiliation(s)

Charlotte Bunne Department of Computer Science, Stanford University, Stanford, CA, USA Genentech, South San Francisco, CA, USA Chan Zuckerberg Initiative, Redwood City, CA, USA School of Computer and Communication Sciences and School of Life Sciences, EPFL, Lausanne, Switzerland
Yusuf Roohani Department of Computer Science, Stanford University, Stanford, CA, USA Chan Zuckerberg Initiative, Redwood City, CA, USA Arc Institute, Palo Alto, CA, USA
Yanay Rosen Department of Computer Science, Stanford University, Stanford, CA, USA Chan Zuckerberg Initiative, Redwood City, CA, USA
Ankit Gupta Chan Zuckerberg Initiative, Redwood City, CA, USA KTH Royal Institute of Technology, Science for Life Laboratory, Department of Protein Science, Stockholm, Sweden
Xikun Zhang Department of Computer Science, Stanford University, Stanford, CA, USA Chan Zuckerberg Initiative, Redwood City, CA, USA Department of Bioengineering, Stanford University, Stanford, CA, USA
Marcel Roed Department of Computer Science, Stanford University, Stanford, CA, USA Chan Zuckerberg Initiative, Redwood City, CA, USA
Theo Alexandrov Department of Pharmacology, University of California, San Diego, CA, USA Department of Bioengineering, University of California, San Diego, CA, USA
Mohammed AlQuraishi Department of Systems Biology, Columbia University, New York, NY, USA
Patricia Brennan Chan Zuckerberg Initiative, Redwood City, CA, USA
Daniel B Burkhardt Cellarity, Somerville, MA, USA
Andrea Califano Department of Systems Biology, Columbia University, New York, NY, USA Vagelos College of Physicians and Surgeons, Columbia University Irving Medical Center, New York, NY, USA Chan Zuckerberg Biohub New York, NY, USA
Jonah Cool Chan Zuckerberg Initiative, Redwood City, CA, USA
Abby F Dernburg Department of Molecular and Cell Biology, University of California, Berkeley, Berkeley, CA, USA
Kirsty Ewing Chan Zuckerberg Initiative, Redwood City, CA, USA
Emily B Fox Department of Computer Science, Stanford University, Stanford, CA, USA Department of Statistics, Stanford University, Stanford, CA, USA Chan Zuckerberg Biohub San Francisco, CA, USA
Matthias Haury Chan Zuckerberg Institute for Advanced Biological Imaging, Redwood City, CA, USA
Amy E Herr Chan Zuckerberg Biohub San Francisco, CA, USA Department of Bioengineering, University of California, Berkeley, CA, USA
Eric Horvitz Microsoft Research, Redmond, WA, USA
Patrick D Hsu Arc Institute, Palo Alto, CA, USA Department of Bioengineering, University of California, Berkeley, CA, USA Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA
Viren Jain Google Research, Mountain View, CA, USA
Gregory R Johnson NewLimit, San Francisco, CA, USA
Thomas Kalil Schmidt Futures, USA
David R Kelley Calico Life Sciences LLC, San Francisco, CA, USA
Shana O Kelley Chan Zuckerberg Biohub Chicago, IL, USA Northwestern University, Evanston, IL, USA
Anna Kreshuk Cell Biology and Biophysics Unit, European Molecular Biology Laboratory, Heidelberg, Germany
Tim Mitchison Department of Systems Biology, Harvard Medical School, Boston, MA, USA
Stephani Otte Chan Zuckerberg Institute for Advanced Biological Imaging, Redwood City, CA, USA
Jay Shendure Department of Genome Sciences, University of Washington, Seattle, WA, USA Brotman Baty Institute for Precision Medicine, Seattle, WA, USA Seattle Hub for Synthetic Biology, Seattle, WA, USA Howard Hughes Medical Institute, Seattle, WA, USA
Nicholas J Sofroniew EvolutionaryScale, PBC, USA
Fabian Theis Institute of Computational Biology, Helmholtz Center Munich, Munich, Germany School of Computing, Information and Technology, Technical University of Munich, Munich, Germany TUM School of Life Sciences Weihenstephan, Technical University of Munich, Munich, Germany
Christina V Theodoris Gladstone Institute of Cardiovascular Disease, Gladstone Institute of Data Science and Biotechnology, San Francisco, CA, USA Department of Pediatrics, University of California, San Francisco, CA, USA
Srigokul Upadhyayula Department of Molecular and Cell Biology, University of California, Berkeley, Berkeley, CA, USA Chan Zuckerberg Biohub San Francisco, CA, USA Molecular Biophysics and Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
Marc Valer Chan Zuckerberg Initiative, Redwood City, CA, USA
Bo Wang Department of Computer Science, University of Toronto, Toronto, Ontario, Canada Vector Institute, Toronto, Ontario, Canada
Eric Xing Carnegie Mellon University, School of Computer Science, Pittsburgh, PA, USA Mohamed Bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates
Serena Yeung-Levy Department of Computer Science, Stanford University, Stanford, CA, USA Department of Biomedical Data Science, Stanford University, Stanford, CA, USA
Marinka Zitnik Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University, Cambridge, MA, USA Broad Institute of MIT and Harvard, Cambridge, MA, USA
Theofanis Karaletsos Chan Zuckerberg Initiative, Redwood City, CA, USA
Aviv Regev Genentech, South San Francisco, CA, USA
Emma Lundberg Chan Zuckerberg Initiative, Redwood City, CA, USA KTH Royal Institute of Technology, Science for Life Laboratory, Department of Protein Science, Stockholm, Sweden Department of Bioengineering, Stanford University, Stanford, CA, USA Department of Pathology, Stanford University, Stanford, CA, USA
Jure Leskovec Department of Computer Science, Stanford University, Stanford, CA, USA Chan Zuckerberg Initiative, Redwood City, CA, USA
Stephen R Quake Chan Zuckerberg Initiative, Redwood City, CA, USA Department of Bioengineering, Stanford University, Stanford, CA, USA Department of Applied Physics, Stanford University, Stanford, CA, USA

Collapse

126

Naghipourfar M, Chen S, Howard MK, Macdonald CB, Saberi A, Hagen T, Mofrad MRK, Coyote-Maestas W, Goodarzi H. A Suite of Foundation Models Captures the Contextual Interplay Between Codons. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.10.10.617568. [PMID: 39416097 PMCID: PMC11482952 DOI: 10.1101/2024.10.10.617568] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 10/19/2024]

127

Lal A, Garfield D, Biancalani T, Eraslan G. Designing realistic regulatory DNA with autoregressive language models. Genome Res 2024;34:1411-1420. [PMID: 39322281 PMCID: PMC11529870 DOI: 10.1101/gr.279142.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2024] [Accepted: 08/19/2024] [Indexed: 09/27/2024]

128

Zhang G, Xie H, Dai X. DeepIndel: An Interpretable Deep Learning Approach for Predicting CRISPR/Cas9-Mediated Editing Outcomes. Int J Mol Sci 2024;25:10928. [PMID: 39456711 PMCID: PMC11507043 DOI: 10.3390/ijms252010928] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2024] [Revised: 10/01/2024] [Accepted: 10/08/2024] [Indexed: 10/28/2024] Open

129

Rafi AM, Nogina D, Penzar D, Lee D, Lee D, Kim N, Kim S, Kim D, Shin Y, Kwak IY, Meshcheryakov G, Lando A, Zinkevich A, Kim BC, Lee J, Kang T, Vaishnav ED, Yadollahpour P, Kim S, Albrecht J, Regev A, Gong W, Kulakovskiy IV, Meyer P, de Boer CG. A community effort to optimize sequence-based deep learning models of gene regulation. Nat Biotechnol 2024:10.1038/s41587-024-02414-w. [PMID: 39394483 DOI: 10.1038/s41587-024-02414-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2024] [Accepted: 08/29/2024] [Indexed: 10/13/2024]

Affiliation(s)

Abdul Muntakim Rafi University of British Columbia, Vancouver, British Columbia, Canada.
Daria Nogina Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Moscow, Russia
Dmitry Penzar Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia AIRI, Moscow, Russia Institute of Protein Research, Russian Academy of Sciences, Pushchino, Russia
Dohoon Lee Seoul National University, Seoul, South Korea
Danyeong Lee Seoul National University, Seoul, South Korea
Nayeon Kim Seoul National University, Seoul, South Korea
Sangyeup Kim Seoul National University, Seoul, South Korea
Dohyeon Kim Seoul National University, Seoul, South Korea
Yeojin Shin Seoul National University, Seoul, South Korea
Il-Youp Kwak Chung-Ang University, Seoul, South Korea
Georgy Meshcheryakov Institute of Protein Research, Russian Academy of Sciences, Pushchino, Russia
Andrey Lando Yandex, Moscow, Russia
Arsenii Zinkevich Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Moscow, Russia Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia
Byeong-Chan Kim Chung-Ang University, Seoul, South Korea
Juhyun Lee Chung-Ang University, Seoul, South Korea
Taein Kang Chung-Ang University, Seoul, South Korea
Eeshit Dhaval Vaishnav Broad Institute of MIT and Harvard, Cambridge, MA, USA Sequome, Inc., South San Francisco, CA, USA
Payman Yadollahpour Broad Institute of MIT and Harvard, Cambridge, MA, USA
Sun Kim Seoul National University, Seoul, South Korea
Jake Albrecht Sage Bionetworks, Seattle, WA, USA
Aviv Regev Broad Institute of MIT and Harvard, Cambridge, MA, USA Genentech, San Francisco, CA, USA
Wuming Gong University of Minnesota, Minneapolis, MN, USA
Ivan V Kulakovskiy Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia Institute of Protein Research, Russian Academy of Sciences, Pushchino, Russia
Pablo Meyer Health Care and Life Sciences, IBM Research, New York, NY, USA
Carl G de Boer University of British Columbia, Vancouver, British Columbia, Canada.

Collapse

130

Wang J. Deep Learning in Hematology: From Molecules to Patients. Clin Hematol Int 2024;6:19-42. [PMID: 39417017 PMCID: PMC11477942 DOI: 10.46989/001c.124131] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2024] [Accepted: 06/29/2024] [Indexed: 10/19/2024] Open

131

Crombie TA, Rajaei M, Saxena AS, Johnson LM, Saber S, Tanny RE, Ponciano JM, Andersen EC, Zhou J, Baer CF. Direct inference of the distribution of fitness effects of spontaneous mutations from recombinant inbred Caenorhabditis elegans mutation accumulation lines. Genetics 2024;228:iyae136. [PMID: 39139098 PMCID: PMC12098947 DOI: 10.1093/genetics/iyae136] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2024] [Revised: 07/30/2024] [Accepted: 08/02/2024] [Indexed: 08/15/2024] Open

132

Yang Y, Li G, Pang K, Cao W, Zhang Z, Li X. Deciphering 3'UTR Mediated Gene Regulation Using Interpretable Deep Representation Learning. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2024;11:e2407013. [PMID: 39159140 PMCID: PMC11497048 DOI: 10.1002/advs.202407013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/24/2024] [Revised: 07/23/2024] [Indexed: 08/21/2024]

133

Lam HYI, Ong XE, Mutwil M. Large language models in plant biology. TRENDS IN PLANT SCIENCE 2024;29:1145-1155. [PMID: 38797656 DOI: 10.1016/j.tplants.2024.04.013] [Citation(s) in RCA: 13] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/30/2024] [Revised: 04/29/2024] [Accepted: 04/30/2024] [Indexed: 05/29/2024]

134

Zhang Y, Mao M, Zhang R, Liao YT, Wu VCH. DeepPL: A deep-learning-based tool for the prediction of bacteriophage lifecycle. PLoS Comput Biol 2024;20:e1012525. [PMID: 39418300 PMCID: PMC11521287 DOI: 10.1371/journal.pcbi.1012525] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2024] [Revised: 10/29/2024] [Accepted: 09/30/2024] [Indexed: 10/19/2024] Open

135

Hou A, Luo H, Liu H, Luo L, Ding P. Multi-scale DNA language model improves 6 mA binding sites prediction. Comput Biol Chem 2024;112:108129. [PMID: 39067351 DOI: 10.1016/j.compbiolchem.2024.108129] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2024] [Revised: 06/05/2024] [Accepted: 06/10/2024] [Indexed: 07/30/2024]

136

Kumar Halder A, Agarwal A, Jodkowska K, Plewczynski D. A systematic analyses of different bioinformatics pipelines for genomic data and its impact on deep learning models for chromatin loop prediction. Brief Funct Genomics 2024;23:538-548. [PMID: 38555493 DOI: 10.1093/bfgp/elae009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2023] [Revised: 02/07/2024] [Accepted: 03/04/2024] [Indexed: 04/02/2024] Open

137

İhtiyar MN, Özgür A. Generative language models on nucleotide sequences of human genes. Sci Rep 2024;14:22204. [PMID: 39333252 PMCID: PMC11437190 DOI: 10.1038/s41598-024-72512-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Accepted: 09/09/2024] [Indexed: 09/29/2024] Open

Abstract

Language models, especially transformer-based ones, have achieved colossal success in natural language processing. To be precise, studies like BERT for natural language understanding and works like GPT-3 for natural language generation are very important. If we consider DNA sequences as a text written with an alphabet of four letters representing the nucleotides, they are similar in structure to natural languages. This similarity has led to the development of discriminative language models such as DNABERT in the field of DNA-related bioinformatics. To our knowledge, however, the generative side of the coin is still largely unexplored. Therefore, we have focused on the development of an autoregressive generative language model such as GPT-3 for DNA sequences. Since working with whole DNA sequences is challenging without extensive computational resources, we decided to conduct our study on a smaller scale and focus on nucleotide sequences of human genes, i.e. unique parts of DNA with specific functions, rather than the whole DNA. This decision has not significantly changed the structure of the problem, as both DNA and genes can be considered as 1D sequences consisting of four different nucleotides without losing much information and without oversimplification. First of all, we systematically studied an almost entirely unexplored problem and observed that recurrent neural networks (RNNs) perform best, while simple techniques such as N-grams are also promising. Another beneficial point was learning how to work with generative models on languages we do not understand, unlike natural languages. The importance of using real-world tasks beyond classical metrics such as perplexity was noted. In addition, we examined whether the data-hungry nature of these models can be altered by selecting a language with minimal vocabulary size, four due to four different types of nucleotides. The reason for reviewing this was that choosing such a language might make the problem easier. However, in this study, we found that this did not change the amount of data required very much.

Collapse

138

AlSaad R, Abd-Alrazaq A, Boughorbel S, Ahmed A, Renault MA, Damseh R, Sheikh J. Multimodal Large Language Models in Health Care: Applications, Challenges, and Future Outlook. J Med Internet Res 2024;26:e59505. [PMID: 39321458 PMCID: PMC11464944 DOI: 10.2196/59505] [Citation(s) in RCA: 15] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2024] [Revised: 08/07/2024] [Accepted: 08/20/2024] [Indexed: 09/27/2024] Open

139

Yang K, Islas N, Jewell S, Jha A, Radens CM, Pleiss JA, Lynch KW, Barash Y, Choi PS. Machine learning-optimized targeted detection of alternative splicing. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.09.20.614162. [PMID: 39386495 PMCID: PMC11463589 DOI: 10.1101/2024.09.20.614162] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 10/12/2024]

140

Fu L, Shi J, Huang B. Binning Metagenomic Contigs Using Contig Embedding and Decomposed Tetranucleotide Frequency. BIOLOGY 2024;13:755. [PMID: 39452065 PMCID: PMC11505167 DOI: 10.3390/biology13100755] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/28/2024] [Revised: 09/15/2024] [Accepted: 09/23/2024] [Indexed: 10/26/2024]

141

Todhunter ME, Jubair S, Verma R, Saqe R, Shen K, Duffy B. Artificial intelligence and machine learning applications for cultured meat. Front Artif Intell 2024;7:1424012. [PMID: 39381621 PMCID: PMC11460582 DOI: 10.3389/frai.2024.1424012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2024] [Accepted: 08/21/2024] [Indexed: 10/10/2024] Open

142

Phan H, Brouard C, Mourad R. Semi-supervised learning with pseudo-labeling compares favorably with large language models for regulatory sequence prediction. Brief Bioinform 2024;25:bbae560. [PMID: 39489607 PMCID: PMC11531863 DOI: 10.1093/bib/bbae560] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2024] [Revised: 09/13/2024] [Accepted: 10/17/2024] [Indexed: 11/05/2024] Open

143

Li Q, Hu Z, Wang Y, Li L, Fan Y, King I, Jia G, Wang S, Song L, Li Y. Progress and opportunities of foundation models in bioinformatics. Brief Bioinform 2024;25:bbae548. [PMID: 39461902 PMCID: PMC11512649 DOI: 10.1093/bib/bbae548] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2024] [Revised: 08/20/2024] [Accepted: 10/12/2024] [Indexed: 10/29/2024] Open

144

Xu J, Gao Y, Lu Q, Zhang R, Gui J, Liu X, Yue Z. RiceSNP-BST: a deep learning framework for predicting biotic stress-associated SNPs in rice. Brief Bioinform 2024;25:bbae599. [PMID: 39562160 PMCID: PMC11576077 DOI: 10.1093/bib/bbae599] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2024] [Revised: 10/07/2024] [Accepted: 11/04/2024] [Indexed: 11/21/2024] Open

145

Benegas G, Ye C, Albors C, Li JC, Song YS. Genomic Language Models: Opportunities and Challenges. ARXIV 2024:arXiv:2407.11435v2. [PMID: 39070037 PMCID: PMC11275703] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 07/30/2024]

146

Chao KH, Mao A, Salzberg SL, Pertea M. Splam: a deep-learning-based splice site predictor that improves spliced alignments. Genome Biol 2024;25:243. [PMID: 39285451 PMCID: PMC11406845 DOI: 10.1186/s13059-024-03379-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2023] [Accepted: 08/28/2024] [Indexed: 09/19/2024] Open

147

Sanabria M, Hirsch J, Poetsch AR. Distinguishing word identity and sequence context in DNA language models. BMC Bioinformatics 2024;25:301. [PMID: 39272021 PMCID: PMC11395559 DOI: 10.1186/s12859-024-05869-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2023] [Accepted: 07/12/2024] [Indexed: 09/15/2024] Open

148

Bhattacharya M, Pal S, Chatterjee S, Lee SS, Chakraborty C. Large language model to multimodal large language model: A journey to shape the biological macromolecules to biological sciences and medicine. MOLECULAR THERAPY. NUCLEIC ACIDS 2024;35:102255. [PMID: 39377065 PMCID: PMC11456558 DOI: 10.1016/j.omtn.2024.102255] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 10/09/2024]

149

Sereshki S, Lonardi S. Predicting Differentially Methylated Cytosines in TET and DNMT3 Knockout Mutants via a Large Language Model. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.02.592257. [PMID: 39282350 PMCID: PMC11398415 DOI: 10.1101/2024.05.02.592257] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 09/22/2024]

150

Yu CQ, Wang XF, Li LP, You ZH, Ren ZH, Chu P, Guo F, Wang ZY. RBNE-CMI: An Efficient Method for Predicting circRNA-miRNA Interactions via Multiattribute Incomplete Heterogeneous Network Embedding. J Chem Inf Model 2024. [PMID: 39231016 DOI: 10.1021/acs.jcim.4c01118] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/06/2024]