Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Islamaj Dogan R, Kim S, Chatr-Aryamontri A, Wei CH, Comeau DC, Antunes R, Matos S, Chen Q, Elangovan A, Panyam NC, Verspoor K, Liu H, Wang Y, Liu Z, Altinel B, Hüsünbeyi ZM, Özgür A, Fergadis A, Wang CK, Dai HJ, Tran T, Kavuluru R, Luo L, Steppi A, Zhang J, Qu J, Lu Z. Overview of the BioCreative VI Precision Medicine Track: mining protein interactions and mutations for precision medicine. Database (Oxford) 2019;2019:5303240. [PMID: 30689846 PMCID: PMC6348314 DOI: 10.1093/database/bay147] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2018] [Accepted: 12/19/2018] [Indexed: 12/16/2022]

For:	Islamaj Dogan R, Kim S, Chatr-Aryamontri A, Wei CH, Comeau DC, Antunes R, Matos S, Chen Q, Elangovan A, Panyam NC, Verspoor K, Liu H, Wang Y, Liu Z, Altinel B, Hüsünbeyi ZM, Özgür A, Fergadis A, Wang CK, Dai HJ, Tran T, Kavuluru R, Luo L, Steppi A, Zhang J, Qu J, Lu Z. Overview of the BioCreative VI Precision Medicine Track: mining protein interactions and mutations for precision medicine. Database (Oxford) 2019;2019:5303240. [PMID: 30689846 PMCID: PMC6348314 DOI: 10.1093/database/bay147] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2018] [Accepted: 12/19/2018] [Indexed: 12/16/2022]

Number

Cited by Other Article(s)

Chen Q, Hu Y, Peng X, Xie Q, Jin Q, Gilson A, Singer MB, Ai X, Lai PT, Wang Z, Keloth VK, Raja K, Huang J, He H, Lin F, Du J, Zhang R, Zheng WJ, Adelman RA, Lu Z, Xu H. Benchmarking large language models for biomedical natural language processing applications and recommendations. Nat Commun 2025;16:3280. [PMID: 40188094 PMCID: PMC11972378 DOI: 10.1038/s41467-025-56989-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2023] [Accepted: 02/07/2025] [Indexed: 04/07/2025] Open

Affiliation(s)

Qingyu Chen Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
Yan Hu McWilliams School of Biomedical Informatics, University of Texas Health Science at Houston, Houston, TX, USA
Xueqing Peng Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Qianqian Xie Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Qiao Jin National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
Aidan Gilson Department of Ophthalmology and Visual Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Maxwell B Singer Department of Ophthalmology and Visual Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Xuguang Ai Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Po-Ting Lai National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
Zhizheng Wang National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
Vipina K Keloth Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Kalpana Raja Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Jimin Huang Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Huan He Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Fongci Lin Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Jingcheng Du McWilliams School of Biomedical Informatics, University of Texas Health Science at Houston, Houston, TX, USA
Rui Zhang Division of Computational Health Sciences, Department of Surgery, Medical School, University of Minnesota, Minneapolis, MN, USA Center for Learning Health System Sciences, University of Minnesota, Minneapolis, MN, 55455, USA
W Jim Zheng McWilliams School of Biomedical Informatics, University of Texas Health Science at Houston, Houston, TX, USA
Ron A Adelman Department of Ophthalmology and Visual Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Zhiyong Lu National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
Hua Xu Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA.

Collapse

Ramos MC, Collison CJ, White AD. A review of large language models and autonomous agents in chemistry. Chem Sci 2025;16:2514-2572. [PMID: 39829984 PMCID: PMC11739813 DOI: 10.1039/d4sc03921a] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2024] [Accepted: 12/03/2024] [Indexed: 01/22/2025] Open

Savosina P, Druzhilovskiy D, Filimonov D, Poroikov V. WWAD: the most comprehensive small molecule World Wide Approved Drug database of therapeutics. Front Pharmacol 2024;15:1473279. [PMID: 39359251 PMCID: PMC11444997 DOI: 10.3389/fphar.2024.1473279] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2024] [Accepted: 08/28/2024] [Indexed: 10/04/2024] Open

Islamaj R, Wei CH, Lai PT, Luo L, Coss C, Gokal Kochar P, Miliaras N, Rodionov O, Sekiya K, Trinh D, Whitman D, Lu Z. The biomedical relationship corpus of the BioRED track at the BioCreative VIII challenge and workshop. Database (Oxford) 2024;2024:baae071. [PMID: 39126204 PMCID: PMC11315767 DOI: 10.1093/database/baae071] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Revised: 06/03/2024] [Accepted: 07/09/2024] [Indexed: 08/12/2024]

Abstract

The automatic recognition of biomedical relationships is an important step in the semantic understanding of the information contained in the unstructured text of the published literature. The BioRED track at BioCreative VIII aimed to foster the development of such methods by providing the participants the BioRED-BC8 corpus, a collection of 1000 PubMed documents manually curated for diseases, gene/proteins, chemicals, cell lines, gene variants, and species, as well as pairwise relationships between them which are disease-gene, chemical-gene, disease-variant, gene-gene, chemical-disease, chemical-chemical, chemical-variant, and variant-variant. Furthermore, relationships are categorized into the following semantic categories: positive correlation, negative correlation, binding, conversion, drug interaction, comparison, cotreatment, and association. Unlike most of the previous publicly available corpora, all relationships are expressed at the document level as opposed to the sentence level, and as such, the entities are normalized to the corresponding concept identifiers of the standardized vocabularies, namely, diseases and chemicals are normalized to MeSH, genes (and proteins) to National Center for Biotechnology Information (NCBI) Gene, species to NCBI Taxonomy, cell lines to Cellosaurus, and gene/protein variants to Single Nucleotide Polymorphism Database. Finally, each annotated relationship is categorized as 'novel' depending on whether it is a novel finding or experimental verification in the publication it is expressed in. This distinction helps differentiate novel findings from other relationships in the same text that provides known facts and/or background knowledge. The BioRED-BC8 corpus uses the previous BioRED corpus of 600 PubMed articles as the training dataset and includes a set of newly published 400 articles to serve as the test data for the challenge. All test articles were manually annotated for the BioCreative VIII challenge by expert biocurators at the National Library of Medicine, using the original annotation guidelines, where each article is doubly annotated in a three-round annotation process until full agreement is reached between all curators. This manuscript details the characteristics of the BioRED-BC8 corpus as a critical resource for biomedical named entity recognition and relation extraction. Using this new resource, we have demonstrated advancements in biomedical text-mining algorithm development. Database URL: https://codalab.lisn.upsaclay.fr/competitions/16381.

Collapse

Islamaj R, Lai PT, Wei CH, Luo L, Almeida T, Jonker RAA, Conceição SIR, Sousa DF, Phan CP, Chiang JH, Li J, Pan D, Meesawad W, Tsai RTH, Sarol MJ, Hong G, Valiev A, Tutubalina E, Lee SM, Hsu YY, Li M, Verspoor K, Lu Z. The overview of the BioRED (Biomedical Relation Extraction Dataset) track at BioCreative VIII. Database (Oxford) 2024;2024:baae069. [PMID: 39114977 PMCID: PMC11306928 DOI: 10.1093/database/baae069] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Revised: 05/27/2024] [Accepted: 07/09/2024] [Indexed: 08/11/2024]

Abstract

The BioRED track at BioCreative VIII calls for a community effort to identify, semantically categorize, and highlight the novelty factor of the relationships between biomedical entities in unstructured text. Relation extraction is crucial for many biomedical natural language processing (NLP) applications, from drug discovery to custom medical solutions. The BioRED track simulates a real-world application of biomedical relationship extraction, and as such, considers multiple biomedical entity types, normalized to their specific corresponding database identifiers, as well as defines relationships between them in the documents. The challenge consisted of two subtasks: (i) in Subtask 1, participants were given the article text and human expert annotated entities, and were asked to extract the relation pairs, identify their semantic type and the novelty factor, and (ii) in Subtask 2, participants were given only the article text, and were asked to build an end-to-end system that could identify and categorize the relationships and their novelty. We received a total of 94 submissions from 14 teams worldwide. The highest F-score performances achieved for the Subtask 1 were: 77.17% for relation pair identification, 58.95% for relation type identification, 59.22% for novelty identification, and 44.55% when evaluating all of the above aspects of the comprehensive relation extraction. The highest F-score performances achieved for the Subtask 2 were: 55.84% for relation pair, 43.03% for relation type, 42.74% for novelty, and 32.75% for comprehensive relation extraction. The entire BioRED track dataset and other challenge materials are available at https://ftp.ncbi.nlm.nih.gov/pub/lu/BC8-BioRED-track/ and https://codalab.lisn.upsaclay.fr/competitions/13377 and https://codalab.lisn.upsaclay.fr/competitions/13378. Database URL: https://ftp.ncbi.nlm.nih.gov/pub/lu/BC8-BioRED-track/https://codalab.lisn.upsaclay.fr/competitions/13377https://codalab.lisn.upsaclay.fr/competitions/13378.

Collapse

Affiliation(s)

Rezarta Islamaj National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, United States
Po-Ting Lai National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, United States
Chih-Hsuan Wei National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, United States
Ling Luo School of Computer Science and Technology, Dalian University of Technology, No. 2 Linggong Road, Ganjingzi District, Dalian 116024, China
Tiago Almeida Department of Electronics, Telecommunications and Informatics (DETI), Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
Richard A. A Jonker Department of Electronics, Telecommunications and Informatics (DETI), Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
Sofia I. R Conceição Departamento de Informática, Faculdade de Ciências da Universidade de Lisboa, Edifício C6 Campo Grande, Lisbon 1749-016, Portugal
Diana F Sousa Departamento de Informática, Faculdade de Ciências da Universidade de Lisboa, Edifício C6 Campo Grande, Lisbon 1749-016, Portugal
Cong-Phuoc Phan Department of Computer Science and Information Engineering, National Cheng Kung University, No.1, University Road, Tainan City 701, Taiwan, Republic of China
Jung-Hsien Chiang Department of Computer Science and Information Engineering, National Cheng Kung University, No.1, University Road, Tainan City 701, Taiwan, Republic of China
Jiru Li School of Computer Science and Technology, Dalian University of Technology, No. 2 Linggong Road, Ganjingzi District, Dalian 116024, China
Dinghao Pan School of Computer Science and Technology, Dalian University of Technology, No. 2 Linggong Road, Ganjingzi District, Dalian 116024, China
Wilailack Meesawad Department of Computer Science and Information Engineering, National Central University, No. 300, Zhongda Rd., Zhongli District, Taoyuan City 32001, Taiwan, Republic of China
Richard Tzong-Han Tsai Department of Computer Science and Information Engineering, National Central University, No. 300, Zhongda Rd., Zhongli District, Taoyuan City 32001, Taiwan, Republic of China Research Center for Humanities and Social Sciences, Academia Sinica, No. 128, Section 2, Academia Rd., Nangang District, Taoyuan City 115201, Taiwan, Republic of China
M. Janina Sarol School of Information Sciences, University of Illinois at Urbana-Champaign, 614 E. Daniel St, Champaign, IL 61820, United States
Gibong Hong School of Information Sciences, University of Illinois at Urbana-Champaign, 614 E. Daniel St, Champaign, IL 61820, United States
Airat Valiev Higher School of Economics University, 20 Myasnitskaya St, Moscow 101000, Russia
Elena Tutubalina Artificial Intelligence Research Institute (AIRI), 32 Kutuzovskiy St, Moscow 121170, Russia Kazan Federal University, 18 Kremlevskaya St, Kazan 420008, Russia
Shao-Man Lee Miin Wu School of Computing, National Cheng Kung University, No. 1, University Road, Tainan 701, Taiwan, Republic of China
Yi-Yu Hsu Miin Wu School of Computing, National Cheng Kung University, No. 1, University Road, Tainan 701, Taiwan, Republic of China
Mingjie Li School of Computing Technologies, RMIT University, 124 La Trobe Street, Melbourne, Victoria 3000, Australia
Karin Verspoor School of Computing Technologies, RMIT University, 124 La Trobe Street, Melbourne, Victoria 3000, Australia
Zhiyong Lu National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, United States

Collapse

Lai PT, Wei CH, Luo L, Chen Q, Lu Z. BioREx: Improving biomedical relation extraction by leveraging heterogeneous datasets. J Biomed Inform 2023;146:104487. [PMID: 37673376 DOI: 10.1016/j.jbi.2023.104487] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Revised: 08/18/2023] [Accepted: 09/02/2023] [Indexed: 09/08/2023]

Abstract

Biomedical relation extraction (RE) is the task of automatically identifying and characterizing relations between biomedical concepts from free text. RE is a central task in biomedical natural language processing (NLP) research and plays a critical role in many downstream applications, such as literature-based discovery and knowledge graph construction. State-of-the-art methods were used primarily to train machine learning models on individual RE datasets, such as protein-protein interaction and chemical-induced disease relation. Manual dataset annotation, however, is highly expensive and time-consuming, as it requires domain knowledge. Existing RE datasets are usually domain-specific or small, which limits the development of generalized and high-performing RE models. In this work, we present a novel framework for systematically addressing the data heterogeneity of individual datasets and combining them into a large dataset. Based on the framework and dataset, we report on BioREx, a data-centric approach for extracting relations. Our evaluation shows that BioREx achieves significantly higher performance than the benchmark system trained on the individual dataset, setting a new SOTA from 74.4% to 79.6% in F-1 measure on the recently released BioRED corpus. We further demonstrate that the combined dataset can improve performance for five different RE tasks. In addition, we show that on average BioREx compares favorably to current best-performing methods such as transfer learning and multi-task learning. Finally, we demonstrate BioREx's robustness and generalizability in two independent RE tasks not previously seen in training data: drug-drug N-ary combination and document-level gene-disease RE. The integrated dataset and optimized method have been packaged as a stand-alone tool available at https://github.com/ncbi/BioREx.

Collapse

Bachman JA, Gyori BM, Sorger PK. Automated assembly of molecular mechanisms at scale from text mining and curated databases. Mol Syst Biol 2023;19:e11325. [PMID: 36938926 PMCID: PMC10167483 DOI: 10.15252/msb.202211325] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2022] [Revised: 02/24/2023] [Accepted: 02/27/2023] [Indexed: 03/21/2023] Open

Zhao W, Zhang J, Yang J, Jiang X, He T. Document-Level Chemical-Induced Disease Relation Extraction via Hierarchical Representation Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022;19:2782-2793. [PMID: 34077368 DOI: 10.1109/tcbb.2021.3086090] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]

Chen Q, Du J, Allot A, Lu Z. LitMC-BERT: Transformer-Based Multi-Label Classification of Biomedical Literature With An Application on COVID-19 Literature Curation. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022;19:2584-2595. [PMID: 35536809 PMCID: PMC9647722 DOI: 10.1109/tcbb.2022.3173562] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/11/2021] [Revised: 04/19/2022] [Accepted: 04/22/2022] [Indexed: 05/20/2023]

Chen Q, Allot A, Leaman R, Islamaj R, Du J, Fang L, Wang K, Xu S, Zhang Y, Bagherzadeh P, Bergler S, Bhatnagar A, Bhavsar N, Chang YC, Lin SJ, Tang W, Zhang H, Tavchioski I, Pollak S, Tian S, Zhang J, Otmakhova Y, Yepes AJ, Dong H, Wu H, Dufour R, Labrak Y, Chatterjee N, Tandon K, Laleye FAA, Rakotoson L, Chersoni E, Gu J, Friedrich A, Pujari SC, Chizhikova M, Sivadasan N, Vg S, Lu Z. Multi-label classification for biomedical literature: an overview of the BioCreative VII LitCovid Track for COVID-19 literature topic annotations. Database (Oxford) 2022;2022:baac069. [PMID: 36043400 PMCID: PMC9428574 DOI: 10.1093/database/baac069] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2022] [Revised: 08/02/2022] [Accepted: 08/13/2022] [Indexed: 05/03/2023]

Abstract

The coronavirus disease 2019 (COVID-19) pandemic has been severely impacting global society since December 2019. The related findings such as vaccine and drug development have been reported in biomedical literature-at a rate of about 10 000 articles on COVID-19 per month. Such rapid growth significantly challenges manual curation and interpretation. For instance, LitCovid is a literature database of COVID-19-related articles in PubMed, which has accumulated more than 200 000 articles with millions of accesses each month by users worldwide. One primary curation task is to assign up to eight topics (e.g. Diagnosis and Treatment) to the articles in LitCovid. The annotated topics have been widely used for navigating the COVID literature, rapidly locating articles of interest and other downstream studies. However, annotating the topics has been the bottleneck of manual curation. Despite the continuing advances in biomedical text-mining methods, few have been dedicated to topic annotations in COVID-19 literature. To close the gap, we organized the BioCreative LitCovid track to call for a community effort to tackle automated topic annotation for COVID-19 literature. The BioCreative LitCovid dataset-consisting of over 30 000 articles with manually reviewed topics-was created for training and testing. It is one of the largest multi-label classification datasets in biomedical scientific literature. Nineteen teams worldwide participated and made 80 submissions in total. Most teams used hybrid systems based on transformers. The highest performing submissions achieved 0.8875, 0.9181 and 0.9394 for macro-F1-score, micro-F1-score and instance-based F1-score, respectively. Notably, these scores are substantially higher (e.g. 12%, higher for macro F1-score) than the corresponding scores of the state-of-art multi-label classification method. The level of participation and results demonstrate a successful track and help close the gap between dataset curation and method development. The dataset is publicly available via https://ftp.ncbi.nlm.nih.gov/pub/lu/LitCovid/biocreative/ for benchmarking and further development. Database URL https://ftp.ncbi.nlm.nih.gov/pub/lu/LitCovid/biocreative/.

Collapse

Affiliation(s)

Qingyu Chen National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD, Bethesda 20892, USA
Alexis Allot National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD, Bethesda 20892, USA
Robert Leaman National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD, Bethesda 20892, USA
Rezarta Islamaj National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD, Bethesda 20892, USA
Jingcheng Du School of Biomedical Informatics, UT Health, TX, Houston 77030, USA
Li Fang Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, USA
Kai Wang Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, USA Department of Pathology and Laboratory Medicine, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
Shuo Xu College of Economics and Management, Beijing University of Technology, Beijing, QC, China
Yuefu Zhang College of Economics and Management, Beijing University of Technology, Beijing, QC, China
Parsa Bagherzadeh CLaC Labs, Concordia University, Montreal, Canada
Sabine Bergler CLaC Labs, Concordia University, Montreal, Canada
Aakash Bhatnagar Navrachana University, Vadodara, India
Nidhir Bhavsar Navrachana University, Vadodara, India
Yung-Chun Chang Graduate Institute of Data Science, Taipei Medical University, Taipei, Taiwan
Sheng-Jie Lin Graduate Institute of Data Science, Taipei Medical University, Taipei, Taiwan
Wentai Tang College of Computer Science and Technology, Dalian University of Technology, Dalian, China
Hongtong Zhang College of Computer Science and Technology, Dalian University of Technology, Dalian, China
Ilija Tavchioski Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia Jožef Stefan Institute, Ljubljana, Slovenia
Senja Pollak Jožef Stefan Institute, Ljubljana, Slovenia
Shubo Tian Department of Statistics, Florida State University, Tallahassee, FL, USA
Jinfeng Zhang Department of Statistics, Florida State University, Tallahassee, FL, USA
Yulia Otmakhova School of Computing and Information Systems, University of Melbourne, Melbourne, AU-VIC, Australia
Antonio Jimeno Yepes School of Computing Technologies, RMIT University, Melbourne, AU-VIC, Australia
Hang Dong Centre for Medical Informatics, Usher Institute, University of Edinburgh, Edinburgh, UK
Honghan Wu Institute of Health Informatics, University College London, London, UK
Richard Dufour LS2N, Nantes University, Nantes, France
Yanis Labrak LIA, Avignon University, Avignon, France
Niladri Chatterjee Department of Mathematics, Indian Institute of Technology Delhi, New Delhi, India
Kushagri Tandon Department of Mathematics, Indian Institute of Technology Delhi, New Delhi, India
Fréjus A A Laleye Opscidia, Paris, France
Loïc Rakotoson Opscidia, Paris, France
Emmanuele Chersoni Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Hong Kong, China
Jinghang Gu Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Hong Kong, China
Annemarie Friedrich Bosch Center for Artificial Intelligence, Renningen, Germany
Subhash Chandra Pujari Institute of Computer Science, Heidelberg University, Heidelberg, Germany Bosch Center for Artificial Intelligence, Renningen, Germany
Mariia Chizhikova SINAI Group, Department of Computer Science, Advanced Studies Center in ICT (CEATIC), Universidad de Jaén, Jaén, Spain
Naveen Sivadasan TCS Research, Life Sciences, Hyderabad, India
Saipradeep Vg TCS Research, Life Sciences, Hyderabad, India
Zhiyong Lu National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD, Bethesda 20892, USA

Collapse

Sharma VS, Fossati A, Ciuffa R, Buljan M, Williams EG, Chen Z, Shao W, Pedrioli PGA, Purcell AW, Martínez MR, Song J, Manica M, Aebersold R, Li C. PCfun: a hybrid computational framework for systematic characterization of protein complex function. Brief Bioinform 2022;23:6611913. [PMID: 35724564 PMCID: PMC9310514 DOI: 10.1093/bib/bbac239] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2022] [Revised: 05/05/2022] [Accepted: 05/21/2022] [Indexed: 11/14/2022] Open

Affiliation(s)

Varun S Sharma Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Switzerland.,CeMM Research Center for Molecular Medicine of the Austrian Academy of Sciences, Vienna, Austria
Andrea Fossati Quantitative Biosciences Institute (QBI) and Department of Cellular and Molecular Pharmacology, University of California, San Francisco, CA 94158, USA.,J. David Gladstone Institutes, San Francisco, CA 94158, USA
Rodolfo Ciuffa Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Switzerland
Marija Buljan Empa - Swiss Federal Laboratories for Materials Science and Technology, St. Gallen, Switzerland.,Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
Evan G Williams Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette Luxembourg
Zhen Chen Collaborative Innovation Center of Henan Grain Crops, Henan Agricultural University, Zhengzhou 450046, China
Wenguang Shao Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Switzerland
Patrick G A Pedrioli Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Switzerland
Anthony W Purcell Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
María Rodríguez Martínez IBM Research Europe, Zurich, Switzerland
Jiangning Song Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.,Monash Data Futures Institute, Monash University, Melbourne, VIC 3800, Australia
Matteo Manica IBM Research Europe, Zurich, Switzerland
Ruedi Aebersold Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Switzerland.,Faculty of Science, University of Zurich, Switzerland
Chen Li Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Switzerland.,Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia

Collapse

Liu X, Tan K, Dong S. Multi-granularity sequential neural network for document-level biomedical relation extraction. Inf Process Manag 2021. [DOI: 10.1016/j.ipm.2021.102718] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]

Islamaj R, Wei CH, Cissel D, Miliaras N, Printseva O, Rodionov O, Sekiya K, Ward J, Lu Z. NLM-Gene, a richly annotated gold standard dataset for gene entities that addresses ambiguity and multi-species gene recognition. J Biomed Inform 2021;118:103779. [PMID: 33839304 PMCID: PMC11037554 DOI: 10.1016/j.jbi.2021.103779] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2021] [Revised: 03/14/2021] [Accepted: 04/05/2021] [Indexed: 10/21/2022]

Islamaj R, Leaman R, Kim S, Kwon D, Wei CH, Comeau DC, Peng Y, Cissel D, Coss C, Fisher C, Guzman R, Kochar PG, Koppel S, Trinh D, Sekiya K, Ward J, Whitman D, Schmidt S, Lu Z. NLM-Chem, a new resource for chemical entity recognition in PubMed full text literature. Sci Data 2021;8:91. [PMID: 33767203 PMCID: PMC7994842 DOI: 10.1038/s41597-021-00875-1] [Citation(s) in RCA: 32] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2020] [Accepted: 01/19/2021] [Indexed: 11/13/2022] Open

Affiliation(s)

Rezarta Islamaj National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
Robert Leaman National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
Sun Kim National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
Dongseop Kwon National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
Chih-Hsuan Wei National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
Donald C Comeau National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
Yifan Peng National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
David Cissel National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
Cathleen Coss National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
Carol Fisher National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
Rob Guzman National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
Preeti Gokal Kochar National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
Stella Koppel National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
Dorothy Trinh National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
Keiko Sekiya National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
Janice Ward National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
Deborah Whitman National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
Susan Schmidt National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
Zhiyong Lu National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA.

Collapse

Identifying protein subcellular localisation in scientific literature using bidirectional deep recurrent neural network. Sci Rep 2021;11:1696. [PMID: 33462256 PMCID: PMC7813825 DOI: 10.1038/s41598-020-80441-8] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2020] [Accepted: 12/17/2020] [Indexed: 11/17/2022] Open

Oughtred R, Rust J, Chang C, Breitkreutz B, Stark C, Willems A, Boucher L, Leung G, Kolas N, Zhang F, Dolma S, Coulombe‐Huntington J, Chatr‐aryamontri A, Dolinski K, Tyers M. The BioGRID database: A comprehensive biomedical resource of curated protein, genetic, and chemical interactions. Protein Sci 2021;30:187-200. [PMID: 33070389 PMCID: PMC7737760 DOI: 10.1002/pro.3978] [Citation(s) in RCA: 914] [Impact Index Per Article: 228.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2020] [Revised: 10/09/2020] [Accepted: 10/13/2020] [Indexed: 02/06/2023]

Maier-Hein L, Reinke A, Kozubek M, Martel AL, Arbel T, Eisenmann M, Hanbury A, Jannin P, Müller H, Onogur S, Saez-Rodriguez J, van Ginneken B, Kopp-Schneider A, Landman BA. BIAS: Transparent reporting of biomedical image analysis challenges. Med Image Anal 2020;66:101796. [PMID: 32911207 PMCID: PMC7441980 DOI: 10.1016/j.media.2020.101796] [Citation(s) in RCA: 53] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2019] [Revised: 06/12/2020] [Accepted: 07/27/2020] [Indexed: 12/12/2022]

Affiliation(s)

Lena Maier-Hein Division of Computer Assisted Medical Interventions (CAMI), German Cancer Research Center (DKFZ), Im Neuenheimer Feld 223, Heidelberg 69120, Germany.
Annika Reinke Division of Computer Assisted Medical Interventions (CAMI), German Cancer Research Center (DKFZ), Im Neuenheimer Feld 223, Heidelberg 69120, Germany
Michal Kozubek Centre for Biomedical Image Analysis, Masaryk University, Botanická 68a, Brno 60200, Czech Republic
Anne L Martel Physical Sciences, Sunnybrook Research Institute, 2075 Bayview Avenue, Rm M6-609, Toronto ON M4N 3M5, Canada; Department Medical Biophysics, University of Toronto, 101 College St Suite 15-701, Toronto, ON M5G 1L7, Canada
Tal Arbel Centre for Intelligent Machines, McGill University, 3480 University Street, McConnell Engineering Building, Room 425, Montreal QC H3A 0E9, Canada
Matthias Eisenmann Division of Computer Assisted Medical Interventions (CAMI), German Cancer Research Center (DKFZ), Im Neuenheimer Feld 223, Heidelberg 69120, Germany
Allan Hanbury Institute of Information Systems Engineering, Technische Universität (TU) Wien, Favoritenstraße 9-11/194-04, Vienna 1040, Austria; Complexity Science Hub Vienna, Josefstädter Straße 39, Vienna 1080, Austria
Pierre Jannin Laboratoire Traitement du Signal et de l'Image (LTSI) - UMR_S 1099, Université de Rennes 1, Inserm, Rennes, Cedex 35043, France
Henning Müller University of Applied Sciences Western Switzerland (HES-SO), Rue du Technopole 3, Sierre 3960, Switzerland; Medical Faculty, University of Geneva, Rue Gabrielle-Perret-Gentil 4, Geneva 1211, Switzerland
Sinan Onogur Division of Computer Assisted Medical Interventions (CAMI), German Cancer Research Center (DKFZ), Im Neuenheimer Feld 223, Heidelberg 69120, Germany
Julio Saez-Rodriguez Institute of Computational Biomedicine, Heidelberg University, Faculty of Medicine, Im Neuenheimer Feld 267, Heidelberg 69120, Germany; Heidelberg University Hospital, Im Neuenheimer Feld 267, Heidelberg 69120, Germany; Joint Research Centre for Computational Biomedicine, Rheinisch-Westfälische Technische Hochschule (RWTH) Aachen, Faculty of Medicine, Aachen 52074, Germany
Bram van Ginneken Department of Radiology and Nuclear Medicine, Medical Image Analysis, Radboud University Center, Nijmegen 6525 GA, The Netherlands
Annette Kopp-Schneider Division of Biostatistics, German Cancer Research Center (DKFZ), Im Neuenheimer Feld 581, Heidelberg, 69120, Germany
Bennett A Landman Electrical Engineering, Vanderbilt University, Nashville, Tennessee TN 37235-1679, USA

Collapse

Qu J, Steppi A, Zhong D, Hao J, Wang J, Lung PY, Zhao T, He Z, Zhang J. Triage of documents containing protein interactions affected by mutations using an NLP based machine learning approach. BMC Genomics 2020;21:773. [PMID: 33167858 PMCID: PMC7654050 DOI: 10.1186/s12864-020-07185-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2020] [Accepted: 10/26/2020] [Indexed: 11/17/2022] Open

Wang J, Li M, Diao Q, Lin H, Yang Z, Zhang Y. Biomedical document triage using a hierarchical attention-based capsule network. BMC Bioinformatics 2020;21:380. [PMID: 32938366 PMCID: PMC7495737 DOI: 10.1186/s12859-020-03673-5] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open

Chen Q, Du J, Kim S, Wilbur WJ, Lu Z. Deep learning with sentence embeddings pre-trained on biomedical corpora improves the performance of finding similar sentences in electronic medical records. BMC Med Inform Decis Mak 2020;20:73. [PMID: 32349758 PMCID: PMC7191680 DOI: 10.1186/s12911-020-1044-0] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open

Abstract

Background

Capturing sentence semantics plays a vital role in a range of text mining applications. Despite continuous efforts on the development of related datasets and models in the general domain, both datasets and models are limited in biomedical and clinical domains. The BioCreative/OHNLP2018 organizers have made the first attempt to annotate 1068 sentence pairs from clinical notes and have called for a community effort to tackle the Semantic Textual Similarity (BioCreative/OHNLP STS) challenge.

Methods

We developed models using traditional machine learning and deep learning approaches. For the post challenge, we focused on two models: the Random Forest and the Encoder Network. We applied sentence embeddings pre-trained on PubMed abstracts and MIMIC-III clinical notes and updated the Random Forest and the Encoder Network accordingly.

Results

The official results demonstrated our best submission was the ensemble of eight models. It achieved a Person correlation coefficient of 0.8328 – the highest performance among 13 submissions from 4 teams. For the post challenge, the performance of both Random Forest and the Encoder Network was improved; in particular, the correlation of the Encoder Network was improved by ~ 13%. During the challenge task, no end-to-end deep learning models had better performance than machine learning models that take manually-crafted features. In contrast, with the sentence embeddings pre-trained on biomedical corpora, the Encoder Network now achieves a correlation of ~ 0.84, which is higher than the original best model. The ensembled model taking the improved versions of the Random Forest and Encoder Network as inputs further increased performance to 0.8528.

Conclusions

Deep learning models with sentence embeddings pre-trained on biomedical corpora achieve the highest performance on the test set. Through error analysis, we find that end-to-end deep learning models and traditional machine learning models with manually-crafted features complement each other by finding different types of sentences. We suggest a combination of these models can better find similar sentences in practice.

Collapse

Chen Q, Lee K, Yan S, Kim S, Wei CH, Lu Z. BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale. PLoS Comput Biol 2020;16:e1007617. [PMID: 32324731 PMCID: PMC7237030 DOI: 10.1371/journal.pcbi.1007617] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2019] [Revised: 05/19/2020] [Accepted: 12/19/2019] [Indexed: 12/14/2022] Open

Abstract

A massive number of biological entities, such as genes and mutations, are mentioned in the biomedical literature. The capturing of the semantic relatedness of biological entities is vital to many biological applications, such as protein-protein interaction prediction and literature-based discovery. Concept embeddings—which involve the learning of vector representations of concepts using machine learning models—have been employed to capture the semantics of concepts. To develop concept embeddings, named-entity recognition (NER) tools are first used to identify and normalize concepts from the literature, and then different machine learning models are used to train the embeddings. Despite multiple attempts, existing biomedical concept embeddings generally suffer from suboptimal NER tools, small-scale evaluation, and limited availability. In response, we employed high-performance machine learning-based NER tools for concept recognition and trained our concept embeddings, BioConceptVec, via four different machine learning models on ~30 million PubMed abstracts. BioConceptVec covers over 400,000 biomedical concepts mentioned in the literature and is of the largest among the publicly available biomedical concept embeddings to date. To evaluate the validity and utility of BioConceptVec, we respectively performed two intrinsic evaluations (identifying related concepts based on drug-gene and gene-gene interactions) and two extrinsic evaluations (protein-protein interaction prediction and drug-drug interaction extraction), collectively using over 25 million instances from nine independent datasets (17 million instances from six intrinsic evaluation tasks and 8 million instances from three extrinsic evaluation tasks), which is, by far, the most comprehensive to our best knowledge. The intrinsic evaluation results demonstrate that BioConceptVec consistently has, by a large margin, better performance than existing concept embeddings in identifying similar and related concepts. More importantly, the extrinsic evaluation results demonstrate that using BioConceptVec with advanced deep learning models can significantly improve performance in downstream bioinformatics studies and biomedical text-mining applications. Our BioConceptVec embeddings and benchmarking datasets are publicly available at https://github.com/ncbi-nlp/BioConceptVec.

Capturing the semantics of related biological concepts, such as genes and mutations, is of significant importance to many research tasks in computational biology such as protein-protein interaction detection, gene-drug association prediction, and biomedical literature-based discovery. Here, we propose to leverage state-of-the-art text mining tools and machine learning models to learn the semantics via vector representations (aka. embeddings) of over 400,000 biological concepts mentioned in the entire PubMed abstracts. Our learned embeddings, namely BioConceptVec, can capture related concepts based on their surrounding contextual information in the literature, which is beyond exact term match or co-occurrence-based methods. BioConceptVec has been thoroughly evaluated in multiple bioinformatics tasks consisting of over 25 million instances from nine different biological datasets. The evaluation results demonstrate that BioConceptVec has better performance than existing methods in all tasks. Finally, BioConceptVec is made freely available to the research community and general public.

Collapse

Döring K, Qaseem A, Becer M, Li J, Mishra P, Gao M, Kirchner P, Sauter F, Telukunta KK, Moumbock AFA, Thomas P, Günther S. Automated recognition of functional compound-protein relationships in literature. PLoS One 2020;15:e0220925. [PMID: 32126064 PMCID: PMC7053725 DOI: 10.1371/journal.pone.0220925] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2019] [Accepted: 01/29/2020] [Indexed: 11/18/2022] Open

Caufield JH, Ping P. New advances in extracting and learning from protein-protein interactions within unstructured biomedical text data. Emerg Top Life Sci 2019;3:357-369. [PMID: 33523203 DOI: 10.1042/etls20190003] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2019] [Revised: 07/11/2019] [Accepted: 07/16/2019] [Indexed: 12/14/2022]