Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Chen Q, Allot A, Leaman R, Islamaj R, Du J, Fang L, Wang K, Xu S, Zhang Y, Bagherzadeh P, Bergler S, Bhatnagar A, Bhavsar N, Chang YC, Lin SJ, Tang W, Zhang H, Tavchioski I, Pollak S, Tian S, Zhang J, Otmakhova Y, Yepes AJ, Dong H, Wu H, Dufour R, Labrak Y, Chatterjee N, Tandon K, Laleye FAA, Rakotoson L, Chersoni E, Gu J, Friedrich A, Pujari SC, Chizhikova M, Sivadasan N, Vg S, Lu Z. Multi-label classification for biomedical literature: an overview of the BioCreative VII LitCovid Track for COVID-19 literature topic annotations. Database (Oxford) 2022;2022:baac069. [PMID: 36043400 PMCID: PMC9428574 DOI: 10.1093/database/baac069] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2022] [Revised: 08/02/2022] [Accepted: 08/13/2022] [Indexed: 05/03/2023]

For:	Chen Q, Allot A, Leaman R, Islamaj R, Du J, Fang L, Wang K, Xu S, Zhang Y, Bagherzadeh P, Bergler S, Bhatnagar A, Bhavsar N, Chang YC, Lin SJ, Tang W, Zhang H, Tavchioski I, Pollak S, Tian S, Zhang J, Otmakhova Y, Yepes AJ, Dong H, Wu H, Dufour R, Labrak Y, Chatterjee N, Tandon K, Laleye FAA, Rakotoson L, Chersoni E, Gu J, Friedrich A, Pujari SC, Chizhikova M, Sivadasan N, Vg S, Lu Z. Multi-label classification for biomedical literature: an overview of the BioCreative VII LitCovid Track for COVID-19 literature topic annotations. Database (Oxford) 2022;2022:baac069. [PMID: 36043400 PMCID: PMC9428574 DOI: 10.1093/database/baac069] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2022] [Revised: 08/02/2022] [Accepted: 08/13/2022] [Indexed: 05/03/2023]

Number

Cited by Other Article(s)

Chen Q, Hu Y, Peng X, Xie Q, Jin Q, Gilson A, Singer MB, Ai X, Lai PT, Wang Z, Keloth VK, Raja K, Huang J, He H, Lin F, Du J, Zhang R, Zheng WJ, Adelman RA, Lu Z, Xu H. Benchmarking large language models for biomedical natural language processing applications and recommendations. Nat Commun 2025;16:3280. [PMID: 40188094 PMCID: PMC11972378 DOI: 10.1038/s41467-025-56989-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2023] [Accepted: 02/07/2025] [Indexed: 04/07/2025] Open

Affiliation(s)

Qingyu Chen Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
Yan Hu McWilliams School of Biomedical Informatics, University of Texas Health Science at Houston, Houston, TX, USA
Xueqing Peng Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Qianqian Xie Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Qiao Jin National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
Aidan Gilson Department of Ophthalmology and Visual Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Maxwell B Singer Department of Ophthalmology and Visual Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Xuguang Ai Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Po-Ting Lai National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
Zhizheng Wang National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
Vipina K Keloth Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Kalpana Raja Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Jimin Huang Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Huan He Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Fongci Lin Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Jingcheng Du McWilliams School of Biomedical Informatics, University of Texas Health Science at Houston, Houston, TX, USA
Rui Zhang Division of Computational Health Sciences, Department of Surgery, Medical School, University of Minnesota, Minneapolis, MN, USA Center for Learning Health System Sciences, University of Minnesota, Minneapolis, MN, 55455, USA
W Jim Zheng McWilliams School of Biomedical Informatics, University of Texas Health Science at Houston, Houston, TX, USA
Ron A Adelman Department of Ophthalmology and Visual Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Zhiyong Lu National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
Hua Xu Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA.

Collapse

Anderson LN, Hoyt CT, Zucker JD, McNaughton AD, Teuton JR, Karis K, Arokium-Christian NN, Warley JT, Stromberg ZR, Gyori BM, Kumar N. Computational tools and data integration to accelerate vaccine development: challenges, opportunities, and future directions. Front Immunol 2025;16:1502484. [PMID: 40124369 PMCID: PMC11925797 DOI: 10.3389/fimmu.2025.1502484] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2024] [Accepted: 01/23/2025] [Indexed: 03/25/2025] Open

Huang TY, Chong CF, Lin HY, Chen TY, Chang YC, Lin MC. A pre-trained language model for emergency department intervention prediction using routine physiological data and clinical narratives. Int J Med Inform 2024;191:105564. [PMID: 39121529 DOI: 10.1016/j.ijmedinf.2024.105564] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2024] [Revised: 07/15/2024] [Accepted: 07/20/2024] [Indexed: 08/12/2024]

Abstract

INTRODUCTION

The urgency and complexity of emergency room (ER) settings require precise and swift decision-making processes for patient care. Ensuring the timely execution of critical examinations and interventions is vital for reducing diagnostic errors, but the literature highlights a need for innovative approaches to optimize diagnostic accuracy and patient outcomes. In response, our study endeavors to create predictive models for timely examinations and interventions by leveraging the patient's symptoms and vital signs recorded during triage, and in so doing, augment traditional diagnostic methodologies.

METHODS

Focusing on four key areas-medication dispensing, vital interventions, laboratory testing, and emergency radiology exams, the study employed Natural Language Processing (NLP) and seven advanced machine learning techniques. The research was centered around the innovative use of BioClinicalBERT, a state-of-the-art NLP framework.

RESULTS

BioClinicalBERT emerged as the superior model, outperforming others in predictive accuracy. The integration of physiological data with patient narrative symptoms demonstrated greater effectiveness compared to models based solely on textual data. The robustness of our approach was confirmed by an Area Under the Receiver Operating Characteristic curve (AUROC) score of 0.9.

CONCLUSION

The findings of our study underscore the feasibility of establishing a decision support system for emergency patients, targeting timely interventions and examinations based on a nuanced analysis of symptoms. By using an advanced natural language processing technique, our approach shows promise for enhancing diagnostic accuracy. However, the current model is not yet fully mature for direct implementation into daily clinical practice. Recognizing the imperative nature of precision in the ER environment, future research endeavors must focus on refining and expanding predictive models to include detailed timely examinations and interventions. Although the progress achieved in this study represents an encouraging step towards a more innovative and technology-driven paradigm in emergency care, full clinical integration warrants further exploration and validation.

Collapse

Xu S, Zhang Y, Chen L, An X. Is metadata of articles about COVID-19 enough for multilabel topic classification task? Database (Oxford) 2024;2024:baae106. [PMID: 39432499 PMCID: PMC11492800 DOI: 10.1093/database/baae106] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2023] [Revised: 06/03/2024] [Accepted: 09/12/2024] [Indexed: 10/23/2024]

Zong H, Wu R, Cha J, Feng W, Wu E, Li J, Shao A, Tao L, Li Z, Tang B, Shen B. Advancing Chinese biomedical text mining with community challenges. J Biomed Inform 2024;157:104716. [PMID: 39197732 DOI: 10.1016/j.jbi.2024.104716] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2024] [Revised: 08/22/2024] [Accepted: 08/25/2024] [Indexed: 09/01/2024]

Affiliation(s)

Hui Zong Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu 610041, China
Rongrong Wu Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu 610041, China
Jiaxue Cha Shanghai Key Laboratory of Signaling and Disease Research, Laboratory of Receptor-Based Bio-Medicine, Collaborative Innovation Center for Brain Science, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
Weizhe Feng Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu 610041, China
Erman Wu Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu 610041, China
Jiakun Li Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu 610041, China; Department of Urology, West China Hospital, Sichuan University, Chengdu 610041, China
Aibin Shao Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu 610041, China
Liang Tao Faculty of Business Information, Shanghai Business School, Shanghai 201400, China
Zuofeng Li Takeda Co. Ltd., Shanghai 200040, China
Buzhou Tang Department of Computer Science, Harbin Institute of Technology, Shenzhen 518055, China
Bairong Shen Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu 610041, China.

Collapse

Luo L, Ning J, Zhao Y, Wang Z, Ding Z, Chen P, Fu W, Han Q, Xu G, Qiu Y, Pan D, Li J, Li H, Feng W, Tu S, Liu Y, Yang Z, Wang J, Sun Y, Lin H. Taiyi: a bilingual fine-tuned large language model for diverse biomedical tasks. J Am Med Inform Assoc 2024;31:1865-1874. [PMID: 38422367 PMCID: PMC11339499 DOI: 10.1093/jamia/ocae037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Revised: 01/08/2024] [Accepted: 02/16/2024] [Indexed: 03/02/2024] Open

Affiliation(s)

Ling Luo School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Jinzhong Ning School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Yingwen Zhao School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Zhijun Wang School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Zeyuan Ding School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Peng Chen School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Weiru Fu School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Qinyu Han School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Guangtao Xu School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Yunzhi Qiu School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Dinghao Pan School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Jiru Li School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Hao Li School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Wenduo Feng School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Senbo Tu School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Yuqi Liu School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Zhihao Yang School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Jian Wang School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Yuanyuan Sun School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Hongfei Lin School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China

Collapse

Sarol MJ, Hong G, Guerra E, Kilicoglu H. Integrating deep learning architectures for enhanced biomedical relation extraction: a pipeline approach. Database (Oxford) 2024;2024:baae079. [PMID: 39197056 PMCID: PMC11352595 DOI: 10.1093/database/baae079] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2024] [Revised: 06/21/2024] [Accepted: 08/14/2024] [Indexed: 08/30/2024]

Abstract

Biomedical relation extraction from scientific publications is a key task in biomedical natural language processing (NLP) and can facilitate the creation of large knowledge bases, enable more efficient knowledge discovery, and accelerate evidence synthesis. In this paper, building upon our previous effort in the BioCreative VIII BioRED Track, we propose an enhanced end-to-end pipeline approach for biomedical relation extraction (RE) and novelty detection (ND) that effectively leverages existing datasets and integrates state-of-the-art deep learning methods. Our pipeline consists of four tasks performed sequentially: named entity recognition (NER), entity linking (EL), RE, and ND. We trained models using the BioRED benchmark corpus that was the basis of the shared task. We explored several methods for each task and combinations thereof: for NER, we compared a BERT-based sequence labeling model that uses the BIO scheme with a span classification model. For EL, we trained a convolutional neural network model for diseases and chemicals and used an existing tool, PubTator 3.0, for mapping other entity types. For RE and ND, we adapted the BERT-based, sentence-bound PURE model to bidirectional and document-level extraction. We also performed extensive hyperparameter tuning to improve model performance. We obtained our best performance using BERT-based models for NER, RE, and ND, and the hybrid approach for EL. Our enhanced and optimized pipeline showed substantial improvement compared to our shared task submission, NER: 93.53 (+3.09), EL: 83.87 (+9.73), RE: 46.18 (+15.67), and ND: 38.86 (+14.9). While the performances of the NER and EL models are reasonably high, RE and ND tasks remain challenging at the document level. Further enhancements to the dataset could enable more accurate and useful models for practical use. We provide our models and code at https://github.com/janinaj/e2eBioMedRE/. Database URL: https://github.com/janinaj/e2eBioMedRE/.

Collapse

Madan S, Lentzen M, Brandt J, Rueckert D, Hofmann-Apitius M, Fröhlich H. Transformer models in biomedicine. BMC Med Inform Decis Mak 2024;24:214. [PMID: 39075407 PMCID: PMC11287876 DOI: 10.1186/s12911-024-02600-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2023] [Accepted: 07/08/2024] [Indexed: 07/31/2024] Open

Du J, Soysal E, Wang D, He L, Lin B, Wang J, Manion FJ, Li Y, Wu E, Yao L. Machine learning models for abstract screening task - A systematic literature review application for health economics and outcome research. BMC Med Res Methodol 2024;24:108. [PMID: 38724903 PMCID: PMC11080200 DOI: 10.1186/s12874-024-02224-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2023] [Accepted: 04/18/2024] [Indexed: 05/13/2024] Open

Badenes-Olmedo C, Corcho O. Lessons learned to enable question answering on knowledge graphs extracted from scientific publications: A case study on the coronavirus literature. J Biomed Inform 2023;142:104382. [PMID: 37156393 PMCID: PMC10163941 DOI: 10.1016/j.jbi.2023.104382] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2022] [Revised: 04/14/2023] [Accepted: 05/03/2023] [Indexed: 05/10/2023]

Systematic Guidelines for Effective Utilization of COVID-19 Databases in Genomic, Epidemiologic, and Clinical Research. Viruses 2023;15:v15030692. [PMID: 36992400 PMCID: PMC10059256 DOI: 10.3390/v15030692] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2023] [Revised: 02/27/2023] [Accepted: 03/04/2023] [Indexed: 03/09/2023] Open

Jimeno Yepes AJ, Verspoor K. Classifying literature mentions of biological pathogens as experimentally studied using natural language processing. J Biomed Semantics 2023;14:1. [PMID: 36721225 PMCID: PMC9889128 DOI: 10.1186/s13326-023-00282-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2022] [Accepted: 01/17/2023] [Indexed: 02/02/2023] Open

Abstract

BACKGROUND

Information pertaining to mechanisms, management and treatment of disease-causing pathogens including viruses and bacteria is readily available from research publications indexed in MEDLINE. However, identifying the literature that specifically characterises these pathogens and their properties based on experimental research, important for understanding of the molecular basis of diseases caused by these agents, requires sifting through a large number of articles to exclude incidental mentions of the pathogens, or references to pathogens in other non-experimental contexts such as public health.

OBJECTIVE

In this work, we lay the foundations for the development of automatic methods for characterising mentions of pathogens in scientific literature, focusing on the task of identifying research that involves the experimental study of a pathogen in an experimental context. There are no manually annotated pathogen corpora available for this purpose, while such resources are necessary to support the development of machine learning-based models. We therefore aim to fill this gap, producing a large data set automatically from MEDLINE under some simplifying assumptions for the task definition, and using it to explore automatic methods that specifically support the detection of experimentally studied pathogen mentions in research publications.

METHODS

We developed a pathogen mention characterisation literature data set -READBiomed-Pathogens- automatically using NCBI resources, which we make available. Resources such as the NCBI Taxonomy, MeSH and GenBank can be used effectively to identify relevant literature about experimentally researched pathogens, more specifically using MeSH to link to MEDLINE citations including titles and abstracts with experimentally researched pathogens. We experiment with several machine learning-based natural language processing (NLP) algorithms leveraging this data set as training data, to model the task of detecting papers that specifically describe experimental study of a pathogen.

RESULTS

We show that our data set READBiomed-Pathogens can be used to explore natural language processing configurations for experimental pathogen mention characterisation. READBiomed-Pathogens includes citations related to organisms including bacteria, viruses, and a small number of toxins and other disease-causing agents.

CONCLUSIONS

We studied the characterisation of experimentally studied pathogens in scientific literature, developing several natural language processing methods supported by an automatically developed data set. As a core contribution of the work, we presented a methodology to automatically construct a data set for pathogen identification using existing biomedical resources. The data set and the annotation code are made publicly available. Performance of the pathogen mention identification and characterisation algorithms were additionally evaluated on a small manually annotated data set shows that the data set that we have generated allows characterising pathogens of interest.

TRIAL REGISTRATION

N/A.

Collapse

Comprehensively identifying Long Covid articles with human-in-the-loop machine learning. PATTERNS (NEW YORK, N.Y.) 2022;4:100659. [PMID: 36471749 PMCID: PMC9712067 DOI: 10.1016/j.patter.2022.100659] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/29/2022] [Revised: 09/19/2022] [Accepted: 11/17/2022] [Indexed: 12/05/2022]

Rabby G, Berka P. Multi-class classification of COVID-19 documents using machine learning algorithms. J Intell Inf Syst 2022;60:571-591. [PMID: 36465147 PMCID: PMC9707112 DOI: 10.1007/s10844-022-00768-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2022] [Revised: 11/16/2022] [Accepted: 11/17/2022] [Indexed: 11/30/2022]

Abstract

In most biomedical research paper corpus, document classification is a crucial task. Even due to the global epidemic, it is a crucial task for researchers across a variety of fields to figure out the relevant scientific research papers accurately and quickly from a flood of biomedical research papers. It can also assist learners or researchers in assigning a research paper to an appropriate category and also help to find the relevant research paper within a very short time. A biomedical document classifier needs to be designed differently to go beyond a "general" text classifier because it's not dependent only on the text itself (i.e. on titles and abstracts) but can also utilize other information like entities extracted using some medical taxonomies or bibliometric data. The main objective of this research was to find out the type of information or features and representation method creates influence the biomedical document classification task. For this reason, we run several experiments on conventional text classification methods with different kinds of features extracted from the titles, abstracts, and bibliometric data. These procedures include data cleaning, feature engineering, and multi-class classification. Eleven different variants of input data tables were created and analyzed using ten machine learning algorithms. We also evaluate the data efficiency and interpretability of these models as essential features of any biomedical research paper classification system for handling specifically the COVID-19 related health crisis. Our major findings are that TF-IDF representations outperform the entity extraction methods and the abstract itself provides sufficient information for correct classification. Out of the used machine learning algorithms, the best performance over various forms of document representation was achieved by Random Forest and Neural Network (BERT). Our results lead to a concrete guideline for practitioners on biomedical document classification.

Collapse

Gu J, Chersoni E, Wang X, Huang CR, Qian L, Zhou G. LitCovid ensemble learning for COVID-19 multi-label classification. Database (Oxford) 2022;2022:6846687. [PMID: 36426767 PMCID: PMC9693804 DOI: 10.1093/database/baac103] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2022] [Revised: 10/27/2022] [Accepted: 11/04/2022] [Indexed: 11/27/2022]

Abstract

The Coronavirus Disease 2019 (COVID-19) pandemic has shifted the focus of research worldwide, and more than 10 000 new articles per month have concentrated on COVID-19-related topics. Considering this rapidly growing literature, the efficient and precise extraction of the main topics of COVID-19-relevant articles is of great importance. The manual curation of this information for biomedical literature is labor-intensive and time-consuming, and as such the procedure is insufficient and difficult to maintain. In response to these complications, the BioCreative VII community has proposed a challenging task, LitCovid Track, calling for a global effort to automatically extract semantic topics for COVID-19 literature. This article describes our work on the BioCreative VII LitCovid Track. We proposed the LitCovid Ensemble Learning (LCEL) method for the tasks and integrated multiple biomedical pretrained models to address the COVID-19 multi-label classification problem. Specifically, seven different transformer-based pretrained models were ensembled for the initialization and fine-tuning processes independently. To enhance the representation abilities of the deep neural models, diverse additional biomedical knowledge was utilized to facilitate the fruitfulness of the semantic expressions. Simple yet effective data augmentation was also leveraged to address the learning deficiency during the training phase. In addition, given the imbalanced label distribution of the challenging task, a novel asymmetric loss function was applied to the LCEL model, which explicitly adjusted the negative-positive importance by assigning different exponential decay factors and helped the model focus on the positive samples. After the training phase, an ensemble bagging strategy was adopted to merge the outputs from each model for final predictions. The experimental results show the effectiveness of our proposed approach, as LCEL obtains the state-of-the-art performance on the LitCovid dataset. Database URL: https://github.com/JHnlp/LCEL.

Collapse

Xu S, Li L, Wang C, An X, Yang G. An improved author-topic (AT) model with authorship credit allocation schemes. J Inf Sci 2022. [DOI: 10.1177/01655515221133530] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]

Abstract Authorship credit allocation schemes have attracted considerable research attention. However, no consensus about which one is the best has been attained until now, and limited evidence from practical tasks has been reported. Therefore, this study uses the author interest discovery task as a real-world task case to provide valuable insights into authorship credit allocation schemes and guidelines for further practical applications. For this purpose, a novel model, ATcredit, is proposed to strengthen the Author-Topic (AT) model with an authorship credit allocation scheme, and collapsed Gibbs sampling is used to approximate the posterior and estimate model parameters. Extensive experiments using the SynBio dataset reveal several interesting findings as follows. (a) Any scheme for allocating unequal authorship credits performs better than its equal-credit counterpart with our ATcredit model in terms of perplexity. (b) The fixed versions of four out of the six schemes work better than their flexible counterparts with our ATcredit model, regardless of the hyper-authorship strategy. (c) The variation coefficient of credit awards can serve as a criterion to decide whether the hyper-authorship strategy should be used. (d) When the number of authors in a scholarly article is less than three, the six authorship credit allocation schemes are similar to each other with our ATcredit model in terms of perplexity. (e) The harmonic counting scheme performs the best, followed by the arithmetic counting scheme, and the network-based counting scheme performs the worst with our ATcredit model in terms of perplexity. (f) The arithmetic counting scheme is similar to the harmonic counting scheme in terms of the normalised mutual information (NMI) of discovered interests, but the geometric counting scheme is different from the axiomatic and network-based counting schemes. Collapse

Chen Q, Allot A, Leaman R, Wei CH, Aghaarabi E, Guerrerio J, Xu L, Lu Z. LitCovid in 2022: an information resource for the COVID-19 literature. Nucleic Acids Res 2022;51:D1512-D1518. [PMID: 36350613 PMCID: PMC9825538 DOI: 10.1093/nar/gkac1005] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2022] [Revised: 10/11/2022] [Accepted: 10/19/2022] [Indexed: 11/11/2022] Open

Chen Q, Du J, Allot A, Lu Z. LitMC-BERT: Transformer-Based Multi-Label Classification of Biomedical Literature With An Application on COVID-19 Literature Curation. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022;19:2584-2595. [PMID: 35536809 PMCID: PMC9647722 DOI: 10.1109/tcbb.2022.3173562] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/11/2021] [Revised: 04/19/2022] [Accepted: 04/22/2022] [Indexed: 05/20/2023]