Menke JD, Ming S, Radhakrishna S, Kilicoglu H, Smalheiser NR. Enhancing automated indexing of publication types and study designs in biomedical literature using full-text features.
MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2025:2025.04.23.25326300. [PMID:
40343026 PMCID:
PMC12060953 DOI:
10.1101/2025.04.23.25326300]
[Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 05/11/2025]
Abstract
Objective
Searching for biomedical articles by publication type or study design is essential for tasks like evidence synthesis. Prior work has relied solely on PubMed information or a limited set of types (e.g., randomized controlled trials). This study builds on our previous work by leveraging full-text features, alternative text representations, and advanced optimization techniques.
Methods
Using a dataset of PubMed articles published between 1987 and 2023 with human-curated indexing terms, we fine-tuned BERT-based encoders (PubMedBERT, BioLinkBERT, SPECTER, SPECTER2, SPECTER2-Clf) to investigate whether text representations based on different pre-training objectives could benefit the task. We incorporated textual and verbalized metadata features, full-text extraction (rule-based, extractive, and abstractive summarization), and additional topical information about the articles. To improve calibration and mitigate label noise, we used asymmetric loss and label smoothing. We also explored contrastive learning approaches (SimCSE, ADNCE, HeroCon, WeighCon). Models were evaluated using precision, recall, F1 score (both micro- and macro-), and area under ROC curve (AUC).
Results
Fine-tuning SPECTER2-base with adding the MeSH term "Animals", asymmetric loss with label smoothing, and WeighCon contrastive loss improved performance significantly over the previous best architecture (micro-F1: 0.664 → 0.679 [ + 2.2 % ] ; macro-F1: 0.663 → 0.690 [ + 4.1 % ] ; p < 0.0001). Asymmetric loss and using SPECTER2-base instead of PubMedBERT contributed most to this gain. Full-text features boosted performance by 2.4% (micro-F1) and 1.8% (macro-F1) over the baseline (micro-F1: 0.616 → 0.631 macro-F1: 0.556 → 0.566 ; p < 0.0001). Topical label splitting and contrastive learning provided minor, non-significant improvements.
Conclusion
Full-text features, enhanced document representations, and fine-tuning optimizations improve publication type and study design indexing. Future work should refine label accuracy, better distill relevant article information, and expand label sets to meet needs of the research community. Data, code, and models are available at https://github.com/ScienceNLP-Lab/MultiTagger-v2.
Collapse