Chen S, Li Y, Lu S, Van H, Aerts HJWL, Savova GK, Bitterman DS. Evaluating the ChatGPT family of models for biomedical reasoning and classification.
J Am Med Inform Assoc 2024;
31:940-948. [PMID:
38261400 PMCID:
PMC10990500 DOI:
10.1093/jamia/ocad256]
[Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Revised: 12/15/2023] [Accepted: 12/19/2023] [Indexed: 01/24/2024] Open
Abstract
OBJECTIVE
Large language models (LLMs) have shown impressive ability in biomedical question-answering, but have not been adequately investigated for more specific biomedical applications. This study investigates ChatGPT family of models (GPT-3.5, GPT-4) in biomedical tasks beyond question-answering.
MATERIALS AND METHODS
We evaluated model performance with 11 122 samples for two fundamental tasks in the biomedical domain-classification (n = 8676) and reasoning (n = 2446). The first task involves classifying health advice in scientific literature, while the second task is detecting causal relations in biomedical literature. We used 20% of the dataset for prompt development, including zero- and few-shot settings with and without chain-of-thought (CoT). We then evaluated the best prompts from each setting on the remaining dataset, comparing them to models using simple features (BoW with logistic regression) and fine-tuned BioBERT models.
RESULTS
Fine-tuning BioBERT produced the best classification (F1: 0.800-0.902) and reasoning (F1: 0.851) results. Among LLM approaches, few-shot CoT achieved the best classification (F1: 0.671-0.770) and reasoning (F1: 0.682) results, comparable to the BoW model (F1: 0.602-0.753 and 0.675 for classification and reasoning, respectively). It took 78 h to obtain the best LLM results, compared to 0.078 and 0.008 h for the top-performing BioBERT and BoW models, respectively.
DISCUSSION
The simple BoW model performed similarly to the most complex LLM prompting. Prompt engineering required significant investment.
CONCLUSION
Despite the excitement around viral ChatGPT, fine-tuning for two fundamental biomedical natural language processing tasks remained the best strategy.
Collapse