Lee CE, Kim JS, Min JH, Han SW. SimSon: simple contrastive learning of SMILES for molecular property prediction.
Bioinformatics 2025;
41:btaf275. [PMID:
40341364 PMCID:
PMC12124188 DOI:
10.1093/bioinformatics/btaf275]
[Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2024] [Revised: 03/21/2025] [Accepted: 05/07/2025] [Indexed: 05/10/2025] Open
Abstract
MOTIVATION
Molecular property prediction with deep learning has accelerated drug discovery and retrosynthesis. However, the shortage of labeled molecular data and the challenge of generalizing across the vast chemical spaces pose significant hurdles for leveraging deep learning in molecular property prediction. This study proposes a self-supervised framework designed to acquire a Simplified Molecular Input Line Entry System (SMILES) representation, which we have dubbed Simple SMILES contrastive learning (SimSon). SimSon was pre-trained using unlabeled SMILES data through contrastive learning to grasp the SMILES representations.
RESULTS
Our findings demonstrate that contrastive learning with randomized SMILES enriches the ability of the model to generalize and its robustness as it captures the global semantic context at the molecular level. In downstream tasks, SimSon performs competitively when compared to graph-based methods and even outperforms them on certain benchmark datasets. These results indicate that SimSon effectively captures structural information from SMILES, exhibiting remarkable generalization and robustness. The potential applications of SimSon extend to bioinformatics and cheminformatics, encompassing areas such as drug discovery and drug-drug interaction prediction.
AVAILABILITY AND IMPLEMENTATION
The source code is available at https://github.com/lee00206/SimSon.
Collapse