1
|
Almeida DS, Almeida MV, Sampaio JV, Gaieta EM, Costa AHS, Rabelo FFA, Cavalcante CL, Sartori GR, Silva JHM. AbSet: A Standardized Data Set of Antibody Structures for Machine Learning Applications. J Chem Inf Model 2025; 65:4767-4774. [PMID: 40349368 PMCID: PMC12117563 DOI: 10.1021/acs.jcim.5c00410] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2025] [Revised: 04/23/2025] [Accepted: 04/30/2025] [Indexed: 05/14/2025]
Abstract
Machine learning algorithms have played a fundamental role in the development of therapeutic antibodies by being trained on data sets of sequences and/or structures. However, structural data sets remain limited, especially those that include antibody-antigen complexes. Additionally, many of the available structures are not standardized, and antibody-specific databases often do not provide molecular descriptors that could enhance ML models. To address this gap, we introduce AbSet, a curated dataset comprising over 800,000 antibody structures and corresponding molecular descriptors, including both experimentally determined and in silico-generated antibody-antigen complexes. We systematically retrieved antibody structures from the Protein Data Bank (PDB), applied rigorous standardization protocols, and expanded the dataset through large-scale protein-protein docking to generate structural variants of antibody-antigen interactions. Each model was classified as high, medium, acceptable, or incorrect quality based on structural similarity to reference experimental complexes. This classification enables both the construction of a decoy set of confirmed non-binders and the generation of high-confidence augmented structural data for machine learning applications. AbSet is publicly available via the Zenodo repository, with accompanying scripts hosted on GitHub (https://github.com/SFBBGroup/AbSet.git).
Collapse
Affiliation(s)
- Diego S. Almeida
- Laboratory
of Structural and Functional Biology Applied to Biopharmaceuticals, Fundação Oswaldo Cruz, Fiocruz Ceará, Eusébio61773-270, Brazil
- Instituto
Oswaldo Cruz, Fiocruz, Rio de Janeiro, Rio de Janeiro21040-900, Brazil
| | - Matheus V. Almeida
- Laboratory
of Structural and Functional Biology Applied to Biopharmaceuticals, Fundação Oswaldo Cruz, Fiocruz Ceará, Eusébio61773-270, Brazil
| | - Jean V. Sampaio
- Laboratory
of Structural and Functional Biology Applied to Biopharmaceuticals, Fundação Oswaldo Cruz, Fiocruz Ceará, Eusébio61773-270, Brazil
- Instituto
Oswaldo Cruz, Fiocruz, Rio de Janeiro, Rio de Janeiro21040-900, Brazil
| | - Eduardo M. Gaieta
- Laboratory
of Structural and Functional Biology Applied to Biopharmaceuticals, Fundação Oswaldo Cruz, Fiocruz Ceará, Eusébio61773-270, Brazil
- Instituto
Oswaldo Cruz, Fiocruz, Rio de Janeiro, Rio de Janeiro21040-900, Brazil
| | - Andrielly H. S. Costa
- Laboratory
of Structural and Functional Biology Applied to Biopharmaceuticals, Fundação Oswaldo Cruz, Fiocruz Ceará, Eusébio61773-270, Brazil
- Instituto
Oswaldo Cruz, Fiocruz, Rio de Janeiro, Rio de Janeiro21040-900, Brazil
| | | | | | - Geraldo R. Sartori
- Laboratory
of Structural and Functional Biology Applied to Biopharmaceuticals, Fundação Oswaldo Cruz, Fiocruz Ceará, Eusébio61773-270, Brazil
| | - João H. M. Silva
- Laboratory
of Structural and Functional Biology Applied to Biopharmaceuticals, Fundação Oswaldo Cruz, Fiocruz Ceará, Eusébio61773-270, Brazil
- Instituto
Oswaldo Cruz, Fiocruz, Rio de Janeiro, Rio de Janeiro21040-900, Brazil
- Pasteur-Fiocruz
Center on Immunology and Immunotherapy, Fundação Oswaldo Cruz, Fiocruz Ceará, Eusébio61760-000, Brazil
| |
Collapse
|
2
|
Dudzic P, Janusz B, Satława T, Chomicz D, Gawłowski T, Grabowski R, Jóźwiak P, Tarkowski M, Mycielski M, Wróbel S, Krawczyk K. RIOT-Rapid Immunoglobulin Overview Tool-annotation of nucleotide and amino acid immunoglobulin sequences using an open germline database. Brief Bioinform 2024; 26:bbae632. [PMID: 39656773 DOI: 10.1093/bib/bbae632] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2024] [Revised: 10/16/2024] [Accepted: 11/22/2024] [Indexed: 12/17/2024] Open
Abstract
Antibodies are a cornerstone of the immune system, playing a pivotal role in identifying and neutralizing infections caused by bacteria, viruses, and other pathogens. Understanding their structure, and function, can provide insights into both the body's natural defenses and the principles behind many therapeutic interventions, including vaccines and antibody-based drugs. The analysis and annotation of antibody sequences, including the identification of variable, diversity, joining, and constant genes, as well as the delineation of framework regions and complementarity-determining regions, is essential for understanding their structure and function. Currently analyzing large volumes of antibody sequences is routine in antibody discovery, requiring fast and accurate tools. While there are existing tools designed for the annotation and numbering of antibody sequences, they often have limitations such as being restricted to either nucleotide or amino acid sequences; slow execution times; or reliance on germline databases that are closed, frequently changed, or have sparse coverage for some species. Here, we present the Rapid Immunoglobulin Overview Tool (RIOT), a novel open-source solution for antibody numbering that addresses these shortcomings. RIOT handles nucleotide and amino acid sequence processing, comes integrated with an Open Germline Receptor Database, and is computationally efficient. We hope that the tool will facilitate rapid annotation of antibody sequencing outputs for the benefit of understanding antibody biology and discovering novel therapeutics.
Collapse
Affiliation(s)
- Paweł Dudzic
- NaturalAntibody S.A., Al. Piastów 22, 71-064 Szczecin, Poland
| | - Bartosz Janusz
- NaturalAntibody S.A., Al. Piastów 22, 71-064 Szczecin, Poland
| | - Tadeusz Satława
- NaturalAntibody S.A., Al. Piastów 22, 71-064 Szczecin, Poland
| | - Dawid Chomicz
- NaturalAntibody S.A., Al. Piastów 22, 71-064 Szczecin, Poland
| | | | - Rafał Grabowski
- NaturalAntibody S.A., Al. Piastów 22, 71-064 Szczecin, Poland
| | - Przemek Jóźwiak
- NaturalAntibody S.A., Al. Piastów 22, 71-064 Szczecin, Poland
| | | | | | - Sonia Wróbel
- NaturalAntibody S.A., Al. Piastów 22, 71-064 Szczecin, Poland
| | - Konrad Krawczyk
- NaturalAntibody S.A., Al. Piastów 22, 71-064 Szczecin, Poland
| |
Collapse
|
3
|
Dudzic P, Chomicz D, Kończak J, Satława T, Janusz B, Wrobel S, Gawłowski T, Jaszczyszyn I, Bielska W, Demharter S, Spreafico R, Schulte L, Martin K, Comeau SR, Krawczyk K. Large-scale data mining of four billion human antibody variable regions reveals convergence between therapeutic and natural antibodies that constrains search space for biologics drug discovery. MAbs 2024; 16:2361928. [PMID: 38844871 PMCID: PMC11164219 DOI: 10.1080/19420862.2024.2361928] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2024] [Accepted: 05/27/2024] [Indexed: 06/12/2024] Open
Abstract
The naïve human antibody repertoire has theoretical access to an estimated > 1015 antibodies. Identifying subsets of this prohibitively large space where therapeutically relevant antibodies may be found is useful for development of these agents. It was previously demonstrated that, despite the immense sequence space, different individuals can produce the same antibodies. It was also shown that therapeutic antibodies, which typically follow seemingly unnatural development processes, can arise independently naturally. To check for biases in how the sequence space is explored, we data mined public repositories to identify 220 bioprojects with a combined seven billion reads. Of these, we created a subset of human bioprojects that we make available as the AbNGS database (https://naturalantibody.com/ngs/). AbNGS contains 135 bioprojects with four billion productive human heavy variable region sequences and 385 million unique complementarity-determining region (CDR)-H3s. We find that 270,000 (0.07% of 385 million) unique CDR-H3s are highly public in that they occur in at least five of 135 bioprojects. Of 700 unique therapeutic CDR-H3, a total of 6% has direct matches in the small set of 270,000. This observation extends to a match between CDR-H3 and V-gene call as well. Thus, the subspace of shared ('public') CDR-H3s shows utility for serving as a starting point for therapeutic antibody design.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | | | | | - Lukas Schulte
- Global Computational Biology & Digital Sciences, Boehringer Ingelheim Pharma GmbH & Co. KG, Biberach an der Riß, Germany
| | - Kyle Martin
- Biotherapeutics Discovery, Boehringer Ingelheim, Ridgefield, CT, USA
| | - Stephen R. Comeau
- Biotherapeutics Discovery, Boehringer Ingelheim, Ridgefield, CT, USA
| | | |
Collapse
|