1
|
Beck TL, Carloni P, Asthagiri DN. All-Atom Biomolecular Simulation in the Exascale Era. J Chem Theory Comput 2024; 20:1777-1782. [PMID: 38382017 DOI: 10.1021/acs.jctc.3c01276] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/23/2024]
Abstract
Exascale supercomputers have opened the door to dynamic simulations, facilitated by AI/ML techniques, that model biomolecular motions over unprecedented length and time scales. This new capability holds the potential to revolutionize our understanding of fundamental biological processes. Here we report on some of the major advances that were discussed at a recent CECAM workshop in Pisa, Italy, on the topic with a primary focus on atomic-level simulations. First, we highlight examples of current large-scale biomolecular simulations and the future possibilities enabled by crossing the exascale threshold. Next, we discuss challenges to be overcome in optimizing the usage of these powerful resources. Finally, we close by listing several grand challenge problems that could be investigated with this new computer architecture.
Collapse
Affiliation(s)
- Thomas L Beck
- National Center for Computational Sciences, Oak Ridge National Laboratory, Oak Ridge, Tennessee 37830, United States
| | - Paolo Carloni
- INM-9/IAS-5 Computational Biomedicine, Forschungszentrum Jülich, Wilhelm-Johnen-Straße, D-54245 Jülich, Germany
- Department of Physics, RWTH Aachen University, D-52078 Aachen, Germany
| | - Dilipkumar N Asthagiri
- National Center for Computational Sciences, Oak Ridge National Laboratory, Oak Ridge, Tennessee 37830, United States
| |
Collapse
|
2
|
Ekim B, Sahlin K, Medvedev P, Berger B, Chikhi R. Efficient mapping of accurate long reads in minimizer space with mapquik. Genome Res 2023; 33:1188-1197. [PMID: 37399256 PMCID: PMC10538364 DOI: 10.1101/gr.277679.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2023] [Accepted: 06/26/2023] [Indexed: 07/05/2023]
Abstract
DNA sequencing data continue to progress toward longer reads with increasingly lower sequencing error rates. We focus on the critical problem of mapping, or aligning, low-divergence sequences from long reads (e.g., Pacific Biosciences [PacBio] HiFi) to a reference genome, which poses challenges in terms of accuracy and computational resources when using cutting-edge read mapping approaches that are designed for all types of alignments. A natural idea would be to optimize efficiency with longer seeds to reduce the probability of extraneous matches; however, contiguous exact seeds quickly reach a sensitivity limit. We introduce mapquik, a novel strategy that creates accurate longer seeds by anchoring alignments through matches of k consecutively sampled minimizers (k-min-mers) and only indexing k-min-mers that occur once in the reference genome, thereby unlocking ultrafast mapping while retaining high sensitivity. We show that mapquik significantly accelerates the seeding and chaining steps-fundamental bottlenecks to read mapping-for both the human and maize genomes with [Formula: see text] sensitivity and near-perfect specificity. On the human genome, for both real and simulated reads, mapquik achieves a [Formula: see text] speedup over the state-of-the-art tool minimap2, and on the maize genome, mapquik achieves a [Formula: see text] speedup over minimap2, making mapquik the fastest mapper to date. These accelerations are enabled from not only minimizer-space seeding but also a novel heuristic [Formula: see text] pseudochaining algorithm, which improves upon the long-standing [Formula: see text] bound. Minimizer-space computation builds the foundation for achieving real-time analysis of long-read sequencing data.
Collapse
Affiliation(s)
- Bariş Ekim
- Computer Science and Artificial Intelligence Laboratory (CSAIL), Massachusetts Institute of Technology (MIT), Cambridge, Massachusetts 02139, USA
- Department of Mathematics, Massachusetts Institute of Technology (MIT), Cambridge, Massachusetts 02139, USA
| | - Kristoffer Sahlin
- Department of Mathematics, Science for Life Laboratory, Stockholm University, SE-106 91 Stockholm, Sweden
| | - Paul Medvedev
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - Bonnie Berger
- Computer Science and Artificial Intelligence Laboratory (CSAIL), Massachusetts Institute of Technology (MIT), Cambridge, Massachusetts 02139, USA
- Department of Mathematics, Massachusetts Institute of Technology (MIT), Cambridge, Massachusetts 02139, USA
| | - Rayan Chikhi
- Department of Computational Biology, Institut Pasteur, 75015 Paris, France
| |
Collapse
|
3
|
Berger B, Yu YW. Navigating bottlenecks and trade-offs in genomic data analysis. Nat Rev Genet 2023; 24:235-250. [PMID: 36476810 DOI: 10.1038/s41576-022-00551-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/27/2022] [Indexed: 12/12/2022]
Abstract
Genome sequencing and analysis allow researchers to decode the functional information hidden in DNA sequences as well as to study cell to cell variation within a cell population. Traditionally, the primary bottleneck in genomic analysis pipelines has been the sequencing itself, which has been much more expensive than the computational analyses that follow. However, an important consequence of the continued drive to expand the throughput of sequencing platforms at lower cost is that often the analytical pipelines are struggling to keep up with the sheer amount of raw data produced. Computational cost and efficiency have thus become of ever increasing importance. Recent methodological advances, such as data sketching, accelerators and domain-specific libraries/languages, promise to address these modern computational challenges. However, despite being more efficient, these innovations come with a new set of trade-offs, both expected, such as accuracy versus memory and expense versus time, and more subtle, including the human expertise needed to use non-standard programming interfaces and set up complex infrastructure. In this Review, we discuss how to navigate these new methodological advances and their trade-offs.
Collapse
Affiliation(s)
- Bonnie Berger
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, USA.
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA.
| | - Yun William Yu
- Department of Computer and Mathematical Sciences, University of Toronto Scarborough, Toronto, Ontario, Canada
- Tri-Campus Department of Mathematics, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
4
|
BioBlocksLab: A portable DIY Bio Lab using BioBlocks language. Methods 2023; 210:36-43. [PMID: 36641111 DOI: 10.1016/j.ymeth.2023.01.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2022] [Revised: 01/04/2023] [Accepted: 01/06/2023] [Indexed: 01/13/2023] Open
Abstract
Standard molecular biology laboratories are usually made with complex, sophisticated, and expensive equipment. Unfortunately, most of these labs are not affordable for everyone. In this paper, we show how we built a portable bio lab BioBlocksLab, made of four modules: a centrifuge, a thermocycler, electrophoresis, and an incubator. We also propose a new version of a blockly programming language to describe experimental lab protocols, called BioBlocks 2.0, which is based on the Microsoft MakeCode platform from the open-source project Microsoft Programming Experience Toolkit (PXT). We run BioBlocks programs of real lab protocols to control different hardware modules with biological reagents and get positive results. We offer an easy, affordable, and open-source way for everyone to do experiments with Do-It-Yourself (DIY) portable bio-labs.
Collapse
|
5
|
Sequre: a high-performance framework for secure multiparty computation enables biomedical data sharing. Genome Biol 2023; 24:5. [PMID: 36631897 PMCID: PMC9832703 DOI: 10.1186/s13059-022-02841-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2022] [Accepted: 12/21/2022] [Indexed: 01/12/2023] Open
Abstract
Secure multiparty computation (MPC) is a cryptographic tool that allows computation on top of sensitive biomedical data without revealing private information to the involved entities. Here, we introduce Sequre, an easy-to-use, high-performance framework for developing performant MPC applications. Sequre offers a set of automatic compile-time optimizations that significantly improve the performance of MPC applications and incorporates the syntax of Python programming language to facilitate rapid application development. We demonstrate its usability and performance on various bioinformatics tasks showing up to 3-4 times increased speed over the existing pipelines with 7-fold reductions in codebase sizes.
Collapse
|
6
|
Smajlović H, Shajii A, Berger B, Cho H, Numanagić I. Sequre: a high-performance framework for rapid development of secure bioinformatics pipelines. IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL & DISTRIBUTED PROCESSING, WORKSHOPS AND PHD FORUM : [PROCEEDINGS]. IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL & DISTRIBUTED PROCESSING, WORKSHOPS AND PHD FORUM 2022; 2022:164-165. [PMID: 35958356 PMCID: PMC9364365] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Affiliation(s)
| | | | | | - Hyunghoon Cho
- Broad Institute of MIT and Harvard, Massachusetts, USA
| | | |
Collapse
|
7
|
Išerić H, Alkan C, Hach F, Numanagić I. Fast characterization of segmental duplication structure in multiple genome assemblies. Algorithms Mol Biol 2022; 17:4. [PMID: 35303886 PMCID: PMC8932185 DOI: 10.1186/s13015-022-00210-2] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2021] [Accepted: 02/08/2022] [Indexed: 11/29/2022] Open
Abstract
MOTIVATION The increasing availability of high-quality genome assemblies raised interest in the characterization of genomic architecture. Major architectural elements, such as common repeats and segmental duplications (SDs), increase genome plasticity that stimulates further evolution by changing the genomic structure and inventing new genes. Optimal computation of SDs within a genome requires quadratic-time local alignment algorithms that are impractical due to the size of most genomes. Additionally, to perform evolutionary analysis, one needs to characterize SDs in multiple genomes and find relations between those SDs and unique (non-duplicated) segments in other genomes. A naïve approach consisting of multiple sequence alignment would make the optimal solution to this problem even more impractical. Thus there is a need for fast and accurate algorithms to characterize SD structure in multiple genome assemblies to better understand the evolutionary forces that shaped the genomes of today. RESULTS Here we introduce a new approach, BISER, to quickly detect SDs in multiple genomes and identify elementary SDs and core duplicons that drive the formation of such SDs. BISER improves earlier tools by (i) scaling the detection of SDs with low homology to multiple genomes while introducing further 7-33[Formula: see text] speed-ups over the existing tools, and by (ii) characterizing elementary SDs and detecting core duplicons to help trace the evolutionary history of duplications to as far as 300 million years. AVAILABILITY AND IMPLEMENTATION BISER is implemented in Seq programming language and is publicly available at https://github.com/0xTCG/biser .
Collapse
Affiliation(s)
- Hamza Išerić
- Department of Computer Science, University of Victoria, Victoria, BC, V8P 5C2, Canada
| | - Can Alkan
- Department of Computer Engineering, Bilkent University, 06800, Ankara, Turkey
| | - Faraz Hach
- Vancouver Prostate Centre, Vancouver, BC, V6H 3Z6, Canada
- Department of Urologic Sciences, University of British Columbia, Vancouver, BC, V5Z 1M9, Canada
| | - Ibrahim Numanagić
- Department of Computer Science, University of Victoria, Victoria, BC, V8P 5C2, Canada.
| |
Collapse
|
8
|
Ekim B, Berger B, Chikhi R. Minimizer-space de Bruijn graphs: Whole-genome assembly of long reads in minutes on a personal computer. Cell Syst 2021; 12:958-968.e6. [PMID: 34525345 PMCID: PMC8562525 DOI: 10.1016/j.cels.2021.08.009] [Citation(s) in RCA: 41] [Impact Index Per Article: 13.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2021] [Revised: 08/01/2021] [Accepted: 08/19/2021] [Indexed: 10/20/2022]
Abstract
DNA sequencing data continue to progress toward longer reads with increasingly lower sequencing error rates. Here, we define an algorithmic approach, mdBG, that makes use of minimizer-space de Bruijn graphs to enable long-read genome assembly. mdBG achieves orders-of-magnitude improvement in both speed and memory usage over existing methods without compromising accuracy. A human genome is assembled in under 10 min using 8 cores and 10 GB RAM, and 60 Gbp of metagenome reads are assembled in 4 min using 1 GB RAM. In addition, we constructed a minimizer-space de Bruijn graph-based representation of 661,405 bacterial genomes, comprising 16 million nodes and 45 million edges, and successfully search it for anti-microbial resistance (AMR) genes in 12 min. We expect our advances to be essential to sequence analysis, given the rise of long-read sequencing in genomics, metagenomics, and pangenomics. Code for constructing mdBGs is freely available for download at https://github.com/ekimb/rust-mdbg/.
Collapse
Affiliation(s)
- Barış Ekim
- Computer Science and Artificial Intelligence Laboratory (CSAIL), Massachusetts Institute of Technology (MIT), Cambridge, MA 02139, USA; Department of Mathematics, Massachusetts Institute of Technology (MIT), Cambridge, MA 02139, USA
| | - Bonnie Berger
- Computer Science and Artificial Intelligence Laboratory (CSAIL), Massachusetts Institute of Technology (MIT), Cambridge, MA 02139, USA; Department of Mathematics, Massachusetts Institute of Technology (MIT), Cambridge, MA 02139, USA.
| | - Rayan Chikhi
- Department of Computational Biology, Institut Pasteur, Paris 75015, France.
| |
Collapse
|