1
|
Sahlin K, Baudeau T, Cazaux B, Marchet C. A survey of mapping algorithms in the long-reads era. Genome Biol 2023; 24:133. [PMID: 37264447 PMCID: PMC10236595 DOI: 10.1186/s13059-023-02972-3] [Citation(s) in RCA: 23] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2022] [Accepted: 05/12/2023] [Indexed: 06/03/2023] Open
Abstract
It has been over a decade since the first publication of a method dedicated entirely to mapping long-reads. The distinctive characteristics of long reads resulted in methods moving from the seed-and-extend framework used for short reads to a seed-and-chain framework due to the seed abundance in each read. The main novelties are based on alternative seed constructs or chaining formulations. Dozens of tools now exist, whose heuristics have evolved considerably. We provide an overview of the methods used in long-read mappers. Since they are driven by implementation-specific parameters, we develop an original visualization tool to understand the parameter settings ( http://bcazaux.polytech-lille.net/Minimap2/ ).
Collapse
Affiliation(s)
- Kristoffer Sahlin
- Department of Mathematics, Science for Life Laboratory, Stockholm University, 106 91, Stockholm, Sweden.
| | - Thomas Baudeau
- Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000, Lille, France
| | - Bastien Cazaux
- Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000, Lille, France
| | - Camille Marchet
- Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000, Lille, France.
| |
Collapse
|
2
|
Nakamura R, Motai Y, Kumagai M, Wike CL, Nishiyama H, Nakatani Y, Durand NC, Kondo K, Kondo T, Tsukahara T, Shimada A, Cairns BR, Aiden EL, Morishita S, Takeda H. CTCF looping is established during gastrulation in medaka embryos. Genome Res 2021; 31:968-980. [PMID: 34006570 PMCID: PMC8168583 DOI: 10.1101/gr.269951.120] [Citation(s) in RCA: 35] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2020] [Accepted: 03/30/2021] [Indexed: 12/23/2022]
Abstract
Chromatin looping plays an important role in genome regulation. However, because ChIP-seq and loop-resolution Hi-C (DNA-DNA proximity ligation) are extremely challenging in mammalian early embryos, the developmental stage at which cohesin-mediated loops form remains unknown. Here, we study early development in medaka (the Japanese killifish, Oryzias latipes) at 12 time points before, during, and after gastrulation (the onset of cell differentiation) and characterize transcription, protein binding, and genome architecture. We find that gastrulation is associated with drastic changes in genome architecture, including the formation of the first loops between sites bound by the insulator protein CTCF and a large increase in the size of contact domains. In contrast, the binding of the CTCF is fixed throughout embryogenesis. Loops form long after genome-wide transcriptional activation, and long after domain formation seen in mouse embryos. These results suggest that, although loops may play a role in differentiation, they are not required for zygotic transcription. When we repeated our experiments in zebrafish, loops did not emerge until gastrulation, that is, well after zygotic genome activation. We observe that loop positions are highly conserved in synteny blocks of medaka and zebrafish, indicating that the 3D genome architecture has been maintained for >110–200 million years of evolution.
Collapse
Affiliation(s)
- Ryohei Nakamura
- Department of Biological Sciences, Graduate School of Science, The University of Tokyo, Tokyo 113-0033 Japan
| | - Yuichi Motai
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa 277-8562, Japan
| | - Masahiko Kumagai
- Advanced Analysis Center, National Agriculture and Food Research Organization, Tsukuba, Ibaraki 305-8602, Japan
| | - Candice L Wike
- Howard Hughes Medical Institute, Department of Oncological Sciences and Huntsman Cancer Institute, University of Utah School of Medicine, Salt Lake City, Utah 84112, USA
| | - Haruyo Nishiyama
- Department of Biological Sciences, Graduate School of Science, The University of Tokyo, Tokyo 113-0033 Japan
| | - Yoichiro Nakatani
- Department of Cancer Genome Informatics, Graduate School of Medicine, Osaka University, Osaka 565-0871, Japan
| | - Neva C Durand
- The Center for Genome Architecture, Baylor College of Medicine, Houston, Texas 77030, USA.,Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030, USA.,Department of Computer Science, Department of Computational and Applied Mathematics, Rice University, Houston, Texas 77005, USA.,Broad Institute of Harvard and Massachusetts Institute of Technology (MIT), Cambridge, Massachusetts 02139 USA.,Center for Theoretical Biological Physics, Rice University, Houston, Texas 77030, USA
| | - Kaori Kondo
- RIKEN-IMS, Laboratory for Developmental Genetics, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa 230-0045, Japan
| | - Takashi Kondo
- RIKEN-IMS, Laboratory for Developmental Genetics, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa 230-0045, Japan
| | - Tatsuya Tsukahara
- Department of Neurobiology, Harvard Medical School, Boston, Massachusetts 02115, USA
| | - Atsuko Shimada
- Department of Biological Sciences, Graduate School of Science, The University of Tokyo, Tokyo 113-0033 Japan
| | - Bradley R Cairns
- Howard Hughes Medical Institute, Department of Oncological Sciences and Huntsman Cancer Institute, University of Utah School of Medicine, Salt Lake City, Utah 84112, USA
| | - Erez Lieberman Aiden
- The Center for Genome Architecture, Baylor College of Medicine, Houston, Texas 77030, USA.,Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030, USA.,Department of Computer Science, Department of Computational and Applied Mathematics, Rice University, Houston, Texas 77005, USA.,Broad Institute of Harvard and Massachusetts Institute of Technology (MIT), Cambridge, Massachusetts 02139 USA.,Center for Theoretical Biological Physics, Rice University, Houston, Texas 77030, USA
| | - Shinichi Morishita
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa 277-8562, Japan
| | - Hiroyuki Takeda
- Department of Biological Sciences, Graduate School of Science, The University of Tokyo, Tokyo 113-0033 Japan
| |
Collapse
|
3
|
Numanagić I, Gökkaya AS, Zhang L, Berger B, Alkan C, Hach F. Fast characterization of segmental duplications in genome assemblies. Bioinformatics 2018; 34:i706-i714. [PMID: 30423092 PMCID: PMC6129265 DOI: 10.1093/bioinformatics/bty586] [Citation(s) in RCA: 52] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023] Open
Abstract
Motivation Segmental duplications (SDs) or low-copy repeats, are segments of DNA > 1 Kbp with high sequence identity that are copied to other regions of the genome. SDs are among the most important sources of evolution, a common cause of genomic structural variation and several are associated with diseases of genomic origin including schizophrenia and autism. Despite their functional importance, SDs present one of the major hurdles for de novo genome assembly due to the ambiguity they cause in building and traversing both state-of-the-art overlap-layout-consensus and de Bruijn graphs. This causes SD regions to be misassembled, collapsed into a unique representation, or completely missing from assembled reference genomes for various organisms. In turn, this missing or incorrect information limits our ability to fully understand the evolution and the architecture of the genomes. Despite the essential need to accurately characterize SDs in assemblies, there has been only one tool that was developed for this purpose, called Whole-Genome Assembly Comparison (WGAC); its primary goal is SD detection. WGAC is comprised of several steps that employ different tools and custom scripts, which makes this strategy difficult and time consuming to use. Thus there is still a need for algorithms to characterize within-assembly SDs quickly, accurately, and in a user friendly manner. Results Here we introduce SEgmental Duplication Evaluation Framework (SEDEF) to rapidly detect SDs through sophisticated filtering strategies based on Jaccard similarity and local chaining. We show that SEDEF accurately detects SDs while maintaining substantial speed up over WGAC that translates into practical run times of minutes instead of weeks. Notably, our algorithm captures up to 25% 'pairwise error' between segments, whereas previous studies focused on only 10%, allowing us to more deeply track the evolutionary history of the genome. Availability and implementation SEDEF is available at https://github.com/vpc-ccg/sedef.
Collapse
Affiliation(s)
- Ibrahim Numanagić
- Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Alim S Gökkaya
- Department of Computer Engineering, Bilkent University, Ankara, Turkey
| | - Lillian Zhang
- Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA
| | - Bonnie Berger
- Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Can Alkan
- Department of Computer Engineering, Bilkent University, Ankara, Turkey
| | - Faraz Hach
- Vancouver Prostate Centre, Vancouver, Canada
- Department of Urologic Sciences, University of British Columbia, Vancouver, Canada
| |
Collapse
|
4
|
Canzar S, Salzberg SL. Short Read Mapping: An Algorithmic Tour. PROCEEDINGS OF THE IEEE. INSTITUTE OF ELECTRICAL AND ELECTRONICS ENGINEERS 2017; 105:436-458. [PMID: 28502990 PMCID: PMC5425171 DOI: 10.1109/jproc.2015.2455551] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Ultra-high-throughput next-generation sequencing (NGS) technology allows us to determine the sequence of nucleotides of many millions of DNA molecules in parallel. Accompanied by a dramatic reduction in cost since its introduction in 2004, NGS technology has provided a new way of addressing a wide range of biological and biomedical questions, from the study of human genetic disease to the analysis of gene expression, protein-DNA interactions, and patterns of DNA methylation. The data generated by NGS instruments comprise huge numbers of very short DNA sequences, or 'reads', that carry little information by themselves. These reads therefore have to be pieced together by well-engineered algorithms to reconstruct biologically meaningful measurments, such as the level of expression of a gene. To solve this complex, high-dimensional puzzle, reads must be mapped back to a reference genome to determine their origin Due to sequencing errors and to genuine differences between the reference genome and the individual being sequenced, this mapping process must be tolerant of mismatches, insertions, and deletions. Although optimal alignment algorithms to solve this problem have long been available, the practical requirements of aligning hundreds of millions of short reads to the 3 billion base pair long human genome have stimulated the development of new, more efficient methods, which today are used routinely throughout the world for the analysis of NGS data.
Collapse
|
5
|
Abouelhoda MI, Kurtz S, Ohlebusch E. CoCoNUT: an efficient system for the comparison and analysis of genomes. BMC Bioinformatics 2008; 9:476. [PMID: 19014477 PMCID: PMC3224568 DOI: 10.1186/1471-2105-9-476] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2008] [Accepted: 11/12/2008] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Comparative genomics is the analysis and comparison of genomes from different species. This area of research is driven by the large number of sequenced genomes and heavily relies on efficient algorithms and software to perform pairwise and multiple genome comparisons. RESULTS Most of the software tools available are tailored for one specific task. In contrast, we have developed a novel system CoCoNUT (Computational Comparative geNomics Utility Toolkit) that allows solving several different tasks in a unified framework: (1) finding regions of high similarity among multiple genomic sequences and aligning them, (2) comparing two draft or multi-chromosomal genomes, (3) locating large segmental duplications in large genomic sequences, and (4) mapping cDNA/EST to genomic sequences. CONCLUSION CoCoNUT is competitive with other software tools w.r.t. the quality of the results. The use of state of the art algorithms and data structures allows CoCoNUT to solve comparative genomics tasks more efficiently than previous tools. With the improved user interface (including an interactive visualization component), CoCoNUT provides a unified, versatile, and easy-to-use software tool for large scale studies in comparative genomics.
Collapse
|