1
|
Tanade C, Khan NS, Rakestraw E, Ladd WD, Draeger EW, Randles A. Establishing the longitudinal hemodynamic mapping framework for wearable-driven coronary digital twins. NPJ Digit Med 2024; 7:236. [PMID: 39242829 PMCID: PMC11379815 DOI: 10.1038/s41746-024-01216-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2023] [Accepted: 08/05/2024] [Indexed: 09/09/2024] Open
Abstract
Understanding the evolving nature of coronary hemodynamics is crucial for early disease detection and monitoring progression. We require digital twins that mimic a patient's circulatory system by integrating continuous physiological data and computing hemodynamic patterns over months. Current models match clinical flow measurements but are limited to single heartbeats. To this end, we introduced the longitudinal hemodynamic mapping framework (LHMF), designed to tackle critical challenges: (1) computational intractability of explicit methods; (2) boundary conditions reflecting varying activity states; and (3) accessible computing resources for clinical translation. We show negligible error (0.0002-0.004%) between LHMF and explicit data of 750 heartbeats. We deployed LHMF across traditional and cloud-based platforms, demonstrating high-throughput simulations on heterogeneous systems. Additionally, we established LHMFC, where hemodynamically similar heartbeats are clustered to avoid redundant simulations, accurately reconstructing longitudinal hemodynamic maps (LHMs). This study captured 3D hemodynamics over 4.5 million heartbeats, paving the way for cardiovascular digital twins.
Collapse
Affiliation(s)
- Cyrus Tanade
- Department of Biomedical Engineering, Duke University, Durham, NC, 27708, USA
| | - Nusrat Sadia Khan
- Department of Biomedical Engineering, Duke University, Durham, NC, 27708, USA
| | - Emily Rakestraw
- Department of Biomedical Engineering, Duke University, Durham, NC, 27708, USA
| | - William D Ladd
- Department of Biomedical Engineering, Duke University, Durham, NC, 27708, USA
| | - Erik W Draeger
- Center for Applied Scientific Computing, Lawrence Livermore National Laboratory, Livermore, CA, 94550, USA
| | - Amanda Randles
- Department of Biomedical Engineering, Duke University, Durham, NC, 27708, USA.
| |
Collapse
|
2
|
Hadar N, Dolgin V, Oustinov K, Yogev Y, Poleg T, Safran A, Freund O, Agam N, Jean MM, Proskorovski-Ohayon R, Wormser O, Drabkin M, Halperin D, Eskin-Schwartz M, Narkis G, Sued-Hendrickson S, Aminov I, Gombosh M, Aharoni S, Birk OS. VARista: a free web platform for streamlined whole-genome variant analysis across T2T, hg38, and hg19. Hum Genet 2024; 143:695-701. [PMID: 38607411 DOI: 10.1007/s00439-024-02671-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2023] [Accepted: 03/24/2024] [Indexed: 04/13/2024]
Abstract
With the increasing importance of genomic data in understanding genetic diseases, there is an essential need for efficient and user-friendly tools that simplify variant analysis. Although multiple tools exist, many present barriers such as steep learning curves, limited reference genome compatibility, or costs. We developed VARista, a free web-based tool, to address these challenges and provide a streamlined solution for researchers, particularly those focusing on rare monogenic diseases. VARista offers a user-centric interface that eliminates much of the technical complexity typically associated with variant analysis. The tool directly supports VCF files generated using reference genomes hg19, hg38, and the emerging T2T, with seamless remapping capabilities between them. Features such as gene summaries and links, tissue and cell-specific gene expression data for both adults and fetuses, as well as automated PCR design and integration with tools such as SpliceAI and AlphaMissense, enable users to focus on the biology and the case itself. As we demonstrate, VARista proved effective in narrowing down potential disease-causing variants, prioritizing them effectively, and providing meaningful biological context, facilitating rapid decision-making. VARista stands out as a freely available and comprehensive tool that consolidates various aspects of variant analysis into a single platform that embraces the forefront of genomic advancements. Its design inherently supports a shift in focus from technicalities to critical thinking, thereby promoting better-informed decisions in genetic disease research. Given its unique capabilities and user-centric design, VARista has the potential to become an essential asset for the genomic research community. https://VARista.link.
Collapse
Affiliation(s)
- Noam Hadar
- The Morris Kahn Laboratory of Human Genetics at the National Institute of Biotechnology in the Negev and Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer Sheva, Israel
| | - Vadim Dolgin
- The Morris Kahn Laboratory of Human Genetics at the National Institute of Biotechnology in the Negev and Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer Sheva, Israel
| | - Katya Oustinov
- The Morris Kahn Laboratory of Human Genetics at the National Institute of Biotechnology in the Negev and Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer Sheva, Israel
| | - Yuval Yogev
- The Morris Kahn Laboratory of Human Genetics at the National Institute of Biotechnology in the Negev and Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer Sheva, Israel
| | - Tomer Poleg
- The Morris Kahn Laboratory of Human Genetics at the National Institute of Biotechnology in the Negev and Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer Sheva, Israel
| | - Amit Safran
- The Morris Kahn Laboratory of Human Genetics at the National Institute of Biotechnology in the Negev and Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer Sheva, Israel
| | - Ofek Freund
- The Morris Kahn Laboratory of Human Genetics at the National Institute of Biotechnology in the Negev and Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer Sheva, Israel
| | - Nadav Agam
- The Morris Kahn Laboratory of Human Genetics at the National Institute of Biotechnology in the Negev and Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer Sheva, Israel
| | - Matan M Jean
- The Morris Kahn Laboratory of Human Genetics at the National Institute of Biotechnology in the Negev and Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer Sheva, Israel
| | - Regina Proskorovski-Ohayon
- The Morris Kahn Laboratory of Human Genetics at the National Institute of Biotechnology in the Negev and Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer Sheva, Israel
| | - Ohad Wormser
- The Morris Kahn Laboratory of Human Genetics at the National Institute of Biotechnology in the Negev and Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer Sheva, Israel
| | - Max Drabkin
- The Morris Kahn Laboratory of Human Genetics at the National Institute of Biotechnology in the Negev and Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer Sheva, Israel
| | - Daniel Halperin
- The Morris Kahn Laboratory of Human Genetics at the National Institute of Biotechnology in the Negev and Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer Sheva, Israel
| | - Marina Eskin-Schwartz
- The Morris Kahn Laboratory of Human Genetics at the National Institute of Biotechnology in the Negev and Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer Sheva, Israel
- Genetics Institute, Soroka University Medical Center, Beer-Sheva, Israel
| | - Ginat Narkis
- The Morris Kahn Laboratory of Human Genetics at the National Institute of Biotechnology in the Negev and Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer Sheva, Israel
- Genetics Institute, Soroka University Medical Center, Beer-Sheva, Israel
| | - Sufa Sued-Hendrickson
- The Morris Kahn Laboratory of Human Genetics at the National Institute of Biotechnology in the Negev and Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer Sheva, Israel
| | - Ilana Aminov
- The Morris Kahn Laboratory of Human Genetics at the National Institute of Biotechnology in the Negev and Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer Sheva, Israel
| | - Maya Gombosh
- The Morris Kahn Laboratory of Human Genetics at the National Institute of Biotechnology in the Negev and Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer Sheva, Israel
| | - Sarit Aharoni
- The Morris Kahn Laboratory of Human Genetics at the National Institute of Biotechnology in the Negev and Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer Sheva, Israel
| | - Ohad S Birk
- The Morris Kahn Laboratory of Human Genetics at the National Institute of Biotechnology in the Negev and Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer Sheva, Israel.
- Genetics Institute, Soroka University Medical Center, Beer-Sheva, Israel.
| |
Collapse
|
3
|
Morishita S. A whole-genome shotgun approach to human reference genome sequencing. Nat Rev Genet 2024; 25:236. [PMID: 38307946 DOI: 10.1038/s41576-024-00703-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2024]
Affiliation(s)
- Shinichi Morishita
- Department of Computational Biology and Medical Sciences, University of Tokyo, Tokyo, Japan.
| |
Collapse
|
4
|
Venkadesh S, Santarelli A, Boesen T, Dong HW, Ascoli GA. Combinatorial quantification of distinct neural projections from retrograde tracing. Nat Commun 2023; 14:7271. [PMID: 37949860 PMCID: PMC10638408 DOI: 10.1038/s41467-023-43124-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2023] [Accepted: 11/01/2023] [Indexed: 11/12/2023] Open
Abstract
Comprehensive quantification of neuronal architectures underlying anatomical brain connectivity remains challenging. We introduce a method to identify distinct axonal projection patterns from a source to a set of target regions and the count of neurons with each pattern. A source region projecting to n targets could have 2n-1 theoretically possible projection types, although only a subset of these types typically exists. By injecting uniquely labeled retrograde tracers in k target regions (k < n), one can experimentally count the cells expressing different color combinations in the source region. The neuronal counts for different color combinations from n-choose-k experiments provide constraints for a model that is robustly solvable using evolutionary algorithms. Here, we demonstrate this method's reliability for 4 targets using simulated triple injection experiments. Furthermore, we illustrate the experimental application of this framework by quantifying the projections of male mouse primary motor cortex to the primary and secondary somatosensory and motor cortices.
Collapse
Affiliation(s)
- Siva Venkadesh
- Interdisciplinary Program in Neuroscience, George Mason University, Fairfax, VA, 22030, USA
- Center for Neural Informatics, Structures, and Plasticity, George Mason University, Fairfax, VA, 22030, USA
| | - Anthony Santarelli
- UCLA Brain Research & Artificial Intelligence Nexus, Department of Neurobiology, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA, 90089, USA
| | - Tyler Boesen
- UCLA Brain Research & Artificial Intelligence Nexus, Department of Neurobiology, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA, 90089, USA
| | - Hong-Wei Dong
- UCLA Brain Research & Artificial Intelligence Nexus, Department of Neurobiology, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA, 90089, USA
| | - Giorgio A Ascoli
- Interdisciplinary Program in Neuroscience, George Mason University, Fairfax, VA, 22030, USA.
- Center for Neural Informatics, Structures, and Plasticity, George Mason University, Fairfax, VA, 22030, USA.
| |
Collapse
|
5
|
Tanade C, Rakestraw E, Ladd W, Draeger E, Randles A. Cloud Computing to Enable Wearable-Driven Longitudinal Hemodynamic Maps. INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS : [PROCEEDINGS]. SC (CONFERENCE : SUPERCOMPUTING) 2023; 2023:82. [PMID: 38939612 PMCID: PMC11210499 DOI: 10.1145/3581784.3607101] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/29/2024]
Abstract
Tracking hemodynamic responses to treatment and stimuli over long periods remains a grand challenge. Moving from established single-heartbeat technology to longitudinal profiles would require continuous data describing how the patient's state evolves, new methods to extend the temporal domain over which flow is sampled, and high-throughput computing resources. While personalized digital twins can accurately measure 3D hemodynamics over several heartbeats, state-of-the-art methods would require hundreds of years of wallclock time on leadership scale systems to simulate one day of activity. To address these challenges, we propose a cloud-based, parallel-in-time framework leveraging continuous data from wearable devices to capture the first 3D patient-specific, longitudinal hemodynamic maps. We demonstrate the validity of our method by establishing ground truth data for 750 beats and comparing the results. Our cloud-based framework is based on an initial fixed set of simulations to enable the wearable-informed creation of personalized longitudinal hemodynamic maps.
Collapse
Affiliation(s)
| | | | | | - Erik Draeger
- Lawrence Livermore National Lab, Livermore, CA, USA
| | | |
Collapse
|
6
|
Han X, Guo S, Ji N, Li T, Liu J, Ye X, Wang Y, Yun Z, Xiong F, Rong J, Liu D, Ma H, Wang Y, Huang Y, Zhang P, Wu W, Ding L, Hawrylycz M, Lein E, Ascoli GA, Xie W, Liu L, Zhang L, Peng H. Whole human-brain mapping of single cortical neurons for profiling morphological diversity and stereotypy. SCIENCE ADVANCES 2023; 9:eadf3771. [PMID: 37824619 PMCID: PMC10569712 DOI: 10.1126/sciadv.adf3771] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/29/2022] [Accepted: 04/18/2023] [Indexed: 10/14/2023]
Abstract
Quantifying neuron morphology and distribution at the whole-brain scale is essential to understand the structure and diversity of cell types. It is exceedingly challenging to reuse recent technologies of single-cell labeling and whole-brain imaging to study human brains. We propose adaptive cell tomography (ACTomography), a low-cost, high-throughput, and high-efficacy tomography approach, based on adaptive targeting of individual cells. We established a platform to inject dyes into cortical neurons in surgical tissues of 18 patients with brain tumors or other conditions and one donated fresh postmortem brain. We collected three-dimensional images of 1746 cortical neurons, of which 852 neurons were reconstructed to quantify local dendritic morphology, and mapped to standard atlases. In our data, human neurons are more diverse across brain regions than by subject age or gender. The strong stereotypy within cohorts of brain regions allows generating a statistical tensor field of neuron morphology to characterize anatomical modularity of a human brain.
Collapse
Affiliation(s)
- Xiaofeng Han
- Institute for Brain and Intelligence, Southeast University, Nanjing, China
| | - Shuxia Guo
- Institute for Brain and Intelligence, Southeast University, Nanjing, China
| | - Nan Ji
- Department of Neurosurgery, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
- China National Clinical Research Center for Neurological Diseases, Beijing, China
- Beijing Key Laboratory of Brain Tumor, Beijing, China
| | - Tian Li
- Department of Neurosurgery, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
| | - Jian Liu
- Institute for Brain and Intelligence, Southeast University, Nanjing, China
| | - Xiangqiao Ye
- Institute for Brain and Intelligence, Southeast University, Nanjing, China
| | - Yi Wang
- Department of Neurosurgery, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
| | - Zhixi Yun
- Institute for Brain and Intelligence, Southeast University, Nanjing, China
| | - Feng Xiong
- Institute for Brain and Intelligence, Southeast University, Nanjing, China
| | - Jing Rong
- Institute for Brain and Intelligence, Southeast University, Nanjing, China
| | - Di Liu
- Institute for Brain and Intelligence, Southeast University, Nanjing, China
| | - Hui Ma
- Institute for Brain and Intelligence, Southeast University, Nanjing, China
| | - Yujin Wang
- Department of Neurosurgery, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
| | - Yue Huang
- China National Clinical Research Center for Neurological Diseases, Beijing, China
| | - Peng Zhang
- Department of Neurosurgery, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
| | - Wenhao Wu
- Department of Neurosurgery, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
| | - Liya Ding
- Institute for Brain and Intelligence, Southeast University, Nanjing, China
| | | | - Ed Lein
- Allen Institute for Brain Science, Seattle, WA, USA
| | - Giorgio A. Ascoli
- Center for Neural Informatics, Krasnow Institute for Advanced Studies and Bioengineering Department, College of Engineering and Computing, George Mason University, Fairfax, VA, USA
| | - Wei Xie
- Institute for Brain and Intelligence, Southeast University, Nanjing, China
- The Key Laboratory of Developmental Genes and Human Disease, Ministry of Education, School of Life Science and Technology, Southeast University, Nanjing, China
| | - Lijuan Liu
- Institute for Brain and Intelligence, Southeast University, Nanjing, China
| | - Liwei Zhang
- Department of Neurosurgery, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
- China National Clinical Research Center for Neurological Diseases, Beijing, China
- Beijing Key Laboratory of Brain Tumor, Beijing, China
| | - Hanchuan Peng
- Institute for Brain and Intelligence, Southeast University, Nanjing, China
| |
Collapse
|
7
|
LeBlanc P, Ma L. Microbiome subcommunity learning with logistic-tree normal latent Dirichlet allocation. Biometrics 2023; 79:2321-2332. [PMID: 36222326 PMCID: PMC10090221 DOI: 10.1111/biom.13772] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2021] [Accepted: 09/26/2022] [Indexed: 11/28/2022]
Abstract
Mixed-membership (MM) models such as latent Dirichlet allocation (LDA) have been applied to microbiome compositional data to identify latent subcommunities of microbial species. These subcommunities are informative for understanding the biological interplay of microbes and for predicting health outcomes. However, microbiome compositions typically display substantial cross-sample heterogeneities in subcommunity compositions-that is, the variability in the proportions of microbes in shared subcommunities across samples-which is not accounted for in prior analyses. As a result, LDA can produce inference, which is highly sensitive to the specification of the number of subcommunities and often divides a single subcommunity into multiple artificial ones. To address this limitation, we incorporate the logistic-tree normal (LTN) model into LDA to form a new MM model. This model allows cross-sample variation in the composition of each subcommunity around some "centroid" composition that defines the subcommunity. Incorporation of auxiliary Pólya-Gamma variables enables a computationally efficient collapsed blocked Gibbs sampler to carry out Bayesian inference under this model. By accounting for such heterogeneity, our new model restores the robustness of the inference in the specification of the number of subcommunities and allows meaningful subcommunities to be identified.
Collapse
Affiliation(s)
- Patrick LeBlanc
- Department of Statistical Sciences, Duke University, Durham, North Carolina, USA
| | - Li Ma
- Department of Statistical Sciences, Duke University, Durham, North Carolina, USA
- Department of Biostatistics and Bioinformatics, Duke University Medical School, Durham, North Carolina, USA
| |
Collapse
|
8
|
Schiml VC, Delogu F, Kumar P, Kunath B, Batut B, Mehta S, Johnson JE, Grüning B, Pope PB, Jagtap PD, Griffin TJ, Arntzen MØ. Integrative meta-omics in Galaxy and beyond. ENVIRONMENTAL MICROBIOME 2023; 18:56. [PMID: 37420292 PMCID: PMC10329324 DOI: 10.1186/s40793-023-00514-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/16/2023] [Accepted: 07/05/2023] [Indexed: 07/09/2023]
Abstract
BACKGROUND 'Omics methods have empowered scientists to tackle the complexity of microbial communities on a scale not attainable before. Individually, omics analyses can provide great insight; while combined as "meta-omics", they enhance the understanding of which organisms occupy specific metabolic niches, how they interact, and how they utilize environmental nutrients. Here we present three integrative meta-omics workflows, developed in Galaxy, for enhanced analysis and integration of metagenomics, metatranscriptomics, and metaproteomics, combined with our newly developed web-application, ViMO (Visualizer for Meta-Omics) to analyse metabolisms in complex microbial communities. RESULTS In this study, we applied the workflows on a highly efficient cellulose-degrading minimal consortium enriched from a biogas reactor to analyse the key roles of uncultured microorganisms in complex biomass degradation processes. Metagenomic analysis recovered metagenome-assembled genomes (MAGs) for several constituent populations including Hungateiclostridium thermocellum, Thermoclostridium stercorarium and multiple heterogenic strains affiliated to Coprothermobacter proteolyticus. The metagenomics workflow was developed as two modules, one standard, and one optimized for improving the MAG quality in complex samples by implementing a combination of single- and co-assembly, and dereplication after binning. The exploration of the active pathways within the recovered MAGs can be visualized in ViMO, which also provides an overview of the MAG taxonomy and quality (contamination and completeness), and information about carbohydrate-active enzymes (CAZymes), as well as KEGG annotations and pathways, with counts and abundances at both mRNA and protein level. To achieve this, the metatranscriptomic reads and metaproteomic mass-spectrometry spectra are mapped onto predicted genes from the metagenome to analyse the functional potential of MAGs, as well as the actual expressed proteins and functions of the microbiome, all visualized in ViMO. CONCLUSION Our three workflows for integrative meta-omics in combination with ViMO presents a progression in the analysis of 'omics data, particularly within Galaxy, but also beyond. The optimized metagenomics workflow allows for detailed reconstruction of microbial community consisting of MAGs with high quality, and thus improves analyses of the metabolism of the microbiome, using the metatranscriptomics and metaproteomics workflows.
Collapse
Affiliation(s)
- Valerie C Schiml
- Faculty of Chemistry, Biotechnology and Food Science, Norwegian University of Life Sciences (NMBU), P.O. Box 5003, 1432, Ås, Norway
| | - Francesco Delogu
- Faculty of Chemistry, Biotechnology and Food Science, Norwegian University of Life Sciences (NMBU), P.O. Box 5003, 1432, Ås, Norway
| | - Praveen Kumar
- Department of Biochemistry, Biophysics and Molecular Biology, University of Minnesota, Minneapolis, MN, 55455, USA
| | - Benoit Kunath
- Faculty of Chemistry, Biotechnology and Food Science, Norwegian University of Life Sciences (NMBU), P.O. Box 5003, 1432, Ås, Norway
| | - Bérénice Batut
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Freiburg, Germany
| | - Subina Mehta
- Department of Biochemistry, Biophysics and Molecular Biology, University of Minnesota, Minneapolis, MN, 55455, USA
| | - James E Johnson
- Minnesota Supercomputing Institute, University of Minnesota, Minneapolis, MN, 55455, USA
| | - Björn Grüning
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Freiburg, Germany
| | - Phillip B Pope
- Faculty of Chemistry, Biotechnology and Food Science, Norwegian University of Life Sciences (NMBU), P.O. Box 5003, 1432, Ås, Norway
- Faculty of Biosciences, Norwegian University of Life Sciences (NMBU), P.O. Box 5003, 1432, Ås, Norway
| | - Pratik D Jagtap
- Department of Biochemistry, Biophysics and Molecular Biology, University of Minnesota, Minneapolis, MN, 55455, USA
| | - Timothy J Griffin
- Department of Biochemistry, Biophysics and Molecular Biology, University of Minnesota, Minneapolis, MN, 55455, USA
| | - Magnus Ø Arntzen
- Faculty of Chemistry, Biotechnology and Food Science, Norwegian University of Life Sciences (NMBU), P.O. Box 5003, 1432, Ås, Norway.
| |
Collapse
|
9
|
Hadar N, Narkis G, Amar S, Varnavsky M, Palti GC, Safran A, Birk OS. STRavinsky STR database and PGTailor PGT tool demonstrate superiority of CHM13-T2T over hg38 and hg19 for STR-based applications. Eur J Hum Genet 2023; 31:738-743. [PMID: 37055538 PMCID: PMC10325972 DOI: 10.1038/s41431-023-01352-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2023] [Revised: 03/18/2023] [Accepted: 03/23/2023] [Indexed: 04/15/2023] Open
Abstract
Short-Tandem-Repeats (STRs) have long been studied for possible roles in biological phenomena, and are utilized in multiple applications such as forensics, evolutionary studies and pre-implantation-genetic-testing (PGT). The two reference genomes most used by clinicians and researchers are GRCh37/hg19 and GRCh38/hg38, both constructed using mainly short-read-sequencing (SRS) in which all-STR-containing-reads cannot be assembled to the reference genome. With the introduction of long-read-sequencing (LRS) methods and the generation of the CHM13 reference genome, also known as T2T, many previously unmapped STRs were finally localized within the human genome. We generated STRavinsky, a compact STR database for three reference genomes, including T2T. We proceeded to demonstrate the advantages of T2T over hg19 and hg38, identifying nearly double the number of STRs throughout all chromosomes. Through STRavinsky, providing a resolution down to a specific genomic coordinate, we demonstrated extreme propensity of TGGAA repeats in p arms of acrocentric chromosomes, substantially corroborating early molecular studies suggesting a possible role in formation of Robertsonian translocations. Moreover, we delineated unique propensity of TGGAA repeats specifically in chromosome 16q11.2 and in 9q12. Finally, we harness the superior capabilities of T2T and STRavinsky to generate PGTailor, a novel web application dramatically facilitating design of STR-based PGT tests in mere minutes.
Collapse
Affiliation(s)
- Noam Hadar
- Morris Kahn Laboratory of Human Genetics, NIBN and Faculty of Health Sciences, Ben Gurion University of the Negev, Beer Sheva, Israel
| | - Ginat Narkis
- Morris Kahn Laboratory of Human Genetics, NIBN and Faculty of Health Sciences, Ben Gurion University of the Negev, Beer Sheva, Israel
- Genetics Institute, Soroka Medical Center, Beer Sheva, Israel
| | - Shirly Amar
- Genetics Institute, Soroka Medical Center, Beer Sheva, Israel
| | | | | | - Amit Safran
- Morris Kahn Laboratory of Human Genetics, NIBN and Faculty of Health Sciences, Ben Gurion University of the Negev, Beer Sheva, Israel
| | - Ohad S Birk
- Morris Kahn Laboratory of Human Genetics, NIBN and Faculty of Health Sciences, Ben Gurion University of the Negev, Beer Sheva, Israel.
- Genetics Institute, Soroka Medical Center, Beer Sheva, Israel.
| |
Collapse
|
10
|
Camilli A. De Novo Genome Sequencing, Annotation, and Taxonomy of Unknown Bacteria. Cold Spring Harb Protoc 2023; 2023:1-3. [PMID: 36283838 PMCID: PMC10586727 DOI: 10.1101/pdb.top107847] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
Abstract
Whole-genome sequencing of viruses and bacteria has become routine thanks to advances in DNA-sequencing technologies. Parallel advances in computing power and software design allow for billions of base pairs of sequence information to be analyzed in hours to minutes. Here, I describe methods to isolate known as well as new species of bacteria from the environment; to purify, sequence, assemble, and bioinformatically annotate their genomes; and to determine their place in the tree of life by phylogenetic analysis. The protocol introduced here was developed as part of Cold Spring Harbor's Advanced Bacterial Genetics course.
Collapse
Affiliation(s)
- Andrew Camilli
- Department of Molecular Biology and Microbiology, Tufts University, School of Medicine, Boston, Massachusetts 02111, USA
| |
Collapse
|
11
|
Abstract
A near-complete sequence outlines a path for a more inclusive reference.
Collapse
|
12
|
The Vista of Application of Specific Anaphylaxis Accurate Diagnosis Based on DNA Single-Nucleotide Methylation Sites. CONTRAST MEDIA & MOLECULAR IMAGING 2021; 2021:8202068. [PMID: 34908915 PMCID: PMC8635942 DOI: 10.1155/2021/8202068] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Revised: 10/21/2021] [Accepted: 10/29/2021] [Indexed: 11/29/2022]
Abstract
Anaphylaxis has rapidly spread around the world in the last several decades. Environmental factors seem to play a major role, and epigenetic marks, especially DNA methylation, get more attention. We discussed several GEO opening data classifications with TOP 100 specific methylation region values (normalized M-values on line) by machine learning, which are remarkable to classify specific anaphylaxis after monoallergen exposure. Then, we sequenced the whole-genome DNA methylation of six people (3 wormwood monoallergen atopic rhinitis patients and 3 normal-immune people) during the pollen season and analyzed the difference of the single nucleotide and DNA region. The results' divergences were obvious (the differential single nucleotides were mostly distributed in nongene regions but the differential DNA regions of GWAS, on the other hand), which may have caused most single nucleotides to be concealed in the regions' sequences. Therefore, we suggest that we should conduct more “pragmatic” and directly find special single-nucleotide changes after exposure to atopic allergens instead of complex correlativity. It is possible to try to use DNA methylation marks to accurately diagnose anaphylaxis and form a machine learning classification based on the single methylated CpGs.
Collapse
|
13
|
Rahman A, Pachter L. SWALO: scaffolding with assembly likelihood optimization. Nucleic Acids Res 2021; 49:e117. [PMID: 34417615 PMCID: PMC8599790 DOI: 10.1093/nar/gkab717] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2020] [Revised: 06/16/2021] [Accepted: 08/16/2021] [Indexed: 01/01/2023] Open
Abstract
Scaffolding, i.e. ordering and orienting contigs is an important step in genome assembly. We present a method for scaffolding using second generation sequencing reads based on likelihoods of genome assemblies. A generative model for sequencing is used to obtain maximum likelihood estimates of gaps between contigs and to estimate whether linking contigs into scaffolds would lead to an increase in the likelihood of the assembly. We then link contigs if they can be unambiguously joined or if the corresponding increase in likelihood is substantially greater than that of other possible joins of those contigs. The method is implemented in a tool called Swalo with approximations to make it efficient and applicable to large datasets. Analysis on real and simulated datasets reveals that it consistently makes more or similar number of correct joins as other scaffolders while linking very few contigs incorrectly, thus outperforming other scaffolders and demonstrating that substantial improvement in genome assembly may be achieved through the use of statistical models. Swalo is freely available for download at https://atifrahman.github.io/SWALO/.
Collapse
Affiliation(s)
- Atif Rahman
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720, USA.,Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1205, Bangladesh
| | - Lior Pachter
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720, USA.,Departments of Mathematics and Molecular & Cell Biology, University of California, Berkeley, CA 94720, USA.,Departments of Biology and Computing & Mathematical Sciences, California Institute of Technology, Pasadena, CA 91103, USA
| |
Collapse
|
14
|
Berger B, Waterman MS, Yu YW. Levenshtein Distance, Sequence Comparison and Biological Database Search. IEEE TRANSACTIONS ON INFORMATION THEORY 2021; 67:3287-3294. [PMID: 34257466 PMCID: PMC8274556 DOI: 10.1109/tit.2020.2996543] [Citation(s) in RCA: 23] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Levenshtein edit distance has played a central role-both past and present-in sequence alignment in particular and biological database similarity search in general. We start our review with a history of dynamic programming algorithms for computing Levenshtein distance and sequence alignments. Following, we describe how those algorithms led to heuristics employed in the most widely used software in bioinformatics, BLAST, a program to search DNA and protein databases for evolutionarily relevant similarities. More recently, the advent of modern genomic sequencing and the volume of data it generates has resulted in a return to the problem of local alignment. We conclude with how the mathematical formulation of Levenshtein distance as a metric made possible additional optimizations to similarity search in biological contexts. These modern optimizations are built around the low metric entropy and fractional dimensionality of biological databases, enabling orders of magnitude acceleration of biological similarity search.
Collapse
Affiliation(s)
- Bonnie Berger
- Department of Mathematics and Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02139 USA, and also with the Department of Computer Science and AI Lab, Massachusetts Institute of Technology, Cambridge, MA 02139 USA
| | - Michael S Waterman
- Quantitative and Computational Biology Section, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089 USA
| | - Yun William Yu
- Department of Mathematics, University of Toronto, Toronto, ON M5S 2E4, Canada, and also with the Department of Computer and Mathematical Sciences, University of Toronto at Scarborough, Toronto, ON M1C 1A4, Canada
| |
Collapse
|
15
|
Gómez-Muñoz C, García-Ortega LF, Montalvo-Arredondo J, Pérez-Ortega E, Damas-Buenrostro LC, Riego-Ruiz L. Long insert clone experimental evidence for assembly improvement and chimeric chromosomes detection in an allopentaploid beer yeast. G3-GENES GENOMES GENETICS 2021; 11:6188626. [PMID: 33768233 PMCID: PMC8495930 DOI: 10.1093/g3journal/jkab088] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/04/2020] [Accepted: 03/12/2021] [Indexed: 11/18/2022]
Abstract
Lager beer is made with the hybrid Saccharomyces pastorianus. Many publicly available S. pastorianus genome assemblies are highly fragmented due to the difficulties of assembling hybrid genomes, such as the presence of homeologous chromosomes from both parental types, and translocations between them. To improve the assembly of a previously sequenced lager yeast hybrid Saccharomyces sp. 790 and elucidate its genome structure, we proposed the use of alternative experimental evidence. We determined the phylogenetic position of Saccharomyces sp. 790 and established it as S. pastorianus 790. Then, we obtained from this yeast a bacterial artificial chromosome (BAC) genomic library with its BAC-end sequences (BESs). To analyze these data, we developed a pipeline (applicable to other assemblies) that classifies BES pairs alignments according to their orientation. For the case of S. pastorianus 790, paired-end BESs alignments validated parts of the assembly and unpaired-end ones suggested contig joins or misassemblies. Importantly, the BACs library was preserved and used for verification experiments. Unpaired-end alignments were used to upgrade the previous assembly and provided an improved detection of translocations. With this, we proposed a genome structure of S. pastorianus 790, which was similar to that of other lager yeasts; however, when we estimated chromosome copy number and experimentally measured its genome size, we discovered that one key difference is the outstanding S. pastorianus 790 ploidy level (allopentaploid). Altogether, our results show the value of combining bioinformatic analyses with experimental data such as long-insert clone information to improve a short-read assembly of a hybrid genome.
Collapse
Affiliation(s)
- Cintia Gómez-Muñoz
- División de Biología Molecular, Instituto Potosino de Investigación Científica y Tecnológica, A.C., San Luis Potosí, Mexico, 78216
| | - Luis Fernando García-Ortega
- División de Biología Molecular, Instituto Potosino de Investigación Científica y Tecnológica, A.C., San Luis Potosí, Mexico, 78216.,Departamento de Ingeniería Genética, Centro de Investigación y de Estudios Avanzados del IPN, Irapuato, Mexico, 36824
| | - Javier Montalvo-Arredondo
- División de Biología Molecular, Instituto Potosino de Investigación Científica y Tecnológica, A.C., San Luis Potosí, Mexico, 78216.,Dirección General Académica, Universidad Autónoma Agraria Antonio Narro, Saltillo, Mexico, 25315
| | | | | | - Lina Riego-Ruiz
- División de Biología Molecular, Instituto Potosino de Investigación Científica y Tecnológica, A.C., San Luis Potosí, Mexico, 78216
| |
Collapse
|
16
|
|
17
|
Nakabayashi R, Morishita S. HiC-Hiker: a probabilistic model to determine contig orientation in chromosome-length scaffolds with Hi-C. Bioinformatics 2020; 36:3966-3974. [PMID: 32369554 PMCID: PMC7672694 DOI: 10.1093/bioinformatics/btaa288] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2019] [Revised: 03/09/2020] [Accepted: 04/27/2020] [Indexed: 01/18/2023] Open
Abstract
Motivation De novo assembly of reference-quality genomes used to require enormously laborious tasks. In particular, it is extremely time-consuming to build genome markers for ordering assembled contigs along chromosomes; thus, they are only available for well-established model organisms. To resolve this issue, recent studies demonstrated that Hi-C could be a powerful and cost-effective means to output chromosome-length scaffolds for non-model species with no genome marker resources, because the Hi-C contact frequency between a pair of two loci can be a good estimator of their genomic distance, even if there is a large gap between them. Indeed, state-of-the-art methods such as 3D-DNA are now widely used for locating contigs in chromosomes. However, it remains challenging to reduce errors in contig orientation because shorter contigs have fewer contacts with their neighboring contigs. These orientation errors lower the accuracy of gene prediction, read alignment, and synteny block estimation in comparative genomics. Results To reduce these contig orientation errors, we propose a new algorithm, named HiC-Hiker, which has a firm grounding in probabilistic theory, rigorously models Hi-C contacts across contigs, and effectively infers the most probable orientations via the Viterbi algorithm. We compared HiC-Hiker and 3D-DNA using human and worm genome contigs generated from short reads, evaluated their performances, and observed a remarkable reduction in the contig orientation error rate from 4.3% (3D-DNA) to 1.7% (HiC-Hiker). Our algorithm can consider long-range information between distal contigs and precisely estimates Hi-C read contact probabilities among contigs, which may also be useful for determining the ordering of contigs. Availability and implementation HiC-Hiker is freely available at: https://github.com/ryought/hic_hiker.
Collapse
Affiliation(s)
- Ryo Nakabayashi
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Chiba 277-8562, Japan
| | - Shinichi Morishita
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Chiba 277-8562, Japan
| |
Collapse
|
18
|
Hao M, Qiao H, Gao Y, Wang Z, Qiao X, Chen X, Qi H. A mixed culture of bacterial cells enables an economic DNA storage on a large scale. Commun Biol 2020; 3:416. [PMID: 32737399 PMCID: PMC7395121 DOI: 10.1038/s42003-020-01141-7] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2020] [Accepted: 07/02/2020] [Indexed: 11/25/2022] Open
Abstract
DNA emerged as a novel potential material for mass data storage, offering the possibility to cheaply solve a great data storage problem. Large oligonucleotide pools demonstrated high potential of large-scale data storage in test tube, meanwhile, living cell with high fidelity in information replication. Here we show a mixed culture of bacterial cells carrying a large oligo pool that was assembled in a high-copy-number plasmid was presented as a stable material for large-scale data storage. The underlying principle was explored by deep bioinformatic analysis. Although homology assembly showed sequence context dependent bias, the large oligonucleotide pools in the mixed culture were constant over multiple successive passages. Finally, over ten thousand distinct oligos encompassing 2304 Kbps encoding 445 KB digital data, were stored in cells, the largest storage in living cells reported so far and present a previously unreported approach for bridging the gap between in vitro and in vivo systems. Hao, Qiao, Gao et al. show that over ten thousand oligonucleotides encoding 445 KB of digital data can be stored in cultured bacterial cells. Data storage in living cells increases the information storage capacity while enabling its economical propagation.
Collapse
Affiliation(s)
- Min Hao
- School of Chemical Engineering and Technology, Tianjin University, Tianjin, China.,Key Laboratory of Systems Bioengineering (Ministry of Education), Tianjin University, Tianjin, China
| | - Hongyan Qiao
- School of Chemical Engineering and Technology, Tianjin University, Tianjin, China.,Key Laboratory of Systems Bioengineering (Ministry of Education), Tianjin University, Tianjin, China
| | - Yanmin Gao
- School of Chemical Engineering and Technology, Tianjin University, Tianjin, China.,Key Laboratory of Systems Bioengineering (Ministry of Education), Tianjin University, Tianjin, China
| | - Zhaoguan Wang
- School of Chemical Engineering and Technology, Tianjin University, Tianjin, China.,Key Laboratory of Systems Bioengineering (Ministry of Education), Tianjin University, Tianjin, China
| | - Xin Qiao
- School of Chemical Engineering and Technology, Tianjin University, Tianjin, China.,Key Laboratory of Systems Bioengineering (Ministry of Education), Tianjin University, Tianjin, China
| | - Xin Chen
- Center for Applied Mathematics, Tianjin University, Tianjin, China
| | - Hao Qi
- School of Chemical Engineering and Technology, Tianjin University, Tianjin, China. .,Key Laboratory of Systems Bioengineering (Ministry of Education), Tianjin University, Tianjin, China.
| |
Collapse
|
19
|
Giani AM, Gallo GR, Gianfranceschi L, Formenti G. Long walk to genomics: History and current approaches to genome sequencing and assembly. Comput Struct Biotechnol J 2019; 18:9-19. [PMID: 31890139 PMCID: PMC6926122 DOI: 10.1016/j.csbj.2019.11.002] [Citation(s) in RCA: 109] [Impact Index Per Article: 21.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2019] [Revised: 11/03/2019] [Accepted: 11/06/2019] [Indexed: 12/13/2022] Open
Abstract
Genomes represent the starting point of genetic studies. Since the discovery of DNA structure, scientists have devoted great efforts to determine their sequence in an exact way. In this review we provide a comprehensive historical background of the improvements in DNA sequencing technologies that have accompanied the major milestones in genome sequencing and assembly, ranging from early sequencing methods to Next-Generation Sequencing platforms. We then focus on the advantages and challenges of the current technologies and approaches, collectively known as Third Generation Sequencing. As these technical advancements have been accompanied by progress in analytical methods, we also review the bioinformatic tools currently employed in de novo genome assembly, as well as some applications of Third Generation Sequencing technologies and high-quality reference genomes.
Collapse
Key Words
- BAC, Bacterial Artificial Chromosome
- Bioinformatics
- Genome assembly
- HGP, Human Genome Project
- HMW, high molecular weight
- HapMap, haplotype map
- NGS, Next Generation Sequencing
- Next-generation
- OLC, Overlap-Layout-Consensus
- QV, Quality Value (QV)
- Reference
- SBS, Sequencing by Synthesis
- SMRT, Single Molecule Real-Time
- SNPs, Single Nucleotide Polymorphisms
- SRA, Short Read Archive
- SV, Structural Variant
- Sequencing
- TGS, Third Generation Sequencing
- Third-generation
- WGS, Whole Genome Sequencing
- ZMW, Zero-Mode Waveguide
- bp, base pair
- dNTPs, deoxynucleoside triphosphates
- ddNTP, 2,3-dideoxynucleoside triphosphate
Collapse
Affiliation(s)
- Alice Maria Giani
- Department of Surgery, Weill Cornell Medical College, New York, NY, USA
| | | | | | | |
Collapse
|
20
|
Andonov R, Djidjev H, François S, Lavenier D. Complete assembly of circular and chloroplast genomes based on global optimization. J Bioinform Comput Biol 2019; 17:1950014. [PMID: 31288643 DOI: 10.1142/s0219720019500148] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
This paper focuses on the last two stages of genome assembly, namely, scaffolding and gap-filling, and shows that they can be solved as part of a single optimization problem. Our approach is based on modeling genome assembly as a problem of finding a simple path in a specific graph that satisfies as many distance constraints as possible encoding the insert-size information. We formulate it as a mixed-integer linear programming (MILP) problem and apply an optimization solver to find the exact solutions on a benchmark of chloroplasts. We show that the presence of repetitions in the set of unitigs is the main reason for the existence of multiple equivalent solutions that are associated to alternative subpaths. We also describe two sufficient conditions and we design efficient algorithms for identifying these subpaths. Comparisons of the results achieved by our tool with the ones obtained with recent assemblers are presented.
Collapse
Affiliation(s)
- Rumen Andonov
- * Univ Rennes, Inria, CNRS, IRISA, F-35000 Rennes, France
| | - Hristo Djidjev
- † Los Alamos National Laboratory, Los Alamos, NM 87545, USA
| | | | | |
Collapse
|
21
|
Characterization and evolutionary dynamics of complex regions in eukaryotic genomes. SCIENCE CHINA-LIFE SCIENCES 2019; 62:467-488. [PMID: 30810961 DOI: 10.1007/s11427-018-9458-0] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/01/2018] [Accepted: 11/05/2018] [Indexed: 01/07/2023]
Abstract
Complex regions in eukaryotic genomes are typically characterized by duplications of chromosomal stretches that often include one or more genes repeated in a tandem array or in relatively close proximity. Nevertheless, the repetitive nature of these regions, together with the often high sequence identity among repeats, have made complex regions particularly recalcitrant to proper molecular characterization, often being misassembled or completely absent in genome assemblies. This limitation has prevented accurate functional and evolutionary analyses of these regions. This is becoming increasingly relevant as evidence continues to support a central role for complex genomic regions in explaining human disease, developmental innovations, and ecological adaptations across phyla. With the advent of long-read sequencing technologies and suitable assemblers, the development of algorithms that can accommodate sample heterozygosity, and the adoption of a pangenomic-like view of these regions, accurate reconstructions of complex regions are now within reach. These reconstructions will finally allow for accurate functional and evolutionary studies of complex genomic regions, underlying the generation of genotype-phenotype maps of unprecedented resolution.
Collapse
|
22
|
Li W, Freudenberg J, Freudenberg J. Alignment-free approaches for predicting novel Nuclear Mitochondrial Segments (NUMTs) in the human genome. Gene 2019; 691:141-152. [PMID: 30630097 DOI: 10.1016/j.gene.2018.12.040] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2018] [Revised: 12/07/2018] [Accepted: 12/14/2018] [Indexed: 10/27/2022]
Abstract
The nuclear human genome harbors sequences of mitochondrial origin, indicating an ancestral transfer of DNA from the mitogenome. Several Nuclear Mitochondrial Segments (NUMTs) have been detected by alignment-based sequence similarity search, as implemented in the Basic Local Alignment Search Tool (BLAST). Identifying NUMTs is important for the comprehensive annotation and understanding of the human genome. Here we explore the possibility of detecting NUMTs in the human genome by alignment-free sequence similarity search, such as k-mers (k-tuples, k-grams, oligos of length k) distributions. We find that when k=6 or larger, the k-mer approach and BLAST search produce almost identical results, e.g., detect the same set of NUMTs longer than 3 kb. However, when k=5 or k=4, certain signals are only detected by the alignment-free approach, and these may indicate yet unrecognized, and potentially more ancestral NUMTs. We introduce a "Manhattan plot" style representation of NUMT predictions across the genome, which are calculated based on the reciprocal of the Jensen-Shannon divergence between the nuclear and mitochondrial k-mer frequencies. The further inspection of the k-mer-based NUMT predictions however shows that most of them contain long-terminal-repeat (LTR) annotations, whereas BLAST-based NUMT predictions do not. Thus, similarity of the mitogenome to LTR sequences is recognized, which we validate by finding the mitochondrial k-mer distribution closer to those for transposable sequences and specifically, close to some types of LTR.
Collapse
Affiliation(s)
- Wentian Li
- The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institute for Medical Research, Northwell Health, Manhasset, NY, USA.
| | - Jerome Freudenberg
- The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institute for Medical Research, Northwell Health, Manhasset, NY, USA
| | - Jan Freudenberg
- Regeneron Genetics Center, Regeneron Pharmaceuticals, Inc., Tarrytown, NY, USA
| |
Collapse
|
23
|
Maxson Jones K, Ankeny RA, Cook-Deegan R. The Bermuda Triangle: The Pragmatics, Policies, and Principles for Data Sharing in the History of the Human Genome Project. JOURNAL OF THE HISTORY OF BIOLOGY 2018; 51:693-805. [PMID: 30390178 PMCID: PMC7307446 DOI: 10.1007/s10739-018-9538-7] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
The Bermuda Principles for DNA sequence data sharing are an enduring legacy of the Human Genome Project (HGP). They were adopted by the HGP at a strategy meeting in Bermuda in February of 1996 and implemented in formal policies by early 1998, mandating daily release of HGP-funded DNA sequences into the public domain. The idea of daily sharing, we argue, emanated directly from strategies for large, goal-directed molecular biology projects first tested within the "community" of C. elegans researchers, and were introduced and defended for the HGP by the nematode biologists John Sulston and Robert Waterston. In the C. elegans community, and subsequently in the HGP, daily sharing served the pragmatic goals of quality control and project coordination. Yet in the HGP human genome, we also argue, the Bermuda Principles addressed concerns about gene patents impeding scientific advancement, and were aspirational and flexible in implementation and justification. They endured as an archetype for how rapid data sharing could be realized and rationalized, and permitted adaptation to the needs of various scientific communities. Yet in addition to the support of Sulston and Waterston, their adoption also depended on the clout of administrators at the US National Institutes of Health (NIH) and the UK nonprofit charity the Wellcome Trust, which together funded 90% of the HGP human sequencing effort. The other nations wishing to remain in the HGP consortium had to accommodate to the Bermuda Principles, requiring exceptions from incompatible existing or pending data access policies for publicly funded research in Germany, Japan, and France. We begin this story in 1963, with the biologist Sydney Brenner's proposal for a nematode research program at the Laboratory of Molecular Biology (LMB) at the University of Cambridge. We continue through 2003, with the completion of the HGP human reference genome, and conclude with observations about policy and the historiography of molecular biology.
Collapse
Affiliation(s)
- Kathryn Maxson Jones
- Department of History, Princeton University, Princeton, NJ, USA.
- MBL McDonnell Foundation Scholar, Marine Biological Laboratory, Woods Hole, MA, USA.
| | - Rachel A Ankeny
- School of Humanities, The University of Adelaide, Adelaide, Australia
| | - Robert Cook-Deegan
- School for the Future of Innovation in Society, Consortium for Science, Policy & Outcomes, Arizona State University, Barrett & O'Connor Washington Center, Washington, D.C., USA
| |
Collapse
|
24
|
Harrison OB, Schoen C, Retchless AC, Wang X, Jolley KA, Bray JE, Maiden MCJ. Neisseria genomics: current status and future perspectives. Pathog Dis 2018; 75:3861976. [PMID: 28591853 PMCID: PMC5827584 DOI: 10.1093/femspd/ftx060] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2017] [Accepted: 06/05/2017] [Indexed: 12/17/2022] Open
Abstract
High-throughput whole genome sequencing has unlocked a multitude of possibilities enabling members of the Neisseria genus to be examined with unprecedented detail, including the human pathogens Neisseria meningitidis and Neisseria gonorrhoeae. To maximise the potential benefit of this for public health, it is becoming increasingly important to ensure that this plethora of data are adequately stored, disseminated and made readily accessible. Investigations facilitating cross-species comparisons as well as the analysis of global datasets will allow differences among and within species and across geographic locations and different times to be identified, improving our understanding of the distinct phenotypes observed. Recent advances in high-throughput platforms that measure the transcriptome, proteome and/or epigenome are also becoming increasingly employed to explore the complexities of Neisseria biology. An integrated approach to the analysis of these is essential to fully understand the impact these may have in the Neisseria genus. This article reviews the current status of some of the tools available for next generation sequence analysis at the dawn of the ‘post-genomic’ era.
Collapse
Affiliation(s)
| | - Christoph Schoen
- Institute for Hygiene and Microbiology, University of Würzburg, Würzburg 97080, Germany
| | - Adam C Retchless
- Centers for Disease Control and Prevention, Atlanta, GA 30333, USA
| | - Xin Wang
- Centers for Disease Control and Prevention, Atlanta, GA 30333, USA
| | - Keith A Jolley
- Department of Zoology, University of Oxford, Oxford OX1 3SY, UK
| | - James E Bray
- Department of Zoology, University of Oxford, Oxford OX1 3SY, UK
| | | |
Collapse
|
25
|
Silva PJ, Schaibley VM, Ramos KS. Academic medical centers as innovation ecosystems to address population -omics challenges in precision medicine. J Transl Med 2018; 16:28. [PMID: 29448963 PMCID: PMC5815198 DOI: 10.1186/s12967-018-1401-2] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2017] [Accepted: 02/05/2018] [Indexed: 01/08/2023] Open
Abstract
While the promise of the Human Genome Project provided significant insights into the structure of the human genome, the complexities of disease at the individual level have made it difficult to utilize -omic information in clinical decision making. Some of the existing constraints have been minimized by technological advancements that have reduced the cost of sequencing to a rate far in excess of Moore's Law (a halving in cost per unit output every 18 months). The reduction in sequencing costs has made it economically feasible to create large data commons capturing the diversity of disease across populations. Until recently, these data have primarily been consumed in clinical research, but now increasingly being considered in clinical decision- making. Such advances are disrupting common diagnostic business models around which academic medical centers (AMCs) and molecular diagnostic companies have collaborated over the last decade. Proprietary biomarkers and patents on proprietary diagnostic content are no longer driving biomarker collaborations between industry and AMCs. Increasingly the scope of the data commons and biorepositories that AMCs can assemble through a nexus of academic and pharma collaborations is driving a virtuous cycle of precision medicine capabilities that make an AMC relevant and highly competitive. A rebalancing of proprietary strategies and open innovation strategies is warranted to enable institutional precision medicine asset portfolios. The scope of the AMC's clinical trial and research collaboration portfolios with industry are increasingly dependent on the currency of data, and less on patents. Intrapeneurial support of internal service offerings, clinical trials and clinical laboratory services for example, will be important new points of emphasis at the academic-industry interface. Streamlining these new models of industry collaboration for AMCs are a new area for technology transfer offices to offer partnerships and to add value beyond the traditional intellectual property offering.
Collapse
Affiliation(s)
- Patrick J. Silva
- Office of the Senior Vice President Health Sciences, University of Arizona Health Sciences, Drachman Hall, Room B207, 1295 North Martin Avenue, P.O. Box 210202, Tucson, AZ 85721-0202 USA
| | - Valerie M. Schaibley
- Center for Applied Genetics and Genomic Medicine, University of Arizona, 1295 North Martin Avenue, Drachman Hall, Room B207, Tucson, AZ 85721-0202 USA
| | - Kenneth S. Ramos
- Office of the Senior Vice President Health Sciences, University of Arizona Health Sciences, Drachman Hall, Room B207, 1295 North Martin Avenue, P.O. Box 210202, Tucson, AZ 85721-0202 USA
- University of Arizona College of Medicine-Phoenix, 550 E. Van Buren Street, Phoenix, 85004 USA
- University of Arizona College of Medicine-Tucson, 1295 North Martin Avenue, Drachman Hall, Room B207, P.O. Box 210202, Tucson, AZ 85721-0202 USA
- Center for Applied Genetics and Genomic Medicine, University of Arizona, 1295 North Martin Avenue, Drachman Hall, Room B207, Tucson, AZ 85721-0202 USA
| |
Collapse
|
26
|
Abstract
Here, I argue that computational thinking and techniques are so central to the quest of understanding life that today all biology is computational biology. Computational biology brings order into our understanding of life, it makes biological concepts rigorous and testable, and it provides a reference map that holds together individual insights. The next modern synthesis in biology will be driven by mathematical, statistical, and computational methods being absorbed into mainstream biological training, turning biology into a quantitative science.
Collapse
Affiliation(s)
- Florian Markowetz
- University of Cambridge, Cancer Research UK Cambridge Institute, Cambridge, United Kingdom
| |
Collapse
|
27
|
Huang J, Zhang C, Zhao X, Fei Z, Wan K, Zhang Z, Pang X, Yin X, Bai Y, Sun X, Gao L, Li R, Zhang J, Li X. The Jujube Genome Provides Insights into Genome Evolution and the Domestication of Sweetness/Acidity Taste in Fruit Trees. PLoS Genet 2016; 12:e1006433. [PMID: 28005948 PMCID: PMC5179053 DOI: 10.1371/journal.pgen.1006433] [Citation(s) in RCA: 89] [Impact Index Per Article: 11.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2016] [Accepted: 10/20/2016] [Indexed: 12/26/2022] Open
Abstract
Jujube (Ziziphus jujuba Mill.) belongs to the Rhamnaceae family and is a popular fruit tree species with immense economic and nutritional value. Here, we report a draft genome of the dry jujube cultivar 'Junzao' and the genome resequencing of 31 geographically diverse accessions of cultivated and wild jujubes (Ziziphus jujuba var. spinosa). Comparative analysis revealed that the genome of 'Dongzao', a fresh jujube, was ~86.5 Mb larger than that of the 'Junzao', partially due to the recent insertions of transposable elements in the 'Dongzao' genome. We constructed eight proto-chromosomes of the common ancestor of Rhamnaceae and Rosaceae, two sister families in the order Rosales, and elucidated the evolutionary processes that have shaped the genome structures of modern jujubes. Population structure analysis revealed the complex genetic background of jujubes resulting from extensive hybridizations between jujube and its wild relatives. Notably, several key genes that control fruit organic acid metabolism and sugar content were identified in the selective sweep regions. We also identified S-locus genes controlling gametophytic self-incompatibility and investigated haplotype patterns of the S locus in the jujube genomes, which would provide a guideline for parent selection for jujube crossbreeding. This study provides valuable genomic resources for jujube improvement, and offers insights into jujube genome evolution and its population structure and domestication.
Collapse
Affiliation(s)
- Jian Huang
- College of Forestry, Northwest A&F University, Yangling, China
- Center for Jujube Engineering and Technology of State Forestry Administration, Northwest A&F University, Yangling, China
| | - Chunmei Zhang
- College of Forestry, Northwest A&F University, Yangling, China
- Center for Jujube Engineering and Technology of State Forestry Administration, Northwest A&F University, Yangling, China
| | - Xing Zhao
- Novogene Bioinformatics Institute, Beijing, China
| | - Zhangjun Fei
- Boyce Thompson Institute, Cornell University, Ithaca, New York, United States of America
| | - KangKang Wan
- Novogene Bioinformatics Institute, Beijing, China
| | - Zhong Zhang
- College of Forestry, Northwest A&F University, Yangling, China
- Center for Jujube Engineering and Technology of State Forestry Administration, Northwest A&F University, Yangling, China
| | - Xiaoming Pang
- College of Biological Sciences and Technology, Beijing Forestry University, Beijing, China
| | - Xiao Yin
- College of Forestry, Northwest A&F University, Yangling, China
| | - Yang Bai
- Boyce Thompson Institute, Cornell University, Ithaca, New York, United States of America
| | - Xiaoqing Sun
- Novogene Bioinformatics Institute, Beijing, China
| | - Lizhi Gao
- Plant Germplasm and Genomics Center, Germplasm Bank of Wild Species in Southwest China, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming, China
| | - Ruiqiang Li
- Novogene Bioinformatics Institute, Beijing, China
| | - Jinbo Zhang
- Novogene Bioinformatics Institute, Beijing, China
| | - Xingang Li
- College of Forestry, Northwest A&F University, Yangling, China
- Center for Jujube Engineering and Technology of State Forestry Administration, Northwest A&F University, Yangling, China
| |
Collapse
|
28
|
Carvalho AB, Dupim EG, Goldstein G. Improved assembly of noisy long reads by k-mer validation. Genome Res 2016; 26:1710-1720. [PMID: 27831497 PMCID: PMC5131822 DOI: 10.1101/gr.209247.116] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2016] [Accepted: 09/29/2016] [Indexed: 11/24/2022]
Abstract
Genome assembly depends critically on read length. Two recent technologies, from Pacific Biosciences (PacBio) and Oxford Nanopore, produce read lengths >20 kb, which yield de novo genome assemblies with vastly greater contiguity than those based on Sanger, Illumina, or other technologies. However, the very high error rates of these two new technologies (∼15% per base) makes assembly imprecise at repeats longer than the read length and computationally expensive. Here we show that the contiguity and quality of the assembly of these noisy long reads can be significantly improved at a minimal cost, by leveraging on the low error rate and low cost of Illumina short reads. Namely, k-mers from the PacBio raw reads that are not present in Illumina reads (which account for ∼95% of the distinct k-mers) are deemed sequencing errors and ignored at the seed alignment step. By focusing on the ∼5% of k-mers that are error free, read overlap sensitivity is dramatically increased. Of equal importance, the validation procedure can be extended to exclude repetitive k-mers, which prevents read miscorrection at repeats and further improves the resulting assemblies. We tested the k-mer validation procedure using one long-read technology (PacBio) and one assembler (MHAP/Celera Assembler), but it is very likely to yield analogous improvements with alternative long-read technologies and assemblers, such as Oxford Nanopore and BLASR/DALIGNER/Falcon, respectively.
Collapse
Affiliation(s)
- Antonio Bernardo Carvalho
- Departamento de Genética, Universidade Federal do Rio de Janeiro, CEP 21941-971, Rio de Janeiro, Brazil
| | - Eduardo G Dupim
- Departamento de Genética, Universidade Federal do Rio de Janeiro, CEP 21941-971, Rio de Janeiro, Brazil
| | - Gabriel Goldstein
- Departamento de Genética, Universidade Federal do Rio de Janeiro, CEP 21941-971, Rio de Janeiro, Brazil
| |
Collapse
|
29
|
Making sense of genomes of parasitic worms: Tackling bioinformatic challenges. Biotechnol Adv 2016; 34:663-686. [DOI: 10.1016/j.biotechadv.2016.03.001] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2015] [Revised: 02/25/2016] [Accepted: 03/01/2016] [Indexed: 01/25/2023]
|
30
|
Beier S, Himmelbach A, Schmutzer T, Felder M, Taudien S, Mayer KFX, Platzer M, Stein N, Scholz U, Mascher M. Multiplex sequencing of bacterial artificial chromosomes for assembling complex plant genomes. PLANT BIOTECHNOLOGY JOURNAL 2016; 14:1511-22. [PMID: 26801048 PMCID: PMC5066668 DOI: 10.1111/pbi.12511] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/14/2015] [Revised: 11/11/2015] [Accepted: 11/13/2015] [Indexed: 05/02/2023]
Abstract
Hierarchical shotgun sequencing remains the method of choice for assembling high-quality reference sequences of complex plant genomes. The efficient exploitation of current high-throughput technologies and powerful computational facilities for large-insert clone sequencing necessitates the sequencing and assembly of a large number of clones in parallel. We developed a multiplexed pipeline for shotgun sequencing and assembling individual bacterial artificial chromosomes (BACs) using the Illumina sequencing platform. We illustrate our approach by sequencing 668 barley BACs (Hordeum vulgare L.) in a single Illumina HiSeq 2000 lane. Using a newly designed parallelized computational pipeline, we obtained sequence assemblies of individual BACs that consist, on average, of eight sequence scaffolds and represent >98% of the genomic inserts. Our BAC assemblies are clearly superior to a whole-genome shotgun assembly regarding contiguity, completeness and the representation of the gene space. Our methods may be employed to rapidly obtain high-quality assemblies of a large number of clones to assemble map-based reference sequences of plant and animal species with complex genomes by sequencing along a minimum tiling path.
Collapse
Affiliation(s)
- Sebastian Beier
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Stadt Seeland, Germany
| | - Axel Himmelbach
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Stadt Seeland, Germany
| | - Thomas Schmutzer
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Stadt Seeland, Germany
| | - Marius Felder
- Leibniz Institute on Aging-Fritz Lipmann Institute (FLI), Jena, Germany
| | - Stefan Taudien
- Leibniz Institute on Aging-Fritz Lipmann Institute (FLI), Jena, Germany
| | - Klaus F X Mayer
- Plant Genome and System Biology (PGSB), Helmholtz Center Munich, German Research Center for Environmental Health (GmbH), Neuherberg, Germany
| | - Matthias Platzer
- Leibniz Institute on Aging-Fritz Lipmann Institute (FLI), Jena, Germany
| | - Nils Stein
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Stadt Seeland, Germany
| | - Uwe Scholz
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Stadt Seeland, Germany
| | - Martin Mascher
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Stadt Seeland, Germany
| |
Collapse
|
31
|
Bermudez-Santana CI. APLICACIONES DE LA BIOINFORMÁTICA EN LA MEDICINA: EL GENOMA HUMANO. ¿CÓMO PODEMOS VER TANTO DETALLE? ACTA BIOLÓGICA COLOMBIANA 2016. [DOI: 10.15446/abc.v21n1supl.51233] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022] Open
Abstract
<p lang="es-ES" align="JUSTIFY">La bioinformática es un campo novedoso que soporta parte de la investigación biológica dirigida a la identificación de variantes génicas que pueden ser descubiertas desde los estudios de genomas completos. Basados en esta motivación se presenta el panorama general de los aportes principales de la bioinformática en el desarrollo del secuenciamiento del primer genoma humano. Adicionalmente se resumen los principales avances en cómputo desarrollados para responder a las demandas requeridas por los métodos de secuenciamiento de última generación para lograr re-secuenciar un genoma humano. Finalmente se introducen algunos de los nuevos retos que deben asumirse para aplicar la genómica personalizada en el desarrollo de la medicina. </p><p lang="es-ES" align="JUSTIFY"> </p><p lang="es-ES" align="JUSTIFY">Abstract</p><p lang="es-ES" align="JUSTIFY">Bioinformatics is a new field that supports part of the biological research aimed at identifying gene variants that can be discovered from studies of whole genomes. Based on this motivation the overview of the main contributions of bioinformatics in the development of sequencing the first human genome is presented. Additionally it is summarized the main advances in computing developed to meet the demands to re-sequence a human genome by using the next generation sequencing technologies. Finally some new challenges that must be faced to apply the personalized genomics into the medicine development are introduced.</p>
Collapse
|
32
|
Chaisson MJP, Wilson RK, Eichler EE. Genetic variation and the de novo assembly of human genomes. Nat Rev Genet 2015; 16:627-40. [PMID: 26442640 DOI: 10.1038/nrg3933] [Citation(s) in RCA: 226] [Impact Index Per Article: 25.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
The discovery of genetic variation and the assembly of genome sequences are both inextricably linked to advances in DNA-sequencing technology. Short-read massively parallel sequencing has revolutionized our ability to discover genetic variation but is insufficient to generate high-quality genome assemblies or resolve most structural variation. Full resolution of variation is only guaranteed by complete de novo assembly of a genome. Here, we review approaches to genome assembly, the nature of gaps or missing sequences, and biases in the assembly process. We describe the challenges of generating a complete de novo genome assembly using current technologies and the impact that being able to perfectly sequence the genome would have on understanding human disease and evolution. Finally, we summarize recent technological advances that improve both contiguity and accuracy and emphasize the importance of complete de novo assembly as opposed to read mapping as the primary means to understanding the full range of human genetic variation.
Collapse
Affiliation(s)
- Mark J P Chaisson
- Department of Genome Sciences, University of Washington, Foege Building S-413A, Box 355065, 3720 15th Ave NE, Seattle, Washington 98195, USA
| | - Richard K Wilson
- McDonnell Genome Institute, Department of Medicine, Department of Genetics, Washington University School of Medicine, St. Louis, Missouri 63108, USA
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington, Foege Building S-413A, Box 355065, 3720 15th Ave NE, Seattle, Washington 98195, USA.,Howard Hughes Medical Institute, University of Washington, Seattle, Washington 98195, USA
| |
Collapse
|
33
|
Fierst JL. Using linkage maps to correct and scaffold de novo genome assemblies: methods, challenges, and computational tools. Front Genet 2015; 6:220. [PMID: 26150829 PMCID: PMC4473057 DOI: 10.3389/fgene.2015.00220] [Citation(s) in RCA: 98] [Impact Index Per Article: 10.9] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2015] [Accepted: 06/08/2015] [Indexed: 01/05/2023] Open
Abstract
Modern high-throughput DNA sequencing has made it possible to inexpensively produce genome sequences, but in practice many of these draft genomes are fragmented and incomplete. Genetic linkage maps based on recombination rates between physical markers have been used in biology for over 100 years and a linkage map, when paired with a de novo sequencing project, can resolve mis-assemblies and anchor chromosome-scale sequences. Here, I summarize the methodology behind integrating de novo assemblies and genetic linkage maps, outline the current challenges, review the available software tools, and discuss new mapping technologies.
Collapse
Affiliation(s)
- Janna L. Fierst
- Department of Biological Sciences, University of AlabamaTuscaloosa, AL, USA
| |
Collapse
|
34
|
Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat Biotechnol 2015; 33:623-30. [DOI: 10.1038/nbt.3238] [Citation(s) in RCA: 687] [Impact Index Per Article: 76.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2014] [Accepted: 04/08/2015] [Indexed: 02/07/2023]
|
35
|
Chapman JA, Mascher M, Buluç A, Barry K, Georganas E, Session A, Strnadova V, Jenkins J, Sehgal S, Oliker L, Schmutz J, Yelick KA, Scholz U, Waugh R, Poland JA, Muehlbauer GJ, Stein N, Rokhsar DS. A whole-genome shotgun approach for assembling and anchoring the hexaploid bread wheat genome. Genome Biol 2015. [PMID: 25637298 DOI: 10.1186/s13059‐015‐0582‐8] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Polyploid species have long been thought to be recalcitrant to whole-genome assembly. By combining high-throughput sequencing, recent developments in parallel computing, and genetic mapping, we derive, de novo, a sequence assembly representing 9.1 Gbp of the highly repetitive 16 Gbp genome of hexaploid wheat, Triticum aestivum, and assign 7.1 Gb of this assembly to chromosomal locations. The genome representation and accuracy of our assembly is comparable or even exceeds that of a chromosome-by-chromosome shotgun assembly. Our assembly and mapping strategy uses only short read sequencing technology and is applicable to any species where it is possible to construct a mapping population.
Collapse
Affiliation(s)
- Jarrod A Chapman
- Department of Energy Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, CA, 94598, USA.
| | - Martin Mascher
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, 06466, Stadt Seeland, Germany.
| | - Aydın Buluç
- Computational Research Division and National Energy Research Supercomputing Center (NERSC), Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA.
| | - Kerrie Barry
- Department of Energy Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, CA, 94598, USA.
| | - Evangelos Georganas
- Computational Research Division and National Energy Research Supercomputing Center (NERSC), Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA. .,Department of Electrical Engineering and Computer Science, Computer Science Division, University of California, Berkeley, CA, 94720, USA.
| | - Adam Session
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, 94720, USA.
| | - Veronika Strnadova
- Department of Computer Science, University of California, Santa Barbara, CA, 93106, USA.
| | - Jerry Jenkins
- Department of Energy Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, CA, 94598, USA. .,HudsonAlpha Institute of Biotechnology, Huntsville, AL, 35806, USA.
| | - Sunish Sehgal
- Department of Plant Pathology, Kansas State University, Manhattan, KS, 65506, USA. .,Present address: Department of Plant Science, South Dakota State University, Brookings, SD, 57007, USA.
| | - Leonid Oliker
- Computational Research Division and National Energy Research Supercomputing Center (NERSC), Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA.
| | - Jeremy Schmutz
- Department of Energy Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, CA, 94598, USA. .,HudsonAlpha Institute of Biotechnology, Huntsville, AL, 35806, USA.
| | - Katherine A Yelick
- Computational Research Division and National Energy Research Supercomputing Center (NERSC), Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA. .,Department of Electrical Engineering and Computer Science, Computer Science Division, University of California, Berkeley, CA, 94720, USA.
| | - Uwe Scholz
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, 06466, Stadt Seeland, Germany.
| | - Robbie Waugh
- Division of Plant Sciences, University of Dundee & The James Hutton Institute, Invergowrie, Dundee, DD2 5DA, UK.
| | - Jesse A Poland
- Department of Plant Pathology, Kansas State University, Manhattan, KS, 65506, USA.
| | - Gary J Muehlbauer
- Departments of Agronomy and Plant Genetics, and Plant Biology, University of Minnesota, St Paul, MN, 55108, USA.
| | - Nils Stein
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, 06466, Stadt Seeland, Germany.
| | - Daniel S Rokhsar
- Department of Energy Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, CA, 94598, USA. .,Department of Molecular and Cell Biology, University of California, Berkeley, CA, 94720, USA.
| |
Collapse
|
36
|
Chapman JA, Mascher M, Buluç A, Barry K, Georganas E, Session A, Strnadova V, Jenkins J, Sehgal S, Oliker L, Schmutz J, Yelick KA, Scholz U, Waugh R, Poland JA, Muehlbauer GJ, Stein N, Rokhsar DS. A whole-genome shotgun approach for assembling and anchoring the hexaploid bread wheat genome. Genome Biol 2015; 16:26. [PMID: 25637298 PMCID: PMC4373400 DOI: 10.1186/s13059-015-0582-8] [Citation(s) in RCA: 164] [Impact Index Per Article: 18.2] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2014] [Accepted: 01/06/2015] [Indexed: 11/10/2022] Open
Abstract
Polyploid species have long been thought to be recalcitrant to whole-genome assembly. By combining high-throughput sequencing, recent developments in parallel computing, and genetic mapping, we derive, de novo, a sequence assembly representing 9.1 Gbp of the highly repetitive 16 Gbp genome of hexaploid wheat, Triticum aestivum, and assign 7.1 Gb of this assembly to chromosomal locations. The genome representation and accuracy of our assembly is comparable or even exceeds that of a chromosome-by-chromosome shotgun assembly. Our assembly and mapping strategy uses only short read sequencing technology and is applicable to any species where it is possible to construct a mapping population.
Collapse
Affiliation(s)
- Jarrod A Chapman
- Department of Energy Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, CA, 94598, USA.
| | - Martin Mascher
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, 06466, Stadt Seeland, Germany.
| | - Aydın Buluç
- Computational Research Division and National Energy Research Supercomputing Center (NERSC), Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA.
| | - Kerrie Barry
- Department of Energy Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, CA, 94598, USA.
| | - Evangelos Georganas
- Computational Research Division and National Energy Research Supercomputing Center (NERSC), Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA. .,Department of Electrical Engineering and Computer Science, Computer Science Division, University of California, Berkeley, CA, 94720, USA.
| | - Adam Session
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, 94720, USA.
| | - Veronika Strnadova
- Department of Computer Science, University of California, Santa Barbara, CA, 93106, USA.
| | - Jerry Jenkins
- Department of Energy Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, CA, 94598, USA. .,HudsonAlpha Institute of Biotechnology, Huntsville, AL, 35806, USA.
| | - Sunish Sehgal
- Department of Plant Pathology, Kansas State University, Manhattan, KS, 65506, USA. .,Present address: Department of Plant Science, South Dakota State University, Brookings, SD, 57007, USA.
| | - Leonid Oliker
- Computational Research Division and National Energy Research Supercomputing Center (NERSC), Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA.
| | - Jeremy Schmutz
- Department of Energy Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, CA, 94598, USA. .,HudsonAlpha Institute of Biotechnology, Huntsville, AL, 35806, USA.
| | - Katherine A Yelick
- Computational Research Division and National Energy Research Supercomputing Center (NERSC), Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA. .,Department of Electrical Engineering and Computer Science, Computer Science Division, University of California, Berkeley, CA, 94720, USA.
| | - Uwe Scholz
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, 06466, Stadt Seeland, Germany.
| | - Robbie Waugh
- Division of Plant Sciences, University of Dundee & The James Hutton Institute, Invergowrie, Dundee, DD2 5DA, UK.
| | - Jesse A Poland
- Department of Plant Pathology, Kansas State University, Manhattan, KS, 65506, USA.
| | - Gary J Muehlbauer
- Departments of Agronomy and Plant Genetics, and Plant Biology, University of Minnesota, St Paul, MN, 55108, USA.
| | - Nils Stein
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, 06466, Stadt Seeland, Germany.
| | - Daniel S Rokhsar
- Department of Energy Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, CA, 94598, USA. .,Department of Molecular and Cell Biology, University of California, Berkeley, CA, 94720, USA.
| |
Collapse
|
37
|
Wang FQ, Zhong J, Zhao Y, Xiao J, Liu J, Dai M, Zheng G, Zhang L, Yu J, Wu J, Duan B. Genome sequencing of high-penicillin producing industrial strain of Penicillium chrysogenum. BMC Genomics 2014; 15 Suppl 1:S11. [PMID: 24564352 PMCID: PMC4046689 DOI: 10.1186/1471-2164-15-s1-s11] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
BACKGROUND Due to the importance of Penicillium chrysogenum holding in medicine, the genome of low-penicillin producing laboratorial strain Wisconsin54-1255 had been sequenced and fully annotated. Through classical mutagenesis of Wisconsin54-1255, product titers and productivities of penicillin have dramatically increased, but what underlying genome structural variations is still little known. Therefore, genome sequencing of a high-penicillin producing industrial strain is very meaningful. RESULTS To reveal more insights into the genome structural variations of high-penicillin producing strain, we sequenced an industrial strain P. chrysogenum NCPC10086. By whole genome comparative analysis, we observed a large number of mutations, insertions and deletions, and structural variations. There are 69 new genes that not exist in the genome sequence of Wisconsin54-1255 and some of them are involved in energy metabolism, nitrogen metabolism and glutathione metabolism. Most importantly, we discovered a 53.7 Kb "new shift fragment" in a seven copies of determinative penicillin biosynthesis cluster in NCPC10086 and the arrangement type of amplified region is unique. Moreover, we presented two large-scale translocations in NCPC10086, containing genes involved energy, nitrogen metabolism and peroxysome pathway. At last, we found some non-synonymous mutations in the genes participating in homogentisate pathway or working as regulators of penicillin biosynthesis. CONCLUSIONS We provided the first high-quality genome sequence of industrial high-penicillin strain of P. chrysogenum and carried out a comparative genome analysis with a low-producing experimental strain. The genomic variations we discovered are related with energy metabolism, nitrogen metabolism and so on. These findings demonstrate the potential information for insights into the high-penicillin yielding mechanism and metabolic engineering in the future.
Collapse
Affiliation(s)
- Fu-Qiang Wang
- />New Drug Research and Development Center of North China Pharmaceutical Group Corporation, National Engineering Research Center of Microbial Medicine, Hebei Industry Microbial Metabolic Engineering & Technology Research Center, Shijiazhuang, Hebei 050015 China
| | - Jun Zhong
- />CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101 China
- />University of Chinese Academy of Sciences, Beijing, 100049 China
| | - Ying Zhao
- />New Drug Research and Development Center of North China Pharmaceutical Group Corporation, National Engineering Research Center of Microbial Medicine, Hebei Industry Microbial Metabolic Engineering & Technology Research Center, Shijiazhuang, Hebei 050015 China
| | - Jingfa Xiao
- />CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101 China
| | - Jing Liu
- />New Drug Research and Development Center of North China Pharmaceutical Group Corporation, National Engineering Research Center of Microbial Medicine, Hebei Industry Microbial Metabolic Engineering & Technology Research Center, Shijiazhuang, Hebei 050015 China
| | - Meng Dai
- />New Drug Research and Development Center of North China Pharmaceutical Group Corporation, National Engineering Research Center of Microbial Medicine, Hebei Industry Microbial Metabolic Engineering & Technology Research Center, Shijiazhuang, Hebei 050015 China
| | - Guizhen Zheng
- />New Drug Research and Development Center of North China Pharmaceutical Group Corporation, National Engineering Research Center of Microbial Medicine, Hebei Industry Microbial Metabolic Engineering & Technology Research Center, Shijiazhuang, Hebei 050015 China
| | - Li Zhang
- />New Drug Research and Development Center of North China Pharmaceutical Group Corporation, National Engineering Research Center of Microbial Medicine, Hebei Industry Microbial Metabolic Engineering & Technology Research Center, Shijiazhuang, Hebei 050015 China
| | - Jun Yu
- />CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101 China
| | - Jiayan Wu
- />CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101 China
| | - Baoling Duan
- />New Drug Research and Development Center of North China Pharmaceutical Group Corporation, National Engineering Research Center of Microbial Medicine, Hebei Industry Microbial Metabolic Engineering & Technology Research Center, Shijiazhuang, Hebei 050015 China
| |
Collapse
|
38
|
Abstract
MOTIVATION The de novo assembly of large, complex genomes is a significant challenge with currently available DNA sequencing technology. While many de novo assembly software packages are available, comparatively little attention has been paid to assisting the user with the assembly. RESULTS This article addresses the practical aspects of de novo assembly by introducing new ways to perform quality assessment on a collection of sequence reads. The software implementation calculates per-base error rates, paired-end fragment-size distributions and coverage metrics in the absence of a reference genome. Additionally, the software will estimate characteristics of the sequenced genome, such as repeat content and heterozygosity that are key determinants of assembly difficulty.
Collapse
|
39
|
Li W, Freudenberg J, Miramontes P. Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome. BMC Bioinformatics 2014; 15:2. [PMID: 24386976 PMCID: PMC3927684 DOI: 10.1186/1471-2105-15-2] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2013] [Accepted: 12/17/2013] [Indexed: 11/10/2022] Open
Abstract
Background The amount of non-unique sequence (non-singletons) in a genome directly affects the difficulty of read alignment to a reference assembly for high throughput-sequencing data. Although a longer read is more likely to be uniquely mapped to the reference genome, a quantitative analysis of the influence of read lengths on mappability has been lacking. To address this question, we evaluate the k-mer distribution of the human reference genome. The k-mer frequency is determined for k ranging from 20 bp to 1000 bp. Results We observe that the proportion of non-singletons k-mers decreases slowly with increasing k, and can be fitted by piecewise power-law functions with different exponents at different ranges of k. A slower decay at greater values for k indicates more limited gains in mappability for read lengths between 200 bp and 1000 bp. The frequency distributions of k-mers exhibit long tails with a power-law-like trend, and rank frequency plots exhibit a concave Zipf’s curve. The most frequent 1000-mers comprise 172 regions, which include four large stretches on chromosomes 1 and X, containing genes of biomedical relevance. Comparison with other databases indicates that the 172 regions can be broadly classified into two types: those containing LINE transposable elements and those containing segmental duplications. Conclusion Read mappability as measured by the proportion of singletons increases steadily up to the length scale around 200 bp. When read length increases above 200 bp, smaller gains in mappability are expected. Moreover, the proportion of non-singletons decreases with read lengths much slower than linear. Even a read length of 1000 bp would not allow the unique alignment of reads for many coding regions of human genes. A mix of techniques will be needed for efficiently producing high-quality data that cover the complete human genome.
Collapse
Affiliation(s)
- Wentian Li
- The Robert S, Boas Center for Genomics and Human Genetic, The Feinstein Institute for Medical Research, North Shore LIJ Health System, 350 Community Drive, Manhasset, USA.
| | | | | |
Collapse
|
40
|
Patel LR, Nykter M, Chen K, Zhang W. Cancer genome sequencing: understanding malignancy as a disease of the genome, its conformation, and its evolution. Cancer Lett 2013; 340:152-60. [PMID: 23111104 PMCID: PMC3632661 DOI: 10.1016/j.canlet.2012.10.018] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2012] [Revised: 10/15/2012] [Accepted: 10/19/2012] [Indexed: 12/11/2022]
Abstract
Advances in cancer genomics have been propelled by the steady evolution of molecular profiling technologies. Over the past decade, high-throughput sequencing technologies have matured to the point necessary to support disease-specific shotgun sequencing. This has compelled whole-genome sequencing studies across a broad panel of malignancies. The emergence of high-throughput sequencing technologies has inspired new chemical and computational techniques enabling interrogation of cancer-specific genomic and transcriptomic variants, previously unannotated genes, and chromatin structure. Finally, recent progress in single-cell sequencing holds great promise for studies interrogating the consequences of tumor evolution in cancers presenting with genomic heterogeneity.
Collapse
Affiliation(s)
- Lalit R. Patel
- MD/PhD Program, The University of Texas Health Science Center, Houston, TX, USA
- Department of Pathology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Matti Nykter
- Department of Signal Processing, Tampere University of Technology, Tampere 33101, Finland
| | - Kexin Chen
- Department of Epidemiology and Biostatistics, Tianjin Medical University Cancer Institute and Hospital, Tianjin 300060, China
| | - Wei Zhang
- Department of Pathology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| |
Collapse
|
41
|
Ghodsi M, Hill CM, Astrovskaya I, Lin H, Sommer DD, Koren S, Pop M. De novo likelihood-based measures for comparing genome assemblies. BMC Res Notes 2013; 6:334. [PMID: 23965294 PMCID: PMC3765854 DOI: 10.1186/1756-0500-6-334] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2013] [Accepted: 08/13/2013] [Indexed: 12/12/2022] Open
Abstract
Background The current revolution in genomics has been made possible by software tools called genome assemblers, which stitch together DNA fragments “read” by sequencing machines into complete or nearly complete genome sequences. Despite decades of research in this field and the development of dozens of genome assemblers, assessing and comparing the quality of assembled genome sequences still relies on the availability of independently determined standards, such as manually curated genome sequences, or independently produced mapping data. These “gold standards” can be expensive to produce and may only cover a small fraction of the genome, which limits their applicability to newly generated genome sequences. Here we introduce a de novo probabilistic measure of assembly quality which allows for an objective comparison of multiple assemblies generated from the same set of reads. We define the quality of a sequence produced by an assembler as the conditional probability of observing the sequenced reads from the assembled sequence. A key property of our metric is that the true genome sequence maximizes the score, unlike other commonly used metrics. Results We demonstrate that our de novo score can be computed quickly and accurately in a practical setting even for large datasets, by estimating the score from a relatively small sample of the reads. To demonstrate the benefits of our score, we measure the quality of the assemblies generated in the GAGE and Assemblathon 1 assembly “bake-offs” with our metric. Even without knowledge of the true reference sequence, our de novo metric closely matches the reference-based evaluation metrics used in the studies and outperforms other de novo metrics traditionally used to measure assembly quality (such as N50). Finally, we highlight the application of our score to optimize assembly parameters used in genome assemblers, which enables better assemblies to be produced, even without prior knowledge of the genome being assembled. Conclusion Likelihood-based measures, such as ours proposed here, will become the new standard for de novo assembly evaluation.
Collapse
Affiliation(s)
- Mohammadreza Ghodsi
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, USA.
| | | | | | | | | | | | | |
Collapse
|
42
|
Abstract
High-throughput shotgun sequence data make it possible in principle to accurately estimate population genetic parameters without confounding by SNP ascertainment bias. One such statistic of interest is the proportion of heterozygous sites within an individual's genome, which is informative about inbreeding and effective population size. However, in many cases, the available sequence data of an individual are limited to low coverage, preventing the confident calling of genotypes necessary to directly count the proportion of heterozygous sites. Here, we present a method for estimating an individual's genome-wide rate of heterozygosity from low-coverage sequence data, without an intermediate step that calls genotypes. Our method jointly learns the shared allele distribution between the individual and a panel of other individuals, together with the sequencing error distributions and the reference bias. We show our method works well, first, by its performance on simulated sequence data and, second, on real sequence data where we obtain estimates using low-coverage data consistent with those from higher coverage. We apply our method to obtain estimates of the rate of heterozygosity for 11 humans from diverse worldwide populations and through this analysis reveal the complex dependency of local sequencing coverage on the true underlying heterozygosity, which complicates the estimation of heterozygosity from sequence data. We show how we can use filters to correct for the confounding arising from sequencing depth. We find in practice that ratios of heterozygosity are more interpretable than absolute estimates and show that we obtain excellent conformity of ratios of heterozygosity with previous estimates from higher-coverage data.
Collapse
|
43
|
BIOCOMPUTATION: Some history and prospects. Biosystems 2013; 112:196-203. [DOI: 10.1016/j.biosystems.2012.12.005] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2012] [Revised: 12/10/2012] [Accepted: 12/13/2012] [Indexed: 11/24/2022]
|
44
|
Lonardi S, Duma D, Alpert M, Cordero F, Beccuti M, Bhat PR, Wu Y, Ciardo G, Alsaihati B, Ma Y, Wanamaker S, Resnik J, Bozdag S, Luo MC, Close TJ. Combinatorial pooling enables selective sequencing of the barley gene space. PLoS Comput Biol 2013; 9:e1003010. [PMID: 23592960 PMCID: PMC3617026 DOI: 10.1371/journal.pcbi.1003010] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2012] [Accepted: 02/05/2013] [Indexed: 11/23/2022] Open
Abstract
For the vast majority of species – including many economically or ecologically important organisms, progress in biological research is hampered due to the lack of a reference genome sequence. Despite recent advances in sequencing technologies, several factors still limit the availability of such a critical resource. At the same time, many research groups and international consortia have already produced BAC libraries and physical maps and now are in a position to proceed with the development of whole-genome sequences organized around a physical map anchored to a genetic map. We propose a BAC-by-BAC sequencing protocol that combines combinatorial pooling design and second-generation sequencing technology to efficiently approach denovo selective genome sequencing. We show that combinatorial pooling is a cost-effective and practical alternative to exhaustive DNA barcoding when preparing sequencing libraries for hundreds or thousands of DNA samples, such as in this case gene-bearing minimum-tiling-path BAC clones. The novelty of the protocol hinges on the computational ability to efficiently compare hundred millions of short reads and assign them to the correct BAC clones (deconvolution) so that the assembly can be carried out clone-by-clone. Experimental results on simulated data for the rice genome show that the deconvolution is very accurate, and the resulting BAC assemblies have high quality. Results on real data for a gene-rich subset of the barley genome confirm that the deconvolution is accurate and the BAC assemblies have good quality. While our method cannot provide the level of completeness that one would achieve with a comprehensive whole-genome sequencing project, we show that it is quite successful in reconstructing the gene sequences within BACs. In the case of plants such as barley, this level of sequence knowledge is sufficient to support critical end-point objectives such as map-based cloning and marker-assisted breeding. The problem of obtaining the full genomic sequence of an organism has been solved either via a global brute-force approach (called whole-genome shotgun) or by a divide-and-conquer strategy (called clone-by-clone). Both approaches have advantages and disadvantages in terms of cost, manual labor, and the ability to deal with sequencing errors and highly repetitive regions of the genome. With the advent of second-generation sequencing instruments, the whole-genome shotgun approach has been the preferred choice. The clone-by-clone strategy is, however, still very relevant for large complex genomes. In fact, several research groups and international consortia have produced clone libraries and physical maps for many economically or ecologically important organisms and now are in a position to proceed with sequencing. In this manuscript, we demonstrate the feasibility of this approach on the gene-space of a large, very repetitive plant genome. The novelty of our approach is that, in order to take advantage of the throughput of the current generation of sequencing instruments, we pool hundreds of clones using a special type of “smart” pooling design that allows one to establish with high accuracy the source clone from the sequenced reads in a pool. Extensive simulations and experimental results support our claims.
Collapse
Affiliation(s)
- Stefano Lonardi
- Department of Computer Science and Engineering, University of California, Riverside, California, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
45
|
Forde BM, O'Toole PW. Next-generation sequencing technologies and their impact on microbial genomics. Brief Funct Genomics 2013; 12:440-53. [PMID: 23314033 DOI: 10.1093/bfgp/els062] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
Next-generation sequencing technologies have had a dramatic impact in the field of genomic research through the provision of a low cost, high-throughput alternative to traditional capillary sequencers. These new sequencing methods have surpassed their original scope and now provide a range of utility-based applications, which allow for a more comprehensive analysis of the structure and content of microbial genomes than was previously possible. With the commercialization of a third generation of sequencing technologies imminent, we discuss the applications of current next-generation sequencing methods and explore their impact on and contribution to microbial genome research.
Collapse
Affiliation(s)
- Brian M Forde
- Department of Microbiology, University College Cork, Cork, Ireland.
| | | |
Collapse
|
46
|
Wit P, Pespeni MH, Ladner JT, Barshis DJ, Seneca F, Jaris H, Therkildsen NO, Morikawa M, Palumbi SR. The simple fool's guide to population genomics via
RNA
‐Seq: an introduction to high‐throughput sequencing data analysis. Mol Ecol Resour 2012; 12:1058-67. [DOI: 10.1111/1755-0998.12003] [Citation(s) in RCA: 196] [Impact Index Per Article: 16.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2012] [Revised: 07/16/2012] [Accepted: 07/27/2012] [Indexed: 11/30/2022]
Affiliation(s)
- Pierre Wit
- Department of Biology Stanford University Hopkins Marine Station 120 Ocean view Blvd Pacific Grove CA 93950 USA
| | - Melissa H. Pespeni
- Department of Biology Stanford University Hopkins Marine Station 120 Ocean view Blvd Pacific Grove CA 93950 USA
- Department of Biology Indiana University 915 E. Third Street Myers Hall 150 Bloomington IN 47405‐7107 USA
| | - Jason T. Ladner
- Department of Biology Stanford University Hopkins Marine Station 120 Ocean view Blvd Pacific Grove CA 93950 USA
| | - Daniel J. Barshis
- Department of Biology Stanford University Hopkins Marine Station 120 Ocean view Blvd Pacific Grove CA 93950 USA
| | - François Seneca
- Department of Biology Stanford University Hopkins Marine Station 120 Ocean view Blvd Pacific Grove CA 93950 USA
| | - Hannah Jaris
- Department of Biology Stanford University Hopkins Marine Station 120 Ocean view Blvd Pacific Grove CA 93950 USA
| | - Nina Overgaard Therkildsen
- National Institute of Aquatic Resources Technical University of Denmark Vejlsøvej 39 8600 Silkeborg Denmark
| | | | - Stephen R. Palumbi
- Department of Biology Stanford University Hopkins Marine Station 120 Ocean view Blvd Pacific Grove CA 93950 USA
| |
Collapse
|
47
|
Fouhy F, Ross RP, Fitzgerald GF, Stanton C, Cotter PD. Composition of the early intestinal microbiota: knowledge, knowledge gaps and the use of high-throughput sequencing to address these gaps. Gut Microbes 2012; 3:203-20. [PMID: 22572829 PMCID: PMC3427213 DOI: 10.4161/gmic.20169] [Citation(s) in RCA: 156] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
The colonization, development and maturation of the newborn gastrointestinal tract that begins immediately at birth and continues for two years, is modulated by numerous factors including mode of delivery, feeding regime, maternal diet/weight, probiotic and prebiotic use and antibiotic exposure pre-, peri- and post-natally. While in the past, culture-based approaches were used to assess the impact of these factors on the gut microbiota, these have now largely been replaced by culture-independent DNA-based approaches and most recently, high-throughput sequencing-based forms thereof. The aim of this review is to summarize recent research into the modulatory factors that impact on the acquisition and development of the infant gut microbiota, to outline the knowledge recently gained through the use of culture-independent techniques and, in particular, highlight advances in high-throughput sequencing and how these technologies have, and will continue to, fill gaps in our knowledge with respect to the human intestinal microbiota.
Collapse
Affiliation(s)
- Fiona Fouhy
- Teagasc Food Research Centre; Moorepark; Fermoy, Cork Ireland,Microbiology Department; University College Cork; Cork, Ireland
| | - R. Paul Ross
- Teagasc Food Research Centre; Moorepark; Fermoy, Cork Ireland,Alimentary Pharmabiotic Centre; Cork, Ireland
| | - Gerald F. Fitzgerald
- Microbiology Department; University College Cork; Cork, Ireland,Alimentary Pharmabiotic Centre; Cork, Ireland
| | - Catherine Stanton
- Teagasc Food Research Centre; Moorepark; Fermoy, Cork Ireland,Alimentary Pharmabiotic Centre; Cork, Ireland,Correspondence to: Catherine Stanton, and Paul D. Cotter,
| | - Paul D. Cotter
- Teagasc Food Research Centre; Moorepark; Fermoy, Cork Ireland,Alimentary Pharmabiotic Centre; Cork, Ireland,Correspondence to: Catherine Stanton, and Paul D. Cotter,
| |
Collapse
|
48
|
Medvedev P, Pham S, Chaisson M, Tesler G, Pevzner P. Paired de bruijn graphs: a novel approach for incorporating mate pair information into genome assemblers. J Comput Biol 2011; 18:1625-34. [PMID: 21999285 DOI: 10.1089/cmb.2011.0151] [Citation(s) in RCA: 50] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The recent proliferation of next generation sequencing with short reads has enabled many new experimental opportunities but, at the same time, has raised formidable computational challenges in genome assembly. One of the key advances that has led to an improvement in contig lengths has been mate pairs, which facilitate the assembly of repeating regions. Mate pairs have been algorithmically incorporated into most next generation assemblers as various heuristic post-processing steps to correct the assembly graph or to link contigs into scaffolds. Such methods have allowed the identification of longer contigs than would be possible with single reads; however, they can still fail to resolve complex repeats. Thus, improved methods for incorporating mate pairs will have a strong effect on contig length in the future. Here, we introduce the paired de Bruijn graph, a generalization of the de Bruijn graph that incorporates mate pair information into the graph structure itself instead of analyzing mate pairs at a post-processing step. This graph has the potential to be used in place of the de Bruijn graph in any de Bruijn graph based assembler, maintaining all other assembly steps such as error-correction and repeat resolution. Through assembly results on simulated perfect data, we argue that this can effectively improve the contig sizes in assembly.
Collapse
Affiliation(s)
- Paul Medvedev
- Department of Computer Science and Engineering, University of California, San Diego, California, USA.
| | | | | | | | | |
Collapse
|
49
|
Kim Y, Jo AR, Jang DH, Cho YJ, Chun J, Min BM, Choi Y. Toll-like receptor 9 mediates oral bacteria-induced IL-8 expression in gingival epithelial cells. Immunol Cell Biol 2011; 90:655-63. [PMID: 21968713 DOI: 10.1038/icb.2011.85] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
Abstract
Previously, we reported that various oral bacteria regulate interleukin (IL)-8 production differently in gingival epithelial cells. The aim of this study was to characterize the pattern recognition receptor(s) that mediate bacteria-induced IL-8 expression. Among ligands that mimic bacterial components, only a Toll-like receptor (TLR) 9 ligand enhanced IL-8 expression as determined by ELISA. Both normal and immortalized human gingival epithelial (HOK-16B) cells expressed TLR9 intracellularly and showed enhanced IL-8 expression in response to CpG-oligonucleotide. The ability of eight strains of four oral bacterial species to induce IL-8 expression in HOK-16B cells, and their invasion capacity were examined in the absence or presence of 2% human serum. The ability of purified bacterial DNA (bDNA) to induce IL-8 was also examined. Six out of eight strains increased IL-8 production in the absence of serum. Usage of an endosomal acidification blocker or a TLR9 antagonist inhibited the IL-8 induction by two potent strains. In the presence of serum, many strains lost the ability to induce IL-8 and presented substantially reduced invasion capacity. The IL-8-inducing ability of bacteria in the absence or presence of serum showed a strong positive correlation with their invasion index. The IL-8-inducing ability of bacteria in the absence of human serum was also correlated with the immunostimulatory activity of its bDNA. The observed immunostimulatory activity of the bDNA could not be linked to its CpG motif content. In conclusion, oral bacteria induce IL-8 in gingival epithelial cells through TLR9 and the IL-8-inducing ability depends on the invasive capacity and immunostimulating DNA.
Collapse
Affiliation(s)
- Youngsook Kim
- Department of Oromaxillofacial Infection & Immunity, School of Dentistry and Dental Research Institute, Seoul National University, Seoul, Republic of Korea
| | | | | | | | | | | | | |
Collapse
|
50
|
Jackson SA, Iwata A, Lee SH, Schmutz J, Shoemaker R. Sequencing crop genomes: approaches and applications. THE NEW PHYTOLOGIST 2011; 191:915-925. [PMID: 21707621 DOI: 10.1111/j.1469-8137.2011.03804.x] [Citation(s) in RCA: 47] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Abstract
Many challenges face plant scientists, in particular those working on crop production, such as a projected increase in population, decrease in water and arable land, changes in weather patterns and predictability. Advances in genome sequencing and resequencing can and should play a role in our response to meeting these challenges. However, several barriers prevent rapid and effective deployment of these tools to a wide variety of crops. Because of the complexity of crop genomes, de novo sequencing with next-generation sequencing technologies is a process fraught with difficulties that then create roadblocks to the utilization of these genome sequences for crop improvement. Collecting rapid and accurate phenotypes in crop plants is a hindrance to integrating genomics with crop improvement, and advances in informatics are needed to put these tools in the hands of the scientists on the ground.
Collapse
Affiliation(s)
- Scott A Jackson
- Institute for Plant Breeding, Genetics and Genomics, University of Georgia,111 Riverbend Rd, Athens, GA 30602, USA
| | - Aiko Iwata
- Institute for Plant Breeding, Genetics and Genomics, University of Georgia,111 Riverbend Rd, Athens, GA 30602, USA
| | - Suk-Ha Lee
- Department of Plant Science and Research Institute for Agriculture and Life Sciences, Seoul National University, Seoul 151-921, Korea
| | - Jeremy Schmutz
- HudsonAlpha Genome Sequencing Center, Huntsville, AL 35806, USA
| | - Randy Shoemaker
- USDA-ARS, Corn Insects and Crop Genetics Research Unit, Ames, IA 50011, USA
| |
Collapse
|