301
|
Mirza B, Wang W, Wang J, Choi H, Chung NC, Ping P. Machine Learning and Integrative Analysis of Biomedical Big Data. Genes (Basel) 2019; 10:E87. [PMID: 30696086 PMCID: PMC6410075 DOI: 10.3390/genes10020087] [Citation(s) in RCA: 176] [Impact Index Per Article: 29.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2018] [Revised: 01/08/2019] [Accepted: 01/21/2019] [Indexed: 12/11/2022] Open
Abstract
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues.
Collapse
Affiliation(s)
- Bilal Mirza
- NIH BD2K Center of Excellence for Biomedical Computing, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Department of Physiology, University of California Los Angeles, Los Angeles, CA 90095, USA.
| | - Wei Wang
- NIH BD2K Center of Excellence for Biomedical Computing, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Scalable Analytics Institute (ScAi), University of California Los Angeles, Los Angeles, CA 90095, USA.
- Department of Bioinformatics, University of California Los Angeles, Los Angeles, CA 90095, USA.
| | - Jie Wang
- NIH BD2K Center of Excellence for Biomedical Computing, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Department of Physiology, University of California Los Angeles, Los Angeles, CA 90095, USA.
| | - Howard Choi
- NIH BD2K Center of Excellence for Biomedical Computing, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Department of Physiology, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Department of Bioinformatics, University of California Los Angeles, Los Angeles, CA 90095, USA.
| | - Neo Christopher Chung
- NIH BD2K Center of Excellence for Biomedical Computing, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Institute of Informatics, Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Banacha 2, 02-097 Warsaw, Poland.
| | - Peipei Ping
- NIH BD2K Center of Excellence for Biomedical Computing, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Department of Physiology, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Scalable Analytics Institute (ScAi), University of California Los Angeles, Los Angeles, CA 90095, USA.
- Department of Bioinformatics, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Department of Medicine (Cardiology), University of California Los Angeles, Los Angeles, CA 90095, USA.
| |
Collapse
|
302
|
Abstract
We polled the Editorial Board of Genome Biology to ask where they see genomics going in the next few years. Here are some of their responses.
Collapse
|
303
|
Marchet C, Lecompte L, Silva CD, Cruaud C, Aury JM, Nicolas J, Peterlongo P. De novo clustering of long reads by gene from transcriptomics data. Nucleic Acids Res 2019; 47:e2. [PMID: 30260405 PMCID: PMC6326815 DOI: 10.1093/nar/gky834] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2018] [Revised: 09/04/2018] [Accepted: 09/10/2018] [Indexed: 02/07/2023] Open
Abstract
Long-read sequencing currently provides sequences of several thousand base pairs. It is therefore possible to obtain complete transcripts, offering an unprecedented vision of the cellular transcriptome. However the literature lacks tools for de novo clustering of such data, in particular for Oxford Nanopore Technologies reads, because of the inherent high error rate compared to short reads. Our goal is to process reads from whole transcriptome sequencing data accurately and without a reference genome in order to reliably group reads coming from the same gene. This de novo approach is therefore particularly suitable for non-model species, but can also serve as a useful pre-processing step to improve read mapping. Our contribution both proposes a new algorithm adapted to clustering of reads by gene and a practical and free access tool that allows to scale the complete processing of eukaryotic transcriptomes. We sequenced a mouse RNA sample using the MinION device. This dataset is used to compare our solution to other algorithms used in the context of biological clustering. We demonstrate that it is the best approach for transcriptomics long reads. When a reference is available to enable mapping, we show that it stands as an alternative method that predicts complementary clusters.
Collapse
Affiliation(s)
| | | | - Corinne Da Silva
- Commissariat à l’Énergie Atomique (CEA), Institut de Biologie François Jacob, Genoscope, 91000 Evry, France
| | - Corinne Cruaud
- Commissariat à l’Énergie Atomique (CEA), Institut de Biologie François Jacob, Genoscope, 91000 Evry, France
| | - Jean-Marc Aury
- Commissariat à l’Énergie Atomique (CEA), Institut de Biologie François Jacob, Genoscope, 91000 Evry, France
| | | | | |
Collapse
|
304
|
Single-Molecule Sequencing: Towards Clinical Applications. Trends Biotechnol 2019; 37:72-85. [DOI: 10.1016/j.tibtech.2018.07.013] [Citation(s) in RCA: 112] [Impact Index Per Article: 18.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2018] [Revised: 07/16/2018] [Accepted: 07/18/2018] [Indexed: 12/31/2022]
|
305
|
Zascavage RR, Thorson K, Planz JV. Nanopore sequencing: An enrichment-free alternative to mitochondrial DNA sequencing. Electrophoresis 2019; 40:272-280. [PMID: 30511783 PMCID: PMC6590251 DOI: 10.1002/elps.201800083] [Citation(s) in RCA: 26] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2018] [Revised: 10/25/2018] [Accepted: 11/03/2018] [Indexed: 12/31/2022]
Abstract
Mitochondrial DNA sequence data are often utilized in disease studies, conservation genetics and forensic identification. The current approaches for sequencing the full mtGenome typically require several rounds of PCR enrichment during Sanger or MPS protocols followed by fairly tedious assembly and analysis. Here we describe an efficient approach to sequencing directly from genomic DNA samples without prior enrichment or extensive library preparation steps. A comparison is made between libraries sequenced directly from native DNA and the same samples sequenced from libraries generated with nine overlapping mtDNA amplicons on the Oxford Nanopore MinION™ device. The native and amplicon library preparation methods and alternative base calling strategies were assessed to establish error rates and identify trends of discordance between the two library preparation approaches. For the complete mtGenome, 16 569 nucleotides, an overall error rate of approximately 1.00% was observed. As expected with mtDNA, the majority of error was detected in homopolymeric regions. The use of a modified basecaller that corrects for ambiguous signal in homopolymeric stretches reduced the error rate for both library preparation methods to approximately 0.30%. Our study indicates that direct mtDNA sequencing from native DNA on the MinION™ device provides comparable results to those obtained from common mtDNA sequencing methods and is a reliable alternative to approaches using PCR-enriched libraries.
Collapse
Affiliation(s)
- Roxanne R. Zascavage
- Department of MicrobiologyImmunology and GeneticsUniversity of North Texas Health Science CenterFort WorthTXUSA
- Department of Criminology and Criminal JusticeUniversity of Texas at ArlingtonArlingtonTXUSA
| | - Kelcie Thorson
- Department of MicrobiologyImmunology and GeneticsUniversity of North Texas Health Science CenterFort WorthTXUSA
- Zoetis Inc.ParsippanyNJUSA
| | - John V. Planz
- Department of MicrobiologyImmunology and GeneticsUniversity of North Texas Health Science CenterFort WorthTXUSA
| |
Collapse
|
306
|
Abstract
Somatic structural variants undoubtedly play important roles in driving tumourigenesis. This is evident despite the substantial technical challenges that remain in accurately detecting structural variants and their breakpoints in tumours and in spite of our incomplete understanding of the impact of structural variants on cellular function. Developments in these areas of research contribute to the ongoing discovery of structural variation with a clear impact on the evolution of the tumour and on the clinical importance to the patient. Recent large whole genome sequencing studies have reinforced our impression of each tumour as a unique combination of mutations but paradoxically have also discovered similar genome-wide patterns of single-nucleotide and structural variation between tumours. Statistical methods have been developed to deconvolute mutation patterns, or signatures, that recur across samples, providing information about the mutagens and repair processes that may be active in a given tumour. These signatures can guide treatment by, for example, highlighting vulnerabilities in a particular tumour to a particular chemotherapy. Thus, although the complete reconstruction of the full evolutionary trajectory of a tumour genome remains currently out of reach, valuable data are already emerging to improve the treatment of cancer.
Collapse
Affiliation(s)
- Ailith Ewing
- MRC Human Genetics Unit, MRC Institute of Genetics and Molecular Medicine, The University of Edinburgh, Western General Hospital, Crewe Road, Edinburgh, EH42XU, UK
| | - Colin Semple
- MRC Human Genetics Unit, MRC Institute of Genetics and Molecular Medicine, The University of Edinburgh, Western General Hospital, Crewe Road, Edinburgh, EH42XU, UK
| |
Collapse
|
307
|
|
308
|
Lin YY, Wu PC, Chen PL, Oyang YJ, Chen CY. HAHap: a read-based haplotyping method using hierarchical assembly. PeerJ 2018; 6:e5852. [PMID: 30397550 PMCID: PMC6214236 DOI: 10.7717/peerj.5852] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2018] [Accepted: 09/27/2018] [Indexed: 11/20/2022] Open
Abstract
BACKGROUND The need for read-based phasing arises with advances in sequencing technologies. The minimum error correction (MEC) approach is the primary trend to resolve haplotypes by reducing conflicts in a single nucleotide polymorphism-fragment matrix. However, it is frequently observed that the solution with the optimal MEC might not be the real haplotypes, due to the fact that MEC methods consider all positions together and sometimes the conflicts in noisy regions might mislead the selection of corrections. To tackle this problem, we present a hierarchical assembly-based method designed to progressively resolve local conflicts. RESULTS This study presents HAHap, a new phasing algorithm based on hierarchical assembly. HAHap leverages high-confident variant pairs to build haplotypes progressively. The phasing results by HAHap on both real and simulated data, compared to other MEC-based methods, revealed better phasing error rates for constructing haplotypes using short reads from whole-genome sequencing. We compared the number of error corrections (ECs) on real data with other methods, and it reveals the ability of HAHap to predict haplotypes with a lower number of ECs. We also used simulated data to investigate the behavior of HAHap under different sequencing conditions, highlighting the applicability of HAHap in certain situations.
Collapse
Affiliation(s)
- Yu-Yu Lin
- Department of Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taipei, Taiwan
| | - Ping Chun Wu
- Taipei Blood Center, Taiwan Blood Services Foundation, Taipei, Taiwan
| | - Pei-Lung Chen
- Graduate Institute of Medical Genomics and Proteomics, College of Medicine, National Taiwan University, Taipei, Taiwan
| | - Yen-Jen Oyang
- Department of Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taipei, Taiwan
| | - Chien-Yu Chen
- Department of Bio-Industrial Mechatronics Engineering, National Taiwan University, Taipei, Taiwan
| |
Collapse
|
309
|
Hehir-Kwa JY, Tops BBJ, Kemmeren P. The clinical implementation of copy number detection in the age of next-generation sequencing. Expert Rev Mol Diagn 2018; 18:907-915. [PMID: 30221560 DOI: 10.1080/14737159.2018.1523723] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
INTRODUCTION The role of copy number variants (CNVs) in disease is now well established. In parallel NGS technologies, such as long-read technologies, there is continual development and data analysis methods continue to be refined. Clinical exome sequencing data is now a reality for many diagnostic laboratories in both congenital genetics and oncology. This provides the ability to detect and report both SNVs and structural variants, including CNVs, using a single assay for a wide range of patient cohorts. Areas covered: Currently, whole-genome sequencing is mainly restricted to research applications and clinical utility studies. Furthermore, detecting the full-size spectrum of CNVs as well as somatic events remains difficult for both exome and whole-genome sequencing. As a result, the full extent of genomic variants in an individual's genome is still largely unknown. Recently, new sequencing technologies have been introduced which maintain the long-range genomic context, aiding the detection of CNVs and structural variants. Expert commentary: The development of long-read sequencing promises to resolve many CNV and SV detection issues but is yet to become established. The current challenge for clinical CNV detection is how to fully exploit all the data which is generated by high throughput sequencing technologies.
Collapse
Affiliation(s)
- Jayne Y Hehir-Kwa
- a Princess Máxima Center for Pediatric Oncology , Utrecht , Netherlands
| | - Bastiaan B J Tops
- a Princess Máxima Center for Pediatric Oncology , Utrecht , Netherlands
| | - Patrick Kemmeren
- a Princess Máxima Center for Pediatric Oncology , Utrecht , Netherlands
| |
Collapse
|
310
|
Piégu B, Arensburger P, Guillou F, Bigot Y. But where did the centromeres go in the chicken genome models? Chromosome Res 2018; 26:297-306. [PMID: 30225548 DOI: 10.1007/s10577-018-9585-0] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2018] [Revised: 08/31/2018] [Accepted: 09/03/2018] [Indexed: 11/30/2022]
Abstract
The chicken genome was the third vertebrate to be sequenced. To date, its sequence and feature annotations are used as the reference for avian models in genome sequencing projects developed on birds and other Sauropsida species, and in genetic studies of domesticated birds of economic and evolutionary biology interest. Therefore, an accurate description of this genome model is important to a wide number of scientists. Here, we review the location and features of a very basic element, the centromeres of chromosomes in the galGal5 genome model. Centromeres are elements that are not determined by their DNA sequence but by their epigenetic status, in particular by the accumulation of the histone-like protein CENP-A. Comparison of data from several public sources (primarily marker probes flanking centromeres using fluorescent in situ hybridization done on giant lampbrush chromosomes and CENP-A ChIP-seq datasets) with galGal5 annotations revealed that centromeres are likely inappropriately mapped in 9 of the 16 galGal5 chromosome models in which they are described. Analysis of karyology data confirmed that the location of the main CENP-A peaks in chromosomes is the best means of locating the centromeres in 25 galGal5 chromosome models, the majority of which (16) are fully sequenced and assembled. This data re-analysis reaffirms that several sources of information should be examined to produce accurate genome annotations, particularly for basic structures such as centromeres that are epigenetically determined.
Collapse
Affiliation(s)
- Benoît Piégu
- PRC, UMR INRA0085, CNRS 7247, Centre INRA Val de Loire, 37380, Nouzilly, France
| | - Peter Arensburger
- Biological Sciences Department, California State Polytechnic University, Pomona, CA, 91768, USA
| | - Florian Guillou
- PRC, UMR INRA0085, CNRS 7247, Centre INRA Val de Loire, 37380, Nouzilly, France
| | - Yves Bigot
- PRC, UMR INRA0085, CNRS 7247, Centre INRA Val de Loire, 37380, Nouzilly, France.
| |
Collapse
|
311
|
Peona V, Weissensteiner MH, Suh A. How complete are “complete” genome assemblies?-An avian perspective. Mol Ecol Resour 2018; 18:1188-1195. [DOI: 10.1111/1755-0998.12933] [Citation(s) in RCA: 84] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2018] [Revised: 06/11/2018] [Accepted: 07/06/2018] [Indexed: 12/26/2022]
Affiliation(s)
- Valentina Peona
- Department of Evolutionary Biology; Evolutionary Biology Centre; Uppsala University; Uppsala Sweden
| | - Matthias H. Weissensteiner
- Department of Evolutionary Biology; Evolutionary Biology Centre; Uppsala University; Uppsala Sweden
- Division of Evolutionary Biology; Faculty of Biology; Ludwig-Maximilian University of Munich; Planegg-Martinsried Germany
| | - Alexander Suh
- Department of Evolutionary Biology; Evolutionary Biology Centre; Uppsala University; Uppsala Sweden
| |
Collapse
|
312
|
Rogers J. Adding resolution and dimensionality to comparative genomics: moving from reference genomes to clade genomics. Genome Biol 2018; 19:115. [PMID: 30107805 PMCID: PMC6090731 DOI: 10.1186/s13059-018-1500-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The main goal and promise of comparative genomics has been to create a comprehensive catalog of genomic information and function across the phenomenal diversity of living systems. A recent study has demonstrated the evolutionary insights possible by generating high-quality whole-genome assemblies from multiple species of a clade.
Collapse
Affiliation(s)
- Jeffrey Rogers
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, 77030, USA. .,Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA.
| |
Collapse
|
313
|
Guppy JL, Jones DB, Jerry DR, Wade NM, Raadsma HW, Huerlimann R, Zenger KR. The State of " Omics" Research for Farmed Penaeids: Advances in Research and Impediments to Industry Utilization. Front Genet 2018; 9:282. [PMID: 30123237 PMCID: PMC6085479 DOI: 10.3389/fgene.2018.00282] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2018] [Accepted: 07/09/2018] [Indexed: 12/19/2022] Open
Abstract
Elucidating the underlying genetic drivers of production traits in agricultural and aquaculture species is critical to efforts to maximize farming efficiency. "Omics" based methods (i.e., transcriptomics, genomics, proteomics, and metabolomics) are increasingly being applied to gain unprecedented insight into the biology of many aquaculture species. While the culture of penaeid shrimp has increased markedly, the industry continues to be impeded in many regards by disease, reproductive dysfunction, and a poor understanding of production traits. Extensive effort has been, and continues to be, applied to develop critical genomic resources for many commercially important penaeids. However, the industry application of these genomic resources, and the translation of the knowledge derived from "omics" studies has not yet been completely realized. Integration between the multiple "omics" resources now available (i.e., genome assemblies, transcriptomes, linkage maps, optical maps, and proteomes) will prove critical to unlocking the full utility of these otherwise independently developed and isolated resources. Furthermore, emerging "omics" based techniques are now available to address longstanding issues with completing keystone genome assemblies (e.g., through long-read sequencing), and can provide cost-effective industrial scale genotyping tools (e.g., through low density SNP chips and genotype-by-sequencing) to undertake advanced selective breeding programs (i.e., genomic selection) and powerful genome-wide association studies. In particular, this review highlights the status, utility and suggested path forward for continued development, and improved use of "omics" resources in penaeid aquaculture.
Collapse
Affiliation(s)
- Jarrod L. Guppy
- Australian Research Council Industrial Transformation Research Hub for Advanced Prawn Breeding, James Cook University, Townsville, QLD, Australia
- College of Science and Engineering and Centre for Sustainable Tropical Fisheries and Aquaculture, James Cook University, Townsville, QLD, Australia
| | - David B. Jones
- Australian Research Council Industrial Transformation Research Hub for Advanced Prawn Breeding, James Cook University, Townsville, QLD, Australia
- College of Science and Engineering and Centre for Sustainable Tropical Fisheries and Aquaculture, James Cook University, Townsville, QLD, Australia
| | - Dean R. Jerry
- Australian Research Council Industrial Transformation Research Hub for Advanced Prawn Breeding, James Cook University, Townsville, QLD, Australia
- College of Science and Engineering and Centre for Sustainable Tropical Fisheries and Aquaculture, James Cook University, Townsville, QLD, Australia
| | - Nicholas M. Wade
- Australian Research Council Industrial Transformation Research Hub for Advanced Prawn Breeding, James Cook University, Townsville, QLD, Australia
- Aquaculture Program, CSIRO Agriculture & Food, Queensland Bioscience Precinct, St Lucia, QLD, Australia
| | - Herman W. Raadsma
- Australian Research Council Industrial Transformation Research Hub for Advanced Prawn Breeding, James Cook University, Townsville, QLD, Australia
- Faculty of Science, Sydney School of Veterinary Science, The University of Sydney, Camden, NSW, Australia
| | - Roger Huerlimann
- Australian Research Council Industrial Transformation Research Hub for Advanced Prawn Breeding, James Cook University, Townsville, QLD, Australia
- College of Science and Engineering and Centre for Sustainable Tropical Fisheries and Aquaculture, James Cook University, Townsville, QLD, Australia
| | - Kyall R. Zenger
- Australian Research Council Industrial Transformation Research Hub for Advanced Prawn Breeding, James Cook University, Townsville, QLD, Australia
- College of Science and Engineering and Centre for Sustainable Tropical Fisheries and Aquaculture, James Cook University, Townsville, QLD, Australia
| |
Collapse
|
314
|
Beretta S, Patterson MD, Zaccaria S, Della Vedova G, Bonizzoni P. HapCHAT: adaptive haplotype assembly for efficiently leveraging high coverage in long reads. BMC Bioinformatics 2018; 19:252. [PMID: 29970002 PMCID: PMC6029272 DOI: 10.1186/s12859-018-2253-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2017] [Accepted: 06/18/2018] [Indexed: 01/08/2023] Open
Abstract
Background Haplotype assembly is the process of assigning the different alleles of the variants covered by mapped sequencing reads to the two haplotypes of the genome of a human individual. Long reads, which are nowadays cheaper to produce and more widely available than ever before, have been used to reduce the fragmentation of the assembled haplotypes since their ability to span several variants along the genome. These long reads are also characterized by a high error rate, an issue which may be mitigated, however, with larger sets of reads, when this error rate is uniform across genome positions. Unfortunately, current state-of-the-art dynamic programming approaches designed for long reads deal only with limited coverages. Results Here, we propose a new method for assembling haplotypes which combines and extends the features of previous approaches to deal with long reads and higher coverages. In particular, our algorithm is able to dynamically adapt the estimated number of errors at each variant site, while minimizing the total number of error corrections necessary for finding a feasible solution. This allows our method to significantly reduce the required computational resources, allowing to consider datasets composed of higher coverages. The algorithm has been implemented in a freely available tool, HapCHAT: Haplotype Assembly Coverage Handling by Adapting Thresholds. An experimental analysis on sequencing reads with up to 60 × coverage reveals improvements in accuracy and recall achieved by considering a higher coverage with lower runtimes. Conclusions Our method leverages the long-range information of sequencing reads that allows to obtain assembled haplotypes fragmented in a lower number of unphased haplotype blocks. At the same time, our method is also able to deal with higher coverages to better correct the errors in the original reads and to obtain more accurate haplotypes as a result. Availability HapCHAT is available at http://hapchat.algolab.euunder the GNU Public License (GPL).
Collapse
Affiliation(s)
- Stefano Beretta
- Department of Informatics, Systems, and Communication, University of Milano-Bicocca, Milan, Italy
| | - Murray D Patterson
- Department of Informatics, Systems, and Communication, University of Milano-Bicocca, Milan, Italy.
| | - Simone Zaccaria
- Department of Computer Science, Princeton University, Princeton, New Jersey, USA
| | - Gianluca Della Vedova
- Department of Informatics, Systems, and Communication, University of Milano-Bicocca, Milan, Italy
| | - Paola Bonizzoni
- Department of Informatics, Systems, and Communication, University of Milano-Bicocca, Milan, Italy
| |
Collapse
|
315
|
van Dijk EL, Jaszczyszyn Y, Naquin D, Thermes C. The Third Revolution in Sequencing Technology. Trends Genet 2018; 34:666-681. [PMID: 29941292 DOI: 10.1016/j.tig.2018.05.008] [Citation(s) in RCA: 637] [Impact Index Per Article: 91.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2018] [Revised: 05/18/2018] [Accepted: 05/29/2018] [Indexed: 12/16/2022]
Abstract
Forty years ago the advent of Sanger sequencing was revolutionary as it allowed complete genome sequences to be deciphered for the first time. A second revolution came when next-generation sequencing (NGS) technologies appeared, which made genome sequencing much cheaper and faster. However, NGS methods have several drawbacks and pitfalls, most notably their short reads. Recently, third-generation/long-read methods appeared, which can produce genome assemblies of unprecedented quality. Moreover, these technologies can directly detect epigenetic modifications on native DNA and allow whole-transcript sequencing without the need for assembly. This marks the third revolution in sequencing technology. Here we review and compare the various long-read methods. We discuss their applications and their respective strengths and weaknesses and provide future perspectives.
Collapse
Affiliation(s)
- Erwin L van Dijk
- Institute for Integrative Biology of the Cell, UMR9198, CNRS CEA Université Paris-Sud, Université Paris-Saclay, 9198 Gif sur Yvette Cedex, France.
| | - Yan Jaszczyszyn
- Institute for Integrative Biology of the Cell, UMR9198, CNRS CEA Université Paris-Sud, Université Paris-Saclay, 9198 Gif sur Yvette Cedex, France
| | - Delphine Naquin
- Institute for Integrative Biology of the Cell, UMR9198, CNRS CEA Université Paris-Sud, Université Paris-Saclay, 9198 Gif sur Yvette Cedex, France
| | - Claude Thermes
- Institute for Integrative Biology of the Cell, UMR9198, CNRS CEA Université Paris-Sud, Université Paris-Saclay, 9198 Gif sur Yvette Cedex, France
| |
Collapse
|
316
|
Bourne SD, Hudson J, Holman LE, Rius M. Marine Invasion Genomics: Revealing Ecological and Evolutionary Consequences of Biological Invasions. ACTA ACUST UNITED AC 2018. [DOI: 10.1007/13836_2018_21] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
|
317
|
Kyriakidou M, Tai HH, Anglin NL, Ellis D, Strömvik MV. Current Strategies of Polyploid Plant Genome Sequence Assembly. FRONTIERS IN PLANT SCIENCE 2018; 9:1660. [PMID: 30519250 PMCID: PMC6258962 DOI: 10.3389/fpls.2018.01660] [Citation(s) in RCA: 114] [Impact Index Per Article: 16.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/13/2018] [Accepted: 10/25/2018] [Indexed: 05/14/2023]
Abstract
Polyploidy or duplication of an entire genome occurs in the majority of angiosperms. The understanding of polyploid genomes is important for the improvement of those crops, which humans rely on for sustenance and basic nutrition. As climate change continues to pose a potential threat to agricultural production, there will increasingly be a demand for plant cultivars that can resist biotic and abiotic stresses and also provide needed and improved nutrition. In the past decade, Next Generation Sequencing (NGS) has fundamentally changed the genomics landscape by providing tools for the exploration of polyploid genomes. Here, we review the challenges of the assembly of polyploid plant genomes, and also present recent advances in genomic resources and functional tools in molecular genetics and breeding. As genomes of diploid and less heterozygous progenitor species are increasingly available, we discuss the lack of complexity of these currently available reference genomes as they relate to polyploid crops. Finally, we review recent approaches of haplotyping by phasing and the impact of third generation technologies on polyploid plant genome assembly.
Collapse
Affiliation(s)
- Maria Kyriakidou
- Department of Plant Science, McGill University, Montreal, QC, Canada
| | - Helen H. Tai
- Fredericton Research and Development Centre, Agriculture and Agri-Food Canada, Fredericton, NB, Canada
| | | | | | - Martina V. Strömvik
- Department of Plant Science, McGill University, Montreal, QC, Canada
- *Correspondence: Martina V. Strömvik
| |
Collapse
|
318
|
Luikart G, Kardos M, Hand BK, Rajora OP, Aitken SN, Hohenlohe PA. Population Genomics: Advancing Understanding of Nature. POPULATION GENOMICS 2018. [DOI: 10.1007/13836_2018_60] [Citation(s) in RCA: 30] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
|