401
|
Ma JE, Jiang HY, Li LM, Zhang XJ, Li HM, Li GY, Mo DY, Chen JP. SMRT sequencing of the full-length transcriptome of the Sunda pangolin (Manis javanica). Gene 2019; 692:208-216. [PMID: 30664913 DOI: 10.1016/j.gene.2019.01.008] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2018] [Revised: 12/26/2018] [Accepted: 01/11/2019] [Indexed: 10/27/2022]
Abstract
It is widely known that transcriptional diversity contributes greatly to biological regulation in eukaryotes. With the development of next-generation sequencing (NGS) technologies, several studies on RNA sequencing have considerably improved our understanding of transcriptome complexity. However, obtaining full-length (FL) transcripts remains a considerable challenge because of difficulties in short read-based assembly. In the present study, single-molecule real-time (SMRT) sequencing and NGS were combined to generate the complete and FL transcriptome of Manis javanica. The results provide a comprehensive set of reference transcripts and hence contribute to the improved annotation of the M. javanica genome. We obtained 45,530 high-confidence transcripts from 19,109 genic loci, of which 8014 genes have not yet been annotated within the M. javanica genome. Furthermore, we revealed 8824 long-chain noncoding RNAs (lncRNAs). A total of 30,199 alternative splicing (AS) and 11,184 alternative polyadenylation (APA) events were identified in the sequencing data. The structure and expression level of 59 digestive enzyme genes, including 13 carbohydrase genes, 28 lipase genes and 18 protease genes, were analyzed, which might provide original data for further research on M. javanica.
Collapse
Affiliation(s)
- Jing-E Ma
- Guangdong Key Laboratory of Animal Conservation and Resource Utilization, Guangdong Public Laboratory of Wild Animal Conservation and Utilization, Guangdong Institute of Applied Biological Resources, Guangzhou, Guangdong, China
| | - Hai-Ying Jiang
- Guangdong Key Laboratory of Animal Conservation and Resource Utilization, Guangdong Public Laboratory of Wild Animal Conservation and Utilization, Guangdong Institute of Applied Biological Resources, Guangzhou, Guangdong, China
| | - Lin-Miao Li
- Guangdong Key Laboratory of Animal Conservation and Resource Utilization, Guangdong Public Laboratory of Wild Animal Conservation and Utilization, Guangdong Institute of Applied Biological Resources, Guangzhou, Guangdong, China
| | - Xiu-Juan Zhang
- Guangdong Key Laboratory of Animal Conservation and Resource Utilization, Guangdong Public Laboratory of Wild Animal Conservation and Utilization, Guangdong Institute of Applied Biological Resources, Guangzhou, Guangdong, China
| | - Hui-Ming Li
- Guangdong Key Laboratory of Animal Conservation and Resource Utilization, Guangdong Public Laboratory of Wild Animal Conservation and Utilization, Guangdong Institute of Applied Biological Resources, Guangzhou, Guangdong, China
| | - Guan-Yu Li
- Guangdong Key Laboratory of Animal Conservation and Resource Utilization, Guangdong Public Laboratory of Wild Animal Conservation and Utilization, Guangdong Institute of Applied Biological Resources, Guangzhou, Guangdong, China
| | - Da-Ying Mo
- Guangdong Key Laboratory of Animal Conservation and Resource Utilization, Guangdong Public Laboratory of Wild Animal Conservation and Utilization, Guangdong Institute of Applied Biological Resources, Guangzhou, Guangdong, China
| | - Jin-Ping Chen
- Guangdong Key Laboratory of Animal Conservation and Resource Utilization, Guangdong Public Laboratory of Wild Animal Conservation and Utilization, Guangdong Institute of Applied Biological Resources, Guangzhou, Guangdong, China.
| |
Collapse
|
402
|
A Sequence-Based Novel Approach for Quality Evaluation of Third-Generation Sequencing Reads. Genes (Basel) 2019; 10:genes10010044. [PMID: 30646604 PMCID: PMC6356754 DOI: 10.3390/genes10010044] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2018] [Revised: 01/07/2019] [Accepted: 01/08/2019] [Indexed: 11/19/2022] Open
Abstract
The advent of third-generation sequencing (TGS) technologies, such as the Pacific Biosciences (PacBio) and Oxford Nanopore machines, provides new possibilities for contig assembly, scaffolding, and high-performance computing in bioinformatics due to its long reads. However, the high error rate and poor quality of TGS reads provide new challenges for accurate genome assembly and long-read alignment. Efficient processing methods are in need to prioritize high-quality reads for improving the results of error correction and assembly. In this study, we proposed a novel Read Quality Evaluation and Selection Tool (REQUEST) for evaluating the quality of third-generation long reads. REQUEST generates training data of high-quality and low-quality reads which are characterized by their nucleotide combinations. A linear regression model was built to score the quality of reads. The method was tested on three datasets of different species. The results showed that the top-scored reads prioritized by REQUEST achieved higher alignment accuracies. The contig assembly results based on the top-scored reads also outperformed conventional approaches that use all reads. REQUEST is able to distinguish high-quality reads from low-quality ones without using reference genomes, making it a promising alternative sequence-quality evaluation method to alignment-based algorithms.
Collapse
|
403
|
Bolger AM, Poorter H, Dumschott K, Bolger ME, Arend D, Osorio S, Gundlach H, Mayer KFX, Lange M, Scholz U, Usadel B. Computational aspects underlying genome to phenome analysis in plants. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2019; 97:182-198. [PMID: 30500991 PMCID: PMC6849790 DOI: 10.1111/tpj.14179] [Citation(s) in RCA: 37] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/24/2018] [Revised: 11/06/2018] [Accepted: 11/16/2018] [Indexed: 05/18/2023]
Abstract
Recent advances in genomics technologies have greatly accelerated the progress in both fundamental plant science and applied breeding research. Concurrently, high-throughput plant phenotyping is becoming widely adopted in the plant community, promising to alleviate the phenotypic bottleneck. While these technological breakthroughs are significantly accelerating quantitative trait locus (QTL) and causal gene identification, challenges to enable even more sophisticated analyses remain. In particular, care needs to be taken to standardize, describe and conduct experiments robustly while relying on plant physiology expertise. In this article, we review the state of the art regarding genome assembly and the future potential of pangenomics in plant research. We also describe the necessity of standardizing and describing phenotypic studies using the Minimum Information About a Plant Phenotyping Experiment (MIAPPE) standard to enable the reuse and integration of phenotypic data. In addition, we show how deep phenotypic data might yield novel trait-trait correlations and review how to link phenotypic data to genomic data. Finally, we provide perspectives on the golden future of machine learning and their potential in linking phenotypes to genomic features.
Collapse
Affiliation(s)
- Anthony M. Bolger
- Institute for Biology I, BioSCRWTH Aachen UniversityWorringer Weg 352074AachenGermany
| | - Hendrik Poorter
- Forschungszentrum Jülich (FZJ) Institute of Bio‐ and Geosciences (IBG‐2) Plant SciencesWilhelm‐Johnen‐Straße52428JülichGermany
- Department of Biological SciencesMacquarie UniversityNorth RydeNSW2109Australia
| | - Kathryn Dumschott
- Institute for Biology I, BioSCRWTH Aachen UniversityWorringer Weg 352074AachenGermany
| | - Marie E. Bolger
- Forschungszentrum Jülich (FZJ) Institute of Bio‐ and Geosciences (IBG‐2) Plant SciencesWilhelm‐Johnen‐Straße52428JülichGermany
| | - Daniel Arend
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) GaterslebenCorrensstraße 306466SeelandGermany
| | - Sonia Osorio
- Department of Molecular Biology and BiochemistryInstituto de Hortofruticultura Subtropical y Mediterránea “La Mayora”Universidad de Málaga‐Consejo Superior de Investigaciones CientíficasCampus de Teatinos29071MálagaSpain
| | - Heidrun Gundlach
- Plant Genome and Systems Biology (PGSB)Helmholtz Zentrum München (HMGU)Ingolstädter Landstraße 185764NeuherbergGermany
| | - Klaus F. X. Mayer
- Plant Genome and Systems Biology (PGSB)Helmholtz Zentrum München (HMGU)Ingolstädter Landstraße 185764NeuherbergGermany
| | - Matthias Lange
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) GaterslebenCorrensstraße 306466SeelandGermany
| | - Uwe Scholz
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) GaterslebenCorrensstraße 306466SeelandGermany
| | - Björn Usadel
- Institute for Biology I, BioSCRWTH Aachen UniversityWorringer Weg 352074AachenGermany
- Forschungszentrum Jülich (FZJ) Institute of Bio‐ and Geosciences (IBG‐2) Plant SciencesWilhelm‐Johnen‐Straße52428JülichGermany
| |
Collapse
|
404
|
Cheng YW, Chen YM, Zhao QQ, Zhao X, Wu YR, Chen DZ, Liao LD, Chen Y, Yang Q, Xu LY, Li EM, Xu JZ. Long Read Single-Molecule Real-Time Sequencing Elucidates Transcriptome-Wide Heterogeneity and Complexity in Esophageal Squamous Cells. Front Genet 2019; 10:915. [PMID: 31636653 PMCID: PMC6787290 DOI: 10.3389/fgene.2019.00915] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2019] [Accepted: 08/29/2019] [Indexed: 02/05/2023] Open
Abstract
Esophageal squamous cell carcinoma is a leading cause of cancer death. Mapping the transcriptional landscapes such as isoforms, fusion transcripts, as well as long noncoding RNAs have played a central role to understand the regulating mechanism during malignant processes. However, canonical methods such as short-read RNA-seq are difficult to define the entire polyadenylated RNA molecules. Here, we combined single-molecule real-time sequencing with RNA-seq to generate high-quality long reads and to survey the transcriptional program in esophageal squamous cells. Compared with the recent annotations of human transcriptome (Ensembl 38 release 91), single-molecule real-time data identified many unannotated transcripts, novel isoforms of known genes and an expanding repository of long intergenic noncoding RNAs (lincRNAs). By integrating with annotation of lincRNA catalog, 1,521 esophageal-cancer-specific lincRNAs were defined from single-molecule real-time reads. Kyoto Encyclopedia of Genes and Genomes enrichment analysis indicated that these lincRNAs and their target genes are involved in a variety of cancer signaling pathways. Isoform usage analysis revealed the shifted alternative splicing patterns, which can be recaptured from clinical samples or supported by previous studies. Utilizing vigorous searching criteria, we also detected multiple transcript fusions, which are not documented in current gene fusion database or readily identified from RNA-seq reads. Two novel fusion transcripts were verified based on real-time PCR and Sanger sequencing. Overall, our long-read single-molecule sequencing largely expands current understanding of full-length transcriptome in esophageal cells and provides novel insights on the transcriptional diversity during oncogenic transformation.
Collapse
Affiliation(s)
- Yin-Wei Cheng
- The Key Laboratory of Molecular Biology for High Cancer Incidence Coastal Chaoshan Area, Shantou University Medical College, Shantou, China
- Department of Biochemistry and Molecular Biology, Shantou University Medical College, Shantou, China
- Computational Systems Biology Lab, Department of Bioinformatics, Shantou University Medical College (SUMC), Shantou, China
| | - Yun-Mei Chen
- Tianjin Novogene Bioinformatics Technology Co., Ltd, Tianjin, China
| | - Qian-Qian Zhao
- Computational Systems Biology Lab, Department of Bioinformatics, Shantou University Medical College (SUMC), Shantou, China
| | - Xing Zhao
- Computational Systems Biology Lab, Department of Bioinformatics, Shantou University Medical College (SUMC), Shantou, China
| | - Ya-Ru Wu
- Computational Systems Biology Lab, Department of Bioinformatics, Shantou University Medical College (SUMC), Shantou, China
| | - Dan-Ze Chen
- Computational Systems Biology Lab, Department of Bioinformatics, Shantou University Medical College (SUMC), Shantou, China
| | - Lian-Di Liao
- The Key Laboratory of Molecular Biology for High Cancer Incidence Coastal Chaoshan Area, Shantou University Medical College, Shantou, China
- China Institute of Oncologic Pathology, Shantou University Medical College, Shantou, China
| | - Yang Chen
- The Key Laboratory of Molecular Biology for High Cancer Incidence Coastal Chaoshan Area, Shantou University Medical College, Shantou, China
- Department of Biochemistry and Molecular Biology, Shantou University Medical College, Shantou, China
| | - Qian Yang
- The Key Laboratory of Molecular Biology for High Cancer Incidence Coastal Chaoshan Area, Shantou University Medical College, Shantou, China
- Department of Biochemistry and Molecular Biology, Shantou University Medical College, Shantou, China
| | - Li-Yan Xu
- The Key Laboratory of Molecular Biology for High Cancer Incidence Coastal Chaoshan Area, Shantou University Medical College, Shantou, China
- China Institute of Oncologic Pathology, Shantou University Medical College, Shantou, China
- *Correspondence: Li-Yan Xu, ; En-Min Li, ; Jian-Zhen Xu,
| | - En-Min Li
- The Key Laboratory of Molecular Biology for High Cancer Incidence Coastal Chaoshan Area, Shantou University Medical College, Shantou, China
- Department of Biochemistry and Molecular Biology, Shantou University Medical College, Shantou, China
- *Correspondence: Li-Yan Xu, ; En-Min Li, ; Jian-Zhen Xu,
| | - Jian-Zhen Xu
- The Key Laboratory of Molecular Biology for High Cancer Incidence Coastal Chaoshan Area, Shantou University Medical College, Shantou, China
- Computational Systems Biology Lab, Department of Bioinformatics, Shantou University Medical College (SUMC), Shantou, China
- *Correspondence: Li-Yan Xu, ; En-Min Li, ; Jian-Zhen Xu,
| |
Collapse
|
405
|
Efficiency of PacBio long read correction by 2nd generation Illumina sequencing. Genomics 2019; 111:43-49. [DOI: 10.1016/j.ygeno.2017.12.011] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2017] [Revised: 12/11/2017] [Accepted: 12/17/2017] [Indexed: 12/17/2022]
|
406
|
Luo S, Tang M, Frandsen PB, Stewart RJ, Zhou X. The genome of an underwater architect, the caddisfly Stenopsyche tienmushanensis Hwang (Insecta: Trichoptera). Gigascience 2018; 7:5202446. [PMID: 30476205 PMCID: PMC6302954 DOI: 10.1093/gigascience/giy143] [Citation(s) in RCA: 31] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2018] [Accepted: 11/15/2018] [Indexed: 12/18/2022] Open
Abstract
Background Caddisflies (Insecta: Trichoptera) are a highly adapted freshwater group of insects split from a common ancestor with Lepidoptera. They are the most diverse (>16,000 species) of the strictly aquatic insect orders and are widely employed as bio-indicators in water quality assessment and monitoring. Among the numerous adaptations to aquatic habitats, caddisfly larvae use silk and materials from the environment (e.g., stones, sticks, leaf matter) to build composite structures such as fixed retreats and portable cases. Understanding how caddisflies have adapted to aquatic habitats will help explain the evolution and subsequent diversification of the group. Findings We sequenced a retreat-builder caddisfly Stenopsyche tienmushanensis Hwang and assembled a high-quality genome from both Illumina and Pacific Biosciences (PacBio) sequencing. In total, 601.2 M Illumina reads (90.2 Gb) and 16.9 M PacBio subreads (89.0 Gb) were generated. The 451.5 Mb assembled genome has a contig N50 of 1.29 M, has a longest contig of 4.76 Mb, and covers 97.65% of the 1,658 insect single-copy genes as assessed by Benchmarking Universal Single-Copy Orthologs. The genome comprises 36.76% repetitive elements. A total of 14,672 predicted protein-coding genes were identified. The genome revealed gene expansions in specific groups of the cytochrome P450 family and olfactory binding proteins, suggesting potential genomic features associated with pollutant tolerance and mate finding. In addition, the complete gene complex of the highly repetitive H-fibroin, the major protein component of caddisfly larval silk, was assembled. Conclusions We report the draft genome of Stenopsyche tienmushanensis, the highest-quality caddisfly genome so far. The genome information will be an important resource for the study of caddisflies and may shed light on the evolution of aquatic insects.
Collapse
Affiliation(s)
- Shiqi Luo
- Beijing Advanced Innovation Center for Food Nutrition and Human Health, College of Plant Protection, China Agricultural University, 2 Yuanmingyuan West Road, Haidian District, Beijing 100193, China
| | - Min Tang
- Beijing Advanced Innovation Center for Food Nutrition and Human Health, College of Plant Protection, China Agricultural University, 2 Yuanmingyuan West Road, Haidian District, Beijing 100193, China
| | - Paul B Frandsen
- Department of Plant and Wildlife Sciences, Brigham Young University, 701 E University Parkway Drive, Provo, UT 84602, USA.,Data Science Lab, Smithsonian Institution, 600 Maryland Ave SW, Washington, DC 20002, USA
| | - Russell J Stewart
- Department of Biomedical Engineering, University of Utah, 20 South 2030 East, Salt Lake City, UT 84112, USA
| | - Xin Zhou
- Beijing Advanced Innovation Center for Food Nutrition and Human Health, College of Plant Protection, China Agricultural University, 2 Yuanmingyuan West Road, Haidian District, Beijing 100193, China
| |
Collapse
|
407
|
Wee Y, Bhyan SB, Liu Y, Lu J, Li X, Zhao M. The bioinformatics tools for the genome assembly and analysis based on third-generation sequencing. Brief Funct Genomics 2018; 18:1-12. [DOI: 10.1093/bfgp/ely037] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2018] [Revised: 10/03/2018] [Accepted: 10/19/2018] [Indexed: 02/06/2023] Open
Affiliation(s)
- YongKiat Wee
- School of Science and Engineering, Faculty of Science, Health, Education and Engineering, University of the Sunshine Coast, Queensland, Australia
| | - Salma Begum Bhyan
- School of Science and Engineering, Faculty of Science, Health, Education and Engineering, University of the Sunshine Coast, Queensland, Australia
| | - Yining Liu
- The School of Public Health, Institute for Chemical Carcinogenesis,Guangzhou Medical University, Dongfengxi Road, Guangzhou, China
| | - Jiachun Lu
- The School of Public Health, Institute for Chemical Carcinogenesis,Guangzhou Medical University, Dongfengxi Road, Guangzhou, China
- The School of Public Health, The First Affiliated Hospital, Guangzhou Medical University, Guangzhou, China
| | - Xiaoyan Li
- Beijing Anzhen Hospital, Capital Medical University, Beijing Institute of Heart, Lung & Blood Vessel Disease, Beijing, China
| | - Min Zhao
- School of Science and Engineering, Faculty of Science, Health, Education and Engineering, University of the Sunshine Coast, Queensland, Australia
| |
Collapse
|
408
|
Zeng D, Chen X, Peng J, Yang C, Peng M, Zhu W, Xie D, He P, Wei P, Lin Y, Zhao Y, Chen X. Single-molecule long-read sequencing facilitates shrimp transcriptome research. Sci Rep 2018; 8:16920. [PMID: 30446694 PMCID: PMC6240054 DOI: 10.1038/s41598-018-35066-3] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2018] [Accepted: 10/31/2018] [Indexed: 12/26/2022] Open
Abstract
Although shrimp are of great economic importance, few full-length shrimp transcriptomes are available. Here, we used Pacific Biosciences single-molecule real-time (SMRT) long-read sequencing technology to generate transcripts from the Pacific white shrimp (Litopenaeus vannamei). We obtained 322,600 full-length non-chimeric reads, from which we generated 51,367 high-quality unique full-length transcripts. We corrected errors in the SMRT sequences by comparison with Illumina-produced short reads. We successfully annotated 81.72% of all unique SMRT transcripts against the NCBI non-redundant database, 58.63% against Swiss-Prot, 45.38% against Gene Ontology, 32.57% against Clusters of Orthologous Groups of proteins (COG), and 47.83% against Kyoto Encyclopedia of Genes and Genomes (KEGG) databases. Across all transcripts, we identified 3,958 long non-coding RNAs (lncRNAs) and 80,650 simple sequence repeats (SSRs). Our study provides a rich set of full-length cDNA sequences for L. vannamei, which will greatly facilitate shrimp transcriptome research.
Collapse
Affiliation(s)
- Digang Zeng
- Guangxi Key Laboratory of Aquatic Genetic Breeding and Healthy Aquaculture, Guangxi Academy of Fisheries Sciences, Nanning, Guangxi, P.R. China
| | - Xiuli Chen
- Guangxi Key Laboratory of Aquatic Genetic Breeding and Healthy Aquaculture, Guangxi Academy of Fisheries Sciences, Nanning, Guangxi, P.R. China
| | - Jinxia Peng
- Guangxi Key Laboratory of Aquatic Genetic Breeding and Healthy Aquaculture, Guangxi Academy of Fisheries Sciences, Nanning, Guangxi, P.R. China
| | - Chunling Yang
- Guangxi Key Laboratory of Aquatic Genetic Breeding and Healthy Aquaculture, Guangxi Academy of Fisheries Sciences, Nanning, Guangxi, P.R. China
| | - Min Peng
- Guangxi Key Laboratory of Aquatic Genetic Breeding and Healthy Aquaculture, Guangxi Academy of Fisheries Sciences, Nanning, Guangxi, P.R. China
| | - Weilin Zhu
- Guangxi Key Laboratory of Aquatic Genetic Breeding and Healthy Aquaculture, Guangxi Academy of Fisheries Sciences, Nanning, Guangxi, P.R. China
| | - Daxiang Xie
- Guangxi Key Laboratory of Aquatic Genetic Breeding and Healthy Aquaculture, Guangxi Academy of Fisheries Sciences, Nanning, Guangxi, P.R. China
| | - Pingping He
- Guangxi Key Laboratory of Aquatic Genetic Breeding and Healthy Aquaculture, Guangxi Academy of Fisheries Sciences, Nanning, Guangxi, P.R. China
| | - Pinyuan Wei
- Guangxi Key Laboratory of Aquatic Genetic Breeding and Healthy Aquaculture, Guangxi Academy of Fisheries Sciences, Nanning, Guangxi, P.R. China
| | - Yong Lin
- Guangxi Key Laboratory of Aquatic Genetic Breeding and Healthy Aquaculture, Guangxi Academy of Fisheries Sciences, Nanning, Guangxi, P.R. China
| | - Yongzhen Zhao
- Guangxi Key Laboratory of Aquatic Genetic Breeding and Healthy Aquaculture, Guangxi Academy of Fisheries Sciences, Nanning, Guangxi, P.R. China.
| | - Xiaohan Chen
- Guangxi Key Laboratory of Aquatic Genetic Breeding and Healthy Aquaculture, Guangxi Academy of Fisheries Sciences, Nanning, Guangxi, P.R. China.
| |
Collapse
|
409
|
Zhang B, Liu J, Wang X, Wei Z. Full-length RNA sequencing reveals unique transcriptome composition in bermudagrass. PLANT PHYSIOLOGY AND BIOCHEMISTRY : PPB 2018; 132:95-103. [PMID: 30176433 DOI: 10.1016/j.plaphy.2018.08.039] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/25/2018] [Accepted: 08/29/2018] [Indexed: 05/20/2023]
Abstract
Bermudagrass [Cynodon dactylon (L.) Pers.] is an important perennial warm-season turfgrass species with great economic value. However, the reference genome and transcriptome information are still deficient in bermudagrass, which severely impedes functional and molecular breeding studies. In this study, through analyzing a mixture sample of leaves, stolons, shoots, roots and flowers with single-molecule long-read sequencing technology from Pacific Biosciences (PacBio), we reported the first full-length transcriptome dataset of bermudagrass (C. dactylon cultivar Yangjiang) comprising 78,192 unigenes. Among the unigenes, 66,409 were functionally annotated, whereas 27,946 were found to have two or more isoforms. The annotated full-length unigenes provided many new insights into gene sequence characteristics and systematic phylogeny of bermudagrass. By comparison with transcriptome dataset in nine grass species, KEGG pathway analyses further revealed that C4 photosynthesis-related genes, notably the phosphoenolpyruvate carboxylase and pyruvate, phosphate dikinase genes, are specifically enriched in bermudagrass. These results not only explained the possible reason why bermudagrass flourishes in warm areas but also provided a solid basis for future studies in this important turfgrass species.
Collapse
Affiliation(s)
- Bing Zhang
- College of Animal Science and Technology, Yangzhou University, Yangzhou 225009, China.
| | - Jianxiu Liu
- Institute of Botany, Jiangsu Province and Chinese Academy of Sciences, Nanjing 210014, China
| | - Xiaoshan Wang
- College of Animal Science and Technology, Yangzhou University, Yangzhou 225009, China
| | - Zhenwu Wei
- College of Animal Science and Technology, Yangzhou University, Yangzhou 225009, China
| |
Collapse
|
410
|
Bakhtiari M, Shleizer-Burko S, Gymrek M, Bansal V, Bafna V. Targeted genotyping of variable number tandem repeats with adVNTR. Genome Res 2018; 28:1709-1719. [PMID: 30352806 PMCID: PMC6211647 DOI: 10.1101/gr.235119.118] [Citation(s) in RCA: 54] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2018] [Accepted: 10/02/2018] [Indexed: 12/20/2022]
Abstract
Whole-genome sequencing is increasingly used to identify Mendelian variants in clinical pipelines. These pipelines focus on single-nucleotide variants (SNVs) and also structural variants, while ignoring more complex repeat sequence variants. Here, we consider the problem of genotyping Variable Number Tandem Repeats (VNTRs), composed of inexact tandem duplications of short (6–100 bp) repeating units. VNTRs span 3% of the human genome, are frequently present in coding regions, and have been implicated in multiple Mendelian disorders. Although existing tools recognize VNTR carrying sequence, genotyping VNTRs (determining repeat unit count and sequence variation) from whole-genome sequencing reads remains challenging. We describe a method, adVNTR, that uses hidden Markov models to model each VNTR, count repeat units, and detect sequence variation. adVNTR models can be developed for short-read (Illumina) and single-molecule (Pacific Biosciences [PacBio]) whole-genome and whole-exome sequencing, and show good results on multiple simulated and real data sets.
Collapse
Affiliation(s)
- Mehrdad Bakhtiari
- Department of Computer Science and Engineering, University of California, San Diego, La Jolla, California 92093, USA
| | - Sharona Shleizer-Burko
- Department of Medicine, University of California, San Diego, La Jolla, California 92093, USA
| | - Melissa Gymrek
- Department of Computer Science and Engineering, University of California, San Diego, La Jolla, California 92093, USA.,Department of Medicine, University of California, San Diego, La Jolla, California 92093, USA
| | - Vikas Bansal
- Department of Pediatrics, University of California, San Diego, La Jolla, California 92093, USA
| | - Vineet Bafna
- Department of Computer Science and Engineering, University of California, San Diego, La Jolla, California 92093, USA
| |
Collapse
|
411
|
Sohn JI, Nam JW. The present and future of de novo whole-genome assembly. Brief Bioinform 2018; 19:23-40. [PMID: 27742661 DOI: 10.1093/bib/bbw096] [Citation(s) in RCA: 80] [Impact Index Per Article: 11.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2016] [Indexed: 12/15/2022] Open
Abstract
As the advent of next-generation sequencing (NGS) technology, various de novo assembly algorithms based on the de Bruijn graph have been developed to construct chromosome-level sequences. However, numerous technical or computational challenges in de novo assembly still remain, although many bright ideas and heuristics have been suggested to tackle the challenges in both experimental and computational settings. In this review, we categorize de novo assemblers on the basis of the type of de Bruijn graphs (Hamiltonian and Eulerian) and discuss the challenges of de novo assembly for short NGS reads regarding computational complexity and assembly ambiguity. Then, we discuss how the limitations of the short reads can be overcome by using a single-molecule sequencing platform that generates long reads of up to several kilobases. In fact, the long read assembly has caused a paradigm shift in whole-genome assembly in terms of algorithms and supporting steps. We also summarize (i) hybrid assemblies using both short and long reads and (ii) overlap-based assemblies for long reads and discuss their challenges and future prospects. This review provides guidelines to determine the optimal approach for a given input data type, computational budget or genome.
Collapse
|
412
|
Tilak MK, Botero-Castro F, Galtier N, Nabholz B. Illumina Library Preparation for Sequencing the GC-Rich Fraction of Heterogeneous Genomic DNA. Genome Biol Evol 2018; 10:616-622. [PMID: 29385572 PMCID: PMC5808798 DOI: 10.1093/gbe/evy022] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/18/2018] [Indexed: 02/06/2023] Open
Abstract
Standard Illumina libraries are biased toward sequences of intermediate GC-content. This results in an underrepresentation of GC-rich regions in sequencing projects of genomes with heterogeneous base composition, such as mammals and birds. We developed a simple, cost-effective protocol to enrich sheared genomic DNA in its GC-rich fraction by subtracting AT-rich DNA. This was achieved by heating DNA up to 90 °C before applying Illumina library preparation. We tested the new approach on chicken DNA and found that heated DNA increased average coverage in the GC-richest chromosomes by a factor up to six. Using a Taq polymerase supposedly appropriate for PCR amplification of GC-rich sequences had a much weaker effect. Our protocol should greatly facilitate sequencing and resequencing of the GC-richest regions of heterogeneous genomes, in combination with standard short-read and long-read technologies.
Collapse
Affiliation(s)
- Marie-Ka Tilak
- Institut des Sciences de l'Evolution, ISEM, Université de Montellier, CNRS, IRD, EPHE, France
| | - Fidel Botero-Castro
- Institut des Sciences de l'Evolution, ISEM, Université de Montellier, CNRS, IRD, EPHE, France
| | - Nicolas Galtier
- Institut des Sciences de l'Evolution, ISEM, Université de Montellier, CNRS, IRD, EPHE, France
| | - Benoit Nabholz
- Institut des Sciences de l'Evolution, ISEM, Université de Montellier, CNRS, IRD, EPHE, France
| |
Collapse
|
413
|
Lakhundi S, Zhang K. Methicillin-Resistant Staphylococcus aureus: Molecular Characterization, Evolution, and Epidemiology. Clin Microbiol Rev 2018; 31:e00020-18. [PMID: 30209034 PMCID: PMC6148192 DOI: 10.1128/cmr.00020-18] [Citation(s) in RCA: 903] [Impact Index Per Article: 129.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Staphylococcus aureus, a major human pathogen, has a collection of virulence factors and the ability to acquire resistance to most antibiotics. This ability is further augmented by constant emergence of new clones, making S. aureus a "superbug." Clinical use of methicillin has led to the appearance of methicillin-resistant S. aureus (MRSA). The past few decades have witnessed the existence of new MRSA clones. Unlike traditional MRSA residing in hospitals, the new clones can invade community settings and infect people without predisposing risk factors. This evolution continues with the buildup of the MRSA reservoir in companion and food animals. This review focuses on imparting a better understanding of MRSA evolution and its molecular characterization and epidemiology. We first describe the origin of MRSA, with emphasis on the diverse nature of staphylococcal cassette chromosome mec (SCCmec). mecA and its new homologues (mecB, mecC, and mecD), SCCmec types (13 SCCmec types have been discovered to date), and their classification criteria are discussed. The review then describes various typing methods applied to study the molecular epidemiology and evolutionary nature of MRSA. Starting with the historical methods and continuing to the advanced whole-genome approaches, typing of collections of MRSA has shed light on the origin, spread, and evolutionary pathways of MRSA clones.
Collapse
Affiliation(s)
- Sahreena Lakhundi
- Centre for Antimicrobial Resistance, Alberta Health Services/Calgary Laboratory Services/University of Calgary, Calgary, Alberta, Canada
| | - Kunyan Zhang
- Centre for Antimicrobial Resistance, Alberta Health Services/Calgary Laboratory Services/University of Calgary, Calgary, Alberta, Canada
- Department of Pathology and Laboratory Medicine, University of Calgary, Calgary, Alberta, Canada
- Department of Microbiology, Immunology and Infectious Diseases, University of Calgary, Calgary, Alberta, Canada
- Department of Medicine, University of Calgary, Calgary, Alberta, Canada
- The Calvin, Phoebe and Joan Snyder Institute for Chronic Diseases, University of Calgary, Calgary, Alberta, Canada
| |
Collapse
|
414
|
Shi Y, Liu Y, Zhang S, Zou R, Tang J, Mu W, Peng Y, Dong S. Assembly and comparative analysis of the complete mitochondrial genome sequence of Sophora japonica 'JinhuaiJ2'. PLoS One 2018; 13:e0202485. [PMID: 30114217 PMCID: PMC6095553 DOI: 10.1371/journal.pone.0202485] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2018] [Accepted: 08/04/2018] [Indexed: 11/18/2022] Open
Abstract
Sophora japonica L. (Faboideae, Leguminosae) is an important traditional Chinese herb with a long history of cultivation. Its flower buds and fruits contain abundant flavonoids, and therefore, the plants are cultivated for the industrial extraction of rutin. Here, we determined the complete nucleotide sequence of the mitochondrial genome of S. japonica ‘JinhuaiJ2’, the most widely planted variety in Guangxi region of China. The total length of the mtDNA sequence is 484,916 bp, with a GC content of 45.4%. Sophora japonica mtDNA harbors 32 known protein-coding genes, 17 tRNA genes, and three rRNA genes with 17 cis-spliced and five trans-spliced introns disrupting eight protein-coding genes. The gene coding and intron regions, and intergenic spacers account for 7.5%, 5.8% and 86.7% of the genome, respectively. The gene profile of S. japonica mitogenome differs from that of the other Faboideae species by only one or two gene gains or losses. Four of the 17 cis-spliced introns showed distinct length variations in the Faboideae, which could be attributed to the homologous recombination of the short repeats measuring a few bases located precisely at the edges of the putative deletions. This reflects the importance of small repeats in the sequence evolution in Faboideae mitogenomes. Repeated sequences of S. japonica mitogenome are mainly composed of small repeats, with only 20 medium-sized repeats, and one large repeat, adding up to 4% of its mitogenome length. Among the 25 pseudogene fragments detected in the intergenic spacer regions, the two largest ones and their corresponding functional gene copies located in two different sets of medium-sized repeats, point to their origins from homologous recombinations. As we further observed the recombined reads associated with the longest repeats of 2,160 bp with the PacBio long read data set of just 15 × in depth, repeat mediated homologous recombinations may play important role in the mitogenomic evolution of S. japonica. Our study provides insightful knowledge to the genetic background of this important herb species and the mitogenomic evolution in the Faboideae species.
Collapse
Affiliation(s)
- Yancai Shi
- Guangxi Institute of Botany, Chinese Academy of Sciences, Guilin, Guangxi, China
| | - Yang Liu
- Fairy Lake Botanical Garden, Shenzhen & Chinese Academy of Sciences, Shenzhen, Guangdong, China
- BGI-Shenzhen, Shenzhen, China
| | - Shouzhou Zhang
- Fairy Lake Botanical Garden, Shenzhen & Chinese Academy of Sciences, Shenzhen, Guangdong, China
| | - Rong Zou
- Guangxi Institute of Botany, Chinese Academy of Sciences, Guilin, Guangxi, China
| | - Jianmin Tang
- Guangxi Institute of Botany, Chinese Academy of Sciences, Guilin, Guangxi, China
| | | | - Yang Peng
- Fairy Lake Botanical Garden, Shenzhen & Chinese Academy of Sciences, Shenzhen, Guangdong, China
| | - Shanshan Dong
- Fairy Lake Botanical Garden, Shenzhen & Chinese Academy of Sciences, Shenzhen, Guangdong, China
- * E-mail:
| |
Collapse
|
415
|
Complete Genome Sequences of Canadian Epidemic Methicillin-Resistant Staphylococcus aureus Strains CMRSA3 and CMRSA6. Microbiol Resour Announc 2018; 7:MRA00892-18. [PMID: 30533890 PMCID: PMC6256457 DOI: 10.1128/mra.00892-18] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2018] [Accepted: 07/10/2018] [Indexed: 11/20/2022] Open
Abstract
Methicillin-resistant Staphylococcus aureus (MRSA) clonal complex 8 (CC8) sequence type 239 (ST239) represents a predominant hospital-associated MRSA sublineage present worldwide. The Canadian epidemic MRSA strains CMRSA3 and CMRSA6 are moderately virulent members of this group but are closely related to the highly virulent strain TW20. Methicillin-resistant Staphylococcus aureus (MRSA) clonal complex 8 (CC8) sequence type 239 (ST239) represents a predominant hospital-associated MRSA sublineage present worldwide. The Canadian epidemic MRSA strains CMRSA3 and CMRSA6 are moderately virulent members of this group but are closely related to the highly virulent strain TW20. Whole-genome sequencing of CMRSA3 and CMRSA6 was conducted to identify genetic determinants associated with their virulence.
Collapse
|
416
|
Panthee S, Hamamoto H, Ishijima SA, Paudel A, Sekimizu K. Utilization of Hybrid Assembly Approach to Determine the Genome of an Opportunistic Pathogenic Fungus, Candida albicans TIMM 1768. Genome Biol Evol 2018; 10:2017-2022. [PMID: 30059981 PMCID: PMC6097704 DOI: 10.1093/gbe/evy166] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/28/2018] [Indexed: 11/25/2022] Open
Abstract
Candida albicans TIMM1768 is a highly virulent strain utilized as a model organism for the study of gastrointestinal and oral candidiasis. Despite being a model strain, identification of its genetic determinants of pathogenesis is hindered by the unavailability of its genome sequence. In this study, we determined the genome sequence of C. albicans TIMM1768 using reads obtained from portable MinION and benchtop Ion PGM sequencers. Genome annotation and a comparative analysis with published genomes revealed that the TIMM1768 strain was close to Candida albicans CHN1, and we identified a significant number of genes encoding for pathogenesis. The availability of the C. albicans TIMM1768 genome will facilitate comparative genomic analysis of Candida species, including studies of its virulence mechanisms and the development of treatment strategies for severe candidiasis.
Collapse
Affiliation(s)
- Suresh Panthee
- Institute of Medical Mycology, Teikyo University, Hachioji, Tokyo, Japan
| | - Hiroshi Hamamoto
- Institute of Medical Mycology, Teikyo University, Hachioji, Tokyo, Japan
| | - Sanae A Ishijima
- Institute of Medical Mycology, Teikyo University, Hachioji, Tokyo, Japan
| | - Atmika Paudel
- Institute of Medical Mycology, Teikyo University, Hachioji, Tokyo, Japan
| | - Kazuhisa Sekimizu
- Institute of Medical Mycology, Teikyo University, Hachioji, Tokyo, Japan
- Genome Pharmaceuticals Institute Co., Ltd, Bunkyo, Tokyo, Japan
| |
Collapse
|
417
|
Choudhury O, Chakrabarty A, Emrich SJ. HECIL: A Hybrid Error Correction Algorithm for Long Reads with Iterative Learning. Sci Rep 2018; 8:9936. [PMID: 29967328 PMCID: PMC6028576 DOI: 10.1038/s41598-018-28364-3] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2018] [Accepted: 05/31/2018] [Indexed: 11/22/2022] Open
Abstract
Second-generation DNA sequencing techniques generate short reads that can result in fragmented genome assemblies. Third-generation sequencing platforms mitigate this limitation by producing longer reads that span across complex and repetitive regions. However, the usefulness of such long reads is limited because of high sequencing error rates. To exploit the full potential of these longer reads, it is imperative to correct the underlying errors. We propose HECIL-Hybrid Error Correction with Iterative Learning-a hybrid error correction framework that determines a correction policy for erroneous long reads, based on optimal combinations of decision weights obtained from short read alignments. We demonstrate that HECIL outperforms state-of-the-art error correction algorithms for an overwhelming majority of evaluation metrics on diverse, real-world data sets including E. coli, S. cerevisiae, and the malaria vector mosquito A. funestus. Additionally, we provide an optional avenue of improving the performance of HECIL's core algorithm by introducing an iterative learning paradigm that enhances the correction policy at each iteration by incorporating knowledge gathered from previous iterations via data-driven confidence metrics assigned to prior corrections.
Collapse
Affiliation(s)
- Olivia Choudhury
- Postdoctoral Researcher, IBM Research, Cambridge, MA, 02142, USA.
| | - Ankush Chakrabarty
- Visiting Research Scientist, Mitsubishi Electric Research Laboratories, Cambridge, MA, 02139, USA
| | - Scott J Emrich
- Associate Professor, Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville, TN, 37996, USA
| |
Collapse
|
418
|
Sohn JI, Nam K, Hong H, Kim JM, Lim D, Lee KT, Do YJ, Cho CY, Kim N, Chai HH, Nam JW. Whole genome and transcriptome maps of the entirely black native Korean chicken breed Yeonsan Ogye. Gigascience 2018; 7:5052204. [PMID: 30010758 PMCID: PMC6065499 DOI: 10.1093/gigascience/giy086] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2018] [Revised: 05/19/2018] [Accepted: 07/04/2018] [Indexed: 12/30/2022] Open
Abstract
Background Yeonsan Ogye (YO), an indigenous Korean chicken breed (Gallus gallus domesticus), has entirely black external features and internal organs. In this study, the draft genome of YO was assembled using a hybrid de novo assembly method that takes advantage of high-depth Illumina short reads (376.6X) and low-depth Pacific Biosciences (PacBio) long reads (9.7X). Findings The contig and scaffold NG50s of the hybrid de novo assembly were 362.3 Kbp and 16.8 Mbp, respectively. The completeness (97.6%) of the draft genome (Ogye_1.1) was evaluated with single-copy orthologous genes using Benchmarking Universal Single-Copy Orthologs and found to be comparable to the current chicken reference genome (galGal5; 97.4%; contigs were assembled with high-depth PacBio long reads (50X) and scaffolded with short reads) and superior to other avian genomes (92%-93%; assembled with short read-only or hybrid methods). Compared to galGal4 and galGal5, the draft genome included 551 structural variations including the fibromelanosis (FM) locus duplication, related to hyperpigmentation. To comprehensively reconstruct transcriptome maps, RNA sequencing and reduced representation bisulfite sequencing data were analyzed from 20 tissues, including 4 black tissues (skin, shank, comb, and fascia). The maps included 15,766 protein-coding and 6,900 long noncoding RNA genes, many of which were tissue-specifically expressed and displayed tissue-specific DNA methylation patterns in the promoter regions. Conclusions We expect that the resulting genome sequence and transcriptome maps will be valuable resources for studying domestic chicken breeds, including black-skinned chickens, as well as for understanding genomic differences between breeds and the evolution of hyperpigmented chickens and functional elements related to hyperpigmentation.
Collapse
Affiliation(s)
- Jang-il Sohn
- Department of Life Science, Hanyang University, Seoul, 133-791, Republic of Korea
- Research Institute for Convergence of Basic Sciences, Hanyang University, Seoul, 133-791, Republic of Korea
| | - Kyoungwoo Nam
- Department of Life Science, Hanyang University, Seoul, 133-791, Republic of Korea
| | - Hyosun Hong
- Department of Life Science, Hanyang University, Seoul, 133-791, Republic of Korea
| | - Jun-Mo Kim
- Department of Animal Science and Technology, Chung-Ang University, Anseong, Gyeonggi-do, 17546, Republic of Korea
| | - Dajeong Lim
- Department of Animal Biotechnology & Environment, National Institute of Animal Science, RDA, Wanju, 55365, Republic of Korea
| | - Kyung-Tai Lee
- Department of Animal Biotechnology & Environment, National Institute of Animal Science, RDA, Wanju, 55365, Republic of Korea
| | - Yoon Jung Do
- Department of Animal Biotechnology & Environment, National Institute of Animal Science, RDA, Wanju, 55365, Republic of Korea
| | - Chang Yeon Cho
- Animal Genetic Resource Research Center, National Institute of Animal Science, RDA, Namwon, 55717, Republic of Korea
| | - Namshin Kim
- Personalized Genomic Medicine Research Center, KRIBB, Daejeon, 34141, Republic of Korea
| | - Han-Ha Chai
- Department of Animal Biotechnology & Environment, National Institute of Animal Science, RDA, Wanju, 55365, Republic of Korea
- College of Pharmacy, Chonnam National University, Kwangju, 61186, Republic of Korea
| | - Jin-Wu Nam
- Department of Life Science, Hanyang University, Seoul, 133-791, Republic of Korea
- Research Institute for Convergence of Basic Sciences, Hanyang University, Seoul, 133-791, Republic of Korea
| |
Collapse
|
419
|
Yin D, Ji C, Ma X, Li H, Zhang W, Li S, Liu F, Zhao K, Li F, Li K, Ning L, He J, Wang Y, Zhao F, Xie Y, Zheng H, Zhang X, Zhang Y, Zhang J. Genome of an allotetraploid wild peanut Arachis monticola: a de novo assembly. Gigascience 2018; 7:5040258. [PMID: 29931126 PMCID: PMC6009596 DOI: 10.1093/gigascience/giy066] [Citation(s) in RCA: 71] [Impact Index Per Article: 10.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2018] [Revised: 03/13/2018] [Accepted: 05/24/2018] [Indexed: 12/16/2022] Open
Abstract
Arachis monticola (2n = 4x = 40) is the only allotetraploid wild peanut within the Arachis genus and section, with an AABB-type genome of ∼2.7 Gb in size. The AA-type subgenome is derived from diploid wild peanut Arachis duranensis, and the BB-type subgenome is derived from diploid wild peanut Arachis ipaensis. A. monticola is regarded either as the direct progenitor of the cultivated peanut or as an introgressive derivative between the cultivated peanut and wild species. The large polyploidy genome structure and enormous nearly identical regions of the genome make the assembly of chromosomal pseudomolecules very challenging. Here we report the first reference quality assembly of the A. monticola genome, using a series of advanced technologies. The final whole genome of A. monticola is ∼2.62 Gb and has a contig N50 and scaffold N50 of 106.66 Kb and 124.92 Mb, respectively. The vast majority (91.83%) of the assembled sequence was anchored onto the 20 pseudo-chromosomes, and 96.07% of assemblies were accurately separated into AA- and BB- subgenomes. We demonstrated efficiency of the current state of the strategy for de novo assembly of the highly complex allotetraploid species, wild peanut (A. monticola), based on whole-genome shotgun sequencing, single molecule real-time sequencing, high-throughput chromosome conformation capture technology, and BioNano optical genome maps. These combined technologies produced reference-quality genome of the allotetraploid wild peanut, which is valuable for understanding the peanut domestication and evolution within the Arachis genus and among legume crops.
Collapse
Affiliation(s)
- Dongmei Yin
- College of Agronomy, Henan Agricultural University, Zhengzhou 450002, China
| | - Changmian Ji
- Biomarker Technologies Corporation, Beijing 101300, China
| | - Xingli Ma
- College of Agronomy, Henan Agricultural University, Zhengzhou 450002, China
| | - Hang Li
- Biomarker Technologies Corporation, Beijing 101300, China
| | - Wanke Zhang
- State Key Lab of Plant Genomics, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Beijing 100101, China
| | - Song Li
- Biomarker Technologies Corporation, Beijing 101300, China
| | - Fuyan Liu
- Biomarker Technologies Corporation, Beijing 101300, China
| | - Kunkun Zhao
- College of Agronomy, Henan Agricultural University, Zhengzhou 450002, China
| | - Fapeng Li
- College of Agronomy, Henan Agricultural University, Zhengzhou 450002, China
| | - Ke Li
- College of Agronomy, Henan Agricultural University, Zhengzhou 450002, China
| | - Longlong Ning
- College of Agronomy, Henan Agricultural University, Zhengzhou 450002, China
| | - Jialin He
- College of Agronomy, Henan Agricultural University, Zhengzhou 450002, China
| | - Yuejun Wang
- National Key Laboratory of Plant Molecular Genetics, CAS Center for Excellence in Molecular Plant Sciences, Institute of Plant Physiology and Ecology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200032, China
| | - Fei Zhao
- National Key Laboratory of Plant Molecular Genetics, CAS Center for Excellence in Molecular Plant Sciences, Institute of Plant Physiology and Ecology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200032, China
| | - Yilin Xie
- National Key Laboratory of Plant Molecular Genetics, CAS Center for Excellence in Molecular Plant Sciences, Institute of Plant Physiology and Ecology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200032, China
| | - Hongkun Zheng
- Biomarker Technologies Corporation, Beijing 101300, China
| | - Xingguo Zhang
- College of Agronomy, Henan Agricultural University, Zhengzhou 450002, China
| | - Yijing Zhang
- National Key Laboratory of Plant Molecular Genetics, CAS Center for Excellence in Molecular Plant Sciences, Institute of Plant Physiology and Ecology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200032, China
| | - Jinsong Zhang
- State Key Lab of Plant Genomics, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Beijing 100101, China
| |
Collapse
|
420
|
Rupp O, MacDonald ML, Li S, Dhiman H, Polson S, Griep S, Heffner K, Hernandez I, Brinkrolf K, Jadhav V, Samoudi M, Hao H, Kingham B, Goesmann A, Betenbaugh MJ, Lewis NE, Borth N, Lee KH. A reference genome of the Chinese hamster based on a hybrid assembly strategy. Biotechnol Bioeng 2018; 115:2087-2100. [PMID: 29704459 PMCID: PMC6045439 DOI: 10.1002/bit.26722] [Citation(s) in RCA: 78] [Impact Index Per Article: 11.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2017] [Revised: 03/13/2018] [Accepted: 04/25/2018] [Indexed: 12/20/2022]
Abstract
Accurate and complete genome sequences are essential in biotechnology to facilitate genome‐based cell engineering efforts. The current genome assemblies for Cricetulus griseus, the Chinese hamster, are fragmented and replete with gap sequences and misassemblies, consistent with most short‐read‐based assemblies. Here, we completely resequenced C. griseus using single molecule real time sequencing and merged this with Illumina‐based assemblies. This generated a more contiguous and complete genome assembly than either technology alone, reducing the number of scaffolds by >28‐fold, with 90% of the sequence in the 122 longest scaffolds. Most genes are now found in single scaffolds, including up‐ and downstream regulatory elements, enabling improved study of noncoding regions. With >95% of the gap sequence filled, important Chinese hamster ovary cell mutations have been detected in draft assembly gaps. This new assembly will be an invaluable resource for continued basic and pharmaceutical research.
Collapse
Affiliation(s)
- Oliver Rupp
- Bioinformatics and Systems Biology, Justus-Liebig-University Giessen, Giessen, Germany
| | - Madolyn L MacDonald
- Department of Computer and Information Sciences, University of Delaware, Newark, Delaware.,Delaware Biotechnology Institute, Newark, Delaware
| | - Shangzhong Li
- Department of Bioengineering, University of California, San Diego, California.,Novo Nordisk Foundation Center for Biosustainability, University of California, San Diego, California
| | - Heena Dhiman
- Austrian Center of Industrial Biotechnology, Vienna, Austria.,Department of Biotechnology, University of Natural Resources and Life Sciences, Vienna, Austria
| | - Shawn Polson
- Department of Computer and Information Sciences, University of Delaware, Newark, Delaware.,Delaware Biotechnology Institute, Newark, Delaware
| | - Sven Griep
- Bioinformatics and Systems Biology, Justus-Liebig-University Giessen, Giessen, Germany
| | - Kelley Heffner
- Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, Maryland
| | - Inmaculada Hernandez
- Department of Biotechnology, University of Natural Resources and Life Sciences, Vienna, Austria
| | - Karina Brinkrolf
- Department of Biorescources, Fraunhofer Institute for Molecular Biology and Applied Ecology, Giessen, Germany
| | - Vaibhav Jadhav
- Austrian Center of Industrial Biotechnology, Vienna, Austria
| | - Mojtaba Samoudi
- Novo Nordisk Foundation Center for Biosustainability, University of California, San Diego, California.,Department of Pediatrics, University of California, San Diego, California
| | - Haiping Hao
- Johns Hopkins University Deep Sequencing and Microarray Core, Johns Hopkins University, Baltimore, Maryland
| | | | - Alexander Goesmann
- Bioinformatics and Systems Biology, Justus-Liebig-University Giessen, Giessen, Germany
| | - Michael J Betenbaugh
- Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, Maryland
| | - Nathan E Lewis
- Department of Bioengineering, University of California, San Diego, California.,Novo Nordisk Foundation Center for Biosustainability, University of California, San Diego, California.,Department of Pediatrics, University of California, San Diego, California
| | - Nicole Borth
- Austrian Center of Industrial Biotechnology, Vienna, Austria.,Department of Biotechnology, University of Natural Resources and Life Sciences, Vienna, Austria
| | - Kelvin H Lee
- Delaware Biotechnology Institute, Newark, Delaware.,Department of Chemical and Biomolecular Engineering, University of Delaware, Newark, Delaware
| |
Collapse
|
421
|
A Whole Genome Assembly of the Horn Fly, Haematobia irritans, and Prediction of Genes with Roles in Metabolism and Sex Determination. G3-GENES GENOMES GENETICS 2018; 8:1675-1686. [PMID: 29602812 PMCID: PMC5940159 DOI: 10.1534/g3.118.200154] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
Haematobia irritans, commonly known as the horn fly, is a globally distributed blood-feeding pest of cattle that is responsible for significant economic losses to cattle producers. Chemical insecticides are the primary means for controlling this pest but problems with insecticide resistance have become common in the horn fly. To provide a foundation for identification of genomic loci for insecticide resistance and for discovery of new control technology, we report the sequencing, assembly, and annotation of the horn fly genome. The assembled genome is 1.14 Gb, comprising 76,616 scaffolds with N50 scaffold length of 23 Kb. Using RNA-Seq data, we have predicted 34,413 gene models of which 19,185 have been assigned functional annotations. Comparative genomics analysis with the Dipteran flies Musca domestica L., Drosophila melanogaster, and Lucilia cuprina, show that the horn fly is most closely related to M. domestica, sharing 8,748 orthologous clusters followed by D. melanogaster and L. cuprina, sharing 7,582 and 7,490 orthologous clusters respectively. We also identified a gene locus for the sodium channel protein in which mutations have been previously reported that confers target site resistance to the most common class of pesticides used in fly control. Additionally, we identified 276 genomic loci encoding members of metabolic enzyme gene families such as cytochrome P450s, esterases and glutathione S-transferases, and several genes orthologous to sex determination pathway genes in other Dipteran species.
Collapse
|
422
|
Di Genova A, Ruz GA, Sagot MF, Maass A. Fast-SG: an alignment-free algorithm for hybrid assembly. Gigascience 2018; 7:4993155. [PMID: 29741627 PMCID: PMC6007556 DOI: 10.1093/gigascience/giy048] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2017] [Revised: 03/01/2018] [Accepted: 04/19/2018] [Indexed: 12/01/2022] Open
Abstract
Background Long-read sequencing technologies are the ultimate solution for genome repeats, allowing near reference-level reconstructions of large genomes. However, long-read de novo assembly pipelines are computationally intense and require a considerable amount of coverage, thereby hindering their broad application to the assembly of large genomes. Alternatively, hybrid assembly methods that combine short- and long-read sequencing technologies can reduce the time and cost required to produce de novo assemblies of large genomes. Results Here, we propose a new method, called Fast-SG, that uses a new ultrafast alignment-free algorithm specifically designed for constructing a scaffolding graph using light-weight data structures. Fast-SG can construct the graph from either short or long reads. This allows the reuse of efficient algorithms designed for short-read data and permits the definition of novel modular hybrid assembly pipelines. Using comprehensive standard datasets and benchmarks, we show how Fast-SG outperforms the state-of-the-art short-read aligners when building the scaffoldinggraph and can be used to extract linking information from either raw or error-corrected long reads. We also show how a hybrid assembly approach using Fast-SG with shallow long-read coverage (5X) and moderate computational resources can produce long-range and accurate reconstructions of the genomes of Arabidopsis thaliana (Ler-0) and human (NA12878). Conclusions Fast-SG opens a door to achieve accurate hybrid long-range reconstructions of large genomes with low effort, high portability, and low cost.
Collapse
Affiliation(s)
- Alex Di Genova
- Facultad de Ingeniería y Ciencias, Universidad Adolfo Ibáñez, Santiago, Chile
- Mathomics Bioinformatics Laboratory, Center for Mathematical Modeling, University of Chile, Av. Beauchef 851., 7th floor, Santiago, Chile
- Inria Grenoble Rhon̂e-Alpes, 655, Avenue de l’Europe, 38334 Montbonnot, France
- CNRS, UMR5558, Université Claude Bernard Lyon 1, 43, Boulevard du 11 Novembre 1918, 69622 Villeurbanne, France
- Fondap Center for Genome Regulation, Av. Blanco Encalada 2085, 3rd floor, Santiago, Chile
| | - Gonzalo A Ruz
- Facultad de Ingeniería y Ciencias, Universidad Adolfo Ibáñez, Santiago, Chile
- Center of Applied Ecology and Sustainability (CAPES), Santiago, Chile
| | - Marie-France Sagot
- Inria Grenoble Rhon̂e-Alpes, 655, Avenue de l’Europe, 38334 Montbonnot, France
- CNRS, UMR5558, Université Claude Bernard Lyon 1, 43, Boulevard du 11 Novembre 1918, 69622 Villeurbanne, France
| | - Alejandro Maass
- Mathomics Bioinformatics Laboratory, Center for Mathematical Modeling, University of Chile, Av. Beauchef 851., 7th floor, Santiago, Chile
- Fondap Center for Genome Regulation, Av. Blanco Encalada 2085, 3rd floor, Santiago, Chile
- Department of Mathematical Engineering, University of Chile, Av. Beauchef 851., 5th floor, Santiago, Chile
| |
Collapse
|
423
|
Pavlovich SS, Lovett SP, Koroleva G, Guito JC, Arnold CE, Nagle ER, Kulcsar K, Lee A, Thibaud-Nissen F, Hume AJ, Mühlberger E, Uebelhoer LS, Towner JS, Rabadan R, Sanchez-Lockhart M, Kepler TB, Palacios G. The Egyptian Rousette Genome Reveals Unexpected Features of Bat Antiviral Immunity. Cell 2018; 173:1098-1110.e18. [PMID: 29706541 PMCID: PMC7112298 DOI: 10.1016/j.cell.2018.03.070] [Citation(s) in RCA: 179] [Impact Index Per Article: 25.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2017] [Revised: 01/22/2018] [Accepted: 03/27/2018] [Indexed: 12/27/2022]
Abstract
Bats harbor many viruses asymptomatically, including several notorious for causing extreme virulence in humans. To identify differences between antiviral mechanisms in humans and bats, we sequenced, assembled, and analyzed the genome of Rousettus aegyptiacus, a natural reservoir of Marburg virus and the only known reservoir for any filovirus. We found an expanded and diversified KLRC/KLRD family of natural killer cell receptors, MHC class I genes, and type I interferons, which dramatically differ from their functional counterparts in other mammals. Such concerted evolution of key components of bat immunity is strongly suggestive of novel modes of antiviral defense. An evaluation of the theoretical function of these genes suggests that an inhibitory immune state may exist in bats. Based on our findings, we hypothesize that tolerance of viral infection, rather than enhanced potency of antiviral defenses, may be a key mechanism by which bats asymptomatically host viruses that are pathogenic in humans.
Collapse
Affiliation(s)
- Stephanie S Pavlovich
- Department of Microbiology, Boston University School of Medicine, Boston, MA 02118, USA
| | - Sean P Lovett
- Center for Genome Sciences, United States Army Research Institute of Infectious Diseases (USAMRIID), Frederick, MD 21702, USA
| | - Galina Koroleva
- Center for Genome Sciences, United States Army Research Institute of Infectious Diseases (USAMRIID), Frederick, MD 21702, USA
| | - Jonathan C Guito
- Viral Special Pathogens Branch, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA
| | - Catherine E Arnold
- Center for Genome Sciences, United States Army Research Institute of Infectious Diseases (USAMRIID), Frederick, MD 21702, USA
| | - Elyse R Nagle
- Center for Genome Sciences, United States Army Research Institute of Infectious Diseases (USAMRIID), Frederick, MD 21702, USA
| | - Kirsten Kulcsar
- Center for Genome Sciences, United States Army Research Institute of Infectious Diseases (USAMRIID), Frederick, MD 21702, USA
| | - Albert Lee
- Departments of Systems Biology and Biomedical Informatics, Columbia University, New York, NY 10032, USA
| | - Françoise Thibaud-Nissen
- National Center for Biotechnology Information, National Library of Medicine, NIH, Bethesda, MD 20892, USA
| | - Adam J Hume
- Department of Microbiology, Boston University School of Medicine, Boston, MA 02118, USA; National Emerging Infectious Diseases Laboratory, Boston University, Boston, MA 02118, USA
| | - Elke Mühlberger
- Department of Microbiology, Boston University School of Medicine, Boston, MA 02118, USA; National Emerging Infectious Diseases Laboratory, Boston University, Boston, MA 02118, USA
| | - Luke S Uebelhoer
- Viral Special Pathogens Branch, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA
| | - Jonathan S Towner
- Viral Special Pathogens Branch, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA
| | - Raul Rabadan
- Departments of Systems Biology and Biomedical Informatics, Columbia University, New York, NY 10032, USA
| | - Mariano Sanchez-Lockhart
- Center for Genome Sciences, United States Army Research Institute of Infectious Diseases (USAMRIID), Frederick, MD 21702, USA; Department of Pathology and Microbiology, University of Nebraska Medical Center, Omaha, NE 68198, USA
| | - Thomas B Kepler
- Department of Microbiology, Boston University School of Medicine, Boston, MA 02118, USA; Department of Mathematics and Statistics, Boston University, Boston, MA 02215, USA; National Emerging Infectious Diseases Laboratory, Boston University, Boston, MA 02118, USA.
| | - Gustavo Palacios
- Center for Genome Sciences, United States Army Research Institute of Infectious Diseases (USAMRIID), Frederick, MD 21702, USA.
| |
Collapse
|
424
|
Bachmann JA, Tedder A, Laenen B, Steige KA, Slotte T. Targeted Long-Read Sequencing of a Locus Under Long-Term Balancing Selection in Capsella. G3 (BETHESDA, MD.) 2018; 8:1327-1333. [PMID: 29476024 PMCID: PMC5873921 DOI: 10.1534/g3.117.300467] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/22/2017] [Accepted: 02/20/2018] [Indexed: 11/18/2022]
Abstract
Rapid advances in short-read DNA sequencing technologies have revolutionized population genomic studies, but there are genomic regions where this technology reaches its limits. Limitations mostly arise due to the difficulties in assembly or alignment to genomic regions of high sequence divergence and high repeat content, which are typical characteristics for loci under strong long-term balancing selection. Studying genetic diversity at such loci therefore remains challenging. Here, we investigate the feasibility and error rates associated with targeted long-read sequencing of a locus under balancing selection. For this purpose, we generated bacterial artificial chromosomes (BACs) containing the Brassicaceae S-locus, a region under strong negative frequency-dependent selection which has previously proven difficult to assemble in its entirety using short reads. We sequence S-locus BACs with single-molecule long-read sequencing technology and conduct de novo assembly of these S-locus haplotypes. By comparing repeated assemblies resulting from independent long-read sequencing runs on the same BAC clone we do not detect any structural errors, suggesting that reliable assemblies are generated, but we estimate an indel error rate of 5.7×10-5 A similar error rate was estimated based on comparison of Illumina short-read sequences and BAC assemblies. Our results show that, until de novo assembly of multiple individuals using long-read sequencing becomes feasible, targeted long-read sequencing of loci under balancing selection is a viable option with low error rates for single nucleotide polymorphisms or structural variation. We further find that short-read sequencing is a valuable complement, allowing correction of the relatively high rate of indel errors that result from this approach.
Collapse
Affiliation(s)
- Jörg A Bachmann
- Department of Ecology, Environment and Plant Sciences, Science for Life Laboratory, Stockholm University, Sweden
| | - Andrew Tedder
- Department of Ecology, Environment and Plant Sciences, Science for Life Laboratory, Stockholm University, Sweden
| | - Benjamin Laenen
- Department of Ecology, Environment and Plant Sciences, Science for Life Laboratory, Stockholm University, Sweden
| | - Kim A Steige
- Department of Ecology, Environment and Plant Sciences, Science for Life Laboratory, Stockholm University, Sweden
| | - Tanja Slotte
- Department of Ecology, Environment and Plant Sciences, Science for Life Laboratory, Stockholm University, Sweden
| |
Collapse
|
425
|
Tardaguila M, de la Fuente L, Marti C, Pereira C, Pardo-Palacios FJ, Del Risco H, Ferrell M, Mellado M, Macchietto M, Verheggen K, Edelmann M, Ezkurdia I, Vazquez J, Tress M, Mortazavi A, Martens L, Rodriguez-Navarro S, Moreno-Manzano V, Conesa A. SQANTI: extensive characterization of long-read transcript sequences for quality control in full-length transcriptome identification and quantification. Genome Res 2018; 28:396-411. [PMID: 29440222 PMCID: PMC5848618 DOI: 10.1101/gr.222976.117] [Citation(s) in RCA: 264] [Impact Index Per Article: 37.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2017] [Accepted: 01/08/2018] [Indexed: 01/15/2023]
Abstract
High-throughput sequencing of full-length transcripts using long reads has paved the way for the discovery of thousands of novel transcripts, even in well-annotated mammalian species. The advances in sequencing technology have created a need for studies and tools that can characterize these novel variants. Here, we present SQANTI, an automated pipeline for the classification of long-read transcripts that can assess the quality of data and the preprocessing pipeline using 47 unique descriptors. We apply SQANTI to a neuronal mouse transcriptome using Pacific Biosciences (PacBio) long reads and illustrate how the tool is effective in characterizing and describing the composition of the full-length transcriptome. We perform extensive evaluation of ToFU PacBio transcripts by PCR to reveal that an important number of the novel transcripts are technical artifacts of the sequencing approach and that SQANTI quality descriptors can be used to engineer a filtering strategy to remove them. Most novel transcripts in this curated transcriptome are novel combinations of existing splice sites, resulting more frequently in novel ORFs than novel UTRs, and are enriched in both general metabolic and neural-specific functions. We show that these new transcripts have a major impact in the correct quantification of transcript levels by state-of-the-art short-read-based quantification algorithms. By comparing our iso-transcriptome with public proteomics databases, we find that alternative isoforms are elusive to proteogenomics detection. SQANTI allows the user to maximize the analytical outcome of long-read technologies by providing the tools to deliver quality-evaluated and curated full-length transcriptomes.
Collapse
Affiliation(s)
- Manuel Tardaguila
- Department of Microbiology and Cell Science, Institute for Food and Agricultural Sciences, Genetics Institute, University of Florida, Gainesville, Florida 32611, USA
| | - Lorena de la Fuente
- Genomics of Gene Expression Laboratory, Centro de Investigaciones Principe Felipe (CIPF), 46012 Valencia, Spain
| | - Cristina Marti
- Genomics of Gene Expression Laboratory, Centro de Investigaciones Principe Felipe (CIPF), 46012 Valencia, Spain
| | - Cécile Pereira
- Department of Microbiology and Cell Science, Institute for Food and Agricultural Sciences, Genetics Institute, University of Florida, Gainesville, Florida 32611, USA
| | | | - Hector Del Risco
- Department of Microbiology and Cell Science, Institute for Food and Agricultural Sciences, Genetics Institute, University of Florida, Gainesville, Florida 32611, USA
| | - Marc Ferrell
- Department of Microbiology and Cell Science, Institute for Food and Agricultural Sciences, Genetics Institute, University of Florida, Gainesville, Florida 32611, USA
| | | | - Marissa Macchietto
- Department of Developmental and Cell Biology, University of California, Irvine, California 92617, USA
| | - Kenneth Verheggen
- VIB-UGent Center for Medical Biotechnology, VIB, B-9000 Ghent, Belgium
- Department of Biochemistry, Ghent University, B-9000 Ghent, Belgium
| | - Mariola Edelmann
- Department of Microbiology and Cell Science, Institute for Food and Agricultural Sciences, Genetics Institute, University of Florida, Gainesville, Florida 32611, USA
| | - Iakes Ezkurdia
- Centro Nacional de Investigaciones Cardiovasculares CNIC, 28029 Madrid, Spain
| | - Jesus Vazquez
- Centro Nacional de Investigaciones Cardiovasculares CNIC, 28029 Madrid, Spain
| | - Michael Tress
- Spanish National Cancer Research Centre (CNIO), 28029 Madrid, Spain
| | - Ali Mortazavi
- Department of Developmental and Cell Biology, University of California, Irvine, California 92617, USA
| | - Lennart Martens
- VIB-UGent Center for Medical Biotechnology, VIB, B-9000 Ghent, Belgium
- Department of Biochemistry, Ghent University, B-9000 Ghent, Belgium
| | - Susana Rodriguez-Navarro
- Gene Expression and mRNA Metabolism Laboratory, CSIC, IBV, 46010 Valencia, Spain
- Gene Expression and mRNA Metabolism Laboratory, CIPF, 46012 Valencia, Spain
| | | | - Ana Conesa
- Department of Microbiology and Cell Science, Institute for Food and Agricultural Sciences, Genetics Institute, University of Florida, Gainesville, Florida 32611, USA
- Genomics of Gene Expression Laboratory, Centro de Investigaciones Principe Felipe (CIPF), 46012 Valencia, Spain
| |
Collapse
|
426
|
Wang JR, Holt J, McMillan L, Jones CD. FMLRC: Hybrid long read error correction using an FM-index. BMC Bioinformatics 2018; 19:50. [PMID: 29426289 PMCID: PMC5807796 DOI: 10.1186/s12859-018-2051-3] [Citation(s) in RCA: 90] [Impact Index Per Article: 12.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2017] [Accepted: 02/01/2018] [Indexed: 11/16/2022] Open
Abstract
BACKGROUND Long read sequencing is changing the landscape of genomic research, especially de novo assembly. Despite the high error rate inherent to long read technologies, increased read lengths dramatically improve the continuity and accuracy of genome assemblies. However, the cost and throughput of these technologies limits their application to complex genomes. One solution is to decrease the cost and time to assemble novel genomes by leveraging "hybrid" assemblies that use long reads for scaffolding and short reads for accuracy. RESULTS We describe a novel method leveraging a multi-string Burrows-Wheeler Transform with auxiliary FM-index to correct errors in long read sequences using a set of complementary short reads. We demonstrate that our method efficiently produces significantly more high quality corrected sequence than existing hybrid error-correction methods. We also show that our method produces more contiguous assemblies, in many cases, than existing state-of-the-art hybrid and long-read only de novo assembly methods. CONCLUSION Our method accurately corrects long read sequence data using complementary short reads. We demonstrate higher total throughput of corrected long reads and a corresponding increase in contiguity of the resulting de novo assemblies. Improved throughput and computational efficiency than existing methods will help better economically utilize emerging long read sequencing technologies.
Collapse
Affiliation(s)
- Jeremy R. Wang
- Department of Genetics, University of North Carolina at Chapel Hill, CB 3280, 3144 Genome Sciences Building, 250 Bell Tower Dr, Chapel Hill, 27599 NC USA
| | - James Holt
- Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC USA
| | - Leonard McMillan
- Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC USA
| | - Corbin D. Jones
- Department of Biology and Integrative Program for Biological and Genome Sciences, University of North Carolina at Chapel Hill, Chapel Hill, NC USA
| |
Collapse
|
427
|
Dominguez Del Angel V, Hjerde E, Sterck L, Capella-Gutierrez S, Notredame C, Vinnere Pettersson O, Amselem J, Bouri L, Bocs S, Klopp C, Gibrat JF, Vlasova A, Leskosek BL, Soler L, Binzer-Panchal M, Lantz H. Ten steps to get started in Genome Assembly and Annotation. F1000Res 2018; 7. [PMID: 29568489 PMCID: PMC5850084 DOI: 10.12688/f1000research.13598.1] [Citation(s) in RCA: 51] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 01/19/2018] [Indexed: 12/16/2022] Open
Abstract
As a part of the ELIXIR-EXCELERATE efforts in capacity building, we present here 10 steps to facilitate researchers getting started in genome assembly and genome annotation. The guidelines given are broadly applicable, intended to be stable over time, and cover all aspects from start to finish of a general assembly and annotation project. Intrinsic properties of genomes are discussed, as is the importance of using high quality DNA. Different sequencing technologies and generally applicable workflows for genome assembly are also detailed. We cover structural and functional annotation and encourage readers to also annotate transposable elements, something that is often omitted from annotation workflows. The importance of data management is stressed, and we give advice on where to submit data and how to make your results Findable, Accessible, Interoperable, and Reusable (FAIR).
Collapse
Affiliation(s)
| | - Erik Hjerde
- Department of Chemistry, Norstruct, UiT The Arctic University of Norway, Tromsø, 9019, Norway
| | - Lieven Sterck
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Technologiepark 927, 9052 Ghent, Belgium.,VIB-UGent Center for Plant Systems Biology, Ghent University - VIB, Technologiepark 927, 9052 Ghent, Belgium
| | - Salvadors Capella-Gutierrez
- Spanish National Bioinformatics Institute (INB), Barcelona, Spain.,Barcelona Supercomputing Center (BSC), Centro Nacional de Supercomputación, Barcelona, Spain
| | - Cederic Notredame
- Centre for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology , Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Olga Vinnere Pettersson
- Uppsala Genome Center, NGI/SciLifeLab, Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala, SE-752 37 , Sweden
| | - Joelle Amselem
- URGI, INRA, Université Paris-Saclay, Versailles, 78026, France
| | - Laurent Bouri
- Institut Français de Bioinformatique, UMS3601-CNRS, Université Paris-Saclay, Orsay, 91403, France
| | - Stephanie Bocs
- CIRAD, UMR AGAP, Montpellier, 34398, France.,AGAP, Cirad, INRA, Montpellier SupAgro, Universite Montpellier, Montpellier, France.,South Green Bioinformatics Platform, Montpellier, France
| | | | - Jean-Francois Gibrat
- Institut Français de Bioinformatique, UMS3601-CNRS, Université Paris-Saclay, Orsay, 91403, France.,Unité de recherche , INRA, Université Paris-Saclay, 78350 Jouy-en-Josas, France
| | - Anna Vlasova
- Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Brane L Leskosek
- Faculty of Medicine, Institute for Biostatistics and Medical Informatics, University of Ljubljana, Ljubljana, Slovenia
| | - Lucile Soler
- IMBIM/NBIS/SciLifeLab, Uppsala University, Uppsala, Sweden
| | | | - Henrik Lantz
- IMBIM/NBIS/SciLifeLab, Uppsala University, Uppsala, Sweden
| |
Collapse
|
428
|
Lin R, Qin F, Shen B, Shi Q, Liu C, Zhang X, Jiao Y, Lu J, Gao Y, Suarez-Fernandez M, Lopez-Moya F, Lopez-Llorca LV, Wang G, Mao Z, Ling J, Yang Y, Cheng X, Xie B. Genome and secretome analysis of Pochonia chlamydosporia provide new insight into egg-parasitic mechanisms. Sci Rep 2018; 8:1123. [PMID: 29348510 PMCID: PMC5773674 DOI: 10.1038/s41598-018-19169-5] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2017] [Accepted: 12/22/2017] [Indexed: 11/24/2022] Open
Abstract
Pochonia chlamydosporia infects eggs and females of economically important plant-parasitic nematodes. The fungal isolates parasitizing different nematodes are genetically distinct. To understand their intraspecific genetic differentiation, parasitic mechanisms, and adaptive evolution, we assembled seven putative chromosomes of P. chlamydosporia strain 170 isolated from root-knot nematode eggs (~44 Mb, including 7.19% of transposable elements) and compared them with the genome of the strain 123 (~41 Mb) isolated from cereal cyst nematode. We focus on secretomes of the fungus, which play important roles in pathogenicity and fungus-host/environment interactions, and identified 1,750 secreted proteins, with a high proportion of carboxypeptidases, subtilisins, and chitinases. We analyzed the phylogenies of these genes and predicted new pathogenic molecules. By comparative transcriptome analysis, we found that secreted proteins involved in responses to nutrient stress are mainly comprised of proteases and glycoside hydrolases. Moreover, 32 secreted proteins undergoing positive selection and 71 duplicated gene pairs encoding secreted proteins are identified. Two duplicated pairs encoding secreted glycosyl hydrolases (GH30), which may be related to fungal endophytic process and lost in many insect-pathogenic fungi but exist in nematophagous fungi, are putatively acquired from bacteria by horizontal gene transfer. The results help understanding genetic origins and evolution of parasitism-related genes.
Collapse
Affiliation(s)
- Runmao Lin
- College of Life Sciences, Beijing Normal University, Beijing, China.,Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Feifei Qin
- College of Life Sciences, Beijing Normal University, Beijing, China.,Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Baoming Shen
- Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Qianqian Shi
- Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Chichuan Liu
- Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Xi Zhang
- College of Life Sciences, Beijing Normal University, Beijing, China.,Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Yang Jiao
- Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Jun Lu
- College of Life Sciences, Beijing Normal University, Beijing, China
| | - Yaoyao Gao
- College of Life Sciences, Beijing Normal University, Beijing, China
| | - Marta Suarez-Fernandez
- Laboratory of Plant Pathology, Department of Marine Sciences and Applied Biology, University of Alicante, Alicante, Spain
| | - Federico Lopez-Moya
- Laboratory of Plant Pathology, Department of Marine Sciences and Applied Biology, University of Alicante, Alicante, Spain
| | - Luis Vicente Lopez-Llorca
- Laboratory of Plant Pathology, Department of Marine Sciences and Applied Biology, University of Alicante, Alicante, Spain
| | - Gang Wang
- Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Zhenchuan Mao
- Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Jian Ling
- Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Yuhong Yang
- Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Xinyue Cheng
- College of Life Sciences, Beijing Normal University, Beijing, China. .,Ministry of Education Key Laboratory for Biodiversity Science and Ecological Engineering, Beijing, China.
| | - Bingyan Xie
- Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing, China. .,Key Laboratory of Biology and Genetic Improvement of Horticultural Crops, Ministry of Agriculture, Beijing, China.
| |
Collapse
|
429
|
An D, Cao HX, Li C, Humbeck K, Wang W. Isoform Sequencing and State-of-Art Applications for Unravelling Complexity of Plant Transcriptomes. Genes (Basel) 2018; 9:genes9010043. [PMID: 29346292 PMCID: PMC5793194 DOI: 10.3390/genes9010043] [Citation(s) in RCA: 37] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2017] [Revised: 12/30/2017] [Accepted: 01/15/2018] [Indexed: 12/12/2022] Open
Abstract
Single-molecule real-time (SMRT) sequencing developed by PacBio, also called third-generation sequencing (TGS), offers longer reads than the second-generation sequencing (SGS). Given its ability to obtain full-length transcripts without assembly, isoform sequencing (Iso-Seq) of transcriptomes by PacBio is advantageous for genome annotation, identification of novel genes and isoforms, as well as the discovery of long non-coding RNA (lncRNA). In addition, Iso-Seq gives access to the direct detection of alternative splicing, alternative polyadenylation (APA), gene fusion, and DNA modifications. Such applications of Iso-Seq facilitate the understanding of gene structure, post-transcriptional regulatory networks, and subsequently proteomic diversity. In this review, we summarize its applications in plant transcriptome study, specifically pointing out challenges associated with each step in the experimental design and highlight the development of bioinformatic pipelines. We aim to provide the community with an integrative overview and a comprehensive guidance to Iso-Seq, and thus to promote its applications in plant research.
Collapse
Affiliation(s)
- Dong An
- School of Agriculture and Biology, Shanghai Jiao Tong University, 800 Dong Chuan Road, Shanghai 200240, China.
| | - Hieu X Cao
- Institute of Biology/Plant Physiology, Martin-Luther-University of Halle-Wittenberg, Weinbergweg 10, 06120 Halle, Germany.
| | - Changsheng Li
- School of Agriculture and Biology, Shanghai Jiao Tong University, 800 Dong Chuan Road, Shanghai 200240, China.
| | - Klaus Humbeck
- Institute of Biology/Plant Physiology, Martin-Luther-University of Halle-Wittenberg, Weinbergweg 10, 06120 Halle, Germany.
| | - Wenqin Wang
- School of Agriculture and Biology, Shanghai Jiao Tong University, 800 Dong Chuan Road, Shanghai 200240, China.
| |
Collapse
|
430
|
Filichkin SA, Hamilton M, Dharmawardhana PD, Singh SK, Sullivan C, Ben-Hur A, Reddy ASN, Jaiswal P. Abiotic Stresses Modulate Landscape of Poplar Transcriptome via Alternative Splicing, Differential Intron Retention, and Isoform Ratio Switching. FRONTIERS IN PLANT SCIENCE 2018; 9:5. [PMID: 29483921 PMCID: PMC5816337 DOI: 10.3389/fpls.2018.00005] [Citation(s) in RCA: 75] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/20/2017] [Accepted: 01/03/2018] [Indexed: 05/19/2023]
Abstract
Abiotic stresses affect plant physiology, development, growth, and alter pre-mRNA splicing. Western poplar is a model woody tree and a potential bioenergy feedstock. To investigate the extent of stress-regulated alternative splicing (AS), we conducted an in-depth survey of leaf, root, and stem xylem transcriptomes under drought, salt, or temperature stress. Analysis of approximately one billion of genome-aligned RNA-Seq reads from tissue- or stress-specific libraries revealed over fifteen millions of novel splice junctions. Transcript models supported by both RNA-Seq and single molecule isoform sequencing (Iso-Seq) data revealed a broad array of novel stress- and/or tissue-specific isoforms. Analysis of Iso-Seq data also resulted in the discovery of 15,087 novel transcribed regions of which 164 show AS. Our findings demonstrate that abiotic stresses profoundly perturb transcript isoform profiles and trigger widespread intron retention (IR) events. Stress treatments often increased or decreased retention of specific introns - a phenomenon described here as differential intron retention (DIR). Many differentially retained introns were regulated in a stress- and/or tissue-specific manner. A subset of transcripts harboring super stress-responsive DIR events showed persisting fluctuations in the degree of IR across all treatments and tissue types. To investigate coordinated dynamics of intron-containing transcripts in the study we quantified absolute copy number of isoforms of two conserved transcription factors (TFs) using Droplet Digital PCR. This case study suggests that stress treatments can be associated with coordinated switches in relative ratios between fully spliced and intron-retaining isoforms and may play a role in adjusting transcriptome to abiotic stresses.
Collapse
Affiliation(s)
- Sergei A. Filichkin
- Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR, United States
| | - Michael Hamilton
- Department of Computer Science, Colorado State University, Fort Collins, CO, United States
| | | | - Sunil K. Singh
- Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR, United States
| | - Christopher Sullivan
- Center for Genome Research and Biocomputing, Oregon State University, Corvallis, OR, United States
| | - Asa Ben-Hur
- Department of Computer Science, Colorado State University, Fort Collins, CO, United States
| | - Anireddy S. N. Reddy
- Department of Biology and Program in Cell and Molecular Biology, Colorado State University, Fort Collins, CO, United States
| | - Pankaj Jaiswal
- Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR, United States
- *Correspondence: Pankaj Jaiswal,
| |
Collapse
|
431
|
Zaccaron AZ, Bluhm BH. The genome sequence of Bipolaris cookei reveals mechanisms of pathogenesis underlying target leaf spot of sorghum. Sci Rep 2017; 7:17217. [PMID: 29222463 PMCID: PMC5722872 DOI: 10.1038/s41598-017-17476-x] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2017] [Accepted: 11/24/2017] [Indexed: 11/23/2022] Open
Abstract
Bipolaris cookei (=Bipolaris sorghicola) causes target leaf spot, one of the most prevalent foliar diseases of sorghum. Little is known about the molecular basis of pathogenesis in B. cookei, in large part due to a paucity of resources for molecular genetics, such as a reference genome. Here, a draft genome sequence of B. cookei was obtained and analyzed. A hybrid assembly strategy utilizing Illumina and Pacific Biosciences sequencing technologies produced a draft nuclear genome of 36.1 Mb, organized into 321 scaffolds with L50 of 31 and N50 of 378 kb, from which 11,189 genes were predicted. Additionally, a finished mitochondrial genome sequence of 135,790 bp was obtained, which contained 75 predicted genes. Comparative genomics revealed that B. cookei possessed substantially fewer carbohydrate-active enzymes and secreted proteins than closely related Bipolaris species. Novel genes involved in secondary metabolism, including genes implicated in ophiobolin biosynthesis, were identified. Among 37 B. cookei genes induced during sorghum infection, one encodes a putative effector with a limited taxonomic distribution among plant pathogenic fungi. The draft genome sequence of B. cookei provided novel insights into target leaf spot of sorghum and is an important resource for future investigation.
Collapse
Affiliation(s)
- Alex Z Zaccaron
- Department of Plant Pathology, University of Arkansas, Division of Agriculture, Fayetteville, AR, 72701, USA
| | - Burton H Bluhm
- Department of Plant Pathology, University of Arkansas, Division of Agriculture, Fayetteville, AR, 72701, USA.
| |
Collapse
|
432
|
Liu Y, Lan C, Blumenstein M, Li J. Bi-level error correction for PacBio long reads. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 17:899-905. [PMID: 29990239 DOI: 10.1109/tcbb.2017.2780832] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
The latest sequencing technologies such as the Pacific Biosciences (PacBio) and Oxford Nanopore machines can generate long reads at the length of thousands of nucleic bases which is much longer than the reads at the length of hundreds generated by Illumina machines. However, these long reads are prone to much higher error rates, for example 15%, making downstream analysis and applications very difficult. Error correction is a process to improve the quality of sequencing data. Hybrid correction strategies have been recently proposed to combine Illumina reads of low error rates to fix sequencing errors in the noisy long reads with good performance. In this paper, we propose a new method named Bicolor, a bi-level framework of hybrid error correction for further improving the quality of PacBio long reads. At the first level, our method uses a de Bruijn graph-based error correction idea to search paths in pairs of solid -mers iteratively with an increasing length of -mer. At the second level, we combine the processed results under different parameters from the first level. In particular, a multiple sequence alignment algorithm is used to align those similar long reads, followed by a voting algorithm which determines the final base at each position of the reads. We compare the superior performance of Bicolor with three state-of-the-art methods on three real data sets. Results demonstrate that Bicolor always achieves the highest identity ratio. Bicolor also achieves a higher alignment ratio () and a higher number of aligned reads than the current methods on two data sets. On the third data set, our method is closely competitive to the current methods in terms of number of aligned reads and genome coverage. The C++ source codes of our algorithm are freely available at https://github.com/yuansliu/Bicolor.
Collapse
|
433
|
Teeling EC, Vernes SC, Dávalos LM, Ray DA, Gilbert MTP, Myers E. Bat Biology, Genomes, and the Bat1K Project: To Generate Chromosome-Level Genomes for All Living Bat Species. Annu Rev Anim Biosci 2017; 6:23-46. [PMID: 29166127 DOI: 10.1146/annurev-animal-022516-022811] [Citation(s) in RCA: 143] [Impact Index Per Article: 17.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Bats are unique among mammals, possessing some of the rarest mammalian adaptations, including true self-powered flight, laryngeal echolocation, exceptional longevity, unique immunity, contracted genomes, and vocal learning. They provide key ecosystem services, pollinating tropical plants, dispersing seeds, and controlling insect pest populations, thus driving healthy ecosystems. They account for more than 20% of all living mammalian diversity, and their crown-group evolutionary history dates back to the Eocene. Despite their great numbers and diversity, many species are threatened and endangered. Here we announce Bat1K, an initiative to sequence the genomes of all living bat species (n∼1,300) to chromosome-level assembly. The Bat1K genome consortium unites bat biologists (>148 members as of writing), computational scientists, conservation organizations, genome technologists, and any interested individuals committed to a better understanding of the genetic and evolutionary mechanisms that underlie the unique adaptations of bats. Our aim is to catalog the unique genetic diversity present in all living bats to better understand the molecular basis of their unique adaptations; uncover their evolutionary history; link genotype with phenotype; and ultimately better understand, promote, and conserve bats. Here we review the unique adaptations of bats and highlight how chromosome-level genome assemblies can uncover the molecular basis of these traits. We present a novel sequencing and assembly strategy and review the striking societal and scientific benefits that will result from the Bat1K initiative.
Collapse
Affiliation(s)
- Emma C Teeling
- School of Biology and Environmental Science, University College Dublin, Belfield, Dublin 4, Ireland;
| | - Sonja C Vernes
- Neurogenetics of Vocal Communication Group, Max Planck Institute for Psycholinguistics, Nijmegen, 6500 AH, The Netherlands.,Donders Centre for Cognitive Neuroimaging, Nijmegen, 6525 EN, The Netherlands
| | - Liliana M Dávalos
- Department of Ecology and Evolution, Stony Brook University, Stony Brook, New York 11794-5245, USA
| | - David A Ray
- Department of Biological Sciences, Texas Tech University, Lubbock, Texas 79409, USA
| | - M Thomas P Gilbert
- Natural History Museum of Denmark, University of Copenhagen, 1350 Copenhagen, Denmark.,University Museum, Norwegian University of Science and Technology, 7491 Trondheim, Norway
| | - Eugene Myers
- Max Planck Institute for Molecular Cell Biology and Genetics, 01307 Dresden, Germany
| | -
- *Full list of Bat1K Consortium members in Supplemental Appendix
| |
Collapse
|
434
|
Li Y, Wei W, Feng J, Luo H, Pi M, Liu Z, Kang C. Genome re-annotation of the wild strawberry Fragaria vesca using extensive Illumina- and SMRT-based RNA-seq datasets. DNA Res 2017; 25:61-70. [PMID: 29036429 PMCID: PMC5824900 DOI: 10.1093/dnares/dsx038] [Citation(s) in RCA: 50] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2017] [Accepted: 08/29/2017] [Indexed: 01/30/2023] Open
Abstract
The genome of the wild diploid strawberry species Fragaria vesca, an ideal model system of cultivated strawberry (Fragaria × ananassa, octoploid) and other Rosaceae family crops, was first published in 2011 and followed by a new assembly (Fvb). However, the annotation for Fvb mainly relied on ab initio predictions and included only predicted coding sequences, therefore an improved annotation is highly desirable. Here, a new annotation version named v2.0.a2 was created for the Fvb genome by a pipeline utilizing one PacBio library, 90 Illumina RNA-seq libraries, and 9 small RNA-seq libraries. Altogether, 18,641 genes (55.6% out of 33,538 genes) were augmented with information on the 5′ and/or 3′ UTRs, 13,168 (39.3%) protein-coding genes were modified or newly identified, and 7,370 genes were found to possess alternative isoforms. In addition, 1,938 long non-coding RNAs, 171 miRNAs, and 51,714 small RNA clusters were integrated into the annotation. This new annotation of F. vesca is substantially improved in both accuracy and integrity of gene predictions, beneficial to the gene functional studies in strawberry and to the comparative genomic analysis of other horticultural crops in Rosaceae family.
Collapse
Affiliation(s)
- Yongping Li
- Key Laboratory of Horticultural Plant Biology (Ministry of Education), College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan 430070, China
| | - Wei Wei
- Key Laboratory of Horticultural Plant Biology (Ministry of Education), College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan 430070, China
| | - Jia Feng
- Key Laboratory of Horticultural Plant Biology (Ministry of Education), College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan 430070, China
| | - Huifeng Luo
- Key Laboratory of Horticultural Plant Biology (Ministry of Education), College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan 430070, China
| | - Mengting Pi
- Key Laboratory of Horticultural Plant Biology (Ministry of Education), College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan 430070, China
| | - Zhongchi Liu
- Key Laboratory of Horticultural Plant Biology (Ministry of Education), College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan 430070, China.,Department of Cell Biology and Molecular Genetics, University of Maryland, College Park, MD 20742, USA
| | - Chunying Kang
- Key Laboratory of Horticultural Plant Biology (Ministry of Education), College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan 430070, China
| |
Collapse
|
435
|
Argout X, Martin G, Droc G, Fouet O, Labadie K, Rivals E, Aury JM, Lanaud C. The cacao Criollo genome v2.0: an improved version of the genome for genetic and functional genomic studies. BMC Genomics 2017; 18:730. [PMID: 28915793 PMCID: PMC5603072 DOI: 10.1186/s12864-017-4120-9] [Citation(s) in RCA: 43] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2017] [Accepted: 09/06/2017] [Indexed: 11/21/2022] Open
Abstract
Background Theobroma cacao L., native to the Amazonian basin of South America, is an economically important fruit tree crop for tropical countries as a source of chocolate. The first draft genome of the species, from a Criollo cultivar, was published in 2011. Although a useful resource, some improvements are possible, including identifying misassemblies, reducing the number of scaffolds and gaps, and anchoring un-anchored sequences to the 10 chromosomes. Methods We used a NGS-based approach to significantly improve the assembly of the Belizian Criollo B97-61/B2 genome. We combined four Illumina large insert size mate paired libraries with 52x of Pacific Biosciences long reads to correct misassembled regions and reduced the number of scaffolds. We then used genotyping by sequencing (GBS) methods to increase the proportion of the assembly anchored to chromosomes. Results The scaffold number decreased from 4,792 in assembly V1 to 554 in V2 while the scaffold N50 size has increased from 0.47 Mb in V1 to 6.5 Mb in V2. A total of 96.7% of the assembly was anchored to the 10 chromosomes compared to 66.8% in the previous version. Unknown sites (Ns) were reduced from 10.8% to 5.7%. In addition, we updated the functional annotations and performed a new RefSeq structural annotation based on RNAseq evidence. Conclusion Theobroma cacao Criollo genome version 2 will be a valuable resource for the investigation of complex traits at the genomic level and for future comparative genomics and genetics studies in cacao tree. New functional tools and annotations are available on the Cocoa Genome Hub (http://cocoa-genome-hub.southgreen.fr). Electronic supplementary material The online version of this article (10.1186/s12864-017-4120-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- X Argout
- CIRAD, UMR AGAP, F-34398, Montpellier, France.
| | - G Martin
- CIRAD, UMR AGAP, F-34398, Montpellier, France
| | - G Droc
- CIRAD, UMR AGAP, F-34398, Montpellier, France
| | - O Fouet
- CIRAD, UMR AGAP, F-34398, Montpellier, France
| | - K Labadie
- Commissariat à l'Energie Atomique (CEA), Institut de Génomique (IG) Genoscope, F-92057, Evry, France
| | - E Rivals
- Laboratoire d'Informatique, de Robotique et de Microélectronique de Montpellier (LIRMM), CNRS et Université de Montpellier, 34095, Cedex 5, Montpellier, France.,Institut de Biologie Computationnelle (IBC), Université de Montpellier, Montpellier, France
| | - J M Aury
- Commissariat à l'Energie Atomique (CEA), Institut de Génomique (IG) Genoscope, F-92057, Evry, France
| | - C Lanaud
- CIRAD, UMR AGAP, F-34398, Montpellier, France
| |
Collapse
|
436
|
Zaccaron AZ, Woloshuk CP, Bluhm BH. Comparative genomics of maize ear rot pathogens reveals expansion of carbohydrate-active enzymes and secondary metabolism backbone genes in Stenocarpella maydis. Fungal Biol 2017; 121:966-983. [PMID: 29029703 DOI: 10.1016/j.funbio.2017.08.006] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2017] [Revised: 08/15/2017] [Accepted: 08/18/2017] [Indexed: 12/11/2022]
Abstract
Stenocarpella maydis is a plant pathogenic fungus that causes Diplodia ear rot, one of the most destructive diseases of maize. To date, little information is available regarding the molecular basis of pathogenesis in this organism, in part due to limited genomic resources. In this study, a 54.8 Mb draft genome assembly of S. maydis was obtained with Illumina and PacBio sequencing technologies, and analyzed. Comparative genomic analyses with the predominant maize ear rot pathogens Aspergillus flavus, Fusarium verticillioides, and Fusarium graminearum revealed an expanded set of carbohydrate-active enzymes for cellulose and hemicellulose degradation in S. maydis. Analyses of predicted genes involved in starch degradation revealed six putative α-amylases, four extracellular and two intracellular, and two putative γ-amylases, one of which appears to have been acquired from bacteria via horizontal transfer. Additionally, 87 backbone genes involved in secondary metabolism were identified, which represents one of the largest known assemblages among Pezizomycotina species. Numerous secondary metabolite gene clusters were identified, including two clusters likely involved in the biosynthesis of diplodiatoxin and chaetoglobosins. The draft genome of S. maydis presented here will serve as a useful resource for molecular genetics, functional genomics, and analyses of population diversity in this organism.
Collapse
Affiliation(s)
- Alex Z Zaccaron
- Department of Plant Pathology, University of Arkansas, Division of Agriculture, Fayetteville, AR 72701, USA
| | - Charles P Woloshuk
- Department of Botany and Plant Pathology, Purdue University, West Lafayette, IN, USA
| | - Burton H Bluhm
- Department of Plant Pathology, University of Arkansas, Division of Agriculture, Fayetteville, AR 72701, USA.
| |
Collapse
|
437
|
A transcriptome atlas of rabbit revealed by PacBio single-molecule long-read sequencing. Sci Rep 2017; 7:7648. [PMID: 28794490 PMCID: PMC5550469 DOI: 10.1038/s41598-017-08138-z] [Citation(s) in RCA: 66] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2016] [Accepted: 07/07/2017] [Indexed: 01/08/2023] Open
Abstract
It is widely acknowledged that transcriptional diversity largely contributes to biological regulation in eukaryotes. Since the advent of second-generation sequencing technologies, a large number of RNA sequencing studies have considerably improved our understanding of transcriptome complexity. However, it still remains a huge challenge for obtaining full-length transcripts because of difficulties in the short read-based assembly. In the present study we employ PacBio single-molecule long-read sequencing technology for whole-transcriptome profiling in rabbit (Oryctolagus cuniculus). We totally obtain 36,186 high-confidence transcripts from 14,474 genic loci, among which more than 23% of genic loci and 66% of isoforms have not been annotated yet within the current reference genome. Furthermore, about 17% of transcripts are computationally revealed to be non-coding RNAs. Up to 24,797 alternative splicing (AS) and 11,184 alternative polyadenylation (APA) events are detected within this de novo constructed transcriptome, respectively. The results provide a comprehensive set of reference transcripts and hence contribute to the improved annotation of rabbit genome.
Collapse
|
438
|
Haghshenas E, Hach F, Sahinalp SC, Chauve C. CoLoRMap: Correcting Long Reads by Mapping short reads. Bioinformatics 2017; 32:i545-i551. [PMID: 27587673 DOI: 10.1093/bioinformatics/btw463] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023] Open
Abstract
MOTIVATION Second generation sequencing technologies paved the way to an exceptional increase in the number of sequenced genomes, both prokaryotic and eukaryotic. However, short reads are difficult to assemble and often lead to highly fragmented assemblies. The recent developments in long reads sequencing methods offer a promising way to address this issue. However, so far long reads are characterized by a high error rate, and assembling from long reads require a high depth of coverage. This motivates the development of hybrid approaches that leverage the high quality of short reads to correct errors in long reads. RESULTS We introduce CoLoRMap, a hybrid method for correcting noisy long reads, such as the ones produced by PacBio sequencing technology, using high-quality Illumina paired-end reads mapped onto the long reads. Our algorithm is based on two novel ideas: using a classical shortest path algorithm to find a sequence of overlapping short reads that minimizes the edit score to a long read and extending corrected regions by local assembly of unmapped mates of mapped short reads. Our results on bacterial, fungal and insect data sets show that CoLoRMap compares well with existing hybrid correction methods. AVAILABILITY AND IMPLEMENTATION The source code of CoLoRMap is freely available for non-commercial use at https://github.com/sfu-compbio/colormap CONTACT ehaghshe@sfu.ca or cedric.chauve@sfu.ca SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ehsan Haghshenas
- School of Computing Sciences MADD-Gen Graduate Program, Simon Fraser University, Burnaby, BC V5A 1S6, Canada
| | - Faraz Hach
- School of Computing Sciences Vancouver Prostate Centre, Vancouver, BC V6H 3Z6, Canada
| | - S Cenk Sahinalp
- School of Computing Sciences Vancouver Prostate Centre, Vancouver, BC V6H 3Z6, Canada, School of Informatics and Computing, Indiana University, Bloomington, IN 47405, USA
| | - Cedric Chauve
- Department of Mathematics, Simon Fraser University, Burnaby, BC V5A 1S6, Canada
| |
Collapse
|
439
|
Sahraeian SME, Mohiyuddin M, Sebra R, Tilgner H, Afshar PT, Au KF, Bani Asadi N, Gerstein MB, Wong WH, Snyder MP, Schadt E, Lam HYK. Gaining comprehensive biological insight into the transcriptome by performing a broad-spectrum RNA-seq analysis. Nat Commun 2017; 8:59. [PMID: 28680106 PMCID: PMC5498581 DOI: 10.1038/s41467-017-00050-4] [Citation(s) in RCA: 213] [Impact Index Per Article: 26.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2016] [Accepted: 05/02/2017] [Indexed: 12/30/2022] Open
Abstract
RNA-sequencing (RNA-seq) is an essential technique for transcriptome studies, hundreds of analysis tools have been developed since it was debuted. Although recent efforts have attempted to assess the latest available tools, they have not evaluated the analysis workflows comprehensively to unleash the power within RNA-seq. Here we conduct an extensive study analysing a broad spectrum of RNA-seq workflows. Surpassing the expression analysis scope, our work also includes assessment of RNA variant-calling, RNA editing and RNA fusion detection techniques. Specifically, we examine both short- and long-read RNA-seq technologies, 39 analysis tools resulting in ~120 combinations, and ~490 analyses involving 15 samples with a variety of germline, cancer and stem cell data sets. We report the performance and propose a comprehensive RNA-seq analysis protocol, named RNACocktail, along with a computational pipeline achieving high accuracy. Validation on different samples reveals that our proposed protocol could help researchers extract more biologically relevant predictions by broad analysis of the transcriptome. RNA-seq is widely used for transcriptome analysis. Here, the authors analyse a wide spectrum of RNA-seq workflows and present a comprehensive analysis protocol named RNACocktail as well as a computational pipeline leveraging the widely used tools for accurate RNA-seq analysis.
Collapse
Affiliation(s)
| | | | - Robert Sebra
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA
| | - Hagen Tilgner
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, 94305, USA
| | - Pegah T Afshar
- Department of Electrical Engineering, Stanford University, Stanford, CA, 94305, USA
| | - Kin Fai Au
- Department of Internal Medicine, University of Iowa, Iowa City, IA, 52242, USA
| | | | - Mark B Gerstein
- Computational Biology and Bioinformatics, Yale University, New Haven, CT, 06520, USA
| | - Wing Hung Wong
- Statistics; Health Research and Policy, Stanford University, Stanford, CA, 94305, USA
| | - Michael P Snyder
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, 94305, USA
| | - Eric Schadt
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA
| | - Hugo Y K Lam
- Roche Sequencing Solutions, Belmont, CA, 94002, USA.
| |
Collapse
|
440
|
Kremer FS, McBride AJA, Pinto LDS. Approaches for in silico finishing of microbial genome sequences. Genet Mol Biol 2017; 40:553-576. [PMID: 28898352 PMCID: PMC5596377 DOI: 10.1590/1678-4685-gmb-2016-0230] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2016] [Accepted: 03/13/2017] [Indexed: 12/15/2022] Open
Abstract
The introduction of next-generation sequencing (NGS) had a significant effect on the availability of genomic information, leading to an increase in the number of sequenced genomes from a large spectrum of organisms. Unfortunately, due to the limitations implied by the short-read sequencing platforms, most of these newly sequenced genomes remained as "drafts", incomplete representations of the whole genetic content. The previous genome sequencing studies indicated that finishing a genome sequenced by NGS, even bacteria, may require additional sequencing to fill the gaps, making the entire process very expensive. As such, several in silico approaches have been developed to optimize the genome assemblies and facilitate the finishing process. The present review aims to explore some free (open source, in many cases) tools that are available to facilitate genome finishing.
Collapse
Affiliation(s)
- Frederico Schmitt Kremer
- Programa de Pós-Graduação em Biotecnologia (PPGB), Centro de
Desenvolvimento Tecnológico, Universidade Federal de Pelotas, Pelotas, Brazil
| | - Alan John Alexander McBride
- Programa de Pós-Graduação em Biotecnologia (PPGB), Centro de
Desenvolvimento Tecnológico, Universidade Federal de Pelotas, Pelotas, Brazil
| | - Luciano da Silva Pinto
- Programa de Pós-Graduação em Biotecnologia (PPGB), Centro de
Desenvolvimento Tecnológico, Universidade Federal de Pelotas, Pelotas, Brazil
| |
Collapse
|
441
|
Wick RR, Judd LM, Gorrie CL, Holt KE. Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads. PLoS Comput Biol 2017; 13:e1005595. [PMID: 28594827 PMCID: PMC5481147 DOI: 10.1371/journal.pcbi.1005595] [Citation(s) in RCA: 5235] [Impact Index Per Article: 654.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2017] [Revised: 06/22/2017] [Accepted: 05/22/2017] [Indexed: 12/11/2022] Open
Abstract
The Illumina DNA sequencing platform generates accurate but short reads, which can be used to produce accurate but fragmented genome assemblies. Pacific Biosciences and Oxford Nanopore Technologies DNA sequencing platforms generate long reads that can produce complete genome assemblies, but the sequencing is more expensive and error-prone. There is significant interest in combining data from these complementary sequencing technologies to generate more accurate "hybrid" assemblies. However, few tools exist that truly leverage the benefits of both types of data, namely the accuracy of short reads and the structural resolving power of long reads. Here we present Unicycler, a new tool for assembling bacterial genomes from a combination of short and long reads, which produces assemblies that are accurate, complete and cost-effective. Unicycler builds an initial assembly graph from short reads using the de novo assembler SPAdes and then simplifies the graph using information from short and long reads. Unicycler uses a novel semi-global aligner to align long reads to the assembly graph. Tests on both synthetic and real reads show Unicycler can assemble larger contigs with fewer misassemblies than other hybrid assemblers, even when long-read depth and accuracy are low. Unicycler is open source (GPLv3) and available at github.com/rrwick/Unicycler.
Collapse
Affiliation(s)
- Ryan R. Wick
- Department of Biochemistry and Molecular Biology, Bio21 Molecular Science and Biotechnology Institute, The University of Melbourne, Victoria, Australia
- * E-mail:
| | - Louise M. Judd
- Department of Biochemistry and Molecular Biology, Bio21 Molecular Science and Biotechnology Institute, The University of Melbourne, Victoria, Australia
| | - Claire L. Gorrie
- Department of Biochemistry and Molecular Biology, Bio21 Molecular Science and Biotechnology Institute, The University of Melbourne, Victoria, Australia
| | - Kathryn E. Holt
- Department of Biochemistry and Molecular Biology, Bio21 Molecular Science and Biotechnology Institute, The University of Melbourne, Victoria, Australia
| |
Collapse
|
442
|
Liu X, Mei W, Soltis PS, Soltis DE, Barbazuk WB. Detecting alternatively spliced transcript isoforms from single-molecule long-read sequences without a reference genome. Mol Ecol Resour 2017; 17:1243-1256. [PMID: 28316149 DOI: 10.1111/1755-0998.12670] [Citation(s) in RCA: 81] [Impact Index Per Article: 10.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2016] [Revised: 03/03/2017] [Accepted: 03/07/2017] [Indexed: 01/11/2023]
Abstract
Alternative splicing (AS) is a major source of transcript and proteome diversity, but examining AS in species without well-annotated reference genomes remains difficult. Research on both human and mouse has demonstrated the advantages of using Iso-Seq™ data for isoform-level transcriptome analysis, including the study of AS and gene fusion. We applied Iso-Seq™ to investigate AS in Amborella trichopoda, a phylogenetically pivotal species that is sister to all other living angiosperms. Our data show that, compared with RNA-Seq data, the Iso-Seq™ platform provides better recovery on large transcripts, new gene locus identification and gene model correction. Reference-based AS detection with Iso-Seq™ data identifies AS within a higher fraction of multi-exonic genes than observed for published RNA-Seq analysis (45.8% vs. 37.5%). These data demonstrate that the Iso-Seq™ approach is useful for detecting AS events. Using the Iso-Seq-defined transcript collection in Amborella as a reference, we further describe a pipeline for detection of AS isoforms from PacBio Iso-Seq™ without using a reference sequence (de novo). Results using this pipeline show a 66%-76% overall success rate in identifying AS events. This de novoAS detection pipeline provides a method to accurately characterize and identify bona fide alternatively spliced transcripts in any nonmodel system that lacks a reference genome sequence. Hence, our pipeline has huge potential applications and benefits to the broader biology community.
Collapse
Affiliation(s)
- Xiaoxian Liu
- Department of Biology, University of Florida, Gainesville, FL, 32611-8525, USA.,Florida Museum of Natural History, University of Florida, Gainesville, FL, 32611-7800, USA
| | - Wenbin Mei
- Department of Biology, University of Florida, Gainesville, FL, 32611-8525, USA
| | - Pamela S Soltis
- Florida Museum of Natural History, University of Florida, Gainesville, FL, 32611-7800, USA.,Genetics Institute, University of Florida, Gainesville, FL, 32610, USA
| | - Douglas E Soltis
- Department of Biology, University of Florida, Gainesville, FL, 32611-8525, USA.,Florida Museum of Natural History, University of Florida, Gainesville, FL, 32611-7800, USA.,Genetics Institute, University of Florida, Gainesville, FL, 32610, USA
| | - W Brad Barbazuk
- Department of Biology, University of Florida, Gainesville, FL, 32611-8525, USA.,Genetics Institute, University of Florida, Gainesville, FL, 32610, USA
| |
Collapse
|
443
|
Wick RR, Judd LM, Gorrie CL, Holt KE. Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads. PLoS Comput Biol 2017. [PMID: 28594827 DOI: 10.1371/journal.pcbi.1005595b] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/16/2023] Open
Abstract
The Illumina DNA sequencing platform generates accurate but short reads, which can be used to produce accurate but fragmented genome assemblies. Pacific Biosciences and Oxford Nanopore Technologies DNA sequencing platforms generate long reads that can produce complete genome assemblies, but the sequencing is more expensive and error-prone. There is significant interest in combining data from these complementary sequencing technologies to generate more accurate "hybrid" assemblies. However, few tools exist that truly leverage the benefits of both types of data, namely the accuracy of short reads and the structural resolving power of long reads. Here we present Unicycler, a new tool for assembling bacterial genomes from a combination of short and long reads, which produces assemblies that are accurate, complete and cost-effective. Unicycler builds an initial assembly graph from short reads using the de novo assembler SPAdes and then simplifies the graph using information from short and long reads. Unicycler uses a novel semi-global aligner to align long reads to the assembly graph. Tests on both synthetic and real reads show Unicycler can assemble larger contigs with fewer misassemblies than other hybrid assemblers, even when long-read depth and accuracy are low. Unicycler is open source (GPLv3) and available at github.com/rrwick/Unicycler.
Collapse
Affiliation(s)
- Ryan R Wick
- Department of Biochemistry and Molecular Biology, Bio21 Molecular Science and Biotechnology Institute, The University of Melbourne, Victoria, Australia
| | - Louise M Judd
- Department of Biochemistry and Molecular Biology, Bio21 Molecular Science and Biotechnology Institute, The University of Melbourne, Victoria, Australia
| | - Claire L Gorrie
- Department of Biochemistry and Molecular Biology, Bio21 Molecular Science and Biotechnology Institute, The University of Melbourne, Victoria, Australia
| | - Kathryn E Holt
- Department of Biochemistry and Molecular Biology, Bio21 Molecular Science and Biotechnology Institute, The University of Melbourne, Victoria, Australia
| |
Collapse
|
444
|
Hoang NV, Furtado A, Mason PJ, Marquardt A, Kasirajan L, Thirugnanasambandam PP, Botha FC, Henry RJ. A survey of the complex transcriptome from the highly polyploid sugarcane genome using full-length isoform sequencing and de novo assembly from short read sequencing. BMC Genomics 2017; 18:395. [PMID: 28532419 PMCID: PMC5440902 DOI: 10.1186/s12864-017-3757-8] [Citation(s) in RCA: 116] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2016] [Accepted: 05/03/2017] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Despite the economic importance of sugarcane in sugar and bioenergy production, there is not yet a reference genome available. Most of the sugarcane transcriptomic studies have been based on Saccharum officinarum gene indices (SoGI), expressed sequence tags (ESTs) and de novo assembled transcript contigs from short-reads; hence knowledge of the sugarcane transcriptome is limited in relation to transcript length and number of transcript isoforms. RESULTS The sugarcane transcriptome was sequenced using PacBio isoform sequencing (Iso-Seq) of a pooled RNA sample derived from leaf, internode and root tissues, of different developmental stages, from 22 varieties, to explore the potential for capturing full-length transcript isoforms. A total of 107,598 unique transcript isoforms were obtained, representing about 71% of the total number of predicted sugarcane genes. The majority of this dataset (92%) matched the plant protein database, while just over 2% was novel transcripts, and over 2% was putative long non-coding RNAs. About 56% and 23% of total sequences were annotated against the gene ontology and KEGG pathway databases, respectively. Comparison with de novo contigs from Illumina RNA-Sequencing (RNA-Seq) of the internode samples from the same experiment and public databases showed that the Iso-Seq method recovered more full-length transcript isoforms, had a higher N50 and average length of largest 1,000 proteins; whereas a greater representation of the gene content and RNA diversity was captured in RNA-Seq. Only 62% of PacBio transcript isoforms matched 67% of de novo contigs, while the non-matched proportions were attributed to the inclusion of leaf/root tissues and the normalization in PacBio, and the representation of more gene content and RNA classes in the de novo assembly, respectively. About 69% of PacBio transcript isoforms and 41% of de novo contigs aligned with the sorghum genome, indicating the high conservation of orthologs in the genic regions of the two genomes. CONCLUSIONS The transcriptome dataset should contribute to improved sugarcane gene models and sugarcane protein predictions; and will serve as a reference database for analysis of transcript expression in sugarcane.
Collapse
Affiliation(s)
- Nam V Hoang
- Queensland Alliance for Agriculture and Food Innovation, The University of Queensland, Room 2.245, Level 2, The John Hay Building, Queensland Biosciences Precinct [#80], 306 Carmody Road, St. Lucia, QLD, 4072, Australia.,College of Agriculture and Forestry, Hue University, Hue, Vietnam
| | - Agnelo Furtado
- Queensland Alliance for Agriculture and Food Innovation, The University of Queensland, Room 2.245, Level 2, The John Hay Building, Queensland Biosciences Precinct [#80], 306 Carmody Road, St. Lucia, QLD, 4072, Australia
| | - Patrick J Mason
- Queensland Alliance for Agriculture and Food Innovation, The University of Queensland, Room 2.245, Level 2, The John Hay Building, Queensland Biosciences Precinct [#80], 306 Carmody Road, St. Lucia, QLD, 4072, Australia
| | - Annelie Marquardt
- Queensland Alliance for Agriculture and Food Innovation, The University of Queensland, Room 2.245, Level 2, The John Hay Building, Queensland Biosciences Precinct [#80], 306 Carmody Road, St. Lucia, QLD, 4072, Australia.,Sugar Research Australia, Indooroopilly, QLD, 4068, Australia
| | - Lakshmi Kasirajan
- Queensland Alliance for Agriculture and Food Innovation, The University of Queensland, Room 2.245, Level 2, The John Hay Building, Queensland Biosciences Precinct [#80], 306 Carmody Road, St. Lucia, QLD, 4072, Australia.,ICAR - Sugarcane Breeding Institute, Coimbatore, Tamil Nadu, India
| | - Prathima P Thirugnanasambandam
- Queensland Alliance for Agriculture and Food Innovation, The University of Queensland, Room 2.245, Level 2, The John Hay Building, Queensland Biosciences Precinct [#80], 306 Carmody Road, St. Lucia, QLD, 4072, Australia.,ICAR - Sugarcane Breeding Institute, Coimbatore, Tamil Nadu, India
| | - Frederik C Botha
- Queensland Alliance for Agriculture and Food Innovation, The University of Queensland, Room 2.245, Level 2, The John Hay Building, Queensland Biosciences Precinct [#80], 306 Carmody Road, St. Lucia, QLD, 4072, Australia.,Sugar Research Australia, Indooroopilly, QLD, 4068, Australia
| | - Robert J Henry
- Queensland Alliance for Agriculture and Food Innovation, The University of Queensland, Room 2.245, Level 2, The John Hay Building, Queensland Biosciences Precinct [#80], 306 Carmody Road, St. Lucia, QLD, 4072, Australia.
| |
Collapse
|
445
|
Zimin AV, Puiu D, Luo MC, Zhu T, Koren S, Marçais G, Yorke JA, Dvořák J, Salzberg SL. Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm. Genome Res 2017. [PMID: 28130360 DOI: 10.1101/gr.2134c5.116] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/06/2023]
Abstract
Long sequencing reads generated by single-molecule sequencing technology offer the possibility of dramatically improving the contiguity of genome assemblies. The biggest challenge today is that long reads have relatively high error rates, currently around 15%. The high error rates make it difficult to use this data alone, particularly with highly repetitive plant genomes. Errors in the raw data can lead to insertion or deletion errors (indels) in the consensus genome sequence, which in turn create significant problems for downstream analysis; for example, a single indel may shift the reading frame and incorrectly truncate a protein sequence. Here, we describe an algorithm that solves the high error rate problem by combining long, high-error reads with shorter but much more accurate Illumina sequencing reads, whose error rates average <1%. Our hybrid assembly algorithm combines these two types of reads to construct mega-reads, which are both long and accurate, and then assembles the mega-reads using the CABOG assembler, which was designed for long reads. We apply this technique to a large data set of Illumina and PacBio sequences from the species Aegilops tauschii, a large and extremely repetitive plant genome that has resisted previous attempts at assembly. We show that the resulting assembled contigs are far larger than in any previous assembly, with an N50 contig size of 486,807 nucleotides. We compare the contigs to independently produced optical maps to evaluate their large-scale accuracy, and to a set of high-quality bacterial artificial chromosome (BAC)-based assemblies to evaluate base-level accuracy.
Collapse
Affiliation(s)
- Aleksey V Zimin
- Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, Maryland 21205, USA
- Institute for Physical Sciences and Technology, University of Maryland, College Park, Maryland 20742, USA
| | - Daniela Puiu
- Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, Maryland 21205, USA
| | - Ming-Cheng Luo
- Department of Plant Sciences, University of California, Davis, California 95616, USA
| | - Tingting Zhu
- Department of Plant Sciences, University of California, Davis, California 95616, USA
| | - Sergey Koren
- National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Guillaume Marçais
- Institute for Physical Sciences and Technology, University of Maryland, College Park, Maryland 20742, USA
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, USA
| | - James A Yorke
- Institute for Physical Sciences and Technology, University of Maryland, College Park, Maryland 20742, USA
- Departments of Mathematics and Physics, University of Maryland, College Park, Maryland 20742, USA
| | - Jan Dvořák
- Department of Plant Sciences, University of California, Davis, California 95616, USA
| | - Steven L Salzberg
- Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, Maryland 21205, USA
- Departments of Biomedical Engineering, Computer Science, and Biostatistics, Johns Hopkins University, Baltimore, Maryland 21218, USA
| |
Collapse
|
446
|
Bao E, Lan L. HALC: High throughput algorithm for long read error correction. BMC Bioinformatics 2017; 18:204. [PMID: 28381259 PMCID: PMC5382505 DOI: 10.1186/s12859-017-1610-3] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2016] [Accepted: 03/24/2017] [Indexed: 11/24/2022] Open
Abstract
BACKGROUND The third generation PacBio SMRT long reads can effectively address the read length issue of the second generation sequencing technology, but contain approximately 15% sequencing errors. Several error correction algorithms have been designed to efficiently reduce the error rate to 1%, but they discard large amounts of uncorrected bases and thus lead to low throughput. This loss of bases could limit the completeness of downstream assemblies and the accuracy of analysis. RESULTS Here, we introduce HALC, a high throughput algorithm for long read error correction. HALC aligns the long reads to short read contigs from the same species with a relatively low identity requirement so that a long read region can be aligned to at least one contig region, including its true genome region's repeats in the contigs sufficiently similar to it (similar repeat based alignment approach). It then constructs a contig graph and, for each long read, references the other long reads' alignments to find the most accurate alignment and correct it with the aligned contig regions (long read support based validation approach). Even though some long read regions without the true genome regions in the contigs are corrected with their repeats, this approach makes it possible to further refine these long read regions with the initial insufficient short reads and correct the uncorrected regions in between. In our performance tests on E. coli, A. thaliana and Maylandia zebra data sets, HALC was able to obtain 6.7-41.1% higher throughput than the existing algorithms while maintaining comparable accuracy. The HALC corrected long reads can thus result in 11.4-60.7% longer assembled contigs than the existing algorithms. CONCLUSIONS The HALC software can be downloaded for free from this site: https://github.com/lanl001/halc .
Collapse
Affiliation(s)
- Ergude Bao
- School of Software Engineering, Beijing Jiaotong University, 3 Shangyuan Residence, Haidian District, Beijing, 100044 China
- Department of Botany and Plant Sciences, University of California, Riverside, 900 University Ave., RiversideCA, 92521 USA
| | - Lingxiao Lan
- School of Software Engineering, Beijing Jiaotong University, 3 Shangyuan Residence, Haidian District, Beijing, 100044 China
| |
Collapse
|
447
|
Li Y, Dai C, Hu C, Liu Z, Kang C. Global identification of alternative splicing via comparative analysis of SMRT- and Illumina-based RNA-seq in strawberry. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2017; 90:164-176. [PMID: 27997733 DOI: 10.1111/tpj.13462] [Citation(s) in RCA: 115] [Impact Index Per Article: 14.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/30/2016] [Revised: 11/23/2016] [Accepted: 12/14/2016] [Indexed: 05/21/2023]
Abstract
Alternative splicing (AS) is a key post-transcriptional regulatory mechanism, yet little information is known about its roles in fruit crops. Here, AS was globally analyzed in the wild strawberry Fragaria vesca genome with RNA-seq data derived from different stages of fruit development. The AS landscape was characterized and compared between the single-molecule, real-time (SMRT) and Illumina RNA-seq platform. While SMRT has a lower sequencing depth, it identifies more genes undergoing AS (57.67% of detected multiexon genes) when it is compared with Illumina (33.48%), illustrating the efficacy of SMRT in AS identification. We investigated different modes of AS in the context of fruit development; the percentage of intron retention (IR) is markedly reduced whereas that of alternative acceptor sites (AA) is significantly increased post-fertilization when compared with pre-fertilization. When all the identified transcripts were combined, a total of 66.43% detected multiexon genes in strawberry undergo AS, some of which lead to a gain or loss of conserved domains in the gene products. The work demonstrates that SMRT sequencing is highly powerful in AS discovery and provides a rich data resource for later functional studies of different isoforms. Further, shifting AS modes may contribute to rapid changes of gene expression during fruit set.
Collapse
Affiliation(s)
- Yongping Li
- Key Laboratory of Horticultural Plant Biology (Ministry of Education), College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan, 430070, China
| | - Cheng Dai
- College of Plant Science and Technology, Huazhong Agricultural University, Wuhan, 430070, China
| | - Chungen Hu
- Key Laboratory of Horticultural Plant Biology (Ministry of Education), College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan, 430070, China
| | - Zhongchi Liu
- Key Laboratory of Horticultural Plant Biology (Ministry of Education), College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan, 430070, China
- Department of Cell Biology and Molecular Genetics, University of Maryland, College Park, MD, 20742, USA
| | - Chunying Kang
- Key Laboratory of Horticultural Plant Biology (Ministry of Education), College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan, 430070, China
| |
Collapse
|
448
|
Genome-wide analysis of complex wheat gliadins, the dominant carriers of celiac disease epitopes. Sci Rep 2017; 7:44609. [PMID: 28300172 PMCID: PMC5353739 DOI: 10.1038/srep44609] [Citation(s) in RCA: 41] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2016] [Accepted: 02/09/2017] [Indexed: 01/08/2023] Open
Abstract
Gliadins, specified by six compound chromosomal loci (Gli-A1/B1/D1 and Gli-A2/B2/D2) in hexaploid bread wheat, are the dominant carriers of celiac disease (CD) epitopes. Because of their complexity, genome-wide characterization of gliadins is a strong challenge. Here, we approached this challenge by combining transcriptomic, proteomic and bioinformatic investigations. Through third-generation RNA sequencing, full-length transcripts were identified for 52 gliadin genes in the bread wheat cultivar Xiaoyan 81. Of them, 42 were active and predicted to encode 25 α-, 11 γ-, one δ- and five ω-gliadins. Comparative proteomic analysis between Xiaoyan 81 and six newly-developed mutants each lacking one Gli locus indicated the accumulation of 38 gliadins in the mature grains. A novel group of α-gliadins (the CSTT group) was recognized to contain very few or no CD epitopes. The δ-gliadins identified here or previously did not carry CD epitopes. Finally, the mutant lacking Gli-D2 showed significant reductions in the most celiac-toxic α-gliadins and derivative CD epitopes. The insights and resources generated here should aid further studies on gliadin functions in CD and the breeding of healthier wheat.
Collapse
|
449
|
Salmela L, Walve R, Rivals E, Ukkonen E. Accurate self-correction of errors in long reads using de Bruijn graphs. Bioinformatics 2017; 33:799-806. [PMID: 27273673 PMCID: PMC5351550 DOI: 10.1093/bioinformatics/btw321] [Citation(s) in RCA: 53] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2016] [Revised: 05/03/2016] [Accepted: 05/16/2016] [Indexed: 12/04/2022] Open
Abstract
Motivation New long read sequencing technologies, like PacBio SMRT and Oxford NanoPore, can produce sequencing reads up to 50 000 bp long but with an error rate of at least 15%. Reducing the error rate is necessary for subsequent utilization of the reads in, e.g. de novo genome assembly. The error correction problem has been tackled either by aligning the long reads against each other or by a hybrid approach that uses the more accurate short reads produced by second generation sequencing technologies to correct the long reads. Results We present an error correction method that uses long reads only. The method consists of two phases: first, we use an iterative alignment-free correction method based on de Bruijn graphs with increasing length of k -mers, and second, the corrected reads are further polished using long-distance dependencies that are found using multiple alignments. According to our experiments, the proposed method is the most accurate one relying on long reads only for read sets with high coverage. Furthermore, when the coverage of the read set is at least 75×, the throughput of the new method is at least 20% higher. Availability and Implementation LoRMA is freely available at http://www.cs.helsinki.fi/u/lmsalmel/LoRMA/ . Contact leena.salmela@cs.helsinki.fi.
Collapse
Affiliation(s)
- Leena Salmela
- Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland
| | - Riku Walve
- Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland
| | - Eric Rivals
- LIRMM and Institut de Biologie Computationelle, CNRS and Université Montpellier, Montpellier, France
| | - Esko Ukkonen
- Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland
| |
Collapse
|
450
|
Koren S, Walenz BP, Berlin K, Miller JR, Bergman NH, Phillippy AM. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res 2017; 27:722-736. [PMID: 28298431 PMCID: PMC5411767 DOI: 10.1101/gr.215087.116] [Citation(s) in RCA: 4775] [Impact Index Per Article: 596.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2016] [Accepted: 03/03/2017] [Indexed: 12/11/2022]
Abstract
Long-read single-molecule sequencing has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. However, given the relatively high error rates of such technologies, efficient and accurate assembly of large repeats and closely related haplotypes remains challenging. We address these issues with Canu, a successor of Celera Assembler that is specifically designed for noisy single-molecule sequences. Canu introduces support for nanopore sequencing, halves depth-of-coverage requirements, and improves assembly continuity while simultaneously reducing runtime by an order of magnitude on large genomes versus Celera Assembler 8.2. These advances result from new overlapping and assembly algorithms, including an adaptive overlapping strategy based on tf-idf weighted MinHash and a sparse assembly graph construction that avoids collapsing diverged repeats and haplotypes. We demonstrate that Canu can reliably assemble complete microbial genomes and near-complete eukaryotic chromosomes using either Pacific Biosciences (PacBio) or Oxford Nanopore technologies and achieves a contig NG50 of >21 Mbp on both human and Drosophila melanogaster PacBio data sets. For assembly structures that cannot be linearly represented, Canu provides graph-based assembly outputs in graphical fragment assembly (GFA) format for analysis or integration with complementary phasing and scaffolding techniques. The combination of such highly resolved assembly graphs with long-range scaffolding information promises the complete and automated assembly of complex genomes.
Collapse
Affiliation(s)
- Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Brian P Walenz
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| | | | - Jason R Miller
- J. Craig Venter Institute, Rockville, Maryland 20850, USA
| | - Nicholas H Bergman
- National Biodefense Analysis and Countermeasures Center, Frederick, Maryland 21702, USA
| | - Adam M Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| |
Collapse
|