1
|
Wang ZF, Yu EP, Fu L, Deng HG, Zhu WG, Xu FX, Cao HL. Chromosome-scale assemblies of three Ormosia species: repetitive sequences distribution and structural rearrangement. Gigascience 2025; 14:giaf047. [PMID: 40378137 PMCID: PMC12083454 DOI: 10.1093/gigascience/giaf047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2024] [Revised: 12/12/2024] [Accepted: 03/27/2025] [Indexed: 05/18/2025] Open
Abstract
BACKGROUND The genus Ormosia belongs to the Fabaceae family; almost all Ormosia species are endemic to China, which is considered one of the centers of this genus. Thus, genomic studies on the genus are needed to better understand species evolution and ensure the conservation and utilization of these species. We performed a chromosome-scale assembly of O. purpureiflora and updated the chromosome-scale assemblies of O. emarginata and O. semicastrata for comparative genomics. FINDINGS The genome assembly sizes of the 3 species ranged from 1.42 to 1.58 Gb, with O. purpureiflora being the largest. Repetitive sequences accounted for 74.0-76.3% of the genomes, and the predicted gene counts ranged from 50,517 to 55,061. Benchmarking Universal Single-Copy Orthologs (BUSCO) analysis indicated 97.0-98.4% genome completeness, whereas the long terminal repeat (LTR) assembly index values ranged from 13.66 to 17.56, meeting the "reference genome" quality standard. Gene completeness, assessed using BUSCO and OMArk, ranged from 95.1% to 96.3% and from 97.1% to 98.1%, respectively.Characterizing genome architectures further revealed that inversions were the main structural rearrangements in Ormosia. In numbers, density distributions of repetitive elements revealed the types of Helitron and terminal inverted repeat (TIR) elements and the types of Gypsy and unknown LTR retrotransposons (LTR-RTs) concentrated in different regions on the chromosomes, whereas Copia LTR-RTs were generally evenly distributed along the chromosomes in Ormosia.Compared with the sister species Lupinus albus, Ormosia species had lower numbers and percentages of resistance (R) genes and transcription factor genes. Genes related to alkaloid, terpene, and flavonoid biosynthesis were found to be duplicated through tandem or proximal duplications. Notably, some genes associated with growth and defense were absent in O. purpureiflora.By resequencing 153 genotypes (∼30 Gb of data per sample) from 6 O. purpureiflora (sub)populations, we identified 40,146 single nucleotide polymorphisms. Corresponding to its very small populations, O. purpureiflora exhibited low genetic diversity. CONCLUSIONS The Ormosia genome assemblies provide valuable resources for studying the evolution, conservation, and potential utility of both Ormosia and Fabaceae species.
Collapse
Affiliation(s)
- Zheng-Feng Wang
- Guangdong Provincial Key Laboratory of Applied Botany, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou 510650, China
- Key Laboratory of Vegetation Restoration and Management of Degraded Ecosystems, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou 510650, China
- Key Laboratory of National Forestry and Grassland Administration on Plant Conservation and Utilization in Southern China, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou 510650, China
- South China National Botanical Garden, Guangzhou 510650, China
| | - En-Ping Yu
- Guangdong Provincial Key Laboratory of Applied Botany, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou 510650, China
- Key Laboratory of Vegetation Restoration and Management of Degraded Ecosystems, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou 510650, China
- Key Laboratory of National Forestry and Grassland Administration on Plant Conservation and Utilization in Southern China, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou 510650, China
- South China National Botanical Garden, Guangzhou 510650, China
- University of Chinese Academy of Sciences, Beijing 100049, China
| | - Lin Fu
- Guangdong Provincial Key Laboratory of Applied Botany, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou 510650, China
- Key Laboratory of National Forestry and Grassland Administration on Plant Conservation and Utilization in Southern China, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou 510650, China
- South China National Botanical Garden, Guangzhou 510650, China
- Key Laboratory of Plant Resources Conservation and Sustainable Utilization, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou 510650, China
| | - Hua-Ge Deng
- Management Office of Guangdong Luofushan Provincial Nature Reserve, Huizhou 516133, China
| | - Wei-Guang Zhu
- Guangdong Provincial Key Laboratory of Applied Botany, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou 510650, China
- Key Laboratory of Vegetation Restoration and Management of Degraded Ecosystems, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou 510650, China
- Key Laboratory of National Forestry and Grassland Administration on Plant Conservation and Utilization in Southern China, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou 510650, China
- South China National Botanical Garden, Guangzhou 510650, China
| | - Feng-Xia Xu
- Guangdong Provincial Key Laboratory of Applied Botany, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou 510650, China
- Key Laboratory of National Forestry and Grassland Administration on Plant Conservation and Utilization in Southern China, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou 510650, China
- South China National Botanical Garden, Guangzhou 510650, China
- Key Laboratory of Plant Resources Conservation and Sustainable Utilization, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou 510650, China
| | - Hong-Lin Cao
- Guangdong Provincial Key Laboratory of Applied Botany, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou 510650, China
- Key Laboratory of Vegetation Restoration and Management of Degraded Ecosystems, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou 510650, China
- Key Laboratory of National Forestry and Grassland Administration on Plant Conservation and Utilization in Southern China, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou 510650, China
- South China National Botanical Garden, Guangzhou 510650, China
| |
Collapse
|
2
|
Moeckel C, Mareboina M, Konnaris MA, Chan CS, Mouratidis I, Montgomery A, Chantzi N, Pavlopoulos GA, Georgakopoulos-Soares I. A survey of k-mer methods and applications in bioinformatics. Comput Struct Biotechnol J 2024; 23:2289-2303. [PMID: 38840832 PMCID: PMC11152613 DOI: 10.1016/j.csbj.2024.05.025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Revised: 05/14/2024] [Accepted: 05/15/2024] [Indexed: 06/07/2024] Open
Abstract
The rapid progression of genomics and proteomics has been driven by the advent of advanced sequencing technologies, large, diverse, and readily available omics datasets, and the evolution of computational data processing capabilities. The vast amount of data generated by these advancements necessitates efficient algorithms to extract meaningful information. K-mers serve as a valuable tool when working with large sequencing datasets, offering several advantages in computational speed and memory efficiency and carrying the potential for intrinsic biological functionality. This review provides an overview of the methods, applications, and significance of k-mers in genomic and proteomic data analyses, as well as the utility of absent sequences, including nullomers and nullpeptides, in disease detection, vaccine development, therapeutics, and forensic science. Therefore, the review highlights the pivotal role of k-mers in addressing current genomic and proteomic problems and underscores their potential for future breakthroughs in research.
Collapse
Affiliation(s)
- Camille Moeckel
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Manvita Mareboina
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Maxwell A. Konnaris
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Candace S.Y. Chan
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
| | - Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| | - Austin Montgomery
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | | | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| |
Collapse
|
3
|
Sami A, El-Metwally S, Rashad MZ. MAC-ErrorReads: machine learning-assisted classifier for filtering erroneous NGS reads. BMC Bioinformatics 2024; 25:61. [PMID: 38321434 PMCID: PMC10848413 DOI: 10.1186/s12859-024-05681-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Accepted: 01/29/2024] [Indexed: 02/08/2024] Open
Abstract
BACKGROUND The rapid advancement of next-generation sequencing (NGS) machines in terms of speed and affordability has led to the generation of a massive amount of biological data at the expense of data quality as errors become more prevalent. This introduces the need to utilize different approaches to detect and filtrate errors, and data quality assurance is moved from the hardware space to the software preprocessing stages. RESULTS We introduce MAC-ErrorReads, a novel Machine learning-Assisted Classifier designed for filtering Erroneous NGS Reads. MAC-ErrorReads transforms the erroneous NGS read filtration process into a robust binary classification task, employing five supervised machine learning algorithms. These models are trained on features extracted through the computation of Term Frequency-Inverse Document Frequency (TF_IDF) values from various datasets such as E. coli, GAGE S. aureus, H. Chr14, Arabidopsis thaliana Chr1 and Metriaclima zebra. Notably, Naive Bayes demonstrated robust performance across various datasets, displaying high accuracy, precision, recall, F1-score, MCC, and ROC values. The MAC-ErrorReads NB model accurately classified S. aureus reads, surpassing most error correction tools with a 38.69% alignment rate. For H. Chr14, tools like Lighter, Karect, CARE, Pollux, and MAC-ErrorReads showed rates above 99%. BFC and RECKONER exceeded 98%, while Fiona had 95.78%. For the Arabidopsis thaliana Chr1, Pollux, Karect, RECKONER, and MAC-ErrorReads demonstrated good alignment rates of 92.62%, 91.80%, 91.78%, and 90.87%, respectively. For the Metriaclima zebra, Pollux achieved a high alignment rate of 91.23%, despite having the lowest number of mapped reads. MAC-ErrorReads, Karect, and RECKONER demonstrated good alignment rates of 83.76%, 83.71%, and 83.67%, respectively, while also producing reasonable numbers of mapped reads to the reference genome. CONCLUSIONS This study demonstrates that machine learning approaches for filtering NGS reads effectively identify and retain the most accurate reads, significantly enhancing assembly quality and genomic coverage. The integration of genomics and artificial intelligence through machine learning algorithms holds promise for enhancing NGS data quality, advancing downstream data analysis accuracy, and opening new opportunities in genetics, genomics, and personalized medicine research.
Collapse
Affiliation(s)
- Amira Sami
- Department of Computer Science, Faculty of Computers and Information, Mansoura University, P.O. Box: 35516, Mansoura, Egypt
| | - Sara El-Metwally
- Department of Computer Science, Faculty of Computers and Information, Mansoura University, P.O. Box: 35516, Mansoura, Egypt.
- Biomedical Informatics Department, Faculty of Computer Science and Engineering, New Mansoura University, Gamasa, 35712, Egypt.
| | - M Z Rashad
- Department of Computer Science, Faculty of Computers and Information, Mansoura University, P.O. Box: 35516, Mansoura, Egypt
| |
Collapse
|
4
|
Wang ZF, Fu L, Yu EP, Zhu WG, Zeng SJ, Cao HL. Chromosome-level genome assembly and demographic history of Euryodendron excelsum in monotypic genus endemic to China. DNA Res 2024; 31:dsad028. [PMID: 38147541 PMCID: PMC10781514 DOI: 10.1093/dnares/dsad028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2023] [Revised: 12/04/2023] [Accepted: 12/22/2023] [Indexed: 12/28/2023] Open
Abstract
Euryodendron excelsum is in a monotypic genus Euryodendron, endemic to China. It has intermediate morphisms in the Pentaphylacaceae or Theaceae families, which make it distinct. Due to anthropogenic disturbance, E. excelsum is currently found in very restricted and fragmented areas with extremely small populations. Although much research and effort has been applied towards its conservation, its long-term survival mechanisms and evolutionary history remain elusive, especially from a genomic aspect. Therefore, using a combination of long/short whole genome sequencing, RNA sequencing reads, and Hi-C data, we assembled and annotated a high-quality genome for E. excelsum. The genome assembly of E. excelsum comprised 1,059,895,887 bp with 99.66% anchored into 23 pseudo-chromosomes and a 99.0% BUSCO completeness. Comparative genomic analysis revealed the expansion of terpenoid and flavonoid secondary metabolite genes, and displayed a tandem and/or proximal duplication framework of these genes. E. excelsum also displayed genes associated with growth, development, and defence adaptation from whole genome duplication. Demographic analysis indicated that its fluctuations in population size and its recent population decline were related to cold climate changes. The E. excelsum genome assembly provides a highly valuable resource for evolutionary and ecological research in the future, aiding its conservation, management, and restoration.
Collapse
Affiliation(s)
- Zheng-Feng Wang
- Guangdong Provincial Key Laboratory of Applied Botany, Key Laboratory of Vegetation Restoration and Management of Degraded Ecosystems, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou 510650, China
- South China National Botanical Garden, Guangzhou, Guangdong 510650, China
| | - Lin Fu
- Guangdong Provincial Key Laboratory of Applied Botany, Key Laboratory of Vegetation Restoration and Management of Degraded Ecosystems, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou 510650, China
- South China National Botanical Garden, Guangzhou, Guangdong 510650, China
| | - En-Ping Yu
- Guangdong Provincial Key Laboratory of Applied Botany, Key Laboratory of Vegetation Restoration and Management of Degraded Ecosystems, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou 510650, China
- South China National Botanical Garden, Guangzhou, Guangdong 510650, China
- University of Chinese Academy of Sciences, Beijing 100049, China
| | - Wei-Guang Zhu
- Guangdong Provincial Key Laboratory of Applied Botany, Key Laboratory of Vegetation Restoration and Management of Degraded Ecosystems, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou 510650, China
- South China National Botanical Garden, Guangzhou, Guangdong 510650, China
| | - Song-Jun Zeng
- Guangdong Provincial Key Laboratory of Applied Botany, Key Laboratory of Vegetation Restoration and Management of Degraded Ecosystems, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou 510650, China
- South China National Botanical Garden, Guangzhou, Guangdong 510650, China
| | - Hong-Lin Cao
- Guangdong Provincial Key Laboratory of Applied Botany, Key Laboratory of Vegetation Restoration and Management of Degraded Ecosystems, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou 510650, China
- South China National Botanical Garden, Guangzhou, Guangdong 510650, China
| |
Collapse
|
5
|
Długosz M, Deorowicz S. Illumina reads correction: evaluation and improvements. Sci Rep 2024; 14:2232. [PMID: 38278837 PMCID: PMC11222498 DOI: 10.1038/s41598-024-52386-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2023] [Accepted: 01/18/2024] [Indexed: 01/28/2024] Open
Abstract
The paper focuses on the correction of Illumina WGS sequencing reads. We provide an extensive evaluation of the existing correctors. To this end, we measure an impact of the correction on variant calling (VC) as well as de novo assembly. It shows, that in selected cases read correction improves the VC results quality. We also examine the algorithms behaviour in a processing of Illumina NovaSeq reads, with different reads quality characteristics than in older sequencers. We show that most of the algorithms are ready to cope with such reads. Finally, we introduce a new version of RECKONER, our read corrector, by optimizing it and equipping with a new correction strategy. Currently, RECKONER allows to correct high-coverage human reads in less than 2.5 h, is able to cope with two types of reads errors: indels and substitutions, and utilizes a new, based on a two lengths of oligomers, correction verification technique.
Collapse
Affiliation(s)
- Maciej Długosz
- Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, 44-100, Gliwice, Poland
| | - Sebastian Deorowicz
- Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, 44-100, Gliwice, Poland.
| |
Collapse
|
6
|
Li X, Shao M. On de novo Bridging Paired-end RNA-seq Data. ACM-BCB ... ... : THE ... ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND BIOMEDICINE. ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND BIOMEDICINE 2023; 2023:41. [PMID: 38045531 PMCID: PMC10692976 DOI: 10.1145/3584371.3612987] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/05/2023]
Abstract
The high-throughput short-reads RNA-seq protocols often produce paired-end reads, with the middle portion of the fragments being unsequenced. We explore if the full-length fragments can be computationally reconstructed from the sequenced two ends in the absence of the reference genome-a problem here we refer to as de novo bridging. Solving this problem provides longer, more informative RNA-seq reads, and benefits downstream RNA-seq analysis such as transcript assembly, expression quantification, and splicing differential analysis. However, de novo bridging is a challenging and complicated task owing to alternative splicing, transcript noises, and sequencing errors. It remains unclear if the data provides sufficient information for accurate bridging, let alone efficient algorithms that determine the true bridges. Methods have been proposed to bridge paired-end reads in the presence of reference genome (called reference-based bridging), but the algorithms are far away from scaling for de novo bridging as the underlying compacted de Bruijn graph (cdBG) used in the latter task often contains millions of vertices and edges. We designed a new truncated Dijkstra's algorithm for this problem, and proposed a novel algorithm that reuses the shortest path tree to avoid running the truncated Dijkstra's algorithm from scratch for all vertices for further speeding up. These innovative techniques result in scalable algorithms that can bridge all paired-end reads in a cdBG with millions of vertices. Our experiments showed that paired-end RNA-seq reads can be accurately bridged to a large extent. The resulting tool is freely available at https://github.com/Shao-Group/rnabridge-denovo.
Collapse
Affiliation(s)
- Xiang Li
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania, USA
| | - Mingfu Shao
- Department of Computer Science and Engineering, Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, Pennsylvania, USA
| |
Collapse
|
7
|
Gao Y, Liao HB, Liu TH, Wu JM, Wang ZF, Cao HL. Draft genome and transcriptome of Nepenthes mirabilis, a carnivorous plant in China. BMC Genom Data 2023; 24:21. [PMID: 37060047 PMCID: PMC10103442 DOI: 10.1186/s12863-023-01126-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2022] [Accepted: 04/06/2023] [Indexed: 04/16/2023] Open
Abstract
OBJECTIVES Nepenthes belongs to the monotypic family Nepenthaceae, one of the largest carnivorous plant families. Nepenthes species show impressive adaptive radiation and suffer from being overexploited in nature. Nepenthes mirabilis is the most widely distributed species and the only Nepenthes species that is naturally distributed within China. Herein, we reported the genome and transcriptome assemblies of N. mirabilis. The assemblies will be useful resources for comparative genomics, to understand the adaptation and conservation of carnivorous species. DATA DESCRIPTION This work produced ~ 139.5 Gb N. mirabilis whole genome sequencing reads using leaf tissues, and ~ 21.7 Gb and ~ 27.9 Gb of raw RNA-seq reads for its leaves and flowers, respectively. Transcriptome assembly obtained 339,802 transcripts, in which 79,758 open reading frames (ORFs) were identified. Function analysis indicated that these ORFs were mainly associated with proteolysis and DNA integration. The assembled genome was 691,409,685 bp with 159,555 contigs/scaffolds and an N50 of 10,307 bp. The BUSCO assessment of the assembled genome and transcriptome indicated 91.1% and 93.7% completeness, respectively. A total of 42,961 genes were predicted in the genome identified, coding for 45,461 proteins. The predicted genes were annotated using multiple databases, facilitating future functional analyses of them. This is the first genome report on the Nepenthaceae family.
Collapse
Affiliation(s)
- Yuan Gao
- Zhongshan Management Centre of the Natural Protected Area, Zhongshan, China
| | - Hao-Bin Liao
- Zhongshan Management Centre of the Natural Protected Area, Zhongshan, China
| | - Ting-Hong Liu
- Guangdong Provincial Key Laboratory of Applied Botany, Key Laboratory of Vegetation Restoration and Management of Degraded Ecosystems, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou, China
| | - Jia-Ming Wu
- Zhongshan Management Centre of the Natural Protected Area, Zhongshan, China
| | - Zheng-Feng Wang
- Guangdong Provincial Key Laboratory of Applied Botany, Key Laboratory of Vegetation Restoration and Management of Degraded Ecosystems, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou, China.
| | - Hong-Lin Cao
- Guangdong Provincial Key Laboratory of Applied Botany, Key Laboratory of Vegetation Restoration and Management of Degraded Ecosystems, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou, China.
| |
Collapse
|
8
|
Genome sequence assembly algorithms and misassembly identification methods. Mol Biol Rep 2022; 49:11133-11148. [PMID: 36151399 DOI: 10.1007/s11033-022-07919-8] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2022] [Accepted: 09/05/2022] [Indexed: 10/14/2022]
Abstract
The sequence assembly algorithms have rapidly evolved with the vigorous growth of genome sequencing technology over the past two decades. Assembly mainly uses the iterative expansion of overlap relationships between sequences to construct the target genome. The assembly algorithms can be typically classified into several categories, such as the Greedy strategy, Overlap-Layout-Consensus (OLC) strategy, and de Bruijn graph (DBG) strategy. In particular, due to the rapid development of third-generation sequencing (TGS) technology, some prevalent assembly algorithms have been proposed to generate high-quality chromosome-level assemblies. However, due to the genome complexity, the length of short reads, and the high error rate of long reads, contigs produced by assembly may contain misassemblies adversely affecting downstream data analysis. Therefore, several read-based and reference-based methods for misassembly identification have been developed to improve assembly quality. This work primarily reviewed the development of DNA sequencing technologies and summarized sequencing data simulation methods, sequencing error correction methods, various mainstream sequence assembly algorithms, and misassembly identification methods. A large amount of computation makes the sequence assembly problem more challenging, and therefore, it is necessary to develop more efficient and accurate assembly algorithms and alternative algorithms.
Collapse
|
9
|
Kallenborn F, Cascitti J, Schmidt B. CARE 2.0: reducing false-positive sequencing error corrections using machine learning. BMC Bioinformatics 2022; 23:227. [PMID: 35698033 PMCID: PMC9195321 DOI: 10.1186/s12859-022-04754-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2022] [Accepted: 05/30/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Next-generation sequencing pipelines often perform error correction as a preprocessing step to obtain cleaned input data. State-of-the-art error correction programs are able to reliably detect and correct the majority of sequencing errors. However, they also introduce new errors by making false-positive corrections. These correction mistakes can have negative impact on downstream analysis, such as k-mer statistics, de-novo assembly, and variant calling. This motivates the need for more precise error correction tools. RESULTS We present CARE 2.0, a context-aware read error correction tool based on multiple sequence alignment targeting Illumina datasets. In addition to a number of newly introduced optimizations its most significant change is the replacement of CARE 1.0's hand-crafted correction conditions with a novel classifier based on random decision forests trained on Illumina data. This results in up to two orders-of-magnitude fewer false-positive corrections compared to other state-of-the-art error correction software. At the same time, CARE 2.0 is able to achieve high numbers of true-positive corrections comparable to its competitors. On a simulated full human dataset with 914M reads CARE 2.0 generates only 1.2M false positives (FPs) (and 801.4M true positives (TPs)) at a highly competitive runtime while the best corrections achieved by other state-of-the-art tools contain at least 3.9M FPs and at most 814.5M TPs. Better de-novo assembly and improved k-mer analysis show the applicability of CARE 2.0 to real-world data. CONCLUSION False-positive corrections can negatively influence down-stream analysis. The precision of CARE 2.0 greatly reduces the number of those corrections compared to other state-of-the-art programs including BFC, Karect, Musket, Bcool, SGA, and Lighter. Thus, higher-quality datasets are produced which improve k-mer analysis and de-novo assembly in real-world datasets which demonstrates the applicability of machine learning techniques in the context of sequencing read error correction. CARE 2.0 is written in C++/CUDA for Linux systems and can be run on the CPU as well as on CUDA-enabled GPUs. It is available at https://github.com/fkallen/CARE .
Collapse
Affiliation(s)
- Felix Kallenborn
- Department of Computer Science, Johannes Gutenberg University Mainz, Mainz, Germany.
| | - Julian Cascitti
- Department of Computer Science, Johannes Gutenberg University Mainz, Mainz, Germany
| | - Bertil Schmidt
- Department of Computer Science, Johannes Gutenberg University Mainz, Mainz, Germany
| |
Collapse
|
10
|
Liu HL, Harris AJ, Wang ZF, Chen HF, Li ZA, Wei X. The genome of the Paleogene relic tree Bretschneidera sinensis: insights into trade-offs in gene family evolution, demographic history, and adaptive SNPs. DNA Res 2022; 29:6523039. [PMID: 35137004 PMCID: PMC8825261 DOI: 10.1093/dnares/dsac003] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2021] [Indexed: 11/13/2022] Open
Abstract
Among relic species, genomic information may provide the key to inferring their long-term survival. Therefore, in this study, we investigated the genome of the Paleogene relic tree species, Bretschneidera sinensis, which is a rare endemic species within southeastern Asia. Specifically, we assembled a high-quality genome for B. sinensis using PacBio high-fidelity and high-throughput chromosome conformation capture reads and annotated it with long and short RNA sequencing reads. Using the genome, we then detected a trade-off between active and passive disease defences among the gene families. Gene families involved in salicylic acid and MAPK signalling pathways expanded as active defence mechanisms against disease, but families involved in terpene synthase activity as passive defences contracted. When inferring the long evolutionary history of B. sinensis, we detected population declines corresponding to historical climate change around the Eocene–Oligocene transition and to climatic fluctuations in the Quaternary. Additionally, based on this genome, we identified 388 single nucleotide polymorphisms (SNPs) that were likely under selection, and showed diverse functions in growth and stress responses. Among them, we further found 41 climate-associated SNPs. The genome of B. sinensis and the SNP dataset will be important resources for understanding extinction/diversification processes using comparative genomics in different lineages.
Collapse
Affiliation(s)
- Hai-Lin Liu
- Guangdong Provincial Key Laboratory of Applied Botany, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou, 510650, China.,University of Chinese Academy of Sciences, Beijing, 100049, China.,Environmental Horticulture Research Institute, Guangdong Academy of Agricultural Sciences, Guangzhou, 510640, China.,Key Laboratory of Ornamental Plant Germplasm Innovation and Utilization, Guangzhou, 510640, China
| | - A J Harris
- Guangdong Provincial Key Laboratory of Applied Botany, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou, 510650, China.,Key Laboratory of Plant Resources Conservation and Sustainable Utilization, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou, 510650, China
| | - Zheng-Feng Wang
- Guangdong Provincial Key Laboratory of Applied Botany, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou, 510650, China.,Southern Marine Science and Engineering Guangdong Laboratory (Guangzhou), Guangzhou, 511458, China.,Center of Plant Ecology, Core Botanical Gardens, Chinese Academy of Sciences, Guangzhou, 510650, China.,Key Laboratory of Vegetation Restoration and Management of Degraded Ecosystems, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou, 510650, China
| | - Hong-Feng Chen
- Guangdong Provincial Key Laboratory of Applied Botany, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou, 510650, China.,Key Laboratory of Plant Resources Conservation and Sustainable Utilization, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou, 510650, China
| | - Zhi-An Li
- Guangdong Provincial Key Laboratory of Applied Botany, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou, 510650, China.,Southern Marine Science and Engineering Guangdong Laboratory (Guangzhou), Guangzhou, 511458, China.,Center of Plant Ecology, Core Botanical Gardens, Chinese Academy of Sciences, Guangzhou, 510650, China.,Key Laboratory of Vegetation Restoration and Management of Degraded Ecosystems, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou, 510650, China
| | - Xiao Wei
- Guangxi Institute of Botany, Chinese Academy of Sciences, Guilin, 541006, China
| |
Collapse
|
11
|
Casasa S, Biddle JF, Koutsovoulos GD, Ragsdale EJ. Polyphenism of a Novel Trait Integrated Rapidly Evolving Genes into Ancestrally Plastic Networks. Mol Biol Evol 2021; 38:331-343. [PMID: 32931588 PMCID: PMC7826178 DOI: 10.1093/molbev/msaa235] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
Developmental polyphenism, the ability to switch between phenotypes in response to environmental variation, involves the alternating activation of environmentally sensitive genes. Consequently, to understand how a polyphenic response evolves requires a comparative analysis of the components that make up environmentally sensitive networks. Here, we inferred coexpression networks for a morphological polyphenism, the feeding-structure dimorphism of the nematode Pristionchus pacificus. In this species, individuals produce alternative forms of a novel trait—moveable teeth, which in one morph enable predatory feeding—in response to environmental cues. To identify the origins of polyphenism network components, we independently inferred coexpression modules for more conserved transcriptional responses, including in an ancestrally nonpolyphenic nematode species. Further, through genome-wide analyses of these components across the nematode family (Diplogastridae) in which the polyphenism arose, we reconstructed how network components have changed. To achieve this, we assembled and resolved the phylogenetic context for five genomes of species representing the breadth of Diplogastridae and a hypothesized outgroup. We found that gene networks instructing alternative forms arose from ancestral plastic responses to environment, specifically starvation-induced metabolism and the formation of a conserved diapause (dauer) stage. Moreover, loci from rapidly evolving gene families were integrated into these networks with higher connectivity than throughout the rest of the P. pacificus transcriptome. In summary, we show that the modular regulatory outputs of a polyphenic response evolved through the integration of conserved plastic responses into networks with genes of high evolutionary turnover.
Collapse
Affiliation(s)
- Sofia Casasa
- Department of Biology, Indiana University, Bloomington, Bloomington, IN
| | - Joseph F Biddle
- Department of Biology, Indiana University, Bloomington, Bloomington, IN
| | | | - Erik J Ragsdale
- Department of Biology, Indiana University, Bloomington, Bloomington, IN
| |
Collapse
|
12
|
Mitchell K, Brito JJ, Mandric I, Wu Q, Knyazev S, Chang S, Martin LS, Karlsberg A, Gerasimov E, Littman R, Hill BL, Wu NC, Yang HT, Hsieh K, Chen L, Littman E, Shabani T, Enik G, Yao D, Sun R, Schroeder J, Eskin E, Zelikovsky A, Skums P, Pop M, Mangul S. Benchmarking of computational error-correction methods for next-generation sequencing data. Genome Biol 2020; 21:71. [PMID: 32183840 PMCID: PMC7079412 DOI: 10.1186/s13059-020-01988-3] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2019] [Accepted: 03/06/2020] [Indexed: 12/16/2022] Open
Abstract
BACKGROUND Recent advancements in next-generation sequencing have rapidly improved our ability to study genomic material at an unprecedented scale. Despite substantial improvements in sequencing technologies, errors present in the data still risk confounding downstream analysis and limiting the applicability of sequencing technologies in clinical tools. Computational error correction promises to eliminate sequencing errors, but the relative accuracy of error correction algorithms remains unknown. RESULTS In this paper, we evaluate the ability of error correction algorithms to fix errors across different types of datasets that contain various levels of heterogeneity. We highlight the advantages and limitations of computational error correction techniques across different domains of biology, including immunogenomics and virology. To demonstrate the efficacy of our technique, we apply the UMI-based high-fidelity sequencing protocol to eliminate sequencing errors from both simulated data and the raw reads. We then perform a realistic evaluation of error-correction methods. CONCLUSIONS In terms of accuracy, we find that method performance varies substantially across different types of datasets with no single method performing best on all types of examined data. Finally, we also identify the techniques that offer a good balance between precision and sensitivity.
Collapse
Affiliation(s)
- Keith Mitchell
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Jaqueline J Brito
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, 1985 Zonal Avenue, Los Angeles, CA, 90089, USA
| | - Igor Mandric
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
- Department of Computer Science, Georgia State University, 1 Park Place, Atlanta, GA, 30303, USA
| | - Qiaozhen Wu
- Department of Mathematics, University of California Los Angeles, 520 Portola Plaza, Los Angeles, CA, 90095, USA
| | - Sergey Knyazev
- Department of Computer Science, Georgia State University, 1 Park Place, Atlanta, GA, 30303, USA
| | - Sei Chang
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Lana S Martin
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, 1985 Zonal Avenue, Los Angeles, CA, 90089, USA
| | - Aaron Karlsberg
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, 1985 Zonal Avenue, Los Angeles, CA, 90089, USA
| | - Ekaterina Gerasimov
- Department of Computer Science, Georgia State University, 1 Park Place, Atlanta, GA, 30303, USA
| | - Russell Littman
- UCLA Bioinformatics, 621 Charles E Young Dr S, Los Angeles, CA, 90024, USA
| | - Brian L Hill
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Nicholas C Wu
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, 92037, USA
| | - Harry Taegyun Yang
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Kevin Hsieh
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Linus Chen
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Eli Littman
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Taylor Shabani
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - German Enik
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Douglas Yao
- Department of Molecular, Cell, and Developmental Biology, University of California Los Angeles, 650 Charles E. Young Drive South, Los Angeles, CA, 90095, USA
| | - Ren Sun
- Department of Molecular and Medical Pharmacology, University of California Los Angeles, 650 Charles E. Young Drive South, Los Angeles, CA, 90095, USA
| | - Jan Schroeder
- Epigenetics & Reprogramming Laboratory, Monash University, 15 Innovation Walk, Melbourne, VIC, 3800, Australia
| | - Eleazar Eskin
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Alex Zelikovsky
- Department of Computer Science, Georgia State University, 1 Park Place, Atlanta, GA, 30303, USA
- The Laboratory of Bioinformatics, I.M, Sechenov First Moscow State Medical University, Moscow, Russia, 119991
| | - Pavel Skums
- Department of Computer Science, Georgia State University, 1 Park Place, Atlanta, GA, 30303, USA
| | - Mihai Pop
- Department of Computer Science and Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, 20742, USA
| | - Serghei Mangul
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, 1985 Zonal Avenue, Los Angeles, CA, 90089, USA.
| |
Collapse
|
13
|
Heydari M, Miclotte G, Van de Peer Y, Fostier J. Illumina error correction near highly repetitive DNA regions improves de novo genome assembly. BMC Bioinformatics 2019; 20:298. [PMID: 31159722 PMCID: PMC6545690 DOI: 10.1186/s12859-019-2906-2] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2019] [Accepted: 05/17/2019] [Indexed: 11/10/2022] Open
Abstract
Background Several standalone error correction tools have been proposed to correct sequencing errors in Illumina data in order to facilitate de novo genome assembly. However, in a recent survey, we showed that state-of-the-art assemblers often did not benefit from this pre-correction step. We found that many error correction tools introduce new errors in reads that overlap highly repetitive DNA regions such as low-complexity patterns or short homopolymers, ultimately leading to a more fragmented assembly. Results We propose BrownieCorrector, an error correction tool for Illumina sequencing data that focuses on the correction of only those reads that overlap short DNA patterns that are highly repetitive in the genome. BrownieCorrector extracts all reads that contain such a pattern and clusters them into different groups using a community detection algorithm that takes into account both the sequence similarity between overlapping reads and their respective paired-end reads. Each cluster holds reads that originate from the same genomic region and hence each cluster can be corrected individually, thus providing a consistent correction for all reads within that cluster. Conclusions BrownieCorrector is benchmarked using six real Illumina datasets for different eukaryotic genomes. The prior use of BrownieCorrector improves assembly results over the use of uncorrected reads in all cases. In comparison with other error correction tools, BrownieCorrector leads to the best assembly results in most cases even though less than 2% of the reads within a dataset are corrected. Additionally, we investigate the impact of error correction on hybrid assembly where the corrected Illumina reads are supplemented with PacBio data. Our results confirm that BrownieCorrector improves the quality of hybrid genome assembly as well. BrownieCorrector is written in standard C++11 and released under GPL license. BrownieCorrector relies on multithreading to take advantage of multi-core/multi-CPU systems. The source code is available at https://github.com/biointec/browniecorrector. Electronic supplementary material The online version of this article (10.1186/s12859-019-2906-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Mahdi Heydari
- Department of Information Technology, Ghent University-imec, IDLab, Ghent, B-9052, Belgium.,Bioinformatics Institute Ghent, Ghent, B-9052, Belgium
| | - Giles Miclotte
- Department of Information Technology, Ghent University-imec, IDLab, Ghent, B-9052, Belgium.,Bioinformatics Institute Ghent, Ghent, B-9052, Belgium
| | - Yves Van de Peer
- Bioinformatics Institute Ghent, Ghent, B-9052, Belgium.,Center for Plant Systems Biology, VIB, Ghent, B-9052, Belgium.,Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, B-9052, Belgium.,Department of Genetics, Genome Research Institute, University of Pretoria, Pretoria, South Africa
| | - Jan Fostier
- Department of Information Technology, Ghent University-imec, IDLab, Ghent, B-9052, Belgium. .,Bioinformatics Institute Ghent, Ghent, B-9052, Belgium.
| |
Collapse
|
14
|
Ragsdale EJ, Koutsovoulos G, Biddle JF. A draft genome for a species of Halicephalobus (Panagrolaimidae). J Nematol 2019; 51:1-4. [PMID: 31814372 PMCID: PMC6909384 DOI: 10.21307/jofnem-2019-068] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2019] [Indexed: 12/19/2022] Open
Abstract
Halicephalobus is a clade of small, exclusively parthenogenic nematodes that have sometimes colonized remarkable habitats. Given their phylogenetic closeness to other parthenogenic panagrolaimid species with which they likely share a sexually reproducing ancestor, Halicephalobus species provide a point of comparison for parallelisms in the evolution of asexuality. Here, we present a draft genome of a putatively new species of Halicephalobus isolated from termites in Japan.
Collapse
Affiliation(s)
- Erik J Ragsdale
- Department of Biology, Indiana University , 915 E. 3rd St. , Bloomington , IN , 47405
| | | | - Joseph F Biddle
- Department of Biology, Indiana University , 915 E. 3rd St. , Bloomington , IN , 47405
| |
Collapse
|