1
|
Novák P, Hoštáková N, Neumann P, Macas J. DANTE and DANTE_LTR: lineage-centric annotation pipelines for long terminal repeat retrotransposons in plant genomes. NAR Genom Bioinform 2024; 6:lqae113. [PMID: 39211332 PMCID: PMC11358816 DOI: 10.1093/nargab/lqae113] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2024] [Revised: 07/18/2024] [Accepted: 08/15/2024] [Indexed: 09/04/2024] Open
Abstract
Long terminal repeat (LTR) retrotransposons constitute a predominant class of repetitive DNA elements in most plant genomes. With the increasing number of sequenced plant genomes, there is an ongoing demand for computational tools facilitating efficient annotation and classification of LTR retrotransposons in plant genome assemblies. Herein, we introduce DANTE, a computational pipeline for Domain-based ANnotation of Transposable Elements, designed for sensitive detection of these elements via their conserved protein domain sequences. The identified protein domains are subsequently inputted into the DANTE_LTR pipeline to annotate complete element sequences by detecting their structural features, such as LTRs, in adjacent genomic regions. Leveraging domain sequences allows for precise classification of elements into phylogenetic lineages, offering a more granular annotation compared with coarser conventional superfamily-based classification methods. The efficiency and accuracy of this approach were evidenced via annotation of LTR retrotransposons in 93 plant genomes. Results were benchmarked against several established pipelines, showing that DANTE_LTR is capable of identifying significantly more intact LTR retrotransposons. DANTE and DANTE_LTR are provided as user-friendly Galaxy tools accessible via a public server (https://repeatexplorer-elixir.cerit-sc.cz), installable on local Galaxy instances from the Galaxy tool shed or executable from the command line.
Collapse
Affiliation(s)
- Petr Novák
- Biology Centre, Czech Academy of Sciences, Branišovská 31, České Budějovice, Czech Republic
| | - Nina Hoštáková
- Biology Centre, Czech Academy of Sciences, Branišovská 31, České Budějovice, Czech Republic
| | - Pavel Neumann
- Biology Centre, Czech Academy of Sciences, Branišovská 31, České Budějovice, Czech Republic
| | - Jiří Macas
- Biology Centre, Czech Academy of Sciences, Branišovská 31, České Budějovice, Czech Republic
| |
Collapse
|
2
|
Krause GR, Shands W, Wheeler TJ. Sensitive and error-tolerant annotation of protein-coding DNA with BATH. BIOINFORMATICS ADVANCES 2024; 4:vbae088. [PMID: 38966592 PMCID: PMC11223822 DOI: 10.1093/bioadv/vbae088] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/04/2024] [Revised: 05/03/2024] [Accepted: 06/10/2024] [Indexed: 07/06/2024]
Abstract
Summary We present BATH, a tool for highly sensitive annotation of protein-coding DNA based on direct alignment of that DNA to a database of protein sequences or profile hidden Markov models (pHMMs). BATH is built on top of the HMMER3 code base, and simplifies the annotation workflow for pHMM-based translated sequence annotation by providing a straightforward input interface and easy-to-interpret output. BATH also introduces novel frameshift-aware algorithms to detect frameshift-inducing nucleotide insertions and deletions (indels). BATH matches the accuracy of HMMER3 for annotation of sequences containing no errors, and produces superior accuracy to all tested tools for annotation of sequences containing nucleotide indels. These results suggest that BATH should be used when high annotation sensitivity is required, particularly when frameshift errors are expected to interrupt protein-coding regions, as is true with long-read sequencing data and in the context of pseudogenes. Availability and implementation The software is available at https://github.com/TravisWheelerLab/BATH.
Collapse
Affiliation(s)
- Genevieve R Krause
- R. Ken Coit College of Pharmacy, University of Arizona, Tucson, AZ 85721, United States
- Department of Computer Science, University of Montana, Missoula, MT 59812, United States
| | - Walt Shands
- Department of Computer Science, University of Montana, Missoula, MT 59812, United States
- Genomics Institute, UC Santa Cruz, Santa Cruz, CA 95060, United States
| | - Travis J Wheeler
- R. Ken Coit College of Pharmacy, University of Arizona, Tucson, AZ 85721, United States
- Department of Computer Science, University of Montana, Missoula, MT 59812, United States
| |
Collapse
|
3
|
Krause GR, Shands W, Wheeler TJ. Sensitive and error-tolerant annotation of protein-coding DNA with BATH. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.12.31.573773. [PMID: 38260252 PMCID: PMC10802276 DOI: 10.1101/2023.12.31.573773] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/24/2024]
Abstract
We present BATH, a tool for highly sensitive annotation of protein-coding DNA based on direct alignment of that DNA to a database of protein sequences or profile hidden Markov models (pHMMs). BATH is built on top of the HMMER3 code base, and simplifies the annotation workflow for pHMM-based annotation by providing a straightforward input interface and easy-to-interpret output. BATH also introduces novel frameshift-aware algorithms to detect frameshift-inducing nucleotide insertions and deletions (indels). BATH matches the accuracy of HMMER3 for annotation of sequences containing no errors, and produces superior accuracy to all tested tools for annotation of sequences containing nucleotide indels. These results suggest that BATH should be used when high annotation sensitivity is required, particularly when frameshift errors are expected to interrupt protein-coding regions, as is true with long read sequencing data and in the context of pseudogenes.
Collapse
Affiliation(s)
- Genevieve R Krause
- R. Ken Coit College of Pharmacy, University of Arizona, Tucson, Arizona, USA
- Department of Computer Science, University of Montana, Missoula, Montana, USA
| | - Walt Shands
- Department of Computer Science, University of Montana, Missoula, Montana, USA
- UC Santa Cruz Genomics Institute, Santa Cruz, California, USA
| | - Travis J Wheeler
- R. Ken Coit College of Pharmacy, University of Arizona, Tucson, Arizona, USA
- Department of Computer Science, University of Montana, Missoula, Montana, USA
| |
Collapse
|
4
|
Dabbaghie F, Srikakulam SK, Marschall T, Kalinina OV. PanPA: generation and alignment of panproteome graphs. BIOINFORMATICS ADVANCES 2023; 3:vbad167. [PMID: 38145107 PMCID: PMC10748787 DOI: 10.1093/bioadv/vbad167] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/24/2023] [Revised: 11/13/2023] [Accepted: 11/23/2023] [Indexed: 12/26/2023]
Abstract
Motivation Compared to eukaryotes, prokaryote genomes are more diverse through different mechanisms, including a higher mutation rate and horizontal gene transfer. Therefore, using a linear representative reference can cause a reference bias. Graph-based pangenome methods have been developed to tackle this problem. However, comparisons in DNA space are still challenging due to this high diversity. In contrast, amino acid sequences have higher similarity due to evolutionary constraints, whereby a single amino acid may be encoded by several synonymous codons. Coding regions cover the majority of the genome in prokaryotes. Thus, panproteomes present an attractive alternative leveraging the higher sequence similarity while not losing much of the genome in non-coding regions. Results We present PanPA, a method that takes a set of multiple sequence alignments of protein sequences, indexes them, and builds a graph for each multiple sequence alignment. In the querying step, it can align DNA or amino acid sequences back to these graphs. We first showcase that PanPA generates correct alignments on a panproteome from 1350 Escherichia coli. To demonstrate that panproteomes allow comparisons at longer phylogenetic distances, we compare DNA and protein alignments from 1073 Salmonella enterica assemblies against E.coli reference genome, pangenome, and panproteome using BWA, GraphAligner, and PanPA, respectively; with PanPA aligning around 22% more sequences. We also aligned a DNA short-reads whole genome sequencing (WGS) sample from S.enterica against the E.coli reference with BWA and the panproteome with PanPA, where PanPA was able to find alignment for 68% of the reads compared to 5% with BWA. Availalability and implementation PanPA is available at https://github.com/fawaz-dabbaghieh/PanPA.
Collapse
Affiliation(s)
- Fawaz Dabbaghie
- Institute for Medical Biometry and Bioinformatics, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, 40225 Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, 40225 Düsseldorf, Germany
- Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Helmholtz Center for Infection Research (HZI), Saarbrücken, Germany
| | - Sanjay K Srikakulam
- Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Helmholtz Center for Infection Research (HZI), Saarbrücken, Germany
- Graduate School of Computer Science, Saarland University, 66123 Saarbrücken, Germany
- Interdisciplinary Graduate School of Natural Product Research, Saarland University, 66123 Saarbrücken, Germany
| | - Tobias Marschall
- Institute for Medical Biometry and Bioinformatics, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, 40225 Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, 40225 Düsseldorf, Germany
| | - Olga V Kalinina
- Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Helmholtz Center for Infection Research (HZI), Saarbrücken, Germany
- Drug Bioinformatics, Medical Faculty, Saarland University, 66421 Homburg, Germany
- Center for Bioinformatics, Saarland University, 66123 Saarbrücken, Germany
| |
Collapse
|
5
|
Yao Y, Frith MC. Improved DNA-Versus-Protein Homology Search for Protein Fossils. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1691-1699. [PMID: 35617174 DOI: 10.1109/tcbb.2022.3177855] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Protein fossils, i.e., noncoding DNA descended from coding DNA, arise frequently from transposable elements (TEs), decayed genes, and viral integrations. They can reveal, and mislead about, evolutionary history and relationships. They have been detected by comparing DNA to protein sequences, but current methods are not optimized for this task. We describe a powerful DNA-protein homology search method. We use a 64×21 substitution matrix, which is fitted to sequence data, automatically learning the genetic code. We detect subtly homologous regions by considering alternative possible alignments between them, and calculate significance (probability of occurring by chance between random sequences). Our method detects TE protein fossils much more sensitively than blastx, and faster. Of the ∼ 7 major categories of eukaryotic TE, three were long thought absent in mammals: we find two of them in the human genome, polinton and DIRS/Ngaro. This method increases our power to find ancient fossils, and perhaps to detect non-standard genetic codes. The alternative-alignments and significance paradigm is not specific to DNA-protein comparison, and could benefit homology search generally. This is an extended version of a conference paper (Yao & Frith, 2021).
Collapse
|
6
|
Bağcı C, Albrecht B, Huson DH. MAIRA: Protein-based Analysis of MinION Reads on a Laptop. Methods Mol Biol 2023; 2649:223-234. [PMID: 37258865 DOI: 10.1007/978-1-0716-3072-3_11] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
Third-generation sequencing technologies are being increasingly used in microbiome research and this has given rise to new challenges in computational microbiome analysis. Oxford Nanopore's MinION is a portable sequencer that streams data that can be basecalled on-the-fly. Here we give an introduction to the MAIRA software, which is designed to analyze MinION sequencing reads from a microbiome sample, as they are produced in real-time, on a laptop. The software processes reads in batches and updates the presented analysis after each batch. There are two analysis steps: First, protein alignments are calculated to determine which genera might be present in a sample. When strong evidence for a genus is found, then, in a second step, a more detailed analysis is performed by aligning the reads against the proteins of all species in the detected genus. The program presents a detailed analysis of species, antibiotic resistance genes, and virulence factors.
Collapse
Affiliation(s)
- Caner Bağcı
- Institute for Bioinformatics and Medical Informatics, University of Tübingen, Tübingen, Germany.
- International Max Planck Research School "From Molecules to Organisms", Max Planck Institute for Developmental Biology, Tübingen, Germany.
| | | | - Daniel H Huson
- Institute for Bioinformatics and Medical Informatics, University of Tübingen, Tübingen, Germany.
| |
Collapse
|
7
|
Dang C, Wu Z, Zhang M, Li X, Sun Y, Wu R, Zheng Y, Xia Y. Microorganisms as bio-filters to mitigate greenhouse gas emissions from high-altitude permafrost revealed by nanopore-based metagenomics. IMETA 2022; 1:e24. [PMID: 38868568 PMCID: PMC10989947 DOI: 10.1002/imt2.24] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/04/2021] [Revised: 01/11/2022] [Accepted: 01/18/2022] [Indexed: 06/14/2024]
Abstract
The distinct climatic and geographical conditions make high-altitude permafrost on the Tibetan Plateau suffer more severe degradation than polar permafrost. However, the microbial responses associated with greenhouse gas production in thawing permafrost remain obscured. Here we applied nanopore-based long-read metagenomics and high-throughput RNA-seq to explore microbial functional activities within the freeze-thaw cycle in the active layers of permafrost at the Qilian Mountain. A bioinformatic framework was established to facilitate phylogenetic and functional annotation of the unassembled nanopore metagenome. By deploying this strategy, 42% more genera could be detected and 58% more genes were annotated to nitrogen and methane cycle. With the aid of such enlarged resolution, we observed vigorous aerobic methane oxidation by Methylomonas, which could serve as a bio-filter to mitigate CH4 emissions from permafrost. Such filtering effect could be further consolidated by both on-site gas phase measurement and incubation experiment that CO2 was the major form of carbon released from permafrost. Despite the increased transcriptional activities of aceticlastic methanogenesis pathways in the thawed permafrost active layer, CH4 generated during the thawing process could be effectively consumed by the microbiome. Additionally, the nitrogen metabolism in permafrost tends to be a closed cycle and active N2O consumption by the topsoil community was detected in the near-surface gas phase. Our findings reveal that although the increased thawed state facilitated the heterotrophic nitrogen and methane metabolism, effective microbial methane oxidation in the active layer could serve as a bio-filter to relieve the overall warming potentials of greenhouse gas emitted from thawed permafrost.
Collapse
Affiliation(s)
- Chenyuan Dang
- School of Environmental Science and Engineering, College of EngineeringSouthern University of Science and TechnologyShenzhenChina
- Laboratory of High‐Resolution Mass Spectrometry Technologies, Dalian Institute of Chemical PhysicsChinese Academy of Sciences (CAS)DalianChina
| | - Ziqi Wu
- School of Environmental Science and Engineering, College of EngineeringSouthern University of Science and TechnologyShenzhenChina
| | - Miao Zhang
- School of Environmental Science and Engineering, College of EngineeringSouthern University of Science and TechnologyShenzhenChina
| | - Xiang Li
- School of Environmental Science and Engineering, College of EngineeringSouthern University of Science and TechnologyShenzhenChina
- Shenzhen Key Laboratory of Marine Archaea Geo‐Omics, Department of Ocean Science and EngineeringSouthern University of Science and TechnologyShenzhenChina
| | - Yuqin Sun
- Shenzhen Key Laboratory of Marine Archaea Geo‐Omics, Department of Ocean Science and EngineeringSouthern University of Science and TechnologyShenzhenChina
- State Environmental Protection Key Laboratory of Integrated Surface Water‐Groundwater Pollution Control, School of Environmental Science and EngineeringSouthern University of Science and TechnologyShenzhenChina
| | - Ren'an Wu
- Laboratory of High‐Resolution Mass Spectrometry Technologies, Dalian Institute of Chemical PhysicsChinese Academy of Sciences (CAS)DalianChina
| | - Yan Zheng
- Shenzhen Key Laboratory of Marine Archaea Geo‐Omics, Department of Ocean Science and EngineeringSouthern University of Science and TechnologyShenzhenChina
- State Environmental Protection Key Laboratory of Integrated Surface Water‐Groundwater Pollution Control, School of Environmental Science and EngineeringSouthern University of Science and TechnologyShenzhenChina
| | - Yu Xia
- School of Environmental Science and Engineering, College of EngineeringSouthern University of Science and TechnologyShenzhenChina
- Shenzhen Key Laboratory of Marine Archaea Geo‐Omics, Department of Ocean Science and EngineeringSouthern University of Science and TechnologyShenzhenChina
| |
Collapse
|
8
|
Frith MC. Paleozoic Protein Fossils Illuminate the Evolution of Vertebrate Genomes and Transposable Elements. Mol Biol Evol 2022; 39:6555113. [PMID: 35348724 PMCID: PMC9004415 DOI: 10.1093/molbev/msac068] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Genomes hold a treasure trove of protein fossils: fragments of formerly protein-coding DNA, which mainly come from transposable elements (TEs) or host genes. These fossils reveal ancient evolution of TEs and genomes, and many fossils have been exapted to perform diverse functions important for the host's fitness. However, old and highly-degraded fossils are hard to identify, standard methods (e.g. BLAST) are not optimized for this task, and few Paleozoic protein fossils have been found. Here, a recently optimized method is used to find protein fossils in vertebrate genomes. It finds Paleozoic fossils predating the amphibian/amniote divergence from most major TE categories, including virus-related Polinton and Gypsy elements. It finds 10 fossils in the human genome (8 from TEs and 2 from host genes) that predate the last common ancestor of all jawed vertebrates, probably from the Ordovician period. It also finds types of transposon and retrotransposon not found in human before. These fossils have extreme sequence conservation, indicating exaptation: some have evidence of gene-regulatory function, and they tend to lienearest to developmental genes. Some ancient fossils suggest "genome tectonics", where two fragments of one TE have drifted apart by up to megabases, possibly explaining gene deserts and large introns. This paints a picture of great TE diversity in our aquatic ancestors, with patchy TE inheritance by later vertebrates, producing new genes and regulatory elements on the way. Host-gene fossils too have contributed anciently-conserved DNA segments. This paves the way to further studies of ancient protein fossils.
Collapse
Affiliation(s)
- Martin C Frith
- Artificial Intelligence Research Center, AIST, Tokyo, Japan.,Graduate School of Frontier Sciences, University of Tokyo, Chiba, Japan.,Computational Bio Big-Data Open Innovation Laboratory, AIST, Tokyo, Japan
| |
Collapse
|
9
|
Shrestha AMS, B Guiao JE, R Santiago KC. Assembly-free rapid differential gene expression analysis in non-model organisms using DNA-protein alignment. BMC Genomics 2022; 23:97. [PMID: 35120462 PMCID: PMC8815227 DOI: 10.1186/s12864-021-08278-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2021] [Accepted: 12/22/2021] [Indexed: 11/16/2022] Open
Abstract
BACKGROUND RNA-seq is being increasingly adopted for gene expression studies in a panoply of non-model organisms, with applications spanning the fields of agriculture, aquaculture, ecology, and environment. For organisms that lack a well-annotated reference genome or transcriptome, a conventional RNA-seq data analysis workflow requires constructing a de-novo transcriptome assembly and annotating it against a high-confidence protein database. The assembly serves as a reference for read mapping, and the annotation is necessary for functional analysis of genes found to be differentially expressed. However, assembly is computationally expensive. It is also prone to errors that impact expression analysis, especially since sequencing depth is typically much lower for expression studies than for transcript discovery. RESULTS We propose a shortcut, in which we obtain counts for differential expression analysis by directly aligning RNA-seq reads to the high-confidence proteome that would have been otherwise used for annotation. By avoiding assembly, we drastically cut down computational costs - the running time on a typical dataset improves from the order of tens of hours to under half an hour, and the memory requirement is reduced from the order of tens of Gbytes to tens of Mbytes. We show through experiments on simulated and real data that our pipeline not only reduces computational costs, but has higher sensitivity and precision than a typical assembly-based pipeline. A Snakemake implementation of our workflow is available at: https://bitbucket.org/project_samar/samar . CONCLUSIONS The flip side of RNA-seq becoming accessible to even modestly resourced labs has been that the time, labor, and infrastructure cost of bioinformatics analysis has become a bottleneck. Assembly is one such resource-hungry process, and we show here that it can be avoided for quick and easy, yet more sensitive and precise, differential gene expression analysis in non-model organisms.
Collapse
Affiliation(s)
- Anish M S Shrestha
- Bioinformatics Lab, Advanced Research Institute for Informatics, Computing, and Networking (AdRIC), De La Salle University, Manila, Philippines.
- Department of Software Technology, College of Computer Studies, De La Salle University, Manila, Philippines.
| | - Joyce Emlyn B Guiao
- Bioinformatics Lab, Advanced Research Institute for Informatics, Computing, and Networking (AdRIC), De La Salle University, Manila, Philippines
- Department of Mathematics and Statistics, College of Science, De La Salle University, Manila, Philippines
| | - Kyle Christian R Santiago
- Bioinformatics Lab, Advanced Research Institute for Informatics, Computing, and Networking (AdRIC), De La Salle University, Manila, Philippines
- Department of Software Technology, College of Computer Studies, De La Salle University, Manila, Philippines
| |
Collapse
|
10
|
Improved Large-Scale Homology Search by Two-Step Seed Search Using Multiple Reduced Amino Acid Alphabets. Genes (Basel) 2021; 12:genes12091455. [PMID: 34573438 PMCID: PMC8469100 DOI: 10.3390/genes12091455] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2021] [Revised: 09/17/2021] [Accepted: 09/18/2021] [Indexed: 12/02/2022] Open
Abstract
Metagenomic analysis, a technique used to comprehensively analyze microorganisms present in the environment, requires performing high-precision homology searches on large amounts of sequencing data, the size of which has increased dramatically with the development of next-generation sequencing. NCBI BLAST is the most widely used software for performing homology searches, but its speed is insufficient for the throughput of current DNA sequencers. In this paper, we propose a new, high-performance homology search algorithm that employs a two-step seed search strategy using multiple reduced amino acid alphabets to identify highly similar subsequences. Additionally, we evaluated the validity of the proposed method against several existing tools. Our method was faster than any other existing program for ≤120,000 queries, while DIAMOND, an existing tool, was the fastest method for >120,000 queries.
Collapse
|
11
|
McHugh AJ, Yap M, Crispie F, Feehily C, Hill C, Cotter PD. Microbiome-based environmental monitoring of a dairy processing facility highlights the challenges associated with low microbial-load samples. NPJ Sci Food 2021; 5:4. [PMID: 33589631 PMCID: PMC7884712 DOI: 10.1038/s41538-021-00087-2] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2020] [Accepted: 01/13/2021] [Indexed: 01/01/2023] Open
Abstract
Efficient and accurate identification of microorganisms throughout the food chain can potentially allow the identification of sources of contamination and the timely implementation of control measures. High throughput DNA sequencing represents a potential means through which microbial monitoring can be enhanced. While Illumina sequencing platforms are most typically used, newer portable platforms, such as the Oxford Nanopore Technologies (ONT) MinION, offer the potential for rapid analysis of food chain microbiomes. Initial assessment of the ability of rapid MinION-based sequencing to identify microbes within a simple mock metagenomic mixture is performed. Subsequently, we compare the performance of both ONT and Illumina sequencing for environmental monitoring of an active food processing facility. Overall, ONT MinION sequencing provides accurate classification to species level, comparable to Illumina-derived outputs. However, while the MinION-based approach provides a means of easy library preparations and portability, the high concentrations of DNA needed is a limiting factor.
Collapse
Affiliation(s)
- Aoife J McHugh
- Food Bioscience Department, Teagasc Food Research Centre, Cork, Ireland.,School of Microbiology, University College Cork, Cork, Ireland
| | - Min Yap
- Food Bioscience Department, Teagasc Food Research Centre, Cork, Ireland.,School of Microbiology, University College Cork, Cork, Ireland
| | - Fiona Crispie
- Food Bioscience Department, Teagasc Food Research Centre, Cork, Ireland.,APC Microbiome Ireland, Cork, Ireland
| | - Conor Feehily
- Food Bioscience Department, Teagasc Food Research Centre, Cork, Ireland.,APC Microbiome Ireland, Cork, Ireland
| | - Colin Hill
- School of Microbiology, University College Cork, Cork, Ireland.,APC Microbiome Ireland, Cork, Ireland
| | - Paul D Cotter
- Food Bioscience Department, Teagasc Food Research Centre, Cork, Ireland. .,APC Microbiome Ireland, Cork, Ireland.
| |
Collapse
|
12
|
Gwak HJ, Lee SJ, Rho M. Application of computational approaches to analyze metagenomic data. J Microbiol 2021; 59:233-241. [DOI: 10.1007/s12275-021-0632-8] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2020] [Revised: 01/18/2021] [Accepted: 01/19/2021] [Indexed: 01/04/2023]
|
13
|
Salloum T, Moussa R, Rahy R, Al Deek J, Khalifeh I, El Hajj R, Hall N, Hirt RP, Tokajian S. Expanded genome-wide comparisons give novel insights into population structure and genetic heterogeneity of Leishmania tropica complex. PLoS Negl Trop Dis 2020; 14:e0008684. [PMID: 32946436 PMCID: PMC7526921 DOI: 10.1371/journal.pntd.0008684] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2020] [Revised: 09/30/2020] [Accepted: 08/06/2020] [Indexed: 12/18/2022] Open
Abstract
Leishmania tropica is one of the main causative agents of cutaneous leishmaniasis (CL). Population structures of L. tropica appear to be genetically highly diverse. However, the relationship between L. tropica strains genomic diversity, protein coding gene evolution and biogeography are still poorly understood. In this study, we sequenced the genomes of three new clinical L. tropica isolates, two derived from a recent outbreak of CL in camps hosting Syrian refugees in Lebanon and one historical isolate from Azerbaijan to further refine comparative genome analyses. In silico multilocus microsatellite typing (MLMT) was performed to integrate the current diversity of genome sequence data in the wider available MLMT genetic population framework. Single nucleotide polymorphism (SNPs), gene copy number variations (CNVs) and chromosome ploidy were investigated across the available 18 L. tropica genomes with a main focus on protein coding genes. MLMT divided the strains in three populations that broadly correlated with their geographical distribution but not populations defined by SNPs. Unique SNPs profiles divided the 18 strains into five populations based on principal component analysis. Gene ontology enrichment analysis of the protein coding genes with population specific SNPs profiles revealed various biological processes, including iron acquisition, sterols synthesis and drug resistance. This study further highlights the complex links between L. tropica important genomic heterogeneity and the parasite broad geographic distribution. Unique sequence features in protein coding genes identified in distinct populations reveal potential novel markers that could be exploited for the development of more accurate typing schemes to further improve our knowledge of the evolution and epidemiology of the parasite as well as highlighting protein variants of potential functional importance underlying L. tropica specific biology.
Collapse
Affiliation(s)
- Tamara Salloum
- Department of Natural Sciences, School of Arts and Sciences, Lebanese American University, Byblos, Lebanon
| | - Rim Moussa
- Department of Natural Sciences, School of Arts and Sciences, Lebanese American University, Byblos, Lebanon
| | - Ryan Rahy
- Department of Natural Sciences, School of Arts and Sciences, Lebanese American University, Byblos, Lebanon
| | - Jospin Al Deek
- Department of Natural Sciences, School of Arts and Sciences, Lebanese American University, Byblos, Lebanon
| | - Ibrahim Khalifeh
- Department of Pathology and Laboratory Medicine, American University of Beirut, Beirut, Lebanon
| | - Rana El Hajj
- Department of Pathology and Laboratory Medicine, American University of Beirut, Beirut, Lebanon
| | - Neil Hall
- Earlham Institute, Norwich research Park, University of East Anglia, Norwich, United Kingdom
| | - Robert P. Hirt
- Biosciences Institute, Faculty of Medical Sciences, Newcastle University, Newcastle upon Tyne, United Kingdom
- * E-mail: (RPH); (ST)
| | - Sima Tokajian
- Department of Natural Sciences, School of Arts and Sciences, Lebanese American University, Byblos, Lebanon
- * E-mail: (RPH); (ST)
| |
Collapse
|
14
|
Albrecht B, Bağcı C, Huson DH. MAIRA- real-time taxonomic and functional analysis of long reads on a laptop. BMC Bioinformatics 2020; 21:390. [PMID: 32938391 PMCID: PMC7495841 DOI: 10.1186/s12859-020-03684-2] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
Background Advances in mobile sequencing devices and laptop performance make metagenomic sequencing and analysis in the field a technologically feasible prospect. However, metagenomic analysis pipelines are usually designed to run on servers and in the cloud. Results MAIRA is a new standalone program for interactive taxonomic and functional analysis of long read metagenomic sequencing data on a laptop, without requiring external resources. The program performs fast, online, genus-level analysis, and on-demand, detailed taxonomic and functional analysis. It uses two levels of frame-shift-aware alignment of DNA reads against protein reference sequences, and then performs detailed analysis using a protein synteny graph. Conclusions We envision this software being used by researchers in the field, when access to servers or cloud facilities is difficult, or by individuals that do not routinely access such facilities, such as medical researchers, crop scientists, or teachers.
Collapse
Affiliation(s)
- Benjamin Albrecht
- Department of Computer Science, University of Tübingen, Sand 14, Tübingen, Germany
| | - Caner Bağcı
- Department of Computer Science, University of Tübingen, Sand 14, Tübingen, Germany.,International Max Planck Research School From Molecules to Organisms, Max Planck Institute for Developmental Biology and Eberhard Karls University Tübingen, Max-Planck-Ring 5, Tübingen, 72076, Germany
| | - Daniel H Huson
- Department of Computer Science, University of Tübingen, Sand 14, Tübingen, Germany.
| |
Collapse
|
15
|
McNair K, Zhou C, Dinsdale EA, Souza B, Edwards RA. PHANOTATE: a novel approach to gene identification in phage genomes. Bioinformatics 2019; 35:4537-4542. [PMID: 31329826 PMCID: PMC6853651 DOI: 10.1093/bioinformatics/btz265] [Citation(s) in RCA: 159] [Impact Index Per Article: 26.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2018] [Revised: 03/07/2019] [Accepted: 04/15/2019] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION Currently there are no tools specifically designed for annotating genes in phages. Several tools are available that have been adapted to run on phage genomes, but due to their underlying design, they are unable to capture the full complexity of phage genomes. Phages have adapted their genomes to be extremely compact, having adjacent genes that overlap and genes completely inside of other longer genes. This non-delineated genome structure makes it difficult for gene prediction using the currently available gene annotators. Here we present PHANOTATE, a novel method for gene calling specifically designed for phage genomes. Although the compact nature of genes in phages is a problem for current gene annotators, we exploit this property by treating a phage genome as a network of paths: where open reading frames are favorable, and overlaps and gaps are less favorable, but still possible. We represent this network of connections as a weighted graph, and use dynamic programing to find the optimal path. RESULTS We compare PHANOTATE to other gene callers by annotating a set of 2133 complete phage genomes from GenBank, using PHANOTATE and the three most popular gene callers. We found that the four programs agree on 82% of the total predicted genes, with PHANOTATE predicting more genes than the other three. We searched for these extra genes in both GenBank's non-redundant protein database and all of the metagenomes in the sequence read archive, and found that they are present at levels that suggest that these are functional protein-coding genes. AVAILABILITY AND IMPLEMENTATION https://github.com/deprekate/PHANOTATE. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Katelyn McNair
- Computational Sciences Research Center, San Diego State University, San Diego, CA 92182, USA
| | - Carol Zhou
- Lawrence Livermore National Laboratory, Livermore, CA 94550, USA
| | | | - Brian Souza
- Lawrence Livermore National Laboratory, Livermore, CA 94550, USA
| | - Robert A Edwards
- Computational Sciences Research Center, San Diego State University, San Diego, CA 92182, USA
- Department of Biology, San Diego State University, San Diego, CA 92182, USA
- Viral Information Institute, San Diego State University, San Diego, CA 92182, USA
| |
Collapse
|
16
|
Eren K, Murrell B. RIFRAF: a frame-resolving consensus algorithm. Bioinformatics 2019; 34:3817-3824. [PMID: 29850783 DOI: 10.1093/bioinformatics/bty426] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2017] [Accepted: 05/22/2018] [Indexed: 01/08/2023] Open
Abstract
Motivation Protein coding genes can be studied using long-read next generation sequencing. However, high rates of indel sequencing errors are problematic, corrupting the reading frame. Even the consensus of multiple independent sequence reads retains indel errors. To solve this problem, we introduce Reference-Informed Frame-Resolving multiple-Alignment Free template inference algorithm (RIFRAF), a sequence consensus algorithm that takes a set of error-prone reads and a reference sequence and infers an accurate in-frame consensus. RIFRAF uses a novel structure, analogous to a two-layer hidden Markov model: the consensus is optimized to maximize alignment scores with both the set of noisy reads and with a reference. The template-to-reads component of the model encodes the preponderance of indels, and is sensitive to the per-base quality scores, giving greater weight to more accurate bases. The reference-to-template component of the model penalizes frame-destroying indels. A local search algorithm proceeds in stages to find the best consensus sequence for both objectives. Results Using Pacific Biosciences SMRT sequences from an HIV-1 env clone, NL4-3, we compare our approach to other consensus and frame correction methods. RIFRAF consistently finds a consensus sequence that is more accurate and in-frame, especially with small numbers of reads. It was able to perfectly reconstruct over 80% of consensus sequences from as few as three reads, whereas the best alternative required twice as many. RIFRAF is able to achieve these results and keep the consensus in-frame even with a distantly related reference sequence. Moreover, unlike other frame correction methods, RIFRAF can detect and keep true indels while removing erroneous ones. Availability and implementation RIFRAF is implemented in Julia, and source code is publicly available at https://github.com/MurrellGroup/Rifraf.jl. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Kemal Eren
- Bioinformatics and Systems Biology, University of California San Diego, La Jolla, CA, USA
| | - Ben Murrell
- Department of Medicine, University of California San Diego, La Jolla, CA, USA
| |
Collapse
|
17
|
Suvorova YM, Korotkova MA, Skryabin KG, Korotkov EV. Search for potential reading frameshifts in cds from Arabidopsis thaliana and other genomes. DNA Res 2019; 26:157-170. [PMID: 30726896 PMCID: PMC6476729 DOI: 10.1093/dnares/dsy046] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2018] [Accepted: 12/07/2018] [Indexed: 01/01/2023] Open
Abstract
A new mathematical method for potential reading frameshift detection in protein-coding sequences (cds) was developed. The algorithm is adjusted to the triplet periodicity of each analysed sequence using dynamic programming and a genetic algorithm. This does not require any preliminary training. Using the developed method, cds from the Arabidopsis thaliana genome were analysed. In total, the algorithm found 9,930 sequences containing one or more potential reading frameshift(s). This is ∼21% of all analysed sequences of the genome. The Type I and Type II error rates were estimated as 11% and 30%, respectively. Similar results were obtained for the genomes of Caenorhabditis elegans, Drosophila melanogaster, Homo sapiens, Rattus norvegicus and Xenopus tropicalis. Also, the developed algorithm was tested on 17 bacterial genomes. We compared our results with the previously obtained data on the search for potential reading frameshifts in these genomes. This study discussed the possibility that the reading frameshift seems like a relatively frequently encountered mutation; and this mutation could participate in the creation of new genes and proteins.
Collapse
Affiliation(s)
- Y M Suvorova
- Institute of Bioengineering, Research Center of Biotechnology of the Russian Academy of Sciences, Moscow, Russia
| | - M A Korotkova
- National Research Nuclear University MEPhI (Moscow Engineering Physics Institute), Moscow, Russia
| | - K G Skryabin
- Institute of Bioengineering, Research Center of Biotechnology of the Russian Academy of Sciences, Moscow, Russia
| | - E V Korotkov
- Institute of Bioengineering, Research Center of Biotechnology of the Russian Academy of Sciences, Moscow, Russia.,National Research Nuclear University MEPhI (Moscow Engineering Physics Institute), Moscow, Russia
| |
Collapse
|
18
|
Dilthey AT, Jain C, Koren S, Phillippy AM. Strain-level metagenomic assignment and compositional estimation for long reads with MetaMaps. Nat Commun 2019; 10:3066. [PMID: 31296857 PMCID: PMC6624308 DOI: 10.1038/s41467-019-10934-2] [Citation(s) in RCA: 91] [Impact Index Per Article: 15.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2018] [Accepted: 06/11/2019] [Indexed: 12/20/2022] Open
Abstract
Metagenomic sequence classification should be fast, accurate and information-rich. Emerging long-read sequencing technologies promise to improve the balance between these factors but most existing methods were designed for short reads. MetaMaps is a new method, specifically developed for long reads, capable of mapping a long-read metagenome to a comprehensive RefSeq database with >12,000 genomes in <16 GB or RAM on a laptop computer. Integrating approximate mapping with probabilistic scoring and EM-based estimation of sample composition, MetaMaps achieves >94% accuracy for species-level read assignment and r2 > 0.97 for the estimation of sample composition on both simulated and real data when the sample genomes or close relatives are present in the classification database. To address novel species and genera, which are comparatively harder to predict, MetaMaps outputs mapping locations and qualities for all classified reads, enabling functional studies (e.g. gene presence/absence) and detection of incongruities between sample and reference genomes. Sequencing platforms, such as Oxford Nanopore or Pacific Biosciences generate long-read data that preserve long-range genomic information but have high error rates. Here, the authors develop MetaMaps, a computational tool for strain-level metagenomic assignment and compositional estimation using long reads.
Collapse
Affiliation(s)
- Alexander T Dilthey
- Institute of Medical Microbiology and Hospital Hygiene, Heinrich-Heine-University Düsseldorf, Düsseldorf, North Rhine-Westphalia, Germany. .,Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, MD, 20892, USA.
| | - Chirag Jain
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, MD, 20892, USA.,Georgia Institute of Technology, Atlanta, GA, 30332, USA
| | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, MD, 20892, USA
| | - Adam M Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, MD, 20892, USA
| |
Collapse
|
19
|
Suvorova YM, Pugacheva VM, Korotkov EV. A Database of Potential Reading Frame Shifts in Coding Sequences from Different Eukaryotic Genomes. Biophysics (Nagoya-shi) 2019. [DOI: 10.1134/s0006350919030217] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
|
20
|
Bağcı C, Beier S, Górska A, Huson DH. Introduction to the Analysis of Environmental Sequences: Metagenomics with MEGAN. Methods Mol Biol 2019; 1910:591-604. [PMID: 31278678 DOI: 10.1007/978-1-4939-9074-0_19] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Metagenomics has become a part of the standard toolkit for scientists interested in studying microbes in the environment. Compared to 16S rDNA sequencing, which allows coarse taxonomic profiling of samples, shotgun metagenomic sequencing provides a more detailed analysis of the taxonomic and functional content of samples. Long read technologies, such as developed by Pacific Biosciences or Oxford Nanopore, produce much longer stretches of informative sequence, greatly simplifying the difficult and time-consuming process of metagenomic assembly. MEGAN6 provides a wide range of analysis and visualization methods for the analysis of short and long read metagenomic data. A simple and efficient analysis pipeline for metagenomic analysis consists of the DIAMOND alignment tool on short reads, or the LAST alignment tool on long reads, followed by MEGAN. This approach performs taxonomic and functional abundance analysis, supports comparative analysis of large-scale experiments, and allows one to involve experimental metadata in the analysis.
Collapse
Affiliation(s)
- Caner Bağcı
- Algorithms in Bioinformatics, Faculty of Computer Science, University of Tübingen, Tübingen, Germany
| | - Sina Beier
- Algorithms in Bioinformatics, Faculty of Computer Science, University of Tübingen, Tübingen, Germany
| | - Anna Górska
- Algorithms in Bioinformatics, Faculty of Computer Science, University of Tübingen, Tübingen, Germany
| | - Daniel H Huson
- Algorithms in Bioinformatics, Faculty of Computer Science, University of Tübingen, Tübingen, Germany.
| |
Collapse
|
21
|
Tanizawa Y, Fujisawa T, Arita M, Nakamura Y. Generating Publication-Ready Prokaryotic Genome Annotations with DFAST. Methods Mol Biol 2019; 1962:215-226. [PMID: 31020563 DOI: 10.1007/978-1-4939-9173-0_13] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
DDBJ Fast Annotation and Submission Tool (DFAST) is a genome annotation pipeline for prokaryotes, which also assists data submission to the public sequence database. It is available both as a web service and as a stand-alone tool that runs on local machines. DFAST can annotate a typical-sized bacterial genome within 5 min. The default annotation workflow contains a gene prediction phase for protein coding sequence, rRNA, tRNA, and CRISPR, and a functional annotation phase to infer protein functions. DFAST generates result files in standard annotation formats and data files for submission to DNA Data Bank of Japan (DDBJ). In this chapter, the annotation workflow and applications of DFAST are introduced.
Collapse
Affiliation(s)
- Yasuhiro Tanizawa
- Department of Informatics, National Institute of Genetics, Shizuoka, Japan.
| | - Takatomo Fujisawa
- Department of Informatics, National Institute of Genetics, Shizuoka, Japan
| | - Masanori Arita
- Department of Informatics, National Institute of Genetics, Shizuoka, Japan
- RIKEN Center for Sustainable Resource Science, Yokohama, Kanagawa, Japan
| | - Yasukazu Nakamura
- Department of Informatics, National Institute of Genetics, Shizuoka, Japan
| |
Collapse
|
22
|
Parallels between experimental and natural evolution of legume symbionts. Nat Commun 2018; 9:2264. [PMID: 29891837 PMCID: PMC5995829 DOI: 10.1038/s41467-018-04778-5] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2018] [Accepted: 05/11/2018] [Indexed: 12/29/2022] Open
Abstract
The emergence of symbiotic interactions has been studied using population genomics in nature and experimental evolution in the laboratory, but the parallels between these processes remain unknown. Here we compare the emergence of rhizobia after the horizontal transfer of a symbiotic plasmid in natural populations of Cupriavidus taiwanensis, over 10 MY ago, with the experimental evolution of symbiotic Ralstonia solanacearum for a few hundred generations. In spite of major differences in terms of time span, environment, genetic background, and phenotypic achievement, both processes resulted in rapid genetic diversification dominated by purifying selection. We observe no adaptation in the plasmid carrying the genes responsible for the ecological transition. Instead, adaptation was associated with positive selection in a set of genes that led to the co-option of the same quorum-sensing system in both processes. Our results provide evidence for similarities in experimental and natural evolutionary transitions and highlight the potential of comparisons between both processes to understand symbiogenesis. It is unclear if experimental evolution is a good model for natural processes. Here, Clerissi et al. find parallels between the evolution of symbiosis in rhizobia after horizontal transfer of a plasmid over 10 million years ago and experimentally evolved symbionts.
Collapse
|
23
|
Huson DH, Albrecht B, Bağcı C, Bessarab I, Górska A, Jolic D, Williams RBH. MEGAN-LR: new algorithms allow accurate binning and easy interactive exploration of metagenomic long reads and contigs. Biol Direct 2018; 13:6. [PMID: 29678199 PMCID: PMC5910613 DOI: 10.1186/s13062-018-0208-7] [Citation(s) in RCA: 104] [Impact Index Per Article: 14.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2017] [Accepted: 03/29/2018] [Indexed: 11/15/2022] Open
Abstract
Background There are numerous computational tools for taxonomic or functional analysis of microbiome samples, optimized to run on hundreds of millions of short, high quality sequencing reads. Programs such as MEGAN allow the user to interactively navigate these large datasets. Long read sequencing technologies continue to improve and produce increasing numbers of longer reads (of varying lengths in the range of 10k-1M bps, say), but of low quality. There is an increasing interest in using long reads in microbiome sequencing, and there is a need to adapt short read tools to long read datasets. Methods We describe a new LCA-based algorithm for taxonomic binning, and an interval-tree based algorithm for functional binning, that are explicitly designed for long reads and assembled contigs. We provide a new interactive tool for investigating the alignment of long reads against reference sequences. For taxonomic and functional binning, we propose to use LAST to compare long reads against the NCBI-nr protein reference database so as to obtain frame-shift aware alignments, and then to process the results using our new methods. Results All presented methods are implemented in the open source edition of MEGAN, and we refer to this new extension as MEGAN-LR (MEGAN long read). We evaluate the LAST+MEGAN-LR approach in a simulation study, and on a number of mock community datasets consisting of Nanopore reads, PacBio reads and assembled PacBio reads. We also illustrate the practical application on a Nanopore dataset that we sequenced from an anammox bio-rector community. Reviewers This article was reviewed by Nicola Segata together with Moreno Zolfo, Pete James Lockhart and Serghei Mangul. Conclusion This work extends the applicability of the widely-used metagenomic analysis software MEGAN to long reads. Our study suggests that the presented LAST+MEGAN-LR pipeline is sufficiently fast and accurate.
Collapse
Affiliation(s)
- Daniel H Huson
- Center for Bioinformatics, University of Tübingen, Sand 14, Tübingen, 72076, Germany. .,Life Sciences Institute, National University of Singapore, 28 Medical Drive, Singapore, 117456, Singapore.
| | - Benjamin Albrecht
- Center for Bioinformatics, University of Tübingen, Sand 14, Tübingen, 72076, Germany
| | - Caner Bağcı
- Center for Bioinformatics, University of Tübingen, Sand 14, Tübingen, 72076, Germany.,IMPRS 'From Molecules to Organisms', Tübingen, Germany
| | - Irina Bessarab
- Singapore Centre for Environmental Life Sciences Engineering, National University of Singapore, 28 Medical Drive, Singapore, 117456, Singapore
| | - Anna Górska
- Center for Bioinformatics, University of Tübingen, Sand 14, Tübingen, 72076, Germany.,IMPRS 'From Molecules to Organisms', Tübingen, Germany
| | - Dino Jolic
- Max-Planck Institute for Developmental Biology, Tübingen, 72076, Germany.,IMPRS 'From Molecules to Organisms', Tübingen, Germany
| | - Rohan B H Williams
- Singapore Centre for Environmental Life Sciences Engineering, National University of Singapore, 28 Medical Drive, Singapore, 117456, Singapore
| |
Collapse
|
24
|
Capela D, Marchetti M, Clérissi C, Perrier A, Guetta D, Gris C, Valls M, Jauneau A, Cruveiller S, Rocha EPC, Masson-Boivin C. Recruitment of a Lineage-Specific Virulence Regulatory Pathway Promotes Intracellular Infection by a Plant Pathogen Experimentally Evolved into a Legume Symbiont. Mol Biol Evol 2017; 34:2503-2521. [PMID: 28535261 DOI: 10.1093/molbev/msx165] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
Ecological transitions between different lifestyles, such as pathogenicity, mutualism and saprophytism, have been very frequent in the course of microbial evolution, and often driven by horizontal gene transfer. Yet, how genomes achieve the ecological transition initiated by the transfer of complex biological traits remains poorly known. Here, we used experimental evolution, genomics, transcriptomics and high-resolution phenotyping to analyze the evolution of the plant pathogen Ralstonia solanacearum into legume symbionts, following the transfer of a natural plasmid encoding the essential mutualistic genes. We show that a regulatory pathway of the recipient R. solanacearum genome involved in extracellular infection of natural hosts was reused to improve intracellular symbiosis with the Mimosa pudica legume. Optimization of intracellular infection capacity was gained through mutations affecting two components of a new regulatory pathway, the transcriptional regulator efpR and a region upstream from the RSc0965-0967 genes of unknown functions. Adaptive mutations caused the downregulation of efpR and the over-expression of a downstream regulatory module, the three unknown genes RSc3146-3148, two of which encoding proteins likely associated to the membrane. This over-expression led to important metabolic and transcriptomic changes and a drastic qualitative and quantitative improvement of nodule intracellular infection. In addition, these adaptive mutations decreased the virulence of the original pathogen. The complete efpR/RSc3146-3148 pathway could only be identified in the genomes of the pathogenic R. solanacearum species complex. Our findings illustrate how the rewiring of a genetic network regulating virulence allows a radically different type of symbiotic interaction and contributes to ecological transitions and trade-offs.
Collapse
Affiliation(s)
- Delphine Capela
- LIPM, Université de Toulouse, INRA, CNRS, Castanet-Tolosan, France
| | - Marta Marchetti
- LIPM, Université de Toulouse, INRA, CNRS, Castanet-Tolosan, France
| | - Camille Clérissi
- LIPM, Université de Toulouse, INRA, CNRS, Castanet-Tolosan, France.,Microbial Evolutionary Genomics, Institut Pasteur, Paris, France.,CNRS, UMR3525, Paris, France
| | - Anthony Perrier
- LIPM, Université de Toulouse, INRA, CNRS, Castanet-Tolosan, France
| | - Dorian Guetta
- LIPM, Université de Toulouse, INRA, CNRS, Castanet-Tolosan, France
| | - Carine Gris
- LIPM, Université de Toulouse, INRA, CNRS, Castanet-Tolosan, France
| | - Marc Valls
- Department of Genetics, University of Barcelona and Centre for Research in Agricultural Genomics (CSIC-IRTA-UAB-UB), Edifici CRAG, Campus UAB, Bellaterra, Spain
| | - Alain Jauneau
- Fédération de Recherches Agrobiosciences, Interactions, Biodiversity, Plateforme d'Imagerie TRI, CNRS, UPS, Castanet-Tolosan, France
| | - Stéphane Cruveiller
- CNRS-UMR8030 and Commissariat à l'Energie Atomique et aux Energies Alternatives CEA/DRF/IG/GEN LABGeM, Evry, France
| | - Eduardo P C Rocha
- Microbial Evolutionary Genomics, Institut Pasteur, Paris, France.,CNRS, UMR3525, Paris, France
| | | |
Collapse
|
25
|
Steinbiss S, Silva-Franco F, Brunk B, Foth B, Hertz-Fowler C, Berriman M, Otto TD. Companion: a web server for annotation and analysis of parasite genomes. Nucleic Acids Res 2016; 44:W29-34. [PMID: 27105845 PMCID: PMC4987884 DOI: 10.1093/nar/gkw292] [Citation(s) in RCA: 92] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2016] [Accepted: 04/08/2016] [Indexed: 01/25/2023] Open
Abstract
Currently available sequencing technologies enable quick and economical sequencing of many new eukaryotic parasite (apicomplexan or kinetoplastid) species or strains. Compared to SNP calling approaches, de novo assembly of these genomes enables researchers to additionally determine insertion, deletion and recombination events as well as to detect complex sequence diversity, such as that seen in variable multigene families. However, there currently are no automated eukaryotic annotation pipelines offering the required range of results to facilitate such analyses. A suitable pipeline needs to perform evidence-supported gene finding as well as functional annotation and pseudogene detection up to the generation of output ready to be submitted to a public database. Moreover, no current tool includes quick yet informative comparative analyses and a first pass visualization of both annotation and analysis results. To overcome those needs we have developed the Companion web server (http://companion.sanger.ac.uk) providing parasite genome annotation as a service using a reference-based approach. We demonstrate the use and performance of Companion by annotating two Leishmania and Plasmodium genomes as typical parasite cases and evaluate the results compared to manually annotated references.
Collapse
Affiliation(s)
- Sascha Steinbiss
- Parasite Genomics, Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, UK
| | - Fatima Silva-Franco
- Institute of Integrative Biology, University of Liverpool, Liverpool L69 7ZB, UK
| | - Brian Brunk
- Department of Biology, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Bernardo Foth
- Parasite Genomics, Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, UK
| | | | - Matthew Berriman
- Parasite Genomics, Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, UK
| | - Thomas D Otto
- Parasite Genomics, Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, UK
| |
Collapse
|
26
|
Jimenez J, Duncan CDS, Gallardo M, Mata J, Perez-Pulido AJ. AnABlast: a new in silico strategy for the genome-wide search of novel genes and fossil regions. DNA Res 2015; 22:439-49. [PMID: 26494834 PMCID: PMC4675712 DOI: 10.1093/dnares/dsv025] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2015] [Accepted: 09/25/2015] [Indexed: 12/15/2022] Open
Abstract
Genome annotation, assisted by computer programs, is one of the great advances in modern biology. Nevertheless, the in silico identification of small and complex coding sequences is still challenging. We observed that amino acid sequences inferred from coding-but rarely from non-coding-DNA sequences accumulated alignments in low-stringency BLAST searches, suggesting that this alignments accumulation could be used to highlight coding regions in sequenced DNA. To investigate this possibility, we developed a computer program (AnABlast) that generates profiles of accumulated alignments in query amino acid sequences using a low-stringency BLAST strategy. To validate this approach, all six-frame translations of DNA sequences between every two annotated exons of the fission yeast genome were analysed with AnABlast. AnABlast-generated profiles identified three new copies of known genes, and four new genes supported by experimental evidence. New pseudogenes, ancestral carboxyl- and amino-terminal subtractions, complex gene rearrangements, and ancient fragments of mitDNA and of bacterial origin, were also inferred. Thus, this novel in silico approach provides a powerful tool to uncover new genes, as well as fossil-coding sequences, thus providing insight into the evolutionary history of annotated genomes.
Collapse
Affiliation(s)
- Juan Jimenez
- Centro Andaluz de Biología del Desarrollo, Universidad Pablo de Olavide de Sevilla/CSIC, Sevilla, Spain
| | - Caia D S Duncan
- Department of Biochemistry, University of Cambridge, Cambridge, UK
| | - María Gallardo
- Centro Andaluz de Biología del Desarrollo, Universidad Pablo de Olavide de Sevilla/CSIC, Sevilla, Spain
| | - Juan Mata
- Department of Biochemistry, University of Cambridge, Cambridge, UK
| | - Antonio J Perez-Pulido
- Centro Andaluz de Biología del Desarrollo, Universidad Pablo de Olavide de Sevilla/CSIC, Sevilla, Spain
| |
Collapse
|
27
|
Sheetlin S, Park Y, Frith MC, Spouge JL. ALP & FALP: C++ libraries for pairwise local alignment E-values. Bioinformatics 2015; 32:304-5. [PMID: 26428291 DOI: 10.1093/bioinformatics/btv575] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2015] [Accepted: 09/28/2015] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Pairwise local alignment is an indispensable tool for molecular biologists. In real time (i.e. in about 1 s), ALP (Ascending Ladder Program) calculates the E-values for protein-protein or DNA-DNA local alignments of random sequences, for arbitrary substitution score matrix, gap costs and letter abundances; and FALP (Frameshift Ascending Ladder Program) performs a similar task, although more slowly, for frameshifting DNA-protein alignments. AVAILABILITY AND IMPLEMENTATION To permit other C++ programmers to implement the computational efficiencies in ALP and FALP directly within their own programs, C++ source codes are available in the public domain at http://go.usa.gov/3GTSW under 'ALP' and 'FALP', along with the standalone programs ALP and FALP. CONTACT spouge@nih.gov SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sergey Sheetlin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD 20894, USA and
| | - Yonil Park
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD 20894, USA and
| | - Martin C Frith
- Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology, Koto-ku, Tokyo 135-0064, Japan
| | - John L Spouge
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD 20894, USA and
| |
Collapse
|