1
|
Wang T, Yu ZG, Li J. CGRWDL: alignment-free phylogeny reconstruction method for viruses based on chaos game representation weighted by dynamical language model. Front Microbiol 2024; 15:1339156. [PMID: 38572227 PMCID: PMC10987876 DOI: 10.3389/fmicb.2024.1339156] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Accepted: 02/23/2024] [Indexed: 04/05/2024] Open
Abstract
Traditional alignment-based methods meet serious challenges in genome sequence comparison and phylogeny reconstruction due to their high computational complexity. Here, we propose a new alignment-free method to analyze the phylogenetic relationships (classification) among species. In our method, the dynamical language (DL) model and the chaos game representation (CGR) method are used to characterize the frequency information and the context information of k-mers in a sequence, respectively. Then for each DNA sequence or protein sequence in a dataset, our method converts the sequence into a feature vector that represents the sequence information based on CGR weighted by the DL model to infer phylogenetic relationships. We name our method CGRWDL. Its performance was tested on both DNA and protein sequences of 8 datasets of viruses to construct the phylogenetic trees. We compared the Robinson-Foulds (RF) distance between the phylogenetic tree constructed by CGRWDL and the reference tree by other advanced methods for each dataset. The results show that the phylogenetic trees constructed by CGRWDL can accurately classify the viruses, and the RF scores between the trees and the reference trees are smaller than that with other methods.
Collapse
Affiliation(s)
- Ting Wang
- National Center for Applied Mathematics in Hunan, Xiangtan University, Xiangtan, Hunan, China
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan, Hunan, China
| | - Zu-Guo Yu
- National Center for Applied Mathematics in Hunan, Xiangtan University, Xiangtan, Hunan, China
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan, Hunan, China
| | - Jinyan Li
- School of Computer Science and Control Engineering, Shenzhen Institute of Advanced Technology, Shenzhen, Guangdong, China
| |
Collapse
|
2
|
Chao P, Zhang X, Zhang L, Yang A, Wang Y, Chen X. Proteomics-based vaccine targets annotation and design of multi-epitope vaccine against antibiotic-resistant Streptococcus gallolyticus. Sci Rep 2024; 14:4836. [PMID: 38418560 PMCID: PMC10901886 DOI: 10.1038/s41598-024-55372-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Accepted: 02/22/2024] [Indexed: 03/01/2024] Open
Abstract
Streptococcus gallolyticus is a non-motile, gram-positive bacterium that causes infective endocarditis. S. gallolyticus has developed resistance to existing antibiotics, and no vaccine is currently available. Therefore, it is essential to develop an effective S. gallolyticus vaccine. Core proteomics was used in this study together with subtractive proteomics and reverse vaccinology approach to find antigenic proteins that could be utilized for the design of the S. gallolyticus multi-epitope vaccine. The pipeline identified two antigenic proteins as potential vaccine targets: penicillin-binding protein and the ATP synthase subunit. T and B cell epitopes from the specific proteins were forecasted employing several immunoinformatics and bioinformatics resources. A vaccine (360 amino acids) was created using a combination of seven cytotoxic T cell lymphocyte (CTL), three helper T cell lymphocyte (HTL), and five linear B cell lymphocyte (LBL) epitopes. To increase immune responses, the vaccine was paired with a cholera enterotoxin subunit B (CTB) adjuvant. The developed vaccine was highly antigenic, non-allergenic, and stable for human use. The vaccine's binding affinity and molecular interactions with the human immunological receptor TLR4 were studied using molecular mechanics/generalized Born surface area (MMGBSA), molecular docking, and molecular dynamic (MD) simulation analyses. Escherichia coli (strain K12) plasmid vector pET-28a ( +) was used to examine the ability of the vaccine to be expressed. According to the outcomes of these computer experiments, the vaccine is quite promising in terms of developing a protective immunity against diseases. However, in vitro and animal research are required to validate our findings.
Collapse
Affiliation(s)
- Peng Chao
- Department of Cardiology, People's Hospital of Xinjiang Uygur Autonomous Region, Urumqi, China
| | - Xueqin Zhang
- Department of Nephrology, People's Hospital of Xinjiang Uygur Autonomous Region, Urumqi, China
| | - Lei Zhang
- Department of Cardiology, People's Hospital of Xinjiang Uygur Autonomous Region, Urumqi, China
| | - Aiping Yang
- Department of Traditional Chinese Medicine, People's Hospital of Xinjiang Uygur Autonomous Region, Urumqi, China
| | - Yong Wang
- Department of Cardiology, People's Hospital of Xinjiang Uygur Autonomous Region, Urumqi, China
| | - Xiaoyang Chen
- Department of Cardiology, People's Hospital of Xinjiang Uygur Autonomous Region, Urumqi, China.
| |
Collapse
|
3
|
Guo X, Guo Y, Chen H, Liu X, He P, Li W, Zhang MQ, Dai Q. Systematic comparison of genome information processing and boundary recognition tools used for genomic island detection. Comput Biol Med 2023; 166:107550. [PMID: 37826950 DOI: 10.1016/j.compbiomed.2023.107550] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2023] [Revised: 09/12/2023] [Accepted: 09/28/2023] [Indexed: 10/14/2023]
Abstract
Genomic islands are fragments of foreign DNA that are found in bacterial and archaeal genomes, and are typically associated with symbiosis or pathogenesis. While numerous genomic island detection methods have been proposed, there has been limited evaluation of the efficiency of the genome information processing and boundary recognition tools. In this study, we conducted a review of the statistical methods involved in genomic signatures, host signature extraction, informative signature selection, divergence measures, and boundary detection steps in genomic island prediction. We compared the performances of these methods on simulated experiments using alien fragments obtained from both artificial and real genomes. Our results indicate that among the nine genomic signatures evaluated, genomic signature frequency and full probability performed the best. However, their performance declined when normalized to their expectations and variances, such as Z-score and composition vector. Based on our experiments of the E. coli genome, we found that the confidence intervals of the window variances achieved the best performance in the signature extraction of the host, with the best confidence interval being 1.5-2 times the standard error. Ordered kurtosis was most effective in selecting informative signatures from a single genome, without requiring prior knowledge from other datasets. Among the three divergence measures evaluated, the two-sample t-test was the most successful, and a non-overlapping window with a small eye window (size 2) was best suited for identifying compositionally distinct regions. Finally, the maximum of the Markovian Jensen-Shannon divergence score, in terms of GC-content bias, was found to make boundary detection faster while maintaining a similar error rate.
Collapse
Affiliation(s)
- Xiangting Guo
- Zhejiang Sci-Tech University, Hangzhou, 310018, China
| | - Yichu Guo
- Zhejiang Sci-Tech University, Hangzhou, 310018, China
| | - Hu Chen
- Zhejiang Sci-Tech University, Hangzhou, 310018, China
| | - Xiaoqing Liu
- College of Sciences, Hangzhou Dianzi University, Hangzhou, 310018, China
| | - Pingan He
- Zhejiang Sci-Tech University, Hangzhou, 310018, China
| | - Wenshu Li
- Zhejiang Sci-Tech University, Hangzhou, 310018, China
| | - Michael Q Zhang
- Center for Systems Biology, University of Texas at Dallas, Richardson, TX, 75080, USA; Center for Synthetic and Systems Biology, TNLIST, Tsinghua University, Beijing, 100084, China
| | - Qi Dai
- Zhejiang Sci-Tech University, Hangzhou, 310018, China; Center for Systems Biology, University of Texas at Dallas, Richardson, TX, 75080, USA.
| |
Collapse
|
4
|
Li X, Li H, Yang Z, Wu Y, Zhang M. Exploring objective feature sets in constructing the evolution relationship of animal genome sequences. BMC Genomics 2023; 24:634. [PMID: 37872534 PMCID: PMC10594854 DOI: 10.1186/s12864-023-09747-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2023] [Accepted: 10/17/2023] [Indexed: 10/25/2023] Open
Abstract
BACKGROUND Exploring evolution regularities of genome sequences and constructing more objective species evolution relationships at the genomic level are high-profile topics. Based on the evolution mechanism of genome sequences proposed in our previous research, we found that only the 8-mers containing CG or TA dinucleotides correlate directly with the evolution of genome sequences, and the relative frequency rather than the actual frequency of these 8-mers is more suitable to characterize the evolution of genome sequences. RESULT Therefore, two types of feature sets were obtained, they are the relative frequency sets of CG1 + CG2 8-mers and TA1 + TA2 8-mers. The evolution relationships of mammals and reptiles were constructed by the relative frequency set of CG1 + CG2 8-mers, and two types of evolution relationships of insects were constructed by the relative frequency sets of CG1 + CG2 8-mers and TA1 + TA2 8-mers respectively. Through comparison and analysis, we found that evolution relationships are consistent with the known conclusions. According to the evolution mechanism, we considered that the evolution relationship constructed by CG1 + CG2 8-mers reflects the evolution state of genome sequences in current time, and the evolution relationship constructed by TA1 + TA2 8-mers reflects the evolution state in the early stage. CONCLUSION Our study provides objective feature sets in constructing evolution relationships at the genomic level.
Collapse
Affiliation(s)
- Xiaolong Li
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot, 010021, China
| | - Hong Li
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot, 010021, China.
| | - Zhenhua Yang
- School of Economics and Management, Inner Mongolia University of Science and Technology, Baotou, 014010, China
| | - Yuan Wu
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot, 010021, China
| | - Mengchuan Zhang
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot, 010021, China
| |
Collapse
|
5
|
Sengupta S, Azad RK. Leveraging comparative genomics to uncover alien genes in bacterial genomes. Microb Genom 2023; 9:mgen000939. [PMID: 36748570 PMCID: PMC9973850 DOI: 10.1099/mgen.0.000939] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023] Open
Abstract
A significant challenge in bacterial genomics is to catalogue genes acquired through the evolutionary process of horizontal gene transfer (HGT). Both comparative genomics and sequence composition-based methods have often been invoked to quantify horizontally acquired genes in bacterial genomes. Comparative genomics methods rely on completely sequenced genomes and therefore the confidence in their predictions increases as the databases become more enriched in completely sequenced genomes. Recent developments including in microbial genome sequencing call for reassessment of alien genes based on information-rich resources currently available. We revisited the comparative genomics approach and developed a new algorithm for alien gene detection. Our algorithm compared favourably with the existing comparative genomics-based methods and is capable of detecting both recent and ancient transfers. It can be used as a standalone tool or in concert with other complementary algorithms for comprehensively cataloguing alien genes in bacterial genomes.
Collapse
Affiliation(s)
- Soham Sengupta
- Department of Biological Sciences and BioDiscovery Institute, University of North Texas, Denton, Texas, 76203, USA
| | - Rajeev K Azad
- Department of Biological Sciences and BioDiscovery Institute, University of North Texas, Denton, Texas, 76203, USA.,Department of Mathematics, University of North Texas, Denton, Texas, 76203, USA
| |
Collapse
|
6
|
Ma Z, Lu YY, Wang Y, Lin R, Yang Z, Zhang F, Wang Y. Metric learning for comparing genomic data with triplet network. Brief Bioinform 2022; 23:6679451. [DOI: 10.1093/bib/bbac345] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2022] [Revised: 07/20/2022] [Accepted: 07/26/2022] [Indexed: 11/13/2022] Open
Abstract
Abstract
Many biological applications are essentially pairwise comparison problems, such as evolutionary relationships on genomic sequences, contigs binning on metagenomic data, cell type identification on gene expression profiles of single-cells, etc. To make pair-wise comparison, it is necessary to adopt suitable dissimilarity metric. However, not all the metrics can be fully adapted to all possible biological applications. It is necessary to employ metric learning based on data adaptive to the application of interest. Therefore, in this study, we proposed MEtric Learning with Triplet network (MELT), which learns a nonlinear mapping from original space to the embedding space in order to keep similar data closer and dissimilar data far apart. MELT is a weakly supervised and data-driven comparison framework that offers more adaptive and accurate dissimilarity learned in the absence of the label information when the supervised methods are not applicable. We applied MELT in three typical applications of genomic data comparison, including hierarchical genomic sequences, longitudinal microbiome samples and longitudinal single-cell gene expression profiles, which have no distinctive grouping information. In the experiments, MELT demonstrated its empirical utility in comparison to many widely used dissimilarity metrics. And MELT is expected to accommodate a more extensive set of applications in large-scale genomic comparisons. MELT is available at https://github.com/Ying-Lab/MELT.
Collapse
Affiliation(s)
- Zhi Ma
- Department of Automation, Xiamen University , China
- National Institute for Data Science in Health and Medicine, Xiamen University
| | - Yang Young Lu
- Cheriton School of Computer Science, University of Waterloo , Waterloo, Ontario , Canada
| | - Yiwen Wang
- Department of Automation, Xiamen University , China
| | - Renhao Lin
- Department of Automation, Xiamen University , China
| | - Zizi Yang
- Department of Automation, Xiamen University , China
| | - Fang Zhang
- Cheriton School of Computer Science, University of Waterloo , Waterloo, Ontario , Canada
| | - Ying Wang
- Department of Automation, Xiamen University , China
- National Institute for Data Science in Health and Medicine, Xiamen University
- Xiamen Key Laboratory of Big Data Intelligent Analysis and Decision , Xiamen, Fujian 361005 , China
- Fujian Key Laboratory of Genetics and Breeding of Marine Organisms , Xiamen, 361100 , China
| |
Collapse
|
7
|
Birth N, Dencker T, Morgenstern B. Insertions and deletions as phylogenetic signal in an alignment-free context. PLoS Comput Biol 2022; 18:e1010303. [PMID: 35939516 PMCID: PMC9387925 DOI: 10.1371/journal.pcbi.1010303] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2021] [Revised: 08/18/2022] [Accepted: 06/14/2022] [Indexed: 11/18/2022] Open
Abstract
Most methods for phylogenetic tree reconstruction are based on sequence alignments; they infer phylogenies from substitutions that may have occurred at the aligned sequence positions. Gaps in alignments are usually not employed as phylogenetic signal. In this paper, we explore an alignment-free approach that uses insertions and deletions (indels) as an additional source of information for phylogeny inference. For a set of four or more input sequences, we generate so-called quartet blocks of four putative homologous segments each. For pairs of such quartet blocks involving the same four sequences, we compare the distances between the two blocks in these sequences, to obtain hints about indels that may have happened between the blocks since the respective four sequences have evolved from their last common ancestor. A prototype implementation that we call Gap-SpaM is presented to infer phylogenetic trees from these data, using a quartet-tree approach or, alternatively, under the maximum-parsimony paradigm. This approach should not be regarded as an alternative to established methods, but rather as a complementary source of phylogenetic information. Interestingly, however, our software is able to produce phylogenetic trees from putative indels alone that are comparable to trees obtained with existing alignment-free methods. Phylogenetic tree inference based on DNA or protein sequence comparison is a fundamental task in computational biology. Given a multiple alignment of a set of input sequences, most approaches compare aligned sequence positions to each other, to find a suitable tree, based on a model of molecular evolution. Insertions and deletions that may have happened since the input sequences evolved from their last common ancestor are ignored by most phylogeny methods. Herein, we show that insertions and deletions can provide an additional source of information for phylogeny inference, and that such information can be obtained with a simple alignment-free approach. We provide an implementation of this idea that we call Gap-SpaM. The proposed approach is complementary to existing phylogeny methods since it is based on a completely different source of information. It is, thus, not meant to be an alternative to those existing methods but rather as a possible additional source of information for tree inference.
Collapse
Affiliation(s)
- Niklas Birth
- Department of Bioinformatics, Institute of Microbiology and Genetics, Universisät Göttingen, Göttingen, Germany
| | - Thomas Dencker
- Department of Bioinformatics, Institute of Microbiology and Genetics, Universisät Göttingen, Göttingen, Germany
| | - Burkhard Morgenstern
- Department of Bioinformatics, Institute of Microbiology and Genetics, Universisät Göttingen, Göttingen, Germany
- Göttingen Center of Molecular Biosciences (GZMB), Göttingen, Germany
- Campus-Institute Data Science (CIDAS), Göttingen, Germany
- * E-mail:
| |
Collapse
|
8
|
Wang Y, Sun F, Lin W, Zhang S. AC-PCoA: Adjustment for confounding factors using principal coordinate analysis. PLoS Comput Biol 2022; 18:e1010184. [PMID: 35830390 PMCID: PMC9278763 DOI: 10.1371/journal.pcbi.1010184] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2021] [Accepted: 05/08/2022] [Indexed: 12/01/2022] Open
Abstract
Confounding factors exist widely in various biological data owing to technical variations, population structures and experimental conditions. Such factors may mask the true signals and lead to spurious associations in the respective biological data, making it necessary to adjust confounding factors accordingly. However, existing confounder correction methods were mainly developed based on the original data or the pairwise Euclidean distance, either one of which is inadequate for analyzing different types of data, such as sequencing data. In this work, we proposed a method called Adjustment for Confounding factors using Principal Coordinate Analysis, or AC-PCoA, which reduces data dimension and extracts the information from different distance measures using principal coordinate analysis, and adjusts confounding factors across multiple datasets by minimizing the associations between lower-dimensional representations and confounding variables. Application of the proposed method was further extended to classification and prediction. We demonstrated the efficacy of AC-PCoA on three simulated datasets and five real datasets. Compared to the existing methods, AC-PCoA shows better results in visualization, statistical testing, clustering, and classification. With today’s unprecedented amount of data, researchers are challenged by the need to enhance meaningful signals without the interference of unwanted confounders hidden inside the data. Data visualization is an important step toward exploring and explaining data in order to intuitively identify the dominant patterns. Principal coordinate analysis (PCoA), as a visualization tool, allows flexible ways to define pairwise distances and project the samples into lower dimensions without changing the distances. However, when visualizing large-scale biological datasets, the true patterns are often hindered by unwanted confounding variations, either biologically or technically in origin. To eliminate these confounding factors and recover underlying signals, we proposed a method called Adjustment for Confounding factors using Principal Coordinate Analysis, or AC-PCoA, and showed that it significantly outperforms existing methods in visualization through three simulation studies and five real datasets. We further showed that the low-dimensional representations given by AC-PCoA provide promising results in statistical testing, clustering, and classification as well.
Collapse
Affiliation(s)
- Yu Wang
- School of Mathematical Sciences, Fudan University, Shanghai, China
- Research Institute of Intelligent Complex Systems, Fudan University, Shanghai, China
| | - Fengzhu Sun
- Quantitative and Computational Biology Department, University of Southern California, Los Angeles, California, United States of America
| | - Wei Lin
- School of Mathematical Sciences, Fudan University, Shanghai, China
- Research Institute of Intelligent Complex Systems, Fudan University, Shanghai, China
- State Key Laboratory of Medical Neurobiology, MOE Frontiers Center for Brain Science, and Institutes of Brain Science, Fudan University, Shanghai, China
- Shanghai Artificial Intelligence Laboratory, Shanghai, China
- Key Laboratory of Mathematics for Nonlinear Science (Fudan University), Ministry of Education, Shanghai, China
- Shanghai Key Laboratory for Contemporary Applied Mathematics (Fudan University), Shanghai, China
| | - Shuqin Zhang
- School of Mathematical Sciences, Fudan University, Shanghai, China
- Key Laboratory of Mathematics for Nonlinear Science (Fudan University), Ministry of Education, Shanghai, China
- Shanghai Key Laboratory for Contemporary Applied Mathematics (Fudan University), Shanghai, China
- * E-mail:
| |
Collapse
|
9
|
An accurate alignment-free protein sequence comparator based on physicochemical properties of amino acids. Sci Rep 2022; 12:11158. [PMID: 35778592 PMCID: PMC9247937 DOI: 10.1038/s41598-022-15266-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2021] [Accepted: 06/21/2022] [Indexed: 11/08/2022] Open
Abstract
Bio-sequence comparators are one of the most basic and significant methods for assessing biological data, and so, due to the importance of proteins, protein sequence comparators are particularly crucial. On the other hand, the complexity of the problem, the growing number of extracted protein sequences, and the growth of studies and data analysis applications addressing protein sequences have necessitated the development of a rapid and accurate approach to account for the complexities in this field. As a result, we propose a protein sequence comparison approach, called PCV, which improves comparison accuracy by producing vectors that encode sequence data as well as physicochemical properties of the amino acids. At the same time, by partitioning the long protein sequences into fix-length blocks and providing encoding vector for each block, this method allows for parallel and fast implementation. To evaluate the performance of PCV, like other alignment-free methods, we used 12 benchmark datasets including classes with homologous sequences which may require a simple preprocessing search tool to select the homologous data. And then, we compared the protein sequence comparison outcomes to those of alternative alignment-based and alignment-free methods, using various evaluation criteria. These results indicate that our method provides significant improvement in sequence classification accuracy, compared to the alternative alignment-free methods and has an average correlation of about 94% with the ClustalW method as our reference method, while considerably reduces the processing time.
Collapse
|
10
|
Kröber E, Kanukollu S, Wende S, Bringel F, Kolb S. A putatively new family of alphaproteobacterial chloromethane degraders from a deciduous forest soil revealed by stable isotope probing and metagenomics. ENVIRONMENTAL MICROBIOME 2022; 17:24. [PMID: 35527282 PMCID: PMC9080209 DOI: 10.1186/s40793-022-00416-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/10/2021] [Accepted: 04/19/2022] [Indexed: 06/14/2023]
Abstract
BACKGROUND Chloromethane (CH3Cl) is the most abundant halogenated organic compound in the atmosphere and substantially responsible for the destruction of the stratospheric ozone layer. Since anthropogenic CH3Cl sources have become negligible with the application of the Montreal Protocol (1987), natural sources, such as vegetation and soils, have increased proportionally in the global budget. CH3Cl-degrading methylotrophs occurring in soils might be an important and overlooked sink. RESULTS AND CONCLUSIONS The objective of our study was to link the biotic CH3Cl sink with the identity of active microorganisms and their biochemical pathways for CH3Cl degradation in a deciduous forest soil. When tested in laboratory microcosms, biological CH3Cl consumption occurred in leaf litter, senescent leaves, and organic and mineral soil horizons. Highest consumption rates, around 2 mmol CH3Cl g-1 dry weight h-1, were measured in organic soil and senescent leaves, suggesting that top soil layers are active (micro-)biological CH3Cl degradation compartments of forest ecosystems. The DNA of these [13C]-CH3Cl-degrading microbial communities was labelled using stable isotope probing (SIP), and the corresponding taxa and their metabolic pathways studied using high-throughput metagenomics sequencing analysis. [13C]-labelled Metagenome-Assembled Genome closely related to the family Beijerinckiaceae may represent a new methylotroph family of Alphaproteobacteria, which is found in metagenome databases of forest soils samples worldwide. Gene markers of the only known pathway for aerobic CH3Cl degradation, via the methyltransferase system encoded by the CH3Cl utilisation genes (cmu), were undetected in the DNA-SIP metagenome data, suggesting that biological CH3Cl sink in this deciduous forest soil operates by a cmu-independent metabolism.
Collapse
Affiliation(s)
- Eileen Kröber
- Max-Planck Institute for Marine Microbiology, Celsiusstrasse 1, 28359 Bremen, Germany
- Microbial Biogeochemistry, RA Landscape Functioning, ZALF Leibniz Centre for Agricultural Landscape Research, Müncheberg, Germany
| | - Saranya Kanukollu
- Microbial Biogeochemistry, RA Landscape Functioning, ZALF Leibniz Centre for Agricultural Landscape Research, Müncheberg, Germany
| | - Sonja Wende
- Microbial Biogeochemistry, RA Landscape Functioning, ZALF Leibniz Centre for Agricultural Landscape Research, Müncheberg, Germany
| | - Françoise Bringel
- Génétique Moléculaire, Génomique, Microbiologie (GMGM), Université de Strasbourg, UMR 7156 CNRS, Strasbourg, France
| | - Steffen Kolb
- Microbial Biogeochemistry, RA Landscape Functioning, ZALF Leibniz Centre for Agricultural Landscape Research, Müncheberg, Germany
- Thaer Institute, Faculty of Life Sciences, Humboldt University of Berlin, Berlin, Germany
| |
Collapse
|
11
|
Aledo JC. Phylogenies from unaligned proteomes using sequence environments of amino acid residues. Sci Rep 2022; 12:7497. [PMID: 35523825 PMCID: PMC9076898 DOI: 10.1038/s41598-022-11370-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2022] [Accepted: 04/21/2022] [Indexed: 11/09/2022] Open
Abstract
Alignment-free methods for sequence comparison and phylogeny inference have attracted a great deal of attention in recent years. Several algorithms have been implemented in diverse software packages. Despite the great number of existing methods, most of them are based on word statistics. Although they propose different filtering and weighting strategies and explore different metrics, their performance may be limited by the phylogenetic signal preserved in these words. Herein, we present a different approach based on the species-specific amino acid neighborhood preferences. These differential preferences can be assessed in the context of vector spaces. In this way, a distance-based method to build phylogenies has been developed and implemented into an easy-to-use R package. Tests run on real-world datasets show that this method can reconstruct phylogenetic relationships with high accuracy, and often outperforms other alignment-free approaches. Furthermore, we present evidence that the new method can perform reliably on datasets formed by non-orthologous protein sequences, that is, the method not only does not require the identification of orthologous proteins, but also does not require their presence in the analyzed dataset. These results suggest that the neighborhood preference of amino acids conveys a phylogenetic signal that may be of great utility in phylogenomics.
Collapse
Affiliation(s)
- Juan Carlos Aledo
- Department of Molecular Biology and Biochemistry, University of Málaga, 29071, Málaga, Spain.
| |
Collapse
|
12
|
Xue Y, Bao Y, Zhang Z, Zhao W, Xiao J, He S, Zhang G, Li Y, Zhao G, Chen R, Zeng J, Zhang Y, Shang Y, Mai J, Shi S, Lu M, Bu C, Zhang Z, Du Z, Xiao J, Wang Y, Kang H, Xu T, Hao L, Bao Y, Jia P, Jiang S, Qian Q, Zhu T, Shang Y, Zong W, Jin T, Zhang Y, Zou D, Bao Y, Xiao J, Zhang Z, Jiang S, Du Q, Feng C, Ma L, Zhang S, Wang A, Dong L, Wang Y, Zou D, Zhang Z, Liu W, Yan X, Ling Y, Zhao G, Zhou Z, Zhang G, Kang W, Jin T, Zhang T, Ma S, Yan H, Liu Z, Ji Z, Cai Y, Wang S, Song M, Ren J, Zhou Q, Qu J, Zhang W, Bao Y, Liu G, Chen X, Chen T, Zhang S, Sun Y, Yu C, Tang B, Zhu J, Dong L, Zhai S, Sun Y, Chen Q, Yang X, Zhang X, Sang Z, Wang Y, Zhao Y, Chen H, Lan L, Wang Y, Zhao W, Ma Y, Jia Y, Zheng X, Chen M, Zhang Y, Zou D, Zhu T, Xu T, Chen M, Niu G, Zong W, Pan R, Jing W, Sang J, Liu C, Xiong Y, Sun Y, Zhai S, Chen H, Zhao W, Xiao J, Bao Y, Hao L, Zhang M, Wang G, Zou D, Yi L, Zhao W, Zong W, Wu S, Xiong Z, Li R, Zong W, Kang H, Xiong Z, Ma Y, Jin T, Gong Z, Yi L, Zhang M, Wu S, Wang G, Li R, Liu L, Li Z, Liu C, Zou D, Li Q, Feng C, Jing W, Luo S, Ma L, Wang J, Shi Y, Zhou H, Zhang P, Song T, Li Y, He S, Xiong Z, Yang F, Li M, Zhao W, Wang G, Li Z, Ma Y, Zou D, Zong W, Kang H, Jia Y, Zheng X, Li R, Tian D, Liu X, Li C, Teng X, Song S, Liu L, Zhang Y, Niu G, Li Q, Li Z, Zhu T, Feng C, Liu X, Zhang Y, Xu T, Chen R, Teng X, Zhang R, Zou D, Ma L, Xu F, Wang Y, Ling Y, Zhou C, Wang H, Teschendorff AE, He Y, Zhang G, Yang Z, Song S, Ma L, Zou D, Tian D, Li C, Zhu J, Li L, Li N, Gong Z, Chen M, Wang A, Ma Y, Teng X, Cui Y, Duan G, Zhang M, Jin T, Wu G, Huang T, Jin E, Zhao W, Kang H, Wang Z, Du Z, Zhang Y, Li R, Zeng J, Hao L, Jiang S, Chen H, Li M, Xiao J, Zhang Z, Zhao W, Xue Y, Bao Y, Ning W, Xue Y, Tang B, Liu Y, Sun Y, Duan G, Cui Y, Zhou Q, Dong L, Jin E, Liu X, Zhang L, Mao B, Zhang S, Zhang Y, Wang G, Zhao W, Wang Z, Zhu Q, Li X, Zhu J, Tian D, Kang H, Li C, Zhang S, Song S, Li M, Zhao W, Liu Y, Wang Z, Luo H, Zhu J, Wu X, Tian D, Li C, Zhao W, Jing H, Zhu J, Tang B, Zou D, Liu L, Pan Y, Liu C, Chen M, Liu X, Zhang Y, Li Z, Feng C, Du Q, Chen R, Zhu T, Ma L, Zou D, Jiang S, Zhang Z, Gong Z, Zhu J, Li C, Jiang S, Ma L, Tang B, Zou D, Chen M, Sun Y, Shi L, Song S, Zhang Z, Li M, Xiao J, Xue Y, Bao Y, Du Z, Zhao W, Li Z, Du Q, Jiang S, Ma L, Zhang Z, Xiong Z, Li M, Zou D, Zong W, Li R, Chen M, Du Z, Zhao W, Bao Y, Ma Y, Zhang X, Lan L, Xue Y, Bao Y, Jiang S, Feng C, Zhao W, Xiao J, Bao Y, Zhang Z, Zuo Z, Ren J, Zhang X, Xiao Y, Li X, Zhang X, Xiao Y, Li X, Liu D, Zhang C, Xue Y, Zhao Z, Jiang T, Wu W, Zhao F, Meng X, Chen M, Peng D, Xue Y, Luo H, Gao F, Ning W, Xue Y, Lin S, Xue Y, Liu C, Guo A, Yuan H, Su T, Zhang YE, Zhou Y, Chen M, Guo G, Fu S, Tan X, Xue Y, Zhang W, Xue Y, Luo M, Guo A, Xie Y, Ren J, Zhou Y, Chen M, Guo G, Wang C, Xue Y, Liao X, Gao X, Wang J, Xie G, Guo A, Yuan C, Chen M, Tian F, Yang D, Gao G, Tang D, Xue Y, Wu W, Chen M, Gou Y, Han C, Xue Y, Cui Q, Li X, Li CY, Luo X, Ren J, Zhang X, Xiao Y, Li X. Database Resources of the National Genomics Data Center, China National Center for Bioinformation in 2022. Nucleic Acids Res 2022; 50:D27-D38. [PMID: 34718731 PMCID: PMC8728233 DOI: 10.1093/nar/gkab951] [Citation(s) in RCA: 297] [Impact Index Per Article: 148.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2021] [Revised: 09/29/2021] [Accepted: 10/08/2021] [Indexed: 12/21/2022] Open
Abstract
The National Genomics Data Center (NGDC), part of the China National Center for Bioinformation (CNCB), provides a family of database resources to support global research in both academia and industry. With the explosively accumulated multi-omics data at ever-faster rates, CNCB-NGDC is constantly scaling up and updating its core database resources through big data archive, curation, integration and analysis. In the past year, efforts have been made to synthesize the growing data and knowledge, particularly in single-cell omics and precision medicine research, and a series of resources have been newly developed, updated and enhanced. Moreover, CNCB-NGDC has continued to daily update SARS-CoV-2 genome sequences, variants, haplotypes and literature. Particularly, OpenLB, an open library of bioscience, has been established by providing easy and open access to a substantial number of abstract texts from PubMed, bioRxiv and medRxiv. In addition, Database Commons is significantly updated by cataloguing a full list of global databases, and BLAST tools are newly deployed to provide online sequence search services. All these resources along with their services are publicly accessible at https://ngdc.cncb.ac.cn.
Collapse
|
13
|
Chen NWG, Ruh M, Darrasse A, Foucher J, Briand M, Costa J, Studholme DJ, Jacques M. Common bacterial blight of bean: a model of seed transmission and pathological convergence. MOLECULAR PLANT PATHOLOGY 2021; 22:1464-1480. [PMID: 33942466 PMCID: PMC8578827 DOI: 10.1111/mpp.13067] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Revised: 03/22/2021] [Accepted: 03/22/2021] [Indexed: 05/31/2023]
Abstract
BACKGROUND Xanthomonas citri pv. fuscans (Xcf) and Xanthomonas phaseoli pv. phaseoli (Xpp) are the causal agents of common bacterial blight of bean (CBB), an important disease worldwide that remains difficult to control. These pathogens belong to distinct species within the Xanthomonas genus and have undergone a dynamic evolutionary history including the horizontal transfer of genes encoding factors probably involved in adaptation to and pathogenicity on common bean. Seed transmission is a key point of the CBB disease cycle, favouring both vertical transmission of the pathogen and worldwide distribution of the disease through global seed trade. TAXONOMY Kingdom: Bacteria; phylum: Proteobacteria; class: Gammaproteobacteria; order: Lysobacterales (also known as Xanthomonadales); family: Lysobacteraceae (also known as Xanthomonadaceae); genus: Xanthomonas; species: X. citri pv. fuscans and X. phaseoli pv. phaseoli (Xcf-Xpp). HOST RANGE The main host of Xcf-Xpp is the common bean (Phaseolus vulgaris). Lima bean (Phaseolus lunatus) and members of the Vigna genus (Vigna aconitifolia, Vigna angularis, Vigna mungo, Vigna radiata, and Vigna umbellata) are also natural hosts of Xcf-Xpp. Natural occurrence of Xcf-Xpp has been reported for a handful of other legumes such as Calopogonium sp., Pueraria sp., pea (Pisum sativum), Lablab purpureus, Macroptilium lathyroides, and Strophostyles helvola. There are conflicting reports concerning the natural occurrence of CBB agents on tepary bean (Phaseolus acutifolius) and cowpea (Vigna unguiculata subsp. unguiculata). SYMPTOMS CBB symptoms occur on all aerial parts of beans, that is, seedlings, leaves, stems, pods, and seeds. Symptoms initially appear as water-soaked spots evolving into necrosis on leaves, pustules on pods, and cankers on twigs. In severe infections, defoliation and wilting may occur. DISTRIBUTION CBB is distributed worldwide, meaning that it is frequently encountered in most places where bean is cultivated in the Americas, Asia, Africa, and Oceania, except for arid tropical areas. Xcf-Xpp are regulated nonquarantine pathogens in Europe and are listed in the A2 list by the European and Mediterranean Plant Protection Organization (EPPO). GENOME The genome consists of a single circular chromosome plus one to four extrachromosomal plasmids of various sizes, for a total mean size of 5.27 Mb with 64.7% GC content and an average predicted number of 4,181 coding sequences. DISEASE CONTROL Management of CBB is based on integrated approaches that comprise measures aimed at avoiding Xcf-Xpp introduction through infected seeds, cultural practices to limit Xcf-Xpp survival between host crops, whenever possible the use of tolerant or resistant bean genotypes, and chemical treatments, mainly restricted to copper compounds. The use of pathogen-free seeds is essential in an effective management strategy and requires appropriate sampling, detection, and identification methods. USEFUL WEBSITES: https://gd.eppo.int/taxon/XANTPH, https://gd.eppo.int/taxon/XANTFF, and http://www.cost.eu/COST_Actions/ca/CA16107.
Collapse
Affiliation(s)
- Nicolas W. G. Chen
- Univ Angers, Institut Agro, INRAE, IRHS, SFR QUASAV, F‐49000 Angers, France
| | - Mylène Ruh
- Univ Angers, Institut Agro, INRAE, IRHS, SFR QUASAV, F‐49000 Angers, France
| | - Armelle Darrasse
- Univ Angers, Institut Agro, INRAE, IRHS, SFR QUASAV, F‐49000 Angers, France
| | - Justine Foucher
- Univ Angers, Institut Agro, INRAE, IRHS, SFR QUASAV, F‐49000 Angers, France
| | - Martial Briand
- Univ Angers, Institut Agro, INRAE, IRHS, SFR QUASAV, F‐49000 Angers, France
| | - Joana Costa
- University of Coimbra, Centre for Functional Ecology ‐ Science for People & the Planet, Department of Life SciencesCoimbraPortugal
| | - David J. Studholme
- Biosciences, College of Life and Environmental SciencesUniversity of ExeterExeterUK
| | | |
Collapse
|
14
|
Wu YQ, Yu ZG, Tang RB, Han GS, Anh VV. An Information-Entropy Position-Weighted K-Mer Relative Measure for Whole Genome Phylogeny Reconstruction. Front Genet 2021; 12:766496. [PMID: 34745231 PMCID: PMC8568955 DOI: 10.3389/fgene.2021.766496] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2021] [Accepted: 09/29/2021] [Indexed: 11/30/2022] Open
Abstract
Alignment methods have faced disadvantages in sequence comparison and phylogeny reconstruction due to their high computational costs in handling time and space complexity. On the other hand, alignment-free methods incur low computational costs and have recently gained popularity in the field of bioinformatics. Here we propose a new alignment-free method for phylogenetic tree reconstruction based on whole genome sequences. A key component is a measure called information-entropy position-weighted k-mer relative measure (IEPWRMkmer), which combines the position-weighted measure of k-mers proposed by our group and the information entropy of frequency of k-mers. The Manhattan distance is used to calculate the pairwise distance between species. Finally, we use the Neighbor-Joining method to construct the phylogenetic tree. To evaluate the performance of this method, we perform phylogenetic analysis on two datasets used by other researchers. The results demonstrate that the IEPWRMkmer method is efficient and reliable. The source codes of our method are provided at https://github.com/ wuyaoqun37/IEPWRMkmer.
Collapse
Affiliation(s)
- Yao-Qun Wu
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering and Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan, China.,Provincial Key Laboratory of Informational Service for Rural Area of Southwestern Hunan, Shaoyang University, Shaoyang, China
| | - Zu-Guo Yu
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering and Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan, China
| | - Run-Bin Tang
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering and Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan, China
| | - Guo-Sheng Han
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering and Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan, China
| | - Vo V Anh
- Faculty of Science, Engineering and Technology, Swinburne University of Technology, Hawthorn, VIC, Australia
| |
Collapse
|
15
|
Geptop 2.0: Accurately Select Essential Genes from the List of Protein-Coding Genes in Prokaryotic Genomes. Methods Mol Biol 2021. [PMID: 34709630 DOI: 10.1007/978-1-0716-1720-5_23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2023]
Abstract
Computational tool composites alternative way to identify essential genes and it is low-cost and time-efficient. Based on experimental essentiality sets deposited in the databases DEG and OGEE as reference, we developed an automatically computational tool named Geptop to select essential genes from the set of protein-coding genes in a prokaryotic genome, which utilizes the strategy of reciprocally best hit for homology search and evolutionary distance for weight assigning. The latest version of Geptop is 2.0 ( http://guolab.whu.edu.cn/geptop ), which can predict gene essentiality with the mean AUC 0f 0.84 in prokaryotes and is more stable. The chapter is to briefly introduce the tool and tell how to use it.
Collapse
|
16
|
Ramanathan N, Ramamurthy J, Natarajan G. Numerical Characterization of DNA Sequences for Alignment-free Sequence Comparison - A Review. Comb Chem High Throughput Screen 2021; 25:365-380. [PMID: 34382516 DOI: 10.2174/1386207324666210811101437] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2020] [Revised: 06/16/2021] [Accepted: 06/24/2021] [Indexed: 11/22/2022]
Abstract
BACKGROUND Biological macromolecules namely, DNA, RNA, and protein have their building blocks organized in a particular sequence and the sequential arrangement encodes evolutionary history of the organism (species). Hence, biological sequences have been used for studying evolutionary relationships among the species. This is usually carried out by multiple sequence algorithms (MSA). Due to certain limitations of MSA, alignment-free sequence comparison methods were developed. The present review is on alignment-free sequence comparison methods carried out using numerical characterization of DNA sequences. <P> Discussion: The graphical representation of DNA sequences by chaos game representation and other 2-dimesnional and 3-dimensional methods are discussed. The evolution of numerical characterization from the various graphical representations and the application of the DNA invariants thus computed in phylogenetic analysis is presented. The extension of computing molecular descriptors in chemometrics to the calculation of new set of DNA invariants and their use in alignment-free sequence comparison in a N-dimensional space and construction of phylogenetic tress is also reviewed. <P> Conclusion: The phylogenetic tress constructed by the alignment-free sequence comparison methods using DNA invariants were found to be better than those constructed using alignment-based tools such as PHLYIP and ClustalW. One of the graphical representation methods is now extended to study viral sequences of infectious diseases for the identification of conserved regions to design peptide-based vaccine by combining numerical characterization and graphical representation.
Collapse
Affiliation(s)
- Natarajan Ramanathan
- Department of Chemistry, Sri Sarada Niketan College for Women, Karur-639005, Tamil Nadu. India
| | - Jayalakshmi Ramamurthy
- Department of Computer Science, Sri Sarada Niketan College for Women, Karur-639005, Tamil Nadu. India
| | - Ganapathy Natarajan
- Department of Mechanical Engineering and Industrial Engineering, University of Wisconsin, Platteville, WI 53818. United States
| |
Collapse
|
17
|
CVTree: A Parallel Alignment-free Phylogeny and Taxonomy Tool based on Composition Vectors of Genomes. GENOMICS PROTEOMICS & BIOINFORMATICS 2021; 19:662-667. [PMID: 34119695 PMCID: PMC9040009 DOI: 10.1016/j.gpb.2021.03.006] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/22/2020] [Revised: 02/23/2021] [Accepted: 03/06/2021] [Indexed: 11/21/2022]
Abstract
CVTree is an alignment-free algorithm to infer phylogenetic relationships from genome sequences. It had been successfully applied to study phylogeny and taxonomy of viruses, prokaryotes, and fungi based on the whole genomes, as well as chloroplasts, mitochondria, and metagenomes. Here we presented the standalone software for the CVTree algorithm. In the software, an extensible parallel workflow for the CVTree algorithm was designed. Based on the workflow, new alignment-free methods were also implemented. And by examining the phylogeny and taxonomy of 13,903 prokaryotes based on 16S rRNA sequences, we showed that CVTree software is an efficient and effective tool for the studying of phylogeny and taxonomy based on genome sequences. Code availability: https://github.com/ghzuo/cvtree.
Collapse
|
18
|
Muggia L, Ametrano CG, Sterflinger K, Tesei D. An Overview of Genomics, Phylogenomics and Proteomics Approaches in Ascomycota. Life (Basel) 2020; 10:E356. [PMID: 33348904 PMCID: PMC7765829 DOI: 10.3390/life10120356] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2020] [Revised: 12/10/2020] [Accepted: 12/12/2020] [Indexed: 12/26/2022] Open
Abstract
Fungi are among the most successful eukaryotes on Earth: they have evolved strategies to survive in the most diverse environments and stressful conditions and have been selected and exploited for multiple aims by humans. The characteristic features intrinsic of Fungi have required evolutionary changes and adaptations at deep molecular levels. Omics approaches, nowadays including genomics, metagenomics, phylogenomics, transcriptomics, metabolomics, and proteomics have enormously advanced the way to understand fungal diversity at diverse taxonomic levels, under changeable conditions and in still under-investigated environments. These approaches can be applied both on environmental communities and on individual organisms, either in nature or in axenic culture and have led the traditional morphology-based fungal systematic to increasingly implement molecular-based approaches. The advent of next-generation sequencing technologies was key to boost advances in fungal genomics and proteomics research. Much effort has also been directed towards the development of methodologies for optimal genomic DNA and protein extraction and separation. To date, the amount of proteomics investigations in Ascomycetes exceeds those carried out in any other fungal group. This is primarily due to the preponderance of their involvement in plant and animal diseases and multiple industrial applications, and therefore the need to understand the biological basis of the infectious process to develop mechanisms for biologic control, as well as to detect key proteins with roles in stress survival. Here we chose to present an overview as much comprehensive as possible of the major advances, mainly of the past decade, in the fields of genomics (including phylogenomics) and proteomics of Ascomycota, focusing particularly on those reporting on opportunistic pathogenic, extremophilic, polyextremotolerant and lichenized fungi. We also present a review of the mostly used genome sequencing technologies and methods for DNA sequence and protein analyses applied so far for fungi.
Collapse
Affiliation(s)
- Lucia Muggia
- Department of Life Sciences, University of Trieste, 34127 Trieste, Italy
| | - Claudio G. Ametrano
- Grainger Bioinformatics Center, Department of Science and Education, The Field Museum, Chicago, IL 60605, USA;
| | - Katja Sterflinger
- Academy of Fine Arts Vienna, Institute of Natual Sciences and Technology in the Arts, 1090 Vienna, Austria;
| | - Donatella Tesei
- Department of Biotechnology, University of Natural Resources and Life Sciences, 1190 Vienna, Austria;
| |
Collapse
|
19
|
Song K. Classifying the Lifestyle of Metagenomically-Derived Phages Sequences Using Alignment-Free Methods. Front Microbiol 2020; 11:567769. [PMID: 33304326 PMCID: PMC7693541 DOI: 10.3389/fmicb.2020.567769] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2020] [Accepted: 10/22/2020] [Indexed: 01/20/2023] Open
Abstract
Phages are viruses that infect bacteria. The phages can be classified into two different categories based on their lifestyles: temperate and lytic. Now, the metavirome can generate a large number of fragments from the viral genomic sequences of entire environmental community, which makes it impossible to determine their lifestyles through experiments. Thus, there is a need to development computational methods for annotating phage contigs and making prediction of their lifestyles. Alignment-based methods for classifying phage lifestyle are limited by incomplete assembled genomes and nucleotide databases. Alignment-free methods based on the frequencies of k-mers were widely used for genome and metagenome comparison which did not rely on the completeness of genome or nucleotide databases. To mimic fragmented metagenomic sequences, the temperate and lytic phages genomic sequences were split into non-overlapping fragments with different lengths, then, I comprehensively compared nine alignment-free dissimilarity measures with a wide range of choices of k-mer length and Markov orders for predicting the lifestyles of these phage contigs. The dissimilarity measure, d2S, performed better than other dissimilarity measures for classifying the lifestyles of phages. Thus, I propose that the alignment-free method, d2S, can be used for predicting the lifestyles of phages which derived from the metagenomic data.
Collapse
Affiliation(s)
- Kai Song
- School of Mathematics and Statistics, Qingdao University, Qingdao, China
| |
Collapse
|
20
|
Yang Z, Li H, Jia Y, Zheng Y, Meng H, Bao T, Li X, Luo L. Intrinsic laws of k-mer spectra of genome sequences and evolution mechanism of genomes. BMC Evol Biol 2020; 20:157. [PMID: 33228538 PMCID: PMC7684957 DOI: 10.1186/s12862-020-01723-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2020] [Accepted: 11/10/2020] [Indexed: 11/17/2022] Open
Abstract
Background K-mer spectra of DNA sequences contain important information about sequence composition and sequence evolution. We want to reveal the evolution rules of genome sequences by studying the k-mer spectra of genome sequences. Results The intrinsic laws of k-mer spectra of 920 genome sequences from primate to prokaryote were analyzed. We found that there are two types of evolution selection modes in genome sequences, named as CG Independent Selection and TA Independent Selection. There is a mutual inhibition relationship between CG and TA independent selections. We found that the intensity of CG and TA independent selections correlates closely with genome evolution and G + C content of genome sequences. The living habits of species are related closely to the independent selection modes adopted by species genomes. Consequently, we proposed an evolution mechanism of genomes in which the genome evolution is determined by the intensities of the CG and TA independent selections and the mutual inhibition relationship. Besides, by the evolution mechanism of genomes, we speculated the evolution modes of prokaryotes in mild and extreme environments in the anaerobic age and the evolving process of prokaryotes from anaerobic to aerobic environment on earth as well as the originations of different eukaryotes. Conclusion We found that there are two independent selection modes in genome sequences. The evolution of genome sequence is determined by the two independent selection modes and the mutual inhibition relationship between them.
Collapse
Affiliation(s)
- Zhenhua Yang
- Laboratory of Theoretical Biophysics, School of Physical Science & Technology, Inner Mongolia University, Hohhot, 010021, China.,School of Economics and Management, Inner Mongolia University of Science & Technology, Baotou, 014010, China
| | - Hong Li
- Laboratory of Theoretical Biophysics, School of Physical Science & Technology, Inner Mongolia University, Hohhot, 010021, China.
| | - Yun Jia
- College of Science, Inner Mongolia University of Technology, Hohhot, 010051, China
| | - Yan Zheng
- Baotou Medical College, Inner Mongolia University of Science & Technology, Baotou, 014040, China
| | - Hu Meng
- School of Life Science & Technology, Inner Mongolia University of Science & Technology, Baotou, 014010, China
| | - Tonglaga Bao
- Laboratory of Theoretical Biophysics, School of Physical Science & Technology, Inner Mongolia University, Hohhot, 010021, China
| | - Xiaolong Li
- Laboratory of Theoretical Biophysics, School of Physical Science & Technology, Inner Mongolia University, Hohhot, 010021, China
| | - Liaofu Luo
- Laboratory of Theoretical Biophysics, School of Physical Science & Technology, Inner Mongolia University, Hohhot, 010021, China
| |
Collapse
|
21
|
Bonnici V, Maresi E, Giugno R. Challenges in gene-oriented approaches for pangenome content discovery. Brief Bioinform 2020; 22:5901976. [PMID: 32893299 DOI: 10.1093/bib/bbaa198] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2020] [Revised: 05/14/2020] [Accepted: 08/04/2020] [Indexed: 01/17/2023] Open
Abstract
Given a group of genomes, represented as the sets of genes that belong to them, the discovery of the pangenomic content is based on the search of genetic homology among the genes for clustering them into families. Thus, pangenomic analyses investigate the membership of the families to the given genomes. This approach is referred to as the gene-oriented approach in contrast to other definitions of the problem that takes into account different genomic features. In the past years, several tools have been developed to discover and analyse pangenomic contents. Because of the hardness of the problem, each tool applies a different strategy for discovering the pangenomic content. This results in a differentiation of the performance of each tool that depends on the composition of the input genomes. This review reports the main analysis instruments provided by the current state of the art tools for the discovery of pangenomic contents. Moreover, unlike previous works, the presented study compares pangenomic tools from a methodological perspective, analysing the causes that lead a given methodology to outperform other tools. The analysis is performed by taking into account different bacterial populations, which are synthetically generated by changing evolutionary parameters. The benchmarks used to compare the pangenomic tools, in addition to the computational pipeline developed for this purpose, are available at https://github.com/InfOmics/pangenes-review. Contact: V. Bonnici, R. Giugno Supplementary information: Supplementary data are available at Briefings in Bioinformatics online.
Collapse
Affiliation(s)
| | - Emiliano Maresi
- The Microsoft Research, University of Trento Centre for Computational and Systems Biology
| | - Rosalba Giugno
- Computer Science and Bioinformatics, referent of the Master Degree in Medical Bioinformatics
| |
Collapse
|
22
|
Systematic Analysis of REBASE Identifies Numerous Type I Restriction-Modification Systems with Duplicated, Distinct hsdS Specificity Genes That Can Switch System Specificity by Recombination. mSystems 2020; 5:5/4/e00497-20. [PMID: 32723795 PMCID: PMC7394358 DOI: 10.1128/msystems.00497-20] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
Many bacterial species contain DNA methyltransferases that have random on/off switching of expression. These systems, called phasevarions (phase-variable regulons), control the expression of multiple genes by global methylation changes. In every previously characterized phasevarion, genes involved in pathobiology, antibiotic resistance, and potential vaccine candidates are randomly varied in their expression, commensurate with methyltransferase switching. Our systematic study to determine the extent of phasevarions controlled by invertible Type I R-M systems will provide valuable information for understanding how bacteria regulate genes and is key to the study of physiology, virulence, and vaccine development; therefore, it is critical to identify and characterize phase-variable methyltransferases controlling phasevarions. N6-Adenine DNA methyltransferases associated with some Type I and Type III restriction-modification (R-M) systems are able to undergo phase variation, randomly switching expression ON or OFF by varying the length of locus-encoded simple sequence repeats (SSRs). This variation of methyltransferase expression results in genome-wide methylation differences and global changes in gene expression. These epigenetic regulatory systems are called phasevarions, phase-variable regulons, and are widespread in bacteria. A distinct switching system has also been described in Type I R-M systems, based on recombination-driven changes in hsdS genes, which dictate the DNA target site. In order to determine the prevalence of recombination-driven phasevarions, we generated a program called RecombinationRepeatSearch to interrogate REBASE and identify the presence and number of inverted repeats of hsdS downstream of Type I R-M loci. We report that 3.9% of Type I R-M systems have duplicated variable hsdS genes containing inverted repeats capable of phase variation. We report the presence of these systems in the major pathogens Enterococcus faecalis and Listeria monocytogenes, which could have important implications for pathogenesis and vaccine development. These data suggest that in addition to SSR-driven phasevarions, many bacteria have independently evolved phase-variable Type I R-M systems via recombination between multiple, variable hsdS genes. IMPORTANCE Many bacterial species contain DNA methyltransferases that have random on/off switching of expression. These systems, called phasevarions (phase-variable regulons), control the expression of multiple genes by global methylation changes. In every previously characterized phasevarion, genes involved in pathobiology, antibiotic resistance, and potential vaccine candidates are randomly varied in their expression, commensurate with methyltransferase switching. Our systematic study to determine the extent of phasevarions controlled by invertible Type I R-M systems will provide valuable information for understanding how bacteria regulate genes and is key to the study of physiology, virulence, and vaccine development; therefore, it is critical to identify and characterize phase-variable methyltransferases controlling phasevarions.
Collapse
|
23
|
Mismatch-tolerant, alignment-free sequence classification using multiple spaced seeds and multiindex Bloom filters. Proc Natl Acad Sci U S A 2020; 117:16961-16968. [PMID: 32641514 PMCID: PMC7382288 DOI: 10.1073/pnas.1903436117] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Alignment-free classification tools have enabled high-throughput processing of sequencing data in many bioinformatics analysis pipelines primarily due to their computational efficiency. Originally k-mer based, such tools often lack sensitivity when faced with sequencing errors and polymorphisms. In response, some tools have been augmented with spaced seeds, which are capable of tolerating mismatches. However, spaced seeds have seen little practical use in classification because they bring increased computational and memory costs compared to methods that use k-mers. These limitations have also caused the design and length of practical spaced seeds to be constrained, since storing spaced seeds can be costly. To address these challenges, we have designed a probabilistic data structure called a multiindex Bloom Filter (miBF), which can store multiple spaced seed sequences with a low memory cost that remains static regardless of seed length or seed design. We formalize how to minimize the false-positive rate of miBFs when classifying sequences from multiple targets or references. Available within BioBloom Tools, we illustrate the utility of miBF in two use cases: read-binning for targeted assembly, and taxonomic read assignment. In our benchmarks, an analysis pipeline based on miBF shows higher sensitivity and specificity for read-binning than sequence alignment-based methods, also executing in less time. Similarly, for taxonomic classification, miBF enables higher sensitivity than a conventional spaced seed-based approach, while using half the memory and an order of magnitude less computational time.
Collapse
|
24
|
Leal NC, Campos TL, Rezende AM, Docena C, Mendes-Marques CL, de Sá Cavalcanti FL, Wallau GL, Rocha IV, Cavalcanti CLB, Veras DL, Alves LR, Andrade-Figueiredo M, de Barros MPS, de Almeida AMP, de Morais MMC, Leal-Balbino TC, Xavier DE, de-Melo-Neto OP. Comparative Genomics of Acinetobacter baumannii Clinical Strains From Brazil Reveals Polyclonal Dissemination and Selective Exchange of Mobile Genetic Elements Associated With Resistance Genes. Front Microbiol 2020; 11:1176. [PMID: 32655514 PMCID: PMC7326025 DOI: 10.3389/fmicb.2020.01176] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2020] [Accepted: 05/08/2020] [Indexed: 12/13/2022] Open
Abstract
Acinetobacter baumannii is an opportunistic bacterial pathogen infecting immunocompromised patients and has gained attention worldwide due to its increased antimicrobial resistance. Here, we report a comparative whole-genome sequencing and analysis coupled with an assessment of antibiotic resistance of 46 Acinetobacter strains (45 A. baumannii plus one Acinetobacter nosocomialis) originated from five hospitals from the city of Recife, Brazil, between 2010 and 2014. An average of 3,809 genes were identified per genome, although only 2,006 genes were single copy orthologs or core genes conserved across all sequenced strains, with an average of 42 new genes found per strain. We evaluated genetic distance through a phylogenetic analysis and MLST as well as the presence of antibiotic resistance genes, virulence markers and mobile genetic elements (MGE). The phylogenetic analysis recovered distinct monophyletic A. baumannii groups corresponding to five known (ST1, ST15, ST25, ST79, and ST113) and one novel ST (ST881, related to ST1). A large number of ST specific genes were found, with the ST79 strains having the largest number of genes in common that were missing from the other STs. Multiple genes associated with resistance to β-lactams, aminoglycosides and other antibiotics were found. Some of those were clearly mapped to defined MGEs and an analysis of those revealed known elements as well as a novel Tn7-Tn3 transposon with a clear ST specific distribution. An association of selected resistance/virulence markers with specific STs was indeed observed, as well as the recent spread of the OXA-253 carbapenemase encoding gene. Virulence genes associated with the synthesis of the capsular antigens were noticeably more variable in the ST113 and ST79 strains. Indeed, several resistance and virulence genes were common to the ST79 and ST113 strains only, despite a greater genetic distance between them, suggesting common means of genetic exchange. Our comparative analysis reveals the spread of multiple STs and the genomic plasticity of A. baumannii from different hospitals in a single metropolitan area. It also highlights differences in the spread of resistance markers and other MGEs between the investigated STs, impacting on the monitoring and treatment of Acinetobacter in the ongoing and future outbreaks.
Collapse
Affiliation(s)
- Nilma C Leal
- Aggeu Magalhães Institute (IAM), Fundação Oswaldo Cruz (Fiocruz), Recife, Brazil
| | - Túlio L Campos
- Aggeu Magalhães Institute (IAM), Fundação Oswaldo Cruz (Fiocruz), Recife, Brazil
| | - Antonio M Rezende
- Aggeu Magalhães Institute (IAM), Fundação Oswaldo Cruz (Fiocruz), Recife, Brazil
| | - Cássia Docena
- Aggeu Magalhães Institute (IAM), Fundação Oswaldo Cruz (Fiocruz), Recife, Brazil
| | | | - Felipe L de Sá Cavalcanti
- Aggeu Magalhães Institute (IAM), Fundação Oswaldo Cruz (Fiocruz), Recife, Brazil.,Department of Pathology, Institute of Biological Sciences, University of Pernambuco, Recife, Brazil
| | - Gabriel L Wallau
- Aggeu Magalhães Institute (IAM), Fundação Oswaldo Cruz (Fiocruz), Recife, Brazil
| | - Igor V Rocha
- Aggeu Magalhães Institute (IAM), Fundação Oswaldo Cruz (Fiocruz), Recife, Brazil
| | | | - Dyana L Veras
- Aggeu Magalhães Institute (IAM), Fundação Oswaldo Cruz (Fiocruz), Recife, Brazil
| | - Lilian R Alves
- Department of Tropical Medicine, Federal University of Pernambuco, Recife, Brazil
| | | | | | | | | | | | - Danilo E Xavier
- Aggeu Magalhães Institute (IAM), Fundação Oswaldo Cruz (Fiocruz), Recife, Brazil
| | | |
Collapse
|
25
|
Dong J, Liu S, Zhang Y, Dai Y, Wu Q. A New Alignment-Free Whole Metagenome Comparison Tool and Its Application on Gut Microbiomes of Wild Giant Pandas. Front Microbiol 2020; 11:1061. [PMID: 32612579 PMCID: PMC7309450 DOI: 10.3389/fmicb.2020.01061] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2020] [Accepted: 04/29/2020] [Indexed: 11/13/2022] Open
Abstract
The comparison of metagenomes is crucial for studying the relationship between microbial communities and environmental factors. One recently published alignment-free whole metagenome comparison method based on k-mer frequencies, Libra, showed higher resolutions than the present fastest method, Mash, on whole metagenomic sequencing reads, but it did not perform as well on the assembled contigs. Here, we developed a new alignment-free tool, KmerFreqCalc, for the comparison of the whole metagenomic data, which first calculated the frequencies of both forward and reverse complementary sequences of k-mers like Mash and then computed the cosine distance between the samples based on k-mer frequency vectors like Libra. We applied KmerFreqCalc on the assembled contigs of the gut microbiomes of wild giant pandas and compared the results to Libra and Mash. The results indicated that KmerFreqCalc was able to detect the subtle difference between giant panda samples caused by seasonal diet change, showing better clustering than Libra and Mash. Therefore, KmerFreqCalc has high resolution and accuracy for assembled contigs, being very suitable for comparison of samples with low dissimilarity.
Collapse
Affiliation(s)
- Jiuhong Dong
- State Key Laboratory of Mycology, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China
| | - Shuai Liu
- Institute of Physical Science and Information Technology, Anhui University, Hefei, China.,Key Laboratory of Animal Ecology and Conservation Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
| | - Yaran Zhang
- State Key Laboratory of Mycology, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China.,University of Chinese Academy of Sciences, Beijing, China
| | - Yi Dai
- State Key Laboratory of Mycology, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China
| | - Qi Wu
- State Key Laboratory of Mycology, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China
| |
Collapse
|
26
|
Li J, Gu T, Li L, Wu X, Shen L, Yu R, Liu Y, Qiu G, Zeng W. Complete genome sequencing and comparative genomic analyses of Bacillus sp. S3, a novel hyper Sb(III)-oxidizing bacterium. BMC Microbiol 2020; 20:106. [PMID: 32354325 PMCID: PMC7193398 DOI: 10.1186/s12866-020-01737-3] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2019] [Accepted: 02/25/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Antimonite [Sb(III)]-oxidizing bacterium has great potential in the environmental bioremediation of Sb-polluted sites. Bacillus sp. S3 that was previously isolated from antimony-contaminated soil displayed high Sb(III) resistance and Sb(III) oxidation efficiency. However, the genomic information and evolutionary feature of Bacillus sp. S3 are very scarce. RESULTS Here, we identified a 5,436,472 bp chromosome with 40.30% GC content and a 241,339 bp plasmid with 36.74% GC content in the complete genome of Bacillus sp. S3. Genomic annotation showed that Bacillus sp. S3 contained a key aioB gene potentially encoding As (III)/Sb(III) oxidase, which was not shared with other Bacillus strains. Furthermore, a wide variety of genes associated with Sb(III) and other heavy metal (loid) s were also ascertained in Bacillus sp. S3, reflecting its adaptive advantage for growth in the harsh eco-environment. Based on the analysis of phylogenetic relationship and the average nucleotide identities (ANI), Bacillus sp. S3 was proved to a novel species within the Bacillus genus. The majority of mobile genetic elements (MGEs) mainly distributed on chromosomes within the Bacillus genus. Pan-genome analysis showed that the 45 genomes contained 554 core genes and many unique genes were dissected in analyzed genomes. Whole genomic alignment showed that Bacillus genus underwent frequently large-scale evolutionary events. In addition, the origin and evolution analysis of Sb(III)-resistance genes revealed the evolutionary relationships and horizontal gene transfer (HGT) events among the Bacillus genus. The assessment of functionality of heavy metal (loid) s resistance genes emphasized its indispensable role in the harsh eco-environment of Bacillus genus. Real-time quantitative PCR (RT-qPCR) analysis indicated that Sb(III)-related genes were all induced under the Sb(III) stress, while arsC gene was down-regulated. CONCLUSIONS The results in this study shed light on the molecular mechanisms of Bacillus sp. S3 coping with Sb(III), extended our understanding on the evolutionary relationships between Bacillus sp. S3 and other closely related species, and further enriched the Sb(III) resistance genetic data sources.
Collapse
Affiliation(s)
- Jiaokun Li
- School of Minerals Processing and Bioengineering, Central South University, Changsha, 410083, China.,Key Laboratory of Biometallurgy, Ministry of Education, Central South University, Changsha, 410083, China
| | - Tianyuan Gu
- School of Minerals Processing and Bioengineering, Central South University, Changsha, 410083, China.,Key Laboratory of Biometallurgy, Ministry of Education, Central South University, Changsha, 410083, China
| | - Liangzhi Li
- School of Minerals Processing and Bioengineering, Central South University, Changsha, 410083, China.,Key Laboratory of Biometallurgy, Ministry of Education, Central South University, Changsha, 410083, China
| | - Xueling Wu
- School of Minerals Processing and Bioengineering, Central South University, Changsha, 410083, China.,Key Laboratory of Biometallurgy, Ministry of Education, Central South University, Changsha, 410083, China
| | - Li Shen
- School of Minerals Processing and Bioengineering, Central South University, Changsha, 410083, China.,Key Laboratory of Biometallurgy, Ministry of Education, Central South University, Changsha, 410083, China
| | - Runlan Yu
- School of Minerals Processing and Bioengineering, Central South University, Changsha, 410083, China.,Key Laboratory of Biometallurgy, Ministry of Education, Central South University, Changsha, 410083, China
| | - Yuandong Liu
- School of Minerals Processing and Bioengineering, Central South University, Changsha, 410083, China.,Key Laboratory of Biometallurgy, Ministry of Education, Central South University, Changsha, 410083, China
| | - Guanzhou Qiu
- School of Minerals Processing and Bioengineering, Central South University, Changsha, 410083, China.,Key Laboratory of Biometallurgy, Ministry of Education, Central South University, Changsha, 410083, China
| | - Weimin Zeng
- School of Minerals Processing and Bioengineering, Central South University, Changsha, 410083, China. .,Key Laboratory of Biometallurgy, Ministry of Education, Central South University, Changsha, 410083, China.
| |
Collapse
|
27
|
A comparative genomics study of 23 Aspergillus species from section Flavi. Nat Commun 2020; 11:1106. [PMID: 32107379 PMCID: PMC7046712 DOI: 10.1038/s41467-019-14051-y] [Citation(s) in RCA: 92] [Impact Index Per Article: 23.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2019] [Accepted: 12/02/2019] [Indexed: 02/01/2023] Open
Abstract
Section Flavi encompasses both harmful and beneficial Aspergillus species, such as Aspergillus oryzae, used in food fermentation and enzyme production, and Aspergillus flavus, food spoiler and mycotoxin producer. Here, we sequence 19 genomes spanning section Flavi and compare 31 fungal genomes including 23 Flavi species. We reassess their phylogenetic relationships and show that the closest relative of A. oryzae is not A. flavus, but A. minisclerotigenes or A. aflatoxiformans and identify high genome diversity, especially in sub-telomeric regions. We predict abundant CAZymes (598 per species) and prolific secondary metabolite gene clusters (73 per species) in section Flavi. However, the observed phenotypes (growth characteristics, polysaccharide degradation) do not necessarily correlate with inferences made from the predicted CAZyme content. Our work, including genomic analyses, phenotypic assays, and identification of secondary metabolites, highlights the genetic and metabolic diversity within section Flavi. Aspergillus fungi classified within the section Flavi include harmful and beneficial species. Here, Kjærbølling et al. analyse the genomes of 23 Flavi species, showing high genetic diversity and potential for synthesis of over 13,700 CAZymes and 1600 secondary metabolites.
Collapse
|
28
|
Röhling S, Linne A, Schellhorn J, Hosseini M, Dencker T, Morgenstern B. The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances. PLoS One 2020; 15:e0228070. [PMID: 32040534 PMCID: PMC7010260 DOI: 10.1371/journal.pone.0228070] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2020] [Accepted: 01/08/2020] [Indexed: 12/14/2022] Open
Abstract
We study the number Nk of length-k word matches between pairs of evolutionarily related DNA sequences, as a function of k. We show that the Jukes-Cantor distance between two genome sequences-i.e. the number of substitutions per site that occurred since they evolved from their last common ancestor-can be estimated from the slope of a function F that depends on Nk and that is affine-linear within a certain range of k. Integers kmin and kmax can be calculated depending on the length of the input sequences, such that the slope of F in the relevant range can be estimated from the values F(kmin) and F(kmax). This approach can be generalized to so-called Spaced-word Matches (SpaM), where mismatches are allowed at positions specified by a user-defined binary pattern. Based on these theoretical results, we implemented a prototype software program for alignment-free sequence comparison called Slope-SpaM. Test runs on real and simulated sequence data show that Slope-SpaM can accurately estimate phylogenetic distances for distances up to around 0.5 substitutions per position. The statistical stability of our results is improved if spaced words are used instead of contiguous words. Unlike previous alignment-free methods that are based on the number of (spaced) word matches, Slope-SpaM produces accurate results, even if sequences share only local homologies.
Collapse
Affiliation(s)
- Sophie Röhling
- University of Göttingen, Department of Bioinformatics, Göttingen, Germany
| | - Alexander Linne
- University of Göttingen, Department of Bioinformatics, Göttingen, Germany
| | - Jendrik Schellhorn
- University of Göttingen, Department of Bioinformatics, Göttingen, Germany
| | | | - Thomas Dencker
- University of Göttingen, Department of Bioinformatics, Göttingen, Germany
| | - Burkhard Morgenstern
- University of Göttingen, Department of Bioinformatics, Göttingen, Germany
- Göttingen Center of Molecular Biosciences (GZMB), Göttingen, Germany
| |
Collapse
|
29
|
Salwan R, Sharma V. Molecular and biotechnological aspects of secondary metabolites in actinobacteria. Microbiol Res 2020; 231:126374. [DOI: 10.1016/j.micres.2019.126374] [Citation(s) in RCA: 42] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2019] [Revised: 11/10/2019] [Accepted: 11/11/2019] [Indexed: 12/21/2022]
|
30
|
Agüero-Chapin G, Galpert D, Molina-Ruiz R, Ancede-Gallardo E, Pérez-Machado G, De la Riva GA, Antunes A. Graph Theory-Based Sequence Descriptors as Remote Homology Predictors. Biomolecules 2019; 10:E26. [PMID: 31878100 PMCID: PMC7022958 DOI: 10.3390/biom10010026] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2019] [Revised: 12/16/2019] [Accepted: 12/18/2019] [Indexed: 12/23/2022] Open
Abstract
Alignment-free (AF) methodologies have increased in popularity in the last decades as alternative tools to alignment-based (AB) algorithms for performing comparative sequence analyses. They have been especially useful to detect remote homologs within the twilight zone of highly diverse gene/protein families and superfamilies. The most popular alignment-free methodologies, as well as their applications to classification problems, have been described in previous reviews. Despite a new set of graph theory-derived sequence/structural descriptors that have been gaining relevance in the detection of remote homology, they have been omitted as AF predictors when the topic is addressed. Here, we first go over the most popular AF approaches used for detecting homology signals within the twilight zone and then bring out the state-of-the-art tools encoding graph theory-derived sequence/structure descriptors and their success for identifying remote homologs. We also highlight the tendency of integrating AF features/measures with the AB ones, either into the same prediction model or by assembling the predictions from different algorithms using voting/weighting strategies, for improving the detection of remote signals. Lastly, we briefly discuss the efforts made to scale up AB and AF features/measures for the comparison of multiple genomes and proteomes. Alongside the achieved experiences in remote homology detection by both the most popular AF tools and other less known ones, we provide our own using the graphical-numerical methodologies, MARCH-INSIDE, TI2BioP, and ProtDCal. We also present a new Python-based tool (SeqDivA) with a friendly graphical user interface (GUI) for delimiting the twilight zone by using several similar criteria.
Collapse
Affiliation(s)
- Guillermin Agüero-Chapin
- CIIMAR/CIMAR, Interdisciplinary Centre of Marine and Environmental Research, University of Porto, Terminal de Cruzeiros do Porto de Leixões, Av. General Norton de Matos s/n 4450-208 Porto, Portugal
- Department of Biology, Faculty of Sciences, University of Porto, Rua do Campo Alegre, 4169-007 Porto, Portugal
| | - Deborah Galpert
- Departamento de Ciencia de la Computación. Universidad Central ¨Marta Abreu¨ de Las Villas (UCLV), Santa Clara 54830, Cuba;
| | - Reinaldo Molina-Ruiz
- Centro de Bioactivos Químicos (CBQ), Universidad Central ¨Marta Abreu¨ de Las Villas (UCLV), Santa Clara 54830, Cuba;
| | - Evys Ancede-Gallardo
- Programa de Doctorado en Fisicoquímica Molecular, Facultad de Ciencias Exactas, Universidad Andrés Bello, Av. República 239, Santiago 8370146, Chile;
| | - Gisselle Pérez-Machado
- EpiDisease S.L. Spin-Off of Centro de Investigación Biomédica en Red de Enfermedades Raras (CIBERER), 46980 Valencia, Spain;
| | - Gustavo A. De la Riva
- Laboratorio de Biotecnología Aplicada S. de R.L. de C.V., GRECA Inc., Carretera La Piedad-Carapán, km 3.5, La Piedad, Michoacán 59300, Mexico;
- Tecnológico Nacional de México, Instituto Tecnológico de la Piedad, Av. Ricardo Guzmán Romero, Santa Fe, La Piedad de Cavadas, Michoacán 59370, Mexico
| | - Agostinho Antunes
- CIIMAR/CIMAR, Interdisciplinary Centre of Marine and Environmental Research, University of Porto, Terminal de Cruzeiros do Porto de Leixões, Av. General Norton de Matos s/n 4450-208 Porto, Portugal
- Department of Biology, Faculty of Sciences, University of Porto, Rua do Campo Alegre, 4169-007 Porto, Portugal
| |
Collapse
|
31
|
Tang K, Ren J, Sun F. Afann: bias adjustment for alignment-free sequence comparison based on sequencing data using neural network regression. Genome Biol 2019; 20:266. [PMID: 31801606 PMCID: PMC6891986 DOI: 10.1186/s13059-019-1872-3] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2019] [Accepted: 10/29/2019] [Indexed: 11/27/2022] Open
Abstract
Alignment-free methods, more time and memory efficient than alignment-based methods, have been widely used for comparing genome sequences or raw sequencing samples without assembly. However, in this study, we show that alignment-free dissimilarity calculated based on sequencing samples can be overestimated compared with the dissimilarity calculated based on their genomes, and this bias can significantly decrease the performance of the alignment-free analysis. Here, we introduce a new alignment-free tool, Alignment-Free methods Adjusted by Neural Network (Afann) that successfully adjusts this bias and achieves excellent performance on various independent datasets. Afann is freely available at https://github.com/GeniusTang/Afann.
Collapse
Affiliation(s)
- Kujin Tang
- Quantitative and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA, USA
| | - Jie Ren
- Quantitative and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA, USA
| | - Fengzhu Sun
- Quantitative and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA, USA.
| |
Collapse
|
32
|
Song K, Ren J, Sun F. Reads Binning Improves Alignment-Free Metagenome Comparison. Front Genet 2019; 10:1156. [PMID: 31824565 PMCID: PMC6881972 DOI: 10.3389/fgene.2019.01156] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2019] [Accepted: 10/22/2019] [Indexed: 12/26/2022] Open
Abstract
Comparing metagenomic samples is a critical step in understanding the relationships among microbial communities. Recently, next-generation sequencing (NGS) technologies have produced a massive amount of short reads data for microbial communities from different environments. The assembly of these short reads can, however, be time-consuming and challenging. In addition, alignment-based methods for metagenome comparison are limited by incomplete genome and/or pathway databases. In contrast, alignment-free methods for metagenome comparison do not depend on the completeness of genome or pathway databases. Still, the existing alignment-free methods,d 2 S andd 2 * , which model k-tuple patterns using only one Markov chain for each sample, neglect the heterogeneity within metagenomic data wherein potentially thousands of types of microorganisms are sequenced. To address this imperfection ind 2 S andd 2 * , we organized NGS sequences into different reads bins and constructed several corresponding Markov models. Next, we modified the definition of our previous alignment-free methods,d 2 S andd 2 * , to make them more compatible with a scheme of analysis which uses the proposed reads bins. We then used two simulated and three real metagenomic datasets to test the effect of the k-tuple size and Markov orders of background sequences on the performance of these de novo alignment-free methods. For dependable comparison of metagenomic samples, our newly developed alignment-free methods with reads binning outperformed alignment-free methods without reads binning in detecting the relationship among microbial communities, including whether they form groups or change according to some environmental gradients.
Collapse
Affiliation(s)
- Kai Song
- School of Mathematics and Statistics, Qingdao University, Qingdao, China
| | - Jie Ren
- Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA, United States
| | - Fengzhu Sun
- Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA, United States
| |
Collapse
|
33
|
Zhou Y, Zhang W, Wu H, Huang K, Jin J. A high-resolution genomic composition-based method with the ability to distinguish similar bacterial organisms. BMC Genomics 2019; 20:754. [PMID: 31638897 PMCID: PMC6805505 DOI: 10.1186/s12864-019-6119-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2019] [Accepted: 09/20/2019] [Indexed: 12/03/2022] Open
Abstract
Background Genomic composition has been found to be species specific and is used to differentiate bacterial species. To date, almost no published composition-based approaches are able to distinguish between most closely related organisms, including intra-genus species and intra-species strains. Thus, it is necessary to develop a novel approach to address this problem. Results Here, we initially determine that the “tetranucleotide-derived z-value Pearson correlation coefficient” (TETRA) approach is representative of other published statistical methods. Then, we devise a novel method called “Tetranucleotide-derived Z-value Manhattan Distance” (TZMD) and compare it with the TETRA approach. Our results show that TZMD reflects the maximal genome difference, while TETRA does not in most conditions, demonstrating in theory that TZMD provides improved resolution. Additionally, our analysis of real data shows that TZMD improves species differentiation and clearly differentiates similar organisms, including similar species belonging to the same genospecies, subspecies and intraspecific strains, most of which cannot be distinguished by TETRA. Furthermore, TZMD is able to determine clonal strains with the TZMD = 0 criterion, which intrinsically encompasses identical composition, high average nucleotide identity and high percentage of shared genomes. Conclusions Our extensive assessment demonstrates that TZMD has high resolution. This study is the first to propose a composition-based method for differentiating bacteria at the strain level and to demonstrate that composition is also strain specific. TZMD is a powerful tool and the first easy-to-use approach for differentiating clonal and non-clonal strains. Therefore, as the first composition-based algorithm for strain typing, TZMD will facilitate bacterial studies in the future.
Collapse
Affiliation(s)
- Yizhuang Zhou
- Laboratory of Hepatobiliary and Pancreatic Surgery, The Affiliated Hospital of Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China. .,Peking-Tsinghua Center for Life Science, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, 100871, People's Republic of China.
| | - Wenting Zhang
- Laboratory of Hepatobiliary and Pancreatic Surgery, The Affiliated Hospital of Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China
| | - Huixian Wu
- China-USA Lipids in Health and Disease Research Center, Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China.,Guangxi Key Laboratory of Molecular Medicine in Liver Injury and Repair, Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China
| | - Kai Huang
- Laboratory of Hepatobiliary and Pancreatic Surgery, The Affiliated Hospital of Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China.,China-USA Lipids in Health and Disease Research Center, Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China.,Guangxi Key Laboratory of Molecular Medicine in Liver Injury and Repair, Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China
| | - Junfei Jin
- Laboratory of Hepatobiliary and Pancreatic Surgery, The Affiliated Hospital of Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China. .,China-USA Lipids in Health and Disease Research Center, Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China. .,Guangxi Key Laboratory of Molecular Medicine in Liver Injury and Repair, Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China.
| |
Collapse
|
34
|
Huang GD, Liu XM, Huang TL, Xia LC. The statistical power of k-mer based aggregative statistics for alignment-free detection of horizontal gene transfer. Synth Syst Biotechnol 2019; 4:150-156. [PMID: 31508512 PMCID: PMC6723412 DOI: 10.1016/j.synbio.2019.08.001] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2019] [Revised: 07/14/2019] [Accepted: 08/05/2019] [Indexed: 12/21/2022] Open
Abstract
Alignment-based database search and sequence comparison are commonly used to detect horizontal gene transfer (HGT). However, with the rapid increase of sequencing depth, hundreds of thousands of contigs are routinely assembled from metagenomics studies, which challenges alignment-based HGT analysis by overwhelming the known reference sequences. Detecting HGT by k-mer statistics thus becomes an attractive alternative. These alignment-free statistics have been demonstrated in high performance and efficiency in whole-genome and transcriptome comparisons. To adapt k-mer statistics for HGT detection, we developed two aggregative statistics TsumS and Tsum*, which subsample metagenome contigs by their representative regions, and summarize the regional D2S and D2* metrics by their upper bounds. We systematically studied the aggregative statistics’ power at different k-mer size using simulations. Our analysis showed that, in general, the power of TsumS and Tsum* increases with sequencing coverage, and reaches a maximum power >80% at k = 6, with 5% Type-I error and the coverage ratio >0.2x. The statistical power of TsumS and Tsum* was evaluated with realistic simulations of HGT mechanism, sequencing depth, read length, and base error. We expect these statistics to be useful distance metrics for identifying HGT in metagenomic studies.
Collapse
Affiliation(s)
- Guan-Da Huang
- School of Physics and Optoelectronics, South China University of Technology, Guangzhou, 510640, China
| | - Xue-Mei Liu
- School of Physics and Optoelectronics, South China University of Technology, Guangzhou, 510640, China
| | - Tian-Lai Huang
- School of Physics and Optoelectronics, South China University of Technology, Guangzhou, 510640, China
| | - Li-C Xia
- Department of Medicine, Stanford University School of Medicine, Stanford, CA, 94305, USA
| |
Collapse
|
35
|
Pei S, Dong R, He RL, Yau SST. Large-Scale Genome Comparison Based on Cumulative Fourier Power and Phase Spectra: Central Moment and Covariance Vector. Comput Struct Biotechnol J 2019; 17:982-994. [PMID: 31384399 PMCID: PMC6661692 DOI: 10.1016/j.csbj.2019.07.003] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2019] [Revised: 06/24/2019] [Accepted: 07/10/2019] [Indexed: 01/04/2023] Open
Abstract
Genome comparison is a vital research area of bioinformatics. For large-scale genome comparisons, the Multiple Sequence Alignment (MSA) methods have been impractical to use due to its algorithmic complexity. In this study, we propose a novel alignment-free method based on the one-to-one correspondence between a DNA sequence and its complete central moment vector of the cumulative Fourier power and phase spectra. In addition, the covariance between the four nucleotides in the power and phase spectra is included. We use the cumulative Fourier power and phase spectra to define a 28-dimensional vector for each DNA sequence. Euclidean distances between the vectors can measure the dissimilarity between DNA sequences. We perform testing with datasets of different sizes and types including simulated DNA sequences, exon-intron and complete genomes. The results show that our method is more accurate and efficient for performing hierarchical clustering than other alignment-free methods and MSA methods.
Collapse
Affiliation(s)
- Shaojun Pei
- Department of Mathematical Sciences, Tsinghua University, Beijing, PR China
| | - Rui Dong
- Department of Mathematical Sciences, Tsinghua University, Beijing, PR China
| | - Rong Lucy He
- Department of Biological Sciences, Chicago State University, Chicago, IL 60628, USA
| | - Stephen S.-T. Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing, PR China
| |
Collapse
|
36
|
The Shared and Specific Genes and a Comparative Genomics Analysis within Three Hanseniaspora Strains. Int J Genomics 2019; 2019:7910865. [PMID: 31281829 PMCID: PMC6589277 DOI: 10.1155/2019/7910865] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2018] [Revised: 02/17/2019] [Accepted: 04/16/2019] [Indexed: 11/21/2022] Open
Abstract
Kloeckera apiculata plays an important role in the inhibition of citrus postharvest blue and green mould diseases. This study was based on the previous genome sequencing of K. apiculata strain 34-9. After homologous comparison, scaffold 27 was defined as the mitochondrial (mt) sequence of K. apiculata 34-9. The comparison showed a high level of sequence identity between scaffold 27 and the known mtDNA of Hanseniaspora uvarum. The genome sequence of H. vineae T02/19AF showed several short and discontinuous fragments homologous to the mtDNA of H. uvarum. The shared and specific genes of K. apiculata, H. uvarum, and H. vineae were analysed by family using the TreeFam methodology. GO analysis was used to classify the shared and specific genes. Most of the gene families were classified into the functional categories of cellular component and metabolic processes. The whole-genome phylogram and genome synteny analysis showed that K. apiculata was more closely related to H. uvarum than to H. vineae. The genomic comparisons clearly displayed the locations of the homologous regions in each genome. This analysis could contribute to discovering the genomic similarities and differences within the genus Hanseniaspora. In addition, some regions were not collinearity-matched in the genome of K. apiculata compared with that of H. uvarum or H. vineae, and these sequences might have resulted from evolutionary variations.
Collapse
|
37
|
Alanjary M, Steinke K, Ziemert N. AutoMLST: an automated web server for generating multi-locus species trees highlighting natural product potential. Nucleic Acids Res 2019; 47:W276-W282. [PMID: 30997504 PMCID: PMC6602446 DOI: 10.1093/nar/gkz282] [Citation(s) in RCA: 229] [Impact Index Per Article: 45.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2019] [Revised: 03/29/2019] [Accepted: 04/10/2019] [Indexed: 12/31/2022] Open
Abstract
Understanding the evolutionary background of a bacterial isolate has applications for a wide range of research. However generating an accurate species phylogeny remains challenging. Reliance on 16S rDNA for species identification currently remains popular. Unfortunately, this widespread method suffers from low resolution at the species level due to high sequence conservation. Currently, there is now a wealth of genomic data that can be used to yield more accurate species designations via modern phylogenetic methods and multiple genetic loci. However, these often require extensive expertise and time. The Automated Multi-Locus Species Tree (autoMLST) was thus developed to provide a rapid 'one-click' pipeline to simplify this workflow at: https://automlst.ziemertlab.com. This server utilizes Multi-Locus Sequence Analysis (MLSA) to produce high-resolution species trees; this does not preform multi-locus sequence typing (MLST), a related classification method. The resulting phylogenetic tree also includes helpful annotations, such as species clade designations and secondary metabolite counts to aid natural product prospecting. Distinct from currently available web-interfaces, autoMLST can automate selection of reference genomes and out-group organisms based on one or more query genomes. This enables a wide range of researchers to perform rigorous phylogenetic analyses more rapidly compared to manual MLSA workflows.
Collapse
Affiliation(s)
- Mohammad Alanjary
- Interfaculty Institute of Microbiology and Infection Medicine Tübingen, University of Tübingen, Tübingen, Germany
- German Centre for Infection Research (DZIF), Partner Site Tübingen, Tübingen, Germany
| | - Katharina Steinke
- Interfaculty Institute of Microbiology and Infection Medicine Tübingen, University of Tübingen, Tübingen, Germany
- German Centre for Infection Research (DZIF), Partner Site Tübingen, Tübingen, Germany
| | - Nadine Ziemert
- Interfaculty Institute of Microbiology and Infection Medicine Tübingen, University of Tübingen, Tübingen, Germany
- German Centre for Infection Research (DZIF), Partner Site Tübingen, Tübingen, Germany
| |
Collapse
|
38
|
Lu YY, Tang K, Ren J, Fuhrman JA, Waterman MS, Sun F. CAFE: aCcelerated Alignment-FrEe sequence analysis. Nucleic Acids Res 2019; 45:W554-W559. [PMID: 28472388 PMCID: PMC5793812 DOI: 10.1093/nar/gkx351] [Citation(s) in RCA: 36] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2017] [Accepted: 04/20/2017] [Indexed: 12/13/2022] Open
Abstract
Alignment-free genome and metagenome comparisons are increasingly important with the development of next generation sequencing (NGS) technologies. Recently developed state-of-the-art k-mer based alignment-free dissimilarity measures including CVTree, \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{upgreek}
\usepackage{mathrsfs}
\setlength{\oddsidemargin}{-69pt}
\begin{document}
}{}$d_2^*$\end{document} and \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{upgreek}
\usepackage{mathrsfs}
\setlength{\oddsidemargin}{-69pt}
\begin{document}
}{}$d_2^S$\end{document} are more computationally expensive than measures based solely on the k-mer frequencies. Here, we report a standalone software, aCcelerated Alignment-FrEe sequence analysis (CAFE), for efficient calculation of 28 alignment-free dissimilarity measures. CAFE allows for both assembled genome sequences and unassembled NGS shotgun reads as input, and wraps the output in a standard PHYLIP format. In downstream analyses, CAFE can also be used to visualize the pairwise dissimilarity measures, including dendrograms, heatmap, principal coordinate analysis and network display. CAFE serves as a general k-mer based alignment-free analysis platform for studying the relationships among genomes and metagenomes, and is freely available at https://github.com/younglululu/CAFE.
Collapse
Affiliation(s)
- Yang Young Lu
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, CA 90089, USA
| | - Kujin Tang
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, CA 90089, USA
| | - Jie Ren
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, CA 90089, USA
| | - Jed A Fuhrman
- Department of Biological Sciences and Wrigley Institute for Environmental Studies, University of Southern California, Los Angeles, CA 90089, USA
| | - Michael S Waterman
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, CA 90089, USA.,Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, 200433 Shanghai, China
| | - Fengzhu Sun
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, CA 90089, USA.,Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, 200433 Shanghai, China
| |
Collapse
|
39
|
Wen QF, Liu S, Dong C, Guo HX, Gao YZ, Guo FB. Geptop 2.0: An Updated, More Precise, and Faster Geptop Server for Identification of Prokaryotic Essential Genes. Front Microbiol 2019; 10:1236. [PMID: 31214154 PMCID: PMC6558110 DOI: 10.3389/fmicb.2019.01236] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2019] [Accepted: 05/17/2019] [Indexed: 12/16/2022] Open
Abstract
Geptop has performed effectively in the identification of prokaryotic essential genes since its first release in 2013. It estimates gene essentiality for prokaryotes based on orthology and phylogeny. Genome-scale essentiality data of more prokaryotic species are available, and the information has been collected into public essential gene repositories such as DEG and OGEE. A faster and more accurate toolkit is needed to meet the increasing prokaryotic genome data. We updated Geptop by supplementing more validated essentiality data into reference set (from 19 to 37 species), and introducing multi-process technology to accelerate the computing speed. Compared with Geptop 1.0 and other gene essentiality prediction models, Geptop 2.0 can generate more stable predictions and finish the computation in a shorter time. The software is available both as an online server and a downloadable standalone application. We hope that the improved Geptop 2.0 will facilitate researches in gene essentiality and the development of novel antibacterial drugs. The gene essentiality prediction tool is available at http://cefg.uestc.cn/geptop.
Collapse
Affiliation(s)
- Qing-Feng Wen
- School of Life Sciences and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Shuo Liu
- School of Life Sciences and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Chuan Dong
- School of Life Sciences and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Hai-Xia Guo
- School of Life Sciences and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Yi-Zhou Gao
- School of Life Sciences and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Feng-Biao Guo
- School of Life Sciences and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
40
|
Leimeister CA, Schellhorn J, Dörrer S, Gerth M, Bleidorn C, Morgenstern B. Prot-SpaM: fast alignment-free phylogeny reconstruction based on whole-proteome sequences. Gigascience 2019; 8:giy148. [PMID: 30535314 PMCID: PMC6436989 DOI: 10.1093/gigascience/giy148] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2018] [Revised: 09/10/2018] [Accepted: 11/20/2018] [Indexed: 11/20/2022] Open
Abstract
Word-based or 'alignment-free' sequence comparison has become an active research area in bioinformatics. While previous word-frequency approaches calculated rough measures of sequence similarity or dissimilarity, some new alignment-free methods are able to accurately estimate phylogenetic distances between genomic sequences. One of these approaches is Filtered Spaced Word Matches. Here, we extend this approach to estimate evolutionary distances between complete or incomplete proteomes; our implementation of this approach is called Prot-SpaM. We compare the performance of Prot-SpaM to other alignment-free methods on simulated sequences and on various groups of eukaryotic and prokaryotic taxa. Prot-SpaM can be used to calculate high-quality phylogenetic trees for dozens of whole-proteome sequences in a matter of seconds or minutes and often outperforms other alignment-free approaches. The source code of our software is available through Github: https://github.com/jschellh/ProtSpaM.
Collapse
Affiliation(s)
- Chris-Andre Leimeister
- University of Göttingen, Department of Bioinformatics, Goldschmidtstr. 1, 37077 Göttingen, Germany
| | - Jendrik Schellhorn
- University of Göttingen, Department of Bioinformatics, Goldschmidtstr. 1, 37077 Göttingen, Germany
| | - Svenja Dörrer
- University of Göttingen, Department of Bioinformatics, Goldschmidtstr. 1, 37077 Göttingen, Germany
| | - Michael Gerth
- Institute for Integrative Biology, University of Liverpool, Biosciences Building, Crown Street, L69 7ZB Liverpool, UK
| | - Christoph Bleidorn
- University of Göttingen, Department of Animal Evolution and Biodiversity, Untere Karspüle 2, 37073 Göttingen, Germany
- Museo Nacional de Ciencias Naturales, Spanish National Research Council (CSIC), 28006 Madrid, Spain
| | - Burkhard Morgenstern
- University of Göttingen, Department of Bioinformatics, Goldschmidtstr. 1, 37077 Göttingen, Germany
- Göttingen Center of Molecular Biosciences (GZMB), Justus-von-Liebig-Weg 11, 37077 Göttingen
| |
Collapse
|
41
|
Prabha R, Singh DP. Cyanobacterial phylogenetic analysis based on phylogenomics approaches render evolutionary diversification and adaptation: an overview of representative orders. 3 Biotech 2019; 9:87. [PMID: 30800598 DOI: 10.1007/s13205-019-1635-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2018] [Accepted: 02/11/2019] [Indexed: 12/12/2022] Open
Abstract
Phylogenetic studies based on a definite set of marker genes usually reconstruct evolutionary relationships among the prokaryotic species. Based on specific target sequences, such studies represent variations and allow identification of similarities or dissimilarities in organisms. With the advent of completely sequenced genomes and accumulation of information on whole prokaryotic genomes, phylogenetic reconstructions should be considered more reliable if they are ideally based on entire genomes to resolve phylogenetic interest. We applied phylogenomics approaches taking into account completely sequenced cyanobacterial genomes to reconstruct underlying species that represented major taxonomic classes and belonged to distinctly different habitats (freshwater, marine, soils, and rocks). We did not rely on describing phylogeny of all representative class of cyanobacterial species on the basis of only ribosomal gene, 16S rDNA gene. In contrast, we analyzed combined molecular marker and phylogenomics approaches (genome alignment, gene content and gene order, composition vector and protein domain content) for accurately inferring phylogenetic relationship of species. We have shown that this approach reflects the impact of evolution on the organisms and considers connects with the ecological adaptation in cyanobacteria in different habitats. Analysis revealed that the members from marine habitat occupy different profile than those from freshwater. Impact of GC content and genomic repetitiveness over the diversification of cyanobacterial species and their possible role in adaptation was also reflected. Members occupying similar habitats cover more evolutionary distance together and also evolve various strategies for adaptation and survival either through genomic repetitiveness or preferences for genes of particular functions or modified GC content. Genomes undergo different changes for their adaptation in diverse habitats.
Collapse
Affiliation(s)
- Ratna Prabha
- 1ICAR-National Bureau of Agriculturally Important Microorganisms, Kushmaur, Maunath Bhanjan, 275101 India
- 2Department of Biotechnology, Mewar University, Gangrar, Chittorgarh, Rajasthan India
| | - Dhananjaya P Singh
- 1ICAR-National Bureau of Agriculturally Important Microorganisms, Kushmaur, Maunath Bhanjan, 275101 India
| |
Collapse
|
42
|
Sarmashghi S, Bohmann K, P. Gilbert MT, Bafna V, Mirarab S. Skmer: assembly-free and alignment-free sample identification using genome skims. Genome Biol 2019; 20:34. [PMID: 30760303 PMCID: PMC6374904 DOI: 10.1186/s13059-019-1632-4] [Citation(s) in RCA: 54] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2018] [Accepted: 01/16/2019] [Indexed: 01/10/2023] Open
Abstract
The ability to inexpensively describe taxonomic diversity is critical in this era of rapid climate and biodiversity changes. The recent genome-skimming approach extends current barcoding practices beyond short markers by applying low-pass sequencing and recovering whole organelle genomes computationally. This approach discards the nuclear DNA, which constitutes the vast majority of the data. In contrast, we suggest using all unassembled reads. We introduce an assembly-free and alignment-free tool, Skmer, to compute genomic distances between the query and reference genome skims. Skmer shows excellent accuracy in estimating distances and identifying the closest match in reference datasets.
Collapse
Affiliation(s)
- Shahab Sarmashghi
- Department of Electrical & Computer Engineering, University of California, San Diego, La Jolla, 92093 CA USA
| | - Kristine Bohmann
- Evolutionary Genomics, Natural History Museum of Denmark, University of Copenhagen, Copenhagen, Denmark
- School of Biological Sciences, University of East Anglia, Norwich, Norfolk UK
| | - M. Thomas P. Gilbert
- Evolutionary Genomics, Natural History Museum of Denmark, University of Copenhagen, Copenhagen, Denmark
- Norwegian University of Science and Technology, Trondheim, 7491 Norway
| | - Vineet Bafna
- Department of Computer Science & Engineering, University of California, San Diego, La Jolla, 92093 CA USA
| | - Siavash Mirarab
- Department of Electrical & Computer Engineering, University of California, San Diego, La Jolla, 92093 CA USA
| |
Collapse
|
43
|
PVTree: A Sequential Pattern Mining Method for Alignment Independent Phylogeny Reconstruction. Genes (Basel) 2019; 10:genes10020073. [PMID: 30678245 PMCID: PMC6410268 DOI: 10.3390/genes10020073] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2018] [Revised: 01/04/2019] [Accepted: 01/14/2019] [Indexed: 11/21/2022] Open
Abstract
Phylogenetic tree is essential to understand evolution and it is usually constructed through multiple sequence alignment, which suffers from heavy computational burdens and requires sophisticated parameter tuning. Recently, alignment free methods based on k-mer profiles or common substrings provide alternative ways to construct phylogenetic trees. However, most of these methods ignore the global similarities between sequences or some specific valuable features, e.g., frequent patterns overall datasets. To make further improvement, we propose an alignment free algorithm based on sequential pattern mining, where each sequence is converted into a binary representation of sequential patterns among sequences. The phylogenetic tree is further constructed via clustering distance matrix which is calculated from pattern vectors. To increase accuracy for highly divergent sequences, we consider pattern weight and filtering redundancy sub-patterns. Both simulated and real data demonstrates our method outperform other alignment free methods, especially for large sequence set with low similarity.
Collapse
|
44
|
Polyphyly in 16S rRNA-based LVTree Versus Monophyly in Whole-genome-based CVTree. GENOMICS PROTEOMICS & BIOINFORMATICS 2018; 16:310-319. [PMID: 30550857 PMCID: PMC6364046 DOI: 10.1016/j.gpb.2018.06.005] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/22/2018] [Revised: 05/11/2018] [Accepted: 06/25/2018] [Indexed: 11/23/2022]
Abstract
We report an important but long-overlooked manifestation of low-resolution power of 16S rRNA sequence analysis at the species level, namely, in 16S rRNA-based phylogenetic trees polyphyletic placements of closely-related species are abundant compared to those in genome-based phylogeny. This phenomenon makes the demarcation of genera within many families ambiguous in the 16S rRNA-based taxonomy. In this study, we reconstructed phylogenetic relationship for more than ten thousand prokaryote genomes using the CVTree method, which is based on whole-genome information. And many such genera, which are polyphyletic in 16S rRNA-based trees, are well resolved as monophyletic clusters by CVTree. We believe that with genome sequencing of prokaryotes becoming a commonplace, genome-based phylogeny is doomed to play a definitive role in the construction of a natural and objective taxonomy.
Collapse
|
45
|
Sahay S, Shome R, Sankarasubramanian J, Vishnu US, Prajapati A, Natesan K, Shome BR, Rahman H, Rajendhran J. Genome sequence analysis of the Indian strain Mannheimia haemolytica serotype A2 from ovine pneumonic pasteurellosis. ANN MICROBIOL 2018. [DOI: 10.1007/s13213-018-1410-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022] Open
|
46
|
Tang K, Ren J, Cronn R, Erickson DL, Milligan BG, Parker-Forney M, Spouge JL, Sun F. Alignment-free genome comparison enables accurate geographic sourcing of white oak DNA. BMC Genomics 2018; 19:896. [PMID: 30526482 PMCID: PMC6288960 DOI: 10.1186/s12864-018-5253-1] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2018] [Accepted: 11/15/2018] [Indexed: 01/14/2023] Open
Abstract
Background The application of genomic data and bioinformatics for the identification of restricted or illegally-sourced natural products is urgently needed. The taxonomic identity and geographic provenance of raw and processed materials have implications in sustainable-use commercial practices, and relevance to the enforcement of laws that regulate or restrict illegally harvested materials, such as timber. Improvements in genomics make it possible to capture and sequence partial-to-complete genomes from challenging tissues, such as wood and wood products. Results In this paper, we report the success of an alignment-free genome comparison method, \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$ {d}_2^{\ast }, $$\end{document}d2∗, that differentiates different geographic sources of white oak (Quercus) species with a high level of accuracy with very small amount of genomic data. The method is robust to sequencing errors, different sequencing laboratories and sequencing platforms. Conclusions This method offers an approach based on genome-scale data, rather than panels of pre-selected markers for specific taxa. The method provides a generalizable platform for the identification and sourcing of materials using a unified next generation sequencing and analysis framework. Electronic supplementary material The online version of this article (10.1186/s12864-018-5253-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Kujin Tang
- Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA, 90089, USA
| | - Jie Ren
- Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA, 90089, USA
| | - Richard Cronn
- Pacific Northwest Research Station, USDA Forest Service, Corvallis, OR, 97331, USA.
| | - David L Erickson
- DNA4 Technologies LLC, bwtech@UMBC Research & Technology Park, Baltimore, MD, 21227, USA
| | - Brook G Milligan
- Conservation Genomics Laboratory, Department of Biology, New Mexico State University, Las Cruces, NM, 88003, USA
| | | | - John L Spouge
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Fengzhu Sun
- Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA, 90089, USA. .,Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, 200433, China.
| |
Collapse
|
47
|
Insights into the genome sequence of ovine Pasteurella multocida type A strain associated with pneumonic pasteurellosis. Small Rumin Res 2018. [DOI: 10.1016/j.smallrumres.2018.10.004] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
|
48
|
Abstract
Background Pan-genome approaches afford the discovery of homology relations in a set of genomes, by determining how some gene families are distributed among a given set of genomes. The retrieval of a complete gene distribution among a class of genomes is an NP-hard problem because computational costs increase with the number of analyzed genomes, in fact, all-against-all gene comparisons are required to completely solve the problem. In presence of phylogenetically distant genomes, due to the variability introduced in gene duplication and transmission, the task of recognizing homologous genes becomes even more difficult. A challenge on this field is that of designing fast and adaptive similarity measures in order to find a suitable pan-genome structure of homology relations. Results We present PanDelos, a stand alone tool for the discovery of pan-genome contents among phylogenetic distant genomes. The methodology is based on information theory and network analysis. It is parameter-free because thresholds are automatically deduced from the context. PanDelos avoids sequence alignment by introducing a measure based on k-mer multiplicity. The k-mer length is defined according to general arguments rather than empirical considerations. Homology candidate relations are integrated into a global network and groups of homologous genes are extracted by applying a community detection algorithm. Conclusions PanDelos outperforms existing approaches, Roary and EDGAR, in terms of running times and quality content discovery. Tests were run on collections of real genomes, previously used in analogous studies, and in synthetic benchmarks that represent fully trusted golden truth. The software is available at https://github.com/GiugnoLab/PanDelos. Electronic supplementary material The online version of this article (10.1186/s12859-018-2417-6) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Vincenzo Bonnici
- Department of Computer Science, University of Verona, Strada le Grazie, 15, Verona, 37134, Italy.
| | - Rosalba Giugno
- Department of Computer Science, University of Verona, Strada le Grazie, 15, Verona, 37134, Italy
| | - Vincenzo Manca
- Department of Computer Science, University of Verona, Strada le Grazie, 15, Verona, 37134, Italy
| |
Collapse
|
49
|
Chen NWG, Serres-Giardi L, Ruh M, Briand M, Bonneau S, Darrasse A, Barbe V, Gagnevin L, Koebnik R, Jacques MA. Horizontal gene transfer plays a major role in the pathological convergence of Xanthomonas lineages on common bean. BMC Genomics 2018; 19:606. [PMID: 30103675 PMCID: PMC6090828 DOI: 10.1186/s12864-018-4975-4] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2018] [Accepted: 07/31/2018] [Indexed: 12/18/2022] Open
Abstract
BACKGROUND Host specialization is a hallmark of numerous plant pathogens including bacteria, fungi, oomycetes and viruses. Yet, the molecular and evolutionary bases of host specificity are poorly understood. In some cases, pathological convergence is observed for individuals belonging to distant phylogenetic clades. This is the case for Xanthomonas strains responsible for common bacterial blight of bean, spread across four genetic lineages. All the strains from these four lineages converged for pathogenicity on common bean, implying possible gene convergences and/or sharing of a common arsenal of genes conferring the ability to infect common bean. RESULTS To search for genes involved in common bean specificity, we used a combination of whole-genome analyses without a priori, including a genome scan based on k-mer search. Analysis of 72 genomes from a collection of Xanthomonas pathovars unveiled 115 genes bearing DNA sequences specific to strains responsible for common bacterial blight, including 20 genes located on a plasmid. Of these 115 genes, 88 were involved in successive events of horizontal gene transfers among the four genetic lineages, and 44 contained nonsynonymous polymorphisms unique to the causal agents of common bacterial blight. CONCLUSIONS Our study revealed that host specificity of common bacterial blight agents is associated with a combination of horizontal transfers of genes, and highlights the role of plasmids in these horizontal transfers.
Collapse
Affiliation(s)
- Nicolas W. G. Chen
- IRHS, INRA, AGROCAMPUS OUEST, Université d’Angers, SFR4207 QUASAV, 42, rue Georges Morel, 49071 Beaucouzé, France
| | - Laurana Serres-Giardi
- IRHS, INRA, AGROCAMPUS OUEST, Université d’Angers, SFR4207 QUASAV, 42, rue Georges Morel, 49071 Beaucouzé, France
| | - Mylène Ruh
- IRHS, INRA, AGROCAMPUS OUEST, Université d’Angers, SFR4207 QUASAV, 42, rue Georges Morel, 49071 Beaucouzé, France
| | - Martial Briand
- IRHS, INRA, AGROCAMPUS OUEST, Université d’Angers, SFR4207 QUASAV, 42, rue Georges Morel, 49071 Beaucouzé, France
| | - Sophie Bonneau
- IRHS, INRA, AGROCAMPUS OUEST, Université d’Angers, SFR4207 QUASAV, 42, rue Georges Morel, 49071 Beaucouzé, France
| | - Armelle Darrasse
- IRHS, INRA, AGROCAMPUS OUEST, Université d’Angers, SFR4207 QUASAV, 42, rue Georges Morel, 49071 Beaucouzé, France
| | - Valérie Barbe
- CEA/DSV/IG/Genoscope, 2 rue Gaston Crémieux, BP5706, 91057 Evry, France
| | - Lionel Gagnevin
- CIRAD, UMR PVBMT, F-97410 Saint-Pierre, La Réunion France
- IRD, CIRAD, Université de Montpellier, IPME, Montpellier, France
| | - Ralf Koebnik
- IRD, CIRAD, Université de Montpellier, IPME, Montpellier, France
| | - Marie-Agnès Jacques
- IRHS, INRA, AGROCAMPUS OUEST, Université d’Angers, SFR4207 QUASAV, 42, rue Georges Morel, 49071 Beaucouzé, France
| |
Collapse
|
50
|
Comparative studies of alignment, alignment-free and SVM based approaches for predicting the hosts of viruses based on viral sequences. Sci Rep 2018; 8:10032. [PMID: 29968780 PMCID: PMC6030160 DOI: 10.1038/s41598-018-28308-x] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2018] [Accepted: 06/15/2018] [Indexed: 12/05/2022] Open
Abstract
Predicting the hosts of newly discovered viruses is important for pandemic surveillance of infectious diseases. We investigated the use of alignment-based and alignment-free methods and support vector machine using mononucleotide frequency and dinucleotide bias to predict the hosts of viruses, and applied these approaches to three datasets: rabies virus, coronavirus, and influenza A virus. For coronavirus, we used the spike gene sequences, while for rabies and influenza A viruses, we used the more conserved nucleoprotein gene sequences. We compared the three methods under different scenarios and showed that their performances are highly correlated with the variability of sequences and sample size. For conserved genes like the nucleoprotein gene, longer k-mers than mono- and dinucleotides are needed to better distinguish the sequences. We also showed that both alignment-based and alignment-free methods can accurately predict the hosts of viruses. When alignment is difficult to achieve or highly time-consuming, alignment-free methods can be a promising substitute to predict the hosts of new viruses.
Collapse
|