1
|
Hu G, Zhou T, Zhou P, Yau SST. Novel natural vector with asymmetric covariance for classifying biological sequences. Gene 2025; 962:149532. [PMID: 40367998 DOI: 10.1016/j.gene.2025.149532] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2025] [Revised: 04/07/2025] [Accepted: 04/23/2025] [Indexed: 05/16/2025]
Abstract
The genome sequences of organisms form a large and complex landscape, presenting a significant challenge in bioinformatics: how to utilize mathematical tools to describe and analyze this space effectively. The ability to compare relationships between different organisms depends on creating a rational mapping rule that can uniformly encode genome sequences of varying lengths as vectors in a measurable space. This mapping would enable researchers to apply modern mathematical and machine learning techniques to otherwise challenging genomic comparisons. The natural vector method has been proposed as a concise and effective approach to accomplish this. However, its various iterations have certain limitations. In response, we carefully analyze the strengths and weaknesses of these natural vector methods and propose an improved version-an asymmetric covariance natural vector method (ACNV). This new method incorporates k-mer information alongside covariance computations with asymmetric properties between base positions. We tested ACNV on microbial genome sequence datasets, including bacterial, fungal, and viral sequences, evaluating its performance in terms of classification accuracy and convex hull separation. The results demonstrate that ACNV effectively captures sequence characteristics, showcasing its robust sequence representation capabilities and highlighting its elegant geometric properties.
Collapse
Affiliation(s)
- Guoqing Hu
- Beijing Institute of Mathematical Sciences and Applications (BIMSA), 101408, Beijing, China.
| | - Tao Zhou
- Department of Mathematical Sciences, Tsinghua University, 100084, Beijing, China
| | - Piyu Zhou
- Beijing Institute of Mathematical Sciences and Applications (BIMSA), 101408, Beijing, China; State Key Laboratory of Mathematical Science, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, 100190, Beijing, China; University of Chinese Academy of Sciences, 100049, Beijing, China
| | - Stephen Shing-Toung Yau
- Beijing Institute of Mathematical Sciences and Applications (BIMSA), 101408, Beijing, China; Department of Mathematical Sciences, Tsinghua University, 100084, Beijing, China.
| |
Collapse
|
2
|
Khodaei M, Edwards SV, Beerli P. Estimating Genome-wide Phylogenies Using Probabilistic Topic Modeling. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.12.20.572577. [PMID: 39605625 PMCID: PMC11601389 DOI: 10.1101/2023.12.20.572577] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2024]
Abstract
Methods for rapidly inferring the evolutionary history of species or populations with genome-wide data are progressing, but computational constraints still limit our abilities in this area. We developed an alignment-free method to infer genome-wide phylogenies and implemented it in the Python package T opic C ontml . The method uses probabilistic topic modeling (specifically, Latent Dirichlet Allocation or LDA) to extract 'topic' frequencies from k -mers, which are derived from multilocus DNA sequences. These extracted frequencies then serve as an input for the program C ontml in the PHYLIP package, which is used to generate a species tree. We evaluated the performance of T opic C ontml on simulated datasets with gaps and three biological datasets: (1) 14 DNA sequence loci from two Australian bird species distributed across nine populations, (2) 5162 loci from 80 mammal species, and (3) raw, unaligned, non-orthologous P ac B io sequences from 12 bird species. Our empirical results and simulated data suggest that our method is efficient and statistically robust. We also assessed the uncertainty of the estimated relationships among clades using a bootstrap procedure.
Collapse
|
3
|
Van Etten J, Stephens TG, Bhattacharya D. A k-mer-Based Approach for Phylogenetic Classification of Taxa in Environmental Genomic Data. Syst Biol 2023; 72:1101-1118. [PMID: 37314057 DOI: 10.1093/sysbio/syad037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2022] [Revised: 03/20/2023] [Accepted: 06/12/2023] [Indexed: 06/15/2023] Open
Abstract
In the age of genome sequencing, whole-genome data is readily and frequently generated, leading to a wealth of new information that can be used to advance various fields of research. New approaches, such as alignment-free phylogenetic methods that utilize k-mer-based distance scoring, are becoming increasingly popular given their ability to rapidly generate phylogenetic information from whole-genome data. However, these methods have not yet been tested using environmental data, which often tends to be highly fragmented and incomplete. Here, we compare the results of one alignment-free approach (which utilizes the D2 statistic) to traditional multi-gene maximum likelihood trees in 3 algal groups that have high-quality genome data available. In addition, we simulate lower-quality, fragmented genome data using these algae to test method robustness to genome quality and completeness. Finally, we apply the alignment-free approach to environmental metagenome assembled genome data of unclassified Saccharibacteria and Trebouxiophyte algae, and single-cell amplified data from uncultured marine stramenopiles to demonstrate its utility with real datasets. We find that in all instances, the alignment-free method produces phylogenies that are comparable, and often more informative, than those created using the traditional multi-gene approach. The k-mer-based method performs well even when there are significant missing data that include marker genes traditionally used for tree reconstruction. Our results demonstrate the value of alignment-free approaches for classifying novel, often cryptic or rare, species, that may not be culturable or are difficult to access using single-cell methods, but fill important gaps in the tree of life.
Collapse
Affiliation(s)
- Julia Van Etten
- Graduate Program in Ecology and Evolution, Rutgers, The State University of New Jersey, 14 College Farm Road, New Brunswick, NJ 08901, USA
| | - Timothy G Stephens
- Department of Biochemistry and Microbiology, Rutgers, The State University of New Jersey, 59 Dudley Road, New Brunswick, NJ 08901, USA
| | - Debashish Bhattacharya
- Department of Biochemistry and Microbiology, Rutgers, The State University of New Jersey, 59 Dudley Road, New Brunswick, NJ 08901, USA
| |
Collapse
|
4
|
Zhang J, Richards ZT, Adam AAS, Chan CX, Shinzato C, Gilmour J, Thomas L, Strugnell JM, Miller DJ, Cooke I. Evolutionary responses of a reef-building coral to climate change at the end of the last glacial maximum. Mol Biol Evol 2022; 39:msac201. [PMID: 36219871 PMCID: PMC9578555 DOI: 10.1093/molbev/msac201] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2022] [Revised: 09/04/2022] [Accepted: 09/13/2022] [Indexed: 11/13/2022] Open
Abstract
Climate change threatens the survival of coral reefs on a global scale, primarily through mass bleaching and mortality as a result of marine heatwaves. While these short-term effects are clear, predicting the fate of coral reefs over the coming century is a major challenge. One way to understand the longer-term effects of rapid climate change is to examine the response of coral populations to past climate shifts. Coastal and shallow-water marine ecosystems such as coral reefs have been reshaped many times by sea-level changes during the Pleistocene, yet, few studies have directly linked this with its consequences on population demographics, dispersal, and adaptation. Here we use powerful analytical techniques, afforded by haplotype phased whole-genomes, to establish such links for the reef-building coral, Acropora digitifera. We show that three genetically distinct populations are present in northwestern Australia, and that their rapid divergence since the last glacial maximum (LGM) can be explained by a combination of founder-effects and restricted gene flow. Signatures of selective sweeps, too strong to be explained by demographic history, are present in all three populations and overlap with genes that show different patterns of functional enrichment between inshore and offshore habitats. In contrast to rapid divergence in the host, we find that photosymbiont communities are largely undifferentiated between corals from all three locations, spanning almost 1000 km, indicating that selection on host genes and not acquisition of novel symbionts, has been the primary driver of adaptation for this species in northwestern Australia.
Collapse
Affiliation(s)
- Jia Zhang
- Department of Molecular and Cell Biology, James Cook University, Townsville, QLD, 4811, Australia
- Centre for Tropical Bioinformatics and Molecular Biology, James Cook University, Townsville, QLD, 4811, Australia
- ARC Centre of Excellence for Coral Reef Studies, James Cook University, Townsville, QLD, 4811, Australia
| | - Zoe T Richards
- Coral Conservation and Research Group, Trace and Environmental DNA Laboratory, School of Molecular and Life Sciences, Curtin University, Bentley, WA 6102, Australia
- Collections and Research, Western Australian Museum, 49 Kew Street Welshpool, WA 6106, Australia
| | - Arne A S Adam
- Coral Conservation and Research Group, Trace and Environmental DNA Laboratory, School of Molecular and Life Sciences, Curtin University, Bentley, WA 6102, Australia
| | - Cheong Xin Chan
- The University of Queensland, School of Chemistry and Molecular Biosciences, Australian Centre for Ecogenomics, Brisbane, QLD 4072, Australia
| | - Chuya Shinzato
- Atmosphere and Ocean Research Institute, The University of Tokyo277-8564, Chiba, Japan
| | - James Gilmour
- Australia Institute of Marine Science, Indian Oceans Marine Research Centre, Crawley, WA, 6009, Australia
| | - Luke Thomas
- Australia Institute of Marine Science, Indian Oceans Marine Research Centre, Crawley, WA, 6009, Australia
- Oceans Graduate School, The UWA Oceans Institute, The University of Western Australia, Perth, WA, 6009, Australia
| | - Jan M Strugnell
- Department of Marine Biology and Aquaculture, James Cook University, Townsville, QLD, 4811, Australia
- Centre for Sustainable Fisheries and Aquaculture, James Cook University, Townsville, QLD, 4811, Australia
| | - David J Miller
- Department of Molecular and Cell Biology, James Cook University, Townsville, QLD, 4811, Australia
- Centre for Tropical Bioinformatics and Molecular Biology, James Cook University, Townsville, QLD, 4811, Australia
- ARC Centre of Excellence for Coral Reef Studies, James Cook University, Townsville, QLD, 4811, Australia
- Marine Climate Change Unit, Okinawa Institute of Science and Technology, Onna-son, Okinawa, Japan 904-0495
| | - Ira Cooke
- Department of Molecular and Cell Biology, James Cook University, Townsville, QLD, 4811, Australia
- Centre for Tropical Bioinformatics and Molecular Biology, James Cook University, Townsville, QLD, 4811, Australia
| |
Collapse
|
5
|
Lo R, Dougan KE, Chen Y, Shah S, Bhattacharya D, Chan CX. Alignment-Free Analysis of Whole-Genome Sequences From Symbiodiniaceae Reveals Different Phylogenetic Signals in Distinct Regions. FRONTIERS IN PLANT SCIENCE 2022; 13:815714. [PMID: 35557718 PMCID: PMC9087856 DOI: 10.3389/fpls.2022.815714] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Accepted: 04/04/2022] [Indexed: 05/24/2023]
Abstract
Dinoflagellates of the family Symbiodiniaceae are predominantly essential symbionts of corals and other marine organisms. Recent research reveals extensive genome sequence divergence among Symbiodiniaceae taxa and high phylogenetic diversity hidden behind subtly different cell morphologies. Using an alignment-free phylogenetic approach based on sub-sequences of fixed length k (i.e. k-mers), we assessed the phylogenetic signal among whole-genome sequences from 16 Symbiodiniaceae taxa (including the genera of Symbiodinium, Breviolum, Cladocopium, Durusdinium and Fugacium) and two strains of Polarella glacialis as outgroup. Based on phylogenetic trees inferred from k-mers in distinct genomic regions (i.e. repeat-masked genome sequences, protein-coding sequences, introns and repeats) and in protein sequences, the phylogenetic signal associated with protein-coding DNA and the encoded amino acids is largely consistent with the Symbiodiniaceae phylogeny based on established markers, such as large subunit rRNA. The other genome sequences (introns and repeats) exhibit distinct phylogenetic signals, supporting the expected differential evolutionary pressure acting on these regions. Our analysis of conserved core k-mers revealed the prevalence of conserved k-mers (>95% core 23-mers among all 18 genomes) in annotated repeats and non-genic regions of the genomes. We observed 180 distinct repeat types that are significantly enriched in genomes of the symbiotic versus free-living Symbiodinium taxa, suggesting an enhanced activity of transposable elements linked to the symbiotic lifestyle. We provide evidence that representation of alignment-free phylogenies as dynamic networks enhances the ability to generate new hypotheses about genome evolution in Symbiodiniaceae. These results demonstrate the potential of alignment-free phylogenetic methods as a scalable approach for inferring comprehensive, unbiased whole-genome phylogenies of dinoflagellates and more broadly of microbial eukaryotes.
Collapse
Affiliation(s)
- Rosalyn Lo
- Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, University of Queensland, Brisbane, QLD, Australia
| | - Katherine E. Dougan
- Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, University of Queensland, Brisbane, QLD, Australia
| | - Yibi Chen
- Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, University of Queensland, Brisbane, QLD, Australia
| | - Sarah Shah
- Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, University of Queensland, Brisbane, QLD, Australia
| | - Debashish Bhattacharya
- Department of Biochemistry and Microbiology, Rutgers University, New Brunswick, NJ, United States
| | - Cheong Xin Chan
- Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, University of Queensland, Brisbane, QLD, Australia
| |
Collapse
|
6
|
Dougan KE, González-Pech RA, Stephens TG, Shah S, Chen Y, Ragan MA, Bhattacharya D, Chan CX. Genome-powered classification of microbial eukaryotes: focus on coral algal symbionts. Trends Microbiol 2022; 30:831-840. [DOI: 10.1016/j.tim.2022.02.001] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2021] [Revised: 01/20/2022] [Accepted: 02/01/2022] [Indexed: 12/20/2022]
|
7
|
Bohnsack KS, Kaden M, Abel J, Saralajew S, Villmann T. The Resolved Mutual Information Function as a Structural Fingerprint of Biomolecular Sequences for Interpretable Machine Learning Classifiers. ENTROPY (BASEL, SWITZERLAND) 2021; 23:1357. [PMID: 34682081 PMCID: PMC8534762 DOI: 10.3390/e23101357] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/19/2021] [Revised: 10/11/2021] [Accepted: 10/14/2021] [Indexed: 11/16/2022]
Abstract
In the present article we propose the application of variants of the mutual information function as characteristic fingerprints of biomolecular sequences for classification analysis. In particular, we consider the resolved mutual information functions based on Shannon-, Rényi-, and Tsallis-entropy. In combination with interpretable machine learning classifier models based on generalized learning vector quantization, a powerful methodology for sequence classification is achieved which allows substantial knowledge extraction in addition to the high classification ability due to the model-inherent robustness. Any potential (slightly) inferior performance of the used classifier is compensated by the additional knowledge provided by interpretable models. This knowledge may assist the user in the analysis and understanding of the used data and considered task. After theoretical justification of the concepts, we demonstrate the approach for various example data sets covering different areas in biomolecular sequence analysis.
Collapse
Affiliation(s)
- Katrin Sophie Bohnsack
- Saxon Institute for Computational Intelligence and Machine Learning, University of Applied Sciences Mittweida, 09648 Mittweida, Germany; (M.K.); (J.A.)
| | - Marika Kaden
- Saxon Institute for Computational Intelligence and Machine Learning, University of Applied Sciences Mittweida, 09648 Mittweida, Germany; (M.K.); (J.A.)
| | - Julia Abel
- Saxon Institute for Computational Intelligence and Machine Learning, University of Applied Sciences Mittweida, 09648 Mittweida, Germany; (M.K.); (J.A.)
| | - Sascha Saralajew
- Bosch Center for Artificial Intelligence, 71272 Renningen, Germany;
| | - Thomas Villmann
- Saxon Institute for Computational Intelligence and Machine Learning, University of Applied Sciences Mittweida, 09648 Mittweida, Germany; (M.K.); (J.A.)
| |
Collapse
|
8
|
Gupta A, Kulkarni M, Mukherjee A. Accurate prediction of B-form/A-form DNA conformation propensity from primary sequence: A machine learning and free energy handshake. PATTERNS 2021; 2:100329. [PMID: 34553171 PMCID: PMC8441556 DOI: 10.1016/j.patter.2021.100329] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/01/2021] [Revised: 03/25/2021] [Accepted: 07/20/2021] [Indexed: 11/26/2022]
Abstract
DNA carries the genetic code of life, with different conformations associated with different biological functions. Predicting the conformation of DNA from its primary sequence, although desirable, is a challenging problem owing to the polymorphic nature of DNA. We have deployed a host of machine learning algorithms, including the popular state-of-the-art LightGBM (a gradient boosting model), for building prediction models. We used the nested cross-validation strategy to address the issues of “overfitting” and selection bias. This simultaneously provides an unbiased estimate of the generalization performance of a machine learning algorithm and allows us to tune the hyperparameters optimally. Furthermore, we built a secondary model based on SHAP (SHapley Additive exPlanations) that offers crucial insight into model interpretability. Our detailed model-building strategy and robust statistical validation protocols tackle the formidable challenge of working on small datasets, which is often the case in biological and medical data. A robust machine learning model to predict A- or B-DNA conformation Outcome of machine learning model is explained with free energy values Our approach works well under class imbalance and limited data constraints
The sequence in the genome of an organism encodes all the information of life. We combine a data-driven approach using machine learning (ML) and the results of free energy calculations to offer a fresh perspective on this long-standing problem of prediction of DNA conformation (A or B) from the sequence. We trained our ML model using sophisticated state-of-the art algorithms such as LightGBM along with a nested cross-validation strategy to overcome the common problems associated with data bias and overfitting when constrained by limited data size. Our study will serve the broader interest of researchers who are not only seeking accurate and reliable predictive models but also want to understand the physical and chemical origins behind the predictions.
Collapse
Affiliation(s)
- Abhijit Gupta
- Department of Chemistry, Indian Institute of Science Education and Research, Pune, Maharashtra 411008, India
| | - Mandar Kulkarni
- Division of Biophysical Chemistry, Lund University, Chemical Center, P.O.B. 124, 22100 Lund, Sweden
| | - Arnab Mukherjee
- Department of Chemistry, Indian Institute of Science Education and Research, Pune, Maharashtra 411008, India
| |
Collapse
|
9
|
Léonard RR, Leleu M, Van Vlierberghe M, Cornet L, Kerff F, Baurain D. ToRQuEMaDA: tool for retrieving queried Eubacteria, metadata and dereplicating assemblies. PeerJ 2021; 9:e11348. [PMID: 33996287 PMCID: PMC8106394 DOI: 10.7717/peerj.11348] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2020] [Accepted: 04/04/2021] [Indexed: 11/20/2022] Open
Abstract
TQMD is a tool for high-performance computing clusters which downloads, stores and produces lists of dereplicated prokaryotic genomes. It has been developed to counter the ever-growing number of prokaryotic genomes and their uneven taxonomic distribution. It is based on word-based alignment-free methods (k-mers), an iterative single-linkage approach and a divide-and-conquer strategy to remain both efficient and scalable. We studied the performance of TQMD by verifying the influence of its parameters and heuristics on the clustering outcome. We further compared TQMD to two other dereplication tools (dRep and Assembly-Dereplicator). Our results showed that TQMD is primarily optimized to dereplicate at higher taxonomic levels (phylum/class), as opposed to the other dereplication tools, but also works at lower taxonomic levels (species/strain) like the other dereplication tools. TQMD is available from source and as a Singularity container at [https://bitbucket.org/phylogeno/tqmd ].
Collapse
Affiliation(s)
- Raphaël R Léonard
- InBioS - Centre d'Ingénierie des Protéines, Université de Liège, Liège, Belgium.,InBioS -PhytoSYSTEMS, Eukaryotic Phylogenomics, Université de Liège, Liège, Belgium
| | - Marie Leleu
- InBioS -PhytoSYSTEMS, Eukaryotic Phylogenomics, Université de Liège, Liège, Belgium.,UGSF -Unité de Glycobiologie Structurale et Fonctionnelle, Université de Lille/CNRS, Lille, France
| | - Mick Van Vlierberghe
- InBioS -PhytoSYSTEMS, Eukaryotic Phylogenomics, Université de Liège, Liège, Belgium
| | - Luc Cornet
- InBioS -PhytoSYSTEMS, Eukaryotic Phylogenomics, Université de Liège, Liège, Belgium.,Mycology and Aerobiology, Sciensano, Service Public Fédéral, Bruxelles, Belgium
| | - Frédéric Kerff
- InBioS - Centre d'Ingénierie des Protéines, Université de Liège, Liège, Belgium
| | - Denis Baurain
- InBioS -PhytoSYSTEMS, Eukaryotic Phylogenomics, Université de Liège, Liège, Belgium
| |
Collapse
|
10
|
Jacobus AP, Stephens TG, Youssef P, González-Pech R, Ciccotosto-Camp MM, Dougan KE, Chen Y, Basso LC, Frazzon J, Chan CX, Gross J. Comparative Genomics Supports That Brazilian Bioethanol Saccharomyces cerevisiae Comprise a Unified Group of Domesticated Strains Related to Cachaça Spirit Yeasts. Front Microbiol 2021; 12:644089. [PMID: 33936002 PMCID: PMC8082247 DOI: 10.3389/fmicb.2021.644089] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2020] [Accepted: 03/08/2021] [Indexed: 01/05/2023] Open
Abstract
Ethanol production from sugarcane is a key renewable fuel industry in Brazil. Major drivers of this alcoholic fermentation are Saccharomyces cerevisiae strains that originally were contaminants to the system and yet prevail in the industrial process. Here we present newly sequenced genomes (using Illumina short-read and PacBio long-read data) of two monosporic isolates (H3 and H4) of the S. cerevisiae PE-2, a predominant bioethanol strain in Brazil. The assembled genomes of H3 and H4, together with 42 draft genomes of sugarcane-fermenting (fuel ethanol plus cachaça) strains, were compared against those of the reference S288C and diverse S. cerevisiae. All genomes of bioethanol yeasts have amplified SNO2(3)/SNZ2(3) gene clusters for vitamin B1/B6 biosynthesis, and display ubiquitous presence of a particular family of SAM-dependent methyl transferases, rare in S. cerevisiae. Widespread amplifications of quinone oxidoreductases YCR102C/YLR460C/YNL134C, and the structural or punctual variations among aquaporins and components of the iron homeostasis system, likely represent adaptations to industrial fermentation. Interesting is the pervasive presence among the bioethanol/cachaça strains of a five-gene cluster (Region B) that is a known phylogenetic signature of European wine yeasts. Combining genomes of H3, H4, and 195 yeast strains, we comprehensively assessed whole-genome phylogeny of these taxa using an alignment-free approach. The 197-genome phylogeny substantiates that bioethanol yeasts are monophyletic and closely related to the cachaça and wine strains. Our results support the hypothesis that biofuel-producing yeasts in Brazil may have been co-opted from a pool of yeasts that were pre-adapted to alcoholic fermentation of sugarcane for the distillation of cachaça spirit, which historically is a much older industry than the large-scale fuel ethanol production.
Collapse
Affiliation(s)
- Ana Paula Jacobus
- Laboratory for Genomics and Experimental Evolution of Yeasts, Institute for Bioenergy Research, São Paulo State University, Rio Claro, Brazil
| | - Timothy G Stephens
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia
| | - Pierre Youssef
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia
| | - Raul González-Pech
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia
| | - Michael M Ciccotosto-Camp
- Australian Centre for Ecogenomics, The University of Queensland, Brisbane, QLD, Australia.,School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD, Australia
| | - Katherine E Dougan
- Australian Centre for Ecogenomics, The University of Queensland, Brisbane, QLD, Australia.,School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD, Australia
| | - Yibi Chen
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia.,Australian Centre for Ecogenomics, The University of Queensland, Brisbane, QLD, Australia.,School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD, Australia
| | - Luiz Carlos Basso
- Biological Science Department, Escola Superior de Agricultura Luiz de Queiroz, University of São Paulo (USP), Piracicaba, Brazil
| | - Jeverson Frazzon
- Institute of Food Science and Technology, Federal University of Rio Grande do Sul (UFRGS), Porto Alegre, Brazil
| | - Cheong Xin Chan
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia.,Australian Centre for Ecogenomics, The University of Queensland, Brisbane, QLD, Australia.,School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD, Australia
| | - Jeferson Gross
- Laboratory for Genomics and Experimental Evolution of Yeasts, Institute for Bioenergy Research, São Paulo State University, Rio Claro, Brazil
| |
Collapse
|
11
|
Abstract
Inferring phylogenetic relationships among hundreds or thousands of microbial genomes is an increasingly common task. The conventional phylogenetic approach adopts multiple sequence alignment to compare gene-by-gene, concatenated multigene or whole-genome sequences, from which a phylogenetic tree would be inferred. These alignments follow the implicit assumption of full-length contiguity among homologous sequences. However, common events in microbial genome evolution (e.g., structural rearrangements and genetic recombination) violate this assumption. Moreover, aligning hundreds or thousands of sequences is computationally intensive and not scalable to the rate at which genome data are generated. Therefore, alignment-free methods present an attractive alternative strategy. Here we describe a scalable alignment-free strategy to infer phylogenetic relationships using complete genome sequences of bacteria and archaea, based on short, subsequences of length k (k-mers). We describe how this strategy can be extended to infer evolutionary relationships beyond a tree-like structure, to better capture both vertical and lateral signals of microbial evolution.
Collapse
|
12
|
Mapping sequence to feature vector using numerical representation of codons targeted to amino acids for alignment-free sequence analysis. Gene 2020; 766:145096. [PMID: 32919006 DOI: 10.1016/j.gene.2020.145096] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2020] [Revised: 08/16/2020] [Accepted: 08/24/2020] [Indexed: 12/17/2022]
Abstract
The phylogenetic analysis based on sequence similarity targeted to real biological taxa is one of the major challenging tasks. In this paper, we propose a novel alignment-free method, CoFASA (Codon Feature based Amino acid Sequence Analyser), for similarity analysis of nucleotide sequences. At first, we assign numerical weights to the four nucleotides. We then calculate a score of each codon based on the numerical value of the constituent nucleotides, termed as degree of codons. Accordingly, we obtain the degree of each amino acid based on the degree of codons targeted towards a specific amino acid. Utilizing the degree of twenty amino acids and their relative abundance within a given sequence, we generate 20-dimensional features for every coding DNA sequence or protein sequence. We use the features for performing phylogenetic analysis of the set of candidate sequences. We use multiple protein sequences derived from Beta-globin (BG), NADH dehydrogenase subunit 5 (ND5), Transferrins (TFs), Xylanases, low identity (<40%) and high identity (⩾40%) protein sequences (encompassing 533 and 1064 protein families) for experimental assessments. We compare our results with sixteen (16) well-known methods, including both alignment-based and alignment-free methods. Various assessment indices are used, such as the Pearson correlation coefficient, RF (Robinson-Foulds) distance and ROC score for performance analysis. While comparing the performance of CoFASA with alignment-based methods (ClustalW, ClustalΩ, MAFFT, and MUSCLE), it shows very similar results. Further, CoFASA shows better performance in comparison to well-known alignment-free methods, including LZW-Kernal, jD2Stat, FFP, spaced, and AFKS-D2s in predicting taxonomic relationship among candidate taxa. Overall, we observe that the features derived by CoFASA are very much useful in isolating the sequences according to their taxonomic labels. While our method is cost-effective, at the same time, produces consistent and satisfactory outcomes.
Collapse
|
13
|
The Nubeam reference-free approach to analyze metagenomic sequencing reads. Genome Res 2020; 30:1364-1375. [PMID: 32883749 PMCID: PMC7545149 DOI: 10.1101/gr.261750.120] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2019] [Accepted: 07/30/2020] [Indexed: 01/04/2023]
Abstract
We present Nubeam (nucleotide be a matrix) as a novel reference-free approach to analyze short sequencing reads. Nubeam represents nucleotides by matrices, transforms a read into a product of matrices, and assigns numbers to reads based on the product matrix. Nubeam capitalizes on the noncommutative property of matrix multiplication, such that different reads are assigned different numbers and similar reads similar numbers. A sample, which is a collection of reads, becomes a collection of numbers that form an empirical distribution. We demonstrate that the genetic difference between samples can be quantified by the distance between empirical distributions. Nubeam includes the k-mer method as a special case, but unlike the k-mer method, it is convenient for Nubeam to account for GC bias and nucleotide quality. As a reference-free approach, Nubeam avoids reference bias and mapping bias, and can work with organisms without reference genomes. Thus, Nubeam is ideal to analyze data sets from metagenomics whole genome shotgun (WGS) sequencing, where the amount of unmapped reads is substantial. When applied to a WGS sequencing data set to quantify distances between metagenomics samples from various human body habitats, Nubeam recapitulates findings made by mapping-based methods and sheds light on contributions of unmapped reads. Nubeam is also useful in analyzing 16S rRNA sequencing data, which is a more prevalent type of data set in metagenomics studies. In our analysis, Nubeam recapitulated the findings that natural microbiota in mouse gut are resilient under challenges, and Nubeam detected differences in vaginal microbiota between cases of polycystic ovary syndrome and healthy controls.
Collapse
|
14
|
Murphy RG, Roddy AC, Srivastava S, Baena E, Waugh D, M. O’Sullivan J, McArt DG, Jain S, LaBonte M. Prostate cancer heterogeneity assessment with multi-regional sampling and alignment-free methods. NAR Genom Bioinform 2020; 2:lqaa062. [PMID: 32856020 PMCID: PMC7440682 DOI: 10.1093/nargab/lqaa062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2020] [Revised: 07/16/2020] [Accepted: 08/05/2020] [Indexed: 11/14/2022] Open
Abstract
Combining alignment-free methods for phylogenetic analysis with multi-regional sampling using next-generation sequencing can provide an assessment of intra-patient tumour heterogeneity. From multi-regional sampling divergent branching, we validated two different lesions within a patient's prostate. Where multi-regional sampling has not been used, a single sample from one of these areas could misguide as to which drugs or therapies would best benefit this patient, due to the fact these tumours appear to be genetically different. This application has the power to render, in a fraction of the time used by other approaches, intra-patient heterogeneity and decipher aberrant biomarkers. Another alignment-free method for calling single-nucleotide variants from raw next-generation sequencing samples has determined possible variants and genomic locations that may be able to characterize the differences between the two main branching patterns. Alignment-free approaches have been applied to relevant clinical multi-regional samples and may be considered as a valuable option for comparing and determining heterogeneity to help deliver personalized medicine through more robust efforts in identifying targetable pathways and therapeutic strategies. Our study highlights the application these tools could have on patient-aligned treatment indications.
Collapse
Affiliation(s)
- Ross G Murphy
- Movember FASTMAN Centre of Excellence, Patrick G Johnston Centre for Cancer Research, School of Medicine, Dentistry and Biomedical Sciences, Queen’s University Belfast, Belfast BT9 7AE, UK
| | - Aideen C Roddy
- Movember FASTMAN Centre of Excellence, Patrick G Johnston Centre for Cancer Research, School of Medicine, Dentistry and Biomedical Sciences, Queen’s University Belfast, Belfast BT9 7AE, UK
| | - Shambhavi Srivastava
- Movember FASTMAN Centre of Excellence, Patrick G Johnston Centre for Cancer Research, School of Medicine, Dentistry and Biomedical Sciences, Queen’s University Belfast, Belfast BT9 7AE, UK
- Molecular Oncology, Cancer Research UK Manchester Institute, The University of Manchester, Alderley Park SK10 4TG, UK
- Belfast–Manchester Movember Centre of Excellence, Cancer Research UK Manchester Institute, The University of Manchester, Alderley Park SK10 4TG, UK
| | - Esther Baena
- Belfast–Manchester Movember Centre of Excellence, Cancer Research UK Manchester Institute, The University of Manchester, Alderley Park SK10 4TG, UK
- Prostate Oncobiology, Cancer Research UK Manchester Institute, The University of Manchester, Alderley Park SK10 4TG, UK
| | - David J Waugh
- Movember FASTMAN Centre of Excellence, Patrick G Johnston Centre for Cancer Research, School of Medicine, Dentistry and Biomedical Sciences, Queen’s University Belfast, Belfast BT9 7AE, UK
- School of Biomedical Sciences, Faculty of Health, Queensland University of Technology, Brisbane, Queensland, QLD 4000, Australia
| | - Joe M. O’Sullivan
- Movember FASTMAN Centre of Excellence, Patrick G Johnston Centre for Cancer Research, School of Medicine, Dentistry and Biomedical Sciences, Queen’s University Belfast, Belfast BT9 7AE, UK
- Northern Ireland Cancer Centre, Belfast Health & Social Care Trust, Belfast BT9 7JL, UK
| | - Darragh G McArt
- Movember FASTMAN Centre of Excellence, Patrick G Johnston Centre for Cancer Research, School of Medicine, Dentistry and Biomedical Sciences, Queen’s University Belfast, Belfast BT9 7AE, UK
| | - Suneil Jain
- Movember FASTMAN Centre of Excellence, Patrick G Johnston Centre for Cancer Research, School of Medicine, Dentistry and Biomedical Sciences, Queen’s University Belfast, Belfast BT9 7AE, UK
- Northern Ireland Cancer Centre, Belfast Health & Social Care Trust, Belfast BT9 7JL, UK
| | - Melissa J LaBonte
- Movember FASTMAN Centre of Excellence, Patrick G Johnston Centre for Cancer Research, School of Medicine, Dentistry and Biomedical Sciences, Queen’s University Belfast, Belfast BT9 7AE, UK
| |
Collapse
|
15
|
Brock DA, Noh S, Hubert AN, Haselkorn TS, DiSalvo S, Suess MK, Bradley AS, Tavakoli-Nezhad M, Geist KS, Queller DC, Strassmann JE. Endosymbiotic adaptations in three new bacterial species associated with Dictyostelium discoideum: Paraburkholderia agricolaris sp. nov., Paraburkholderia hayleyella sp. nov., and Paraburkholderia bonniea sp. nov. PeerJ 2020; 8:e9151. [PMID: 32509456 PMCID: PMC7247526 DOI: 10.7717/peerj.9151] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2019] [Accepted: 04/17/2020] [Indexed: 12/24/2022] Open
Abstract
Here we give names to three new species of Paraburkholderia that can remain in symbiosis indefinitely in the spores of a soil dwelling eukaryote, Dictyostelium discoideum. The new species P. agricolaris sp. nov., P. hayleyella sp. nov., and P. bonniea sp. nov. are widespread across the eastern USA and were isolated as internal symbionts of wild-collected D. discoideum. We describe these sp. nov. using several approaches. Evidence that they are each a distinct new species comes from their phylogenetic position, average nucleotide identity, genome-genome distance, carbon usage, reduced length, cooler optimal growth temperature, metabolic tests, and their previously described ability to invade D. discoideum amoebae and form a symbiotic relationship. All three of these new species facilitate the prolonged carriage of food bacteria by D. discoideum, though they themselves are not food. Further studies of the interactions of these three new species with D. discoideum should be fruitful for understanding the ecology and evolution of symbioses.
Collapse
Affiliation(s)
- Debra A. Brock
- Department of Biology, Washington University in St. Louis, St Louis, MO, United States of America
| | - Suegene Noh
- Department of Biology, Colby College, Waterville, ME, United States of America
| | - Alicia N.M. Hubert
- Department of Biology, Washington University in St. Louis, St Louis, MO, United States of America
| | - Tamara S. Haselkorn
- Department of Biology, University of Central Arkansas, Conway, AR, United States of America
| | - Susanne DiSalvo
- Department of Biological Sciences, Southern Illinois University at Edwardsville, Edwardsville, IL, United States of America
| | - Melanie K. Suess
- Department of Earth and Planetary Sciences, Washington University in St. Louis, St Louis, MO, United States of America
| | - Alexander S. Bradley
- Department of Earth and Planetary Sciences, Division of Biology and Biomedical Sciences, Washington University in St. Louis, St Louis, MO, United States of America
| | | | - Katherine S. Geist
- Department of Biology, Washington University in St. Louis, St Louis, MO, United States of America
| | - David C. Queller
- Department of Biology, Washington University in St. Louis, St Louis, MO, United States of America
| | - Joan E. Strassmann
- Department of Biology, Washington University in St. Louis, St Louis, MO, United States of America
| |
Collapse
|
16
|
Miller JB, McKinnon LM, Whiting MF, Kauwe JSK, Ridge PG. Codon Pairs are Phylogenetically Conserved: A comprehensive analysis of codon pairing conservation across the Tree of Life. PLoS One 2020; 15:e0232260. [PMID: 32401752 PMCID: PMC7219770 DOI: 10.1371/journal.pone.0232260] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2019] [Accepted: 04/10/2020] [Indexed: 11/27/2022] Open
Abstract
Identical codon pairing and co-tRNA codon pairing increase translational efficiency within genes when two codons that encode the same amino acid are translated by the same tRNA before it diffuses from the ribosome. We examine the phylogenetic signal in both identical and co-tRNA codon pairing across 23 428 species using alignment-free and parsimony methods. We determined that conserved codon pairing typically has a smaller window size than the length of a ribosome, and codon pairing tracks phylogenies across various taxonomic groups. We report a comprehensive analysis of codon pairing, including the extent to which each codon pairs. Our parsimony method generally recovers phylogenies that are more congruent with the established phylogenies than our alignment-free method. However, four of the ten taxonomic groups did not have sufficient orthologous codon pairings and were therefore analyzed using only the alignment-free methods. Since the recovered phylogenies using only codon pairing largely match phylogenies from the Open Tree of Life and the NCBI taxonomy, and are comparable to trees recovered by other algorithms, we propose that codon pairing biases are phylogenetically conserved and should be considered in conjunction with other phylogenomic techniques.
Collapse
Affiliation(s)
- Justin B. Miller
- Department of Biology, Brigham Young University, Provo, UT, United States of America
| | - Lauren M. McKinnon
- Department of Biology, Brigham Young University, Provo, UT, United States of America
| | - Michael F. Whiting
- Department of Biology, Brigham Young University, Provo, UT, United States of America
- M.L. Bean Museum, Brigham Young University, Provo, UT, United States of America
| | - John S. K. Kauwe
- Department of Biology, Brigham Young University, Provo, UT, United States of America
| | - Perry G. Ridge
- Department of Biology, Brigham Young University, Provo, UT, United States of America
| |
Collapse
|
17
|
Bernard G, Chan CX, Chan YB, Chua XY, Cong Y, Hogan JM, Maetschke SR, Ragan MA. Alignment-free inference of hierarchical and reticulate phylogenomic relationships. Brief Bioinform 2019; 20:426-435. [PMID: 28673025 PMCID: PMC6433738 DOI: 10.1093/bib/bbx067] [Citation(s) in RCA: 60] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2017] [Revised: 05/04/2017] [Indexed: 11/22/2022] Open
Abstract
We are amidst an ongoing flood of sequence data arising from the application of high-throughput technologies, and a concomitant fundamental revision in our understanding of how genomes evolve individually and within the biosphere. Workflows for phylogenomic inference must accommodate data that are not only much larger than before, but often more error prone and perhaps misassembled, or not assembled in the first place. Moreover, genomes of microbes, viruses and plasmids evolve not only by tree-like descent with modification but also by incorporating stretches of exogenous DNA. Thus, next-generation phylogenomics must address computational scalability while rethinking the nature of orthogroups, the alignment of multiple sequences and the inference and comparison of trees. New phylogenomic workflows have begun to take shape based on so-called alignment-free (AF) approaches. Here, we review the conceptual foundations of AF phylogenetics for the hierarchical (vertical) and reticulate (lateral) components of genome evolution, focusing on methods based on k-mers. We reflect on what seems to be successful, and on where further development is needed.
Collapse
|
18
|
Zielezinski A, Girgis HZ, Bernard G, Leimeister CA, Tang K, Dencker T, Lau AK, Röhling S, Choi JJ, Waterman MS, Comin M, Kim SH, Vinga S, Almeida JS, Chan CX, James BT, Sun F, Morgenstern B, Karlowski WM. Benchmarking of alignment-free sequence comparison methods. Genome Biol 2019; 20:144. [PMID: 31345254 PMCID: PMC6659240 DOI: 10.1186/s13059-019-1755-7] [Citation(s) in RCA: 113] [Impact Index Per Article: 18.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2019] [Accepted: 07/03/2019] [Indexed: 11/22/2022] Open
Abstract
BACKGROUND Alignment-free (AF) sequence comparison is attracting persistent interest driven by data-intensive applications. Hence, many AF procedures have been proposed in recent years, but a lack of a clearly defined benchmarking consensus hampers their performance assessment. RESULTS Here, we present a community resource (http://afproject.org) to establish standards for comparing alignment-free approaches across different areas of sequence-based research. We characterize 74 AF methods available in 24 software tools for five research applications, namely, protein sequence classification, gene tree inference, regulatory element detection, genome-based phylogenetic inference, and reconstruction of species trees under horizontal gene transfer and recombination events. CONCLUSION The interactive web service allows researchers to explore the performance of alignment-free tools relevant to their data types and analytical goals. It also allows method developers to assess their own algorithms and compare them with current state-of-the-art tools, accelerating the development of new, more accurate AF solutions.
Collapse
Affiliation(s)
- Andrzej Zielezinski
- Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University Poznan, Uniwersytetu Poznańskiego 6, 61-614, Poznan, Poland
| | - Hani Z Girgis
- Tandy School of Computer Science, The University of Tulsa, 800 South Tucker Drive, Tulsa, OK, 74104, USA
| | | | - Chris-Andre Leimeister
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Kujin Tang
- Department of Biological Sciences, Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA, 90089, USA
| | - Thomas Dencker
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Anna Katharina Lau
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Sophie Röhling
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Jae Jin Choi
- Department of Chemistry, University of California, Berkeley, CA, 94720, USA
- Molecular Biophysics & Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Michael S Waterman
- Department of Biological Sciences, Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA, 90089, USA
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, 200433, China
| | - Matteo Comin
- Department of Information Engineering, University of Padova, Padova, Italy
| | - Sung-Hou Kim
- Department of Chemistry, University of California, Berkeley, CA, 94720, USA
- Molecular Biophysics & Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Susana Vinga
- INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, 1049-001, Lisbon, Portugal
- IDMEC, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, 1049-001, Lisbon, Portugal
| | - Jonas S Almeida
- Division of Cancer Epidemiology and Genetics (DCEG), National Cancer Institute (NIH/NCI), Bethesda, USA
| | - Cheong Xin Chan
- Institute for Molecular Bioscience, and School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD, 4072, Australia
| | - Benjamin T James
- Tandy School of Computer Science, The University of Tulsa, 800 South Tucker Drive, Tulsa, OK, 74104, USA
| | - Fengzhu Sun
- Department of Biological Sciences, Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA, 90089, USA
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, 200433, China
| | - Burkhard Morgenstern
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Wojciech M Karlowski
- Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University Poznan, Uniwersytetu Poznańskiego 6, 61-614, Poznan, Poland.
| |
Collapse
|
19
|
Criscuolo A. A fast alignment-free bioinformatics procedure to infer accurate distance-based phylogenetic trees from genome assemblies. RESEARCH IDEAS AND OUTCOMES 2019. [DOI: 10.3897/rio.5.e36178] [Citation(s) in RCA: 36] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Abstract
This paper describes a novel alignment-free distance-based procedure for inferring phylogenetic trees from genome contig sequences using publicly available bioinformatics tools. For each pair of genomes, a dissimilarity measure is first computed and next transformed to obtain an estimation of the number of substitution events that have occurred during their evolution. These pairwise evolutionary distances are then used to infer a phylogenetic tree and assess a confidence support for each internal branch. Analyses of both simulated and real genome datasets show that this bioinformatics procedure allows accurate phylogenetic trees to be reconstructed with fast running times, especially when launched on multiple threads. Implemented in a publicly available script, named JolyTree, this procedure is a useful approach for quickly inferring species trees without the burden and potential biases of multiple sequence alignments.
Collapse
|
20
|
Guo Y, Cooper MM, Bromberg R, Marletta MA. A Dual-H-NOX Signaling System in Saccharophagus degradans. Biochemistry 2018; 57:6570-6580. [PMID: 30398342 DOI: 10.1021/acs.biochem.8b01058] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Nitric oxide (NO) is a critical signaling molecule involved in the regulation of a wide variety of physiological processes across every domain of life. In most aerobic and facultative anaerobic bacteria, heme-nitric oxide/oxygen binding (H-NOX) proteins selectively sense NO and inhibit the activity of a histidine kinase (HK) located on the same operon. This NO-dependent inhibition of the cognate HK alters the phosphorylation of the downstream response regulators. In the marine bacterium Saccharophagus degradans ( Sde), in addition to a typical H-NOX ( Sde 3804)/HK ( Sde 3803) pair, an orphan H-NOX ( Sde 3557) with no associated signaling protein has been identified distant from the H-NOX/HK pair in the genome. The characterization reported here elucidates the function of both H-NOX proteins. Sde 3557 exhibits a weaker binding affinity with the kinase, yet both Sde 3804 and Sde 3557 are functional H-NOXs with proper gas binding properties and kinase inhibition activity. Additionally, Sde 3557 has an NO dissociation rate that is significantly slower than that of Sde 3804, which may confer prolonged kinase inhibition in vivo. While it is still unclear whether Sde 3557 has another signaling partner or shares the histidine kinase with Sde 3804, Sde 3557 is the only orphan H-NOX characterized to date. S. degradans is likely using a dual-H-NOX system to fine-tune the downstream response of NO signaling.
Collapse
Affiliation(s)
- Yirui Guo
- California Institute for Quantitative Biosciences , University of California, Berkeley , Berkeley , California 94720 , United States
| | - Matthew M Cooper
- Department of Molecular and Cell Biology , University of California, Berkeley , Berkeley , California 94720 , United States
| | - Raquel Bromberg
- Department of Biophysics , University of Texas Southwestern Medical Center , Dallas , Texas 75390 , United States
| | - Michael A Marletta
- California Institute for Quantitative Biosciences , University of California, Berkeley , Berkeley , California 94720 , United States.,Department of Molecular and Cell Biology , University of California, Berkeley , Berkeley , California 94720 , United States.,Department of Chemistry , University of California, Berkeley , Berkeley , California 94720 , United States
| |
Collapse
|
21
|
Bernard G, Greenfield P, Ragan MA, Chan CX. k-mer Similarity, Networks of Microbial Genomes, and Taxonomic Rank. mSystems 2018; 3:e00257-18. [PMID: 30505941 PMCID: PMC6247013 DOI: 10.1128/msystems.00257-18] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2018] [Accepted: 11/02/2018] [Indexed: 01/27/2023] Open
Abstract
Microbial genomes have been shaped by parent-to-offspring (vertical) descent and lateral genetic transfer. These processes can be distinguished by alignment-based inference and comparison of phylogenetic trees for individual gene families, but this approach is not scalable to whole-genome sequences, and a tree-like structure does not adequately capture how these processes impact microbial physiology. Here we adopted alignment-free approaches based on k-mer statistics to infer phylogenomic networks involving 2,783 completely sequenced bacterial and archaeal genomes and compared the contributions of rRNA, protein-coding, and plasmid sequences to these networks. Our results show that the phylogenomic signal arising from ribosomal RNAs is strong and extends broadly across all taxa, whereas that from plasmids is strong but restricted to closely related groups, particularly Proteobacteria. However, the signal from the other chromosomal regions is restricted in breadth. We show that mean k-mer similarity can correlate with taxonomic rank. We also link the implicated k-mers to genome annotation (thus, functions) and define core k-mers (thus, core functions) in specific phyletic groups. Highly conserved functions in most phyla include amino acid metabolism and transport as well as energy production and conversion. Intracellular trafficking and secretion are the most prominent core functions among Spirochaetes, whereas energy production and conversion are not highly conserved among the largely parasitic or commensal Tenericutes. These observations suggest that differential conservation of functions relates to niche specialization and evolutionary diversification of microbes. Our results demonstrate that k-mer approaches can be used to efficiently identify phylogenomic signals and conserved core functions at the multigenome scale. IMPORTANCE Genome evolution of microbes involves parent-to-offspring descent, and lateral genetic transfer that convolutes the phylogenomic signal. This study investigated phylogenomic signals among thousands of microbial genomes based on short subsequences without using multiple-sequence alignment. The signal from ribosomal RNAs is strong across all taxa, and the signal of plasmids is strong only in closely related groups, particularly Proteobacteria. However, the signal from other chromosomal regions (∼99% of the genomes) is remarkably restricted in breadth. The similarity of subsequences is found to correlate with taxonomic rank and informs on conserved and differential core functions relative to niche specialization and evolutionary diversification of microbes. These results provide a comprehensive, alignment-free view of microbial genome evolution as a network, beyond a tree-like structure.
Collapse
Affiliation(s)
- Guillaume Bernard
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia
| | - Paul Greenfield
- Commonwealth Scientific and Industrial Research Organisation (CSIRO), North Ryde, NSW, Australia
| | - Mark A. Ragan
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia
| | - Cheong Xin Chan
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia
- School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD, Australia
| |
Collapse
|
22
|
Kalesinskas L, Cudone E, Fofanov Y, Putonti C. S-plot2: Rapid Visual and Statistical Analysis of Genomic Sequences. Evol Bioinform Online 2018; 14:1176934318797354. [PMID: 30245567 PMCID: PMC6144591 DOI: 10.1177/1176934318797354] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2018] [Accepted: 08/08/2018] [Indexed: 12/12/2022] Open
Abstract
With the daily release of data from whole genome sequencing projects, tools to facilitate comparative studies are hard-pressed to keep pace. Graphical software solutions can readily recognize synteny by measuring similarities between sequences. Nevertheless, regions of dissimilarity can prove to be equally informative; these regions may harbor genes acquired via lateral gene transfer (LGT), signify gene loss or gain, or include coding regions under strong selection. Previously, we developed the software S-plot. This tool employed an alignment-free approach for comparing bacterial genomes and generated a heatmap representing the genomes’ similarities and dissimilarities in nucleotide usage. In prior studies, this tool proved valuable in identifying genome rearrangements as well as exogenous sequences acquired via LGT in several bacterial species. Herein, we present the next generation of this tool, S-plot2. Similar to its predecessor, S-plot2 creates an interactive, 2-dimensional heatmap capturing the similarities and dissimilarities in nucleotide usage between genomic sequences (partial or complete). This new version, however, includes additional metrics for analysis, new reporting options, and integrated BLAST query functionality for the user to interrogate regions of interest. Furthermore, S-plot2 can evaluate larger sequences, including whole eukaryotic chromosomes. To illustrate some of the applications of the tool, 2 case studies are presented. The first examines strain-specific variation across the Pseudomonas aeruginosa genome and strain-specific LGT events. In the second case study, corresponding human, chimpanzee, and rhesus macaque autosomes were studied and lineage specific contributions to divergence were estimated. S-plot2 provides a means to both visually and quantitatively compare nucleotide sequences, from microbial genomes to eukaryotic chromosomes. The case studies presented illustrate just 2 potential applications of the tool, highlighting its capability to identify and investigate the variation in molecular divergence rates across sequences. S-plot2 is freely available through https://bitbucket.org/lkalesinskas/splot and is supported on the Linux and MS Windows operating systems.
Collapse
Affiliation(s)
- Laurynas Kalesinskas
- Bioinformatics Program, Loyola University Chicago, Chicago, IL, USA.,Department of Biology, Loyola University Chicago, Chicago, IL, USA
| | - Evan Cudone
- Bioinformatics Program, Loyola University Chicago, Chicago, IL, USA.,Department of Mathematics and Statistics, Loyola University Chicago, Chicago, IL, USA
| | - Yuriy Fofanov
- Department of Pharmacology and Toxicology, The University of Texas Medical Branch at Galveston, Galveston, TX, USA
| | - Catherine Putonti
- Bioinformatics Program, Loyola University Chicago, Chicago, IL, USA.,Department of Biology, Loyola University Chicago, Chicago, IL, USA.,Department of Computer Science, Loyola University Chicago, Chicago, IL, USA
| |
Collapse
|
23
|
Ren J, Bai X, Lu YY, Tang K, Wang Y, Reinert G, Sun F. Alignment-Free Sequence Analysis and Applications. Annu Rev Biomed Data Sci 2018; 1:93-114. [PMID: 31828235 PMCID: PMC6905628 DOI: 10.1146/annurev-biodatasci-080917-013431] [Citation(s) in RCA: 58] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
Genome and metagenome comparisons based on large amounts of next generation sequencing (NGS) data pose significant challenges for alignment-based approaches due to the huge data size and the relatively short length of the reads. Alignment-free approaches based on the counts of word patterns in NGS data do not depend on the complete genome and are generally computationally efficient. Thus, they contribute significantly to genome and metagenome comparison. Recently, novel statistical approaches have been developed for the comparison of both long and shotgun sequences. These approaches have been applied to many problems including the comparison of gene regulatory regions, genome sequences, metagenomes, binning contigs in metagenomic data, identification of virus-host interactions, and detection of horizontal gene transfers. We provide an updated review of these applications and other related developments of word-count based approaches for alignment-free sequence analysis.
Collapse
Affiliation(s)
- Jie Ren
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
| | - Xin Bai
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, China
| | - Yang Young Lu
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
| | - Kujin Tang
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
| | - Ying Wang
- Department of Automation, Xiamen University, Xiamen, Fujian, China
| | - Gesine Reinert
- Department of Statistics, University of Oxford, Oxford, United Kingdom
| | - Fengzhu Sun
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, China
| |
Collapse
|
24
|
Curran DM, Gilleard JS, Wasmuth JD. MIPhy: identify and quantify rapidly evolving members of large gene families. PeerJ 2018; 6:e4873. [PMID: 29868279 PMCID: PMC5983006 DOI: 10.7717/peerj.4873] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2018] [Accepted: 05/10/2018] [Indexed: 11/20/2022] Open
Abstract
After transitioning to a new environment, species often exhibit rapid phenotypic innovation. One of the fastest mechanisms for this is duplication followed by specialization of existing genes. When this happens to a member of a gene family, it tends to leave a detectable phylogenetic signature of lineage-specific expansions and contractions. These can be identified by analyzing the gene family across several species and identifying patterns of gene duplication and loss that do not correlate with the known relationships between those species. This signature, termed phylogenetic instability, has been previously linked to adaptations that change the way an organism samples and responds to its environment; conversely, low phylogenetic instability has been previously linked to proteins with endogenous functions. With the increase in genome-level data, there is a need to identify and quantify phylogenetic instability. Here, we present Minimizing Instability in Phylogenetics (MIPhy), a tool that solves this problem by quantifying the incongruence of a gene's evolutionary history. The motivation behind MIPhy was to produce a tool to aid in interpreting phylogenetic trees. It can predict which members of a gene family are under adaptive evolution, working only from a gene tree and the relationship between the species under consideration. While it does not conduct any estimation of positive selection-which is the typical indication of adaptive evolution-the results tend to agree. We demonstrate the usefulness of MIPhy by accurately predicting which members of the mammalian cytochrome P450 gene superfamily metabolize xenobiotics and which metabolize endogenous compounds. Our predictions correlate very well with known substrate specificities of the human enzymes. We also analyze the Caenorhabditis collagen gene family and use MIPhy to predict genes that produce an observable phenotype when knocked down in C. elegans, and show that our predictions correlate well with existing knowledge. The software can be downloaded and installed from https://github.com/dave-the-scientist/miphy and is also available as an online web tool at http://www.miphy.wasmuthlab.org.
Collapse
Affiliation(s)
- David M. Curran
- Department of Ecosystem and Public Health, Faculty of Veterinary Medicine, University of Calgary, Calgary, AB, Canada
| | - John S. Gilleard
- Department of Comparative Biology and Experimental Medicine, Faculty of Veterinary Medicine, University of Calgary, Calgary, AB, Canada
| | - James D. Wasmuth
- Department of Ecosystem and Public Health, Faculty of Veterinary Medicine, University of Calgary, Calgary, AB, Canada
| |
Collapse
|
25
|
Sanderson MJ, Nicolae M, McMahon MM. Homology-Aware Phylogenomics at Gigabase Scales. Syst Biol 2018; 66:590-603. [PMID: 28123115 PMCID: PMC5790135 DOI: 10.1093/sysbio/syw104] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2016] [Accepted: 11/25/2016] [Indexed: 11/13/2022] Open
Abstract
Obstacles to inferring species trees from whole genome data sets range from algorithmic and data management challenges to the wholesale discordance in evolutionary history found in different parts of a genome. Recent work that builds trees directly from genomes by parsing them into sets of small $k$-mer strings holds promise to streamline and simplify these efforts, but existing approaches do not account well for gene tree discordance. We describe a "seed and extend" protocol that finds nearly exact matching sets of orthologous $k$-mers and extends them to construct data sets that can properly account for genomic heterogeneity. Exploiting an efficient suffix array data structure, sets of whole genomes can be parsed and converted into phylogenetic data matrices rapidly, with contiguous blocks of $k$-mers from the same chromosome, gene, or scaffold concatenated as needed. Phylogenetic trees constructed from highly curated rice genome data and a diverse set of six other eukaryotic whole genome, transcriptome, and organellar genome data sets recovered trees nearly identical to published phylogenomic analyses, in a small fraction of the time, and requiring many fewer parameter choices. Our method's ability to retain local homology information was demonstrated by using it to characterize gene tree discordance across the rice genome, and by its robustness to the high rate of interchromosomal gene transfer found in several rice species.
Collapse
Affiliation(s)
- M J Sanderson
- Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ 85721, USA
| | - Marius Nicolae
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269, USA
| | - M M McMahon
- School of Plant Sciences, University of Arizona, Tucson, AZ 85721, USA
| |
Collapse
|
26
|
Abstract
BACKGROUND Building the evolutionary trees for massive unaligned DNA sequences is challenging and crucial. However, reconstructing evolutionary tree for ultra-large sequences is hard. Massive multiple sequence alignment is also challenging and time/space consuming. Hadoop and Spark are developed recently, which bring spring light for the classical computational biology problems. In this paper, we tried to solve the multiple sequence alignment and evolutionary reconstruction in parallel. RESULTS HPTree, which is developed in this paper, can deal with big DNA sequence files quickly. It works well on the >1GB files, and gets better performance than other evolutionary reconstruction tools. Users could use HPTree for reonstructing evolutioanry trees on the computer clusters or cloud platform (eg. Amazon Cloud). HPTree could help on population evolution research and metagenomics analysis. CONCLUSIONS In this paper, we employ the Hadoop and Spark platform and design an evolutionary tree reconstruction software tool for unaligned massive DNA sequences. Clustering and multiple sequence alignment are done in parallel. Neighbour-joining model was employed for the evolutionary tree building. We opened our software together with source codes via http://lab.malab.cn/soft/HPtree/ .
Collapse
Affiliation(s)
- Quan Zou
- School of Computer Science and Technology, Tianjin University, Tianjin, People's Republic of China
- Guangdong Province Key Laboratory of Popular High Performance Computers, Shenzhen University, Shenzhen, China
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, China
| | - Shixiang Wan
- School of Computer Science and Technology, Tianjin University, Tianjin, People's Republic of China
| | - Xiangxiang Zeng
- Department of Computer Science, Xiamen University, Xiamen, China.
| | - Zhanshan Sam Ma
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, China.
| |
Collapse
|
27
|
Zielezinski A, Vinga S, Almeida J, Karlowski WM. Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol 2017; 18:186. [PMID: 28974235 PMCID: PMC5627421 DOI: 10.1186/s13059-017-1319-7] [Citation(s) in RCA: 285] [Impact Index Per Article: 35.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023] Open
Abstract
Alignment-free sequence analyses have been applied to problems ranging from whole-genome phylogeny to the classification of protein families, identification of horizontally transferred genes, and detection of recombined sequences. The strength of these methods makes them particularly useful for next-generation sequencing data processing and analysis. However, many researchers are unclear about how these methods work, how they compare to alignment-based methods, and what their potential is for use for their research. We address these questions and provide a guide to the currently available alignment-free sequence analysis tools.
Collapse
Affiliation(s)
- Andrzej Zielezinski
- Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University in Poznan, Umultowska 89, 61-614, Poznan, Poland
| | - Susana Vinga
- IDMEC, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, 1049-001, Lisbon, Portugal
| | - Jonas Almeida
- Stony Brook University (SUNY), 101 Nicolls Road, Stony Brook, NY, 11794, USA
| | - Wojciech M Karlowski
- Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University in Poznan, Umultowska 89, 61-614, Poznan, Poland.
| |
Collapse
|
28
|
Nojoomi S, Koehl P. String kernels for protein sequence comparisons: improved fold recognition. BMC Bioinformatics 2017; 18:137. [PMID: 28245816 PMCID: PMC5331664 DOI: 10.1186/s12859-017-1560-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2016] [Accepted: 02/23/2017] [Indexed: 11/28/2022] Open
Abstract
BACKGROUND The amino acid sequence of a protein is the blueprint from which its structure and ultimately function can be derived. Therefore, sequence comparison methods remain essential for the determination of similarity between proteins. Traditional approaches for comparing two protein sequences begin with strings of letters (amino acids) that represent the sequences, before generating textual alignments between these strings and providing scores for each alignment. When the similitude between the two protein sequences to be compared is low however, the quality of the corresponding sequence alignment is usually poor, leading to poor performance for the recognition of similarity. RESULTS In this study, we develop an alignment free alternative to these methods that is based on the concept of string kernels. Starting from recently proposed kernels on the discrete space of protein sequences (Shen et al, Found. Comput. Math., 2013,14:951-984), we introduce our own version, SeqKernel. Its implementation depends on two parameters, a coefficient that tunes the substitution matrix and the maximum length of k-mers that it includes. We provide an exhaustive analysis of the impacts of these two parameters on the performance of SeqKernel for fold recognition. We show that with the right choice of parameters, use of the SeqKernel similarity measure improves fold recognition compared to the use of traditional alignment-based methods. We illustrate the application of SeqKernel to inferring phylogeny on RNA polymerases and show that it performs as well as methods based on multiple sequence alignments. CONCLUSION We have presented and characterized a new alignment free method based on a mathematical kernel for scoring the similarity of protein sequences. We discuss possible improvements of this method, as well as an extension of its applications to other modeling methods that rely on sequence comparison.
Collapse
Affiliation(s)
- Saghi Nojoomi
- Biotechnology program, University of California, Davis, 1, Shields Avenue, Davis, CA, 95616 USA
| | - Patrice Koehl
- Department of Computer Science and Genome Center, 1, Shields Avenue, Davis, CA, 95616 USA
| |
Collapse
|
29
|
Cong Y, Chan YB, Phillips CA, Langston MA, Ragan MA. Robust Inference of Genetic Exchange Communities from Microbial Genomes Using TF-IDF. Front Microbiol 2017; 8:21. [PMID: 28154557 PMCID: PMC5243798 DOI: 10.3389/fmicb.2017.00021] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2016] [Accepted: 01/04/2017] [Indexed: 11/13/2022] Open
Abstract
Bacteria and archaea can exchange genetic material across lineages through processes of lateral genetic transfer (LGT). Collectively, these exchange relationships can be modeled as a network and analyzed using concepts from graph theory. In particular, densely connected regions within an LGT network have been defined as genetic exchange communities (GECs). However, it has been problematic to construct networks in which edges solely represent LGT. Here we apply term frequency-inverse document frequency (TF-IDF), an alignment-free method originating from document analysis, to infer regions of lateral origin in bacterial genomes. We examine four empirical datasets of different size (number of genomes) and phyletic breadth, varying a key parameter (word length k) within bounds established in previous work. We map the inferred lateral regions to genes in recipient genomes, and construct networks in which the nodes are groups of genomes, and the edges natively represent LGT. We then extract maximum and maximal cliques (i.e., GECs) from these graphs, and identify nodes that belong to GECs across a wide range of k. Most surviving lateral transfer has happened within these GECs. Using Gene Ontology enrichment tests we demonstrate that biological processes associated with metabolism, regulation and transport are often over-represented among the genes affected by LGT within these communities. These enrichments are largely robust to change of k.
Collapse
Affiliation(s)
- Yingnan Cong
- Institute for Molecular Bioscience and ARC Centre of Excellence in Bioinformatics, University of Queensland, St Lucia QLD, Australia
| | - Yao-Ban Chan
- School of Mathematics and Statistics, University of Melbourne, Parkville VIC, Australia
| | - Charles A Phillips
- Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville TN, USA
| | - Michael A Langston
- Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville TN, USA
| | - Mark A Ragan
- Institute for Molecular Bioscience and ARC Centre of Excellence in Bioinformatics, University of Queensland, St Lucia QLD, Australia
| |
Collapse
|
30
|
Chan CX, Beiko RG, Ragan MA. Scaling Up the Phylogenetic Detection of Lateral Gene Transfer Events. Methods Mol Biol 2017; 1525:421-432. [PMID: 27896730 DOI: 10.1007/978-1-4939-6622-6_16] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
Lateral genetic transfer (LGT) is the process by which genetic material moves between organisms (and viruses) in the biosphere. Among the many approaches developed for the inference of LGT events from DNA sequence data, methods based on the comparison of phylogenetic trees remain the gold standard for many types of problem. Identifying LGT events from sequenced genomes typically involves a series of steps in which homologous sequences are identified and aligned, phylogenetic trees are inferred, and their topologies are compared to identify unexpected or conflicting relationships. These types of approach have been used to elucidate the nature and extent of LGT and its physiological and ecological consequences throughout the Tree of Life. Advances in DNA sequencing technology have led to enormous increases in the number of sequenced genomes, including ultra-deep sampling of specific taxonomic groups and single cell-based sequencing of unculturable "microbial dark matter." Environmental shotgun sequencing enables the study of LGT among organisms that share the same habitat.This abundance of genomic data offers new opportunities for scientific discovery, but poses two key problems. As ever more genomes are generated, the assembly and annotation of each individual genome receives less scrutiny; and with so many genomes available it is tempting to include them all in a single analysis, but thousands of genomes and millions of genes can overwhelm key algorithms in the analysis pipeline. Identifying LGT events of interest therefore depends on choosing the right dataset, and on algorithms that appropriately balance speed and accuracy given the size and composition of the chosen set of genomes.
Collapse
Affiliation(s)
- Cheong Xin Chan
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, 4072, Australia
| | - Robert G Beiko
- Faculty of Computer Science, Dalhousie University, Halifax, NS, B3H 4R2, Canada
| | - Mark A Ragan
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, 4072, Australia.
| |
Collapse
|
31
|
Abstract
Ernst Haeckel based his landmark Tree of Life on the supposed ontogenic recapitulation of phylogeny, i.e. that successive embryonic stages during the development of an organism re-trace the morphological forms of its ancestors over the course of evolution. Much of this idea has since been discredited. Today, phylogenies are often based on families of molecular sequences. The standard approach starts with a multiple sequence alignment, in which the sequences are arranged relative to each other in a way that maximises a measure of similarity position-by-position along their entire length. A tree (or sometimes a network) is then inferred. Rigorous multiple sequence alignment is computationally demanding, and evolutionary processes that shape the genomes of many microbes (bacteria, archaea and some morphologically simple eukaryotes) can add further complications. In particular, recombination, genome rearrangement and lateral genetic transfer undermine the assumptions that underlie multiple sequence alignment, and imply that a tree-like structure may be too simplistic. Here, using genome sequences of 143 bacterial and archaeal genomes, we construct a network of phylogenetic relatedness based on the number of shared
k-mers (subsequences at fixed length
k). Our findings suggest that the network captures not only key aspects of microbial genome evolution as inferred from a tree, but also features that are not treelike. The method is highly scalable, allowing for investigation of genome evolution across a large number of genomes. Instead of using specific regions or sequences from genome sequences, or indeed Haeckel’s idea of ontogeny, we argue that genome phylogenies can be inferred using
k-mers from whole-genome sequences. Representing these networks dynamically allows biological questions of interest to be formulated and addressed quickly and in a visually intuitive manner.
Collapse
Affiliation(s)
- Guillaume Bernard
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, Australia
| | - Mark A Ragan
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, Australia
| | - Cheong Xin Chan
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, Australia
| |
Collapse
|
32
|
CYP101J2, CYP101J3, and CYP101J4, 1,8-Cineole-Hydroxylating Cytochrome P450 Monooxygenases from Sphingobium yanoikuyae Strain B2. Appl Environ Microbiol 2016; 82:6507-6517. [PMID: 27590809 DOI: 10.1128/aem.02067-16] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2016] [Accepted: 08/12/2016] [Indexed: 01/21/2023] Open
Abstract
We report the isolation and characterization of three new cytochrome P450 monooxygenases: CYP101J2, CYP101J3, and CYP101J4. These P450s were derived from Sphingobium yanoikuyae B2, a strain that was isolated from activated sludge based on its ability to fully mineralize 1,8-cineole. Genome sequencing of this strain in combination with purification of native 1,8-cineole-binding proteins enabled identification of 1,8-cineole-binding P450s. The P450 enzymes were cloned, heterologously expressed (N-terminally His6 tagged) in Escherichia coli BL21(DE3), purified, and spectroscopically characterized. Recombinant whole-cell biotransformation in E. coli demonstrated that all three P450s hydroxylate 1,8-cineole using electron transport partners from E. coli to yield a product putatively identified as (1S)-2α-hydroxy-1,8-cineole or (1R)-6α-hydroxy-1,8-cineole. The new P450s belong to the CYP101 family and share 47% and 44% identity with other 1,8-cineole-hydroxylating members found in Novosphingobium aromaticivorans and Pseudomonas putida Compared to P450cin (CYP176A1), a 1,8-cineole-hydroxylating P450 from Citrobacter braakii, these enzymes share less than 30% amino acid sequence identity and hydroxylate 1,8-cineole in a different orientation. Expansion of the enzyme toolbox for modification of 1,8-cineole creates a starting point for use of hydroxylated derivatives in a range of industrial applications. IMPORTANCE CYP101J2, CYP101J3, and CYP101J4 are cytochrome P450 monooxygenases from S. yanoikuyae B2 that hydroxylate the monoterpenoid 1,8-cineole. These enzymes not only play an important role in microbial degradation of this plant-based chemical but also provide an interesting route to synthesize oxygenated 1,8-cineole derivatives for applications as natural flavor and fragrance precursors or incorporation into polymers. The P450 cytochromes also provide an interesting basis from which to compare other enzymes with a similar function and expand the CYP101 family. This could eventually provide enough bacterial parental enzymes with similar amino acid sequences to enable in vitro evolution via DNA shuffling.
Collapse
|
33
|
Exploring lateral genetic transfer among microbial genomes using TF-IDF. Sci Rep 2016; 6:29319. [PMID: 27452976 PMCID: PMC4958990 DOI: 10.1038/srep29319] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2016] [Accepted: 06/13/2016] [Indexed: 11/17/2022] Open
Abstract
Many microbes can acquire genetic material from their environment and incorporate it into their genome, a process known as lateral genetic transfer (LGT). Computational approaches have been developed to detect genomic regions of lateral origin, but typically lack sensitivity, ability to distinguish donor from recipient, and scalability to very large datasets. To address these issues we have introduced an alignment-free method based on ideas from document analysis, term frequency-inverse document frequency (TF-IDF). Here we examine the performance of TF-IDF on three empirical datasets: 27 genomes of Escherichia coli and Shigella, 110 genomes of enteric bacteria, and 143 genomes across 12 bacterial and three archaeal phyla. We investigate the effect of k-mer size, gap size and delineation of groups on the inference of genomic regions of lateral origin, finding an interplay among these parameters and sequence divergence. Because TF-IDF identifies donor groups and delineates regions of lateral origin within recipient genomes, aggregating these regions by gene enables us to explore, for the first time, the mosaic nature of lateral genes including the multiplicity of biological sources, ancestry of transfer and over-writing by subsequent transfers. We carry out Gene Ontology enrichment tests to investigate which biological processes are potentially affected by LGT.
Collapse
|
34
|
Allman ES, Rhodes JA, Sullivant S. Statistically Consistent k-mer Methods for Phylogenetic Tree Reconstruction. J Comput Biol 2016; 24:153-171. [PMID: 27387364 DOI: 10.1089/cmb.2015.0216] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Frequencies of k-mers in sequences are sometimes used as a basis for inferring phylogenetic trees without first obtaining a multiple sequence alignment. We show that a standard approach of using the squared Euclidean distance between k-mer vectors to approximate a tree metric can be statistically inconsistent. To remedy this, we derive model-based distance corrections for orthologous sequences without gaps, which lead to consistent tree inference. The identifiability of model parameters from k-mer frequencies is also studied. Finally, we report simulations showing that the corrected distance outperforms many other k-mer methods, even when sequences are generated with an insertion and deletion process. These results have implications for multiple sequence alignment as well since k-mer methods are usually the first step in constructing a guide tree for such algorithms.
Collapse
Affiliation(s)
- Elizabeth S Allman
- 1 Department of Mathematics and Statistics, University of Alaska Fairbanks , Fairbanks, Alaska
| | - John A Rhodes
- 1 Department of Mathematics and Statistics, University of Alaska Fairbanks , Fairbanks, Alaska
| | - Seth Sullivant
- 2 Department of Mathematics, North Carolina State University , Raleigh, North Carolina
| |
Collapse
|
35
|
Bernard G, Chan CX, Ragan MA. Alignment-free microbial phylogenomics under scenarios of sequence divergence, genome rearrangement and lateral genetic transfer. Sci Rep 2016; 6:28970. [PMID: 27363362 PMCID: PMC4929450 DOI: 10.1038/srep28970] [Citation(s) in RCA: 42] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2016] [Accepted: 06/13/2016] [Indexed: 12/22/2022] Open
Abstract
Alignment-free (AF) approaches have recently been highlighted as alternatives to methods based on multiple sequence alignment in phylogenetic inference. However, the sensitivity of AF methods to genome-scale evolutionary scenarios is little known. Here, using simulated microbial genome data we systematically assess the sensitivity of nine AF methods to three important evolutionary scenarios: sequence divergence, lateral genetic transfer (LGT) and genome rearrangement. Among these, AF methods are most sensitive to the extent of sequence divergence, less sensitive to low and moderate frequencies of LGT, and most robust against genome rearrangement. We describe the application of AF methods to three well-studied empirical genome datasets, and introduce a new application of the jackknife to assess node support. Our results demonstrate that AF phylogenomics is computationally scalable to multi-genome data and can generate biologically meaningful phylogenies and insights into microbial evolution.
Collapse
Affiliation(s)
- Guillaume Bernard
- Institute for Molecular Bioscience, and ARC Centre of Excellence in Bioinformatics, The University of Queensland, Brisbane, QLD 4072, Australia
| | - Cheong Xin Chan
- Institute for Molecular Bioscience, and ARC Centre of Excellence in Bioinformatics, The University of Queensland, Brisbane, QLD 4072, Australia
| | - Mark A. Ragan
- Institute for Molecular Bioscience, and ARC Centre of Excellence in Bioinformatics, The University of Queensland, Brisbane, QLD 4072, Australia
| |
Collapse
|
36
|
Bromberg R, Grishin NV, Otwinowski Z. Phylogeny Reconstruction with Alignment-Free Method That Corrects for Horizontal Gene Transfer. PLoS Comput Biol 2016; 12:e1004985. [PMID: 27336403 PMCID: PMC4918981 DOI: 10.1371/journal.pcbi.1004985] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2015] [Accepted: 05/10/2016] [Indexed: 01/20/2023] Open
Abstract
Advances in sequencing have generated a large number of complete genomes. Traditionally, phylogenetic analysis relies on alignments of orthologs, but defining orthologs and separating them from paralogs is a complex task that may not always be suited to the large datasets of the future. An alternative to traditional, alignment-based approaches are whole-genome, alignment-free methods. These methods are scalable and require minimal manual intervention. We developed SlopeTree, a new alignment-free method that estimates evolutionary distances by measuring the decay of exact substring matches as a function of match length. SlopeTree corrects for horizontal gene transfer, for composition variation and low complexity sequences, and for branch-length nonlinearity caused by multiple mutations at the same site. We tested SlopeTree on 495 bacteria, 73 archaea, and 72 strains of Escherichia coli and Shigella. We compared our trees to the NCBI taxonomy, to trees based on concatenated alignments, and to trees produced by other alignment-free methods. The results were consistent with current knowledge about prokaryotic evolution. We assessed differences in tree topology over different methods and settings and found that the majority of bacteria and archaea have a core set of proteins that evolves by descent. In trees built from complete genomes rather than sets of core genes, we observed some grouping by phenotype rather than phylogeny, for instance with a cluster of sulfur-reducing thermophilic bacteria coming together irrespective of their phyla. The source-code for SlopeTree is available at: http://prodata.swmed.edu/download/pub/slopetree_v1/slopetree.tar.gz. Due to their lack of distinct morphological features, bacteria and archaea were extremely difficult to classify until technology was developed to obtain their DNA sequences; these sequences could then be compared to estimate evolutionary relationships. Now, due to technological advances, there is a flood of available sequences from a wide variety of organisms. These advances have spurred the development of algorithms which can estimate evolutionary relationships using whole genomes, in contrast to the more traditional methods which used single genes earlier and now typically use groups of conserved genes. However, there are many challenges when attempting to infer evolutionary relationships, in particular horizontal gene transfer, where DNA is transferred from one organism to another, resulting in an organism’s genome containing DNA that does not reflect its evolution by descent. We developed a new whole-genome method for estimating evolutionary distances which identifies and corrects for horizontal transfer. We found that for SlopeTree and all other whole-genome methods we applied, horizontal transfer causes some evolutionary distances to be grossly underestimated, and that our correction corrects for this.
Collapse
Affiliation(s)
- Raquel Bromberg
- Department of Biophysics and Department of Biochemistry, University of Texas Southwestern Medical Center at Dallas, Dallas, Texas, United States of America
| | - Nick V. Grishin
- Department of Biophysics and Department of Biochemistry, University of Texas Southwestern Medical Center at Dallas, Dallas, Texas, United States of America
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center at Dallas, Dallas, Texas, United States of America
| | - Zbyszek Otwinowski
- Department of Biophysics and Department of Biochemistry, University of Texas Southwestern Medical Center at Dallas, Dallas, Texas, United States of America
- * E-mail:
| |
Collapse
|
37
|
Wang D, Xu J, Yu J. KGCAK: a K-mer based database for genome-wide phylogeny and complexity evaluation. Biol Direct 2015; 10:53. [PMID: 26376976 PMCID: PMC4573299 DOI: 10.1186/s13062-015-0083-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2015] [Accepted: 09/11/2015] [Indexed: 11/28/2022] Open
Abstract
Background The K-mer approach, treating genomic sequences as simple characters and counting the relative abundance of each string upon a fixed K, has been extensively applied to phylogeny inference for genome assembly, annotation, and comparison. Results To meet increasing demands for comparing large genome sequences and to promote the use of the K-mer approach, we develop a versatile database, KGCAK (http://kgcak.big.ac.cn/KGCAK/), containing ~8,000 genomes that include genome sequences of diverse life forms (viruses, prokaryotes, protists, animals, and plants) and cellular organelles of eukaryotic lineages. It builds phylogeny based on genomic elements in an alignment-free fashion and provides in-depth data processing enabling users to compare the complexity of genome sequences based on K-mer distribution. Conclusion We hope that KGCAK becomes a powerful tool for exploring relationship within and among groups of species in a tree of life based on genomic data. Reviewers This article was reviewed by Prof Mark Ragan and Dr Yuri Wolf.
Collapse
Affiliation(s)
- Dapeng Wang
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, PR China. .,Stem Cell Laboratory, UCL Cancer Institute, University College London, London, WC1E 6BT, UK.
| | - Jiayue Xu
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, PR China. .,University of Chinese Academy of Sciences, Beijing, 100049, China.
| | - Jun Yu
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, PR China.
| |
Collapse
|
38
|
Ankarklev J, Franzén O, Peirasmaki D, Jerlström-Hultqvist J, Lebbad M, Andersson J, Andersson B, Svärd SG. Comparative genomic analyses of freshly isolated Giardia intestinalis assemblage A isolates. BMC Genomics 2015; 16:697. [PMID: 26370391 PMCID: PMC4570179 DOI: 10.1186/s12864-015-1893-6] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2015] [Accepted: 09/01/2015] [Indexed: 12/31/2022] Open
Abstract
Background The diarrhea-causing protozoan Giardia intestinalis makes up a species complex of eight different assemblages (A-H), where assemblage A and B infect humans. Comparative whole-genome analyses of three of these assemblages have shown that there is significant divergence at the inter-assemblage level, however little is currently known regarding variation at the intra-assemblage level. We have performed whole genome sequencing of two sub-assemblage AII isolates, recently axenized from symptomatic human patients, to study the biological and genetic diversity within assemblage A isolates. Results Several biological differences between the new and earlier characterized assemblage A isolates were identified, including a difference in growth medium preference. The two AII isolates were of different sub-assemblage types (AII-1 [AS175] and AII-2 [AS98]) and showed size differences in the smallest chromosomes. The amount of genetic diversity was characterized in relation to the genome of the Giardia reference isolate WB, an assemblage AI isolate. Our analyses indicate that the divergence between AI and AII is approximately 1 %, represented by ~100,000 single nucleotide polymorphisms (SNP) distributed over the chromosomes with enrichment in variable genomic regions containing surface antigens. The level of allelic sequence heterozygosity (ASH) in the two AII isolates was found to be 0.25–0.35 %, which is 25–30 fold higher than in the WB isolate and 10 fold higher than the assemblage AII isolate DH (0.037 %). 35 protein-encoding genes, not found in the WB genome, were identified in the two AII genomes. The large gene families of variant-specific surface proteins (VSPs) and high cysteine membrane proteins (HCMPs) showed isolate-specific divergences of the gene repertoires. Certain genes, often in small gene families with 2 to 8 members, localize to the variable regions of the genomes and show high sequence diversity between the assemblage A isolates. One of the families, Bactericidal/Permeability Increasing-like protein (BPIL), with eight members was characterized further and the proteins were shown to localize to the ER in trophozoites. Conclusions Giardia genomes are modular with highly conserved core regions mixed up by variable regions containing high levels of ASH, SNPs and variable surface antigens. There are significant genomic variations in assemblage A isolates, in terms of chromosome size, gene content, surface protein repertoire and gene polymorphisms and these differences mainly localize to the variable regions of the genomes. The large genetic differences within one assemblage of G. intestinalis strengthen the argument that the assemblages represent different Giardia species. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-1893-6) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Johan Ankarklev
- Department of Cell and Molecular Biology, Science for Life Laboratory, BMC, Uppsala University, Box 596, SE-751 24, Uppsala, Sweden.
| | - Oscar Franzén
- Department of Cell and Molecular Biology, Karolinska Institutet, Box 285, SE-171 77, Stockholm, Sweden. .,Science for Life Laboratory, KISP, Tomtebodavägen 23A, 171 65, Solna, Sweden.
| | - Dimitra Peirasmaki
- Department of Cell and Molecular Biology, Science for Life Laboratory, BMC, Uppsala University, Box 596, SE-751 24, Uppsala, Sweden.
| | - Jon Jerlström-Hultqvist
- Department of Cell and Molecular Biology, Science for Life Laboratory, BMC, Uppsala University, Box 596, SE-751 24, Uppsala, Sweden.
| | - Marianne Lebbad
- Department of Microbiology, Public Health Agency of Sweden, SE-171 82, Solna, Sweden.
| | - Jan Andersson
- Department of Cell and Molecular Biology, Science for Life Laboratory, BMC, Uppsala University, Box 596, SE-751 24, Uppsala, Sweden.
| | - Björn Andersson
- Department of Cell and Molecular Biology, Karolinska Institutet, Box 285, SE-171 77, Stockholm, Sweden. .,Science for Life Laboratory, KISP, Tomtebodavägen 23A, 171 65, Solna, Sweden.
| | - Staffan G Svärd
- Department of Cell and Molecular Biology, Science for Life Laboratory, BMC, Uppsala University, Box 596, SE-751 24, Uppsala, Sweden.
| |
Collapse
|
39
|
Nguyen NPD, Mirarab S, Kumar K, Warnow T. Ultra-large alignments using phylogeny-aware profiles. Genome Biol 2015; 16:124. [PMID: 26076734 PMCID: PMC4492008 DOI: 10.1186/s13059-015-0688-z] [Citation(s) in RCA: 89] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2014] [Accepted: 05/29/2015] [Indexed: 01/07/2023] Open
Abstract
Many biological questions, including the estimation of deep evolutionary histories and the detection of remote homology between protein sequences, rely upon multiple sequence alignments and phylogenetic trees of large datasets. However, accurate large-scale multiple sequence alignment is very difficult, especially when the dataset contains fragmentary sequences. We present UPP, a multiple sequence alignment method that uses a new machine learning technique, the ensemble of hidden Markov models, which we propose here. UPP produces highly accurate alignments for both nucleotide and amino acid sequences, even on ultra-large datasets or datasets containing fragmentary sequences. UPP is available at https://github.com/smirarab/sepp.
Collapse
Affiliation(s)
- Nam-Phuong D Nguyen
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, 1206 West Gregory Drive, Urbana, 61801, Illinois, USA.
| | - Siavash Mirarab
- Department of Computer Science, University of Texas at Austin, 2505 Speedway, Austin, 78712, Texas, USA.
| | - Keerthana Kumar
- Department of Computer Science, University of Texas at Austin, 2505 Speedway, Austin, 78712, Texas, USA.
| | - Tandy Warnow
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, 1206 West Gregory Drive, Urbana, 61801, Illinois, USA. .,Department of Bioengineering, University of Illinois at Urbana-Champaign, 1270 Digital Computer Laboratory, Urbana, 61801, Illinois, USA. .,Department of Computer Science, University of Illinois at Urbana-Champaign, 201 North Goodwin Avenue, Urbana, 61801, Illinois, USA.
| |
Collapse
|