1
|
Rios-Willars E, Chirinos-Arias MC. Mfind: a tool for DNA barcode analysis in angiosperms and its relationship with microsatellites using a sliding window algorithm. PLANTA 2024; 259:134. [PMID: 38671234 DOI: 10.1007/s00425-024-04420-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/12/2023] [Accepted: 04/17/2024] [Indexed: 04/28/2024]
Abstract
MAIN CONCLUSION Mfind is a tool to analyze the impact of microsatellite presence on DNA barcode specificity. We found a significant correlation between barcode entropy and microsatellite count in angiosperm. Genetic barcodes and microsatellites are some of the identification methods in taxonomy and biodiversity research. It is important to establish a relationship between microsatellite quantification and genetic information in barcodes. In order to clarify the association between the genetic information in barcodes (expressed as Shannon's Measure of Information, SMI) and microsatellites count, a total of 330,809 DNA barcodes from the BOLD database (Barcode of Life Data System) were analyzed. A parallel sliding-window algorithm was developed to compute the Shannon entropy of the barcodes, and this was compared with the quantification of microsatellites like (AT)n, (AC)n, and (AG)n. The microsatellite search method utilized an algorithm developed in the Java programming language, which systematically examined the genetic barcodes from an angiosperm database. For this purpose, a computational tool named Mfind was developed, and its search methodology is detailed. This comprehensive study revealed a broad overview of microsatellites within barcodes, unveiling an inverse correlation between the sumz of microsatellites count and barcodes information. The utilization of the Mfind tool demonstrated that the presence of microsatellites impacts the barcode information when considering entropy as a metric. This effect might be attributed to the concise length of DNA barcodes and the repetitive nature of microsatellites, resulting in a direct influence on the entropy of the barcodes.
Collapse
Affiliation(s)
- Ernesto Rios-Willars
- Faculty of Systems, Autonomous University of Coahuila (UAdeC), 25350, Saltillo, Coahuila, México.
| | - Michelle C Chirinos-Arias
- Molecular Biology and Bioinformatics Area, Instituto de Genetica Barbara McClintock (IGBM), Lima, 15022, Peru
| |
Collapse
|
2
|
Caetano-Anollés G, Claverie JM, Nasir A. A critical analysis of the current state of virus taxonomy. Front Microbiol 2023; 14:1240993. [PMID: 37601376 PMCID: PMC10435761 DOI: 10.3389/fmicb.2023.1240993] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Accepted: 07/20/2023] [Indexed: 08/22/2023] Open
Abstract
Taxonomical classification has preceded evolutionary understanding. For that reason, taxonomy has become a battleground fueled by knowledge gaps, technical limitations, and a priorism. Here we assess the current state of the challenging field, focusing on fallacies that are common in viral classification. We emphasize that viruses are crucial contributors to the genomic and functional makeup of holobionts, organismal communities that behave as units of biological organization. Consequently, viruses cannot be considered taxonomic units because they challenge crucial concepts of organismality and individuality. Instead, they should be considered processes that integrate virions and their hosts into life cycles. Viruses harbor phylogenetic signatures of genetic transfer that compromise monophyly and the validity of deep taxonomic ranks. A focus on building phylogenetic networks using alignment-free methodologies and molecular structure can help mitigate the impasse, at least in part. Finally, structural phylogenomic analysis challenges the polyphyletic scenario of multiple viral origins adopted by virus taxonomy, defeating a polyphyletic origin and supporting instead an ancient cellular origin of viruses. We therefore, prompt abandoning deep ranks and urgently reevaluating the validity of taxonomic units and principles of virus classification.
Collapse
Affiliation(s)
- Gustavo Caetano-Anollés
- Evolutionary Bioinformatics Laboratory, Department of Crop Sciences and C.R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, United States
| | - Jean-Michel Claverie
- Structural and Genomic Information Laboratory (UMR7256), Mediterranean Institute of Microbiology (FR3479), IM2B, IOM, Aix Marseille University, CNRS, Marseille, France
| | | |
Collapse
|
3
|
Riley AC, Ashlock DA, Graether SP. The difficulty of aligning intrinsically disordered protein sequences as assessed by conservation and phylogeny. PLoS One 2023; 18:e0288388. [PMID: 37440576 DOI: 10.1371/journal.pone.0288388] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2023] [Accepted: 06/26/2023] [Indexed: 07/15/2023] Open
Abstract
Intrinsically disordered proteins (IDPs) are proteins that lack a stable 3D structure but maintain a biological function. It has been frequently suggested that IDPs are difficult to align because they tend to have fewer conserved residues compared to ordered proteins, but to our knowledge this has never been directly tested. To compare the alignments of ordered proteins to IDPs, their multiple sequence alignments (MSAs) were assessed using two different methods. The first compared the similarity between MSAs produced using the same sequences but created with Clustal Omega, MAFFT, and MUSCLE. The second assessed MSAs based on how well they recapitulated the species tree. These two methods measure the "correctness" of an MSA with two different approaches; the first method measures consistency while the second measures the underlying phylogenetic signal. Proteins that contained both regions of disorder and order were analyzed along with proteins that were fully disordered and fully ordered, using nucleotide, codon and peptide sequence alignments. We observed that IDPs had less similar MSAs than ordered proteins, which is most likely linked to the lower sequence conservation in IDPs. However, comparisons of tree distances found that trees from the ordered sequence MSAs were not significantly closer to the species tree than those inferred from disordered sequence MSAs. Our results show that it is correct to say that IDPs are difficult to align on the basis of MSA consistency, but that this does not equate with alignments being of poor quality when assessed by their ability to correctly infer a species tree.
Collapse
Affiliation(s)
- Andrew C Riley
- Graduate Program in Bioinformatics, University of Guelph, Guelph, Ontario, Canada
- Department of Molecular and Cellular Biology, University of Guelph, Guelph, Ontario, Canada
| | - Daniel A Ashlock
- Graduate Program in Bioinformatics, University of Guelph, Guelph, Ontario, Canada
- Department of Mathematics & Statistics, University of Guelph, Guelph, Ontario, Canada
| | - Steffen P Graether
- Graduate Program in Bioinformatics, University of Guelph, Guelph, Ontario, Canada
- Department of Molecular and Cellular Biology, University of Guelph, Guelph, Ontario, Canada
| |
Collapse
|
4
|
Nord AJ, Wheeler TJ. Mirage2's high-quality spliced protein-to-genome mappings produce accurate multiple-sequence alignments of isoforms. PLoS One 2023; 18:e0285225. [PMID: 37155621 PMCID: PMC10166558 DOI: 10.1371/journal.pone.0285225] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2022] [Accepted: 04/18/2023] [Indexed: 05/10/2023] Open
Abstract
The organization of homologous protein sequences into multiple sequence alignments (MSAs) is a cornerstone of modern analysis of proteins. Recent focus on the importance of alternatively-spliced isoforms in disease and cell biology has highlighted the need for MSA software that can appropriately account for isoforms and the exon-length insertions or deletions that isoforms may have relative to each other. We previously developed Mirage, a software package for generating MSAs for isoforms spanning multiple species. Here, we present Mirage2, which retains the fundamental algorithms of the original Mirage implementation while providing substantially improved translated mapping and improving several aspects of usability. We demonstrate that Mirage2 is highly effective at mapping proteins to their encoding exons, and that these protein-genome mappings lead to extremely accurate intron-aware alignments. Additionally, Mirage2 implements a number of engineering improvements that simplify installation and use.
Collapse
Affiliation(s)
- Alexander J Nord
- Division of Biological Sciences, University of Montana, Missoula, Montana, United States of America
| | - Travis J Wheeler
- Department of Pharmacology & Toxicology, University of Arizona, Tucson, Arizona, United States of America
- Department of Computer Science, University of Montana, Missoula, Montana, United States of America
| |
Collapse
|
5
|
Yamaguchi K, Kadota M, Nishimura O, Ohishi Y, Naito Y, Kuraku S. Technical considerations in Hi-C scaffolding and evaluation of chromosome-scale genome assemblies. Mol Ecol 2021; 30:5923-5934. [PMID: 34432923 PMCID: PMC9292758 DOI: 10.1111/mec.16146] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2020] [Revised: 07/28/2021] [Accepted: 08/18/2021] [Indexed: 12/15/2022]
Abstract
The recent development of ecological studies has been fueled by the introduction of massive information based on chromosome-scale genome sequences, even for species for which genetic linkage is not accessible. This was enabled mainly by the application of Hi-C, a method for genome-wide chromosome conformation capture that was originally developed for investigating the long-range interaction of chromatins. Performing genomic scaffolding using Hi-C data is highly resource-demanding and employs elaborate laboratory steps for sample preparation. It starts with building a primary genome sequence assembly as an input, which is followed by computation for genome scaffolding using Hi-C data, requiring careful validation. This article presents technical considerations for obtaining optimal Hi-C scaffolding results and provides a test case of its application to a reptile species, the Madagascar ground gecko (Paroedura picta). Among the metrics that are frequently used for evaluating scaffolding results, we investigate the validity of the completeness assessment of chromosome-scale genome assemblies using single-copy reference orthologues.
Collapse
Affiliation(s)
- Kazuaki Yamaguchi
- Laboratory for PhyloinformaticsRIKEN Center for Biosystems Dynamics ResearchKobeJapan
| | - Mitsutaka Kadota
- Laboratory for PhyloinformaticsRIKEN Center for Biosystems Dynamics ResearchKobeJapan
| | - Osamu Nishimura
- Laboratory for PhyloinformaticsRIKEN Center for Biosystems Dynamics ResearchKobeJapan
| | - Yuta Ohishi
- Laboratory for PhyloinformaticsRIKEN Center for Biosystems Dynamics ResearchKobeJapan
| | - Yuki Naito
- Database Center for Life Science (DBCLS)MishimaJapan
| | - Shigehiro Kuraku
- Laboratory for PhyloinformaticsRIKEN Center for Biosystems Dynamics ResearchKobeJapan
- Molecular Life History LaboratoryNational Institute of GeneticsMishimaJapan
- Department of GeneticsSokendai (Graduate University for Advanced Studies)MishimaJapan
| |
Collapse
|
6
|
Maiolo M, Gatti L, Frei D, Leidi T, Gil M, Anisimova M. ProPIP: a tool for progressive multiple sequence alignment with Poisson Indel Process. BMC Bioinformatics 2021; 22:518. [PMID: 34689750 PMCID: PMC8543915 DOI: 10.1186/s12859-021-04442-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2020] [Accepted: 10/13/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Current alignment tools typically lack an explicit model of indel evolution, leading to artificially short inferred alignments (i.e., over-alignment) due to inconsistencies between the indel history and the phylogeny relating the input sequences. RESULTS We present a new progressive multiple sequence alignment tool ProPIP. The process of insertions and deletions is described using an explicit evolutionary model-the Poisson Indel Process or PIP. The method is based on dynamic programming and is implemented in a frequentist framework. The source code can be compiled on Linux, macOS and Microsoft Windows platforms. The algorithm is implemented in C++ as standalone program. The source code is freely available on GitHub at https://github.com/acg-team/ProPIP and is distributed under the terms of the GNU GPL v3 license. CONCLUSIONS The use of an explicit indel evolution model allows to avoid over-alignment, to infer gaps in a phylogenetically consistent way and to make inferences about the rates of insertions and deletions. Instead of the arbitrary gap penalties, the parameters used by ProPIP are the insertion and deletion rates, which have biological interpretation and are contextualized in a probabilistic environment. As a result, indel rate settings may be optimised in order to infer phylogenetically meaningful gap patterns.
Collapse
Affiliation(s)
- Massimo Maiolo
- Institute of Applied Simulation, School of Life Sciences and Facility Management, Zurich University of Applied Sciences (ZHAW), Schloss 1, Postfach, 8820, Wädenswil, Switzerland.,Swiss Institute of Bioinformatics (SIB), Quartier Sorge - Batiment Amphipole, 1015, Lausanne, Switzerland
| | - Lorenzo Gatti
- Institute of Applied Simulation, School of Life Sciences and Facility Management, Zurich University of Applied Sciences (ZHAW), Schloss 1, Postfach, 8820, Wädenswil, Switzerland.,Swiss Institute of Bioinformatics (SIB), Quartier Sorge - Batiment Amphipole, 1015, Lausanne, Switzerland
| | - Diego Frei
- Institute of Information Systems and Networking, University of Applied Sciences and Arts of Southern Switzerland, Galleria 2, Via Cantonale 2c, 6928, Manno, Switzerland
| | - Tiziano Leidi
- Institute of Information Systems and Networking, University of Applied Sciences and Arts of Southern Switzerland, Galleria 2, Via Cantonale 2c, 6928, Manno, Switzerland
| | - Manuel Gil
- Institute of Applied Simulation, School of Life Sciences and Facility Management, Zurich University of Applied Sciences (ZHAW), Schloss 1, Postfach, 8820, Wädenswil, Switzerland.,Swiss Institute of Bioinformatics (SIB), Quartier Sorge - Batiment Amphipole, 1015, Lausanne, Switzerland
| | - Maria Anisimova
- Institute of Applied Simulation, School of Life Sciences and Facility Management, Zurich University of Applied Sciences (ZHAW), Schloss 1, Postfach, 8820, Wädenswil, Switzerland. .,Swiss Institute of Bioinformatics (SIB), Quartier Sorge - Batiment Amphipole, 1015, Lausanne, Switzerland.
| |
Collapse
|
7
|
Abstract
Pangenomes are organized collections of the genomic information from related individuals or groups. Graphical pangenomics is the study of these pangenomes using graphical methods to identify and analyze genes, regions, and mutations of interest to an array of biological questions. This field has seen significant progress in recent years including the development of graph based models that better resolve biological phenomena, and an explosion of new tools for mapping reads, creating graphical genomes, and performing pangenome analysis. In this review, we discuss recent developments in models, algorithms associated with graphical genomes, and comparisons between similar tools. In addition we briefly discuss what these developments may mean for the future of genomics.
Collapse
|
8
|
Abstract
Multiple sequence alignment is a core first step in many bioinformatics analyses, and errors in these alignments can have negative consequences for scientific studies. In this article, we review some of the recent literature evaluating multiple sequence alignment methods and identify specific challenges that arise when performing these evaluations. In particular, we discuss the different trends observed in simulation studies and when using biological benchmarks. Overall, we find that multiple sequence alignment, far from being a "solved problem," would benefit from new attention.
Collapse
Affiliation(s)
- Tandy Warnow
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA.
| |
Collapse
|
9
|
Carpentier M, Chomilier J. Protein multiple alignments: sequence-based versus structure-based programs. Bioinformatics 2020; 35:3970-3980. [PMID: 30942864 DOI: 10.1093/bioinformatics/btz236] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2018] [Revised: 03/05/2019] [Accepted: 04/02/2019] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Multiple sequence alignment programs have proved to be very useful and have already been evaluated in the literature yet not alignment programs based on structure or both sequence and structure. In the present article we wish to evaluate the added value provided through considering structures. RESULTS We compared the multiple alignments resulting from 25 programs either based on sequence, structure or both, to reference alignments deposited in five databases (BALIBASE 2 and 3, HOMSTRAD, OXBENCH and SISYPHUS). On the whole, the structure-based methods compute more reliable alignments than the sequence-based ones, and even than the sequence+structure-based programs whatever the databases. Two programs lead, MAMMOTH and MATRAS, nevertheless the performances of MUSTANG, MATT, 3DCOMB, TCOFFEE+TM_ALIGN and TCOFFEE+SAP are better for some alignments. The advantage of structure-based methods increases at low levels of sequence identity, or for residues in regular secondary structures or buried ones. Concerning gap management, sequence-based programs set less gaps than structure-based programs. Concerning the databases, the alignments of the manually built databases are more challenging for the programs. AVAILABILITY AND IMPLEMENTATION All data and results presented in this study are available at: http://wwwabi.snv.jussieu.fr/people/mathilde/download/AliMulComp/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mathilde Carpentier
- Institut Systématique Evolution Biodiversité (ISYEB), Sorbonne Université, MNHN, CNRS, EPHE, Paris, France
| | - Jacques Chomilier
- Sorbonne Université, MNHN, CNRS, IRD, Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie (IMPMC), BiBiP, Paris, France
| |
Collapse
|
10
|
Wright ES. RNAconTest: comparing tools for noncoding RNA multiple sequence alignment based on structural consistency. RNA (NEW YORK, N.Y.) 2020; 26:531-540. [PMID: 32005745 PMCID: PMC7161358 DOI: 10.1261/rna.073015.119] [Citation(s) in RCA: 30] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/20/2019] [Accepted: 01/28/2020] [Indexed: 05/05/2023]
Abstract
The importance of noncoding RNA sequences has become increasingly clear over the past decade. New RNA families are often detected and analyzed using comparative methods based on multiple sequence alignments. Accordingly, a number of programs have been developed for aligning and deriving secondary structures from sets of RNA sequences. Yet, the best tools for these tasks remain unclear because existing benchmarks contain too few sequences belonging to only a small number of RNA families. RNAconTest (RNA consistency test) is a new benchmarking approach relying on the observation that secondary structure is often conserved across highly divergent RNA sequences from the same family. RNAconTest scores multiple sequence alignments based on the level of consistency among known secondary structures belonging to reference sequences in their output alignment. Similarly, consensus secondary structure predictions are scored according to their agreement with one or more known structures in a family. Comparing the performance of 10 popular alignment programs using RNAconTest revealed that DAFS, DECIPHER, LocARNA, and MAFFT created the most structurally consistent alignments. The best consensus secondary structure predictions were generated by DAFS and LocARNA (via RNAalifold). Many of the methods specific to noncoding RNAs exhibited poor scalability as the number or length of input sequences increased, and several programs displayed substantial declines in score as more sequences were aligned. Overall, RNAconTest provides a means of testing and improving tools for comparative RNA analysis, as well as highlighting the best available approaches. RNAconTest is available from the DECIPHER website (http://DECIPHER.codes/Downloads.html).
Collapse
Affiliation(s)
- Erik S Wright
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, Pennsylvania 15219, USA
| |
Collapse
|
11
|
Abstract
Summary PASTA is a multiple sequence method that uses divide-and-conquer plus iteration to enable base alignment methods to scale with high accuracy to large sequence datasets. By default, PASTA included MAFFT L-INS-i; our new extension of PASTA enables the use of MAFFT G-INS-i, MAFFT Homologs, CONTRAlign and ProbCons. We analyzed the performance of each base method and PASTA using these base methods on 224 datasets from BAliBASE 4 with at least 50 sequences. We show that PASTA enables the most accurate base methods to scale to larger datasets at reduced computational effort, and generally improves alignment and tree accuracy on the largest BAliBASE datasets. Availability and implementation PASTA is available at https://github.com/kodicollins/pasta and has also been integrated into the original PASTA repository at https://github.com/smirarab/pasta. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Kodi Collins
- Department of Computer Science, University of California, Los Angeles, CA, USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois, Urbana, IL, USA
| |
Collapse
|
12
|
Nute M, Saleh E, Warnow T. Evaluating Statistical Multiple Sequence Alignment in Comparison to Other Alignment Methods on Protein Data Sets. Syst Biol 2019; 68:396-411. [PMID: 30329135 PMCID: PMC6472439 DOI: 10.1093/sysbio/syy068] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2018] [Revised: 09/27/2018] [Accepted: 10/11/2018] [Indexed: 01/15/2023] Open
Abstract
The estimation of multiple sequence alignments of protein sequences is a basic step in many bioinformatics pipelines, including protein structure prediction, protein family identification, and phylogeny estimation. Statistical coestimation of alignments and trees under stochastic models of sequence evolution has long been considered the most rigorous technique for estimating alignments and trees, but little is known about the accuracy of such methods on biological benchmarks. We report the results of an extensive study evaluating the most popular protein alignment methods as well as the statistical coestimation method BAli-Phy on 1192 protein data sets from established benchmarks as well as on 120 simulated data sets. Our study (which used more than 230 CPU years for the BAli-Phy analyses alone) shows that BAli-Phy has better precision and recall (with respect to the true alignments) than the other alignment methods on the simulated data sets but has consistently lower recall on the biological benchmarks (with respect to the reference alignments) than many of the other methods. In other words, we find that BAli-Phy systematically underaligns when operating on biological sequence data but shows no sign of this on simulated data. There are several potential causes for this change in performance, including model misspecification, errors in the reference alignments, and conflicts between structural alignment and evolutionary alignments, and future research is needed to determine the most likely explanation. We conclude with a discussion of the potential ramifications for each of these possibilities. [BAli-Phy; homology; multiple sequence alignment; protein sequences; structural alignment.]
Collapse
Affiliation(s)
- Michael Nute
- Department of Statistics, University of Illinois at Urbana-Champaign, 725 S Wright St #101, Champaign, IL 61820, USA
| | - Ehsan Saleh
- Department of Computer Science, University of Illinois at Urbana-Champaign, 201 N. Goodwin Ave, Urbana, IL 61801, USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois at Urbana-Champaign, 201 N. Goodwin Ave, Urbana, IL 61801, USA.,Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, 1205 W. Clark St., Urbana, IL 61801, USA.,National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| |
Collapse
|
13
|
Maiolo M, Zhang X, Gil M, Anisimova M. Progressive multiple sequence alignment with indel evolution. BMC Bioinformatics 2018; 19:331. [PMID: 30241460 PMCID: PMC6151001 DOI: 10.1186/s12859-018-2357-1] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2018] [Accepted: 09/03/2018] [Indexed: 12/30/2022] Open
Abstract
Background Sequence alignment is crucial in genomics studies. However, optimal multiple sequence alignment (MSA) is NP-hard. Thus, modern MSA methods employ progressive heuristics, breaking the problem into a series of pairwise alignments guided by a phylogeny. Changes between homologous characters are typically modelled by a Markov substitution model. In contrast, the dynamics of indels are not modelled explicitly, because the computation of the marginal likelihood under such models has exponential time complexity in the number of taxa. But the failure to model indel evolution may lead to artificially short alignments due to biased indel placement, inconsistent with phylogenetic relationship. Results Recently, the classical indel model TKF91 was modified to describe indel evolution on a phylogeny via a Poisson process, termed PIP. PIP allows to compute the joint marginal probability of an MSA and a tree in linear time. We present a new dynamic programming algorithm to align two MSAs –represented by the underlying homology paths– by full maximum likelihood under PIP in polynomial time, and apply it progressively along a guide tree. We have corroborated the correctness of our method by simulation, and compared it with competitive methods on an illustrative real dataset. Conclusions Our MSA method is the first polynomial time progressive aligner with a rigorous mathematical formulation of indel evolution. The new method infers phylogenetically meaningful gap patterns alternative to the popular PRANK, while producing alignments of similar length. Moreover, the inferred gap patterns agree with what was predicted qualitatively by previous studies. The algorithm is implemented in a standalone C++ program: https://github.com/acg-team/ProPIP. Supplementary data are available at BMC Bioinformatics online. Electronic supplementary material The online version of this article (10.1186/s12859-018-2357-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Massimo Maiolo
- Institute of Applied Simulation, School of Life Sciences and Facility Management, Zurich University of Applied Sciences (ZHAW), Grüentalstrasse 14, P.O. Box, Waedenswil, CH-8820, Switzerland.,Institute of Molecular Life Sciences, University of Zurich, Winterthurerstrasse 190, Zurich, CH-8057, Switzerland.,Swiss Institute of Bioinformatics (SIB), Quartier Sorge - Bâtiment Génopode, Lausanne, CH-1015, Switzerland
| | - Xiaolei Zhang
- National Heart and Lung Institute, Imperial College London, South Kensington Campus, London, SW7 2AZ, UK
| | - Manuel Gil
- Institute of Applied Simulation, School of Life Sciences and Facility Management, Zurich University of Applied Sciences (ZHAW), Grüentalstrasse 14, P.O. Box, Waedenswil, CH-8820, Switzerland.,Swiss Institute of Bioinformatics (SIB), Quartier Sorge - Bâtiment Génopode, Lausanne, CH-1015, Switzerland
| | - Maria Anisimova
- Institute of Applied Simulation, School of Life Sciences and Facility Management, Zurich University of Applied Sciences (ZHAW), Grüentalstrasse 14, P.O. Box, Waedenswil, CH-8820, Switzerland. .,Swiss Institute of Bioinformatics (SIB), Quartier Sorge - Bâtiment Génopode, Lausanne, CH-1015, Switzerland.
| |
Collapse
|
14
|
Levy Karin E, Ashkenazy H, Hein J, Pupko T. A Simulation-Based Approach to Statistical Alignment. Syst Biol 2018; 68:252-266. [DOI: 10.1093/sysbio/syy059] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2018] [Accepted: 09/10/2018] [Indexed: 12/26/2022] Open
Affiliation(s)
- Eli Levy Karin
- School of Molecular Cell Biology & Biotechnology, George S. Wise Faculty of Life Sciences, Tel-Aviv University, Tel-Aviv 69978, Israel
| | - Haim Ashkenazy
- School of Molecular Cell Biology & Biotechnology, George S. Wise Faculty of Life Sciences, Tel-Aviv University, Tel-Aviv 69978, Israel
| | - Jotun Hein
- School of Molecular Cell Biology & Biotechnology, George S. Wise Faculty of Life Sciences, Tel-Aviv University, Tel-Aviv 69978, Israel
- Department of Statistics, University of Oxford, Oxford, UK
| | - Tal Pupko
- School of Molecular Cell Biology & Biotechnology, George S. Wise Faculty of Life Sciences, Tel-Aviv University, Tel-Aviv 69978, Israel
| |
Collapse
|
15
|
Nord A, Carey K, Hornbeck P, Wheeler T. Splice-Aware Multiple Sequence Alignment of Protein Isoforms. ACM-BCB ... ... : THE ... ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND BIOMEDICINE. ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND BIOMEDICINE 2018; 2018:200-210. [PMID: 31080963 PMCID: PMC6508070 DOI: 10.1145/3233547.3233592] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
Abstract
Multiple sequence alignment (MSA) is a classic problem in computational genomics. In typical use, MSA software is expected to align a collection of homologous genes, such as orthologs from multiple species or duplication-induced paralogs within a species. Recent focus on the importance of alternatively-spliced isoforms in disease and cell biology has highlighted the need to create MSAs that more effectively accommodate isoforms. MSAs are traditionally constructed using scoring criteria that prefer alignments with occasional mismatches over alignments with long gaps. Alternatively spliced protein isoforms effectively contain exon-length insertions or deletions (indels) relative to each other, and demand an alternative approach. Some improvements can be achieved by making indel penalties much smaller, but this is merely a patchwork solution. In this work we present Mirage, a novel MSA software package for the alignment of alternatively spliced protein isoforms. Mirage aligns isoforms to each other by first mapping each protein sequence to its encoding genomic sequence, and then aligning isoforms to one another based on the relative genomic coordinates of their constitutive codons. Mirage is highly effective at mapping proteins back to their encoding exons, and these protein-genome mappings lead to extremely accurate intra-species alignments; splice site information in these alignments is used to improve the accuracy of inter-species alignments of isoforms. Mirage alignments have also revealed the ubiquity of dual-coding exons, in which an exon conditionally encodes multiple open reading frames as overlapping spliced segments of frame-shifted genomic sequence.
Collapse
Affiliation(s)
- Alex Nord
- University of Montana Missoula, Montana,
| | | | | | | |
Collapse
|
16
|
Le Q, Sievers F, Higgins DG. Protein multiple sequence alignment benchmarking through secondary structure prediction. Bioinformatics 2018; 33:1331-1337. [PMID: 28093407 PMCID: PMC5408826 DOI: 10.1093/bioinformatics/btw840] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2016] [Accepted: 01/10/2017] [Indexed: 12/26/2022] Open
Abstract
Motivation Multiple sequence alignment (MSA) is commonly used to analyze sets of homologous protein or DNA sequences. This has lead to the development of many methods and packages for MSA over the past 30 years. Being able to compare different methods has been problematic and has relied on gold standard benchmark datasets of ‘true’ alignments or on MSA simulations. A number of protein benchmark datasets have been produced which rely on a combination of manual alignment and/or automated superposition of protein structures. These are either restricted to very small MSAs with few sequences or require manual alignment which can be subjective. In both cases, it remains very difficult to properly test MSAs of more than a few dozen sequences. PREFAB and HomFam both rely on using a small subset of sequences of known structure and do not fairly test the quality of a full MSA. Results In this paper we describe QuanTest, a fully automated and highly scalable test system for protein MSAs which is based on using secondary structure prediction accuracy (SSPA) to measure alignment quality. This is based on the assumption that better MSAs will give more accurate secondary structure predictions when we include sequences of known structure. SSPA measures the quality of an entire alignment however, not just the accuracy on a handful of selected sequences. It can be scaled to alignments of any size but here we demonstrate its use on alignments of either 200 or 1000 sequences. This allows the testing of slow accurate programs as well as faster, less accurate ones. We show that the scores from QuanTest are highly correlated with existing benchmark scores. We also validate the method by comparing a wide range of MSA alignment options and by including different levels of mis-alignment into MSA, and examining the effects on the scores. Availability and Implementation QuanTest is available from http://www.bioinf.ucd.ie/download/QuanTest.tgz Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Quan Le
- Conway Institute, UCD School of Medicine and Medical Science, University College Dublin, Belfield, Dublin, Dublin 4, Ireland
| | - Fabian Sievers
- Conway Institute, UCD School of Medicine and Medical Science, University College Dublin, Belfield, Dublin, Dublin 4, Ireland
| | - Desmond G Higgins
- Conway Institute, UCD School of Medicine and Medical Science, University College Dublin, Belfield, Dublin, Dublin 4, Ireland
| |
Collapse
|
17
|
Iwai S, Weinmaier T, Schmidt BL, Albertson DG, Poloso NJ, Dabbagh K, DeSantis TZ. Piphillin: Improved Prediction of Metagenomic Content by Direct Inference from Human Microbiomes. PLoS One 2016; 11:e0166104. [PMID: 27820856 PMCID: PMC5098786 DOI: 10.1371/journal.pone.0166104] [Citation(s) in RCA: 201] [Impact Index Per Article: 22.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2016] [Accepted: 10/07/2016] [Indexed: 01/30/2023] Open
Abstract
Functional analysis of a clinical microbiome facilitates the elucidation of mechanisms by which microbiome perturbation can cause a phenotypic change in the patient. The direct approach for the analysis of the functional capacity of the microbiome is via shotgun metagenomics. An inexpensive method to estimate the functional capacity of a microbial community is through collecting 16S rRNA gene profiles then indirectly inferring the abundance of functional genes. This inference approach has been implemented in the PICRUSt and Tax4Fun software tools. However, those tools have important limitations since they rely on outdated functional databases and uncertain phylogenetic trees and require very specific data pre-processing protocols. Here we introduce Piphillin, a straightforward algorithm independent of any proposed phylogenetic tree, leveraging contemporary functional databases and not obliged to any singular data pre-processing protocol. When all three inference tools were evaluated against actual shotgun metagenomics, Piphillin was superior in predicting gene composition in human clinical samples compared to both PICRUSt and Tax4Fun (p<0.01 and p<0.001, respectively) and Piphillin’s ability to predict disease associations with specific gene orthologs exhibited a 15% increase in balanced accuracy compared to PICRUSt. From laboratory animal samples, no performance advantage was observed for any one of the tools over the others and for environmental samples all produced unsatisfactory predictions. Our results demonstrate that functional inference using the direct method implemented in Piphillin is preferable for clinical biospecimens. Piphillin is publicly available for academic use at http://secondgenome.com/Piphillin.
Collapse
Affiliation(s)
- Shoko Iwai
- Informatics Department, Second Genome Inc., South San Francisco, California, United States of America
| | - Thomas Weinmaier
- Informatics Department, Second Genome Inc., South San Francisco, California, United States of America
| | - Brian L. Schmidt
- Bluestone Center for Clinical Research and the Department of Oral and Maxillofacial Surgery, New York University College of Dentistry, New York, New York, United States of America
| | - Donna G. Albertson
- Bluestone Center for Clinical Research and the Department of Oral and Maxillofacial Surgery, New York University College of Dentistry, New York, New York, United States of America
- Helen Diller Family Comprehensive Cancer Center, University of California San Francisco, San Francisco, California, United States of America
| | - Neil J. Poloso
- Research and External Scientific Innovation Department, Allergan PLC, Irvine, California, United States of America
| | - Karim Dabbagh
- Informatics Department, Second Genome Inc., South San Francisco, California, United States of America
| | - Todd Z. DeSantis
- Informatics Department, Second Genome Inc., South San Francisco, California, United States of America
- * E-mail:
| |
Collapse
|
18
|
Barquist L, Burge SW, Gardner PP. Studying RNA Homology and Conservation with Infernal: From Single Sequences to RNA Families. CURRENT PROTOCOLS IN BIOINFORMATICS 2016; 54:12.13.1-12.13.25. [PMID: 27322404 PMCID: PMC5010141 DOI: 10.1002/cpbi.4] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Emerging high-throughput technologies have led to a deluge of putative non-coding RNA (ncRNA) sequences identified in a wide variety of organisms. Systematic characterization of these transcripts will be a tremendous challenge. Homology detection is critical to making maximal use of functional information gathered about ncRNAs: identifying homologous sequence allows us to transfer information gathered in one organism to another quickly and with a high degree of confidence. ncRNA presents a challenge for homology detection, as the primary sequence is often poorly conserved and de novo secondary structure prediction and search remain difficult. This unit introduces methods developed by the Rfam database for identifying "families" of homologous ncRNAs starting from single "seed" sequences, using manually curated sequence alignments to build powerful statistical models of sequence and structure conservation known as covariance models (CMs), implemented in the Infernal software package. We provide a step-by-step iterative protocol for identifying ncRNA homologs and then constructing an alignment and corresponding CM. We also work through an example for the bacterial small RNA MicA, discovering a previously unreported family of divergent MicA homologs in genus Xenorhabdus in the process. © 2016 by John Wiley & Sons, Inc.
Collapse
Affiliation(s)
- Lars Barquist
- Institute for Molecular Infection Biology, University of Würzburg, Würzburg, D-97080 Germany
- Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1SA United Kingdom; Fax: +44 (0)1223 494919
| | - Sarah W. Burge
- Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1SA United Kingdom; Fax: +44 (0)1223 494919
| | - Paul P. Gardner
- School of Biological Sciences, University of Canterbury, Private Bag 4800, Christchurch, New Zealand
- Biomolecular Interaction Centre, University of Canterbury, Private Bag 4800, Christchurch, New Zealand
| |
Collapse
|
19
|
Salomon M, Dassy B. Partitional Classification: A Complement to Phylogeny. Evol Bioinform Online 2016; 12:149-56. [PMID: 27346943 PMCID: PMC4912232 DOI: 10.4137/ebo.s38288] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2015] [Revised: 03/21/2016] [Accepted: 03/27/2016] [Indexed: 11/30/2022] Open
Abstract
The tree of life is currently an active object of research, though next to vertical gene transmission non vertical gene transfers proved to play a significant role in the evolutionary process. To overcome this difficulty, trees of life are now constructed from genes hypothesized vital, on the assumption that these are all transmitted vertically. This view has been challenged. As a frame for this discussion, we developed a partitional taxonomical system clustering taxa at a high taxonomical rank. Our analysis (1) selects RNase P RNA sequences of bacterial, archaeal, and eucaryal genera from genetic databases, (2) submits the sequences, aligned, to k-medoid analysis to obtain clusters, (3) establishes the correspondence between clusters and taxa, (4) constructs from the taxa a new type of taxon, the genetic community (GC), and (5) classifies the GCs: Archaea-Eukaryotes contrastingly different from the six others, all bacterial. The GCs would be the broadest frame to carry out the phylogenies.
Collapse
Affiliation(s)
- Marc Salomon
- Sorbonne Universités, UPMC Université Paris 06, Faculté de Biologie, Paris, France
| | - Bruno Dassy
- Sorbonne Universités, UPMC Université Paris 06, Faculté de Biologie, Paris, France
| |
Collapse
|
20
|
Potha N, Maragoudakis M, Lyras D. A biology-inspired, data mining framework for extracting patterns in sexual cyberbullying data. Knowl Based Syst 2016. [DOI: 10.1016/j.knosys.2015.12.021] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
|
21
|
Chatzou M, Magis C, Chang JM, Kemena C, Bussotti G, Erb I, Notredame C. Multiple sequence alignment modeling: methods and applications. Brief Bioinform 2015; 17:1009-1023. [PMID: 26615024 DOI: 10.1093/bib/bbv099] [Citation(s) in RCA: 98] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2015] [Revised: 10/16/2015] [Indexed: 12/20/2022] Open
Abstract
This review provides an overview on the development of Multiple sequence alignment (MSA) methods and their main applications. It is focused on progress made over the past decade. The three first sections review recent algorithmic developments for protein, RNA/DNA and genomic alignments. The fourth section deals with benchmarks and explores the relationship between empirical and simulated data, along with the impact on method developments. The last part of the review gives an overview on available MSA local reliability estimators and their dependence on various algorithmic properties of available methods.
Collapse
|
22
|
Lange J, Wyrwicz LS, Vriend G. KMAD: knowledge-based multiple sequence alignment for intrinsically disordered proteins. ACTA ACUST UNITED AC 2015; 32:932-6. [PMID: 26568635 PMCID: PMC4803389 DOI: 10.1093/bioinformatics/btv663] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2015] [Accepted: 11/09/2015] [Indexed: 12/16/2022]
Abstract
Summary: Intrinsically disordered proteins (IDPs) lack tertiary structure and thus differ from globular proteins in terms of their sequence–structure–function relations. IDPs have lower sequence conservation, different types of active sites and a different distribution of functionally important regions, which altogether make their multiple sequence alignment (MSA) difficult. The KMAD MSA software has been written specifically for the alignment and annotation of IDPs. It augments the substitution matrix with knowledge about post-translational modifications, functional domains and short linear motifs. Results: MSAs produced with KMAD describe well-conserved features among IDPs, tend to agree well with biological intuition, and are a good basis for designing new experiments to shed light on this large, understudied class of proteins. Availability and implementation: KMAD web server is accessible at http://www.cmbi.ru.nl/kmad/. A standalone version is freely available. Contact: vriend@cmbi.ru.nl
Collapse
Affiliation(s)
- Joanna Lange
- Laboratory of Bioinformatics and Biostatistics, M. Sklodowska-Curie Memorial Cancer Center and Institute of Oncology, Warsaw, Poland and Laboratory of Bioinformatics and Biostatistics, M. Sklodowska-Curie Memorial Cancer Center and Institute of Oncology, Warsaw, Poland and
| | - Lucjan S Wyrwicz
- Laboratory of Bioinformatics and Biostatistics, M. Sklodowska-Curie Memorial Cancer Center and Institute of Oncology, Warsaw, Poland and
| | - Gert Vriend
- CMBI Radboudumc, 6525 GA, Nijmegen, The Netherlands
| |
Collapse
|
23
|
Fox G, Sievers F, Higgins DG. Using de novo protein structure predictions to measure the quality of very large multiple sequence alignments. ACTA ACUST UNITED AC 2015; 32:814-20. [PMID: 26568625 PMCID: PMC5939968 DOI: 10.1093/bioinformatics/btv592] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2015] [Accepted: 10/10/2015] [Indexed: 01/03/2023]
Abstract
Motivation: Multiple sequence alignments (MSAs) with large numbers of sequences are now commonplace. However, current multiple alignment benchmarks are ill-suited for testing these types of alignments, as test cases either contain a very small number of sequences or are based purely on simulation rather than empirical data. Results: We take advantage of recent developments in protein structure prediction methods to create a benchmark (ContTest) for protein MSAs containing many thousands of sequences in each test case and which is based on empirical biological data. We rank popular MSA methods using this benchmark and verify a recent result showing that chained guide trees increase the accuracy of progressive alignment packages on datasets with thousands of proteins. Availability and implementation: Benchmark data and scripts are available for download at http://www.bioinf.ucd.ie/download/ContTest.tar.gz. Contact:des.higgins@ucd.ie Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Gearóid Fox
- Conway Institute of Biomolecular and Biomedical Research, and UCD School of Medicine and Medical Science, University College Dublin, Dublin 4, Ireland
| | - Fabian Sievers
- Conway Institute of Biomolecular and Biomedical Research, and UCD School of Medicine and Medical Science, University College Dublin, Dublin 4, Ireland
| | - Desmond G Higgins
- Conway Institute of Biomolecular and Biomedical Research, and UCD School of Medicine and Medical Science, University College Dublin, Dublin 4, Ireland
| |
Collapse
|
24
|
Wright ES. DECIPHER: harnessing local sequence context to improve protein multiple sequence alignment. BMC Bioinformatics 2015; 16:322. [PMID: 26445311 PMCID: PMC4595117 DOI: 10.1186/s12859-015-0749-z] [Citation(s) in RCA: 232] [Impact Index Per Article: 23.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2015] [Accepted: 09/23/2015] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND Alignment of large and diverse sequence sets is a common task in biological investigations, yet there remains considerable room for improvement in alignment quality. Multiple sequence alignment programs tend to reach maximal accuracy when aligning only a few sequences, and then diminish steadily as more sequences are added. This drop in accuracy can be partly attributed to a build-up of error and ambiguity as more sequences are aligned. Most high-throughput sequence alignment algorithms do not use contextual information under the assumption that sites are independent. This study examines the extent to which local sequence context can be exploited to improve the quality of large multiple sequence alignments. RESULTS Two predictors based on local sequence context were assessed: (i) single sequence secondary structure predictions, and (ii) modulation of gap costs according to the surrounding residues. The results indicate that context-based predictors have appreciable information content that can be utilized to create more accurate alignments. Furthermore, local context becomes more informative as the number of sequences increases, enabling more accurate protein alignments of large empirical benchmarks. These discoveries became the basis for DECIPHER, a new context-aware program for sequence alignment, which outperformed other programs on large sequence sets. CONCLUSIONS Predicting secondary structure based on local sequence context is an efficient means of breaking the independence assumption in alignment. Since secondary structure is more conserved than primary sequence, it can be leveraged to improve the alignment of distantly related proteins. Moreover, secondary structure predictions increase in accuracy as more sequences are used in the prediction. This enables the scalable generation of large sequence alignments that maintain high accuracy even on diverse sequence sets. The DECIPHER R package and source code are freely available for download at DECIPHER.cee.wisc.edu and from the Bioconductor repository.
Collapse
Affiliation(s)
- Erik S Wright
- Department of Bacteriology, University of Wisconsin-Madison, Madison, WI, 53715, USA. .,Wisconsin Institute for Discovery, University of Wisconsin-Madison, 330 N. Orchard St., Madison, WI, 53715, USA.
| |
Collapse
|
25
|
Tan G, Muffato M, Ledergerber C, Herrero J, Goldman N, Gil M, Dessimoz C. Current Methods for Automated Filtering of Multiple Sequence Alignments Frequently Worsen Single-Gene Phylogenetic Inference. Syst Biol 2015; 64:778-91. [PMID: 26031838 PMCID: PMC4538881 DOI: 10.1093/sysbio/syv033] [Citation(s) in RCA: 155] [Impact Index Per Article: 15.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2014] [Accepted: 05/26/2015] [Indexed: 01/09/2023] Open
Abstract
Phylogenetic inference is generally performed on the basis of multiple sequence alignments (MSA). Because errors in an alignment can lead to errors in tree estimation, there is a strong interest in identifying and removing unreliable parts of the alignment. In recent years several automated filtering approaches have been proposed, but despite their popularity, a systematic and comprehensive comparison of different alignment filtering methods on real data has been lacking. Here, we extend and apply recently introduced phylogenetic tests of alignment accuracy on a large number of gene families and contrast the performance of unfiltered versus filtered alignments in the context of single-gene phylogeny reconstruction. Based on multiple genome-wide empirical and simulated data sets, we show that the trees obtained from filtered MSAs are on average worse than those obtained from unfiltered MSAs. Furthermore, alignment filtering often leads to an increase in the proportion of well-supported branches that are actually wrong. We confirm that our findings hold for a wide range of parameters and methods. Although our results suggest that light filtering (up to 20% of alignment positions) has little impact on tree accuracy and may save some computation time, contrary to widespread practice, we do not generally recommend the use of current alignment filtering methods for phylogenetic inference. By providing a way to rigorously and systematically measure the impact of filtering on alignments, the methodology set forth here will guide the development of better filtering algorithms.
Collapse
Affiliation(s)
- Ge Tan
- Department of Computer Science, ETH Zurich, Universitätstr. 6, 8092 Zurich, Switzerland, Department of Molecular Sciences, Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, London, UK; MRC Clinical Sciences Centre, London W12 0NN, UK
| | - Matthieu Muffato
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge, CB10 1SD, UK
| | - Christian Ledergerber
- Department of Computer Science, ETH Zurich, Universitätstr. 6, 8092 Zurich, Switzerland
| | - Javier Herrero
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge, CB10 1SD, UK; University College London, Gower St, London WC1E 6BT, UK; and
| | - Nick Goldman
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge, CB10 1SD, UK
| | - Manuel Gil
- Institute of Molecular Life Sciences, University of Zurich, Winterthurerstr. 190 , 8057 Zurich, Switzerland; and Swiss Institute of Bioinformatics, Universitätstr. 6, 8092 Zurich, Switzerland
| | - Christophe Dessimoz
- University College London, Gower St, London WC1E 6BT, UK; and European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge, CB10 1SD, UK;
| |
Collapse
|
26
|
Abstract
Horizontal or Lateral Gene Transfer (HGT or LGT) is the transmission of portions of genomic DNA between organisms through a process decoupled from vertical inheritance. In the presence of HGT events, different fragments of the genome are the result of different evolutionary histories. This can therefore complicate the investigations of evolutionary relatedness of lineages and species. Also, as HGT can bring into genomes radically different genotypes from distant lineages, or even new genes bearing new functions, it is a major source of phenotypic innovation and a mechanism of niche adaptation. For example, of particular relevance to human health is the lateral transfer of antibiotic resistance and pathogenicity determinants, leading to the emergence of pathogenic lineages. Computational identification of HGT events relies upon the investigation of sequence composition or evolutionary history of genes. Sequence composition-based ("parametric") methods search for deviations from the genomic average, whereas evolutionary history-based ("phylogenetic") approaches identify genes whose evolutionary history significantly differs from that of the host species. The evaluation and benchmarking of HGT inference methods typically rely upon simulated genomes, for which the true history is known. On real data, different methods tend to infer different HGT events, and as a result it can be difficult to ascertain all but simple and clear-cut HGT events.
Collapse
Affiliation(s)
| | - Nives Škunca
- ETH Zurich, Zurich, Switzerland
- Swiss Institute of Bioinformatics, Zurich, Switzerland
| | | | - Christophe Dessimoz
- University College London, London, United Kingdom
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom
| |
Collapse
|
27
|
Sela I, Ashkenazy H, Katoh K, Pupko T. GUIDANCE2: accurate detection of unreliable alignment regions accounting for the uncertainty of multiple parameters. Nucleic Acids Res 2015; 43:W7-14. [PMID: 25883146 PMCID: PMC4489236 DOI: 10.1093/nar/gkv318] [Citation(s) in RCA: 540] [Impact Index Per Article: 54.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2015] [Accepted: 03/28/2015] [Indexed: 12/25/2022] Open
Abstract
Inference of multiple sequence alignments (MSAs) is a critical part of phylogenetic and comparative genomics studies. However, from the same set of sequences different MSAs are often inferred, depending on the methodologies used and the assumed parameters. Much effort has recently been devoted to improving the ability to identify unreliable alignment regions. Detecting such unreliable regions was previously shown to be important for downstream analyses relying on MSAs, such as the detection of positive selection. Here we developed GUIDANCE2, a new integrative methodology that accounts for: (i) uncertainty in the process of indel formation, (ii) uncertainty in the assumed guide tree and (iii) co-optimal solutions in the pairwise alignments, used as building blocks in progressive alignment algorithms. We compared GUIDANCE2 with seven methodologies to detect unreliable MSA regions using extensive simulations and empirical benchmarks. We show that GUIDANCE2 outperforms all previously developed methodologies. Furthermore, GUIDANCE2 also provides a set of alternative MSAs which can be useful for downstream analyses. The novel algorithm is implemented as a web-server, available at: http://guidance.tau.ac.il.
Collapse
Affiliation(s)
- Itamar Sela
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel-Aviv University, Tel-Aviv 6997801, Israel
| | - Haim Ashkenazy
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel-Aviv University, Tel-Aviv 6997801, Israel
| | - Kazutaka Katoh
- Immunology Frontier Research Center, Osaka University, Suita, Osaka 565-0871, Japan Computational Biology Research Center, The National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 135-0064, Japan
| | - Tal Pupko
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel-Aviv University, Tel-Aviv 6997801, Israel
| |
Collapse
|
28
|
Uricaru R, Michotey C, Chiapello H, Rivals E. YOC, A new strategy for pairwise alignment of collinear genomes. BMC Bioinformatics 2015; 16:111. [PMID: 25885358 PMCID: PMC4411659 DOI: 10.1186/s12859-015-0530-3] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2014] [Accepted: 03/09/2015] [Indexed: 01/02/2023] Open
Abstract
Background Comparing and aligning genomes is a key step in analyzing closely related genomes. Despite the development of many genome aligners in the last 15 years, the problem is not yet fully resolved, even when aligning closely related bacterial genomes of the same species. In addition, no procedures are available to assess the quality of genome alignments or to compare genome aligners. Results We designed an original method for pairwise genome alignment, named YOC, which employs a highly sensitive similarity detection method together with a recent collinear chaining strategy that allows overlaps. YOC improves the reliability of collinear genome alignments, while preserving or even improving sensitivity. We also propose an original qualitative evaluation criterion for measuring the relevance of genome alignments. We used this criterion to compare and benchmark YOC with five recent genome aligners on large bacterial genome datasets, and showed it is suitable for identifying the specificities and the potential flaws of their underlying strategies. Conclusions The YOC prototype is available at https://github.com/ruricaru/YOC. It has several advantages over existing genome aligners: (1) it is based on a simplified two phase alignment strategy, (2) it is easy to parameterize, (3) it produces reliable genome alignments, which are easier to analyze and to use. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0530-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Raluca Uricaru
- University of Bordeaux, CNRS / LaBRI, F-33405, Talence, France. .,University of Bordeaux, CBiB, F-33000, Bordeaux, France. .,LIRMM, UMR 5506, Computational Biology Institute, CNRS, University of Montpellier 2, Montpellier, France.
| | | | - Hélène Chiapello
- MIG, UR 1077, INRA, 78026, Jouy-en-Josas cedex, France. .,MIA-T, UR 0875, INRA, BP 52627, 31326, Castanet-Tolosan cedex, France.
| | - Eric Rivals
- LIRMM, UMR 5506, Computational Biology Institute, CNRS, University of Montpellier 2, Montpellier, France.
| |
Collapse
|
29
|
Simple chained guide trees give poorer multiple sequence alignments than inferred trees in simulation and phylogenetic benchmarks. Proc Natl Acad Sci U S A 2015; 112:E99-100. [PMID: 25564672 DOI: 10.1073/pnas.1417526112] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open
|
30
|
Pervez MT, Babar ME, Nadeem A, Aslam M, Awan AR, Aslam N, Hussain T, Naveed N, Qadri S, Waheed U, Shoaib M. Evaluating the accuracy and efficiency of multiple sequence alignment methods. Evol Bioinform Online 2014; 10:205-17. [PMID: 25574120 PMCID: PMC4267518 DOI: 10.4137/ebo.s19199] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2014] [Revised: 10/09/2014] [Accepted: 10/14/2014] [Indexed: 12/21/2022] Open
Abstract
A comparison of 10 most popular Multiple Sequence Alignment (MSA) tools, namely, MUSCLE, MAFFT(L-INS-i), MAFFT (FFT-NS-2), T-Coffee, ProbCons, SATe, Clustal Omega, Kalign, Multalin, and Dialign-TX is presented. We also focused on the significance of some implementations embedded in algorithm of each tool. Based on 10 simulated trees of different number of taxa generated by R, 400 known alignments and sequence files were constructed using indel-Seq-Gen. A total of 4000 test alignments were generated to study the effect of sequence length, indel size, deletion rate, and insertion rate. Results showed that alignment quality was highly dependent on the number of deletions and insertions in the sequences and that the sequence length and indel size had a weaker effect. Overall, ProbCons was consistently on the top of list of the evaluated MSA tools. SATe, being little less accurate, was 529.10% faster than ProbCons and 236.72% faster than MAFFT(L-INS-i). Among other tools, Kalign and MUSCLE achieved the highest sum of pairs. We also considered BALiBASE benchmark datasets and the results relative to BAliBASE- and indel-Seq-Gen-generated alignments were consistent in the most cases.
Collapse
Affiliation(s)
- Muhammad Tariq Pervez
- Department of Computer Science, Virtual University of Pakistan, Lahore, Pakistan. ; Institute of Biochemistry and Biotechnology, University of Veterinary and Animal Sciences, Lahore, Pakistan
| | | | - Asif Nadeem
- Institute of Biochemistry and Biotechnology, University of Veterinary and Animal Sciences, Lahore, Pakistan
| | - Muhammad Aslam
- Department of Computer Science and Engineering, University of Engineering and Technology, Lahore, Pakistan
| | - Ali Raza Awan
- Institute of Biochemistry and Biotechnology, University of Veterinary and Animal Sciences, Lahore, Pakistan
| | - Naeem Aslam
- Institute of Biochemistry and Biotechnology, University of Veterinary and Animal Sciences, Lahore, Pakistan. ; Department of Computer Science, NFC Institute of Engineering and Technological Training, Multan, Pakistan
| | - Tanveer Hussain
- Institute of Biochemistry and Biotechnology, University of Veterinary and Animal Sciences, Lahore, Pakistan
| | | | - Salman Qadri
- Department of Computer Science, Islamia University of Bahawalpur, Pakistan
| | - Usman Waheed
- Department of Computer Science, Virtual University of Pakistan, Lahore, Pakistan
| | - Muhammad Shoaib
- Department of Computer Science and Engineering, University of Engineering and Technology, Lahore, Pakistan
| |
Collapse
|
31
|
Abstract
Multiple sequence alignments (MSAs) are a prerequisite for a wide variety of evolutionary analyses. Published assessments and benchmark data sets for protein and, to a lesser extent, global nucleotide MSAs are available, but less effort has been made to establish benchmarks in the more general problem of whole-genome alignment (WGA). Using the same model as the successful Assemblathon competitions, we organized a competitive evaluation in which teams submitted their alignments and then assessments were performed collectively after all the submissions were received. Three data sets were used: Two were simulated and based on primate and mammalian phylogenies, and one was comprised of 20 real fly genomes. In total, 35 submissions were assessed, submitted by 10 teams using 12 different alignment pipelines. We found agreement between independent simulation-based and statistical assessments, indicating that there are substantial accuracy differences between contemporary alignment tools. We saw considerable differences in the alignment quality of differently annotated regions and found that few tools aligned the duplications analyzed. We found that many tools worked well at shorter evolutionary distances, but fewer performed competitively at longer distances. We provide all data sets, submissions, and assessment programs for further study and provide, as a resource for future benchmarking, a convenient repository of code and data for reproducing the simulation assessments.
Collapse
|
32
|
Simple chained guide trees give high-quality protein multiple sequence alignments. Proc Natl Acad Sci U S A 2014; 111:10556-61. [PMID: 25002495 DOI: 10.1073/pnas.1405628111] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
Abstract
Guide trees are used to decide the order of sequence alignment in the progressive multiple sequence alignment heuristic. These guide trees are often the limiting factor in making large alignments, and considerable effort has been expended over the years in making these quickly or accurately. In this article we show that, at least for protein families with large numbers of sequences that can be benchmarked with known structures, simple chained guide trees give the most accurate alignments. These also happen to be the fastest and simplest guide trees to construct, computationally. Such guide trees have a striking effect on the accuracy of alignments produced by some of the most widely used alignment packages. There is a marked increase in accuracy and a marked decrease in computational time, once the number of sequences goes much above a few hundred. This is true, even if the order of sequences in the guide tree is random.
Collapse
|
33
|
Warnow T. Large-Scale Multiple Sequence Alignment and Phylogeny Estimation. MODELS AND ALGORITHMS FOR GENOME EVOLUTION 2013. [DOI: 10.1007/978-1-4471-5298-9_6] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
|