1
|
Manuel C, Sakalli E, Schmidt HA, Viñas C, von Haeseler A, Elgert C. When the Past Fades: Detecting Phylogenetic Signal with SatuTe. Mol Biol Evol 2025; 42:msaf090. [PMID: 40423578 PMCID: PMC12108095 DOI: 10.1093/molbev/msaf090] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2024] [Revised: 03/10/2025] [Accepted: 03/26/2025] [Indexed: 05/28/2025] Open
Abstract
In phylogenetics, the phenomenon of saturation is well known, although its influence on tree reconstruction lacks a systematic and well-founded method. Here, we propose a new measure of the phylogenetic information shared between two subtrees connected by a branch in a phylogeny. This measure generalizes the concept of saturation between two sequences to a theory of saturation between subtrees, whose implementation we provide as the versatile program SatuTe. We describe different usages of SatuTe, identifying which branches in a tree are phylogenetically informative and which alignment regions support a given branch. As an example, we discuss the Tree of Life reconstruction from ribosomal proteins and the 16S rRNA gene, with emphasis on the two-domain versus three-domain hypotheses. For the branch leading to Eukaryota, we show that most ribosomal proteins contain a strong phylogenetic signal, whereas some regions of the 16S rRNA gene have lost phylogenetic information. Thus, SatuTe opens new insights into phylogenetic inference and complements standard phylogenetic analysis.
Collapse
Affiliation(s)
- Cassius Manuel
- Center for Integrative Bioinformatics Vienna, Max Perutz Labs, University of Vienna, Medical University of Vienna, Dr. Bohr Gasse 9, Vienna A-1030, Austria
| | - Enes Sakalli
- Center for Integrative Bioinformatics Vienna, Max Perutz Labs, University of Vienna, Medical University of Vienna, Dr. Bohr Gasse 9, Vienna A-1030, Austria
- Vienna BioCenter PhD Program, Doctoral School of the University of Vienna and Medical University of Vienna, Vienna A-1030, Austria
| | - Heiko A Schmidt
- Center for Integrative Bioinformatics Vienna, Max Perutz Labs, University of Vienna, Medical University of Vienna, Dr. Bohr Gasse 9, Vienna A-1030, Austria
| | - Carme Viñas
- Faculty of Mathematics and Statistics, Polytechnic University of Catalonia, Barcelona, Spain
| | - Arndt von Haeseler
- Center for Integrative Bioinformatics Vienna, Max Perutz Labs, University of Vienna, Medical University of Vienna, Dr. Bohr Gasse 9, Vienna A-1030, Austria
- Ludwig Boltzmann Institute for Network Medicine, University of Vienna, Augasse 2-6, Vienna A-1090, Austria
| | - Christiane Elgert
- Center for Integrative Bioinformatics Vienna, Max Perutz Labs, University of Vienna, Medical University of Vienna, Dr. Bohr Gasse 9, Vienna A-1030, Austria
| |
Collapse
|
2
|
Lavin AA, Rivas-Santisteban J. Limitations of sequence dissimilarity as a predictor of prokaryotic lineage. Open Biol 2025; 15:240302. [PMID: 40101780 PMCID: PMC11919493 DOI: 10.1098/rsob.240302] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2024] [Revised: 01/15/2025] [Accepted: 02/09/2025] [Indexed: 03/20/2025] Open
Abstract
The molecular clock rests upon the assumption that the observed changes among sequences capture the differentiation of lineages, or kinship, as dissimilarity increases with time. Although it has been questioned over the years, this paradigmatic principle continues to underlie the idea that the polymorphic space of a gene is so vast that it is unattainable in evolutionary time. Thus, the molecular clock has been used to obtain taxonomic annotations, proving to be very effective at delivering testable results. In this article, however, we ask how often this assumption leads to inaccuracies when inferring the lineage of prokaryotic genes. Thus, we open an interesting discussion by simulating, in realistic scenarios, the critical times in which specific 5S rRNA sequences of two distant lineages are exhausting the polymorphic space. We contend that certain genes in one lineage will become increasingly similar to those in another over time, as the space for new variants is finite, mimicking phylogenetic features by convergence or by chance, without implying true kinship.
Collapse
Affiliation(s)
- Alvar A. Lavin
- Department of Systems Biology, Centro Nacional de Biotecnología, Madrid, Spain
| | - Juan Rivas-Santisteban
- Department of Systems Biology, Centro Nacional de Biotecnología, Madrid, Spain
- Department of Biology and Biochemistry, University of Bath Milner Centre for Evolution, Bath, UK
| |
Collapse
|
3
|
Ren H, Wong TKF, Minh BQ, Lanfear R. MixtureFinder: Estimating DNA Mixture Models for Phylogenetic Analyses. Mol Biol Evol 2025; 42:msae264. [PMID: 39715360 PMCID: PMC11704958 DOI: 10.1093/molbev/msae264] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Revised: 11/26/2024] [Accepted: 12/19/2024] [Indexed: 12/25/2024] Open
Abstract
In phylogenetic studies, both partitioned models and mixture models are used to account for heterogeneity in molecular evolution among the sites of DNA sequence alignments. Partitioned models require the user to specify the grouping of sites into subsets, and then assume that each subset of sites can be modeled by a single common process. Mixture models do not require users to prespecify subsets of sites, and instead calculate the likelihood of every site under every model, while co-estimating the model weights and parameters. While much research has gone into the optimization of partitioned models by merging user-specified subsets, there has been less attention paid to the optimization of mixture models for DNA sequence alignments. In this study, we first ask whether a key assumption of partitioned models-that each user-specified subset can be modeled by a single common process-is supported by the data. Having shown that this is not the case, we then design, implement, test, and apply an algorithm, MixtureFinder, to select the optimum number of classes for a mixture model of Q-matrices for the standard models of DNA sequence evolution. We show this algorithm performs well on simulated and empirical datasets and suggest that it may be useful for future empirical studies. MixtureFinder is available in IQ-TREE2, and a tutorial for using MixtureFinder can be found here: http://www.iqtree.org/doc/Complex-Models#mixture-models.
Collapse
Affiliation(s)
- Huaiyan Ren
- School of Computing, College of Engineering, Computing and Cybernetics, Australian National University, Canberra, ACT 2600, Australia
- Ecology and Evolution, Research School of Biology, College of Science, Australian National University, Canberra, ACT 2600, Australia
| | - Thomas K F Wong
- School of Computing, College of Engineering, Computing and Cybernetics, Australian National University, Canberra, ACT 2600, Australia
- Ecology and Evolution, Research School of Biology, College of Science, Australian National University, Canberra, ACT 2600, Australia
| | - Bui Quang Minh
- School of Computing, College of Engineering, Computing and Cybernetics, Australian National University, Canberra, ACT 2600, Australia
| | - Robert Lanfear
- Ecology and Evolution, Research School of Biology, College of Science, Australian National University, Canberra, ACT 2600, Australia
| |
Collapse
|
4
|
Höhna S, Hsiang AY. Sequential Bayesian Phylogenetic Inference. Syst Biol 2024; 73:704-721. [PMID: 38771253 DOI: 10.1093/sysbio/syae020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2023] [Revised: 04/15/2024] [Accepted: 05/04/2024] [Indexed: 05/22/2024] Open
Abstract
The ideal approach to Bayesian phylogenetic inference is to estimate all parameters of interest jointly in a single hierarchical model. However, this is often not feasible in practice due to the high computational cost. Instead, phylogenetic pipelines generally consist of sequential analyses, whereby a single point estimate from a given analysis is used as input for the next analysis (e.g., a single multiple sequence alignment is used to estimate a gene tree). In this framework, uncertainty is not propagated from step to step, which can lead to inaccurate or spuriously confident results. Here, we formally develop and test a sequential inference approach for Bayesian phylogenetic inference, which uses importance sampling to generate observations for the next step of an analysis pipeline from the posterior distribution produced in the previous step. Our sequential inference approach presented here not only accounts for uncertainty between analysis steps but also allows for greater flexibility in software choice (and hence model availability) and can be computationally more efficient than the traditional joint inference approach when multiple models are being tested. We show that our sequential inference approach is identical in practice to the joint inference approach only if sufficient information in the data is present (a narrow posterior distribution) and/or sufficiently many important samples are used. Conversely, we show that the common practice of using a single point estimate can be biased, for example, a single phylogeny estimate can transform an unrooted phylogeny into a time-calibrated phylogeny. We demonstrate the theory of sequential Bayesian inference using both a toy example and an empirical case study of divergence-time estimation in insects using a relaxed clock model from transcriptome data. In the empirical example, we estimate 3 posterior distributions of branch lengths from the same data (DNA character matrix with a GTR+Γ+I substitution model, an amino acid data matrix with empirical substitution models, and an amino acid data matrix with the PhyloBayes CAT-GTR model). Finally, we apply 3 different node-calibration strategies and show that divergence time estimates are affected by both the data source and underlying substitution process to estimate branch lengths as well as the node-calibration strategies. Thus, our new sequential Bayesian phylogenetic inference provides the opportunity to efficiently test different approaches for divergence time estimation, including branch-length estimation from other software.
Collapse
Affiliation(s)
- Sebastian Höhna
- GeoBio-Center LMU, Ludwig-Maximilians-Universität München, Richard-Wagner Str. 10, 80333 Munich, Germany
- Department of Earth and Environmental Sciences, Paleontology & Geobiology, Ludwig-Maximilians-Universität München, Richard-Wagner Str. 10, 80333 Munich, Germany
| | - Allison Y Hsiang
- GeoBio-Center LMU, Ludwig-Maximilians-Universität München, Richard-Wagner Str. 10, 80333 Munich, Germany
- Department of Earth and Environmental Sciences, Paleontology & Geobiology, Ludwig-Maximilians-Universität München, Richard-Wagner Str. 10, 80333 Munich, Germany
| |
Collapse
|
5
|
Son A, Park J, Kim W, Yoon Y, Lee S, Park Y, Kim H. Revolutionizing Molecular Design for Innovative Therapeutic Applications through Artificial Intelligence. Molecules 2024; 29:4626. [PMID: 39407556 PMCID: PMC11477718 DOI: 10.3390/molecules29194626] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2024] [Revised: 09/19/2024] [Accepted: 09/27/2024] [Indexed: 10/20/2024] Open
Abstract
The field of computational protein engineering has been transformed by recent advancements in machine learning, artificial intelligence, and molecular modeling, enabling the design of proteins with unprecedented precision and functionality. Computational methods now play a crucial role in enhancing the stability, activity, and specificity of proteins for diverse applications in biotechnology and medicine. Techniques such as deep learning, reinforcement learning, and transfer learning have dramatically improved protein structure prediction, optimization of binding affinities, and enzyme design. These innovations have streamlined the process of protein engineering by allowing the rapid generation of targeted libraries, reducing experimental sampling, and enabling the rational design of proteins with tailored properties. Furthermore, the integration of computational approaches with high-throughput experimental techniques has facilitated the development of multifunctional proteins and novel therapeutics. However, challenges remain in bridging the gap between computational predictions and experimental validation and in addressing ethical concerns related to AI-driven protein design. This review provides a comprehensive overview of the current state and future directions of computational methods in protein engineering, emphasizing their transformative potential in creating next-generation biologics and advancing synthetic biology.
Collapse
Affiliation(s)
- Ahrum Son
- Department of Molecular Medicine, Scripps Research, La Jolla, CA 92037, USA;
| | - Jongham Park
- Department of Bio-AI Convergence, Chungnam National University, 99 Daehak-ro, Yuseong-gu, Daejeon 34134, Republic of Korea; (J.P.); (W.K.); (Y.Y.); (S.L.); (Y.P.)
| | - Woojin Kim
- Department of Bio-AI Convergence, Chungnam National University, 99 Daehak-ro, Yuseong-gu, Daejeon 34134, Republic of Korea; (J.P.); (W.K.); (Y.Y.); (S.L.); (Y.P.)
| | - Yoonki Yoon
- Department of Bio-AI Convergence, Chungnam National University, 99 Daehak-ro, Yuseong-gu, Daejeon 34134, Republic of Korea; (J.P.); (W.K.); (Y.Y.); (S.L.); (Y.P.)
| | - Sangwoon Lee
- Department of Bio-AI Convergence, Chungnam National University, 99 Daehak-ro, Yuseong-gu, Daejeon 34134, Republic of Korea; (J.P.); (W.K.); (Y.Y.); (S.L.); (Y.P.)
| | - Yongho Park
- Department of Bio-AI Convergence, Chungnam National University, 99 Daehak-ro, Yuseong-gu, Daejeon 34134, Republic of Korea; (J.P.); (W.K.); (Y.Y.); (S.L.); (Y.P.)
| | - Hyunsoo Kim
- Department of Bio-AI Convergence, Chungnam National University, 99 Daehak-ro, Yuseong-gu, Daejeon 34134, Republic of Korea; (J.P.); (W.K.); (Y.Y.); (S.L.); (Y.P.)
- Department of Convergent Bioscience and Informatics, Chungnam National University, 99 Daehak-ro, Yuseong-gu, Daejeon 34134, Republic of Korea
- Protein AI Design Institute, Chungnam National University, 99 Daehak-ro, Yuseong-gu, Daejeon 34134, Republic of Korea
- SCICS, Prove beyond AI, 99 Daehak-ro, Yuseong-gu, Daejeon 34134, Republic of Korea
| |
Collapse
|
6
|
Middlebrook EA, Katani R, Fair JM. OrthoPhyl-streamlining large-scale, orthology-based phylogenomic studies of bacteria at broad evolutionary scales. G3 (BETHESDA, MD.) 2024; 14:jkae119. [PMID: 38839049 PMCID: PMC11304591 DOI: 10.1093/g3journal/jkae119] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/03/2024] [Revised: 05/15/2024] [Accepted: 05/29/2024] [Indexed: 06/07/2024]
Abstract
There are a staggering number of publicly available bacterial genome sequences (at writing, 2.0 million assemblies in NCBI's GenBank alone), and the deposition rate continues to increase. This wealth of data begs for phylogenetic analyses to place these sequences within an evolutionary context. A phylogenetic placement not only aids in taxonomic classification but informs the evolution of novel phenotypes, targets of selection, and horizontal gene transfer. Building trees from multi-gene codon alignments is a laborious task that requires bioinformatic expertise, rigorous curation of orthologs, and heavy computation. Compounding the problem is the lack of tools that can streamline these processes for building trees from large-scale genomic data. Here we present OrthoPhyl, which takes bacterial genome assemblies and reconstructs trees from whole genome codon alignments. The analysis pipeline can analyze an arbitrarily large number of input genomes (>1200 tested here) by identifying a diversity-spanning subset of assemblies and using these genomes to build gene models to infer orthologs in the full dataset. To illustrate the versatility of OrthoPhyl, we show three use cases: E. coli/Shigella, Brucella/Ochrobactrum and the order Rickettsiales. We compare trees generated with OrthoPhyl to trees generated with kSNP3 and GToTree along with published trees using alternative methods. We show that OrthoPhyl trees are consistent with other methods while incorporating more data, allowing for greater numbers of input genomes, and more flexibility of analysis.
Collapse
Affiliation(s)
- Earl A Middlebrook
- Genomics and Bioanalytics Group, Los Alamos National Laboratory, Mailstop M888, Los Alamos, NM 87545, USA
| | - Robab Katani
- 401 Huck Life Sciences Building, Huck Institutes of Life Sciences, Pennsylvania State University, University Park, PA 16802, USA
| | - Jeanne M Fair
- Genomics and Bioanalytics Group, Los Alamos National Laboratory, Mailstop M888, Los Alamos, NM 87545, USA
| |
Collapse
|
7
|
Tsuda K, Maeno A, Otake A, Kato K, Tanaka W, Hibara KI, Nonomura KI. YABBY and diverged KNOX1 genes shape nodes and internodes in the stem. Science 2024; 384:1241-1247. [PMID: 38870308 DOI: 10.1126/science.adn6748] [Citation(s) in RCA: 11] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2023] [Accepted: 05/03/2024] [Indexed: 06/15/2024]
Abstract
Plant stems comprise nodes and internodes that specialize in solute exchange and elongation. However, their boundaries are not well defined, and how these basic units arise remains elusive. In rice with clear nodes and internodes, we found that one subclade of class I knotted1-like homeobox (KNOX1) genes for shoot meristem indeterminacy restricts node differentiation and allows internode formation by repressing YABBY genes for leaf development and genes from another node-specific KNOX1 subclade. YABBYs promote nodal vascular differentiation and limit stem elongation. YABBY and node-specific KNOX1 genes specify the pulvinus, which further elaborates the nodal structure for gravitropism. Notably, this KNOX1 subclade organization is specific to seed plants. We propose that nodes and internodes are distinct domains specified by YABBY-KNOX1 cross-regulation that diverged in early seed plants.
Collapse
Affiliation(s)
- Katsutoshi Tsuda
- Plant Cytogenetics Laboratory, Department of Gene Function and Phenomics, National Institute of Genetics, Mishima, Shizuoka 411-8540, Japan
- Department of Genetics, School of Life Science, Graduate University for Advanced Studies, Mishima, Shizuoka 411-8540, Japan
| | - Akiteru Maeno
- Plant Cytogenetics Laboratory, Department of Gene Function and Phenomics, National Institute of Genetics, Mishima, Shizuoka 411-8540, Japan
| | - Ayako Otake
- Plant Cytogenetics Laboratory, Department of Gene Function and Phenomics, National Institute of Genetics, Mishima, Shizuoka 411-8540, Japan
| | - Kae Kato
- Plant Cytogenetics Laboratory, Department of Gene Function and Phenomics, National Institute of Genetics, Mishima, Shizuoka 411-8540, Japan
| | - Wakana Tanaka
- Graduate School of Integrated Sciences for Life, Hiroshima University, Higashi-Hiroshima, Hiroshima 739-8528, Japan
| | - Ken-Ichiro Hibara
- Graduate School of Agricultural Regional Vitalization, Kibi International University, Minamiawaji, Hyogo 656-0484, Japan
| | - Ken-Ichi Nonomura
- Plant Cytogenetics Laboratory, Department of Gene Function and Phenomics, National Institute of Genetics, Mishima, Shizuoka 411-8540, Japan
- Department of Genetics, School of Life Science, Graduate University for Advanced Studies, Mishima, Shizuoka 411-8540, Japan
| |
Collapse
|
8
|
Wang W, Dong Z, Du Z, Wu P. Genome-scale approach to reconstructing the phylogenetic tree of psyllids (superfamily Psylloidea) with account of systematic bias. Mol Phylogenet Evol 2023; 189:107924. [PMID: 37699449 DOI: 10.1016/j.ympev.2023.107924] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2023] [Revised: 09/05/2023] [Accepted: 09/09/2023] [Indexed: 09/14/2023]
Abstract
Psyllids (class Insecta: order Hemiptera: superfamily Psylloidea) are a taxonomically and phylogenetically challenging clade. Recent studies have largely advanced the phylogeny of this group, yet the family-level relationships among Aphalaridae, Carsidaridae, and others remain unresolved. Genome-scale phylogenetic analysis is known to provide a finer resolution for problems like that. However, such phylogenomics also introduces new problems: incorrect trees with high confidence yielded due to systematic error (bias). Here we addressed these issues using hundreds of single-copy orthologous (SCO) genes in psyllid transcriptomes and genomes. Our analyses revealed conflicts between the nucleotide-based and amino-acid-based phylogenetic trees. While the nucleotide-based phylogeny strongly supported the (Aphalaridae + Carsidaridae) + Others relationship, the amino-acid-based one recovered Aphalaridae + (Carsidaridae + Others) with 100% support. Further inspection revealed significant compositional heterogeneity in nucleotide sequences for 67% of SCO genes, but not in the corresponding translated amino acid sequences. We then used different strategies to combat this compositional bias, and found that using the RY-coding strategy (coding the standard nucleotides as purines and pyrimidines) the nucleotide-based phylogeny became consistent with the amino-acid-based one. We further applied RY-coding to a published concatenated nucleotide dataset and recovered the Aphalaridae monophyly (which is refuted by the original literature on non-recoded sequences) at the base of psyllid tree. Moreover, it was found that variations in evolutionary rate could lead to errors in nucleotide-based phylogeny. The fast-evolving Heteropsylla cubana (Psyllidae: Ciriacreminae) was incorrectly placed within the subfamily Psyllinae. This bias can be avoided by using data removal or RY-coding strategies. Together, our results strongly support the family relationship of Aphalaridae + (Carsidaridae + Others), and show that the amino-acid-based concatenation analysis is more robust than nucleotide-based one. Future phylogenomic analysis of psyllid nucleotide sequences should take into account methods such as the RY-coding scheme to address potential systematic biases arising from composition and rate heterogeneities.
Collapse
Affiliation(s)
- Wei Wang
- State Key Laboratory of Integrated Management of Pest Insects and Rodents, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Zequn Dong
- State Key Laboratory of Integrated Management of Pest Insects and Rodents, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Zhong Du
- State Key Laboratory of Integrated Management of Pest Insects and Rodents, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Pengxiang Wu
- State Key Laboratory of Integrated Management of Pest Insects and Rodents, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China; University of Chinese Academy of Sciences, Beijing 100049, China.
| |
Collapse
|