1
|
Redelings BD, Holmes I, Lunter G, Pupko T, Anisimova M. Insertions and Deletions: Computational Methods, Evolutionary Dynamics, and Biological Applications. Mol Biol Evol 2024; 41:msae177. [PMID: 39172750 PMCID: PMC11385596 DOI: 10.1093/molbev/msae177] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2024] [Revised: 07/02/2024] [Accepted: 07/09/2024] [Indexed: 08/24/2024] Open
Abstract
Insertions and deletions constitute the second most important source of natural genomic variation. Insertions and deletions make up to 25% of genomic variants in humans and are involved in complex evolutionary processes including genomic rearrangements, adaptation, and speciation. Recent advances in long-read sequencing technologies allow detailed inference of insertions and deletion variation in species and populations. Yet, despite their importance, evolutionary studies have traditionally ignored or mishandled insertions and deletions due to a lack of comprehensive methodologies and statistical models of insertions and deletion dynamics. Here, we discuss methods for describing insertions and deletion variation and modeling insertions and deletions over evolutionary time. We provide practical advice for tackling insertions and deletions in genomic sequences and illustrate our discussion with examples of insertions and deletion-induced effects in human and other natural populations and their contribution to evolutionary processes. We outline promising directions for future developments in statistical methodologies that would allow researchers to analyze insertions and deletion variation and their effects in large genomic data sets and to incorporate insertions and deletions in evolutionary inference.
Collapse
Affiliation(s)
| | - Ian Holmes
- Department of Bioengineering, University of California, Berkeley, CA 94720, USA
- Calico Life Sciences LLC, South San Francisco, CA 94080, USA
| | - Gerton Lunter
- Department of Epidemiology, University Medical Center Groningen, University of Groningen, Groningen 9713 GZ, The Netherlands
| | - Tal Pupko
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 6997801, Israel
| | - Maria Anisimova
- Institute of Computational Life Sciences, Zurich University of Applied Sciences, Wädenswil, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| |
Collapse
|
2
|
Chen YR, Harel I, Singh PP, Ziv I, Moses E, Goshtchevsky U, Machado BE, Brunet A, Jarosz DF. Tissue-specific landscape of protein aggregation and quality control in an aging vertebrate. Dev Cell 2024; 59:1892-1911.e13. [PMID: 38810654 PMCID: PMC11265985 DOI: 10.1016/j.devcel.2024.04.014] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2023] [Revised: 01/13/2024] [Accepted: 04/15/2024] [Indexed: 05/31/2024]
Abstract
Protein aggregation is a hallmark of age-related neurodegeneration. Yet, aggregation during normal aging and in tissues other than the brain is poorly understood. Here, we leverage the African turquoise killifish to systematically profile protein aggregates in seven tissues of an aging vertebrate. Age-dependent aggregation is strikingly tissue specific and not simply driven by protein expression differences. Experimental interrogation in killifish and yeast, combined with machine learning, indicates that this specificity is linked to protein-autonomous biophysical features and tissue-selective alterations in protein quality control. Co-aggregation of protein quality control machinery during aging may further reduce proteostasis capacity, exacerbating aggregate burden. A segmental progeria model with accelerated aging in specific tissues exhibits selectively increased aggregation in these same tissues. Intriguingly, many age-related protein aggregates arise in wild-type proteins that, when mutated, drive human diseases. Our data chart a comprehensive landscape of protein aggregation during vertebrate aging and identify strong, tissue-specific associations with dysfunction and disease.
Collapse
Affiliation(s)
- Yiwen R Chen
- Department of Chemical and Systems Biology, Stanford University, Stanford, CA 94305, USA
| | - Itamar Harel
- The Silberman Institute, the Hebrew University of Jerusalem, Givat Ram, Jerusalem 91904, Israel; Department of Genetics, Stanford University, Stanford, CA 94305, USA
| | - Param Priya Singh
- Department of Genetics, Stanford University, Stanford, CA 94305, USA
| | - Inbal Ziv
- Department of Chemical and Systems Biology, Stanford University, Stanford, CA 94305, USA
| | - Eitan Moses
- The Silberman Institute, the Hebrew University of Jerusalem, Givat Ram, Jerusalem 91904, Israel
| | - Uri Goshtchevsky
- The Silberman Institute, the Hebrew University of Jerusalem, Givat Ram, Jerusalem 91904, Israel
| | - Ben E Machado
- Department of Genetics, Stanford University, Stanford, CA 94305, USA
| | - Anne Brunet
- Department of Genetics, Stanford University, Stanford, CA 94305, USA; Glenn Center for the Biology of Aging, Stanford University, Stanford, CA 94305, USA.
| | - Daniel F Jarosz
- Department of Chemical and Systems Biology, Stanford University, Stanford, CA 94305, USA; Department of Developmental Biology, Stanford University, Stanford, CA 94305, USA.
| |
Collapse
|
3
|
Wygoda E, Loewenthal G, Moshe A, Alburquerque M, Mayrose I, Pupko T. Statistical framework to determine indel-length distribution. Bioinformatics 2024; 40:btae043. [PMID: 38269647 PMCID: PMC10868340 DOI: 10.1093/bioinformatics/btae043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2023] [Revised: 01/10/2024] [Accepted: 01/22/2024] [Indexed: 01/26/2024] Open
Abstract
MOTIVATION Insertions and deletions (indels) of short DNA segments, along with substitutions, are the most frequent molecular evolutionary events. Indels were shown to affect numerous macro-evolutionary processes. Because indels may span multiple positions, their impact is a product of both their rate and their length distribution. An accurate inference of indel-length distribution is important for multiple evolutionary and bioinformatics applications, most notably for alignment software. Previous studies counted the number of continuous gap characters in alignments to determine the best-fitting length distribution. However, gap-counting methods are not statistically rigorous, as gap blocks are not synonymous with indels. Furthermore, such methods rely on alignments that regularly contain errors and are biased due to the assumption of alignment methods that indels lengths follow a geometric distribution. RESULTS We aimed to determine which indel-length distribution best characterizes alignments using statistical rigorous methodologies. To this end, we reduced the alignment bias using a machine-learning algorithm and applied an Approximate Bayesian Computation methodology for model selection. Moreover, we developed a novel method to test if current indel models provide an adequate representation of the evolutionary process. We found that the best-fitting model varies among alignments, with a Zipf length distribution fitting the vast majority of them. AVAILABILITY AND IMPLEMENTATION The data underlying this article are available in Github, at https://github.com/elyawy/SpartaSim and https://github.com/elyawy/SpartaPipeline.
Collapse
Affiliation(s)
- Elya Wygoda
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel
| | - Gil Loewenthal
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel
| | - Asher Moshe
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel
| | - Michael Alburquerque
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel
| | - Itay Mayrose
- School of Plant Sciences and Food Security, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel
| | - Tal Pupko
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel
| |
Collapse
|
4
|
Hajihassan Z, Afsharian NP, Ansari-Pour N. In silico engineering a CD80 variant with increased affinity to CTLA-4 and decreased affinity to CD28 for optimized cancer immunotherapy. J Immunol Methods 2023; 513:113425. [PMID: 36638881 DOI: 10.1016/j.jim.2023.113425] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2022] [Revised: 11/20/2022] [Accepted: 01/08/2023] [Indexed: 01/11/2023]
Abstract
CD80 or cluster of differentiation 80, also known as B7-1, is a member of the immunoglobulin super family, which binds to CTLA-4 and CD28 T cell receptors and induces inhibitory and inductive signals respectively. Although CTLA-4 and CD28 receptors belong to the same protein family, slight differences in their structures leads to CD80 having a higher binding affinity to CTLA-4 (-14.55 kcal/mol) compared with CD28(-12.51 kcal/mol). In this study, we constructed a variant of CD80 protein with increased binding affinity to CTLA-4 and decreased binding affinity to CD28. This variant has no signaling capability, and can act as a cap for these receptors to protect them from natural CD80 proteins existing in the body. The first step was the evolutionary and alanine scanning analysis of CD80 protein to determine conserved regions in this protein. Next, complex alanine scanning technique was employed to determine CD80 protein hotspots in CD80-CTLA-4 and CD80-CD28 protein complexes. This information was fed into a computational model developed in R for in silico mutagenesis and CD80 variant library construction. The 3D structures of variants were modeled using the Swiss model webserver. After modeling the 3D structures, HADDOCK server was employed to build all protein-protein complexes, which contain CTLA-4-CD80 variant complexes, Wild type CD80-CD28 complexes and CD28-CD80 variant complexes. Protein-protein binding free energy was determined using FoldX and the variant number 316 with mutations at 29, 31, 33 positions showed increased binding affinity to CTLA-4 (-21.43 kcal/mol) and decreased binding affinity to CD28 (- 9.54 kcal/mol). Finally, molecular dynamics (MD) simulations confirmed the stability of variant 316. In conclusion, we designed a new CD80 protein variant with potential immunotherapeutic applications.
Collapse
Affiliation(s)
- Zahra Hajihassan
- Department of Life Science Engineering, Faculty of New Sciences & Technologies, University of Tehran, Tehran, Iran.
| | - Nessa Pesaran Afsharian
- Department of Life Science Engineering, Faculty of New Sciences & Technologies, University of Tehran, Tehran, Iran
| | - Naser Ansari-Pour
- Department of Life Science Engineering, Faculty of New Sciences & Technologies, University of Tehran, Tehran, Iran; MRC Molecular Haematology Unit, Weatherall Institute of Molecular Medicine, University of Oxford, Oxford, UK.
| |
Collapse
|
5
|
Zhang Y, Zhang Q, Liu Y, Lin M, Ding C. Multiple Sequence Alignment based on deep Q Network with negative feedback policy. Comput Biol Chem 2022; 101:107780. [DOI: 10.1016/j.compbiolchem.2022.107780] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2022] [Revised: 09/27/2022] [Accepted: 10/18/2022] [Indexed: 11/28/2022]
|
6
|
Jowkar G, Pečerska J, Maiolo M, Gil M, Anisimova M. ARPIP: Ancestral sequence Reconstruction with insertions and deletions under the Poisson Indel Process. Syst Biol 2022:6648472. [PMID: 35866991 DOI: 10.1093/sysbio/syac050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2021] [Accepted: 07/06/2022] [Indexed: 11/12/2022] Open
Abstract
Modern phylogenetic methods allow inference of ancestral molecular sequences given an alignment and phylogeny relating present day sequences. This provides insight into the evolutionary history of molecules, helping to understand gene function and to study biological processes such as adaptation and convergent evolution across a variety of applications. Here we propose a dynamic programming algorithm for fast joint likelihood-based reconstruction of ancestral sequences under the Poisson Indel Process (PIP). Unlike previous approaches, our method, named ARPIP, enables the reconstruction with insertions and deletions based on an explicit indel model. Consequently, inferred indel events have an explicit biological interpretation. Likelihood computation is achieved in linear time with respect to the number of sequences. Our method consists of two steps, namely finding the most probable indel points and reconstructing ancestral sequences. First, we find the most likely indel points and prune the phylogeny to reflect the insertion and deletion events per site. Second, we infer the ancestral states on the pruned subtree in a manner similar to FastML. We applied ARPIP on simulated datasets and on real data from the Betacoronavirus genus. ARPIP reconstructs both the indel events and substitutions with a high degree of accuracy. Our method fares well when compared to established state-of-the-art methods such as FastML and PAML. Moreover, the method can be extended to explore both optimal and suboptimal reconstructions, include rate heterogeneity through time and more. We believe it will expand the range of novel applications of ancestral sequence reconstruction.
Collapse
Affiliation(s)
- Gholamhossein Jowkar
- Zurich University of Applied Sciences, School of Life Sciences and Facility Management, CH-8820, Wädenswil, Switzerland.,Swiss Institute of Bioinformatics, CH-1015 Lausanne, Switzerland.,University of Neuchâtel, Institute of biology, CH-2000 Neuchâtel, Switzerland
| | - Jūlija Pečerska
- Zurich University of Applied Sciences, School of Life Sciences and Facility Management, CH-8820, Wädenswil, Switzerland.,Swiss Institute of Bioinformatics, CH-1015 Lausanne, Switzerland
| | - Massimo Maiolo
- Zurich University of Applied Sciences, School of Life Sciences and Facility Management, CH-8820, Wädenswil, Switzerland.,Swiss Institute of Bioinformatics, CH-1015 Lausanne, Switzerland.,University of Bern, Institute of Pathology, CH-3008 Bern, Switzerland
| | - Manuel Gil
- Zurich University of Applied Sciences, School of Life Sciences and Facility Management, CH-8820, Wädenswil, Switzerland.,Swiss Institute of Bioinformatics, CH-1015 Lausanne, Switzerland
| | - Maria Anisimova
- Zurich University of Applied Sciences, School of Life Sciences and Facility Management, CH-8820, Wädenswil, Switzerland.,Swiss Institute of Bioinformatics, CH-1015 Lausanne, Switzerland
| |
Collapse
|
7
|
Chao J, Tang F, Xu L. Developments in Algorithms for Sequence Alignment: A Review. Biomolecules 2022; 12:biom12040546. [PMID: 35454135 PMCID: PMC9024764 DOI: 10.3390/biom12040546] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2022] [Revised: 03/29/2022] [Accepted: 03/31/2022] [Indexed: 01/27/2023] Open
Abstract
The continuous development of sequencing technologies has enabled researchers to obtain large amounts of biological sequence data, and this has resulted in increasing demands for software that can perform sequence alignment fast and accurately. A number of algorithms and tools for sequence alignment have been designed to meet the various needs of biologists. Here, the ideas that prevail in the research of sequence alignment and some quality estimation methods for multiple sequence alignment tools are summarized.
Collapse
Affiliation(s)
- Jiannan Chao
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China;
| | - Furong Tang
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003, China;
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen 518055, China
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen 518055, China
- Correspondence:
| |
Collapse
|
8
|
Maiolo M, Gatti L, Frei D, Leidi T, Gil M, Anisimova M. ProPIP: a tool for progressive multiple sequence alignment with Poisson Indel Process. BMC Bioinformatics 2021; 22:518. [PMID: 34689750 PMCID: PMC8543915 DOI: 10.1186/s12859-021-04442-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2020] [Accepted: 10/13/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Current alignment tools typically lack an explicit model of indel evolution, leading to artificially short inferred alignments (i.e., over-alignment) due to inconsistencies between the indel history and the phylogeny relating the input sequences. RESULTS We present a new progressive multiple sequence alignment tool ProPIP. The process of insertions and deletions is described using an explicit evolutionary model-the Poisson Indel Process or PIP. The method is based on dynamic programming and is implemented in a frequentist framework. The source code can be compiled on Linux, macOS and Microsoft Windows platforms. The algorithm is implemented in C++ as standalone program. The source code is freely available on GitHub at https://github.com/acg-team/ProPIP and is distributed under the terms of the GNU GPL v3 license. CONCLUSIONS The use of an explicit indel evolution model allows to avoid over-alignment, to infer gaps in a phylogenetically consistent way and to make inferences about the rates of insertions and deletions. Instead of the arbitrary gap penalties, the parameters used by ProPIP are the insertion and deletion rates, which have biological interpretation and are contextualized in a probabilistic environment. As a result, indel rate settings may be optimised in order to infer phylogenetically meaningful gap patterns.
Collapse
Affiliation(s)
- Massimo Maiolo
- Institute of Applied Simulation, School of Life Sciences and Facility Management, Zurich University of Applied Sciences (ZHAW), Schloss 1, Postfach, 8820, Wädenswil, Switzerland.,Swiss Institute of Bioinformatics (SIB), Quartier Sorge - Batiment Amphipole, 1015, Lausanne, Switzerland
| | - Lorenzo Gatti
- Institute of Applied Simulation, School of Life Sciences and Facility Management, Zurich University of Applied Sciences (ZHAW), Schloss 1, Postfach, 8820, Wädenswil, Switzerland.,Swiss Institute of Bioinformatics (SIB), Quartier Sorge - Batiment Amphipole, 1015, Lausanne, Switzerland
| | - Diego Frei
- Institute of Information Systems and Networking, University of Applied Sciences and Arts of Southern Switzerland, Galleria 2, Via Cantonale 2c, 6928, Manno, Switzerland
| | - Tiziano Leidi
- Institute of Information Systems and Networking, University of Applied Sciences and Arts of Southern Switzerland, Galleria 2, Via Cantonale 2c, 6928, Manno, Switzerland
| | - Manuel Gil
- Institute of Applied Simulation, School of Life Sciences and Facility Management, Zurich University of Applied Sciences (ZHAW), Schloss 1, Postfach, 8820, Wädenswil, Switzerland.,Swiss Institute of Bioinformatics (SIB), Quartier Sorge - Batiment Amphipole, 1015, Lausanne, Switzerland
| | - Maria Anisimova
- Institute of Applied Simulation, School of Life Sciences and Facility Management, Zurich University of Applied Sciences (ZHAW), Schloss 1, Postfach, 8820, Wädenswil, Switzerland. .,Swiss Institute of Bioinformatics (SIB), Quartier Sorge - Batiment Amphipole, 1015, Lausanne, Switzerland.
| |
Collapse
|
9
|
Redelings BD. BAli-Phy version 3: model-based co-estimation of alignment and phylogeny. Bioinformatics 2021; 37:3032-3034. [PMID: 33677478 DOI: 10.1093/bioinformatics/btab129] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2020] [Revised: 02/10/2021] [Accepted: 02/26/2021] [Indexed: 02/02/2023] Open
Abstract
SUMMARY We describe improvements to BAli-Phy, a Markov chain Monte Carlo (MCMC) program that jointly estimates phylogeny, alignment and other parameters from unaligned sequence data. Version 3 is substantially faster for large trees, and implements covarion models, additional codon models and other new models. It implements ancestral state reconstruction, allows prior selection for all model parameters, and can also analyze multiple genes simultaneously. AVAILABILITY AND IMPLEMENTATION Software is available for download at http://www.bali-phy.org. C++ source code is freely available on Github under the GPL2 License. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Benjamin D Redelings
- Biology Department, Duke University, Durham, NC 27708, USA.,Department of Ecology and Evolutionary Biology, University of Kansas, Lawrence, KS 66045, USA.,Ronin Institute, Durham, NC 27705, USA
| |
Collapse
|
10
|
Metaheuristics for multiple sequence alignment: A systematic review. Comput Biol Chem 2021; 94:107563. [PMID: 34425495 DOI: 10.1016/j.compbiolchem.2021.107563] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2020] [Revised: 08/04/2021] [Accepted: 08/09/2021] [Indexed: 11/21/2022]
Abstract
The Multiple Sequence Alignment (MSA) is a key task in bioinformatics, because it is used in different important biological analysis, such as function and structure prediction of unknown proteins. There are several approaches to perform MSA and the use of metaheuristics stands out because of the search ability of these methods, which generally leads to good results in a reasonable amount of time. This paper presents a Systematic Literature Review (SLR) on metaheuristics for MSA, compiling relevant works published between 2014 and 2019. The results of our SLR show the constant interest in this subject, due to the several recent publications that use different metaheuristics to obtain more accurate alignments. Moreover, the final results of our SLR show a multi-objective and hybrid approaches trends, which generally leads these methods to achieve even better results. Thus, we show in this work how the use of metaheuristics to perform MSA still remains an important and promising open research field.
Collapse
|
11
|
Delucchi M, Näf P, Bliven S, Anisimova M. TRAL 2.0: Tandem Repeat Detection With Circular Profile Hidden Markov Models and Evolutionary Aligner. FRONTIERS IN BIOINFORMATICS 2021; 1:691865. [PMID: 36303789 PMCID: PMC9581039 DOI: 10.3389/fbinf.2021.691865] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2021] [Accepted: 06/11/2021] [Indexed: 11/13/2022] Open
Abstract
The Tandem Repeat Annotation Library (TRAL) focuses on analyzing tandem repeat units in genomic sequences. TRAL can integrate and harmonize tandem repeat annotations from a large number of external tools, and provides a statistical model for evaluating and filtering the detected repeats. TRAL version 2.0 includes new features such as a module for identifying repeats from circular profile hidden Markov models, a new repeat alignment method based on the progressive Poisson Indel Process, an improved installation procedure and a docker container. TRAL is an open-source Python 3 library and is available, together with documentation and tutorials viavital-it.ch/software/tral.
Collapse
Affiliation(s)
- Matteo Delucchi
- Institute of Applied Simulations, School of Life Sciences und Facility Management, Zurich University of Applied Sciences, Wädenswil, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Paulina Näf
- Institute of Applied Simulations, School of Life Sciences und Facility Management, Zurich University of Applied Sciences, Wädenswil, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Spencer Bliven
- Institute of Applied Simulations, School of Life Sciences und Facility Management, Zurich University of Applied Sciences, Wädenswil, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
- Laboratory for Scientific Computing and Modelling, Paul Scherrer Institute, Villigen PSI, Villigen, Switzerland
| | - Maria Anisimova
- Institute of Applied Simulations, School of Life Sciences und Facility Management, Zurich University of Applied Sciences, Wädenswil, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
- *Correspondence: Maria Anisimova,
| |
Collapse
|
12
|
Kaden M, Bohnsack KS, Weber M, Kudła M, Gutowska K, Blazewicz J, Villmann T. Learning vector quantization as an interpretable classifier for the detection of SARS-CoV-2 types based on their RNA sequences. Neural Comput Appl 2021; 34:67-78. [PMID: 33935376 PMCID: PMC8076884 DOI: 10.1007/s00521-021-06018-2] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2020] [Accepted: 04/07/2021] [Indexed: 02/06/2023]
Abstract
We present an approach to discriminate SARS-CoV-2 virus types based on their RNA sequence descriptions avoiding a sequence alignment. For that purpose, sequences are preprocessed by feature extraction and the resulting feature vectors are analyzed by prototype-based classification to remain interpretable. In particular, we propose to use variants of learning vector quantization (LVQ) based on dissimilarity measures for RNA sequence data. The respective matrix LVQ provides additional knowledge about the classification decisions like discriminant feature correlations and, additionally, can be equipped with easy to realize reject options for uncertain data. Those options provide self-controlled evidence, i.e., the model refuses to make a classification decision if the model evidence for the presented data is not sufficient. This model is first trained using a GISAID dataset with given virus types detected according to the molecular differences in coronavirus populations by phylogenetic tree clustering. In a second step, we apply the trained model to another but unlabeled SARS-CoV-2 virus dataset. For these data, we can either assign a virus type to the sequences or reject atypical samples. Those rejected sequences allow to speculate about new virus types with respect to nucleotide base mutations in the viral sequences. Moreover, this rejection analysis improves model robustness. Last but not least, the presented approach has lower computational complexity compared to methods based on (multiple) sequence alignment. SUPPLEMENTARY INFORMATION The online version contains supplementary material available at 10.1007/s00521-021-06018-2.
Collapse
Affiliation(s)
- Marika Kaden
- University of Applied Sciences Mittweida, Technikumplatz 17, 09648 Mittweida, Germany
- Saxon Institute for Computational Intelligence and Machine Learning, Technikumplatz 17, 09648 Mittweida, Germany
| | - Katrin Sophie Bohnsack
- University of Applied Sciences Mittweida, Technikumplatz 17, 09648 Mittweida, Germany
- Saxon Institute for Computational Intelligence and Machine Learning, Technikumplatz 17, 09648 Mittweida, Germany
| | - Mirko Weber
- University of Applied Sciences Mittweida, Technikumplatz 17, 09648 Mittweida, Germany
- Saxon Institute for Computational Intelligence and Machine Learning, Technikumplatz 17, 09648 Mittweida, Germany
| | - Mateusz Kudła
- University of Applied Sciences Mittweida, Technikumplatz 17, 09648 Mittweida, Germany
- Institute of Computing Science, Poznan University of Technology, Piotrowo 2, 60-965 Poznan, Poland
| | - Kaja Gutowska
- Institute of Computing Science, Poznan University of Technology, Piotrowo 2, 60-965 Poznan, Poland
- Institute of Bioorganic Chemistry, Polish Academy of Sciences, Noskowskiego 12/14, 61-704 Poznan, Poland
- European Centre for Bioinformatics and Genomics, Piotrowo 2, 60-965 Poznan, Poland
| | - Jacek Blazewicz
- Institute of Computing Science, Poznan University of Technology, Piotrowo 2, 60-965 Poznan, Poland
- Institute of Bioorganic Chemistry, Polish Academy of Sciences, Noskowskiego 12/14, 61-704 Poznan, Poland
- European Centre for Bioinformatics and Genomics, Piotrowo 2, 60-965 Poznan, Poland
| | - Thomas Villmann
- University of Applied Sciences Mittweida, Technikumplatz 17, 09648 Mittweida, Germany
- Saxon Institute for Computational Intelligence and Machine Learning, Technikumplatz 17, 09648 Mittweida, Germany
| |
Collapse
|
13
|
Scossa F, Fernie AR. Ancestral sequence reconstruction - An underused approach to understand the evolution of gene function in plants? Comput Struct Biotechnol J 2021; 19:1579-1594. [PMID: 33868595 PMCID: PMC8039532 DOI: 10.1016/j.csbj.2021.03.008] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2021] [Revised: 03/04/2021] [Accepted: 03/06/2021] [Indexed: 02/06/2023] Open
Abstract
Whilst substantial research effort has been placed on understanding the interactions of plant proteins with their molecular partners, relatively few studies in plants - by contrast to work in other organisms - address how these interactions evolve. It is thought that ancestral proteins were more promiscuous than modern proteins and that specificity often evolved following gene duplication and subsequent functional refining. However, ancestral protein resurrection studies have found that some modern proteins have evolved de novo from ancestors lacking those functions. Intriguingly, the new interactions evolved as a consequence of just a few mutations and, as such, acquisition of new functions appears to be neither difficult nor rare, however, only a few of them are incorporated into biological processes before they are lost to subsequent mutations. Here, we detail the approach of ancestral sequence reconstruction (ASR), providing a primer to reconstruct the sequence of an ancestral gene. We will present case studies from a range of different eukaryotes before discussing the few instances where ancestral reconstructions have been used in plants. As ASR is used to dig into the remote evolutionary past, we will also present some alternative genetic approaches to investigate molecular evolution on shorter timescales. We argue that the study of plant secondary metabolism is particularly well suited for ancestral reconstruction studies. Indeed, its ancient evolutionary roots and highly diverse landscape provide an ideal context in which to address the focal issue around the emergence of evolutionary novelties and how this affects the chemical diversification of plant metabolism.
Collapse
Key Words
- APR, ancestral protein resurrection
- ASR, ancestral sequence reconstruction
- Ancestral sequence reconstruction
- CDS, coding sequence
- Evolution
- GR, glucocorticoid receptor
- GWAS, genome wide association study
- Genomics
- InDel, insertion/deletion
- MCMC, Markov Chain Monte Carlo
- ML, maximum likelihood
- MP, maximum parsimony
- MR, mineralcorticoid receptor
- MSA, multiple sequence alignment
- Metabolism
- NJ, neighbor-joining
- Phylogenetics
- Plants
- SFS, site frequency spectrum
Collapse
Affiliation(s)
- Federico Scossa
- Max-Planck-Institute of Molecular Plant Physiology (MPI-MP), 14476 Potsdam-Golm, Germany
- Council for Agricultural Research and Economics (CREA), Research Centre for Genomics and Bioinformatics (CREA-GB), Rome, Italy
| | - Alisdair R. Fernie
- Max-Planck-Institute of Molecular Plant Physiology (MPI-MP), 14476 Potsdam-Golm, Germany
- Center of Plant Systems Biology and Biotechnology (CPSBB), Plovdiv, Bulgaria
| |
Collapse
|
14
|
Maiolo M, Ulzega S, Gil M, Anisimova M. Accelerating phylogeny-aware alignment with indel evolution using short time Fourier transform. NAR Genom Bioinform 2021; 2:lqaa092. [PMID: 33575636 PMCID: PMC7671320 DOI: 10.1093/nargab/lqaa092] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2020] [Revised: 10/15/2020] [Accepted: 10/22/2020] [Indexed: 11/14/2022] Open
Abstract
Recently we presented a frequentist dynamic programming (DP) approach for multiple sequence alignment based on the explicit model of indel evolution Poisson Indel Process (PIP). This phylogeny-aware approach produces evolutionary meaningful gap patterns and is robust to the ‘over-alignment’ bias. Despite linear time complexity for the computation of marginal likelihoods, the overall method’s complexity is cubic in sequence length. Inspired by the popular aligner MAFFT, we propose a new technique to accelerate the evolutionary indel based alignment. Amino acid sequences are converted to sequences representing their physicochemical properties, and homologous blocks are identified by multi-scale short-time Fourier transform. Three three-dimensional DP matrices are then created under PIP, with homologous blocks defining sparse structures where most cells are excluded from the calculations. The homologous blocks are connected through intermediate ‘linking blocks’. The homologous and linking blocks are aligned under PIP as independent DP sub-matrices and their tracebacks merged to yield the final alignment. The new algorithm can largely profit from parallel computing, yielding a theoretical speed-up estimated to be proportional to the cubic power of the number of sub-blocks in the DP matrices. We compare the new method to the original PIP approach and demonstrate it on real data.
Collapse
Affiliation(s)
- Massimo Maiolo
- Institute of Applied Simulation, School of Life Sciences and Facility Management, Zurich University of Applied Sciences (ZHAW), CH-8820 Wädenswil, Switzerland
| | - Simone Ulzega
- Institute of Applied Simulation, School of Life Sciences and Facility Management, Zurich University of Applied Sciences (ZHAW), CH-8820 Wädenswil, Switzerland
| | - Manuel Gil
- Institute of Applied Simulation, School of Life Sciences and Facility Management, Zurich University of Applied Sciences (ZHAW), CH-8820 Wädenswil, Switzerland
| | - Maria Anisimova
- Institute of Applied Simulation, School of Life Sciences and Facility Management, Zurich University of Applied Sciences (ZHAW), CH-8820 Wädenswil, Switzerland
| |
Collapse
|