1
|
Dennler O, Ryan CJ. Evaluating sequence and structural similarity metrics for predicting shared paralog functions. NAR Genom Bioinform 2025; 7:lqaf051. [PMID: 40290317 PMCID: PMC12034104 DOI: 10.1093/nargab/lqaf051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2024] [Revised: 03/07/2025] [Accepted: 04/15/2025] [Indexed: 04/30/2025] Open
Abstract
Gene duplication is the primary source of new genes, resulting in most genes having identifiable paralogs. Over time, paralog pairs may diverge in some respects but many retain the ability to perform the same functional role. Protein sequence identity is often used as a proxy for functional similarity and can predict shared functions between paralogs as revealed by synthetic lethal experiments. However, the advent of alternative protein representations, including embeddings from protein language models (PLMs) and predicted structures from AlphaFold, raises the possibility that alternative similarity metrics could better capture functional similarity between paralogs. Here, using two species (budding yeast and human) and two different definitions of shared functionality (shared protein-protein interactions and synthetic lethality), we evaluated a variety of alternative similarity metrics. For some tasks, predicted structural similarity or PLM similarity outperform sequence identity, but more importantly these similarity metrics are not redundant with sequence identity, i.e. combining them with sequence identity leads to improved predictions of shared functionality. By adding contextual features, representing similarity to homologous proteins within and across species, we can significantly enhance our predictions of shared paralog functionality. Overall, our results suggest that alternative similarity metrics capture complementary aspects of functional similarity beyond sequence identity alone.
Collapse
Affiliation(s)
- Olivier Dennler
- School of Medicine, University College Dublin, Dublin 4, D04 V1W8, Ireland
- School of Computer Science, University College Dublin, Dublin 4, D04 V1W8, Ireland
- Conway Institute, University College Dublin, Dublin 4, D04 V1W8, Ireland
| | - Colm J Ryan
- School of Medicine, University College Dublin, Dublin 4, D04 V1W8, Ireland
- School of Computer Science, University College Dublin, Dublin 4, D04 V1W8, Ireland
- Conway Institute, University College Dublin, Dublin 4, D04 V1W8, Ireland
| |
Collapse
|
2
|
Allert MJ, Kumar S, Wang Y, Beese LS, Hellinga HW. Accurate Identification of Periplasmic Urea-binding Proteins by Structure- and Genome Context-assisted Functional Analysis. J Mol Biol 2024; 436:168780. [PMID: 39241982 DOI: 10.1016/j.jmb.2024.168780] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2024] [Revised: 08/29/2024] [Accepted: 08/31/2024] [Indexed: 09/09/2024]
Abstract
ABC transporters are ancient and ubiquitous nutrient transport systems in bacteria and play a central role in defining lifestyles. Periplasmic solute-binding proteins (SBPs) are components that deliver ligands to their translocation machinery. SBPs have diversified to bind a wide range of ligands with high specificity and affinity. However, accurate assignment of cognate ligands remains a challenging problem in SBPs. Urea metabolism plays an important role in the nitrogen cycle; anthropogenic sources account for more than half of global nitrogen fertilizer. We report identification of urea-binding proteins within a large SBP sequence family that encodes diverse functions. By combining genetic linkage between SBPs, ABC transporter components, enzymes or transcription factors, we accurately identified cognate ligands, as we verified experimentally by biophysical characterization of ligand binding and crystallographic determination of the urea complex of a thermostable urea-binding homolog. Using three-dimensional structure information, these functional assignments were extrapolated to other members in the sequence family lacking genetic linkage information, which revealed that only a fraction bind urea. Using the same combined approaches, we also inferred that other family members bind various short-chain amides, aliphatic amino acids (leucine, isoleucine, valine), γ-aminobutyrate, and as yet unknown ligands. Comparative structural analysis revealed structural adaptations that encode diversification in these SBPs. Systematic assignment of ligands to SBP sequence families is key to understanding bacterial lifestyles, and also provides a rich source of biosensors for clinical and environmental analysis, such as the thermostable urea-binding protein identified here.
Collapse
Affiliation(s)
- Malin J Allert
- Department of Biochemistry, Duke University Medical Center, Durham, NC 27710, USA.
| | - Shivesh Kumar
- Department of Biochemistry, Duke University Medical Center, Durham, NC 27710, USA; Department of Biochemistry and Molecular Biophysics, Washington University in St. Louis, MO 63110, USA.
| | - You Wang
- Department of Biochemistry, Duke University Medical Center, Durham, NC 27710, USA.
| | - Lorena S Beese
- Department of Biochemistry, Duke University Medical Center, Durham, NC 27710, USA.
| | - Homme W Hellinga
- Department of Biochemistry, Duke University Medical Center, Durham, NC 27710, USA.
| |
Collapse
|
3
|
Barrios-Núñez I, Martínez-Redondo G, Medina-Burgos P, Cases I, Fernández R, Rojas A. Decoding functional proteome information in model organisms using protein language models. NAR Genom Bioinform 2024; 6:lqae078. [PMID: 38962255 PMCID: PMC11217674 DOI: 10.1093/nargab/lqae078] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2024] [Revised: 05/31/2024] [Accepted: 06/26/2024] [Indexed: 07/05/2024] Open
Abstract
Protein language models have been tested and proved to be reliable when used on curated datasets but have not yet been applied to full proteomes. Accordingly, we tested how two different machine learning-based methods performed when decoding functional information from the proteomes of selected model organisms. We found that protein language models are more precise and informative than deep learning methods for all the species tested and across the three gene ontologies studied, and that they better recover functional information from transcriptomic experiments. The results obtained indicate that these language models are likely to be suitable for large-scale annotation and downstream analyses, and we recommend a guide for their use.
Collapse
Affiliation(s)
- Israel Barrios-Núñez
- Computational Biology and Bioinformatics Group, Andalusian Center for Developmental Biology (CABD-CSIC), 41013 Sevilla, Spain
| | | | - Patricia Medina-Burgos
- Computational Biology and Bioinformatics Group, Andalusian Center for Developmental Biology (CABD-CSIC), 41013 Sevilla, Spain
| | - Ildefonso Cases
- Bioinformatics Unit, Andalusian Center for Developmental Biology (CABD-CSIC), 41013 Sevilla, Spain
| | - Rosa Fernández
- Metazoa Phylogenomics Lab, Institute of Evolutionary Biology (CSIC-UPF), 08003 Barcelona, Spain
| | - Ana M Rojas
- Computational Biology and Bioinformatics Group, Andalusian Center for Developmental Biology (CABD-CSIC), 41013 Sevilla, Spain
| |
Collapse
|
4
|
Hera MR, Liu S, Wei W, Rodriguez JS, Ma C, Koslicki D. Metagenomic functional profiling: to sketch or not to sketch? Bioinformatics 2024; 40:ii165-ii173. [PMID: 39230701 PMCID: PMC11373326 DOI: 10.1093/bioinformatics/btae397] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/05/2024] Open
Abstract
MOTIVATION Functional profiling of metagenomic samples is essential to decipher the functional capabilities of microbial communities. Traditional and more widely used functional profilers in the context of metagenomics rely on aligning reads against a known reference database. However, aligning sequencing reads against a large and fast-growing database is computationally expensive. In general, k-mer-based sketching techniques have been successfully used in metagenomics to address this bottleneck, notably in taxonomic profiling. In this work, we describe leveraging FracMinHash (implemented in sourmash, a publicly available software), a k-mer-sketching algorithm, to obtain functional profiles of metagenome samples. RESULTS We show how pieces of the sourmash software (and the resulting FracMinHash sketches) can be put together in a pipeline to functionally profile a metagenomic sample. We named our pipeline fmh-funprofiler. We report that the functional profiles obtained using this pipeline demonstrate comparable completeness and better purity compared to the profiles obtained using other alignment-based methods when applied to simulated metagenomic data. We also report that fmh-funprofiler is 39-99× faster in wall-clock time, and consumes up to 40-55× less memory. Coupled with the KEGG database, this method not only replicates fundamental biological insights but also highlights novel signals from the Human Microbiome Project datasets. AVAILABILITY AND IMPLEMENTATION This fast and lightweight metagenomic functional profiler is freely available and can be accessed here: https://github.com/KoslickiLab/fmh-funprofiler. All scripts of the analyses we present in this manuscript can be found on GitHub.
Collapse
Affiliation(s)
- Mahmudur Rahman Hera
- School of Electrical Engineering and Computer Science, Pennsylvania State University, University Park, Pennsylvania 16802, United States
| | - Shaopeng Liu
- Bioinformatics and Genomics, Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, Pennsylvania 16802, United States
| | - Wei Wei
- Bioinformatics and Genomics, Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, Pennsylvania 16802, United States
| | - Judith S Rodriguez
- Bioinformatics and Genomics, Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, Pennsylvania 16802, United States
| | - Chunyu Ma
- Bioinformatics and Genomics, Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, Pennsylvania 16802, United States
| | - David Koslicki
- School of Electrical Engineering and Computer Science, Pennsylvania State University, University Park, Pennsylvania 16802, United States
- Bioinformatics and Genomics, Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, Pennsylvania 16802, United States
- Department of Biology, Pennsylvania State University, University Park, Pennsylvania 16802, United States
| |
Collapse
|
5
|
Beavan A, Domingo-Sananes MR, McInerney JO. Contingency, repeatability, and predictability in the evolution of a prokaryotic pangenome. Proc Natl Acad Sci U S A 2024; 121:e2304934120. [PMID: 38147560 PMCID: PMC10769857 DOI: 10.1073/pnas.2304934120] [Citation(s) in RCA: 16] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2023] [Accepted: 11/05/2023] [Indexed: 12/28/2023] Open
Abstract
Pangenomes exhibit remarkable variability in many prokaryotic species, much of which is maintained through the processes of horizontal gene transfer and gene loss. Repeated acquisitions of near-identical homologs can easily be observed across pangenomes, leading to the question of whether these parallel events potentiate similar evolutionary trajectories, or whether the remarkably different genetic backgrounds of the recipients mean that postacquisition evolutionary trajectories end up being quite different. In this study, we present a machine learning method that predicts the presence or absence of genes in the Escherichia coli pangenome based on complex patterns of the presence or absence of other accessory genes within a genome. Our analysis leverages the repeated transfer of genes through the E. coli pangenome to observe patterns of repeated evolution following similar events. We find that the presence or absence of a substantial set of genes is highly predictable from other genes alone, indicating that selection potentiates and maintains gene-gene co-occurrence and avoidance relationships deterministically over long-term bacterial evolution and is robust to differences in host evolutionary history. We propose that at least part of the pangenome can be understood as a set of genes with relationships that govern their likely cohabitants, analogous to an ecosystem's set of interacting organisms. Our findings indicate that intragenomic gene fitness effects may be key drivers of prokaryotic evolution, influencing the repeated emergence of complex gene-gene relationships across the pangenome.
Collapse
Affiliation(s)
- Alan Beavan
- School of Life Sciences, The University of Nottingham, NottinghamNG7 2UH, United Kingdom
| | - Maria Rosa Domingo-Sananes
- School of Life Sciences, The University of Nottingham, NottinghamNG7 2UH, United Kingdom
- School of Science and Technology, Nottingham Trent University, NottinghamNG1 4FQ, United Kingdom
| | - James O. McInerney
- School of Life Sciences, The University of Nottingham, NottinghamNG7 2UH, United Kingdom
| |
Collapse
|
6
|
Hellmuth M, Stadler PF. The Theory of Gene Family Histories. Methods Mol Biol 2024; 2802:1-32. [PMID: 38819554 DOI: 10.1007/978-1-0716-3838-5_1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/01/2024]
Abstract
Most genes are part of larger families of evolutionary-related genes. The history of gene families typically involves duplications and losses of genes as well as horizontal transfers into other organisms. The reconstruction of detailed gene family histories, i.e., the precise dating of evolutionary events relative to phylogenetic tree of the underlying species has remained a challenging topic despite their importance as a basis for detailed investigations into adaptation and functional evolution of individual members of the gene family. The identification of orthologs, moreover, is a particularly important subproblem of the more general setting considered here. In the last few years, an extensive body of mathematical results has appeared that tightly links orthology, a formal notion of best matches among genes, and horizontal gene transfer. The purpose of this chapter is to broadly outline some of the key mathematical insights and to discuss their implication for practical applications. In particular, we focus on tree-free methods, i.e., methods to infer orthology or horizontal gene transfer as well as gene trees, species trees, and reconciliations between them without using a priori knowledge of the underlying trees or statistical models for the inference of phylogenetic trees. Instead, the initial step aims to extract binary relations among genes.
Collapse
Affiliation(s)
- Marc Hellmuth
- Department of Mathematics, Faculty of Science, Stockholm University, Stockholm, Sweden
| | - Peter F Stadler
- Bioinformatics Group, Department of Computer Science, Leipzig University, Leipzig, Germany.
- Interdisciplinary Center for Bioinformatics, Leipzig University, Leipzig, Germany.
- Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany.
- Universidad Nacional de Colombia, Bogotá, Colombia.
- Institute for Theoretical Chemistry, University of Vienna, Wien, Austria.
- Center for non-coding RNA in Technology and Health, University of Copenhagen, Frederiksberg, Denmark.
- Santa Fe Institute, Santa Fe, NM, USA.
| |
Collapse
|
7
|
Song Y, Miao Z, Brazma A, Papatheodorou I. Benchmarking strategies for cross-species integration of single-cell RNA sequencing data. Nat Commun 2023; 14:6495. [PMID: 37838716 PMCID: PMC10576752 DOI: 10.1038/s41467-023-41855-w] [Citation(s) in RCA: 29] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2022] [Accepted: 09/21/2023] [Indexed: 10/16/2023] Open
Abstract
The growing number of available single-cell gene expression datasets from different species creates opportunities to explore evolutionary relationships between cell types across species. Cross-species integration of single-cell RNA-sequencing data has been particularly informative in this context. However, in order to do so robustly it is essential to have rigorous benchmarking and appropriate guidelines to ensure that integration results truly reflect biology. Here, we benchmark 28 combinations of gene homology mapping methods and data integration algorithms in a variety of biological settings. We examine the capability of each strategy to perform species-mixing of known homologous cell types and to preserve biological heterogeneity using 9 established metrics. We also develop a new biology conservation metric to address the maintenance of cell type distinguishability. Overall, scANVI, scVI and SeuratV4 methods achieve a balance between species-mixing and biology conservation. For evolutionarily distant species, including in-paralogs is beneficial. SAMap outperforms when integrating whole-body atlases between species with challenging gene homology annotation. We provide our freely available cross-species integration and assessment pipeline to help analyse new data and develop new algorithms.
Collapse
Affiliation(s)
- Yuyao Song
- European Molecular Biology Laboratory-European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SA, United Kingdom.
| | - Zhichao Miao
- European Molecular Biology Laboratory-European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SA, United Kingdom
- Guangzhou Laboratory, Guangzhou International Bio Island, Guangzhou, 510005, China
| | - Alvis Brazma
- European Molecular Biology Laboratory-European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SA, United Kingdom
| | - Irene Papatheodorou
- European Molecular Biology Laboratory-European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SA, United Kingdom.
| |
Collapse
|
8
|
Ruperti F, Papadopoulos N, Musser JM, Mirdita M, Steinegger M, Arendt D. Cross-phyla protein annotation by structural prediction and alignment. Genome Biol 2023; 24:113. [PMID: 37173746 PMCID: PMC10176882 DOI: 10.1186/s13059-023-02942-9] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2022] [Accepted: 04/18/2023] [Indexed: 05/15/2023] Open
Abstract
BACKGROUND Protein annotation is a major goal in molecular biology, yet experimentally determined knowledge is typically limited to a few model organisms. In non-model species, the sequence-based prediction of gene orthology can be used to infer protein identity; however, this approach loses predictive power at longer evolutionary distances. Here we propose a workflow for protein annotation using structural similarity, exploiting the fact that similar protein structures often reflect homology and are more conserved than protein sequences. RESULTS We propose a workflow of openly available tools for the functional annotation of proteins via structural similarity (MorF: MorphologFinder) and use it to annotate the complete proteome of a sponge. Sponges are highly relevant for inferring the early history of animals, yet their proteomes remain sparsely annotated. MorF accurately predicts the functions of proteins with known homology in [Formula: see text] cases and annotates an additional [Formula: see text] of the proteome beyond standard sequence-based methods. We uncover new functions for sponge cell types, including extensive FGF, TGF, and Ephrin signaling in sponge epithelia, and redox metabolism and control in myopeptidocytes. Notably, we also annotate genes specific to the enigmatic sponge mesocytes, proposing they function to digest cell walls. CONCLUSIONS Our work demonstrates that structural similarity is a powerful approach that complements and extends sequence similarity searches to identify homologous proteins over long evolutionary distances. We anticipate this will be a powerful approach that boosts discovery in numerous -omics datasets, especially for non-model organisms.
Collapse
Affiliation(s)
- Fabian Ruperti
- Developmental Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
- Faculty of Biosciences, Collaboration for joint Ph.D. degree between EMBL and Heidelberg University, Heidelberg, Germany
| | - Nikolaos Papadopoulos
- Developmental Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
- Department for Evolutionary Biology, University of Vienna, Vienna, Austria
| | - Jacob M Musser
- Developmental Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
| | - Milot Mirdita
- School of Biological Sciences, Seoul National University, Seoul, South Korea
| | - Martin Steinegger
- School of Biological Sciences, Seoul National University, Seoul, South Korea
- Artificial Intelligence Institute, Seoul National University, Seoul, South Korea
| | - Detlev Arendt
- Developmental Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany.
- Centre for Organismal Studies, University of Heidelberg, Heidelberg, Germany.
| |
Collapse
|
9
|
Titus-McQuillan JE, Nanni AV, McIntyre LM, Rogers RL. Estimating transcriptome complexities across eukaryotes. BMC Genomics 2023; 24:254. [PMID: 37170194 PMCID: PMC10173493 DOI: 10.1186/s12864-023-09326-0] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2023] [Accepted: 04/20/2023] [Indexed: 05/13/2023] Open
Abstract
BACKGROUND Genomic complexity is a growing field of evolution, with case studies for comparative evolutionary analyses in model and emerging non-model systems. Understanding complexity and the functional components of the genome is an untapped wealth of knowledge ripe for exploration. With the "remarkable lack of correspondence" between genome size and complexity, there needs to be a way to quantify complexity across organisms. In this study, we use a set of complexity metrics that allow for evaluating changes in complexity using TranD. RESULTS We ascertain if complexity is increasing or decreasing across transcriptomes and at what structural level, as complexity varies. In this study, we define three metrics - TpG, EpT, and EpG- to quantify the transcriptome's complexity that encapsulates the dynamics of alternative splicing. Here we compare complexity metrics across 1) whole genome annotations, 2) a filtered subset of orthologs, and 3) novel genes to elucidate the impacts of orthologs and novel genes in transcript model analysis. Effective Exon Number (EEN) issued to compare the distribution of exon sizes within transcripts against random expectations of uniform exon placement. EEN accounts for differences in exon size, which is important because novel gene differences in complexity for orthologs and whole-transcriptome analyses are biased towards low-complexity genes with few exons and few alternative transcripts. CONCLUSIONS With our metric analyses, we are able to quantify changes in complexity across diverse lineages with greater precision and accuracy than previous cross-species comparisons under ortholog conditioning. These analyses represent a step toward whole-transcriptome analysis in the emerging field of non-model evolutionary genomics, with key insights for evolutionary inference of complexity changes on deep timescales across the tree of life. We suggest a means to quantify biases generated in ortholog calling and correct complexity analysis for lineage-specific effects. With these metrics, we directly assay the quantitative properties of newly formed lineage-specific genes as they lower complexity.
Collapse
Affiliation(s)
- James E Titus-McQuillan
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, NC, 28223, USA.
| | - Adalena V Nanni
- Department of Molecular Genetics and Microbiology, University of Florida, Gainesville, FL, 32611, USA
- University of Florida Genetics Institute, University of Florida, Gainesville, FL, 32611, USA
| | - Lauren M McIntyre
- Department of Molecular Genetics and Microbiology, University of Florida, Gainesville, FL, 32611, USA
- University of Florida Genetics Institute, University of Florida, Gainesville, FL, 32611, USA
| | - Rebekah L Rogers
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, NC, 28223, USA
| |
Collapse
|
10
|
Gamboa M, Kitamura N, Miura K, Noda S, Kaminuma O. Evolutionary mechanisms underlying the diversification of nuclear factor of activated T cells across vertebrates. Sci Rep 2023; 13:6468. [PMID: 37156933 PMCID: PMC10167247 DOI: 10.1038/s41598-023-33751-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2022] [Accepted: 04/18/2023] [Indexed: 05/10/2023] Open
Abstract
The mechanisms of immunity linked to biological evolution are crucial for understanding animal morphogenesis, organogenesis, and biodiversity. The nuclear factor of activated T cells (NFAT) family consists of five members (NFATc1-c4, 5) with different functions in the immune system. However, the evolutionary dynamics of NFATs in vertebrates has not been explored. Herein, we investigated the origin and mechanisms underlying the diversification of NFATs by comparing the gene, transcript and protein sequences, and chromosome information. We defined an ancestral origin of NFATs during the bilaterian development, dated approximately 650 million years ago, where NFAT5 and NFATc1-c4 were derived independently. The conserved parallel evolution of NFATs in multiple species was probably attributed to their innate nature. Conversely, frequent gene duplications and chromosomal rearrangements in the recently evolved taxa have suggested their roles in the adaptive immune evolution. A significant correlation was observed between the chromosome rearrangements with gene duplications and the structural fixation changes in vertebrate NFATs, suggesting their role in NFAT diversification. Remarkably, a conserved gene structure around NFAT genes with vertebrate evolutionary-related breaking points indicated the inheritance of NFATs with their neighboring genes as a unit. The close relationship between NFAT diversification and vertebrate immune evolution was suggested.
Collapse
Affiliation(s)
- Maribet Gamboa
- Department of Disease Model, Research Institute for Radiation Biology and Medicine, Hiroshima University, Hiroshima, 734-8553, Japan.
- Department of Ecology, Faculty of Sciences, Universidad Católica de la Santísima Concepción, 4090541, Concepción, Chile.
| | - Noriko Kitamura
- Neurovirology Project, Tokyo Metropolitan Institute of Medical Science, Tokyo, 156-8506, Japan
| | - Kento Miura
- Department of Disease Model, Research Institute for Radiation Biology and Medicine, Hiroshima University, Hiroshima, 734-8553, Japan
| | - Satoko Noda
- Graduate School of Science and Engineering, Ibaraki University, Ibaraki, 310-8512, Japan
| | - Osamu Kaminuma
- Department of Disease Model, Research Institute for Radiation Biology and Medicine, Hiroshima University, Hiroshima, 734-8553, Japan.
| |
Collapse
|
11
|
Duan G, Wu G, Chen X, Tian D, Li Z, Sun Y, Du Z, Hao L, Song S, Gao Y, Xiao J, Zhang Z, Bao Y, Tang B, Zhao W. HGD: an integrated homologous gene database across multiple species. Nucleic Acids Res 2022; 51:D994-D1002. [PMID: 36318261 PMCID: PMC9825607 DOI: 10.1093/nar/gkac970] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2022] [Revised: 09/28/2022] [Accepted: 10/17/2022] [Indexed: 11/06/2022] Open
Abstract
Homology is fundamental to infer genes' evolutionary processes and relationships with shared ancestry. Existing homolog gene resources vary in terms of inferring methods, homologous relationship and identifiers, posing inevitable difficulties for choosing and mapping homology results from one to another. Here, we present HGD (Homologous Gene Database, https://ngdc.cncb.ac.cn/hgd), a comprehensive homologs resource integrating multi-species, multi-resources and multi-omics, as a complement to existing resources providing public and one-stop data service. Currently, HGD houses a total of 112 383 644 homologous pairs for 37 species, including 19 animals, 16 plants and 2 microorganisms. Meanwhile, HGD integrates various annotations from public resources, including 16 909 homologs with traits, 276 670 homologs with variants, 398 573 homologs with expression and 536 852 homologs with gene ontology (GO) annotations. HGD provides a wide range of omics gene function annotations to help users gain a deeper understanding of gene function.
Collapse
Affiliation(s)
| | | | - Xiaoning Chen
- National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China,University of Chinese Academy of Sciences, Beijing 100049, China
| | - Dongmei Tian
- National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China,CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China
| | - Zhaohua Li
- National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China,University of Chinese Academy of Sciences, Beijing 100049, China
| | - Yanling Sun
- National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China,CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China
| | - Zhenglin Du
- National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China,CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China
| | - Lili Hao
- National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China,CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China
| | - Shuhui Song
- National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China,CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China,University of Chinese Academy of Sciences, Beijing 100049, China
| | - Yuan Gao
- National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China,CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China,University of Chinese Academy of Sciences, Beijing 100049, China
| | - Jingfa Xiao
- National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China,CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China,University of Chinese Academy of Sciences, Beijing 100049, China
| | - Zhang Zhang
- National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China,CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China,University of Chinese Academy of Sciences, Beijing 100049, China
| | - Yiming Bao
- National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China,CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China,University of Chinese Academy of Sciences, Beijing 100049, China
| | - Bixia Tang
- Correspondence may also be addressed to Bixia Tang.
| | - Wenming Zhao
- To whom correspondence should be addressed. Tel: +86 1084097636; Fax: +86 1084097720;
| |
Collapse
|
12
|
Serna-Duque JA, Cuesta A, Esteban MÁ. Massive gene expansion of hepcidin, a host defense peptide, in gilthead seabream (Sparus aurata). FISH & SHELLFISH IMMUNOLOGY 2022; 124:563-571. [PMID: 35489593 DOI: 10.1016/j.fsi.2022.04.032] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/01/2021] [Revised: 03/09/2022] [Accepted: 04/21/2022] [Indexed: 06/14/2023]
Abstract
Host defense peptides (HDP) are among the most ancient immune molecules in animals and clearly reflect an ancestral evolutionary history involving pathogen-host interactions. Hepcidins are a very widespread family of HDPs among vertebrates and are especially diverse in teleosts. We have investigated the identification of new hepcidins in gilthead seabream (Sparus aurata), a fish farmed in the Mediterranean. Targeted gene predictions supported with expressed sequence tags (ESTs) derived from Hidden Markov Models were used to find the hamp genes in the seabream genome. The results revealed a massively clustered hamp duplication on chromosome 17. In fact, the seabream genome contains the largest number of hepcidin copies described in any vertebrate. The evolutionary history of hepcidins in seabream, and vertebrates generally, clearly indicates high adaptation in teleosts and novel subgroups within hepcidin type II. Furthermore, basal hepcidin gene expression analysis indicates specific-tissue expression profiles, while the presence and distribution of transcription factor binding sites (TFBS) in hamp promoters as well as their transcription profile upon bacterial challenge indicates different immune roles depending on the type of hepcidin and tissue. This massive duplication of HDP genes in a bony fish could point to a far more specific and adaptive innate immune system than assumed in the classic concept of immunity in mammals. Hence, a new world of knowledge regarding hepcidins in fish and vertebrates is being initiated.
Collapse
Affiliation(s)
- Jhon A Serna-Duque
- Immunobiology for Aquaculture Group, Department of Cell Biology and Histology, Faculty of Biology, Campus of International Excellence, Campus Mare Nostrum, University of Murcia, 30100, Murcia, Spain
| | - Alberto Cuesta
- Immunobiology for Aquaculture Group, Department of Cell Biology and Histology, Faculty of Biology, Campus of International Excellence, Campus Mare Nostrum, University of Murcia, 30100, Murcia, Spain
| | - M Ángeles Esteban
- Immunobiology for Aquaculture Group, Department of Cell Biology and Histology, Faculty of Biology, Campus of International Excellence, Campus Mare Nostrum, University of Murcia, 30100, Murcia, Spain.
| |
Collapse
|
13
|
Sánchez AL, Lafond M. Colorful orthology clustering in bounded-degree similarity graphs. J Bioinform Comput Biol 2021; 19:2140010. [PMID: 34775924 DOI: 10.1142/s0219720021400102] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Clustering genes in similarity graphs is a popular approach for orthology prediction. Most algorithms group genes without considering their species, which results in clusters that contain several paralogous genes. Moreover, clustering is known to be problematic when in-paralogs arise from ancient duplications. Recently, we proposed a two-step process that avoids these problems. First, we infer clusters of only orthologs (i.e. with only genes from distinct species), and second, we infer the missing inter-cluster orthologs. In this paper, we focus on the first step, which leads to a problem we call Colorful Clustering. In general, this is as hard as classical clustering. However, in similarity graphs, the number of species is usually small, as well as the neighborhood size of genes in other species. We therefore study the problem of clustering in which the number of colors is bounded by [Formula: see text], and each gene has at most [Formula: see text] neighbors in another species. We show that the well-known cluster editing formulation remains NP-hard even when [Formula: see text] and [Formula: see text]. We then propose a fixed-parameter algorithm in [Formula: see text] to find the single best cluster in the graph. We implemented this algorithm and included it in the aforementioned two-step approach. Experiments on simulated data show that this approach performs favorably to applying only an unconstrained clustering step.
Collapse
Affiliation(s)
- Alitzel López Sánchez
- Computer Science Department, Université de Sherbrooke, 2500 Boulevard de l'Université, Sherbrooke, Québec J1K 2R1, Canada
| | - Manuel Lafond
- Computer Science Department, Université de Sherbrooke, 2500 Boulevard de l'Université, Sherbrooke, Québec J1K 2R1, Canada
| |
Collapse
|
14
|
Begum T, Serrano‐Serrano ML, Robinson‐Rechavi M. Performance of a phylogenetic independent contrast method and an improved pairwise comparison under different scenarios of trait evolution after speciation and duplication. Methods Ecol Evol 2021. [DOI: 10.1111/2041-210x.13680] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Tina Begum
- Department of Ecology and Evolution University of Lausanne Lausanne Switzerland
- SIB Swiss Institute of Bioinformatics Lausanne Switzerland
| | - Martha Liliana Serrano‐Serrano
- Department of Ecology and Evolution University of Lausanne Lausanne Switzerland
- SIB Swiss Institute of Bioinformatics Lausanne Switzerland
| | - Marc Robinson‐Rechavi
- Department of Ecology and Evolution University of Lausanne Lausanne Switzerland
- SIB Swiss Institute of Bioinformatics Lausanne Switzerland
| |
Collapse
|
15
|
Plant geranylgeranyl diphosphate synthases: every (gene) family has a story. ABIOTECH 2021; 2:289-298. [PMID: 36303884 PMCID: PMC9590577 DOI: 10.1007/s42994-021-00050-5] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/25/2021] [Accepted: 05/18/2021] [Indexed: 12/20/2022]
Abstract
Plant isoprenoids (also known as terpenes or terpenoids) are a wide family of primary and secondary metabolites with multiple functions. In particular, most photosynthesis-related isoprenoids (including carotenoids and chlorophylls) as well as diterpenes and polyterpenes derive from geranylgeranyl diphosphate (GGPP) produced by GGPP synthase (GGPPS) enzymes in several cell compartments. Plant genomes typically harbor multiple copies of differentially expressed genes encoding GGPPS-like proteins. While sequence comparisons allow to identify potential GGPPS candidates, experimental evidence is required to ascertain their enzymatic activity and biological function. Actually, functional analyses of the full set of potential GGPPS paralogs are only available for a handful of plant species. Here we review our current knowledge on the GGPPS families of the model plant Arabidopsis thaliana and the crop species rice (Oryza sativa), pepper (Capsicum annuum) and tomato (Solanum lycopersicum). The results indicate that a major determinant of the biological role of particular GGPPS paralogs is the expression profile of the corresponding genes even though specific interactions with other proteins (including GGPP-consuming enzymes) might also contribute to subfunctionalization. In some species, however, a single GGPPS isoforms appears to be responsible for the production of most if not all GGPP required for cell functions. Deciphering the mechanisms regulating GGPPS activity in particular cell compartments, tissues, organs and plant species will be very useful for future metabolic engineering approaches aimed to manipulate the accumulation of particular GGPP-derived products of interest without negatively impacting the levels of other isoprenoids required to sustain essential cell functions.
Collapse
|
16
|
Begum T, Robinson-Rechavi M. Special Care Is Needed in Applying Phylogenetic Comparative Methods to Gene Trees with Speciation and Duplication Nodes. Mol Biol Evol 2021; 38:1614-1626. [PMID: 33169790 PMCID: PMC8042747 DOI: 10.1093/molbev/msaa288] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open
Abstract
How gene function evolves is a central question of evolutionary biology. It can be investigated by comparing functional genomics results between species and between genes. Most comparative studies of functional genomics have used pairwise comparisons. Yet it has been shown that this can provide biased results, as genes, like species, are phylogenetically related. Phylogenetic comparative methods should be used to correct for this, but they depend on strong assumptions, including unbiased tree estimates relative to the hypothesis being tested. Such methods have recently been used to test the “ortholog conjecture,” the hypothesis that functional evolution is faster in paralogs than in orthologs. Although pairwise comparisons of tissue specificity (τ) provided support for the ortholog conjecture, phylogenetic independent contrasts did not. Our reanalysis on the same gene trees identified problems with the time calibration of duplication nodes. We find that the gene trees used suffer from important biases, due to the inclusion of trees with no duplication nodes, to the relative age of speciations and duplications, to systematic differences in branch lengths, and to non-Brownian motion of tissue specificity on many trees. We find that incorrect implementation of phylogenetic method in empirical gene trees with duplications can be problematic. Controlling for biases allows successful use of phylogenetic methods to study the evolution of gene function and provides some support for the ortholog conjecture using three different phylogenetic approaches.
Collapse
Affiliation(s)
- Tina Begum
- Department of Ecology and Evolution, University of Lausanne, Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Marc Robinson-Rechavi
- Department of Ecology and Evolution, University of Lausanne, Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| |
Collapse
|
17
|
Tarashansky AJ, Musser JM, Khariton M, Li P, Arendt D, Quake SR, Wang B. Mapping single-cell atlases throughout Metazoa unravels cell type evolution. eLife 2021; 10:e66747. [PMID: 33944782 PMCID: PMC8139856 DOI: 10.7554/elife.66747] [Citation(s) in RCA: 148] [Impact Index Per Article: 37.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2021] [Accepted: 04/30/2021] [Indexed: 12/11/2022] Open
Abstract
Comparing single-cell transcriptomic atlases from diverse organisms can elucidate the origins of cellular diversity and assist the annotation of new cell atlases. Yet, comparison between distant relatives is hindered by complex gene histories and diversifications in expression programs. Previously, we introduced the self-assembling manifold (SAM) algorithm to robustly reconstruct manifolds from single-cell data (Tarashansky et al., 2019). Here, we build on SAM to map cell atlas manifolds across species. This new method, SAMap, identifies homologous cell types with shared expression programs across distant species within phyla, even in complex examples where homologous tissues emerge from distinct germ layers. SAMap also finds many genes with more similar expression to their paralogs than their orthologs, suggesting paralog substitution may be more common in evolution than previously appreciated. Lastly, comparing species across animal phyla, spanning sponge to mouse, reveals ancient contractile and stem cell families, which may have arisen early in animal evolution.
Collapse
Affiliation(s)
| | - Jacob M Musser
- European Molecular Biology Laboratory, Developmental Biology UnitHeidelbergGermany
| | | | - Pengyang Li
- Department of Bioengineering, Stanford UniversityStanfordUnited States
| | - Detlev Arendt
- European Molecular Biology Laboratory, Developmental Biology UnitHeidelbergGermany
- Centre for Organismal Studies, University of HeidelbergHeidelbergGermany
| | - Stephen R Quake
- Department of Bioengineering, Stanford UniversityStanfordUnited States
- Department of Applied Physics, Stanford UniversityStanfordUnited States
- Chan Zuckerberg BiohubSan FranciscoUnited States
| | - Bo Wang
- Department of Bioengineering, Stanford UniversityStanfordUnited States
- Department of Developmental Biology, Stanford University School of MedicineStanfordUnited States
| |
Collapse
|
18
|
Too CC, Ong KS, Yule CM, Keller A. Putative roles of bacteria in the carbon and nitrogen cycles in a tropical peat swamp forest. Basic Appl Ecol 2021. [DOI: 10.1016/j.baae.2020.10.004] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
19
|
Rosselli R, La Porta N, Muresu R, Stevanato P, Concheri G, Squartini A. Pangenomics of the Symbiotic Rhizobiales. Core and Accessory Functions Across a Group Endowed with High Levels of Genomic Plasticity. Microorganisms 2021; 9:microorganisms9020407. [PMID: 33669391 PMCID: PMC7920277 DOI: 10.3390/microorganisms9020407] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2020] [Revised: 02/10/2021] [Accepted: 02/11/2021] [Indexed: 11/16/2022] Open
Abstract
Pangenome analyses reveal major clues on evolutionary instances and critical genome core conservation. The order Rhizobiales encompasses several families with rather disparate ecological attitudes. Among them, Rhizobiaceae, Bradyrhizobiaceae, Phyllobacteriacreae and Xanthobacteriaceae, include members proficient in mutualistic symbioses with plants based on the bacterial conversion of N2 into ammonia (nitrogen-fixation). The pangenome of 12 nitrogen-fixing plant symbionts of the Rhizobiales was analyzed yielding total 37,364 loci, with a core genome constituting 700 genes. The percentage of core genes averaged 10.2% over single genomes, and between 5% to 7% were found to be plasmid-associated. The comparison between a representative reference genome and the core genome subset, showed the core genome highly enriched in genes for macromolecule metabolism, ribosomal constituents and overall translation machinery, while membrane/periplasm-associated genes, and transport domains resulted under-represented. The analysis of protein functions revealed that between 1.7% and 4.9% of core proteins could putatively have different functions.
Collapse
Affiliation(s)
- Riccardo Rosselli
- Department of Marine Microbiology and Biogeochemistry, NIOZ Royal Netherlands Institute of Sea Research, NL-1790 AB Den Burg, The Netherlands;
- Departamento de Fisiología, Genética y Microbiología, Universidad de Alicante, 03690 Alicante, Spain
| | - Nicola La Porta
- Department of Sustainable Agrobiosystems and Bioresources, Research and Innovation Centre, Fondazione Edmund Mach, 38098 San Michele all’Adige, Italy;
- MOUNTFOR Project Centre, European Forest Institute, 38098 San Michele all’Adige, Italy
| | - Rosella Muresu
- Institute of Animal Production Systems in Mediterranean Environments-National Research Council, 07040 Sassari, Italy;
| | - Piergiorgio Stevanato
- Department of Agronomy, Food, Natural Resources, Animals and Environment, University of Padova, 35020 Legnaro, Italy; (P.S.); (G.C.)
| | - Giuseppe Concheri
- Department of Agronomy, Food, Natural Resources, Animals and Environment, University of Padova, 35020 Legnaro, Italy; (P.S.); (G.C.)
| | - Andrea Squartini
- Department of Agronomy, Food, Natural Resources, Animals and Environment, University of Padova, 35020 Legnaro, Italy; (P.S.); (G.C.)
- Correspondence: ; Tel.: +39-049-8272-923
| |
Collapse
|
20
|
New Approaches for Inferring Phylogenies in the Presence of Paralogs. Trends Genet 2021; 37:174-187. [DOI: 10.1016/j.tig.2020.08.012] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2020] [Revised: 08/13/2020] [Accepted: 08/19/2020] [Indexed: 12/18/2022]
|
21
|
Agarwal PR, Lahiri A. Comparative study of the SBP-box gene family in rice siblings. J Biosci 2020. [DOI: 10.1007/s12038-020-00048-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
22
|
Ahrens JB, Teufel AI, Siltberg-Liberles J. A Phylogenetic Rate Parameter Indicates Different Sequence Divergence Patterns in Orthologs and Paralogs. J Mol Evol 2020; 88:720-730. [PMID: 33118098 DOI: 10.1007/s00239-020-09969-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2020] [Accepted: 10/15/2020] [Indexed: 10/23/2022]
Abstract
Heterotachy-the change in sequence evolutionary rate over time-is a common feature of protein molecular evolution. Decades of studies have shed light on the conditions under which heterotachy occurs, and there is evidence that site-specific evolutionary rate shifts are correlated with changes in protein function. Here, we present a large-scale, computational analysis using thousands of protein sequence alignments from animal and plant proteomes, representing genes related either by orthology (speciation events) or paralogy (gene duplication), to compare sequence divergence patterns in orthologous vs. paralogous sequence alignments. We use sequence-based phylogenetic analyses to infer overall sequence divergence (tree length/number of sequences) and to fit site-specific rates to a discrete gamma distribution with a shape parameter α. This inference method is applied to real protein sequence alignments, as well as alignments simulated under various models of protein sequence evolution. Our simulations indicate that sequence divergence and the α parameter are positively correlated when sequences evolve with heterotachy, meaning that inferred site rate distributions appear more uniform as sequences diverge. Divergence and α are also positively correlated in both orthologous and paralogous genes, but the average increase in α (as a function of divergence) is significantly higher in paralogous protein alignments than in orthologous alignments. This result is consistent with the widely held view that recently duplicated proteins initially evolve under relaxed selective pressure, promoting functional divergence by accumulation of amino acid replacements, and hence experience more evolutionary rate fluctuations than orthologous proteins. We discuss these findings in the context of the ortholog conjecture, a long-standing assumption in molecular evolution, which posits that protein sequences related by orthology tend to be more functionally conserved than paralogous proteins.
Collapse
Affiliation(s)
- Joseph B Ahrens
- Department of Biological Sciences, Biomolecular Sciences Institute, Florida International University, Miami, FL, USA. .,Department of Biochemistry and Molecular Genetics, Computational Bioscience Program, University of Colorado Denver, Aurora, CO, USA.
| | - Ashley I Teufel
- Department of Integrative Biology, The University of Texas At Austin, Austin, TX, USA.,Santa Fe Institute, Santa Fe, NM, USA
| | - Jessica Siltberg-Liberles
- Department of Biological Sciences, Biomolecular Sciences Institute, Florida International University, Miami, FL, USA.
| |
Collapse
|
23
|
COMPOSITUM 1 contributes to the architectural simplification of barley inflorescence via meristem identity signals. Nat Commun 2020; 11:5138. [PMID: 33046693 PMCID: PMC7550572 DOI: 10.1038/s41467-020-18890-y] [Citation(s) in RCA: 39] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2020] [Accepted: 09/15/2020] [Indexed: 11/23/2022] Open
Abstract
Grasses have varying inflorescence shapes; however, little is known about the genetic mechanisms specifying such shapes among tribes. Here, we identify the grass-specific TCP transcription factor COMPOSITUM 1 (COM1) expressing in inflorescence meristematic boundaries of different grasses. COM1 specifies branch-inhibition in barley (Triticeae) versus branch-formation in non-Triticeae grasses. Analyses of cell size, cell walls and transcripts reveal barley COM1 regulates cell growth, thereby affecting cell wall properties and signaling specifically in meristematic boundaries to establish identity of adjacent meristems. COM1 acts upstream of the boundary gene Liguleless1 and confers meristem identity partially independent of the COM2 pathway. Furthermore, COM1 is subject to purifying natural selection, thereby contributing to specification of the spike inflorescence shape. This meristem identity pathway has conceptual implications for both inflorescence evolution and molecular breeding in Triticeae. Grasses have diverse inflorescence morphologies, but the underlying genetic mechanisms are unclear. Here, the authors report a TCP transcription factor COM1 affects cell growth through regulation of cell wall properties and promotes branch formation in non-Triticeae grasses but branch inhibition in barley (Triticeae).
Collapse
|
24
|
Stamboulian M, Guerrero RF, Hahn MW, Radivojac P. The ortholog conjecture revisited: the value of orthologs and paralogs in function prediction. Bioinformatics 2020; 36:i219-i226. [PMID: 32657391 PMCID: PMC7355290 DOI: 10.1093/bioinformatics/btaa468] [Citation(s) in RCA: 37] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
MOTIVATION The computational prediction of gene function is a key step in making full use of newly sequenced genomes. Function is generally predicted by transferring annotations from homologous genes or proteins for which experimental evidence exists. The 'ortholog conjecture' proposes that orthologous genes should be preferred when making such predictions, as they evolve functions more slowly than paralogous genes. Previous research has provided little support for the ortholog conjecture, though the incomplete nature of the data cast doubt on the conclusions. RESULTS We use experimental annotations from over 40 000 proteins, drawn from over 80 000 publications, to revisit the ortholog conjecture in two pairs of species: (i) Homo sapiens and Mus musculus and (ii) Saccharomyces cerevisiae and Schizosaccharomyces pombe. By making a distinction between questions about the evolution of function versus questions about the prediction of function, we find strong evidence against the ortholog conjecture in the context of function prediction, though questions about the evolution of function remain difficult to address. In both pairs of species, we quantify the amount of information that would be ignored if paralogs are discarded, as well as the resulting loss in prediction accuracy. Taken as a whole, our results support the view that the types of homologs used for function transfer are largely irrelevant to the task of function prediction. Maximizing the amount of data used for this task, regardless of whether it comes from orthologs or paralogs, is most likely to lead to higher prediction accuracy. AVAILABILITY AND IMPLEMENTATION https://github.com/predragradivojac/oc. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Moses Stamboulian
- Department of Computer Science, Indiana University, Bloomington, IN 47405, USA
| | - Rafael F Guerrero
- Department of Computer Science, Indiana University, Bloomington, IN 47405, USA
- Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695, USA
| | - Matthew W Hahn
- Department of Computer Science, Indiana University, Bloomington, IN 47405, USA
- Department of Biology, Indiana University, Bloomington, IN 47405, USA
| | - Predrag Radivojac
- Khoury College of Computer Sciences, Northeastern University, Boston, MA 02115, USA
| |
Collapse
|
25
|
Laurent JM, Garge RK, Teufel AI, Wilke CO, Kachroo AH, Marcotte EM. Humanization of yeast genes with multiple human orthologs reveals functional divergence between paralogs. PLoS Biol 2020; 18:e3000627. [PMID: 32421706 PMCID: PMC7259792 DOI: 10.1371/journal.pbio.3000627] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2019] [Revised: 05/29/2020] [Accepted: 04/14/2020] [Indexed: 01/17/2023] Open
Abstract
Despite over a billion years of evolutionary divergence, several thousand human genes possess clearly identifiable orthologs in yeast, and many have undergone lineage-specific duplications in one or both lineages. These duplicated genes may have been free to diverge in function since their expansion, and it is unclear how or at what rate ancestral functions are retained or partitioned among co-orthologs between species and within gene families. Thus, in order to investigate how ancestral functions are retained or lost post-duplication, we systematically replaced hundreds of essential yeast genes with their human orthologs from gene families that have undergone lineage-specific duplications, including those with single duplications (1 yeast gene to 2 human genes, 1:2) or higher-order expansions (1:>2) in the human lineage. We observe a variable pattern of replaceability across different ortholog classes, with an obvious trend toward differential replaceability inside gene families, and rarely observe replaceability by all members of a family. We quantify the ability of various properties of the orthologs to predict replaceability, showing that in the case of 1:2 orthologs, replaceability is predicted largely by the divergence and tissue-specific expression of the human co-orthologs, i.e., the human proteins that are less diverged from their yeast counterpart and more ubiquitously expressed across human tissues more often replace their single yeast ortholog. These trends were consistent with in silico simulations demonstrating that when only one ortholog can replace its corresponding yeast equivalent, it tends to be the least diverged of the pair. Replaceability of yeast genes having more than 2 human co-orthologs was marked by retention of orthologous interactions in functional or protein networks as well as by more ancestral subcellular localization. Overall, we performed >400 human gene replaceability assays, revealing 50 new human-yeast complementation pairs, thus opening up avenues to further functionally characterize these human genes in a simplified organismal context.
Collapse
Affiliation(s)
- Jon M. Laurent
- Center for Systems and Synthetic Biology, Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, Texas, United States of America
- Institute for Systems Genetics, NYU Langone Health, New York, New York, United States of America
| | - Riddhiman K. Garge
- Center for Systems and Synthetic Biology, Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, Texas, United States of America
- Department of Molecular Biosciences, The University of Texas at Austin, Austin, Texas, United States of America
| | - Ashley I. Teufel
- Center for Systems and Synthetic Biology, Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, Texas, United States of America
- Department of Integrative Biology, The University of Texas at Austin, Austin, Texas, United States of America
- Santa Fe Institute, Santa Fe, New Mexico, United States of America
| | - Claus O. Wilke
- Center for Systems and Synthetic Biology, Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, Texas, United States of America
- Department of Integrative Biology, The University of Texas at Austin, Austin, Texas, United States of America
| | - Aashiq H. Kachroo
- The Department of Biology, Centre for Applied Synthetic Biology, Concordia University, Montreal, Quebec, Canada
| | - Edward M. Marcotte
- Center for Systems and Synthetic Biology, Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, Texas, United States of America
- Department of Molecular Biosciences, The University of Texas at Austin, Austin, Texas, United States of America
| |
Collapse
|
26
|
Geiß M, Laffitte MEG, Sánchez AL, Valdivia DI, Hellmuth M, Rosales MH, Stadler PF. Best match graphs and reconciliation of gene trees with species trees. J Math Biol 2020; 80:1459-1495. [PMID: 32002659 PMCID: PMC7052050 DOI: 10.1007/s00285-020-01469-y] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2019] [Revised: 01/08/2020] [Indexed: 11/19/2022]
Abstract
A wide variety of problems in computational biology, most notably the assessment of orthology, are solved with the help of reciprocal best matches. Using an evolutionary definition of best matches that captures the intuition behind the concept we clarify rigorously the relationships between reciprocal best matches, orthology, and evolutionary events under the assumption of duplication/loss scenarios. We show that the orthology graph is a subgraph of the reciprocal best match graph (RBMG). We furthermore give conditions under which an RBMG that is a cograph identifies the correct orthlogy relation. Using computer simulations we find that most false positive orthology assignments can be identified as so-called good quartets-and thus corrected-in the absence of horizontal transfer. Horizontal transfer, however, may introduce also false-negative orthology assignments.
Collapse
Affiliation(s)
- Manuela Geiß
- Bioinformatics Group, Department of Computer Science, Interdisciplinary Center of Bioinformatics, University of Leipzig, Härtelstraße 16-18, 04107 Leipzig, Germany
| | - Marcos E. González Laffitte
- CONACYT-Instituto de Matemáticas, UNAM Juriquilla, Blvd. Juriquilla 3001, 76230 Juriquilla, Querétaro, QRO Mexico
| | - Alitzel López Sánchez
- CONACYT-Instituto de Matemáticas, UNAM Juriquilla, Blvd. Juriquilla 3001, 76230 Juriquilla, Querétaro, QRO Mexico
| | - Dulce I. Valdivia
- Centro de Ciencias Básicas, Universidad Autónoma de Aguascalientes, Av. Universidad 940, 20131 Aguascalientes, AGS México
- Instituto de Matemáticas, UNAM Juriquilla, Blvd. Juriquilla 3001, 76230 Juriquilla, Querétaro, QRO Mexico
| | - Marc Hellmuth
- Institute of Mathematics and Computer Science, University of Greifswald, Walther-Rathenau-Straße 47, 17487 Greifswald, Germany
- Center for Bioinformatics, Saarland University, Building E 2.1, P.O. Box 151150, 66041 Saarbrücken, Germany
| | - Maribel Hernández Rosales
- CONACYT-Instituto de Matemáticas, UNAM Juriquilla, Blvd. Juriquilla 3001, 76230 Juriquilla, Querétaro, QRO Mexico
| | - Peter F. Stadler
- Bioinformatics Group, Department of Computer Science, Interdisciplinary Center of Bioinformatics, University of Leipzig, Härtelstraße 16-18, 04107 Leipzig, Germany
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany
- Competence Center for Scalable Data Services and Solutions, Leipzig Research Center for Civilization Diseases, Leipzig University, Härtelstraße 16-18, 04107 Leipzig, Germany
- Max-Planck-Institute for Mathematics in the Sciences, Inselstraße 22, 04103 Leipzig, Germany
- Inst. f. Theoretical Chemistry, University of Vienna, Währingerstraße 17, 1090 Wien, Austria
- Facultad de Ciencias, Universidad National de Colombia, Bogotá, Colombia
- Santa Fe Institute, 1399 Hyde Park Rd., Santa Fe, NM 87501 USA
| |
Collapse
|
27
|
David KT, Oaks JR, Halanych KM. Patterns of gene evolution following duplications and speciations in vertebrates. PeerJ 2020; 8:e8813. [PMID: 32266119 PMCID: PMC7120047 DOI: 10.7717/peerj.8813] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2019] [Accepted: 02/27/2020] [Indexed: 11/24/2022] Open
Abstract
BACKGROUND Eukaryotic genes typically form independent evolutionary lineages through either speciation or gene duplication events. Generally, gene copies resulting from speciation events (orthologs) are expected to maintain similarity over time with regard to sequence, structure and function. After a duplication event, however, resulting gene copies (paralogs) may experience a broader set of possible fates, including partial (subfunctionalization) or complete loss of function, as well as gain of new function (neofunctionalization). This assumption, known as the Ortholog Conjecture, is prevalent throughout molecular biology and notably plays an important role in many functional annotation methods. Unfortunately, studies that explicitly compare evolutionary processes between speciation and duplication events are rare and conflicting. METHODS To provide an empirical assessment of ortholog/paralog evolution, we estimated ratios of nonsynonymous to synonymous substitutions (ω = dN/dS) for 251,044 lineages in 6,244 gene trees across 77 vertebrate taxa. RESULTS Overall, we found ω to be more similar between lineages descended from speciation events (p < 0.001) than lineages descended from duplication events, providing strong support for the Ortholog Conjecture. The asymmetry in ω following duplication events appears to be largely driven by an increase along one of the paralogous lineages, while the other remains similar to the parent. This trend is commonly associated with neofunctionalization, suggesting that gene duplication is a significant mechanism for generating novel gene functions.
Collapse
Affiliation(s)
- Kyle T. David
- Department of Biological Sciences, Auburn University, Auburn, AL, USA
| | - Jamie R. Oaks
- Department of Biological Sciences, Auburn University, Auburn, AL, USA
| | | |
Collapse
|
28
|
Bick JT, Zeng S, Robinson MD, Ulbrich SE, Bauersachs S. Mammalian Annotation Database for improved annotation and functional classification of Omics datasets from less well-annotated organisms. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2020; 2019:5539597. [PMID: 31353404 PMCID: PMC6661403 DOI: 10.1093/database/baz086] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/03/2018] [Revised: 03/08/2019] [Accepted: 06/05/2019] [Indexed: 02/02/2023]
Abstract
Next-generation sequencing technologies and the availability of an increasing number of mammalian and other genomes allow gene expression studies, particularly RNA sequencing, in many non-model organisms. However, incomplete genome annotation and assignments of genes to functional annotation databases can lead to a substantial loss of information in downstream data analysis. To overcome this, we developed Mammalian Annotation Database tool (MAdb, https://madb.ethz.ch) to conveniently provide homologous gene information for selected mammalian species. The assignment between species is performed in three steps: (i) matching official gene symbols, (ii) using ortholog information contained in Ensembl Compara and (iii) pairwise BLAST comparisons of all transcripts. In addition, we developed a new tool (AnnOverlappeR) for the reliable assignment of the National Center for Biotechnology Information (NCBI) and Ensembl gene IDs. The gene lists translated to gene IDs of well-annotated species such as a human can be used for improved functional annotation with relevant tools based on Gene Ontology and molecular pathway information. We tested the MAdb on a published RNA-seq data set for the pig and showed clearly improved overrepresentation analysis results based on the assigned human homologous gene identifiers. Using the MAdb revealed a similar list of human homologous genes and functional annotation results regardless of whether starting with gene IDs from NCBI or Ensembl. The MAdb database is accessible via a web interface and a Galaxy application.
Collapse
Affiliation(s)
- Jochen T Bick
- Animal Physiology, Institute of Agricultural Sciences, ETH Zurich, Zurich, Switzerland
| | - Shuqin Zeng
- Animal Physiology, Institute of Agricultural Sciences, ETH Zurich, Zurich, Switzerland.,Genetics and Functional Genomics, Vetsuisse Faculty Zurich, University of Zurich, Zurich, Switzerland
| | - Mark D Robinson
- Institute of Molecular Life Sciences and SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, Switzerland
| | - Susanne E Ulbrich
- Animal Physiology, Institute of Agricultural Sciences, ETH Zurich, Zurich, Switzerland
| | - Stefan Bauersachs
- Animal Physiology, Institute of Agricultural Sciences, ETH Zurich, Zurich, Switzerland.,Genetics and Functional Genomics, Vetsuisse Faculty Zurich, University of Zurich, Zurich, Switzerland
| |
Collapse
|
29
|
Tran NV, Greshake Tzovaras B, Ebersberger I. PhyloProfile: dynamic visualization and exploration of multi-layered phylogenetic profiles. Bioinformatics 2019; 34:3041-3043. [PMID: 29659708 DOI: 10.1093/bioinformatics/bty225] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2017] [Accepted: 04/05/2018] [Indexed: 01/16/2023] Open
Abstract
Summary Phylogenetic profiles form the basis for tracing proteins and their functions across species and through time. Novel genome sequences nowadays often represent species from the remotest corner of the tree of life. Thus, phylogenetic profiling becomes increasingly important for functionally annotating this data and to integrate it into a comprehensive view on organismal evolution. To strengthen the link between the sharing of a gene across species and of the corresponding function, it is meanwhile common to complement phylogenetic profiles with additional information, such as domain architecture similarities between orthologs, or pairwise similarities of other protein features. However, there are few visualization tools that facilitate an intuitive integration of these various information layers. Here, we present PhyloProfile, an R-based tool to visualize, explore and analyze multi-layered phylogenetic profiles. Availability and implementation PhyloProfile is available as open source code under the MIT license at https://github.com/BIONF/phyloprofile. An online version for testing PhyloProfile and for small to medium-scale analyses is available at http://applbio.biologie.uni-frankfurt.de/phyloprofile.
Collapse
Affiliation(s)
- Ngoc-Vinh Tran
- Department for Applied Bioinformatics, Institute of Cell Biology and Neuroscience, Goethe University, Frankfurt am Main, Germany
| | - Bastian Greshake Tzovaras
- Department for Applied Bioinformatics, Institute of Cell Biology and Neuroscience, Goethe University, Frankfurt am Main, Germany
| | - Ingo Ebersberger
- Department for Applied Bioinformatics, Institute of Cell Biology and Neuroscience, Goethe University, Frankfurt am Main, Germany.,Senckenberg Biodiversity and Climate Research Centre (BiK-F), Frankfurt am Main, Germany
| |
Collapse
|
30
|
Lafond M, Meghdari Miardan M, Sankoff D. Accurate prediction of orthologs in the presence of divergence after duplication. Bioinformatics 2019; 34:i366-i375. [PMID: 29950018 PMCID: PMC6022570 DOI: 10.1093/bioinformatics/bty242] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Motivation When gene duplication occurs, one of the copies may become free of selective pressure and evolve at an accelerated pace. This has important consequences on the prediction of orthology relationships, since two orthologous genes separated by divergence after duplication may differ in both sequence and function. In this work, we make the distinction between the primary orthologs, which have not been affected by accelerated mutation rates on their evolutionary path, and the secondary orthologs, which have. Similarity-based prediction methods will tend to miss secondary orthologs, whereas phylogeny-based methods cannot separate primary and secondary orthologs. However, both types of orthology have applications in important areas such as gene function prediction and phylogenetic reconstruction, motivating the need for methods that can distinguish the two types. Results We formalize the notion of divergence after duplication and provide a theoretical basis for the inference of primary and secondary orthologs. We then put these ideas to practice with the Hybrid Prediction of Paralogs and Orthologs (HyPPO) framework, which combines ideas from both similarity and phylogeny approaches. We apply our method to simulated and empirical datasets and show that we achieve superior accuracy in predicting primary orthologs, secondary orthologs and paralogs. Availability and implementation HyPPO is a modular framework with a core developed in Python and is provided with a variety of C++ modules. The source code is available at https://github.com/manuellafond/HyPPO. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Manuel Lafond
- Department of Mathematics and Statistics, University of Ottawa, Ottawa, Canada.,Department of Computer Science, Université de Sherbrooke, Sherbrooke, Canada
| | | | - David Sankoff
- Department of Mathematics and Statistics, University of Ottawa, Ottawa, Canada
| |
Collapse
|
31
|
Torres Manno MA, Pizarro MD, Prunello M, Magni C, Daurelio LD, Espariz M. GeM-Pro: a tool for genome functional mining and microbial profiling. Appl Microbiol Biotechnol 2019; 103:3123-3134. [DOI: 10.1007/s00253-019-09648-8] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2018] [Revised: 01/11/2019] [Accepted: 01/14/2019] [Indexed: 11/30/2022]
|
32
|
Abstract
The distinction between orthologs and paralogs, genes that started diverging by speciation versus duplication, is relevant in a wide range of contexts, most notably phylogenetic tree inference and protein function annotation. In this chapter, we provide an overview of the methods used to infer orthology and paralogy. We survey both graph-based approaches (and their various grouping strategies) and tree-based approaches, which solve the more general problem of gene/species tree reconciliation. We discuss conceptual differences among the various orthology inference methods and databases and examine the difficult issue of verifying and benchmarking orthology predictions. Finally, we review typical applications of orthologous genes, groups, and reconciled trees and conclude with thoughts on future methodological developments.
Collapse
|
33
|
Mier P, Pérez-Pulido AJ, Andrade-Navarro MA. Automated selection of homologs to track the evolutionary history of proteins. BMC Bioinformatics 2018; 19:431. [PMID: 30453878 PMCID: PMC6245638 DOI: 10.1186/s12859-018-2457-y] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2018] [Accepted: 10/31/2018] [Indexed: 11/26/2022] Open
Abstract
Background The selection of distant homologs of a query protein under study is a usual and useful application of protein sequence databases. Such sets of homologs are often applied to investigate the function of a protein and the degree to which experimental results can be transferred from one organism to another. In particular, a variety of databases facilitates static browsing for orthologs. However, these resources have a limited power when identifying orthologs between taxonomically distant species. In addition, in some situations, for a given query protein, it is advantageous to compare the sets of orthologs from different specific organisms: this recursive step-wise search might give an idea of the evolutionary path of the protein as a series of consecutive steps, for example gaining or losing domains. However, a step-wise orthology search is a time-consuming task if the number of steps is high. Results To illustrate a solution for this problem, we present the web tool ProteinPathTracker, which allows to track the evolutionary history of a query protein by locating homologs in selected proteomes along several evolutionary paths. Additional functionalities include locking a region of interest to follow its evolution in the discovered homologous sequences and the study of the protein function evolution by analysis of the annotations of the homologs. Conclusions ProteinPathTracker is an easy-to-use web tool that automatises the practice of looking for selected homologs in distant species in a straightforward way for non-expert users. Electronic supplementary material The online version of this article (10.1186/s12859-018-2457-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Pablo Mier
- Faculty of Biology, Johannes Gutenberg University Mainz, Hans-Dieter-Hüsch-Weg 15, 55128, Mainz, Germany.
| | | | - Miguel A Andrade-Navarro
- Faculty of Biology, Johannes Gutenberg University Mainz, Hans-Dieter-Hüsch-Weg 15, 55128, Mainz, Germany
| |
Collapse
|
34
|
Johnson BR. Taxonomically Restricted Genes Are Fundamental to Biology and Evolution. Front Genet 2018; 9:407. [PMID: 30294344 PMCID: PMC6158316 DOI: 10.3389/fgene.2018.00407] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2018] [Accepted: 09/04/2018] [Indexed: 12/26/2022] Open
Abstract
Genes limited to particular clades, taxonomically restricted genes (TRGs), are common in all sequenced genomes. TRGs have recently become associated with the evolution of novelty, as numerous studies across the tree of life have now linked expression of TRGs with novel phenotypes. However, TRGs that underlie ancient lineage specific traits have been largely omitted from discussions of the general importance of TRGs. Here it is argued that when all TRGs are considered, it is apparent that TRGs are fundamental to biology and evolution and likely play many complementary roles to the better understood toolkit genes. Genes underlying photosynthesis and skeletons, for example, are examples of commonplace fundamental TRGs. Essentially, although basic cell biology has a highly conserved genetic basis across the tree of life, most major clades also have lineage specific traits central to their biology and these traits are often based on TRGs. In short, toolkit genes underlie what is conserved across organisms, while TRGs define in many cases what is unique. An appreciation of the importance of TRGs will improve our understanding of evolution by triggering the study of neglected topics in which TRGs are of paramount importance.
Collapse
Affiliation(s)
- Brian R Johnson
- Department of Entomology and Nematology, Center for Population Biology, University of California, Davis, Davis, CA, United States
| |
Collapse
|
35
|
Abstract
This chapter covers the theory and practice of ortholog gene set computation. In the theoretical part we give detailed and formal descriptions of the relevant concepts. We also cover the topic of graph-based clustering as a tool to compute ortholog gene sets. In the second part we provide an overview of practical considerations intended for researchers who need to determine orthologous genes from a collection of annotated genomes, briefly describing some of the most popular programs and resources currently available for this task.
Collapse
|
36
|
Zinkgraf M, Gerttula S, Zhao S, Filkov V, Groover A. Transcriptional and temporal response of Populus stems to gravi-stimulation. JOURNAL OF INTEGRATIVE PLANT BIOLOGY 2018; 60:578-590. [PMID: 29480544 DOI: 10.1111/jipb.12645] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/26/2018] [Accepted: 02/24/2018] [Indexed: 05/12/2023]
Abstract
Plants modify development in response to external stimuli, to produce new growth that is appropriate for environmental conditions. For example, gravi-stimulation of leaning branches in angiosperm trees results in modifications of wood development, to produce tension wood that pulls leaning stems upright. Here, we use gravi-stimulation and tension wood response to dissect the temporal changes in gene expression underlying wood formation in Populus stems. Using time-series analysis of seven time points over a 14-d experiment, we identified 8,919 genes that were differentially expressed between tension wood (upper) and opposite wood (lower) sides of leaning stems. Clustering of differentially expressed genes showed four major transcriptional responses, including gene clusters whose transcript levels were associated with two types of tissue-specific impulse responses that peaked at about 24-48 h, and gene clusters with sustained changes in transcript levels that persisted until the end of the 14-d experiment. Functional enrichment analysis of those clusters suggests they reflect temporal changes in pathways associated with hormone regulation, protein localization, cell wall biosynthesis and epigenetic processes. Time-series analysis of gene expression is an underutilized approach for dissecting complex developmental responses in plants, and can reveal gene clusters and mechanisms influencing development.
Collapse
Affiliation(s)
- Matthew Zinkgraf
- USDA Forest Service, Pacific Southwest Research Station, 1731 Research Park Drive, Davis, CA 95618, USA
- Department of Computer Science, University of California Davis, One Shields Avenue, Davis, CA 95618, USA
| | - Suzanne Gerttula
- USDA Forest Service, Pacific Southwest Research Station, 1731 Research Park Drive, Davis, CA 95618, USA
- Department of Computer Science, University of California Davis, One Shields Avenue, Davis, CA 95618, USA
| | - Shutang Zhao
- State Key Laboratory of Tree Genetics and Breeding, Research Institute of Forestry, Chinese Academy of Forestry, Beijing 100091, China
| | - Vladimir Filkov
- Department of Computer Science, University of California Davis, One Shields Avenue, Davis, CA 95618, USA
| | - Andrew Groover
- USDA Forest Service, Pacific Southwest Research Station, 1731 Research Park Drive, Davis, CA 95618, USA
- Department of Plant Biology, University of California Davis, One Shields Avenue, Davis, CA 95618, USA
| |
Collapse
|
37
|
Raboanatahiry N, Wang B, Yu L, Li M. Functional and Structural Diversity of Acyl-coA Binding Proteins in Oil Crops. Front Genet 2018; 9:182. [PMID: 29872448 PMCID: PMC5972291 DOI: 10.3389/fgene.2018.00182] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2018] [Accepted: 05/01/2018] [Indexed: 12/16/2022] Open
Abstract
Diversities in structure and function of ACBP were discussed in this review. ACBP are important proteins that could transport newly synthesized fatty acid, activated into -coA, from plastid to endoplasmic reticulum, where oil in the form of triacylglycerol occurs. ACBP were detected in various animal and plants species, which indicated their importance in biological function. In fact, involvement of ACBP in important process such as lipid metabolism, regulation of enzyme and gene expression, and in response to plant stresses has been proven in several studies. In this review, findings on ACBP of 11 well-known oil crops were reviewed to comprehend diversity, comparative analyses on ACBP structure were made, and link between structure and function, tissue expression and subcellular location of ACBP were also observed. Incomplete reports in some species were mentioned, which might be encouraging to start or to perform deeper studies. Similar characteristics were found in paralogs ACBP, and orthologs ACBP had different functions, despite the high identity in amino acid sequence. At the end, it is confirmed that ortholog proteins could not necessarily display the same function, even from closely related species.
Collapse
Affiliation(s)
- Nadia Raboanatahiry
- Department of Biotechnology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, China.,Hubei Key Laboratory of Economic Forest Germplasm Improvement and Resources Comprehensive Utilization, Hubei Collaborative Innovation Center for the Characteristic Resources Exploitation of Dabie Mountains, Huanggang Normal University, Huanggang, China
| | - Baoshan Wang
- College of Life Science, Shandong Normal University, Jinan, China
| | - Longjiang Yu
- Department of Biotechnology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, China
| | - Maoteng Li
- Department of Biotechnology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, China.,Hubei Key Laboratory of Economic Forest Germplasm Improvement and Resources Comprehensive Utilization, Hubei Collaborative Innovation Center for the Characteristic Resources Exploitation of Dabie Mountains, Huanggang Normal University, Huanggang, China
| |
Collapse
|
38
|
Madagascar ground gecko genome analysis characterizes asymmetric fates of duplicated genes. BMC Biol 2018; 16:40. [PMID: 29661185 PMCID: PMC5901865 DOI: 10.1186/s12915-018-0509-4] [Citation(s) in RCA: 39] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2018] [Accepted: 03/22/2018] [Indexed: 11/13/2022] Open
Abstract
Background Conventionally, comparison among amniotes – birds, mammals, and reptiles – has often been approached through analyses of mammals and, for comparison, birds. However, birds are morphologically and physiologically derived and, moreover, some parts of their genomes are recognized as difficult to sequence and/or assemble and are thus missing in genome assemblies. Therefore, sequencing the genomes of reptiles would aid comparative studies on amniotes by providing more comprehensive coverage to help understand the molecular mechanisms underpinning evolutionary changes. Results Herein, we present the whole genome sequences of the Madagascar ground gecko (Paroedura picta), a promising study system especially in developmental biology, and used it to identify changes in gene repertoire across amniotes. The genome-wide analysis of the Madagascar ground gecko allowed us to reconstruct a comprehensive set of gene phylogenies comprising 13,043 ortholog groups from diverse amniotes. Our study revealed 469 genes retained by some reptiles but absent from available genome-wide sequence data of both mammals and birds. Importantly, these genes, herein collectively designated as ‘elusive’ genes, exhibited high nucleotide substitution rates and uneven intra-genomic distribution. Furthermore, the genomic regions flanking these elusive genes exhibited distinct characteristics that tended to be associated with increased gene density, repeat element density, and GC content. Conclusion This highly continuous and nearly complete genome assembly of the Madagascar ground gecko will facilitate the use of this species as an experimental animal in diverse fields of biology. Gene repertoire comparisons across amniotes further demonstrated that the fate of a duplicated gene can be affected by the intrinsic properties of its genomic location, which can persist for hundreds of millions of years. Electronic supplementary material The online version of this article (10.1186/s12915-018-0509-4) contains supplementary material, which is available to authorized users.
Collapse
|
39
|
Dunn CW, Zapata F, Munro C, Siebert S, Hejnol A. Pairwise comparisons across species are problematic when analyzing functional genomic data. Proc Natl Acad Sci U S A 2018; 115:E409-E417. [PMID: 29301966 PMCID: PMC5776959 DOI: 10.1073/pnas.1707515115] [Citation(s) in RCA: 61] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022] Open
Abstract
There is considerable interest in comparing functional genomic data across species. One goal of such work is to provide an integrated understanding of genome and phenotype evolution. Most comparative functional genomic studies have relied on multiple pairwise comparisons between species, an approach that does not incorporate information about the evolutionary relationships among species. The statistical problems that arise from not considering these relationships can lead pairwise approaches to the wrong conclusions and are a missed opportunity to learn about biology that can only be understood in an explicit phylogenetic context. Here, we examine two recently published studies that compare gene expression across species with pairwise methods, and find reason to question the original conclusions of both. One study interpreted pairwise comparisons of gene expression as support for the ortholog conjecture, the hypothesis that orthologs tend to have more similar attributes (expression in this case) than paralogs. The other study interpreted pairwise comparisons of embryonic gene expression across distantly related animals as evidence for a distinct evolutionary process that gave rise to phyla. In each study, distinct patterns of pairwise similarity among species were originally interpreted as evidence of particular evolutionary processes, but instead, we find that they reflect species relationships. These reanalyses concretely show the inadequacy of pairwise comparisons for analyzing functional genomic data across species. It will be critical to adopt phylogenetic comparative methods in future functional genomic work. Fortunately, phylogenetic comparative biology is also a rapidly advancing field with many methods that can be directly applied to functional genomic data.
Collapse
Affiliation(s)
- Casey W Dunn
- Department of Ecology and Evolutionary Biology, Brown University, Providence, RI 02912;
| | - Felipe Zapata
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, CA 90095
| | - Catriona Munro
- Department of Ecology and Evolutionary Biology, Brown University, Providence, RI 02912
| | - Stefan Siebert
- Department of Molecular and Cellular Biology, University of California, Davis, CA 95616
| | - Andreas Hejnol
- Sars International Centre for Marine Molecular Biology, University of Bergen, Bergen 5006, Norway
| |
Collapse
|
40
|
Jain A, Roustan V, Weckwerth W, Ebersberger I. Studying AMPK in an Evolutionary Context. Methods Mol Biol 2018; 1732:111-142. [PMID: 29480472 DOI: 10.1007/978-1-4939-7598-3_8] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
Abstract
The AMPK protein kinase forms the heart of a complex network controlling the metabolic activities in a eukaryotic cell. Unraveling the steps by which this pathway evolved from its primordial roots in the last eukaryotic common ancestor to its present status in contemporary species has the potential to shed light on the evolution of eukaryotes. A homolog search for the proteins interacting in this pathway is considerably straightforward. However, interpreting the results, when reconstructing the evolutionary history of the pathway over larger evolutionary distances, bears a number of pitfalls. With this in mind, we present a protocol to trace a metabolic pathway across contemporary species and backward in evolutionary time. Alongside the individual analysis steps, we provide guidelines for data interpretation generalizing beyond the analysis of AMPK.
Collapse
Affiliation(s)
- Arpit Jain
- Applied Bioinformatics Group, Institute of Cell Biology and Neuroscience, Goethe University Frankfurt, Frankfurt, Germany
| | - Valentin Roustan
- Department of Ecogenomics and Systems Biology, University of Vienna, Vienna, Austria
| | - Wolfram Weckwerth
- Department of Ecogenomics and Systems Biology, University of Vienna, Vienna, Austria.,Vienna Metabolomics Center (VIME), University of Vienna, Vienna, Austria
| | - Ingo Ebersberger
- Applied Bioinformatics Group, Institute of Cell Biology and Neuroscience, Goethe University Frankfurt, Frankfurt, Germany. .,Senckenberg Biodiversity and Climate Research Centre (BIK-F), Frankfurt, Germany.
| |
Collapse
|
41
|
A Comprehensive Computational Analysis of Mycobacterium Genomes Pinpoints the Genes Co-occurring with YczE, a Membrane Protein Coding Gene Under the Putative Control of a MocR, and Predicts its Function. Interdiscip Sci 2017; 10:111-125. [PMID: 29098594 DOI: 10.1007/s12539-017-0266-z] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2017] [Revised: 09/08/2017] [Accepted: 10/11/2017] [Indexed: 10/18/2022]
Abstract
Bacterial proteins belonging to the YczE family are predicted to be membrane proteins of yet unknown function. In many bacterial species, the yczE gene coding for the YczE protein is divergently transcribed with respect to an adjacent transcriptional regulator of the MocR family. According to in silico predictions, proteins named YczR are supposed to regulate the expression of yczE genes. These regulators linked to the yczE genes are predicted to constitute a subfamily within the MocR family. To put forward hypotheses amenable to experimental testing about the possible function of the YczE proteins, a phylogenetic profile strategy was applied. This strategy consists in searching for those genes that, within a set of genomes, co-occur exclusively with a certain gene of interest. Co-occurrence can be suggestive of a functional link. A set of 30 mycobacterial complete proteomes were collected. Of these, only 16 contained YczE proteins. Interestingly, in all cases each yczE gene was divergently transcribed with respect to a yczR gene. Two orthology clustering procedures were applied to find proteins co-occurring exclusively with the YczE proteins. The reported results suggest that YczE may be involved in the membrane translocation and metabolism of sulfur-containing compounds mostly in rapidly growing, low pathogenicity mycobacterial species. These observations may hint at potential targets for therapies to treat the emerging opportunistic infections provoked by the widespread environmental mycobacterial species and may contribute to the delineation of the genomic and physiological differences between the pathogenic and non-pathogenic mycobacterial species.
Collapse
|
42
|
Positive diversifying selection is a pervasive adaptive force throughout the Drosophila radiation. Mol Phylogenet Evol 2017; 112:230-243. [DOI: 10.1016/j.ympev.2017.04.023] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2016] [Revised: 04/26/2017] [Accepted: 04/26/2017] [Indexed: 01/02/2023]
|
43
|
Martín-Durán JM, Ryan JF, Vellutini BC, Pang K, Hejnol A. Increased taxon sampling reveals thousands of hidden orthologs in flatworms. Genome Res 2017; 27:1263-1272. [PMID: 28400424 PMCID: PMC5495077 DOI: 10.1101/gr.216226.116] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2016] [Accepted: 04/10/2017] [Indexed: 11/25/2022]
Abstract
Gains and losses shape the gene complement of animal lineages and are a fundamental aspect of genomic evolution. Acquiring a comprehensive view of the evolution of gene repertoires is limited by the intrinsic limitations of common sequence similarity searches and available databases. Thus, a subset of the gene complement of an organism consists of hidden orthologs, i.e., those with no apparent homology to sequenced animal lineages—mistakenly considered new genes—but actually representing rapidly evolving orthologs or undetected paralogs. Here, we describe Leapfrog, a simple automated BLAST pipeline that leverages increased taxon sampling to overcome long evolutionary distances and identify putative hidden orthologs in large transcriptomic databases by transitive homology. As a case study, we used 35 transcriptomes of 29 flatworm lineages to recover 3427 putative hidden orthologs, some unidentified by OrthoFinder and HaMStR, two common orthogroup inference algorithms. Unexpectedly, we do not observe a correlation between the number of putative hidden orthologs in a lineage and its “average” evolutionary rate. Hidden orthologs do not show unusual sequence composition biases that might account for systematic errors in sequence similarity searches. Instead, gene duplication with divergence of one paralog and weak positive selection appear to underlie hidden orthology in Platyhelminthes. By using Leapfrog, we identify key centrosome-related genes and homeodomain classes previously reported as absent in free-living flatworms, e.g., planarians. Altogether, our findings demonstrate that hidden orthologs comprise a significant proportion of the gene repertoire in flatworms, qualifying the impact of gene losses and gains in gene complement evolution.
Collapse
Affiliation(s)
- José M Martín-Durán
- Sars International Centre for Marine Molecular Biology, University of Bergen, Bergen 5006, Norway
| | - Joseph F Ryan
- Sars International Centre for Marine Molecular Biology, University of Bergen, Bergen 5006, Norway.,Whitney Laboratory for Marine Bioscience, University of Florida, St. Augustine, Florida 32080, USA
| | - Bruno C Vellutini
- Sars International Centre for Marine Molecular Biology, University of Bergen, Bergen 5006, Norway
| | - Kevin Pang
- Sars International Centre for Marine Molecular Biology, University of Bergen, Bergen 5006, Norway
| | - Andreas Hejnol
- Sars International Centre for Marine Molecular Biology, University of Bergen, Bergen 5006, Norway
| |
Collapse
|
44
|
Liu Y, Wei M, Hou C, Lu T, Liu L, Wei H, Cheng Y, Wei Z. Functional Characterization of Populus PsnSHN2 in Coordinated Regulation of Secondary Wall Components in Tobacco. Sci Rep 2017; 7:42. [PMID: 28246387 PMCID: PMC5428377 DOI: 10.1038/s41598-017-00093-z] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2016] [Accepted: 02/03/2017] [Indexed: 11/13/2022] Open
Abstract
Wood formation is a biological process during which the most abundant lignocellulosic biomass on earth is produced. Although a number of transcription factors have been linked to the regulation of wood formation process, none of them has been demonstrated to be a higher hierarchical regulator that coordinately regulates secondary wall biosynthesis genes. Here, we identified a Populus gene, PsnSHN2, a counterpart of the Arabidopsis AP2/ERF type transcription factor, SHINE2. PsnSHN2 is predominantly expressed in xylem tissues and acted evidently as a high hierarchical transcriptional activator. Overexpression of PsnSHN2 in tobacco significantly altered the expression of both transcription factors and biosynthesis genes involved in secondary wall formation, leading to the thickened secondary walls and the changed cell wall composition. The most significant changes occurred in the contents of cellulose and hemicellulose that increased 37% and 28%, respectively, whereas the content of lignin that decreased 34%. Furthermore, PsnSHN2 activated or repressed the promoter activities of transcription factors involved in secondary wall biosynthesis and bound to five cis-acting elements enriched in the promoter regions of these transcription factors. Taken together, our results suggest PsnSHN2 coordinately regulate secondary wall formation through selective up/down-regulation of its downstream transcription factors that control secondary wall formation.
Collapse
Affiliation(s)
- Yingying Liu
- State Key Laboratory of Tree Genetics and Breeding, Northeast Forestry University, Heilongjiang Harbin, 150040, P.R. China
| | - Minjing Wei
- State Key Laboratory of Tree Genetics and Breeding, Northeast Forestry University, Heilongjiang Harbin, 150040, P.R. China
| | - Cong Hou
- State Key Laboratory of Tree Genetics and Breeding, Northeast Forestry University, Heilongjiang Harbin, 150040, P.R. China
| | - Tingting Lu
- State Key Laboratory of Tree Genetics and Breeding, Northeast Forestry University, Heilongjiang Harbin, 150040, P.R. China
| | | | - Hairong Wei
- School of Forest Resource and Environmental Science, Michigan Technological University, Houghton, MI, 49931, USA
| | - Yuxiang Cheng
- State Key Laboratory of Tree Genetics and Breeding, Northeast Forestry University, Heilongjiang Harbin, 150040, P.R. China.
| | - Zhigang Wei
- State Key Laboratory of Tree Genetics and Breeding, Northeast Forestry University, Heilongjiang Harbin, 150040, P.R. China.
| |
Collapse
|
45
|
Fused Regression for Multi-source Gene Regulatory Network Inference. PLoS Comput Biol 2016; 12:e1005157. [PMID: 27923054 PMCID: PMC5140053 DOI: 10.1371/journal.pcbi.1005157] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2016] [Accepted: 09/20/2016] [Indexed: 12/03/2022] Open
Abstract
Understanding gene regulatory networks is critical to understanding cellular differentiation and response to external stimuli. Methods for global network inference have been developed and applied to a variety of species. Most approaches consider the problem of network inference independently in each species, despite evidence that gene regulation can be conserved even in distantly related species. Further, network inference is often confined to single data-types (single platforms) and single cell types. We introduce a method for multi-source network inference that allows simultaneous estimation of gene regulatory networks in multiple species or biological processes through the introduction of priors based on known gene relationships such as orthology incorporated using fused regression. This approach improves network inference performance even when orthology mapping and conservation are incomplete. We refine this method by presenting an algorithm that extracts the true conserved subnetwork from a larger set of potentially conserved interactions and demonstrate the utility of our method in cross species network inference. Last, we demonstrate our method’s utility in learning from data collected on different experimental platforms. Gene regulatory networks describing related biological processes are thought to share conserved interaction structure. This assumption motivates a great deal of work in model systems–where discovery of gene regulation may be more experimentally tractable–but is difficult to directly evaluate using existing methods. The presence of shared structure in a well studied model system or process should make the problem of network inference in a related process easier, but this information is not often applied to the discovery of global gene regulatory networks. Further, to be able to successfully translate findings between different organisms, it is important to be able to identify where regulatory structure is different. We provide a method based on penalized fused regression for inferring gene regulatory networks given prior knowledge about the similarity of interactions in each network. This method is demonstrated on synthetic data, and applied to the problem of inferring networks in distantly related bacterial organisms. We then introduce an extension of the method to deal with the condition of uncertainty over the degree of regulatory conservation by simultaneously inferring gene conservation and interaction weights.
Collapse
|
46
|
Kryuchkova-Mostacci N, Robinson-Rechavi M. Tissue-Specificity of Gene Expression Diverges Slowly between Orthologs, and Rapidly between Paralogs. PLoS Comput Biol 2016; 12:e1005274. [PMID: 28030541 PMCID: PMC5193323 DOI: 10.1371/journal.pcbi.1005274] [Citation(s) in RCA: 48] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2016] [Accepted: 11/26/2016] [Indexed: 11/18/2022] Open
Abstract
The ortholog conjecture implies that functional similarity between orthologous genes is higher than between paralogs. It has been supported using levels of expression and Gene Ontology term analysis, although the evidence was rather weak and there were also conflicting reports. In this study on 12 species we provide strong evidence of high conservation in tissue-specificity between orthologs, in contrast to low conservation between within-species paralogs. This allows us to shed a new light on the evolution of gene expression patterns. While there have been several studies of the correlation of expression between species, little is known about the evolution of tissue-specificity itself. Ortholog tissue-specificity is strongly conserved between all tetrapod species, with the lowest Pearson correlation between mouse and frog at r = 0.66. Tissue-specificity correlation decreases strongly with divergence time. Paralogs in human show much lower conservation, even for recent Primate-specific paralogs. When both paralogs from ancient whole genome duplication tissue-specific paralogs are tissue-specific, it is often to different tissues, while other tissue-specific paralogs are mostly specific to the same tissue. The same patterns are observed using human or mouse as focal species, and are robust to choices of datasets and of thresholds. Our results support the following model of evolution: in the absence of duplication, tissue-specificity evolves slowly, and tissue-specific genes do not change their main tissue of expression; after small-scale duplication the less expressed paralog loses the ancestral specificity, leading to an immediate difference between paralogs; over time, both paralogs become more broadly expressed, but remain poorly correlated. Finally, there is a small number of paralog pairs which stay tissue-specific with the same main tissue of expression, for at least 300 million years.
Collapse
Affiliation(s)
- Nadezda Kryuchkova-Mostacci
- Department of Ecology and Evolution, University of Lausanne, Lausanne, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Marc Robinson-Rechavi
- Department of Ecology and Evolution, University of Lausanne, Lausanne, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| |
Collapse
|
47
|
Xing L, An Y, Shi G, Yan J, Xie P, Qu Z, Zhang Z, Liu Z, Pan D, Xu Y. Correlated evolution between CK1δ Protein and the Serine-rich Motif Contributes to Regulating the Mammalian Circadian Clock. J Biol Chem 2016; 292:161-171. [PMID: 27879317 DOI: 10.1074/jbc.m116.751214] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2016] [Revised: 11/21/2016] [Indexed: 11/06/2022] Open
Abstract
Understanding the mechanism underlying the physiological divergence of species is a long-standing issue in evolutionary biology. The circadian clock is a highly conserved system existing in almost all organisms that regulates a wide range of physiological and behavioral events to adapt to the day-night cycle. Here, the interactions between hCK1ϵ/δ/DBT (Drosophila ortholog of CK1δ/ϵ) and serine-rich (SR) motifs from hPER2 (ortholog of Drosophila per) were reconstructed in a Drosophila circadian system. The results indicated that in Drosophila, the SR mutant form hPER2S662G does not recapitulate the mouse or human mutant phenotype. However, introducing hCK1δ (but not DBT) shortened the circadian period and restored the SR motif function. We found that hCK1δ is catalytically more efficient than DBT in phosphorylating the SR motif, which demonstrates that the evolution of CK1δ activity is required for SR motif modulation. Moreover, an abundance of phosphorylatable SR motifs and the striking emergence of putative SR motifs in vertebrate proteins were observed, which provides further evidence that the correlated evolution between kinase activity and its substrates set the stage for functional diversity in vertebrates. It is possible that such correlated evolution may serve as a biomarker associated with the adaptive benefits of diverse organisms. These results also provide a concrete example of how functional synthesis can be achieved through introducing evolutionary partners in vivo.
Collapse
Affiliation(s)
- Lijuan Xing
- From the Cambridge-Suda Genomic Resource Center, Soochow University, 199 Renai Road, Suzhou 215123 and
| | - Yang An
- the MOE Key Laboratory of Model Animal for Disease Study, Model Animal Research Center, Nanjing University, 12 Xuefu Road, Pukou District, Nanjing 210061, China
| | - Guangsen Shi
- From the Cambridge-Suda Genomic Resource Center, Soochow University, 199 Renai Road, Suzhou 215123 and
| | - Jie Yan
- From the Cambridge-Suda Genomic Resource Center, Soochow University, 199 Renai Road, Suzhou 215123 and
| | - Pancheng Xie
- the MOE Key Laboratory of Model Animal for Disease Study, Model Animal Research Center, Nanjing University, 12 Xuefu Road, Pukou District, Nanjing 210061, China
| | - Zhipeng Qu
- the MOE Key Laboratory of Model Animal for Disease Study, Model Animal Research Center, Nanjing University, 12 Xuefu Road, Pukou District, Nanjing 210061, China
| | - Zhihui Zhang
- the MOE Key Laboratory of Model Animal for Disease Study, Model Animal Research Center, Nanjing University, 12 Xuefu Road, Pukou District, Nanjing 210061, China
| | - Zhiwei Liu
- From the Cambridge-Suda Genomic Resource Center, Soochow University, 199 Renai Road, Suzhou 215123 and
| | - Dejing Pan
- From the Cambridge-Suda Genomic Resource Center, Soochow University, 199 Renai Road, Suzhou 215123 and
| | - Ying Xu
- From the Cambridge-Suda Genomic Resource Center, Soochow University, 199 Renai Road, Suzhou 215123 and .,the MOE Key Laboratory of Model Animal for Disease Study, Model Animal Research Center, Nanjing University, 12 Xuefu Road, Pukou District, Nanjing 210061, China
| |
Collapse
|
48
|
Zallot R, Harrison KJ, Kolaczkowski B, de Crécy-Lagard V. Functional Annotations of Paralogs: A Blessing and a Curse. Life (Basel) 2016; 6:life6030039. [PMID: 27618105 PMCID: PMC5041015 DOI: 10.3390/life6030039] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2016] [Revised: 08/29/2016] [Accepted: 09/02/2016] [Indexed: 12/15/2022] Open
Abstract
Gene duplication followed by mutation is a classic mechanism of neofunctionalization, producing gene families with functional diversity. In some cases, a single point mutation is sufficient to change the substrate specificity and/or the chemistry performed by an enzyme, making it difficult to accurately separate enzymes with identical functions from homologs with different functions. Because sequence similarity is often used as a basis for assigning functional annotations to genes, non-isofunctional gene families pose a great challenge for genome annotation pipelines. Here we describe how integrating evolutionary and functional information such as genome context, phylogeny, metabolic reconstruction and signature motifs may be required to correctly annotate multifunctional families. These integrative analyses can also lead to the discovery of novel gene functions, as hints from specific subgroups can guide the functional characterization of other members of the family. We demonstrate how careful manual curation processes using comparative genomics can disambiguate subgroups within large multifunctional families and discover their functions. We present the COG0720 protein family as a case study. We also discuss strategies to automate this process to improve the accuracy of genome functional annotation pipelines.
Collapse
Affiliation(s)
- Rémi Zallot
- Department of Microbiology and Cell Science, Institute of Food and Agricultural Sciences, University of Florida, Gainesville, FL 32611, USA.
| | - Katherine J Harrison
- Department of Microbiology and Cell Science, Institute of Food and Agricultural Sciences, University of Florida, Gainesville, FL 32611, USA.
| | - Bryan Kolaczkowski
- Department of Microbiology and Cell Science, Institute of Food and Agricultural Sciences, University of Florida, Gainesville, FL 32611, USA.
| | - Valérie de Crécy-Lagard
- Department of Microbiology and Cell Science, Institute of Food and Agricultural Sciences, University of Florida, Gainesville, FL 32611, USA.
| |
Collapse
|
49
|
High-throughput proteome dynamics for discovery of key proteins in sentinel species: Unsuspected vitellogenins diversity in the crustacean Gammarus fossarum. J Proteomics 2016; 146:207-14. [DOI: 10.1016/j.jprot.2016.07.007] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2016] [Revised: 07/01/2016] [Accepted: 07/04/2016] [Indexed: 01/10/2023]
|
50
|
Zaslavsky L, Ciufo S, Fedorov B, Tatusova T. Clustering analysis of proteins from microbial genomes at multiple levels of resolution. BMC Bioinformatics 2016; 17 Suppl 8:276. [PMID: 27586436 PMCID: PMC5009818 DOI: 10.1186/s12859-016-1112-8] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022] Open
Abstract
Background Microbial genomes at the National Center for Biotechnology Information (NCBI) represent a large collection of more than 35,000 assemblies. There are several complexities associated with the data: a great variation in sampling density since human pathogens are densely sampled while other bacteria are less represented; different protein families occur in annotations with different frequencies; and the quality of genome annotation varies greatly. In order to extract useful information from these sophisticated data, the analysis needs to be performed at multiple levels of phylogenomic resolution and protein similarity, with an adequate sampling strategy. Results Protein clustering is used to construct meaningful and stable groups of similar proteins to be used for analysis and functional annotation. Our approach is to create protein clusters at three levels. First, tight clusters in groups of closely-related genomes (species-level clades) are constructed using a combined approach that takes into account both sequence similarity and genome context. Second, clustroids of conservative in-clade clusters are organized into seed global clusters. Finally, global protein clusters are built around the the seed clusters. We propose filtering strategies that allow limiting the protein set included in global clustering. The in-clade clustering procedure, subsequent selection of clustroids and organization into seed global clusters provides a robust representation and high rate of compression. Seed protein clusters are further extended by adding related proteins. Extended seed clusters include a significant part of the data and represent all major known cell machinery. The remaining part, coming from either non-conservative (unique) or rapidly evolving proteins, from rare genomes, or resulting from low-quality annotation, does not group together well. Processing these proteins requires significant computational resources and results in a large number of questionable clusters. Conclusion The developed filtering strategies allow to identify and exclude such peripheral proteins limiting the protein dataset in global clustering. Overall, the proposed methodology allows the relevant data at different levels of details to be obtained and data redundancy eliminated while keeping biologically interesting variations. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1112-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Leonid Zaslavsky
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, 20894, MD, USA.
| | - Stacy Ciufo
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, 20894, MD, USA
| | - Boris Fedorov
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, 20894, MD, USA
| | - Tatiana Tatusova
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, 20894, MD, USA
| |
Collapse
|