1
|
Canavati C, Sherill-Rofe D, Kamal L, Bloch I, Zahdeh F, Sharon E, Terespolsky B, Allan IA, Rabie G, Kawas M, Kassem H, Avraham KB, Renbaum P, Levy-Lahad E, Kanaan M, Tabach Y. Using multi-scale genomics to associate poorly annotated genes with rare diseases. Genome Med 2024; 16:4. [PMID: 38178268 PMCID: PMC10765705 DOI: 10.1186/s13073-023-01276-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2023] [Accepted: 12/15/2023] [Indexed: 01/06/2024] Open
Abstract
BACKGROUND Next-generation sequencing (NGS) has significantly transformed the landscape of identifying disease-causing genes associated with genetic disorders. However, a substantial portion of sequenced patients remains undiagnosed. This may be attributed not only to the challenges posed by harder-to-detect variants, such as non-coding and structural variations but also to the existence of variants in genes not previously associated with the patient's clinical phenotype. This study introduces EvORanker, an algorithm that integrates unbiased data from 1,028 eukaryotic genomes to link mutated genes to clinical phenotypes. METHODS EvORanker utilizes clinical data, multi-scale phylogenetic profiling, and other omics data to prioritize disease-associated genes. It was evaluated on solved exomes and simulated genomes, compared with existing methods, and applied to 6260 knockout genes with mouse phenotypes lacking human associations. Additionally, EvORanker was made accessible as a user-friendly web tool. RESULTS In the analyzed exomic cohort, EvORanker accurately identified the "true" disease gene as the top candidate in 69% of cases and within the top 5 candidates in 95% of cases, consistent with results from the simulated dataset. Notably, EvORanker outperformed existing methods, particularly for poorly annotated genes. In the case of the 6260 knockout genes with mouse phenotypes, EvORanker linked 41% of these genes to observed human disease phenotypes. Furthermore, in two unsolved cases, EvORanker successfully identified DLGAP2 and LPCAT3 as disease candidates for previously uncharacterized genetic syndromes. CONCLUSIONS We highlight clade-based phylogenetic profiling as a powerful systematic approach for prioritizing potential disease genes. Our study showcases the efficacy of EvORanker in associating poorly annotated genes to disease phenotypes observed in patients. The EvORanker server is freely available at https://ccanavati.shinyapps.io/EvORanker/ .
Collapse
Affiliation(s)
- Christina Canavati
- Department of Developmental Biology and Cancer Research, Institute of Medical Research - Israel-Canada, The Hebrew University of Jerusalem, Jerusalem, 9112102, Israel
- Molecular Genetics Lab, Istishari Arab Hospital, Ramallah, Palestine
| | - Dana Sherill-Rofe
- Department of Developmental Biology and Cancer Research, Institute of Medical Research - Israel-Canada, The Hebrew University of Jerusalem, Jerusalem, 9112102, Israel
| | - Lara Kamal
- Molecular Genetics Lab, Istishari Arab Hospital, Ramallah, Palestine
- Department of Human Molecular Genetics and Biochemistry, Faculty of Medicine and Sagol School of Neuroscience, Tel Aviv University, Tel Aviv, 6997801, Israel
| | - Idit Bloch
- Department of Developmental Biology and Cancer Research, Institute of Medical Research - Israel-Canada, The Hebrew University of Jerusalem, Jerusalem, 9112102, Israel
| | - Fouad Zahdeh
- Medical Genetics Institute, Shaare Zedek Medical Center, Jerusalem, 91031, Israel
| | - Elad Sharon
- Department of Developmental Biology and Cancer Research, Institute of Medical Research - Israel-Canada, The Hebrew University of Jerusalem, Jerusalem, 9112102, Israel
| | - Batel Terespolsky
- Department of Developmental Biology and Cancer Research, Institute of Medical Research - Israel-Canada, The Hebrew University of Jerusalem, Jerusalem, 9112102, Israel
- Medical Genetics Institute, Shaare Zedek Medical Center, Jerusalem, 91031, Israel
| | - Islam Abu Allan
- Molecular Genetics Lab, Istishari Arab Hospital, Ramallah, Palestine
| | - Grace Rabie
- Hereditary Research Laboratory and Department of Life Sciences, Bethlehem University, Bethlehem, 72372, Palestine
| | - Mariana Kawas
- Hereditary Research Laboratory and Department of Life Sciences, Bethlehem University, Bethlehem, 72372, Palestine
| | - Hanin Kassem
- Molecular Genetics Lab, Istishari Arab Hospital, Ramallah, Palestine
| | - Karen B Avraham
- Department of Human Molecular Genetics and Biochemistry, Faculty of Medicine and Sagol School of Neuroscience, Tel Aviv University, Tel Aviv, 6997801, Israel
| | - Paul Renbaum
- Medical Genetics Institute, Shaare Zedek Medical Center, Jerusalem, 91031, Israel
| | - Ephrat Levy-Lahad
- Medical Genetics Institute, Shaare Zedek Medical Center, Jerusalem, 91031, Israel
- Faculty of Medicine, The Hebrew University of Jerusalem, Jerusalem, 9112102, Israel
| | - Moien Kanaan
- Molecular Genetics Lab, Istishari Arab Hospital, Ramallah, Palestine
- Hereditary Research Laboratory and Department of Life Sciences, Bethlehem University, Bethlehem, 72372, Palestine
| | - Yuval Tabach
- Department of Developmental Biology and Cancer Research, Institute of Medical Research - Israel-Canada, The Hebrew University of Jerusalem, Jerusalem, 9112102, Israel.
| |
Collapse
|
2
|
Dobbelaere J, Su TY, Erdi B, Schleiffer A, Dammermann A. A phylogenetic profiling approach identifies novel ciliogenesis genes in Drosophila and C. elegans. EMBO J 2023; 42:e113616. [PMID: 37317646 PMCID: PMC10425847 DOI: 10.15252/embj.2023113616] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2023] [Revised: 05/22/2023] [Accepted: 06/01/2023] [Indexed: 06/16/2023] Open
Abstract
Cilia are cellular projections that perform sensory and motile functions in eukaryotic cells. A defining feature of cilia is that they are evolutionarily ancient, yet not universally conserved. In this study, we have used the resulting presence and absence pattern in the genomes of diverse eukaryotes to identify a set of 386 human genes associated with cilium assembly or motility. Comprehensive tissue-specific RNAi in Drosophila and mutant analysis in C. elegans revealed signature ciliary defects for 70-80% of novel genes, a percentage similar to that for known genes within the cluster. Further characterization identified different phenotypic classes, including a set of genes related to the cartwheel component Bld10/CEP135 and two highly conserved regulators of cilium biogenesis. We propose this dataset defines the core set of genes required for cilium assembly and motility across eukaryotes and presents a valuable resource for future studies of cilium biology and associated disorders.
Collapse
Affiliation(s)
- Jeroen Dobbelaere
- Max Perutz LabsUniversity of Vienna, Vienna Biocenter (VBC)ViennaAustria
| | - Tiffany Y Su
- Max Perutz LabsUniversity of Vienna, Vienna Biocenter (VBC)ViennaAustria
- Vienna BioCenter PhD ProgramDoctoral School of the University of Vienna and Medical University of ViennaViennaAustria
| | - Balazs Erdi
- Max Perutz LabsUniversity of Vienna, Vienna Biocenter (VBC)ViennaAustria
| | - Alexander Schleiffer
- Research Institute of Molecular Pathology, Vienna Biocenter (VBC)ViennaAustria
- Institute of Molecular Biotechnology of the Austrian Academy of Sciences, Vienna Biocenter (VBC)ViennaAustria
| | | |
Collapse
|
3
|
Dembech E, Malatesta M, De Rito C, Mori G, Cavazzini D, Secchi A, Morandin F, Percudani R. Identification of hidden associations among eukaryotic genes through statistical analysis of coevolutionary transitions. Proc Natl Acad Sci U S A 2023; 120:e2218329120. [PMID: 37043529 PMCID: PMC10120013 DOI: 10.1073/pnas.2218329120] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2022] [Accepted: 03/10/2023] [Indexed: 04/13/2023] Open
Abstract
Coevolution at the gene level, as reflected by correlated events of gene loss or gain, can be revealed by phylogenetic profile analysis. The optimal method and metric for comparing phylogenetic profiles, especially in eukaryotic genomes, are not yet established. Here, we describe a procedure suitable for large-scale analysis, which can reveal coevolution based on the assessment of the statistical significance of correlated presence/absence transitions between gene pairs. This metric can identify coevolution in profiles with low overall similarities and is not affected by similarities lacking coevolutionary information. We applied the procedure to a large collection of 60,912 orthologous gene groups (orthogroups) in 1,264 eukaryotic genomes extracted from OrthoDB. We found significant cotransition scores for 7,825 orthogroups associated in 2,401 coevolving modules linking known and unknown genes in protein complexes and biological pathways. To demonstrate the ability of the method to predict hidden gene associations, we validated through experiments the involvement of vertebrate malate synthase-like genes in the conversion of (S)-ureidoglycolate into glyoxylate and urea, the last step of purine catabolism. This identification explains the presence of glyoxylate cycle genes in metazoa and suggests an anaplerotic role of purine degradation in early eukaryotes.
Collapse
Affiliation(s)
- Elena Dembech
- Department of Chemistry, Life Sciences and Environmental Sustainability, University of Parma, Parma43124, Italy
| | - Marco Malatesta
- Department of Chemistry, Life Sciences and Environmental Sustainability, University of Parma, Parma43124, Italy
| | - Carlo De Rito
- Department of Chemistry, Life Sciences and Environmental Sustainability, University of Parma, Parma43124, Italy
| | - Giulia Mori
- Department of Chemistry, Life Sciences and Environmental Sustainability, University of Parma, Parma43124, Italy
| | - Davide Cavazzini
- Department of Chemistry, Life Sciences and Environmental Sustainability, University of Parma, Parma43124, Italy
| | - Andrea Secchi
- Department of Chemistry, Life Sciences and Environmental Sustainability, University of Parma, Parma43124, Italy
| | - Francesco Morandin
- Department of Mathematical, Physical and Computer Sciences, University of Parma, Parma43124, Italy
| | - Riccardo Percudani
- Department of Chemistry, Life Sciences and Environmental Sustainability, University of Parma, Parma43124, Italy
| |
Collapse
|
4
|
Monem PC, Vidyasagar N, Piatt AL, Sehgal E, Arribere JA. Ubiquitination of stalled ribosomes enables mRNA decay via HBS-1 and NONU-1 in vivo. PLoS Genet 2023; 19:e1010577. [PMID: 36626369 PMCID: PMC9870110 DOI: 10.1371/journal.pgen.1010577] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2022] [Revised: 01/23/2023] [Accepted: 12/18/2022] [Indexed: 01/11/2023] Open
Abstract
As ribosomes translate the genetic code, they can encounter a variety of obstacles that hinder their progress. If ribosomes stall for prolonged times, cells suffer due to the loss of translating ribosomes and the accumulation of aberrant protein products. Thus to protect cells, stalled ribosomes experience a series of reactions to relieve the stall and degrade the offending mRNA, a process known as No-Go mRNA Decay (NGD). While much of the machinery for NGD is known, the precise ordering of events and factors along this pathway has not been tested. Here, we deploy C. elegans to unravel the coordinated events comprising NGD. Utilizing a novel reporter and forward and reverse genetics, we identify the machinery required for NGD. Our subsequent molecular analyses define a functional requirement for ubiquitination on at least two ribosomal proteins (eS10 and uS10), and we show that ribosomes lacking ubiquitination sites on eS10 and uS10 fail to perform NGD in vivo. We show that the nuclease NONU-1 acts after the ubiquitin ligase ZNF-598, and discover a novel requirement for the ribosome rescue factors HBS-1/PELO-1 in mRNA decay via NONU-1. Taken together, our work demonstrates mechanisms by which ribosomes signal to effectors of mRNA repression, and we delineate links between repressive factors working toward a well-defined NGD pathway.
Collapse
Affiliation(s)
- Parissa C. Monem
- Department of Molecular, Cell, and Developmental Biology, University of California at Santa Cruz, Santa Cruz, California, United States of America
| | - Nitin Vidyasagar
- Department of Molecular, Cell, and Developmental Biology, University of California at Santa Cruz, Santa Cruz, California, United States of America
| | - Audrey L. Piatt
- Department of Molecular, Cell, and Developmental Biology, University of California at Santa Cruz, Santa Cruz, California, United States of America
| | - Enisha Sehgal
- Department of Molecular, Cell, and Developmental Biology, University of California at Santa Cruz, Santa Cruz, California, United States of America
| | - Joshua A. Arribere
- Department of Molecular, Cell, and Developmental Biology, University of California at Santa Cruz, Santa Cruz, California, United States of America
| |
Collapse
|
5
|
Ji F, Bonilla G, Krykbaev R, Ruvkun G, Tabach Y, Sadreyev RI. DEPCOD: a tool to detect and visualize co-evolution of protein domains. Nucleic Acids Res 2022; 50:W246-W253. [PMID: 35536332 PMCID: PMC9252791 DOI: 10.1093/nar/gkac349] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2022] [Revised: 04/13/2022] [Accepted: 04/26/2022] [Indexed: 11/14/2022] Open
Abstract
Proteins with similar phylogenetic patterns of conservation or loss across evolutionary taxa are strong candidates to work in the same cellular pathways or engage in physical or functional interactions. Our previously published tools implemented our method of normalized phylogenetic sequence profiling to detect functional associations between non-homologous proteins. However, many proteins consist of multiple protein domains subjected to different selective pressures, so using protein domain as the unit of analysis improves the detection of similar phylogenetic patterns. Here we analyze sequence conservation patterns across the whole tree of life for every protein domain from a set of widely studied organisms. The resulting new interactive webserver, DEPCOD (DEtection of Phylogenetically COrrelated Domains), performs searches with either a selected pre-defined protein domain or a user-supplied sequence as a query to detect other domains from the same organism that have similar conservation patterns. Top similarities on two evolutionary scales (the whole tree of life or eukaryotic genomes) are displayed along with known protein interactions and shared complexes, pathway enrichment among the hits, and detailed visualization of sources of detected similarities. DEPCOD reveals functional relationships between often non-homologous domains that could not be detected using whole-protein sequences. The web server is accessible at http://genetics.mgh.harvard.edu/DEPCOD.
Collapse
Affiliation(s)
- Fei Ji
- Department of Molecular Biology, Massachusetts General Hospital, Boston, MA, USA.,Department of Genetics, Harvard Medical School, Boston, MA, USA
| | - Gracia Bonilla
- Department of Molecular Biology, Massachusetts General Hospital, Boston, MA, USA.,Department of Genetics, Harvard Medical School, Boston, MA, USA
| | - Rustem Krykbaev
- Department of Molecular Biology, Massachusetts General Hospital, Boston, MA, USA
| | - Gary Ruvkun
- Department of Molecular Biology, Massachusetts General Hospital, Boston, MA, USA.,Department of Genetics, Harvard Medical School, Boston, MA, USA
| | - Yuval Tabach
- Department of Developmental Biology and Cancer Research, Faculty of Medicine, The Hebrew University of Jerusalem, Ein Kerem 9112102, Israel
| | - Ruslan I Sadreyev
- Department of Molecular Biology, Massachusetts General Hospital, Boston, MA, USA.,Department of Pathology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
| |
Collapse
|
6
|
Zhao R, Pei S, Yau SST. New Genome Sequence Detection via Natural Vector Convex Hull Method. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1782-1793. [PMID: 33237867 DOI: 10.1109/tcbb.2020.3040706] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
It remains challenging how to find existing but undiscovered genome sequence mutations or predict potential genome sequence mutations based on real sequence data. Motivated by this, we develop approaches to detect new, undiscovered genome sequences. Because discovering new genome sequences through biological experiments is resource-intensive, we want to achieve the new genome sequence detection task mathematically. However, little literature tells us how to detect new, undiscovered genome sequence mutations mathematically. We form a new framework based on natural vector convex hull method that conducts alignment-free sequence analysis. Our newly developed two approaches, Random-permutation Algorithm with Penalty (RAP) and Random-permutation Algorithm with Penalty and COstrained Search (RAPCOS), use the geometry properties captured by natural vectors. In our experiment, we discover a mathematically new human immunodeficiency virus (HIV) genome sequence using some real HIV genome sequences. Significantly, the proposed methods are applicable to solve the new genome sequence detection challenge and have many good properties, such as robustness, rapid convergence, and fast computation.
Collapse
|
7
|
Li W, Yang L, Meng Z, Qiu Y, Wang PSP, Li X. Phylogenetic Analysis: A Novel Method of Protein Sequence Similarity Analysis. INT J PATTERN RECOGN 2022. [DOI: 10.1142/s0218001422580071] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Protein sequence similarity analysis (PSSA) is a significant task in bioinformatics, which can obtain information about unknown sequences such as protein structures and homology relationships. Protein sequence refers to the series of amino acids with rich physical and chemical properties, namely the basic structure of proteins. However, sequence similarity analysis and phylogenetic analysis between different species which have complex amino acid sequences is a challenging problem. In this paper, nine properties of amino acids were considered and the sequence was converted into numerical values by principal component analysis (PCA); with Haar Wavelet Transform, and Higuchi fractal dimension (HFD), a new feature vector is constructed to represent the sequence; Spearman distance was selected to calculate the distance matrix and the phylogenetic tree was constructed. In this paper, two representative protein sequences (9 ND5 (NADH dehydrogenase 5) and 8 ND6 (NADH dehydrogenase 6)) were selected for similarity analysis and phylogenetic analysis, and compared with MEGA software and other existing methods. The extensive results show that our method is outperforming and results consistent with the known facts.
Collapse
Affiliation(s)
- Wei Li
- School of Computer, Electronics and Information, Guangxi University, Nanning, P. R. China
| | - Lina Yang
- School of Computer, Electronics and Information, Guangxi University, Nanning, P. R. China
| | - Zuqiang Meng
- School of Computer, Electronics and Information, Guangxi University, Nanning, P. R. China
| | - Yu Qiu
- School of Computer, Electronics and Information, Guangxi University, Nanning, P. R. China
| | | | - Xichun Li
- Guangxi Normal University for Nationalities, Chongzuo 532200, China
| |
Collapse
|
8
|
Redrado S, Esteban P, Domingo MP, Lopez C, Rezusta A, Ramirez-Labrada A, Arias M, Pardo J, Galvez EM. Integration of In Silico and In Vitro Analysis of Gliotoxin Production Reveals a Narrow Range of Producing Fungal Species. J Fungi (Basel) 2022; 8:jof8040361. [PMID: 35448592 PMCID: PMC9030297 DOI: 10.3390/jof8040361] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2022] [Revised: 03/28/2022] [Accepted: 03/29/2022] [Indexed: 02/06/2023] Open
Abstract
Gliotoxin is a fungal secondary metabolite with impact on health and agriculture since it might act as virulence factor and contaminate human and animal food. Homologous gliotoxin (GT) gene clusters are spread across a number of fungal species although if they produce GT or other related epipolythiodioxopiperazines (ETPs) remains obscure. Using bioinformatic tools, we have identified homologous gli gene clusters similar to the A. fumigatus GT gene cluster in several fungal species. In silico study led to in vitro confirmation of GT and Bisdethiobis(methylthio)gliotoxin (bmGT) production in fungal strain cultures by HPLC detection. Despite we selected most similar homologous gli gene cluster in 20 different species, GT and bmGT were only detected in section Fumigati species and in a Trichoderma virens Q strain. Our results suggest that in silico gli homology analyses in different fungal strains to predict GT production might be only informative when accompanied by analysis about mycotoxin production in cell cultures.
Collapse
Affiliation(s)
- Sergio Redrado
- Instituto de Carboquımica ICB-CSIC, 50018 Zaragoza, Spain; (S.R.); (M.P.D.)
| | - Patricia Esteban
- Biomedical Research Centre of Aragon (CIBA), Fundacion Instituto de Investigacion Sanitaria Aragon (IIS Aragon), 50009 Zaragoza, Spain; (P.E.); (A.R.-L.); (M.A.); (J.P.)
| | | | - Concepción Lopez
- Department of Microbiology, Hospital Universitario Miguel Servet, IIS Aragón, 50009 Zaragoza, Spain; (C.L.); (A.R.)
| | - Antonio Rezusta
- Department of Microbiology, Hospital Universitario Miguel Servet, IIS Aragón, 50009 Zaragoza, Spain; (C.L.); (A.R.)
| | - Ariel Ramirez-Labrada
- Biomedical Research Centre of Aragon (CIBA), Fundacion Instituto de Investigacion Sanitaria Aragon (IIS Aragon), 50009 Zaragoza, Spain; (P.E.); (A.R.-L.); (M.A.); (J.P.)
| | - Maykel Arias
- Biomedical Research Centre of Aragon (CIBA), Fundacion Instituto de Investigacion Sanitaria Aragon (IIS Aragon), 50009 Zaragoza, Spain; (P.E.); (A.R.-L.); (M.A.); (J.P.)
| | - Julián Pardo
- Biomedical Research Centre of Aragon (CIBA), Fundacion Instituto de Investigacion Sanitaria Aragon (IIS Aragon), 50009 Zaragoza, Spain; (P.E.); (A.R.-L.); (M.A.); (J.P.)
- Department of Microbiology, Pediatrics, Radiology and Public Health, University of Zaragoza, 50009 Zaragoza, Spain
- Aragon I+D Foundation (ARAID), 50018 Zaragoza, Spain
| | - Eva M. Galvez
- Instituto de Carboquımica ICB-CSIC, 50018 Zaragoza, Spain; (S.R.); (M.P.D.)
- Correspondence:
| |
Collapse
|
9
|
Csűrös M. Gain-loss-duplication models for copy number evolution on a phylogeny: Exact algorithms for computing the likelihood and its gradient. Theor Popul Biol 2022; 145:80-94. [DOI: 10.1016/j.tpb.2022.03.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2021] [Revised: 03/07/2022] [Accepted: 03/10/2022] [Indexed: 10/18/2022]
|
10
|
Stupp D, Sharon E, Bloch I, Zitnik M, Zuk O, Tabach Y. Co-evolution based machine-learning for predicting functional interactions between human genes. Nat Commun 2021; 12:6454. [PMID: 34753957 PMCID: PMC8578642 DOI: 10.1038/s41467-021-26792-w] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2020] [Accepted: 10/09/2021] [Indexed: 12/20/2022] Open
Abstract
Over the next decade, more than a million eukaryotic species are expected to be fully sequenced. This has the potential to improve our understanding of genotype and phenotype crosstalk, gene function and interactions, and answer evolutionary questions. Here, we develop a machine-learning approach for utilizing phylogenetic profiles across 1154 eukaryotic species. This method integrates co-evolution across eukaryotic clades to predict functional interactions between human genes and the context for these interactions. We benchmark our approach showing a 14% performance increase (auROC) compared to previous methods. Using this approach, we predict functional annotations for less studied genes. We focus on DNA repair and verify that 9 of the top 50 predicted genes have been identified elsewhere, with others previously prioritized by high-throughput screens. Overall, our approach enables better annotation of function and functional interactions and facilitates the understanding of evolutionary processes underlying co-evolution. The manuscript is accompanied by a webserver available at: https://mlpp.cs.huji.ac.il. With the rise in number of eukaryotic species being fully sequenced, large scale phylogenetic profiling can give insights on gene function, Here, the authors describe a machine-learning approach that integrates co-evolution across eukaryotic clades to predict gene function and functional interactions among human genes.
Collapse
Affiliation(s)
- Doron Stupp
- Department of Developmental Biology and Cancer Research, The Institute for Medical Research Israel-Canada, The Hebrew University of Jerusalem, 9112001, Jerusalem, Israel
| | - Elad Sharon
- Department of Developmental Biology and Cancer Research, The Institute for Medical Research Israel-Canada, The Hebrew University of Jerusalem, 9112001, Jerusalem, Israel
| | - Idit Bloch
- Department of Developmental Biology and Cancer Research, The Institute for Medical Research Israel-Canada, The Hebrew University of Jerusalem, 9112001, Jerusalem, Israel
| | - Marinka Zitnik
- Department of Biomedical Informatics, Harvard University, Boston, MA, 02115, USA
| | - Or Zuk
- Department of Statistics and Data Science, The Hebrew University of Jerusalem, Jerusalem, 9190501, Israel.
| | - Yuval Tabach
- Department of Developmental Biology and Cancer Research, The Institute for Medical Research Israel-Canada, The Hebrew University of Jerusalem, 9112001, Jerusalem, Israel.
| |
Collapse
|
11
|
Mu Z, Yu T, Liu X, Zheng H, Wei L, Liu J. FEGS: a novel feature extraction model for protein sequences and its applications. BMC Bioinformatics 2021; 22:297. [PMID: 34078264 PMCID: PMC8172329 DOI: 10.1186/s12859-021-04223-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2021] [Accepted: 05/28/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Feature extraction of protein sequences is widely used in various research areas related to protein analysis, such as protein similarity analysis and prediction of protein functions or interactions. RESULTS In this study, we introduce FEGS (Feature Extraction based on Graphical and Statistical features), a novel feature extraction model of protein sequences, by developing a new technique for graphical representation of protein sequences based on the physicochemical properties of amino acids and effectively employing the statistical features of protein sequences. By fusing the graphical and statistical features, FEGS transforms a protein sequence into a 578-dimensional numerical vector. When FEGS is applied to phylogenetic analysis on five protein sequence data sets, its performance is notably better than all of the other compared methods. CONCLUSION The FEGS method is carefully designed, which is practically powerful for extracting features of protein sequences. The current version of FEGS is developed to be user-friendly and is expected to play a crucial role in the related studies of protein sequence analyses.
Collapse
Affiliation(s)
- Zengchao Mu
- School of Mathematics and Statistics, Shandong University, Weihai, 264209, China
| | - Ting Yu
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, 266237, China
| | - Xiaoping Liu
- Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Beijing, China
| | - Hongyu Zheng
- Department of Radiation Oncology, Qilu Hospital, Cheeloo College of Medicine, Shandong University, Jinan, 250012, China
| | - Leyi Wei
- School of Software, Shandong University, Jinan, China.
| | - Juntao Liu
- School of Mathematics and Statistics, Shandong University, Weihai, 264209, China.
| |
Collapse
|
12
|
Tsaban T, Stupp D, Sherill-Rofe D, Bloch I, Sharon E, Schueler-Furman O, Wiener R, Tabach Y. CladeOScope: functional interactions through the prism of clade-wise co-evolution. NAR Genom Bioinform 2021; 3:lqab024. [PMID: 33928243 PMCID: PMC8057497 DOI: 10.1093/nargab/lqab024] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2020] [Revised: 03/12/2021] [Accepted: 03/18/2021] [Indexed: 12/11/2022] Open
Abstract
Mapping co-evolved genes via phylogenetic profiling (PP) is a powerful approach to uncover functional interactions between genes and to associate them with pathways. Despite many successful endeavors, the understanding of co-evolutionary signals in eukaryotes remains partial. Our hypothesis is that 'Clades', branches of the tree of life (e.g. primates and mammals), encompass signals that cannot be detected by PP using all eukaryotes. As such, integrating information from different clades should reveal local co-evolution signals and improve function prediction. Accordingly, we analyzed 1028 genomes in 66 clades and demonstrated that the co-evolutionary signal was scattered across clades. We showed that functionally related genes are frequently co-evolved in only parts of the eukaryotic tree and that clades are complementary in detecting functional interactions within pathways. We examined the non-homologous end joining pathway and the UFM1 ubiquitin-like protein pathway and showed that both demonstrated distinguished co-evolution patterns in specific clades. Our research offers a different way to look at co-evolution across eukaryotes and points to the importance of modular co-evolution analysis. We developed the 'CladeOScope' PP method to integrate information from 16 clades across over 1000 eukaryotic genomes and is accessible via an easy to use web server at http://cladeoscope.cs.huji.ac.il.
Collapse
Affiliation(s)
- Tomer Tsaban
- Department of Developmental Biology and Cancer Research, Institute for Medical Research Israel-Canada and Hadassah Medical School, The Hebrew University of Jerusalem, Jerusalem 9112001, Israel
| | - Doron Stupp
- Department of Developmental Biology and Cancer Research, Institute for Medical Research Israel-Canada and Hadassah Medical School, The Hebrew University of Jerusalem, Jerusalem 9112001, Israel
| | - Dana Sherill-Rofe
- Department of Developmental Biology and Cancer Research, Institute for Medical Research Israel-Canada and Hadassah Medical School, The Hebrew University of Jerusalem, Jerusalem 9112001, Israel
| | - Idit Bloch
- Department of Developmental Biology and Cancer Research, Institute for Medical Research Israel-Canada and Hadassah Medical School, The Hebrew University of Jerusalem, Jerusalem 9112001, Israel
| | - Elad Sharon
- Department of Developmental Biology and Cancer Research, Institute for Medical Research Israel-Canada and Hadassah Medical School, The Hebrew University of Jerusalem, Jerusalem 9112001, Israel
| | - Ora Schueler-Furman
- Department of Microbiology and Molecular Genetics, Institute for Medical Research Israel-Canada and Hadassah Medical School, The Hebrew University of Jerusalem, Jerusalem 9112001, Israel
| | - Reuven Wiener
- Department of Biochemistry and Molecular Biology, Institute for Medical Research Israel-Canada and Hadassah Medical School,The Hebrew University of Jerusalem, Jerusalem 9112001, Israel
| | - Yuval Tabach
- Department of Developmental Biology and Cancer Research, Institute for Medical Research Israel-Canada and Hadassah Medical School, The Hebrew University of Jerusalem, Jerusalem 9112001, Israel
| |
Collapse
|
13
|
Bloch I, Sherill-Rofe D, Stupp D, Unterman I, Beer H, Sharon E, Tabach Y. Optimization of co-evolution analysis through phylogenetic profiling reveals pathway-specific signals. Bioinformatics 2021; 36:4116-4125. [PMID: 32353123 DOI: 10.1093/bioinformatics/btaa281] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2019] [Revised: 04/17/2020] [Accepted: 04/23/2020] [Indexed: 12/11/2022] Open
Abstract
SUMMARY The exponential growth in available genomic data is expected to reach full sequencing of a million genomes in the coming decade. Improving and developing methods to analyze these genomes and to reveal their utility is of major interest in a wide variety of fields, such as comparative and functional genomics, evolution and bioinformatics. Phylogenetic profiling is an established method for predicting functional interactions between proteins based on similarities in their evolutionary patterns across species. Proteins that function together (i.e. generate complexes, interact in the same pathways or improve adaptation to environmental niches) tend to show coordinated evolution across the tree of life. The normalized phylogenetic profiling (NPP) method takes into account minute changes in proteins across species to identify protein co-evolution. Despite the success of this method, it is still not clear what set of parameters is required for optimal use of co-evolution in predicting functional interactions. Moreover, it is not clear if pathway evolution or function should direct parameter choice. Here, we create a reliable and usable NPP construction pipeline. We explore the effect of parameter selection on functional interaction prediction using NPP from 1028 genomes, both separately and in various value combinations. We identify several parameter sets that optimize performance for pathways with certain biological annotation. This work reveals the importance of choosing the right parameters for optimized function prediction based on a biological context. AVAILABILITY AND IMPLEMENTATION Source code and documentation are available on GitHub: https://github.com/iditam/CompareNPPs. CONTACT yuvaltab@ekmd.huji.ac.il. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Idit Bloch
- Department of Developmental Biology and Cancer Research, Institute for Medical Research Israel-Canada, Hebrew University of Jerusalem, Jerusalem 9112102, Israel
| | - Dana Sherill-Rofe
- Department of Developmental Biology and Cancer Research, Institute for Medical Research Israel-Canada, Hebrew University of Jerusalem, Jerusalem 9112102, Israel
| | - Doron Stupp
- Department of Developmental Biology and Cancer Research, Institute for Medical Research Israel-Canada, Hebrew University of Jerusalem, Jerusalem 9112102, Israel
| | - Irene Unterman
- Department of Developmental Biology and Cancer Research, Institute for Medical Research Israel-Canada, Hebrew University of Jerusalem, Jerusalem 9112102, Israel
| | - Hodaya Beer
- Department of Developmental Biology and Cancer Research, Institute for Medical Research Israel-Canada, Hebrew University of Jerusalem, Jerusalem 9112102, Israel
| | - Elad Sharon
- Department of Developmental Biology and Cancer Research, Institute for Medical Research Israel-Canada, Hebrew University of Jerusalem, Jerusalem 9112102, Israel
| | - Yuval Tabach
- Department of Developmental Biology and Cancer Research, Institute for Medical Research Israel-Canada, Hebrew University of Jerusalem, Jerusalem 9112102, Israel
| |
Collapse
|
14
|
Tremblay BJM, Lobb B, Doxey AC. PhyloCorrelate: inferring bacterial gene-gene functional associations through large-scale phylogenetic profiling. Bioinformatics 2021; 37:17-22. [PMID: 33416870 DOI: 10.1093/bioinformatics/btaa1105] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2020] [Revised: 12/26/2020] [Accepted: 12/29/2020] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Statistical detection of co-occurring genes across genomes, known as "phylogenetic profiling", is a powerful bioinformatic technique for inferring gene-gene functional associations. However, this can be a challenging task given the size and complexity of phylogenomic databases, difficulty in accounting for phylogenetic structure, inconsistencies in genome annotation, and substantial computational requirements. RESULTS We introduce PhyloCorrelate-a computational framework for gene co-occurrence analysis across large phylogenomic datasets. PhyloCorrelate implements a variety of co-occurrence metrics including standard correlation metrics and model-based metrics that account for phylogenetic history. By combining multiple metrics, we developed an optimized score that exhibits a superior ability to link genes with overlapping GO terms and KEGG pathways, enabling gene function prediction. Using genomic and functional annotation data from the Genome Taxonomy Database and AnnoTree, we performed all-by-all comparisons of gene occurrence profiles across the bacterial tree of life, totaling 154,217,052 comparisons for 28,315 genes across 27,372 bacterial genomes. All predictions are available in an online database, which instantaneously returns the top correlated genes for any PFAM, TIGRFAM, or KEGG query. In total, PhyloCorrelate detected 29,762 high confidence associations between bacterial gene/protein pairs, and generated functional predictions for 834 DUFs and proteins of unknown function. AVAILABILITY PhyloCorrelate is available as a web-server at phylocorrelate.uwaterloo.ca as well as an R package for analysis of custom datasets. We anticipate that PhyloCorrelate will be broadly useful as a tool for predicting function and interactions for gene families. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Briallen Lobb
- Department of Biology, 200 University Ave. West, Waterloo, ON, N2L 3G1, Canada
| | - Andrew C Doxey
- Department of Biology, 200 University Ave. West, Waterloo, ON, N2L 3G1, Canada
| |
Collapse
|
15
|
Nagy LG, Merényi Z, Hegedüs B, Bálint B. Novel phylogenetic methods are needed for understanding gene function in the era of mega-scale genome sequencing. Nucleic Acids Res 2020; 48:2209-2219. [PMID: 31943056 PMCID: PMC7049691 DOI: 10.1093/nar/gkz1241] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2019] [Revised: 12/15/2019] [Accepted: 12/31/2019] [Indexed: 12/21/2022] Open
Abstract
Ongoing large-scale genome sequencing projects are forecasting a data deluge that will almost certainly overwhelm current analytical capabilities of evolutionary genomics. In contrast to population genomics, there are no standardized methods in evolutionary genomics for extracting evolutionary and functional (e.g. gene-trait association) signal from genomic data. Here, we examine how current practices of multi-species comparative genomics perform in this aspect and point out that many genomic datasets are under-utilized due to the lack of powerful methodologies. As a result, many current analyses emphasize gene families for which some functional data is already available, resulting in a growing gap between functionally well-characterized genes/organisms and the universe of unknowns. This leaves unknown genes on the 'dark side' of genomes, a problem that will not be mitigated by sequencing more and more genomes, unless we develop tools to infer functional hypotheses for unknown genes in a systematic manner. We provide an inventory of recently developed methods capable of predicting gene-gene and gene-trait associations based on comparative data, then argue that realizing the full potential of whole genome datasets requires the integration of phylogenetic comparative methods into genomics, a rich but underutilized toolbox for looking into the past.
Collapse
Affiliation(s)
- László G Nagy
- Synthetic and Systems Biology Unit, Institute of Biochemistry, Biological Research Centre, Temesvari krt 62. Szeged 6726, Hungary
| | - Zsolt Merényi
- Synthetic and Systems Biology Unit, Institute of Biochemistry, Biological Research Centre, Temesvari krt 62. Szeged 6726, Hungary
| | - Botond Hegedüs
- Synthetic and Systems Biology Unit, Institute of Biochemistry, Biological Research Centre, Temesvari krt 62. Szeged 6726, Hungary
| | - Balázs Bálint
- Synthetic and Systems Biology Unit, Institute of Biochemistry, Biological Research Centre, Temesvari krt 62. Szeged 6726, Hungary
| |
Collapse
|
16
|
Alcalá-Corona SA, Espinal-Enríquez J, de Anda-Jáuregui G, Hernández-Lemus E. The Hierarchical Modular Structure of HER2+ Breast Cancer Network. Front Physiol 2018; 9:1423. [PMID: 30364267 PMCID: PMC6193406 DOI: 10.3389/fphys.2018.01423] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2018] [Accepted: 09/19/2018] [Indexed: 11/13/2022] Open
Abstract
HER2-enriched breast cancer is a complex disease characterized by the overexpression of the ERBB2 amplicon. While the effects of this genomic aberration on the pathology have been studied, genome-wide deregulation patterns in this subtype of cancer are also observed. A novel approach to the study of this malignant neoplasy is the use of transcriptional networks. These networks generally exhibit modular structures, which in turn may be associated to biological processes. This modular regulation of biological functions may also exhibit a hierarchical structure, with deeper levels of modular organization accounting for more specific functional regulation. In this work, we identified the most probable (maximum likelihood) model of the hierarchical modular structure of the HER2-enriched transcriptional network as reconstructed from gene expression data, and analyzed the statistical associations of modules and submodules to biological functions. We found modular structures, independent from direct ERBB2 amplicon regulation, involved in different biological functions such as signaling, immunity, and cellular morphology. Higher resolution submodules were identified in more specific functions, such as micro-RNA regulation and the activation of viral-like immune response. We propose the approach presented here as one that may help to unveil mechanisms involved in the development of the pathology.
Collapse
Affiliation(s)
- Sergio Antonio Alcalá-Corona
- Computational Genomics, National Institute of Genomic Medicine, Mexico City, Mexico.,Centro de Ciencias de la Complejidad, Universidad Nacional Autónoma de Mexico, Ciudad de Mexico, Mexico
| | - Jesús Espinal-Enríquez
- Computational Genomics, National Institute of Genomic Medicine, Mexico City, Mexico.,Centro de Ciencias de la Complejidad, Universidad Nacional Autónoma de Mexico, Ciudad de Mexico, Mexico
| | | | - Enrique Hernández-Lemus
- Computational Genomics, National Institute of Genomic Medicine, Mexico City, Mexico.,Centro de Ciencias de la Complejidad, Universidad Nacional Autónoma de Mexico, Ciudad de Mexico, Mexico
| |
Collapse
|
17
|
Phylogenetic analysis of protein sequences based on a novel k-mer natural vector method. Genomics 2018; 111:1298-1305. [PMID: 30195069 DOI: 10.1016/j.ygeno.2018.08.010] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2018] [Revised: 08/19/2018] [Accepted: 08/27/2018] [Indexed: 11/22/2022]
Abstract
Based on the k-mer model for protein sequence, a novel k-mer natural vector method is proposed to characterize the features of k-mers in a protein sequence, in which the numbers and distributions of k-mers are considered. It is proved that the relationship between a protein sequence and its k-mer natural vector is one-to-one. Phylogenetic analysis of protein sequences therefore can be easily performed without requiring evolutionary models or human intervention. In addition, there exists no a criterion to choose a suitable k, and k has a great influence on obtaining results as well as computational complexity. In this paper, a compound k-mer natural vector is utilized to quantify each protein sequence. The results gotten from phylogenetic analysis on three protein datasets demonstrate that our new method can precisely describe the evolutionary relationships of proteins, and greatly heighten the computing efficiency.
Collapse
|
18
|
Single Cell Genetics and Epigenetics in Early Embryo: From Oocyte to Blastocyst. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2018; 1068:103-117. [DOI: 10.1007/978-981-13-0502-3_9] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
|
19
|
Sferra G, Fratini F, Ponzi M, Pizzi E. Phylo_dCor: distance correlation as a novel metric for phylogenetic profiling. BMC Bioinformatics 2017; 18:396. [PMID: 28870256 PMCID: PMC5584357 DOI: 10.1186/s12859-017-1815-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2017] [Accepted: 08/29/2017] [Indexed: 12/20/2022] Open
Abstract
Background Elaboration of powerful methods to predict functional and/or physical protein-protein interactions from genome sequence is one of the main tasks in the post-genomic era. Phylogenetic profiling allows the prediction of protein-protein interactions at a whole genome level in both Prokaryotes and Eukaryotes. For this reason it is considered one of the most promising methods. Results Here, we propose an improvement of phylogenetic profiling that enables handling of large genomic datasets and infer global protein-protein interactions. This method uses the distance correlation as a new measure of phylogenetic profile similarity. We constructed robust reference sets and developed Phylo-dCor, a parallelized version of the algorithm for calculating the distance correlation that makes it applicable to large genomic data. Using Saccharomyces cerevisiae and Escherichia coli genome datasets, we showed that Phylo-dCor outperforms phylogenetic profiling methods previously described based on the mutual information and Pearson’s correlation as measures of profile similarity. Conclusions In this work, we constructed and assessed robust reference sets and propose the distance correlation as a measure for comparing phylogenetic profiles. To make it applicable to large genomic data, we developed Phylo-dCor, a parallelized version of the algorithm for calculating the distance correlation. Two R scripts that can be run on a wide range of machines are available upon request. Electronic supplementary material The online version of this article (10.1186/s12859-017-1815-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Gabriella Sferra
- Dipartimento di Malattie Infettive, Parassitarie e Immunomediate, Istituto Superiore di Sanità, Viale Regina Elena 299, 00161, Rome, Italy
| | - Federica Fratini
- Dipartimento di Malattie Infettive, Parassitarie e Immunomediate, Istituto Superiore di Sanità, Viale Regina Elena 299, 00161, Rome, Italy
| | - Marta Ponzi
- Dipartimento di Malattie Infettive, Parassitarie e Immunomediate, Istituto Superiore di Sanità, Viale Regina Elena 299, 00161, Rome, Italy
| | - Elisabetta Pizzi
- Dipartimento di Malattie Infettive, Parassitarie e Immunomediate, Istituto Superiore di Sanità, Viale Regina Elena 299, 00161, Rome, Italy.
| |
Collapse
|