1
|
Bornstein K, Gryan G, Chang ES, Marchler-Bauer A, Schneider VA. The NIH Comparative Genomics Resource: addressing the promises and challenges of comparative genomics on human health. BMC Genomics 2023; 24:575. [PMID: 37759191 PMCID: PMC10523801 DOI: 10.1186/s12864-023-09643-4] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2023] [Accepted: 08/31/2023] [Indexed: 09/29/2023] Open
Abstract
Comparative genomics is the comparison of genetic information within and across organisms to understand the evolution, structure, and function of genes, proteins, and non-coding regions (Sivashankari and Shanmughavel, Bioinformation 1:376-8, 2007). Advances in sequencing technology and assembly algorithms have resulted in the ability to sequence large genomes and provided a wealth of data that are being used in comparative genomic analyses. Comparative analysis can be leveraged to systematically explore and evaluate the biological relationships and evolution between species, aid in understanding the structure and function of genes, and gain a better understanding of disease and potential drug targets. As our knowledge of genetics expands, comparative genomics can help identify emerging model organisms among a broader span of the tree of life, positively impacting human health. This impact includes, but is not limited to, zoonotic disease research, therapeutics development, microbiome research, xenotransplantation, oncology, and toxicology. Despite advancements in comparative genomics, new challenges have arisen around the quantity, quality assurance, annotation, and interoperability of genomic data and metadata. New tools and approaches are required to meet these challenges and fulfill the needs of researchers. This paper focuses on how the National Institutes of Health (NIH) Comparative Genomics Resource (CGR) can address both the opportunities for comparative genomics to further impact human health and confront an increasingly complex set of challenges facing researchers.
Collapse
Affiliation(s)
| | - Gary Gryan
- The MITRE Corporation, 7525 Colshire Dr, McLean, VA, USA
| | - E Sally Chang
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Aron Marchler-Bauer
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Valerie A Schneider
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA.
| |
Collapse
|
2
|
Garzón W, Benavides L, Gaignard A, Redon R, Südholt M. A taxonomy of tools and approaches for distributed genomic analyses. INFORMATICS IN MEDICINE UNLOCKED 2022. [DOI: 10.1016/j.imu.2022.101024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/16/2022] Open
|
3
|
Guarino F, Cicatelli A, Castiglione S, Agius DR, Orhun GE, Fragkostefanakis S, Leclercq J, Dobránszki J, Kaiserli E, Lieberman-Lazarovich M, Sõmera M, Sarmiento C, Vettori C, Paffetti D, Poma AMG, Moschou PN, Gašparović M, Yousefi S, Vergata C, Berger MMJ, Gallusci P, Miladinović D, Martinelli F. An Epigenetic Alphabet of Crop Adaptation to Climate Change. Front Genet 2022; 13:818727. [PMID: 35251130 PMCID: PMC8888914 DOI: 10.3389/fgene.2022.818727] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2021] [Accepted: 01/28/2022] [Indexed: 01/10/2023] Open
Abstract
Crop adaptation to climate change is in a part attributed to epigenetic mechanisms which are related to response to abiotic and biotic stresses. Although recent studies increased our knowledge on the nature of these mechanisms, epigenetics remains under-investigated and still poorly understood in many, especially non-model, plants, Epigenetic modifications are traditionally divided into two main groups, DNA methylation and histone modifications that lead to chromatin remodeling and the regulation of genome functioning. In this review, we outline the most recent and interesting findings on crop epigenetic responses to the environmental cues that are most relevant to climate change. In addition, we discuss a speculative point of view, in which we try to decipher the “epigenetic alphabet” that underlies crop adaptation mechanisms to climate change. The understanding of these mechanisms will pave the way to new strategies to design and implement the next generation of cultivars with a broad range of tolerance/resistance to stresses as well as balanced agronomic traits, with a limited loss of (epi)genetic variability.
Collapse
Affiliation(s)
- Francesco Guarino
- Dipartimento di Chimica e Biologia “A. Zambelli”, Università Degli Studi di Salerno, Salerno, Italy
| | - Angela Cicatelli
- Dipartimento di Chimica e Biologia “A. Zambelli”, Università Degli Studi di Salerno, Salerno, Italy
| | - Stefano Castiglione
- Dipartimento di Chimica e Biologia “A. Zambelli”, Università Degli Studi di Salerno, Salerno, Italy
| | - Dolores R. Agius
- Centre of Molecular Medicine and Biobanking, University of Malta, Msida, Malta
| | - Gul Ebru Orhun
- Bayramic Vocational College, Canakkale Onsekiz Mart University, Canakkale, Turkey
| | | | - Julie Leclercq
- CIRAD, UMR AGAP, Montpellier, France
- AGAP, Univ Montpellier, CIRAD, INRA, Institut Agro, Montpellier, France
| | - Judit Dobránszki
- Centre for Agricultural Genomics and Biotechnology, FAFSEM, University of Debrecen, Debrecen, Hungary
| | - Eirini Kaiserli
- Institute of Molecular, Cell and Systems Biology, College of Medical, Veterinary and Life Sciences, University of Glasgow, Glasgow, United Kingdom
| | | | - Merike Sõmera
- Department of Chemistry and Biotechnology, Tallinn University of Technology, Tallinn, Estonia
| | - Cecilia Sarmiento
- Department of Chemistry and Biotechnology, Tallinn University of Technology, Tallinn, Estonia
| | - Cristina Vettori
- Institute of Biosciences and Bioresources (IBBR), National Research Council (CNR), Sesto Fiorentino, Italy
| | - Donatella Paffetti
- Department of Agriculture, Food, Environment and Forestry (DAGRI), University of Florence, Florence, Italy
| | - Anna M. G. Poma
- Department of Clinical Medicine, Public Health, Life and Environmental Sciences, University of L’Aquila, Aquila, Italy
| | - Panagiotis N. Moschou
- Institute of Molecular Biology and Biotechnology, Foundation for Research and Technology—Hellas, Heraklion, Greece
- Department of Biology, University of Crete, Heraklion, Greece
- Department of Plant Biology, Uppsala BioCenter, Swedish University of Agricultural Sciences and Linnean Center for Plant Biology, Uppsala, Sweden
| | - Mateo Gašparović
- Chair of Photogrammetry and Remote Sensing, Faculty of Geodesy, University of Zagreb, Zagreb, Croatia
| | - Sanaz Yousefi
- Department of Horticultural Science, Bu-Ali Sina University, Hamedan, Iran
| | - Chiara Vergata
- Department of Biology, University of Florence, Sesto Fiorentino, Italy
| | - Margot M. J. Berger
- UMR Ecophysiologie et Génomique Fonctionnelle de la Vigne, Université de Bordeaux, INRAE, Bordeaux Science Agro, Bordeaux, France
| | - Philippe Gallusci
- UMR Ecophysiologie et Génomique Fonctionnelle de la Vigne, Université de Bordeaux, INRAE, Bordeaux Science Agro, Bordeaux, France
| | - Dragana Miladinović
- Institute of Field and Vegetable Crops, National Institute of Republic of Serbia, Novi Sad, Serbia
- *Correspondence: Dragana Miladinović, ; Federico Martinelli,
| | - Federico Martinelli
- Department of Biology, University of Florence, Sesto Fiorentino, Italy
- *Correspondence: Dragana Miladinović, ; Federico Martinelli,
| |
Collapse
|
4
|
Karim MR, Cochez M, Zappa A, Sahay R, Rebholz-Schuhmann D, Beyan O, Decker S. Convolutional Embedded Networks for Population Scale Clustering and Bio-Ancestry Inferencing. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:369-382. [PMID: 32750845 DOI: 10.1109/tcbb.2020.2994649] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
The study of genetic variants (GVs) can help find correlating population groups and to identify cohorts that are predisposed to common diseases and explain differences in disease susceptibility and how patients react to drugs. Machine learning techniques are increasingly being applied to identify interacting GVs to understand their complex phenotypic traits. Since the performance of a learning algorithm not only depends on the size and nature of the data but also on the quality of underlying representation, deep neural networks (DNNs) can learn non-linear mappings that allow transforming GVs data into more clustering and classification friendly representations than manual feature selection. In this paper, we propose convolutional embedded networks (CEN) in which we combine two DNN architectures called convolutional embedded clustering (CEC) and convolutional autoencoder (CAE) classifier for clustering individuals and predicting geographic ethnicity based on GVs, respectively. We employed CAE-based representation learning to 95 million GVs from the '1000 genomes' (covering 2,504 individuals from 26 ethnic origins) and 'Simons genome diversity' (covering 279 individuals from 130 ethnic origins) projects. Quantitative and qualitative analyses with a focus on accuracy and scalability show that our approach outperforms state-of-the-art approaches such as VariantSpark and ADMIXTURE. In particular, CEC can cluster targeted population groups in 22 hours with an adjusted rand index (ARI) of 0.915, the normalized mutual information (NMI) of 0.92, and the clustering accuracy (ACC) of 89 percent. Contrarily, the CAE classifier can predict the geographic ethnicity of unknown samples with an F1 and Mathews correlation coefficient (MCC) score of 0.9004 and 0.8245, respectively. Further, to provide interpretations of the predictions, we identify significant biomarkers using gradient boosted trees (GBT) and SHapley Additive exPlanations (SHAP). Overall, our approach is transparent and faster than the baseline methods, and scalable for 5 to 100 percent of the full human genome.
Collapse
|
5
|
Vesteghem C, Brøndum RF, Sønderkær M, Sommer M, Schmitz A, Bødker JS, Dybkær K, El-Galaly TC, Bøgsted M. Implementing the FAIR Data Principles in precision oncology: review of supporting initiatives. Brief Bioinform 2021; 21:936-945. [PMID: 31263868 PMCID: PMC7299292 DOI: 10.1093/bib/bbz044] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2019] [Revised: 03/13/2019] [Accepted: 03/21/2019] [Indexed: 12/26/2022] Open
Abstract
Compelling research has recently shown that cancer is so heterogeneous that single research centres cannot produce enough data to fit prognostic and predictive models of sufficient accuracy. Data sharing in precision oncology is therefore of utmost importance. The Findable, Accessible, Interoperable and Reusable (FAIR) Data Principles have been developed to define good practices in data sharing. Motivated by the ambition of applying the FAIR Data Principles to our own clinical precision oncology implementations and research, we have performed a systematic literature review of potentially relevant initiatives. For clinical data, we suggest using the Genomic Data Commons model as a reference as it provides a field-tested and well-documented solution. Regarding classification of diagnosis, morphology and topography and drugs, we chose to follow the World Health Organization standards, i.e. ICD10, ICD-O-3 and Anatomical Therapeutic Chemical classifications, respectively. For the bioinformatics pipeline, the Genome Analysis ToolKit Best Practices using Docker containers offer a coherent solution and have therefore been selected. Regarding the naming of variants, we follow the Human Genome Variation Society's standard. For the IT infrastructure, we have built a centralized solution to participate in data sharing through federated solutions such as the Beacon Networks.
Collapse
Affiliation(s)
- Charles Vesteghem
- Department of Clinical Medicine, Aalborg University, Denmark.,Department of Haematology, Aalborg University Hospital, Denmark
| | | | - Mads Sønderkær
- Department of Haematology, Aalborg University Hospital, Denmark
| | - Mia Sommer
- Department of Clinical Medicine, Aalborg University, Denmark.,Department of Haematology, Aalborg University Hospital, Denmark
| | | | | | - Karen Dybkær
- Department of Clinical Medicine, Aalborg University, Denmark.,Department of Haematology, Aalborg University Hospital, Denmark.,Clinical Cancer Research Center, Aalborg University Hospital, Denmark
| | - Tarec Christoffer El-Galaly
- Department of Clinical Medicine, Aalborg University, Denmark.,Department of Haematology, Aalborg University Hospital, Denmark.,Clinical Cancer Research Center, Aalborg University Hospital, Denmark
| | - Martin Bøgsted
- Department of Clinical Medicine, Aalborg University, Denmark.,Department of Haematology, Aalborg University Hospital, Denmark.,Clinical Cancer Research Center, Aalborg University Hospital, Denmark
| |
Collapse
|
6
|
Cesano A, Cannarile MA, Gnjatic S, Gomes B, Guinney J, Karanikas V, Karkada M, Kirkwood JM, Kotlan B, Masucci GV, Meeusen E, Monette A, Naing A, Thorsson V, Tschernia N, Wang E, Wells DK, Wyant TL, Rutella S. Society for Immunotherapy of Cancer clinical and biomarkers data sharing resource document: Volume II-practical challenges. J Immunother Cancer 2020; 8:e001472. [PMID: 33323463 PMCID: PMC7745522 DOI: 10.1136/jitc-2020-001472] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/06/2020] [Indexed: 01/10/2023] Open
Abstract
The development of strongly predictive validated biomarkers is essential for the field of immuno-oncology (IO) to advance. The highly complex, multifactorial data sets required to develop these biomarkers necessitate effective, responsible data-sharing efforts in order to maximize the scientific knowledge and utility gained from their collection. While the sharing of clinical- and safety-related trial data has already been streamlined to a large extent, the sharing of biomarker-aimed clinical trial derived data and data sets has been met with a number of hurdles that have impaired the progression of biomarkers from hypothesis to clinical use. These hurdles include technical challenges associated with the infrastructure, technology, workforce, and sustainability required for clinical biomarker data sharing. To provide guidance and assist in the navigation of these challenges, the Society for Immunotherapy of Cancer (SITC) Biomarkers Committee convened to outline the challenges that researchers currently face, both at the conceptual level (Volume I) and at the technical level (Volume II). The committee also suggests possible solutions to these problems in the form of professional standards and harmonized requirements for data sharing, assisting in continued progress toward effective, clinically relevant biomarkers in the IO setting.
Collapse
Affiliation(s)
| | - Michael A Cannarile
- Roche Pharmaceutical Research and Early Development Oncology, Roche Innovation Center Munich, Penzberg, Germany
| | - Sacha Gnjatic
- Department of Medicine, Tisch Cancer Institute, Icahn School of Medicine, New York, New York, USA
| | - Bruno Gomes
- Roche Pharmaceutical Research and Early Development Oncology, Roche Innovation Center, Basel, Switzerland
| | | | - Vaios Karanikas
- Roche Pharmaceutical Research and Early Development Oncology, Roche Innovation Center, Zürich, Switzerland
| | - Mohan Karkada
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, Massachusetts, USA
| | - John M Kirkwood
- Department of Medicine, Division of Hematology/Oncology, University of Pittsburgh School of Medicine and Melanoma Center at UPMC Hillman Cancer Center, Pittsburgh, Pennsylvania, USA
| | - Beatrix Kotlan
- National Institute of Oncology, Budapest, Budapest, Hungary
| | | | - Els Meeusen
- CancerProbe Pty Ltd, Prahran, Victoria, Australia
| | - Anne Monette
- Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, Quebec, Canada
| | - Aung Naing
- Department of Investigational Cancer Therapeutics, Division of Cancer Medicine, The University of Texas MD Anderson Cancer Center, Houston, Texas, USA
| | | | - Nicholas Tschernia
- Department of Medicine, Division of Hematology/Oncology, University of North Carolina School of Medicine, Chapel Hill, North Carolina, USA
| | - Ena Wang
- Allogene Therapeutics, South San Francisco, California, USA
| | - Daniel K Wells
- Parker Institute for Cancer Immunotherapy, San Francisco, California, USA
| | | | - Sergio Rutella
- John van Geest Cancer Research Centre, Nottingham Trent University, Nottingham, Nottinghamshire, UK
- Centre for Health, Ageing and Understanding Disease (CHAUD), Nottingham Trent University, Nottingham, Nottinghamshire, UK
| |
Collapse
|
7
|
Wang YD, Li Z, Li FS. Differences in key genes in human alveolar macrophages between phenotypically normal smokers and nonsmokers: diagnostic and prognostic value in lung cancer. INTERNATIONAL JOURNAL OF CLINICAL AND EXPERIMENTAL PATHOLOGY 2020; 13:2788-2805. [PMID: 33284895 PMCID: PMC7716130] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Received: 06/29/2020] [Accepted: 09/02/2020] [Indexed: 06/12/2023]
Abstract
OBJECTIVE To explore the effect of smoking on gene expression in human alveolar macrophages and the value of identified key genes in the early diagnosis and prognosis of lung cancers. METHODS We downloaded three data sets (GSE8823, GSE2125, and GSE3212) from the Gene Expression Omnibus (GEO) database, including 31 non-smoking and 33 smoking human alveolar macrophage samples. We identified common differentially expressed genes (DEGs), from which we obtained module genes and hub genes by using STRING and Cytoscape. Then we analyzed the protein-protein interaction (PPI) network of DEGs, hub genes, and module genes and used David online analysis tool to carry out functional enrichment analysis of DEGs and module genes. RESULTS A total of 85 differentially expressed genes was obtained, including 42 up-regulated genes and 43 down-regulated genes. The Human Protein Atlas and Survival analysis showed that GBP1, ITGAM, CSF1, SPP1, COL1A1, LAMB1 and THBS1 may be closely associated with the carcinogenesis and prognosis of lung cancer. CONCLUSION DEGs, module, and hub genes identified in the present study help explain the effects of smoking on human alveolar macrophages and provide candidate targets for diagnosis and treatment of smoking-related lung cancer.
Collapse
Affiliation(s)
- Yi-De Wang
- Department of Integrated Pulmonology, Fourth Affiliated Hospital of Xinjiang Medical UniversityUrumqi 830000, China
| | - Zheng Li
- Xinjiang National Clinical Research Base of Traditional Chinese Medicine, Xinjiang Medical UniversityUrumqi 830000, China
| | - Feng-Sen Li
- Xinjiang National Clinical Research Base of Traditional Chinese Medicine, Xinjiang Medical UniversityUrumqi 830000, China
| |
Collapse
|
8
|
Egli A. Digitalization, clinical microbiology and infectious diseases. Clin Microbiol Infect 2020; 26:1289-1290. [PMID: 32622954 PMCID: PMC7330545 DOI: 10.1016/j.cmi.2020.06.031] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2020] [Accepted: 06/20/2020] [Indexed: 01/11/2023]
Affiliation(s)
- A Egli
- Clinical Bacteriology and Mycology, University Hospital Basel, Basel, Switzerland; Applied Microbiology Research, Department of Biomedicine, University of Basel, Basel, Switzerland.
| |
Collapse
|
9
|
Tangaro MA, Donvito G, Antonacci M, Chiara M, Mandreoli P, Pesole G, Zambelli F. Laniakea: an open solution to provide Galaxy "on-demand" instances over heterogeneous cloud infrastructures. Gigascience 2020; 9:giaa033. [PMID: 32252069 PMCID: PMC7136032 DOI: 10.1093/gigascience/giaa033] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2020] [Revised: 03/13/2020] [Accepted: 03/17/2020] [Indexed: 12/26/2022] Open
Abstract
BACKGROUND While the popular workflow manager Galaxy is currently made available through several publicly accessible servers, there are scenarios where users can be better served by full administrative control over a private Galaxy instance, including, but not limited to, concerns about data privacy, customisation needs, prioritisation of particular job types, tools development, and training activities. In such cases, a cloud-based Galaxy virtual instance represents an alternative that equips the user with complete control over the Galaxy instance itself without the burden of the hardware and software infrastructure involved in running and maintaining a Galaxy server. RESULTS We present Laniakea, a complete software solution to set up a "Galaxy on-demand" platform as a service. Building on the INDIGO-DataCloud software stack, Laniakea can be deployed over common cloud architectures usually supported both by public and private e-infrastructures. The user interacts with a Laniakea-based service through a simple front-end that allows a general setup of a Galaxy instance, and then Laniakea takes care of the automatic deployment of the virtual hardware and the software components. At the end of the process, the user gains access with full administrative privileges to a private, production-grade, fully customisable, Galaxy virtual instance and to the underlying virtual machine (VM). Laniakea features deployment of single-server or cluster-backed Galaxy instances, sharing of reference data across multiple instances, data volume encryption, and support for VM image-based, Docker-based, and Ansible recipe-based Galaxy deployments. A Laniakea-based Galaxy on-demand service, named Laniakea@ReCaS, is currently hosted at the ELIXIR-IT ReCaS cloud facility. CONCLUSIONS Laniakea offers to scientific e-infrastructures a complete and easy-to-use software solution to provide a Galaxy on-demand service to their users. Laniakea-based cloud services will help in making Galaxy more accessible to a broader user base by removing most of the burdens involved in deploying and running a Galaxy service. In turn, this will facilitate the adoption of Galaxy in scenarios where classic public instances do not represent an optimal solution. Finally, the implementation of Laniakea can be easily adapted and expanded to support different services and platforms beyond Galaxy.
Collapse
Affiliation(s)
- Marco Antonio Tangaro
- Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council (CNR), Via Giovanni Amendola 122/O, 70126 Bari, Italy
| | - Giacinto Donvito
- National Institute for Nuclear Physics (INFN), Section of Bari, Via Orabona 4, 70126 Bari, Italy
| | - Marica Antonacci
- National Institute for Nuclear Physics (INFN), Section of Bari, Via Orabona 4, 70126 Bari, Italy
| | - Matteo Chiara
- Department of Biosciences, University of Milan, via Celoria 26, 20133 Milano, Italy
| | - Pietro Mandreoli
- Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council (CNR), Via Giovanni Amendola 122/O, 70126 Bari, Italy
- Department of Biosciences, University of Milan, via Celoria 26, 20133 Milano, Italy
| | - Graziano Pesole
- Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council (CNR), Via Giovanni Amendola 122/O, 70126 Bari, Italy
- Department of Biosciences, Biotechnologies and Biopharmaceutics, University of Bari, Via Orabona 4, 70126 Bari, Italy
| | - Federico Zambelli
- Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council (CNR), Via Giovanni Amendola 122/O, 70126 Bari, Italy
- Department of Biosciences, University of Milan, via Celoria 26, 20133 Milano, Italy
| |
Collapse
|
10
|
Li F, Wang Y, Li C, Marquez-Lago TT, Leier A, Rawlings ND, Haffari G, Revote J, Akutsu T, Chou KC, Purcell AW, Pike RN, Webb GI, Ian Smith A, Lithgow T, Daly RJ, Whisstock JC, Song J. Twenty years of bioinformatics research for protease-specific substrate and cleavage site prediction: a comprehensive revisit and benchmarking of existing methods. Brief Bioinform 2019; 20:2150-2166. [PMID: 30184176 PMCID: PMC6954447 DOI: 10.1093/bib/bby077] [Citation(s) in RCA: 58] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2018] [Revised: 07/26/2018] [Accepted: 08/01/2018] [Indexed: 01/06/2023] Open
Abstract
The roles of proteolytic cleavage have been intensively investigated and discussed during the past two decades. This irreversible chemical process has been frequently reported to influence a number of crucial biological processes (BPs), such as cell cycle, protein regulation and inflammation. A number of advanced studies have been published aiming at deciphering the mechanisms of proteolytic cleavage. Given its significance and the large number of functionally enriched substrates targeted by specific proteases, many computational approaches have been established for accurate prediction of protease-specific substrates and their cleavage sites. Consequently, there is an urgent need to systematically assess the state-of-the-art computational approaches for protease-specific cleavage site prediction to further advance the existing methodologies and to improve the prediction performance. With this goal in mind, in this article, we carefully evaluated a total of 19 computational methods (including 8 scoring function-based methods and 11 machine learning-based methods) in terms of their underlying algorithm, calculated features, performance evaluation and software usability. Then, extensive independent tests were performed to assess the robustness and scalability of the reviewed methods using our carefully prepared independent test data sets with 3641 cleavage sites (specific to 10 proteases). The comparative experimental results demonstrate that PROSPERous is the most accurate generic method for predicting eight protease-specific cleavage sites, while GPS-CCD and LabCaS outperformed other predictors for calpain-specific cleavage sites. Based on our review, we then outlined some potential ways to improve the prediction performance and ease the computational burden by applying ensemble learning, deep learning, positive unlabeled learning and parallel and distributed computing techniques. We anticipate that our study will serve as a practical and useful guide for interested readers to further advance next-generation bioinformatics tools for protease-specific cleavage site prediction.
Collapse
Affiliation(s)
- Fuyi Li
- Biomedicine Discovery Institute and Department of Biochemistry & Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - Yanan Wang
- Biomedicine Discovery Institute and Department of Biochemistry & Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Chen Li
- Biomedicine Discovery Institute and Department of Biochemistry & Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- Department of Biology, Institute of Molecular Systems Biology,ETH Zürich, Zürich 8093, Switzerland
| | - Tatiana T Marquez-Lago
- Department of Genetics and Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, AL, USA
| | - André Leier
- Department of Genetics and Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, AL, USA
| | - Neil D Rawlings
- EMBL European Bioinformatics Institute, Wellcome Trust Genome Campus, Wellcome Trust Genome Campus,Hinxton, Cambridgeshire CB10 1SD, UK
| | - Gholamreza Haffari
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | - Jerico Revote
- Biomedicine Discovery Institute and Department of Biochemistry & Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto 611-0011, Japan
| | - Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, USA
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Anthony W Purcell
- Biomedicine Discovery Institute and Department of Biochemistry & Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - Robert N Pike
- La Trobe Institute for Molecular Science, La Trobe University, Melbourne, VIC 3086, Australia
- ARC Centre of Excellence in Advanced Molecular Imaging, Monash University, Melbourne, VIC 3800, Australia
| | - Geoffrey I Webb
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | - A Ian Smith
- Biomedicine Discovery Institute and Department of Biochemistry & Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- ARC Centre of Excellence in Advanced Molecular Imaging, Monash University, Melbourne, VIC 3800, Australia
| | - Trevor Lithgow
- Biomedicine Discovery Institute and Department of Microbiology, Monash University, Melbourne, Victoria 3800, Australia
| | - Roger J Daly
- Biomedicine Discovery Institute and Department of Biochemistry & Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - James C Whisstock
- Biomedicine Discovery Institute and Department of Biochemistry & Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- ARC Centre of Excellence in Advanced Molecular Imaging, Monash University, Melbourne, VIC 3800, Australia
| | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry & Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
- ARC Centre of Excellence in Advanced Molecular Imaging, Monash University, Melbourne, VIC 3800, Australia
| |
Collapse
|
11
|
Svensson D, Sjögren R, Sundell D, Sjödin A, Trygg J. doepipeline: a systematic approach to optimizing multi-level and multi-step data processing workflows. BMC Bioinformatics 2019; 20:498. [PMID: 31615395 PMCID: PMC6794737 DOI: 10.1186/s12859-019-3091-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2018] [Accepted: 09/10/2019] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND Selecting the proper parameter settings for bioinformatic software tools is challenging. Not only will each parameter have an individual effect on the outcome, but there are also potential interaction effects between parameters. Both of these effects may be difficult to predict. To make the situation even more complex, multiple tools may be run in a sequential pipeline where the final output depends on the parameter configuration for each tool in the pipeline. Because of the complexity and difficulty of predicting outcomes, in practice parameters are often left at default settings or set based on personal or peer experience obtained in a trial and error fashion. To allow for the reliable and efficient selection of parameters for bioinformatic pipelines, a systematic approach is needed. RESULTS We present doepipeline, a novel approach to optimizing bioinformatic software parameters, based on core concepts of the Design of Experiments methodology and recent advances in subset designs. Optimal parameter settings are first approximated in a screening phase using a subset design that efficiently spans the entire search space, then optimized in the subsequent phase using response surface designs and OLS modeling. Doepipeline was used to optimize parameters in four use cases; 1) de-novo assembly, 2) scaffolding of a fragmented genome assembly, 3) k-mer taxonomic classification of Oxford Nanopore Technologies MinION reads, and 4) genetic variant calling. In all four cases, doepipeline found parameter settings that produced a better outcome with respect to the characteristic measured when compared to using default values. Our approach is implemented and available in the Python package doepipeline. CONCLUSIONS Our proposed methodology provides a systematic and robust framework for optimizing software parameter settings, in contrast to labor- and time-intensive manual parameter tweaking. Implementation in doepipeline makes our methodology accessible and user-friendly, and allows for automatic optimization of tools in a wide range of cases. The source code of doepipeline is available at https://github.com/clicumu/doepipeline and it can be installed through conda-forge.
Collapse
Affiliation(s)
- Daniel Svensson
- Department of Chemistry, Computational Life Science Cluster (CLiC), Umeå University, Umeå, Sweden
| | - Rickard Sjögren
- Department of Chemistry, Computational Life Science Cluster (CLiC), Umeå University, Umeå, Sweden
- Corporate Research, Sartorius AG, Umeå, Sweden
| | - David Sundell
- Division of CBRN Security and Defence, FOI - Swedish Defence Research Agency, Umeå, Sweden
| | - Andreas Sjödin
- Division of CBRN Security and Defence, FOI - Swedish Defence Research Agency, Umeå, Sweden
| | - Johan Trygg
- Department of Chemistry, Computational Life Science Cluster (CLiC), Umeå University, Umeå, Sweden.
- Corporate Research, Sartorius AG, Umeå, Sweden.
| |
Collapse
|