1
|
Sheehan K, Jeon H, Corr SC, Hayes JM, Mok KH. Antibody Aggregation: A Problem Within the Biopharmaceutical Industry and Its Role in AL Amyloidosis Disease. Protein J 2025; 44:1-20. [PMID: 39527351 DOI: 10.1007/s10930-024-10237-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/24/2024] [Indexed: 11/16/2024]
Abstract
Due to the large size and rapid growth of the global therapeutic antibody market, there is major interest in understanding the aggregation of protein products as it can compromise efficacy, concentration, and safety. Various production and storage conditions have been identified as capable of inducing aggregation of polyclonal and monoclonal antibody (mAb) therapies such as low pH, freezing, light exposure, lyophilisation and increased ionic strength. The addition of stabilising excipients to these therapeutics helps to combat the formation of aggregates with future aggregation inhibition mechanisms involving the introduction of point mutations and glycoengineering within aggregation prone regions (APRs). Antibody aggregation also plays an integral role in the pathogenesis of a condition known as amyloid light chain (AL) amyloidosis which is characterised by the production of improperly folded and amyloidogenic immunoglobulin light chains (LCs). Current diagnostic tools rely heavily on histological staining with their future moving towards amyloid component identification and proteomic analysis. For many years, treatment options designed for multiple myeloma (MM) have been applied to AL amyloidosis patients by depleting plasma cell numbers. More recently, treatment strategies more specific to this condition have been developed with many designed to recognize amyloid fibrils and trigger their degradation without causing systemic plasma cell cytotoxicity. Amyloid fibrils in AL disease and aggregates in antibody therapeutics are both formed through the oligomerisation of misfolded / modified proteins attempting to reach a thermodynamically stable, free energy minimum that is lower than the respective monomers themselves. Although the final morphologies are different, by understanding the principles underlying such aggregation, we expect to find common insights that may contribute to the development of new and effective methods of antibody aggregation and/or amyloidosis management. We envision that this area of research will continue to be very relevant in both industry and clinical settings.
Collapse
Affiliation(s)
- Kate Sheehan
- Trinity Biomedical Sciences Institute (TBSI), School of Biochemistry & Immunology, Trinity College Dublin, The University of Dublin, Dublin 2, Ireland
- School of Genetics & Microbiology, Trinity College Dublin, The University of Dublin, Dublin 2, Ireland
| | - Hyesoo Jeon
- Trinity Biomedical Sciences Institute (TBSI), School of Biochemistry & Immunology, Trinity College Dublin, The University of Dublin, Dublin 2, Ireland
- Lonza Biologics Tuas Pte. Ltd., 35 Tuas South Ave 6, Singapore, 637377, Republic of Singapore
| | - Sinéad C Corr
- School of Genetics & Microbiology, Trinity College Dublin, The University of Dublin, Dublin 2, Ireland
- School of Microbiology and APC Microbiome Ireland, University College Cork, Cork, Ireland
| | - Jerrard M Hayes
- Trinity Biomedical Sciences Institute (TBSI), School of Biochemistry & Immunology, Trinity College Dublin, The University of Dublin, Dublin 2, Ireland
| | - K H Mok
- Trinity Biomedical Sciences Institute (TBSI), School of Biochemistry & Immunology, Trinity College Dublin, The University of Dublin, Dublin 2, Ireland.
- Centre for Research on Adaptive Nanostructures and Nanodevices (CRANN), Trinity College Dublin, The University of Dublin, Dublin 2, Ireland.
| |
Collapse
|
2
|
Pir MS, Timucin E. AFFIPred: AlphaFold2 structure-based Functional Impact Prediction of missense variations. Protein Sci 2025; 34:e70030. [PMID: 39840793 PMCID: PMC11751861 DOI: 10.1002/pro.70030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2024] [Revised: 12/23/2024] [Accepted: 12/24/2024] [Indexed: 01/23/2025]
Abstract
Protein structure holds immense potential for pathogenicity prediction, albeit structure-based predictors are limited compared to the sequence-based counterparts due to the "structure knowledge gap" between large number of available protein sequences and relatively limited number of structures. Leveraging the highly accurate protein structures predicted by AlphaFold2 (AF2), we introduce AFFIPred, an ensemble machine learning classifier that combines sequence and AF2-based structural characteristics to predict missense variant pathogenicity. Based on the assessments on unseen datasets, AFFIPred reached a comparable level of performance with the state-of-the-art predictors such as AlphaMissense. We also showed that the recruitment of AF2 structures that are full-length and represent the unbound states ensures more precise SASA calculations compared to the recruitment of experimental structures. In line with the completeness of the AF2 structures, their use provide a more comprehensive view of the structural characteristics of the missense variation datasets by capturing all variants. AFFIPred maintains high-level accuracy without the limitations of PDB-based classifiers. AFFIPred has predicted over 210 million variations of the human proteome, which are accessible at https://affipred.timucinlab.com/.
Collapse
Affiliation(s)
- Mustafa S Pir
- Department of Biostatistics and Bioinformatics, Institute of Health SciencesAcibadem UniversityAtasehirIstanbulTurkey
| | - Emel Timucin
- Department of Biostatistics and Bioinformatics, Institute of Health SciencesAcibadem UniversityAtasehirIstanbulTurkey
- Department of Biostatistics and Medical Informatics, School of MedicineAcibadem UniversityAtasehirIstanbulTurkey
| |
Collapse
|
3
|
Chernigovskaya M, Pavlović M, Kanduri C, Gielis S, Robert P, Scheffer L, Slabodkin A, Haff IH, Meysman P, Yaari G, Sandve GK, Greiff V. Simulation of adaptive immune receptors and repertoires with complex immune information to guide the development and benchmarking of AIRR machine learning. Nucleic Acids Res 2025; 53:gkaf025. [PMID: 39873270 PMCID: PMC11773363 DOI: 10.1093/nar/gkaf025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2023] [Accepted: 01/25/2025] [Indexed: 01/30/2025] Open
Abstract
Machine learning (ML) has shown great potential in the adaptive immune receptor repertoire (AIRR) field. However, there is a lack of large-scale ground-truth experimental AIRR data suitable for AIRR-ML-based disease diagnostics and therapeutics discovery. Simulated ground-truth AIRR data are required to complement the development and benchmarking of robust and interpretable AIRR-ML methods where experimental data is currently inaccessible or insufficient. The challenge for simulated data to be useful is incorporating key features observed in experimental repertoires. These features, such as antigen or disease-associated immune information, cause AIRR-ML problems to be challenging. Here, we introduce LIgO, a software suite, which simulates AIRR data for the development and benchmarking of AIRR-ML methods. LIgO incorporates different types of immune information both on the receptor and the repertoire level and preserves native-like generation probability distribution. Additionally, LIgO assists users in determining the computational feasibility of their simulations. We show two examples where LIgO supports the development and validation of AIRR-ML methods: (i) how individuals carrying out-of-distribution immune information impacts receptor-level prediction performance and (ii) how immune information co-occurring in the same AIRs impacts the performance of conventional receptor-level encoding and repertoire-level classification approaches. LIgO guides the advancement and assessment of interpretable AIRR-ML methods.
Collapse
Affiliation(s)
- Maria Chernigovskaya
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, 0372, Norway
| | - Milena Pavlović
- Department of Informatics, University of Oslo, Oslo, 0373, Norway
- UiO:RealArt Convergence Environment, University of Oslo, Oslo, 0373, Norway
| | - Chakravarthi Kanduri
- Department of Informatics, University of Oslo, Oslo, 0373, Norway
- UiO:RealArt Convergence Environment, University of Oslo, Oslo, 0373, Norway
| | - Sofie Gielis
- Department of Mathematics and Computer Science, University of Antwerp, Antwerp, 2020, Belgium
| | - Philippe A Robert
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, 0372, Norway
- Department of Biomedicine, University of Basel, Basel, 4031, Switzerland
| | - Lonneke Scheffer
- Department of Informatics, University of Oslo, Oslo, 0373, Norway
| | - Andrei Slabodkin
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, 0372, Norway
| | | | - Pieter Meysman
- Department of Mathematics and Computer Science, University of Antwerp, Antwerp, 2020, Belgium
| | - Gur Yaari
- Faculty of Engineering, Bar-Ilan University, Ramat Gan, 5290002, Israel
| | - Geir Kjetil Sandve
- Department of Informatics, University of Oslo, Oslo, 0373, Norway
- UiO:RealArt Convergence Environment, University of Oslo, Oslo, 0373, Norway
| | - Victor Greiff
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, 0372, Norway
| |
Collapse
|
4
|
Gillani M, Pollastri G. Protein subcellular localization prediction tools. Comput Struct Biotechnol J 2024; 23:1796-1807. [PMID: 38707539 PMCID: PMC11066471 DOI: 10.1016/j.csbj.2024.04.032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2024] [Revised: 04/11/2024] [Accepted: 04/11/2024] [Indexed: 05/07/2024] Open
Abstract
Protein subcellular localization prediction is of great significance in bioinformatics and biological research. Most of the proteins do not have experimentally determined localization information, computational prediction methods and tools have been acting as an active research area for more than two decades now. Knowledge of the subcellular location of a protein provides valuable information about its functionalities, the functioning of the cell, and other possible interactions with proteins. Fast, reliable, and accurate predictors provides platforms to harness the abundance of sequence data to predict subcellular locations accordingly. During the last decade, there has been a considerable amount of research effort aimed at developing subcellular localization predictors. This paper reviews recent subcellular localization prediction tools in the Eukaryotic, Prokaryotic, and Virus-based categories followed by a detailed analysis. Each predictor is discussed based on its main features, strengths, weaknesses, algorithms used, prediction techniques, and analysis. This review is supported by prediction tools taxonomies that highlight their rele- vant area and examples for uncomplicated categorization and ease of understandability. These taxonomies help users find suitable tools according to their needs. Furthermore, recent research gaps and challenges are discussed to cover areas that need the utmost attention. This survey provides an in-depth analysis of the most recent prediction tools to facilitate readers and can be considered a quick guide for researchers to identify and explore the recent literature advancements.
Collapse
Affiliation(s)
- Maryam Gillani
- School of Computer Science, University College Dublin (UCD), Dublin, D04 V1W8, Ireland
| | - Gianluca Pollastri
- School of Computer Science, University College Dublin (UCD), Dublin, D04 V1W8, Ireland
| |
Collapse
|
5
|
Wossnig L, Furtmann N, Buchanan A, Kumar S, Greiff V. Best practices for machine learning in antibody discovery and development. Drug Discov Today 2024; 29:104025. [PMID: 38762089 DOI: 10.1016/j.drudis.2024.104025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2023] [Revised: 04/25/2024] [Accepted: 05/13/2024] [Indexed: 05/20/2024]
Abstract
In the past 40 years, therapeutic antibody discovery and development have advanced considerably, with machine learning (ML) offering a promising way to speed up the process by reducing costs and the number of experiments required. Recent progress in ML-guided antibody design and development (D&D) has been hindered by the diversity of data sets and evaluation methods, which makes it difficult to conduct comparisons and assess utility. Establishing standards and guidelines will be crucial for the wider adoption of ML and the advancement of the field. This perspective critically reviews current practices, highlights common pitfalls and proposes method development and evaluation guidelines for various ML-based techniques in therapeutic antibody D&D. Addressing challenges across the ML process, best practices are recommended for each stage to enhance reproducibility and progress.
Collapse
Affiliation(s)
- Leonard Wossnig
- LabGenius Ltd, The Biscuit Factory, 100 Drummond Road, London SE16 4DG, UK; Department of Computer Science, University College London, 66-72 Gower St, London WC1E 6EA, UK.
| | - Norbert Furtmann
- R&D Large Molecules Research Platform, Sanofi Deutschland GmbH, Industriepark Höchst, Frankfurt Am Main, Germany
| | - Andrew Buchanan
- Biologics Engineering, R&D, AstraZeneca, Cambridge CB2 0AA, UK
| | - Sandeep Kumar
- Computational Protein Design and Modeling Group, Computational Science, Moderna Therapeutics, 200 Technology Square, Cambridge, MA 02139, USA
| | - Victor Greiff
- Department of Immunology and Oslo University Hospital, University of Oslo, Oslo, Norway
| |
Collapse
|
6
|
Ndochinwa OG, Wang QY, Amadi OC, Nwagu TN, Nnamchi CI, Okeke ES, Moneke AN. Current status and emerging frontiers in enzyme engineering: An industrial perspective. Heliyon 2024; 10:e32673. [PMID: 38912509 PMCID: PMC11193041 DOI: 10.1016/j.heliyon.2024.e32673] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2023] [Revised: 06/05/2024] [Accepted: 06/06/2024] [Indexed: 06/25/2024] Open
Abstract
Protein engineering mechanisms can be an efficient approach to enhance the biochemical properties of various biocatalysts. Immobilization of biocatalysts and the introduction of new-to-nature chemical reactivities are also possible through the same mechanism. Discovering new protocols that enhance the catalytic active protein that possesses novelty in terms of being stable, active, and, stereoselectivity with functions could be identified as essential areas in terms of concurrent bioorganic chemistry (synergistic relationship between organic chemistry and biochemistry in the context of enzyme engineering). However, with our current level of knowledge about protein folding and its correlation with protein conformation and activities, it is almost impossible to design proteins with specific biological and physical properties. Hence, contemporary protein engineering typically involves reprogramming existing enzymes by mutagenesis to generate new phenotypes with desired properties. These processes ensure that limitations of naturally occurring enzymes are not encountered. For example, researchers have engineered cellulases and hemicellulases to withstand harsh conditions encountered during biomass pretreatment, such as high temperatures and acidic environments. By enhancing the activity and robustness of these enzymes, biofuel production becomes more economically viable and environmentally sustainable. Recent trends in enzyme engineering have enabled the development of tailored biocatalysts for pharmaceutical applications. For instance, researchers have engineered enzymes such as cytochrome P450s and amine oxidases to catalyze challenging reactions involved in drug synthesis. In addition to conventional methods, there has been an increasing application of machine learning techniques to identify patterns in data. These patterns are then used to predict protein structures, enhance enzyme solubility, stability, and function, forecast substrate specificity, and assist in rational protein design. In this review, we discussed recent trends in enzyme engineering to optimize the biochemical properties of various biocatalysts. Using examples relevant to biotechnology in engineering enzymes, we try to expatiate the significance of enzyme engineering with how these methods could be applied to optimize the biochemical properties of a naturally occurring enzyme.
Collapse
Affiliation(s)
- Obinna Giles Ndochinwa
- Department of Microbiology, Faculty of Biological Science, University of Nigeria, Nsukka, Nigeria
| | - Qing-Yan Wang
- State Key Laboratory of Biomass Enzyme Technology, National Engineering Research Center for Non-Food Biorefinery, Guangxi Academy of Sciences, Nanning, Guangxi, China
| | - Oyetugo Chioma Amadi
- Department of Microbiology, Faculty of Biological Science, University of Nigeria, Nsukka, Nigeria
| | - Tochukwu Nwamaka Nwagu
- Department of Microbiology, Faculty of Biological Science, University of Nigeria, Nsukka, Nigeria
| | | | - Emmanuel Sunday Okeke
- Department of Biochemistry, Faculty of Biological Sciences & Natural Science Unit, School of General Studies, University of Nigeria, Nsukka, Enugu State, 410001, Nigeria
- Institute of Environmental Health and Ecological Security, School of the Environment and Safety, Jiangsu University, 301 Xuefu Rd., 212013, Zhenjiang, Jiangsu, China
| | - Anene Nwabu Moneke
- Department of Microbiology, Faculty of Biological Science, University of Nigeria, Nsukka, Nigeria
| |
Collapse
|
7
|
Armah-Sekum RE, Szedmak S, Rousu J. Protein function prediction through multi-view multi-label latent tensor reconstruction. BMC Bioinformatics 2024; 25:174. [PMID: 38698340 PMCID: PMC11067221 DOI: 10.1186/s12859-024-05789-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2024] [Accepted: 04/17/2024] [Indexed: 05/05/2024] Open
Abstract
BACKGROUND In last two decades, the use of high-throughput sequencing technologies has accelerated the pace of discovery of proteins. However, due to the time and resource limitations of rigorous experimental functional characterization, the functions of a vast majority of them remain unknown. As a result, computational methods offering accurate, fast and large-scale assignment of functions to new and previously unannotated proteins are sought after. Leveraging the underlying associations between the multiplicity of features that describe proteins could reveal functional insights into the diverse roles of proteins and improve performance on the automatic function prediction task. RESULTS We present GO-LTR, a multi-view multi-label prediction model that relies on a high-order tensor approximation of model weights combined with non-linear activation functions. The model is capable of learning high-order relationships between multiple input views representing the proteins and predicting high-dimensional multi-label output consisting of protein functional categories. We demonstrate the competitiveness of our method on various performance measures. Experiments show that GO-LTR learns polynomial combinations between different protein features, resulting in improved performance. Additional investigations establish GO-LTR's practical potential in assigning functions to proteins under diverse challenging scenarios: very low sequence similarity to previously observed sequences, rarely observed and highly specific terms in the gene ontology. IMPLEMENTATION The code and data used for training GO-LTR is available at https://github.com/aalto-ics-kepaco/GO-LTR-prediction .
Collapse
Affiliation(s)
- Robert Ebo Armah-Sekum
- Department of Computer Science, Aalto University, Konemiehentie 2, 02150, Espoo, Finland.
| | - Sandor Szedmak
- Department of Computer Science, Aalto University, Konemiehentie 2, 02150, Espoo, Finland
| | - Juho Rousu
- Department of Computer Science, Aalto University, Konemiehentie 2, 02150, Espoo, Finland.
| |
Collapse
|
8
|
Draizen EJ, Readey J, Mura C, Bourne PE. Prop3D: A flexible, Python-based platform for machine learning with protein structural properties and biophysical data. BMC Bioinformatics 2024; 25:11. [PMID: 38177985 PMCID: PMC10768222 DOI: 10.1186/s12859-023-05586-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2023] [Accepted: 11/27/2023] [Indexed: 01/06/2024] Open
Abstract
BACKGROUND Machine learning (ML) has a rich history in structural bioinformatics, and modern approaches, such as deep learning, are revolutionizing our knowledge of the subtle relationships between biomolecular sequence, structure, function, dynamics and evolution. As with any advance that rests upon statistical learning approaches, the recent progress in biomolecular sciences is enabled by the availability of vast volumes of sufficiently-variable data. To be useful, such data must be well-structured, machine-readable, intelligible and manipulable. These and related requirements pose challenges that become especially acute at the computational scales typical in ML. Furthermore, in structural bioinformatics such data generally relate to protein three-dimensional (3D) structures, which are inherently more complex than sequence-based data. A significant and recurring challenge concerns the creation of large, high-quality, openly-accessible datasets that can be used for specific training and benchmarking tasks in ML pipelines for predictive modeling projects, along with reproducible splits for training and testing. RESULTS Here, we report 'Prop3D', a platform that allows for the creation, sharing and extensible reuse of libraries of protein domains, featurized with biophysical and evolutionary properties that can range from detailed, atomically-resolved physicochemical quantities (e.g., electrostatics) to coarser, residue-level features (e.g., phylogenetic conservation). As a community resource, we also supply a 'Prop3D-20sf' protein dataset, obtained by applying our approach to CATH . We have developed and deployed the Prop3D framework, both in the cloud and on local HPC resources, to systematically and reproducibly create comprehensive datasets via the Highly Scalable Data Service ( HSDS ). Our datasets are freely accessible via a public HSDS instance, or they can be used with accompanying Python wrappers for popular ML frameworks. CONCLUSION Prop3D and its associated Prop3D-20sf dataset can be of broad utility in at least three ways. Firstly, the Prop3D workflow code can be customized and deployed on various cloud-based compute platforms, with scalability achieved largely by saving the results to distributed HDF5 files via HSDS . Secondly, the linked Prop3D-20sf dataset provides a hand-crafted, already-featurized dataset of protein domains for 20 highly-populated CATH families; importantly, provision of this pre-computed resource can aid the more efficient development (and reproducible deployment) of ML pipelines. Thirdly, Prop3D-20sf's construction explicitly takes into account (in creating datasets and data-splits) the enigma of 'data leakage', stemming from the evolutionary relationships between proteins.
Collapse
Affiliation(s)
- Eli J Draizen
- Department of Biomedical Engineering, University of Virginia, Charlottesville, VA, USA.
- School of Data Science, University of Virginia, Charlottesville, VA, USA.
| | | | - Cameron Mura
- Department of Biomedical Engineering, University of Virginia, Charlottesville, VA, USA.
- School of Data Science, University of Virginia, Charlottesville, VA, USA.
| | - Philip E Bourne
- Department of Biomedical Engineering, University of Virginia, Charlottesville, VA, USA
- School of Data Science, University of Virginia, Charlottesville, VA, USA
| |
Collapse
|
9
|
Attafi OA, Clementel D, Kyritsis K, Capriotti E, Farrell G, Fragkouli SC, Castro LJ, Hatos A, Lenaerts T, Mazurenko S, Mozaffari S, Pradelli F, Ruch P, Savojardo C, Turina P, Zambelli F, Piovesan D, Monzon AM, Psomopoulos F, Tosatto SCE. DOME Registry: implementing community-wide recommendations for reporting supervised machine learning in biology. Gigascience 2024; 13:giae094. [PMID: 39661723 PMCID: PMC11633452 DOI: 10.1093/gigascience/giae094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2024] [Revised: 10/22/2024] [Accepted: 10/27/2024] [Indexed: 12/13/2024] Open
Abstract
Supervised machine learning (ML) is used extensively in biology and deserves closer scrutiny. The Data Optimization Model Evaluation (DOME) recommendations aim to enhance the validation and reproducibility of ML research by establishing standards for key aspects such as data handling and processing, optimization, evaluation, and model interpretability. The recommendations help to ensure that key details are reported transparently by providing a structured set of questions. Here, we introduce the DOME registry (URL: registry.dome-ml.org), a database that allows scientists to manage and access comprehensive DOME-related information on published ML studies. The registry uses external resources like ORCID, APICURON, and the Data Stewardship Wizard to streamline the annotation process and ensure comprehensive documentation. By assigning unique identifiers and DOME scores to publications, the registry fosters a standardized evaluation of ML methods. Future plans include continuing to grow the registry through community curation, improving the DOME score definition and encouraging publishers to adopt DOME standards, and promoting transparency and reproducibility of ML in the life sciences.
Collapse
Affiliation(s)
| | - Damiano Clementel
- Department of Biomedical Sciences, University of Padova, Padova 35131, Italy
| | - Konstantinos Kyritsis
- Institute of Applied Biosciences, Centre for Research and Technology Hellas, Thessaloniki 570 01, Greece
| | - Emidio Capriotti
- Department of Pharmacy and Biotechnology, University of Bologna, Bologna 40126, Italy
| | | | - Styliani-Christina Fragkouli
- Institute of Applied Biosciences, Centre for Research and Technology Hellas, Thessaloniki 570 01, Greece
- Department of Biology, National and Kapodistrian University of Athens, Athens 157 72, Greece
| | | | - András Hatos
- Department of Oncology, Geneva University Hospitals, Geneva 1205, Switzerland
- Department of Computational Biology, University of Lausanne, Lausanne 1015, Switzerland
- Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
- Swiss Cancer Center Léman, Lausanne 1015, Switzerland
| | - Tom Lenaerts
- Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles-Vrije Universiteit Brussels, Brussels 1050, Belgium
- Machine Learning Group, Université Libre de Bruxelles, Brussels 1050, Belgium
- Artificial Intelligence Laboratory, Vrije Universiteit Brussels, Brussels 1050, Belgium
| | - Stanislav Mazurenko
- Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Faculty of Science, Brno 62500, Czech Republic
- Masaryk University, Czech Republic International Clinical Research Centre, St. Anne’s Hospital, Brno 65690, Czech Republic
| | - Soroush Mozaffari
- Department of Biomedical Sciences, University of Padova, Padova 35131, Italy
| | - Franco Pradelli
- Department of Biomedical Sciences, University of Padova, Padova 35131, Italy
| | - Patrick Ruch
- HES-SO–HEG Geneva, Geneva 1227, Switzerland
- SIB Swiss Institute of Bioinformatics, Geneva 1206, Switzerland
| | - Castrense Savojardo
- Department of Pharmacy and Biotechnology, University of Bologna, Bologna 40126, Italy
| | - Paola Turina
- Department of Pharmacy and Biotechnology, University of Bologna, Bologna 40126, Italy
| | - Federico Zambelli
- Department of Biosciences, University of Milan, Milan 20133, Italy
- Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies (IBIOM), Bari 70126, Italy
| | - Damiano Piovesan
- Department of Biomedical Sciences, University of Padova, Padova 35131, Italy
| | | | - Fotis Psomopoulos
- Institute of Applied Biosciences, Centre for Research and Technology Hellas, Thessaloniki 570 01, Greece
| | - Silvio C E Tosatto
- Department of Biomedical Sciences, University of Padova, Padova 35131, Italy
- Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies (IBIOM), Bari 70126, Italy
| |
Collapse
|
10
|
Zhao XJG, Cao H. Linking research of biomedical datasets. Brief Bioinform 2022; 23:6712704. [PMID: 36151775 DOI: 10.1093/bib/bbac373] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2022] [Revised: 08/03/2022] [Accepted: 08/08/2022] [Indexed: 12/14/2022] Open
Abstract
Biomedical data preprocessing and efficient computing can be as important as the statistical methods used to fit the data; data processing needs to consider application scenarios, data acquisition and individual rights and interests. We review common principles, knowledge and methods of integrated research according to the whole-pipeline processing mechanism diverse, coherent, sharing, auditable and ecological. First, neuromorphic and native algorithms integrate diverse datasets, providing linear scalability and high visualization. Second, the choice mechanism of different preprocessing, analysis and transaction methods from raw to neuromorphic was summarized on the node and coordinator platforms. Third, combination of node, network, cloud, edge, swarm and graph builds an ecosystem of cohort integrated research and clinical diagnosis and treatment. Looking forward, it is vital to simultaneously combine deep computing, mass data storage and massively parallel communication.
Collapse
Affiliation(s)
- Xiu-Ju George Zhao
- Wuhan Institute of Physics and Mathematics (WIPM), China.,Wuhan Polytechnic University, China
| | - Hui Cao
- Wuhan Polytechnic University, China
| |
Collapse
|
11
|
Behrendt A, Golchin P, König F, Mulnaes D, Stalke A, Dröge C, Keitel V, Gohlke H. Vasor: Accurate prediction of variant effects for amino acid substitutions in multidrug resistance protein 3. Hepatol Commun 2022; 6:3098-3111. [PMID: 36111625 PMCID: PMC9592774 DOI: 10.1002/hep4.2088] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/15/2022] [Revised: 07/26/2022] [Accepted: 08/16/2022] [Indexed: 12/14/2022] Open
Abstract
The phosphatidylcholine floppase multidrug resistance protein 3 (MDR3) is an essential hepatobiliary transport protein. MDR3 dysfunction is associated with various liver diseases, ranging from severe progressive familial intrahepatic cholestasis to transient forms of intrahepatic cholestasis of pregnancy and familial gallstone disease. Single amino acid substitutions are often found as causative of dysfunction, but identifying the substitution effect in in vitro studies is time and cost intensive. We developed variant assessor of MDR3 (Vasor), a machine learning-based model to classify novel MDR3 missense variants into the categories benign or pathogenic. Vasor was trained on the largest data set to date that is specific for benign and pathogenic variants of MDR3 and uses general predictors, namely Evolutionary Models of Variant Effects (EVE), EVmutation, PolyPhen-2, I-Mutant2.0, MUpro, MAESTRO, and PON-P2 along with other variant properties, such as half-sphere exposure and posttranslational modification site, as input. Vasor consistently outperformed the integrated general predictors and the external prediction tool MutPred2, leading to the current best prediction performance for MDR3 single-site missense variants (on an external test set: F1-score, 0.90; Matthew's correlation coefficient, 0.80). Furthermore, Vasor predictions cover the entire sequence space of MDR3. Vasor is accessible as a webserver at https://cpclab.uni-duesseldorf.de/mdr3_predictor/ for users to rapidly obtain prediction results and a visualization of the substitution site within the MDR3 structure. The MDR3-specific prediction tool Vasor can provide reliable predictions of single-site amino acid substitutions, giving users a fast way to initially assess whether a variant is benign or pathogenic.
Collapse
Affiliation(s)
- Annika Behrendt
- Institute for Pharmaceutical and Medicinal ChemistryHeinrich Heine University DüsseldorfDüsseldorfGermany
| | - Pegah Golchin
- Department of Electrical Engineering and Information TechnologyTechnische Universität DarmstadtDarmstadtGermany
| | - Filip König
- Institute for Pharmaceutical and Medicinal ChemistryHeinrich Heine University DüsseldorfDüsseldorfGermany
| | - Daniel Mulnaes
- Institute for Pharmaceutical and Medicinal ChemistryHeinrich Heine University DüsseldorfDüsseldorfGermany
| | - Amelie Stalke
- Department of Human GeneticsHannover Medical SchoolHannoverGermany
- Division of Kidney, Department of Pediatric Gastroenterology and Hepatology, Liver, and Metabolic DiseasesHannover Medical SchoolHannoverGermany
| | - Carola Dröge
- Department for Gastroenterology, Hepatology, and Infectious Diseases, Medical FacultyOtto von Guericke UniversityMagdeburgGermany
- Department for Gastroenterology, Hepatology, and Infectious DiseasesUniversity Hospital, Medical FacultyHeinrich Heine University DüsseldorfDüsseldorfGermany
| | - Verena Keitel
- Department for Gastroenterology, Hepatology, and Infectious Diseases, Medical FacultyOtto von Guericke UniversityMagdeburgGermany
- Department for Gastroenterology, Hepatology, and Infectious DiseasesUniversity Hospital, Medical FacultyHeinrich Heine University DüsseldorfDüsseldorfGermany
| | - Holger Gohlke
- Institute for Pharmaceutical and Medicinal ChemistryHeinrich Heine University DüsseldorfDüsseldorfGermany
- John‐von‐Neumann‐Institute for Computing, Jülich Supercomputing Center, Institute of Biological Information Processing (IBI‐7: Structural Biochemistry), and Institute of Bio‐ and Geosciences (IBG‐4: Bioinformatics)Forschungszentrum Jülich GmbHJülichGermany
| |
Collapse
|
12
|
Buś S, Jędrzejewski K, Guzik P. Using Minimum Redundancy Maximum Relevance Algorithm to Select Minimal Sets of Heart Rate Variability Parameters for Atrial Fibrillation Detection. J Clin Med 2022; 11:4004. [PMID: 35887768 PMCID: PMC9318370 DOI: 10.3390/jcm11144004] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2022] [Revised: 07/08/2022] [Accepted: 07/09/2022] [Indexed: 02/06/2023] Open
Abstract
Heart rate is quite regular during sinus (normal) rhythm (SR) originating from the sinus node. In contrast, heart rate is usually irregular during atrial fibrillation (AF). Complete atrioventricular block with an escape rhythm, ventricular pacing, or ventricular tachycardia are the most common exceptions when heart rate may be regular in AF. Heart rate variability (HRV) is the variation in the duration of consecutive cardiac cycles (RR intervals). We investigated the utility of HRV parameters for automated detection of AF with machine learning (ML) classifiers. The minimum redundancy maximum relevance (MRMR) algorithm, one of the most effective algorithms for feature selection, helped select the HRV parameters (including five original), best suited for distinguishing AF from SR in a database of over 53,000 60 s separate electrocardiogram (ECG) segments cut from longer (up to 24 h) ECG recordings. HRV parameters entered the ML-based classifiers as features. Seven different, commonly used classifiers were trained with one to six HRV-based features with the highest scores resulting from the MRMR algorithm and tested using the 5-fold cross-validation and blindfold validation. The best ML classifier in the blindfold validation achieved an accuracy of 97.2% and diagnostic odds ratio of 1566. From all studied HRV features, the top three HRV parameters distinguishing AF from SR were: the percentage of successive RR intervals differing by at least 50 ms (pRR50), the ratio of standard deviations of points along and across the identity line of the Poincare plots, respectively (SD2/SD1), and coefficient of variation-standard deviation of RR intervals divided by their mean duration (CV). The proposed methodology and the presented results of the selection of HRV parameters have the potential to develop practical solutions and devices for automatic AF detection with minimal sets of simple HRV parameters. Using straightforward ML classifiers and the extremely small sets of simple HRV features, always with pRR50 included, the differentiation of AF from sinus rhythms in the 60 s ECGs is very effective.
Collapse
Affiliation(s)
- Szymon Buś
- Institute of Electronic Systems, Faculty of Electronics and Information Technology, Warsaw University of Technology, Nowowiejska 15/19, 00-665 Warsaw, Poland;
| | - Konrad Jędrzejewski
- Institute of Electronic Systems, Faculty of Electronics and Information Technology, Warsaw University of Technology, Nowowiejska 15/19, 00-665 Warsaw, Poland;
| | - Przemysław Guzik
- Department of Cardiology-Intensive Therapy and Internal Disease, Poznan University of Medical Sciences, 60-355 Poznan, Poland;
| |
Collapse
|
13
|
Sokhansanj BA, Rosen GL. Mapping Data to Deep Understanding: Making the Most of the Deluge of SARS-CoV-2 Genome Sequences. mSystems 2022; 7:e0003522. [PMID: 35311562 PMCID: PMC9040592 DOI: 10.1128/msystems.00035-22] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/27/2022] [Indexed: 12/22/2022] Open
Abstract
Next-generation sequencing has been essential to the global response to the COVID-19 pandemic. As of January 2022, nearly 7 million severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) sequences are available to researchers in public databases. Sequence databases are an abundant resource from which to extract biologically relevant and clinically actionable information. As the pandemic has gone on, SARS-CoV-2 has rapidly evolved, involving complex genomic changes that challenge current approaches to classifying SARS-CoV-2 variants. Deep sequence learning could be a potentially powerful way to build complex sequence-to-phenotype models. Unfortunately, while they can be predictive, deep learning typically produces "black box" models that cannot directly provide biological and clinical insight. Researchers should therefore consider implementing emerging methods for visualizing and interpreting deep sequence models. Finally, researchers should address important data limitations, including (i) global sequencing disparities, (ii) insufficient sequence metadata, and (iii) screening artifacts due to poor sequence quality control.
Collapse
Affiliation(s)
- Bahrad A. Sokhansanj
- Drexel University, Ecological and Evolutionary Signal-Processing and Informatics Laboratory, Department of Electrical & Computer Engineering, College of Engineering, Philadelphia, Pennsylvania, USA
| | - Gail L. Rosen
- Drexel University, Ecological and Evolutionary Signal-Processing and Informatics Laboratory, Department of Electrical & Computer Engineering, College of Engineering, Philadelphia, Pennsylvania, USA
| |
Collapse
|
14
|
Lee BD, Gitter A, Greene CS, Raschka S, Maguire F, Titus AJ, Kessler MD, Lee AJ, Chevrette MG, Stewart PA, Britto-Borges T, Cofer EM, Yu KH, Carmona JJ, Fertig EJ, Kalinin AA, Signal B, Lengerich BJ, Triche TJ, Boca SM. Ten quick tips for deep learning in biology. PLoS Comput Biol 2022; 18:e1009803. [PMID: 35324884 PMCID: PMC8946751 DOI: 10.1371/journal.pcbi.1009803] [Citation(s) in RCA: 26] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/05/2022] Open
Affiliation(s)
- Benjamin D. Lee
- In-Q-Tel Labs, Arlington, Virginia, United States of America
- School of Engineering and Applied Sciences, Harvard University, Cambridge, Massachusetts, United States of America
- Department of Genetics, Harvard Medical School, Boston, Massachusetts, United States of America
| | - Anthony Gitter
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, Wisconsin, United States of America
- Morgridge Institute for Research, Madison, Wisconsin, United States of America
| | - Casey S. Greene
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
- Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Aurora, Colorado, United States of America
- Center for Health AI, University of Colorado School of Medicine, Aurora, Colorado, United States of America
| | - Sebastian Raschka
- Department of Statistics, University of Wisconsin-Madison, Madison, Wisconsin, United States of America
| | - Finlay Maguire
- Faculty of Computer Science, Dalhousie University, Halifax, Nova Scotia, Canada
| | - Alexander J. Titus
- University of New Hampshire, Manchester, New Hampshire, United States of America
- Bioeconomy.XYZ, Manchester, New Hampshire, United States of America
| | - Michael D. Kessler
- Department of Oncology, Johns Hopkins University, Baltimore, Maryland, United States of America
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, Maryland, United States of America
| | - Alexandra J. Lee
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
- Genomics and Computational Biology Graduate Program, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
| | - Marc G. Chevrette
- Wisconsin Institute for Discovery and Department of Plant Pathology, University of Wisconsin-Madison, Madison, Wisconsin, United States of America
| | - Paul Allen Stewart
- Department of Biostatistics and Bioinformatics, Moffitt Cancer Center, Tampa, Florida, United States of America
| | - Thiago Britto-Borges
- Section of Bioinformatics and Systems Cardiology, Klaus Tschira Institute for Integrative Computational Cardiology, University Hospital Heidelberg, Heidelberg, Germany
- Department of Internal Medicine III (Cardiology, Angiology, and Pneumology), University Hospital Heidelberg, Heidelberg, Germany
| | - Evan M. Cofer
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
- Graduate Program in Quantitative and Computational Biology, Princeton University, Princeton, New Jersey, United States of America
| | - Kun-Hsing Yu
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, United States of America
- Department of Pathology, Brigham and Women’s Hospital, Boston, Massachusetts, United States of America
| | - Juan Jose Carmona
- Philips Healthcare, Cambridge, Massachusetts, United States of America
| | - Elana J. Fertig
- Department of Oncology, Johns Hopkins University, Baltimore, Maryland, United States of America
- Department of Biomedical Engineering, Department of Applied Mathematics and Statistics, Convergence Institute, Johns Hopkins University, Baltimore, Maryland, United States of America
| | - Alexandr A. Kalinin
- Medical Big Data Group, Shenzhen Research Institute of Big Data, Shenzhen, China
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Brandon Signal
- School of Medicine, College of Health and Medicine, University of Tasmania, Hobart, Australia
| | - Benjamin J. Lengerich
- Computer Science Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
| | - Timothy J. Triche
- Center for Epigenetics, Van Andel Research Institute, Grand Rapids, Michigan, United States of America
- Department of Pediatrics, College of Human Medicine, Michigan State University, East Lansing, Michigan, United States of America
- Department of Translational Genomics, Keck School of Medicine, University of Southern California, Los Angeles, California, United States of America
| | - Simina M. Boca
- Innovation Center for Biomedical Informatics, Georgetown University Medical Center, District of Columbia, United States of America
- Department of Oncology, Georgetown University Medical Center, Washington, DC, United States of America
- Department of Biostatistics, Bioinformatics and Biomathematics, Georgetown University Medical Center, Washington, DC, United States of America
- Cancer Prevention and Control Program, Lombardi Comprehensive Cancer Center, Washington, DC, United States of America
| |
Collapse
|
15
|
Petti S, Eddy SR. Constructing benchmark test sets for biological sequence analysis using independent set algorithms. PLoS Comput Biol 2022; 18:e1009492. [PMID: 35255082 PMCID: PMC8929697 DOI: 10.1371/journal.pcbi.1009492] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2021] [Revised: 03/17/2022] [Accepted: 02/10/2022] [Indexed: 11/18/2022] Open
Abstract
Biological sequence families contain many sequences that are very similar to each other because they are related by evolution, so the strategy for splitting data into separate training and test sets is a nontrivial choice in benchmarking sequence analysis methods. A random split is insufficient because it will yield test sequences that are closely related or even identical to training sequences. Adapting ideas from independent set graph algorithms, we describe two new methods for splitting sequence data into dissimilar training and test sets. These algorithms input a sequence family and produce a split in which each test sequence is less than p% identical to any individual training sequence. These algorithms successfully split more families than a previous approach, enabling construction of more diverse benchmark datasets.
Collapse
Affiliation(s)
- Samantha Petti
- NSF-Simons Center for the Mathematical and Statistical Analysis of Biology, Harvard University, Cambridge, Massachusetts, United States of America
| | - Sean R. Eddy
- Howard Hughes Medical Institute; Department of Molecular & Cellular Biology; and John A. Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, Massachusetts, United States of America
- * E-mail:
| |
Collapse
|
16
|
Palmblad M, Böcker S, Degroeve S, Kohlbacher O, Käll L, Noble WS, Wilhelm M. Interpretation of the DOME Recommendations for Machine Learning in Proteomics and Metabolomics. J Proteome Res 2022; 21:1204-1207. [PMID: 35119864 PMCID: PMC8981311 DOI: 10.1021/acs.jproteome.1c00900] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
![]()
Machine
learning is increasingly applied in proteomics and metabolomics
to predict molecular structure, function, and physicochemical properties,
including behavior in chromatography, ion mobility, and tandem mass
spectrometry. These must be described in sufficient detail to apply
or evaluate the performance of trained models. Here we look at and
interpret the recently published and general DOME (Data, Optimization,
Model, Evaluation) recommendations for conducting and reporting on
machine learning in the specific context of proteomics and metabolomics.
Collapse
Affiliation(s)
- Magnus Palmblad
- Center for Proteomics and Metabolomics, Leiden University Medical Center, 2300 RC, Leiden, The Netherlands
| | - Sebastian Böcker
- Faculty of Mathematics and Computer Science, Friedrich Schiller University, 07743 Jena, Germany
| | - Sven Degroeve
- VIB-UGent Center for Medical Biotechnology, VIB, Ghent, Belgium and Department of Biomolecular Medicine, Ghent University, 9052 Ghent, Belgium
| | - Oliver Kohlbacher
- Eberhard Karls Universität Tübingen, WSI/ZBIT, 72076 Tübingen, Germany
| | - Lukas Käll
- Science for Life Laboratory, School of Engineering Sciences in Chemistry, Biotechnology and Health, Royal Institute of Technology (KTH), 171 21 Solna, Sweden
| | - William Stafford Noble
- Department of Genome Sciences and the Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington 98195-5065, United States
| | - Mathias Wilhelm
- Computational Mass Spectrometry, Technical University of Munich (TUM), 85354 Freising, Germany
| |
Collapse
|
17
|
Greener JG, Kandathil SM, Moffat L, Jones DT. A guide to machine learning for biologists. Nat Rev Mol Cell Biol 2022; 23:40-55. [PMID: 34518686 DOI: 10.1038/s41580-021-00407-0] [Citation(s) in RCA: 790] [Impact Index Per Article: 263.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/23/2021] [Indexed: 02/08/2023]
Abstract
The expanding scale and inherent complexity of biological data have encouraged a growing use of machine learning in biology to build informative and predictive models of the underlying biological processes. All machine learning techniques fit models to data; however, the specific methods are quite varied and can at first glance seem bewildering. In this Review, we aim to provide readers with a gentle introduction to a few key machine learning techniques, including the most recently developed and widely used techniques involving deep neural networks. We describe how different techniques may be suited to specific types of biological data, and also discuss some best practices and points to consider when one is embarking on experiments involving machine learning. Some emerging directions in machine learning methodology are also discussed.
Collapse
Affiliation(s)
- Joe G Greener
- Department of Computer Science, University College London, London, UK
| | - Shaun M Kandathil
- Department of Computer Science, University College London, London, UK
| | - Lewis Moffat
- Department of Computer Science, University College London, London, UK
| | - David T Jones
- Department of Computer Science, University College London, London, UK.
| |
Collapse
|
18
|
|
19
|
Lam C, Tso CF, Green-Saxena A, Pellegrini E, Iqbal Z, Evans D, Hoffman J, Calvert J, Mao Q, Das R. Semi-supervised deep learning from time series clinical data for acute respiratory distress syndrome prediction: model development and validation study. JMIR Form Res 2021; 5:e28028. [PMID: 34398784 PMCID: PMC8447921 DOI: 10.2196/28028] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2021] [Revised: 06/18/2021] [Accepted: 08/01/2021] [Indexed: 11/23/2022] Open
Abstract
Background A high number of patients who are hospitalized with COVID-19 develop acute respiratory distress syndrome (ARDS). Objective In response to the need for clinical decision support tools to help manage the next pandemic during the early stages (ie, when limited labeled data are present), we developed machine learning algorithms that use semisupervised learning (SSL) techniques to predict ARDS development in general and COVID-19 populations based on limited labeled data. Methods SSL techniques were applied to 29,127 encounters with patients who were admitted to 7 US hospitals from May 1, 2019, to May 1, 2021. A recurrent neural network that used a time series of electronic health record data was applied to data that were collected when a patient’s peripheral oxygen saturation level fell below the normal range (<97%) to predict the subsequent development of ARDS during the remaining duration of patients’ hospital stay. Model performance was assessed with the area under the receiver operating characteristic curve and area under the precision recall curve of an external hold-out test set. Results For the whole data set, the median time between the first peripheral oxygen saturation measurement of <97% and subsequent respiratory failure was 21 hours. The area under the receiver operating characteristic curve for predicting subsequent ARDS development was 0.73 when the model was trained on a labeled data set of 6930 patients, 0.78 when the model was trained on the labeled data set that had been augmented with the unlabeled data set of 16,173 patients by using SSL techniques, and 0.84 when the model was trained on the entire training set of 23,103 labeled patients. Conclusions In the context of using time-series inpatient data and a careful model training design, unlabeled data can be used to improve the performance of machine learning models when labeled data for predicting ARDS development are scarce or expensive.
Collapse
Affiliation(s)
- Carson Lam
- Dascena, Inc., 12333 Sowden Rd Ste B PMB 65148, Houston, US
| | - Chak Foon Tso
- Dascena, Inc., 12333 Sowden Rd Ste B PMB 65148, Houston, US
| | | | | | - Zohora Iqbal
- Dascena, Inc., 12333 Sowden Rd Ste B PMB 65148, Houston, US
| | - Daniel Evans
- Dascena, Inc., 12333 Sowden Rd Ste B PMB 65148, Houston, US
| | - Jana Hoffman
- Dascena, Inc., 12333 Sowden Rd Ste B PMB 65148, Houston, US
| | - Jacob Calvert
- Dascena, Inc., 12333 Sowden Rd Ste B PMB 65148, Houston, US
| | - Qingqing Mao
- Dascena, Inc., 12333 Sowden Rd Ste B PMB 65148, Houston, US
| | - Ritankar Das
- Dascena, Inc., 12333 Sowden Rd Ste B PMB 65148, Houston, US
| |
Collapse
|
20
|
Westerman EL, Bowman SEJ, Davidson B, Davis MC, Larson ER, Sanford CPJ. Deploying Big Data to Crack the Genotype to Phenotype Code. Integr Comp Biol 2021; 60:385-396. [PMID: 32492136 DOI: 10.1093/icb/icaa055] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
Mechanistically connecting genotypes to phenotypes is a longstanding and central mission of biology. Deciphering these connections will unite questions and datasets across all scales from molecules to ecosystems. Although high-throughput sequencing has provided a rich platform on which to launch this effort, tools for deciphering mechanisms further along the genome to phenome pipeline remain limited. Machine learning approaches and other emerging computational tools hold the promise of augmenting human efforts to overcome these obstacles. This vision paper is the result of a Reintegrating Biology Workshop, bringing together the perspectives of integrative and comparative biologists to survey challenges and opportunities in cracking the genotype to phenotype code and thereby generating predictive frameworks across biological scales. Key recommendations include promoting the development of minimum "best practices" for the experimental design and collection of data; fostering sustained and long-term data repositories; promoting programs that recruit, train, and retain a diversity of talent; and providing funding to effectively support these highly cross-disciplinary efforts. We follow this discussion by highlighting a few specific transformative research opportunities that will be advanced by these efforts.
Collapse
Affiliation(s)
- Erica L Westerman
- Department of Biological Sciences, University of Arkansas, Fayetteville, AR 72701, USA
| | - Sarah E J Bowman
- High-Throughput Crystallization Screening Center, Hauptman-Woodward Medical Research Institute, Buffalo, NY 14203, USA.,Department of Biochemistry, Jacobs School of Medicine & Biomedical Sciences at the University at Buffalo, Buffalo, NY 14203, USA
| | - Bradley Davidson
- Department of Biology, Swarthmore College, Swarthmore, PA 19081, USA
| | - Marcus C Davis
- Department of Biology, James Madison University, Harrisonburg, VA 22807, USA
| | - Eric R Larson
- Department of Natural Resources and Environmental Sciences, University of Illinois, Urbana, IL 61801, USA
| | - Christopher P J Sanford
- Department of Ecology, Evolution and Organismal Biology, Kennesaw State University, Kennesaw, GA 30144, USA
| |
Collapse
|
21
|
Wilson CJ, Chang M, Karttunen M, Choy WY. KEAP1 Cancer Mutants: A Large-Scale Molecular Dynamics Study of Protein Stability. Int J Mol Sci 2021; 22:5408. [PMID: 34065616 PMCID: PMC8161161 DOI: 10.3390/ijms22105408] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2021] [Revised: 05/11/2021] [Accepted: 05/13/2021] [Indexed: 12/30/2022] Open
Abstract
We have performed 280 μs of unbiased molecular dynamics (MD) simulations to investigate the effects of 12 different cancer mutations on Kelch-like ECH-associated protein 1 (KEAP1) (G333C, G350S, G364C, G379D, R413L, R415G, A427V, G430C, R470C, R470H, R470S and G476R), one of the frequently mutated proteins in lung cancer. The aim was to provide structural insight into the effects of these mutants, including a new class of ANCHOR (additionally NRF2-complexed hypomorph) mutant variants. Our work provides additional insight into the structural dynamics of mutants that could not be analyzed experimentally, painting a more complete picture of their mutagenic effects. Notably, blade-wise analysis of the Kelch domain points to stability as a possible target of cancer in KEAP1. Interestingly, structural analysis of the R470C ANCHOR mutant, the most prevalent missense mutation in KEAP1, revealed no significant change in structural stability or NRF2 binding site dynamics, possibly indicating an covalent modification as this mutant's mode of action.
Collapse
Affiliation(s)
- Carter J. Wilson
- Department of Biochemistry, The University of Western Ontario, 1151 Richmond Street, London, ON N6A 5C1, Canada; (C.J.W.); (M.C.)
- Department of Applied Mathematics, The University of Western Ontario, 1151 Richmond Street, London, ON N6A 5B7, Canada
| | - Megan Chang
- Department of Biochemistry, The University of Western Ontario, 1151 Richmond Street, London, ON N6A 5C1, Canada; (C.J.W.); (M.C.)
| | - Mikko Karttunen
- Department of Applied Mathematics, The University of Western Ontario, 1151 Richmond Street, London, ON N6A 5B7, Canada
- Department of Chemistry, The University of Western Ontario, 1151 Richmond Street, London, ON N6A 3K7, Canada
- Centre for Advanced Materials and Biomaterials Research, The University of Western Ontario, 1151 Richmond Street, London, ON N6A 5B7, Canada
| | - Wing-Yiu Choy
- Department of Biochemistry, The University of Western Ontario, 1151 Richmond Street, London, ON N6A 5C1, Canada; (C.J.W.); (M.C.)
| |
Collapse
|
22
|
Liu Z, Gong Y, Bao Y, Guo Y, Wang H, Lin GN. TMPSS: A Deep Learning-Based Predictor for Secondary Structure and Topology Structure Prediction of Alpha-Helical Transmembrane Proteins. Front Bioeng Biotechnol 2021; 8:629937. [PMID: 33569377 PMCID: PMC7869861 DOI: 10.3389/fbioe.2020.629937] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2020] [Accepted: 12/10/2020] [Indexed: 11/13/2022] Open
Abstract
Alpha transmembrane proteins (αTMPs) profoundly affect many critical biological processes and are major drug targets due to their pivotal protein functions. At present, even though the non-transmembrane secondary structures are highly relevant to the biological functions of αTMPs along with their transmembrane structures, they have not been unified to be studied yet. In this study, we present a novel computational method, TMPSS, to predict the secondary structures in non-transmembrane parts and the topology structures in transmembrane parts of αTMPs. TMPSS applied a Convolutional Neural Network (CNN), combined with an attention-enhanced Bidirectional Long Short-Term Memory (BiLSTM) network, to extract the local contexts and long-distance interdependencies from primary sequences. In addition, a multi-task learning strategy was used to predict the secondary structures and the transmembrane helixes. TMPSS was thoroughly trained and tested against a non-redundant independent dataset, where the Q3 secondary structure prediction accuracy achieved 78% in the non-transmembrane region, and the accuracy of the transmembrane region prediction achieved 90%. In sum, our method showcased a unified model for predicting the secondary structure and topology structure of αTMPs by only utilizing features generated from primary sequences and provided a steady and fast prediction, which promisingly improves the structural studies on αTMPs.
Collapse
Affiliation(s)
- Zhe Liu
- Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, China.,Shanghai Key Laboratory of Psychotic Disorders, Shanghai, China
| | - Yingli Gong
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Yihang Bao
- School of Information Science and Technology, Institute of Computational Biology, Northeast Normal University, Changchun, China
| | - Yuanzhao Guo
- School of Information Science and Technology, Institute of Computational Biology, Northeast Normal University, Changchun, China
| | - Han Wang
- School of Information Science and Technology, Institute of Computational Biology, Northeast Normal University, Changchun, China
| | - Guan Ning Lin
- Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, China.,Shanghai Key Laboratory of Psychotic Disorders, Shanghai, China
| |
Collapse
|
23
|
Sarkar A, Yang Y, Vihinen M. Variation benchmark datasets: update, criteria, quality and applications. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2020; 2020:5710862. [PMID: 32016318 PMCID: PMC6997940 DOI: 10.1093/database/baz117] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/12/2019] [Revised: 06/03/2019] [Accepted: 07/01/2019] [Indexed: 02/07/2023]
Abstract
Development of new computational methods and testing their performance has to be carried out using experimental data. Only in comparison to existing knowledge can method performance be assessed. For that purpose, benchmark datasets with known and verified outcome are needed. High-quality benchmark datasets are valuable and may be difficult, laborious and time consuming to generate. VariBench and VariSNP are the two existing databases for sharing variation benchmark datasets used mainly for variation interpretation. They have been used for training and benchmarking predictors for various types of variations and their effects. VariBench was updated with 419 new datasets from 109 papers containing altogether 329 014 152 variants; however, there is plenty of redundancy between the datasets. VariBench is freely available at http://structure.bmc.lu.se/VariBench/. The contents of the datasets vary depending on information in the original source. The available datasets have been categorized into 20 groups and subgroups. There are datasets for insertions and deletions, substitutions in coding and non-coding region, structure mapped, synonymous and benign variants. Effect-specific datasets include DNA regulatory elements, RNA splicing, and protein property for aggregation, binding free energy, disorder and stability. Then there are several datasets for molecule-specific and disease-specific applications, as well as one dataset for variation phenotype effects. Variants are often described at three molecular levels (DNA, RNA and protein) and sometimes also at the protein structural level including relevant cross references and variant descriptions. The updated VariBench facilitates development and testing of new methods and comparison of obtained performances to previously published methods. We compared the performance of the pathogenicity/tolerance predictor PON-P2 to several benchmark studies, and show that such comparisons are feasible and useful, however, there may be limitations due to lack of provided details and shared data. Database URL: http://structure.bmc.lu.se/VariBench
Collapse
Affiliation(s)
- Anasua Sarkar
- Department of Experimental Medical Science, BMC B13, Lund University, SE-22 184 Lund, Sweden
| | - Yang Yang
- School of Computer Science and Technology, Soochow University, No1. Shizi Street, Suzhou, 215006 Jiangsu, China.,Provincial Key Laboratory for Computer Information Processing Technology, No1. Shizi Street, Soochow University, Suzhou, 215006 Jiangsu, China
| | - Mauno Vihinen
- Department of Experimental Medical Science, BMC B13, Lund University, SE-22 184 Lund, Sweden
| |
Collapse
|
24
|
Eitzinger S, Asif A, Watters KE, Iavarone AT, Knott GJ, Doudna JA, Minhas FUAA. Machine learning predicts new anti-CRISPR proteins. Nucleic Acids Res 2020; 48:4698-4708. [PMID: 32286628 PMCID: PMC7229843 DOI: 10.1093/nar/gkaa219] [Citation(s) in RCA: 68] [Impact Index Per Article: 13.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2019] [Revised: 03/23/2020] [Accepted: 03/25/2020] [Indexed: 01/30/2023] Open
Abstract
The increasing use of CRISPR–Cas9 in medicine, agriculture, and synthetic biology has accelerated the drive to discover new CRISPR–Cas inhibitors as potential mechanisms of control for gene editing applications. Many anti-CRISPRs have been found that inhibit the CRISPR–Cas adaptive immune system. However, comparing all currently known anti-CRISPRs does not reveal a shared set of properties for facile bioinformatic identification of new anti-CRISPR families. Here, we describe AcRanker, a machine learning based method to aid direct identification of new potential anti-CRISPRs using only protein sequence information. Using a training set of known anti-CRISPRs, we built a model based on XGBoost ranking. We then applied AcRanker to predict candidate anti-CRISPRs from predicted prophage regions within self-targeting bacterial genomes and discovered two previously unknown anti-CRISPRs: AcrllA20 (ML1) and AcrIIA21 (ML8). We show that AcrIIA20 strongly inhibits Streptococcus iniae Cas9 (SinCas9) and weakly inhibits Streptococcus pyogenes Cas9 (SpyCas9). We also show that AcrIIA21 inhibits SpyCas9, Streptococcus aureus Cas9 (SauCas9) and SinCas9 with low potency. The addition of AcRanker to the anti-CRISPR discovery toolkit allows researchers to directly rank potential anti-CRISPR candidate genes for increased speed in testing and validation of new anti-CRISPRs. A web server implementation for AcRanker is available online at http://acranker.pythonanywhere.com/.
Collapse
Affiliation(s)
- Simon Eitzinger
- Department of Molecular and Cell Biology, University of California Berkeley, Berkeley, CA 94720, USA
| | - Amina Asif
- Department of Computer and Information Sciences, Pakistan Institute of Engineering and Applied Sciences (PIEAS), PO Nilore, Islamabad, Pakistan.,FAST School of Computing, National University of Computer and Emerging Sciences (NUCES), Islamabad, Pakistan
| | - Kyle E Watters
- Department of Molecular and Cell Biology, University of California Berkeley, Berkeley, CA 94720, USA
| | - Anthony T Iavarone
- QB3/Chemistry Mass Spectrometry Facility, University of California, Berkeley, Berkeley, CA 94720, USA
| | - Gavin J Knott
- Department of Molecular and Cell Biology, University of California Berkeley, Berkeley, CA 94720, USA
| | - Jennifer A Doudna
- Department of Molecular and Cell Biology, University of California Berkeley, Berkeley, CA 94720, USA.,Department of Chemistry, University of California Berkeley, Berkeley, CA 94720, USA.,Innovative Genomics Institute, University of California Berkeley, Berkeley, CA 94720, USA.,Gladstone Institute of Data Science and Biotechnology, Gladstone Institutes, San Francisco, CA 94158.,Howard Hughes Medical Institute, University of California Berkeley, Berkeley, CA 94720, USA.,Molecular Biophysics & Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Fayyaz Ul Amir Afsar Minhas
- Department of Computer and Information Sciences, Pakistan Institute of Engineering and Applied Sciences (PIEAS), PO Nilore, Islamabad, Pakistan.,Department of Computer Science, University of Warwick, Coventry, CV4 7AL, UK
| |
Collapse
|
25
|
Camargo G, Bugatti PH, Saito PTM. Active semi-supervised learning for biological data classification. PLoS One 2020; 15:e0237428. [PMID: 32813738 PMCID: PMC7437865 DOI: 10.1371/journal.pone.0237428] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2019] [Accepted: 07/27/2020] [Indexed: 11/18/2022] Open
Abstract
Due to datasets have continuously grown, efforts have been performed in the attempt to solve the problem related to the large amount of unlabeled data in disproportion to the scarcity of labeled data. Another important issue is related to the trade-off between the difficulty in obtaining annotations provided by a specialist and the need for a significant amount of annotated data to obtain a robust classifier. In this context, active learning techniques jointly with semi-supervised learning are interesting. A smaller number of more informative samples previously selected (by the active learning strategy) and labeled by a specialist can propagate the labels to a set of unlabeled data (through the semi-supervised one). However, most of the literature works neglect the need for interactive response times that can be required by certain real applications. We propose a more effective and efficient active semi-supervised learning framework, including a new active learning method. An extensive experimental evaluation was performed in the biological context (using the ALL-AML, Escherichia coli and PlantLeaves II datasets), comparing our proposals with state-of-the-art literature works and different supervised (SVM, RF, OPF) and semi-supervised (YATSI-SVM, YATSI-RF and YATSI-OPF) classifiers. From the obtained results, we can observe the benefits of our framework, which allows the classifier to achieve higher accuracies more quickly with a reduced number of annotated samples. Moreover, the selection criterion adopted by our active learning method, based on diversity and uncertainty, enables the prioritization of the most informative boundary samples for the learning process. We obtained a gain of up to 20% against other learning techniques. The active semi-supervised learning approaches presented a better trade-off (accuracies and competitive and viable computational times) when compared with the active supervised learning ones.
Collapse
Affiliation(s)
- Guilherme Camargo
- Department of Computing, Federal University of Technology - Paraná, Cornélio Procópio, PR, Brazil
| | - Pedro H. Bugatti
- Department of Computing, Federal University of Technology - Paraná, Cornélio Procópio, PR, Brazil
| | - Priscila T. M. Saito
- Department of Computing, Federal University of Technology - Paraná, Cornélio Procópio, PR, Brazil
- Institute of Computing, University of Campinas, Campinas, SP, Brazil
| |
Collapse
|
26
|
Piovesan D, Hatos A, Minervini G, Quaglia F, Monzon AM, Tosatto SCE. Assessing predictors for new post translational modification sites: A case study on hydroxylation. PLoS Comput Biol 2020; 16:e1007967. [PMID: 32569263 PMCID: PMC7332089 DOI: 10.1371/journal.pcbi.1007967] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2019] [Revised: 07/02/2020] [Accepted: 05/19/2020] [Indexed: 12/15/2022] Open
Abstract
Post-translational modification (PTM) sites have become popular for predictor development. However, with the exception of phosphorylation and a handful of other examples, PTMs suffer from a limited number of available training examples and sparsity in protein sequences. Here, proline hydroxylation is taken as an example to compare different methods and evaluate their performance on new experimentally determined sites. As a guide for effective experimental design, predictors require both high specificity and sensitivity. However, the self-reported performance may often not be indicative of prediction quality and detection of new sites is not guaranteed. We have benchmarked seven published hydroxylation site predictors on two newly constructed independent datasets. The self-reported performance is found to widely overestimate the real accuracy measured on independent datasets. No predictor performs better than random on new examples, indicating the refined models do not sufficiently generalize to detect new sites. The number of false positives is high and precision low, in particular for non-collagen proteins whose motifs are not conserved. As hydroxylation site predictors do not generalize for new data, caution is advised when using PTM predictors in the absence of independent evaluations, in particular for highly specific sites involved in signalling. Machine learning methods are extensively used by biologists to design and interpret experiments. Predictors which take the only sequence as input are of particular interest due to the large amount of available sequence data and high self-reported performance. In this work, we evaluated post-translational modification (PTM) predictors for hydroxylation sites and found that they perform no better than random, in strong contrast to performances reported in their original publications. PTMs are chemical amino acid alterations providing the cell with conditional mechanisms to fine tune protein function, regulating complex biological processes such as signalling and cell cycle. Hydroxylation sites are a good PTM test case due to the availability of a range of predictors and an abundance of newly experimentally detected modification sites. Poor performances in our results highlight the overlooked problem of predicting PTMs when best practices are not followed and training data are likely incomplete. Experimentalists should be careful when using PTM predictors blindly and more independent assessments are needed to establish their usefulness in practice.
Collapse
Affiliation(s)
- Damiano Piovesan
- Department of Biomedical Sciences, University of Padua, Padua, Italy
- * E-mail:
| | - Andras Hatos
- Department of Biomedical Sciences, University of Padua, Padua, Italy
| | | | - Federica Quaglia
- Department of Biomedical Sciences, University of Padua, Padua, Italy
| | | | | |
Collapse
|
27
|
Affiliation(s)
- Stanislav Mazurenko
- Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Faculty of Science, Masaryk University, 625 00 Brno, Czech Republic
| | - Zbynek Prokop
- Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Faculty of Science, Masaryk University, 625 00 Brno, Czech Republic
- International Centre for Clinical Research, St. Ann’s Hospital, 602 00 Brno, Czech Republic
| | - Jiri Damborsky
- Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Faculty of Science, Masaryk University, 625 00 Brno, Czech Republic
- International Centre for Clinical Research, St. Ann’s Hospital, 602 00 Brno, Czech Republic
| |
Collapse
|
28
|
|
29
|
Torrisi M, Kaleel M, Pollastri G. Deeper Profiles and Cascaded Recurrent and Convolutional Neural Networks for state-of-the-art Protein Secondary Structure Prediction. Sci Rep 2019; 9:12374. [PMID: 31451723 PMCID: PMC6710256 DOI: 10.1038/s41598-019-48786-x] [Citation(s) in RCA: 56] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2019] [Accepted: 08/12/2019] [Indexed: 01/10/2023] Open
Abstract
Protein Secondary Structure prediction has been a central topic of research in Bioinformatics for decades. In spite of this, even the most sophisticated ab initio SS predictors are not able to reach the theoretical limit of three-state prediction accuracy (88–90%), while only a few predict more than the 3 traditional Helix, Strand and Coil classes. In this study we present tests on different models trained both on single sequence and evolutionary profile-based inputs and develop a new state-of-the-art system with Porter 5. Porter 5 is composed of ensembles of cascaded Bidirectional Recurrent Neural Networks and Convolutional Neural Networks, incorporates new input encoding techniques and is trained on a large set of protein structures. Porter 5 achieves 84% accuracy (81% SOV) when tested on 3 classes and 73% accuracy (70% SOV) on 8 classes on a large independent set. In our tests Porter 5 is 2% more accurate than its previous version and outperforms or matches the most recent predictors of secondary structure we tested. When Porter 5 is retrained on SCOPe based sets that eliminate homology between training/testing samples we obtain similar results. Porter is available as a web server and standalone program at http://distilldeep.ucd.ie/porter/ alongside all the datasets and alignments.
Collapse
Affiliation(s)
- Mirko Torrisi
- School of Computer Science, University College Dublin, Belfield, Dublin 4, Ireland
| | - Manaz Kaleel
- School of Computer Science, University College Dublin, Belfield, Dublin 4, Ireland
| | - Gianluca Pollastri
- School of Computer Science, University College Dublin, Belfield, Dublin 4, Ireland.
| |
Collapse
|
30
|
Latysheva NS, Babu MM. Molecular Signatures of Fusion Proteins in Cancer. ACS Pharmacol Transl Sci 2019; 2:122-133. [PMID: 32219217 PMCID: PMC7088938 DOI: 10.1021/acsptsci.9b00019] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2019] [Indexed: 01/07/2023]
Abstract
![]()
Although gene fusions
are recognized as driver mutations in a wide
variety of cancers, the general molecular mechanisms underlying oncogenic
fusion proteins are insufficiently understood. Here, we employ large-scale
data integration and machine learning and (1) identify three functionally
distinct subgroups of gene fusions and their molecular signatures;
(2) characterize the cellular pathways rewired by fusion events across
different cancers; and (3) analyze the relative importance of over
100 structural, functional, and regulatory features of ∼2200
gene fusions. We report subgroups of fusions that likely act as driver
mutations and find that gene fusions disproportionately affect pathways
regulating cellular shape and movement. Although fusion proteins are
similar across different cancer types, they affect cancer type-specific
pathways. Key indicators of fusion-forming proteins include high and
nontissue specific expression, numerous splice sites, and higher centrality
in protein-interaction networks. Together, these findings provide
unifying and cancer type-specific trends across diverse oncogenic
fusion proteins.
Collapse
Affiliation(s)
- Natasha S Latysheva
- MRC Laboratory of Molecular Biology, Francis Crick Avenue, Cambridge CB2 0QH, United Kingdom
| | - M Madan Babu
- MRC Laboratory of Molecular Biology, Francis Crick Avenue, Cambridge CB2 0QH, United Kingdom
| |
Collapse
|
31
|
Niroula A, Vihinen M. How good are pathogenicity predictors in detecting benign variants? PLoS Comput Biol 2019; 15:e1006481. [PMID: 30742610 PMCID: PMC6386394 DOI: 10.1371/journal.pcbi.1006481] [Citation(s) in RCA: 64] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2018] [Revised: 02/22/2019] [Accepted: 12/19/2018] [Indexed: 01/07/2023] Open
Abstract
Computational tools are widely used for interpreting variants detected in sequencing projects. The choice of these tools is critical for reliable variant impact interpretation for precision medicine and should be based on systematic performance assessment. The performance of the methods varies widely in different performance assessments, for example due to the contents and sizes of test datasets. To address this issue, we obtained 63,160 common amino acid substitutions (allele frequency ≥1% and <25%) from the Exome Aggregation Consortium (ExAC) database, which contains variants from 60,706 genomes or exomes. We evaluated the specificity, the capability to detect benign variants, for 10 variant interpretation tools. In addition to overall specificity of the tools, we tested their performance for variants in six geographical populations. PON-P2 had the best performance (95.5%) followed by FATHMM (86.4%) and VEST (83.5%). While these tools had excellent performance, the poorest method predicted more than one third of the benign variants to be disease-causing. The results allow choosing reliable methods for benign variant interpretation, for both research and clinical purposes, as well as provide a benchmark for method developers. In precision/personalized medicine of many conditions it is essential to investigate individual’s genome. Interpretation of the observed variation (mutation) sets is feasible only with computational approaches. We assessed the performance of variant pathogenicity/tolerance prediction programs on benign variants. Variants were obtained from high-quality ExAC database and selected to have minor allele frequency between 1 and 25%. We obtained 63,160 such cases and investigated 10 widely used predictors. Specificities of the methods showed large differences, from 64 to 96%, thus users of these methods have to be careful when choosing the one(s) they will use. We investigated further the performances on different populations, allele frequencies, separately for males and females, chromosome wise and for population unique and non-unique variants. The ranking of the tools remained the same in all these scenarios, i.e. the best methods were the best irrespective on how the data was filtered and grouped. This is to our knowledge the first large scale evaluation of method performance on benign variants.
Collapse
Affiliation(s)
- Abhishek Niroula
- Protein Structure and Bioinformatics, Department of Experimental Medical Science, Lund University, Lund, Sweden
| | - Mauno Vihinen
- Protein Structure and Bioinformatics, Department of Experimental Medical Science, Lund University, Lund, Sweden
- * E-mail:
| |
Collapse
|
32
|
Schaafsma GCP, Vihinen M. Representativeness of variation benchmark datasets. BMC Bioinformatics 2018; 19:461. [PMID: 30497376 PMCID: PMC6267811 DOI: 10.1186/s12859-018-2478-6] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2018] [Accepted: 11/09/2018] [Indexed: 12/14/2022] Open
Abstract
Background Benchmark datasets are essential for both method development and performance assessment. These datasets have numerous requirements, representativeness being one. In the case of variant tolerance/pathogenicity prediction, representativeness means that the dataset covers the space of variations and their effects. Results We performed the first analysis of the representativeness of variation benchmark datasets. We used statistical approaches to investigate how proteins in the benchmark datasets were representative for the entire human protein universe. We investigated the distributions of variants in chromosomes, protein structures, CATH domains and classes, Pfam protein families, Enzyme Commission (EC) classifications and Gene Ontology annotations in 24 datasets that have been used for training and testing variant tolerance prediction methods. All the datasets were available in VariBench or VariSNP databases. We tested also whether the pathogenic variant datasets contained neutral variants defined as those that have high minor allele frequency in the ExAC database. The distributions of variants over the chromosomes and proteins varied greatly between the datasets. Conclusions None of the datasets was found to be well representative. Many of the tested datasets had quite good coverage of the different protein characteristics. Dataset size correlates to representativeness but only weakly to the performance of methods trained on them. The results imply that dataset representativeness is an important factor and should be taken into account in predictor development and testing. Electronic supplementary material The online version of this article (10.1186/s12859-018-2478-6) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Gerard C P Schaafsma
- Protein Structure and Bioinformatics, Department of Experimental Medical Science, Lund University, BMC B13, SE-221 84, Lund, Sweden
| | - Mauno Vihinen
- Protein Structure and Bioinformatics, Department of Experimental Medical Science, Lund University, BMC B13, SE-221 84, Lund, Sweden.
| |
Collapse
|
33
|
PON-tstab: Protein Variant Stability Predictor. Importance of Training Data Quality. Int J Mol Sci 2018; 19:ijms19041009. [PMID: 29597263 PMCID: PMC5979465 DOI: 10.3390/ijms19041009] [Citation(s) in RCA: 37] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2018] [Revised: 03/21/2018] [Accepted: 03/24/2018] [Indexed: 12/24/2022] Open
Abstract
Several methods have been developed to predict effects of amino acid substitutions on protein stability. Benchmark datasets are essential for method training and testing and have numerous requirements including that the data is representative for the investigated phenomenon. Available machine learning algorithms for variant stability have all been trained with ProTherm data. We noticed a number of issues with the contents, quality and relevance of the database. There were errors, but also features that had not been clearly communicated. Consequently, all machine learning variant stability predictors have been trained on biased and incorrect data. We obtained a corrected dataset and trained a random forests-based tool, PON-tstab, applicable to variants in any organism. Our results highlight the importance of the benchmark quality, suitability and appropriateness. Predictions are provided for three categories: stability decreasing, increasing and those not affecting stability.
Collapse
|
34
|
Collaborative representation-based classification of microarray gene expression data. PLoS One 2017; 12:e0189533. [PMID: 29236759 PMCID: PMC5728509 DOI: 10.1371/journal.pone.0189533] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2017] [Accepted: 11/27/2017] [Indexed: 11/19/2022] Open
Abstract
Microarray technology is important to simultaneously express multiple genes over a number of time points. Multiple classifier models, such as sparse representation (SR)-based method, have been developed to classify microarray gene expression data. These methods allocate the gene data points to different clusters. In this paper, we propose a novel collaborative representation (CR)-based classification with regularized least square to classify gene data. First, the CR codes a testing sample as a sparse linear combination of all training samples and then classifies the testing sample by evaluating which class leads to the minimum representation error. This CR-based classification approach is remarkably less complex than traditional classification methods but leads to very competitive classification results. In addition, compressive sensing approach is adopted to project the high-dimensional gene expression dataset to a lower-dimensional space which nearly contains the whole information. This compression without loss is beneficial to reduce the computational load. Experiments to detect subtypes of diseases, such as leukemia and autism spectrum disorders, are performed by analyzing the gene expression. The results show that the proposed CR-based algorithm exhibits significantly higher stability and accuracy than the traditional classifiers, such as support vector machine algorithm.
Collapse
|
35
|
Carraro M, Minervini G, Giollo M, Bromberg Y, Capriotti E, Casadio R, Dunbrack R, Elefanti L, Fariselli P, Ferrari C, Gough J, Katsonis P, Leonardi E, Lichtarge O, Menin C, Martelli PL, Niroula A, Pal LR, Repo S, Scaini MC, Vihinen M, Wei Q, Xu Q, Yang Y, Yin Y, Zaucha J, Zhao H, Zhou Y, Brenner SE, Moult J, Tosatto SCE. Performance of in silico tools for the evaluation of p16INK4a (CDKN2A) variants in CAGI. Hum Mutat 2017; 38:1042-1050. [PMID: 28440912 PMCID: PMC5561474 DOI: 10.1002/humu.23235] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2016] [Revised: 04/17/2017] [Accepted: 04/19/2017] [Indexed: 12/31/2022]
Abstract
Correct phenotypic interpretation of variants of unknown significance for cancer-associated genes is a diagnostic challenge as genetic screenings gain in popularity in the next-generation sequencing era. The Critical Assessment of Genome Interpretation (CAGI) experiment aims to test and define the state of the art of genotype-phenotype interpretation. Here, we present the assessment of the CAGI p16INK4a challenge. Participants were asked to predict the effect on cellular proliferation of 10 variants for the p16INK4a tumor suppressor, a cyclin-dependent kinase inhibitor encoded by the CDKN2A gene. Twenty-two pathogenicity predictors were assessed with a variety of accuracy measures for reliability in a medical context. Different assessment measures were combined in an overall ranking to provide more robust results. The R scripts used for assessment are publicly available from a GitHub repository for future use in similar assessment exercises. Despite a limited test-set size, our findings show a variety of results, with some methods performing significantly better. Methods combining different strategies frequently outperform simpler approaches. The best predictor, Yang&Zhou lab, uses a machine learning method combining an empirical energy function measuring protein stability with an evolutionary conservation term. The p16INK4a challenge highlights how subtle structural effects can neutralize otherwise deleterious variants.
Collapse
Affiliation(s)
- Marco Carraro
- Department of Biomedical Sciences, University of Padova, Padova, Italy
| | | | - Manuel Giollo
- Department of Biomedical Sciences, University of Padova, Padova, Italy
- Department of Information Engineering, University of Padova, Padova, Italy
| | - Yana Bromberg
- Department of Biochemistry and Microbiology, Rutgers University, New Brunswick, New Jersey
- Department of Genetics, Rutgers University, Piscataway, New Jersey
- Technical University of Munich Institute for Advanced Study (TUM-IAS), Garching/Munich, Germany
| | - Emidio Capriotti
- BioFolD Unit, Department of Biological, Geological, and Environmental Sciences (BiGeA), University of Bologna, Bologna, Italy
| | - Rita Casadio
- Biocomputing Group, Department of Biological, Geological, and Environmental Sciences (BiGeA), University of Bologna, Bologna, Italy
| | - Roland Dunbrack
- Institute for Cancer Research, Fox Chase Cancer Center, Philadelphia, Pennsylvania
| | - Lisa Elefanti
- Immunology and Molecular Oncology Unit, Veneto Institute of Oncology, Padua, Italy
| | - Pietro Fariselli
- Department of Comparative Biomedicine and Food Science, University of Padua, viale dell'Università 16, 35020, Legnaro (PD), Italy
| | - Carlo Ferrari
- Department of Information Engineering, University of Padova, Padova, Italy
| | - Julian Gough
- Department of Computer Science, University of Bristol, Bristol, UK
| | - Panagiotis Katsonis
- Department of Human and Molecular Genetics, Baylor College of Medicine, Houston, Texas
| | - Emanuela Leonardi
- Department of Woman and Child Health, University of Padova, Padova, Italy
| | - Olivier Lichtarge
- Department of Human and Molecular Genetics, Baylor College of Medicine, Houston, Texas
- Department of Biochemistry & Molecular Biology, Baylor College of Medicine, Houston, Texas
- Department of Pharmacology, Baylor College of Medicine, Houston, Texas
- Computational and Integrative Biomedical Research Center, Baylor College of Medicine, Houston, Texas
| | - Chiara Menin
- Immunology and Molecular Oncology Unit, Veneto Institute of Oncology, Padua, Italy
| | - Pier Luigi Martelli
- BioFolD Unit, Department of Biological, Geological, and Environmental Sciences (BiGeA), University of Bologna, Bologna, Italy
| | - Abhishek Niroula
- Protein Structure and Bioinformatics Group, Department of Experimental Medical Science, Lund University, Lund, Sweden
| | - Lipika R Pal
- Institute for Bioscience and Biotechnology Research, University of Maryland, Rockville, Maryland
| | - Susanna Repo
- EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK
| | - Maria Chiara Scaini
- Immunology and Molecular Oncology Unit, Veneto Institute of Oncology, Padua, Italy
| | - Mauno Vihinen
- Protein Structure and Bioinformatics Group, Department of Experimental Medical Science, Lund University, Lund, Sweden
| | - Qiong Wei
- Biocomputing Group, Department of Biological, Geological, and Environmental Sciences (BiGeA), University of Bologna, Bologna, Italy
| | - Qifang Xu
- Biocomputing Group, Department of Biological, Geological, and Environmental Sciences (BiGeA), University of Bologna, Bologna, Italy
| | - Yuedong Yang
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Gold Coast, Queensland, Australia
| | - Yizhou Yin
- Institute for Bioscience and Biotechnology Research, University of Maryland, Rockville, Maryland
- Computational Biology, Bioinformatics and Genomics, Biological Sciences Graduate Program, University of Maryland, College Park, Maryland
| | - Jan Zaucha
- Department of Computer Science, University of Bristol, Bristol, UK
| | - Huiying Zhao
- Institute of Health and Biomedical Innovation, Queensland University of Technology, Queensland, Australia
| | - Yaoqi Zhou
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Gold Coast, Queensland, Australia
| | - Steven E Brenner
- Department of Plant and Microbial Biology, University of California, Berkeley, California
| | - John Moult
- Institute for Bioscience and Biotechnology Research, University of Maryland, Rockville, Maryland
- Department of Cell Biology and Molecular Genetics, University of Maryland, College Park, Maryland
| | - Silvio C E Tosatto
- Department of Biomedical Sciences, University of Padova, Padova, Italy
- CNR Institute of Neuroscience, Padova, Italy
| |
Collapse
|
36
|
Niroula A, Vihinen M. PON-P and PON-P2 predictor performance in CAGI challenges: Lessons learned. Hum Mutat 2017; 38:1085-1091. [PMID: 28224672 DOI: 10.1002/humu.23199] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2016] [Revised: 01/25/2017] [Accepted: 02/17/2017] [Indexed: 01/14/2023]
Abstract
Computational tools are widely used for ranking and prioritizing variants for characterizing their disease relevance. Since numerous tools have been developed, they have to be properly assessed before being applied. Critical Assessment of Genome Interpretation (CAGI) experiments have significantly contributed toward the assessment of prediction methods for various tasks. Within and outside the CAGI, we have addressed several questions that facilitate development and assessment of variation interpretation tools. These areas include collection and distribution of benchmark datasets, their use for systematic large-scale method assessment, and the development of guidelines for reporting methods and their performance. For us, CAGI has provided a chance to experiment with new ideas, test the application areas of our methods, and network with other prediction method developers. In this article, we discuss our experiences and lessons learned from the various CAGI challenges. We describe our approaches, their performance, and impact of CAGI on our research. Finally, we discuss some of the possibilities that CAGI experiments have opened up and make some suggestions for future experiments.
Collapse
Affiliation(s)
- Abhishek Niroula
- Protein Structure and Bioinformatics Group, Department of Experimental Medical Science, Lund University, Lund, Sweden
| | - Mauno Vihinen
- Protein Structure and Bioinformatics Group, Department of Experimental Medical Science, Lund University, Lund, Sweden
| |
Collapse
|
37
|
Niroula A, Vihinen M. Predicting Severity of Disease-Causing Variants. Hum Mutat 2017; 38:357-364. [PMID: 28070986 DOI: 10.1002/humu.23173] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2016] [Revised: 12/07/2016] [Accepted: 01/06/2017] [Indexed: 12/22/2022]
Abstract
Most diseases, including those of genetic origin, express a continuum of severity. Clinical interventions for numerous diseases are based on the severity of the phenotype. Predicting severity due to genetic variants could facilitate diagnosis and choice of therapy. Although computational predictions have been used as evidence for classifying the disease relevance of genetic variants, special tools for predicting disease severity in large scale are missing. Here, we manually curated a dataset containing variants leading to severe and less severe phenotypes and studied the abilities of variation impact predictors to distinguish between them. We found that these tools cannot separate the two groups of variants. Then, we developed a novel machine-learning-based method, PON-PS (http://structure.bmc.lu.se/PON-PS), for the classification of amino acid substitutions associated with benign, severe, and less severe phenotypes. We tested the method using an independent test dataset and variants in four additional proteins. For distinguishing severe and nonsevere variants, PON-PS showed an accuracy of 61% in the test dataset, which is higher than for existing tolerance prediction methods. PON-PS is the first generic tool developed for this task. The tool can be used together with other evidence for improving diagnosis and prognosis and for prioritization of preventive interventions, clinical monitoring, and molecular tests.
Collapse
Affiliation(s)
- Abhishek Niroula
- Department of Experimental Medical Science, Lund University, Lund, SE-22184, Sweden
| | - Mauno Vihinen
- Department of Experimental Medical Science, Lund University, Lund, SE-22184, Sweden
| |
Collapse
|
38
|
Richard FD, Alves R, Kajava AV. Tally: a scoring tool for boundary determination between repetitive and non-repetitive protein sequences. Bioinformatics 2016; 32:1952-8. [PMID: 27153701 DOI: 10.1093/bioinformatics/btw118] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2015] [Accepted: 02/25/2016] [Indexed: 12/23/2022] Open
Abstract
MOTIVATION Tandem Repeats (TRs) are abundant in proteins, having a variety of fundamental functions. In many cases, evolution has blurred their repetitive patterns. This leads to the problem of distinguishing between sequences that contain highly imperfect TRs, and the sequences without TRs. The 3D structure of proteins can be used as a benchmarking criterion for TR detection in sequences, because the vast majority of proteins having TRs in sequences are built of repetitive 3D structural blocks. According to our benchmark, none of the existing scoring methods are able to clearly distinguish, based on the sequence analysis, between structures with and without 3D TRs. RESULTS We developed a scoring tool called Tally, which is based on a machine learning approach. Tally is able to achieve a better separation between sequences with structural TRs and sequences of aperiodic structures, than existing scoring procedures. It performs at a level of 81% sensitivity, while achieving a high specificity of 74% and an Area Under the Receiver Operating Characteristic Curve of 86%. Tally can be used to select a set of structurally and functionally meaningful TRs from all TRs detected in proteomes. The generated dataset is available for benchmarking purposes. AVAILABILITY AND IMPLEMENTATION Source code is available upon request. Tool and dataset can be accessed through our website: http://bioinfo.montp.cnrs.fr/?r=Tally CONTACT andrey.kajava@crbm.cnrs.fr SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- François D Richard
- Centre de Recherche en Biologie cellulaire de Montpellier (CRBM), UMR 5237 CNRS, Université Montpellier 1919 Route de Mende, Cedex 5, Montpellier 34293, France Institut de Biologie Computationnelle (IBC), Montpellier 34095, France
| | - Ronnie Alves
- Institut de Biologie Computationnelle (IBC), Montpellier 34095, France Pós-Graduação em Ciência da Computação (PPGCC), Universidade Federal do Pará, Belém, Brazil
| | - Andrey V Kajava
- Centre de Recherche en Biologie cellulaire de Montpellier (CRBM), UMR 5237 CNRS, Université Montpellier 1919 Route de Mende, Cedex 5, Montpellier 34293, France Institut de Biologie Computationnelle (IBC), Montpellier 34095, France University ITMO, Institute of Bioengineering, St. Petersburg 197101, Russia
| |
Collapse
|
39
|
Bendl J, Musil M, Štourač J, Zendulka J, Damborský J, Brezovský J. PredictSNP2: A Unified Platform for Accurately Evaluating SNP Effects by Exploiting the Different Characteristics of Variants in Distinct Genomic Regions. PLoS Comput Biol 2016; 12:e1004962. [PMID: 27224906 PMCID: PMC4880439 DOI: 10.1371/journal.pcbi.1004962] [Citation(s) in RCA: 143] [Impact Index Per Article: 15.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2015] [Accepted: 05/05/2016] [Indexed: 12/20/2022] Open
Abstract
An important message taken from human genome sequencing projects is that the human population exhibits approximately 99.9% genetic similarity. Variations in the remaining parts of the genome determine our identity, trace our history and reveal our heritage. The precise delineation of phenotypically causal variants plays a key role in providing accurate personalized diagnosis, prognosis, and treatment of inherited diseases. Several computational methods for achieving such delineation have been reported recently. However, their ability to pinpoint potentially deleterious variants is limited by the fact that their mechanisms of prediction do not account for the existence of different categories of variants. Consequently, their output is biased towards the variant categories that are most strongly represented in the variant databases. Moreover, most such methods provide numeric scores but not binary predictions of the deleteriousness of variants or confidence scores that would be more easily understood by users. We have constructed three datasets covering different types of disease-related variants, which were divided across five categories: (i) regulatory, (ii) splicing, (iii) missense, (iv) synonymous, and (v) nonsense variants. These datasets were used to develop category-optimal decision thresholds and to evaluate six tools for variant prioritization: CADD, DANN, FATHMM, FitCons, FunSeq2 and GWAVA. This evaluation revealed some important advantages of the category-based approach. The results obtained with the five best-performing tools were then combined into a consensus score. Additional comparative analyses showed that in the case of missense variations, protein-based predictors perform better than DNA sequence-based predictors. A user-friendly web interface was developed that provides easy access to the five tools’ predictions, and their consensus scores, in a user-understandable format tailored to the specific features of different categories of variations. To enable comprehensive evaluation of variants, the predictions are complemented with annotations from eight databases. The web server is freely available to the community at http://loschmidt.chemi.muni.cz/predictsnp2.
Collapse
Affiliation(s)
- Jaroslav Bendl
- Loschmidt Laboratories, Department of Experimental Biology and Research Centre for Toxic Compounds in the Environment RECETOX, Masaryk University, Brno, Czech Republic
- Department of Information Systems, Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic
- International Clinical Research Center, St. Anne’s University Hospital Brno, Brno, Czech Republic
| | - Miloš Musil
- Loschmidt Laboratories, Department of Experimental Biology and Research Centre for Toxic Compounds in the Environment RECETOX, Masaryk University, Brno, Czech Republic
- Department of Information Systems, Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic
| | - Jan Štourač
- Loschmidt Laboratories, Department of Experimental Biology and Research Centre for Toxic Compounds in the Environment RECETOX, Masaryk University, Brno, Czech Republic
- International Clinical Research Center, St. Anne’s University Hospital Brno, Brno, Czech Republic
| | - Jaroslav Zendulka
- Department of Information Systems, Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic
| | - Jiří Damborský
- Loschmidt Laboratories, Department of Experimental Biology and Research Centre for Toxic Compounds in the Environment RECETOX, Masaryk University, Brno, Czech Republic
- International Clinical Research Center, St. Anne’s University Hospital Brno, Brno, Czech Republic
- * E-mail: (JD); (JBr)
| | - Jan Brezovský
- Loschmidt Laboratories, Department of Experimental Biology and Research Centre for Toxic Compounds in the Environment RECETOX, Masaryk University, Brno, Czech Republic
- International Clinical Research Center, St. Anne’s University Hospital Brno, Brno, Czech Republic
- * E-mail: (JD); (JBr)
| |
Collapse
|