Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Walsh I, Pollastri G, Tosatto SCE. Correct machine learning on protein sequences: a peer-reviewing perspective. Brief Bioinform 2015;17:831-40. [PMID: 26411473 DOI: 10.1093/bib/bbv082] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2015] [Indexed: 12/20/2022] Open

For:	Walsh I, Pollastri G, Tosatto SCE. Correct machine learning on protein sequences: a peer-reviewing perspective. Brief Bioinform 2015;17:831-40. [PMID: 26411473 DOI: 10.1093/bib/bbv082] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2015] [Indexed: 12/20/2022] Open

Number

Cited by Other Article(s)

Sheehan K, Jeon H, Corr SC, Hayes JM, Mok KH. Antibody Aggregation: A Problem Within the Biopharmaceutical Industry and Its Role in AL Amyloidosis Disease. Protein J 2025;44:1-20. [PMID: 39527351 DOI: 10.1007/s10930-024-10237-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/24/2024] [Indexed: 11/16/2024]

Abstract

Due to the large size and rapid growth of the global therapeutic antibody market, there is major interest in understanding the aggregation of protein products as it can compromise efficacy, concentration, and safety. Various production and storage conditions have been identified as capable of inducing aggregation of polyclonal and monoclonal antibody (mAb) therapies such as low pH, freezing, light exposure, lyophilisation and increased ionic strength. The addition of stabilising excipients to these therapeutics helps to combat the formation of aggregates with future aggregation inhibition mechanisms involving the introduction of point mutations and glycoengineering within aggregation prone regions (APRs). Antibody aggregation also plays an integral role in the pathogenesis of a condition known as amyloid light chain (AL) amyloidosis which is characterised by the production of improperly folded and amyloidogenic immunoglobulin light chains (LCs). Current diagnostic tools rely heavily on histological staining with their future moving towards amyloid component identification and proteomic analysis. For many years, treatment options designed for multiple myeloma (MM) have been applied to AL amyloidosis patients by depleting plasma cell numbers. More recently, treatment strategies more specific to this condition have been developed with many designed to recognize amyloid fibrils and trigger their degradation without causing systemic plasma cell cytotoxicity. Amyloid fibrils in AL disease and aggregates in antibody therapeutics are both formed through the oligomerisation of misfolded / modified proteins attempting to reach a thermodynamically stable, free energy minimum that is lower than the respective monomers themselves. Although the final morphologies are different, by understanding the principles underlying such aggregation, we expect to find common insights that may contribute to the development of new and effective methods of antibody aggregation and/or amyloidosis management. We envision that this area of research will continue to be very relevant in both industry and clinical settings.

Collapse

Pir MS, Timucin E. AFFIPred: AlphaFold2 structure-based Functional Impact Prediction of missense variations. Protein Sci 2025;34:e70030. [PMID: 39840793 PMCID: PMC11751861 DOI: 10.1002/pro.70030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2024] [Revised: 12/23/2024] [Accepted: 12/24/2024] [Indexed: 01/23/2025]

Chernigovskaya M, Pavlović M, Kanduri C, Gielis S, Robert P, Scheffer L, Slabodkin A, Haff IH, Meysman P, Yaari G, Sandve GK, Greiff V. Simulation of adaptive immune receptors and repertoires with complex immune information to guide the development and benchmarking of AIRR machine learning. Nucleic Acids Res 2025;53:gkaf025. [PMID: 39873270 PMCID: PMC11773363 DOI: 10.1093/nar/gkaf025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2023] [Accepted: 01/25/2025] [Indexed: 01/30/2025] Open

Gillani M, Pollastri G. Protein subcellular localization prediction tools. Comput Struct Biotechnol J 2024;23:1796-1807. [PMID: 38707539 PMCID: PMC11066471 DOI: 10.1016/j.csbj.2024.04.032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2024] [Revised: 04/11/2024] [Accepted: 04/11/2024] [Indexed: 05/07/2024] Open

Wossnig L, Furtmann N, Buchanan A, Kumar S, Greiff V. Best practices for machine learning in antibody discovery and development. Drug Discov Today 2024;29:104025. [PMID: 38762089 DOI: 10.1016/j.drudis.2024.104025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2023] [Revised: 04/25/2024] [Accepted: 05/13/2024] [Indexed: 05/20/2024]

Ndochinwa OG, Wang QY, Amadi OC, Nwagu TN, Nnamchi CI, Okeke ES, Moneke AN. Current status and emerging frontiers in enzyme engineering: An industrial perspective. Heliyon 2024;10:e32673. [PMID: 38912509 PMCID: PMC11193041 DOI: 10.1016/j.heliyon.2024.e32673] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2023] [Revised: 06/05/2024] [Accepted: 06/06/2024] [Indexed: 06/25/2024] Open

Abstract

Protein engineering mechanisms can be an efficient approach to enhance the biochemical properties of various biocatalysts. Immobilization of biocatalysts and the introduction of new-to-nature chemical reactivities are also possible through the same mechanism. Discovering new protocols that enhance the catalytic active protein that possesses novelty in terms of being stable, active, and, stereoselectivity with functions could be identified as essential areas in terms of concurrent bioorganic chemistry (synergistic relationship between organic chemistry and biochemistry in the context of enzyme engineering). However, with our current level of knowledge about protein folding and its correlation with protein conformation and activities, it is almost impossible to design proteins with specific biological and physical properties. Hence, contemporary protein engineering typically involves reprogramming existing enzymes by mutagenesis to generate new phenotypes with desired properties. These processes ensure that limitations of naturally occurring enzymes are not encountered. For example, researchers have engineered cellulases and hemicellulases to withstand harsh conditions encountered during biomass pretreatment, such as high temperatures and acidic environments. By enhancing the activity and robustness of these enzymes, biofuel production becomes more economically viable and environmentally sustainable. Recent trends in enzyme engineering have enabled the development of tailored biocatalysts for pharmaceutical applications. For instance, researchers have engineered enzymes such as cytochrome P450s and amine oxidases to catalyze challenging reactions involved in drug synthesis. In addition to conventional methods, there has been an increasing application of machine learning techniques to identify patterns in data. These patterns are then used to predict protein structures, enhance enzyme solubility, stability, and function, forecast substrate specificity, and assist in rational protein design. In this review, we discussed recent trends in enzyme engineering to optimize the biochemical properties of various biocatalysts. Using examples relevant to biotechnology in engineering enzymes, we try to expatiate the significance of enzyme engineering with how these methods could be applied to optimize the biochemical properties of a naturally occurring enzyme.

Collapse

Armah-Sekum RE, Szedmak S, Rousu J. Protein function prediction through multi-view multi-label latent tensor reconstruction. BMC Bioinformatics 2024;25:174. [PMID: 38698340 PMCID: PMC11067221 DOI: 10.1186/s12859-024-05789-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2024] [Accepted: 04/17/2024] [Indexed: 05/05/2024] Open

Draizen EJ, Readey J, Mura C, Bourne PE. Prop3D: A flexible, Python-based platform for machine learning with protein structural properties and biophysical data. BMC Bioinformatics 2024;25:11. [PMID: 38177985 PMCID: PMC10768222 DOI: 10.1186/s12859-023-05586-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2023] [Accepted: 11/27/2023] [Indexed: 01/06/2024] Open

Abstract

BACKGROUND

Machine learning (ML) has a rich history in structural bioinformatics, and modern approaches, such as deep learning, are revolutionizing our knowledge of the subtle relationships between biomolecular sequence, structure, function, dynamics and evolution. As with any advance that rests upon statistical learning approaches, the recent progress in biomolecular sciences is enabled by the availability of vast volumes of sufficiently-variable data. To be useful, such data must be well-structured, machine-readable, intelligible and manipulable. These and related requirements pose challenges that become especially acute at the computational scales typical in ML. Furthermore, in structural bioinformatics such data generally relate to protein three-dimensional (3D) structures, which are inherently more complex than sequence-based data. A significant and recurring challenge concerns the creation of large, high-quality, openly-accessible datasets that can be used for specific training and benchmarking tasks in ML pipelines for predictive modeling projects, along with reproducible splits for training and testing.

RESULTS

Here, we report 'Prop3D', a platform that allows for the creation, sharing and extensible reuse of libraries of protein domains, featurized with biophysical and evolutionary properties that can range from detailed, atomically-resolved physicochemical quantities (e.g., electrostatics) to coarser, residue-level features (e.g., phylogenetic conservation). As a community resource, we also supply a 'Prop3D-20sf' protein dataset, obtained by applying our approach to CATH . We have developed and deployed the Prop3D framework, both in the cloud and on local HPC resources, to systematically and reproducibly create comprehensive datasets via the Highly Scalable Data Service ( HSDS ). Our datasets are freely accessible via a public HSDS instance, or they can be used with accompanying Python wrappers for popular ML frameworks.

CONCLUSION

Prop3D and its associated Prop3D-20sf dataset can be of broad utility in at least three ways. Firstly, the Prop3D workflow code can be customized and deployed on various cloud-based compute platforms, with scalability achieved largely by saving the results to distributed HDF5 files via HSDS . Secondly, the linked Prop3D-20sf dataset provides a hand-crafted, already-featurized dataset of protein domains for 20 highly-populated CATH families; importantly, provision of this pre-computed resource can aid the more efficient development (and reproducible deployment) of ML pipelines. Thirdly, Prop3D-20sf's construction explicitly takes into account (in creating datasets and data-splits) the enigma of 'data leakage', stemming from the evolutionary relationships between proteins.

Collapse

Attafi OA, Clementel D, Kyritsis K, Capriotti E, Farrell G, Fragkouli SC, Castro LJ, Hatos A, Lenaerts T, Mazurenko S, Mozaffari S, Pradelli F, Ruch P, Savojardo C, Turina P, Zambelli F, Piovesan D, Monzon AM, Psomopoulos F, Tosatto SCE. DOME Registry: implementing community-wide recommendations for reporting supervised machine learning in biology. Gigascience 2024;13:giae094. [PMID: 39661723 PMCID: PMC11633452 DOI: 10.1093/gigascience/giae094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2024] [Revised: 10/22/2024] [Accepted: 10/27/2024] [Indexed: 12/13/2024] Open

Affiliation(s)

Omar Abdelghani Attafi Department of Biomedical Sciences, University of Padova, Padova 35131, Italy
Damiano Clementel Department of Biomedical Sciences, University of Padova, Padova 35131, Italy
Konstantinos Kyritsis Institute of Applied Biosciences, Centre for Research and Technology Hellas, Thessaloniki 570 01, Greece
Emidio Capriotti Department of Pharmacy and Biotechnology, University of Bologna, Bologna 40126, Italy
Gavin Farrell ELIXIR Hub, Hinxton, Cambridge CB10 1SD, UK
Styliani-Christina Fragkouli Institute of Applied Biosciences, Centre for Research and Technology Hellas, Thessaloniki 570 01, Greece Department of Biology, National and Kapodistrian University of Athens, Athens 157 72, Greece
Leyla Jael Castro ZB Med Information Centre for Life Sciences, Cologne 50931, Germany
András Hatos Department of Oncology, Geneva University Hospitals, Geneva 1205, Switzerland Department of Computational Biology, University of Lausanne, Lausanne 1015, Switzerland Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland Swiss Cancer Center Léman, Lausanne 1015, Switzerland
Tom Lenaerts Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles-Vrije Universiteit Brussels, Brussels 1050, Belgium Machine Learning Group, Université Libre de Bruxelles, Brussels 1050, Belgium Artificial Intelligence Laboratory, Vrije Universiteit Brussels, Brussels 1050, Belgium
Stanislav Mazurenko Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Faculty of Science, Brno 62500, Czech Republic Masaryk University, Czech Republic International Clinical Research Centre, St. Anne’s Hospital, Brno 65690, Czech Republic
Soroush Mozaffari Department of Biomedical Sciences, University of Padova, Padova 35131, Italy
Franco Pradelli Department of Biomedical Sciences, University of Padova, Padova 35131, Italy
Patrick Ruch HES-SO–HEG Geneva, Geneva 1227, Switzerland SIB Swiss Institute of Bioinformatics, Geneva 1206, Switzerland
Castrense Savojardo Department of Pharmacy and Biotechnology, University of Bologna, Bologna 40126, Italy
Paola Turina Department of Pharmacy and Biotechnology, University of Bologna, Bologna 40126, Italy
Federico Zambelli Department of Biosciences, University of Milan, Milan 20133, Italy Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies (IBIOM), Bari 70126, Italy
Damiano Piovesan Department of Biomedical Sciences, University of Padova, Padova 35131, Italy
Alexander Miguel Monzon Department of Information Engineering, University of Padova, Padova 35131, Italy
Fotis Psomopoulos Institute of Applied Biosciences, Centre for Research and Technology Hellas, Thessaloniki 570 01, Greece
Silvio C E Tosatto Department of Biomedical Sciences, University of Padova, Padova 35131, Italy Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies (IBIOM), Bari 70126, Italy

Collapse

Zhao XJG, Cao H. Linking research of biomedical datasets. Brief Bioinform 2022;23:6712704. [PMID: 36151775 DOI: 10.1093/bib/bbac373] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2022] [Revised: 08/03/2022] [Accepted: 08/08/2022] [Indexed: 12/14/2022] Open

Behrendt A, Golchin P, König F, Mulnaes D, Stalke A, Dröge C, Keitel V, Gohlke H. Vasor: Accurate prediction of variant effects for amino acid substitutions in multidrug resistance protein 3. Hepatol Commun 2022;6:3098-3111. [PMID: 36111625 PMCID: PMC9592774 DOI: 10.1002/hep4.2088] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/15/2022] [Revised: 07/26/2022] [Accepted: 08/16/2022] [Indexed: 12/14/2022] Open

Buś S, Jędrzejewski K, Guzik P. Using Minimum Redundancy Maximum Relevance Algorithm to Select Minimal Sets of Heart Rate Variability Parameters for Atrial Fibrillation Detection. J Clin Med 2022;11:4004. [PMID: 35887768 PMCID: PMC9318370 DOI: 10.3390/jcm11144004] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2022] [Revised: 07/08/2022] [Accepted: 07/09/2022] [Indexed: 02/06/2023] Open

Abstract

Heart rate is quite regular during sinus (normal) rhythm (SR) originating from the sinus node. In contrast, heart rate is usually irregular during atrial fibrillation (AF). Complete atrioventricular block with an escape rhythm, ventricular pacing, or ventricular tachycardia are the most common exceptions when heart rate may be regular in AF. Heart rate variability (HRV) is the variation in the duration of consecutive cardiac cycles (RR intervals). We investigated the utility of HRV parameters for automated detection of AF with machine learning (ML) classifiers. The minimum redundancy maximum relevance (MRMR) algorithm, one of the most effective algorithms for feature selection, helped select the HRV parameters (including five original), best suited for distinguishing AF from SR in a database of over 53,000 60 s separate electrocardiogram (ECG) segments cut from longer (up to 24 h) ECG recordings. HRV parameters entered the ML-based classifiers as features. Seven different, commonly used classifiers were trained with one to six HRV-based features with the highest scores resulting from the MRMR algorithm and tested using the 5-fold cross-validation and blindfold validation. The best ML classifier in the blindfold validation achieved an accuracy of 97.2% and diagnostic odds ratio of 1566. From all studied HRV features, the top three HRV parameters distinguishing AF from SR were: the percentage of successive RR intervals differing by at least 50 ms (pRR50), the ratio of standard deviations of points along and across the identity line of the Poincare plots, respectively (SD2/SD1), and coefficient of variation-standard deviation of RR intervals divided by their mean duration (CV). The proposed methodology and the presented results of the selection of HRV parameters have the potential to develop practical solutions and devices for automatic AF detection with minimal sets of simple HRV parameters. Using straightforward ML classifiers and the extremely small sets of simple HRV features, always with pRR50 included, the differentiation of AF from sinus rhythms in the 60 s ECGs is very effective.

Collapse

Sokhansanj BA, Rosen GL. Mapping Data to Deep Understanding: Making the Most of the Deluge of SARS-CoV-2 Genome Sequences. mSystems 2022;7:e0003522. [PMID: 35311562 PMCID: PMC9040592 DOI: 10.1128/msystems.00035-22] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/27/2022] [Indexed: 12/22/2022] Open

Lee BD, Gitter A, Greene CS, Raschka S, Maguire F, Titus AJ, Kessler MD, Lee AJ, Chevrette MG, Stewart PA, Britto-Borges T, Cofer EM, Yu KH, Carmona JJ, Fertig EJ, Kalinin AA, Signal B, Lengerich BJ, Triche TJ, Boca SM. Ten quick tips for deep learning in biology. PLoS Comput Biol 2022;18:e1009803. [PMID: 35324884 PMCID: PMC8946751 DOI: 10.1371/journal.pcbi.1009803] [Citation(s) in RCA: 26] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/05/2022] Open

Affiliation(s)

Benjamin D. Lee In-Q-Tel Labs, Arlington, Virginia, United States of America School of Engineering and Applied Sciences, Harvard University, Cambridge, Massachusetts, United States of America Department of Genetics, Harvard Medical School, Boston, Massachusetts, United States of America
Anthony Gitter Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, Wisconsin, United States of America Morgridge Institute for Research, Madison, Wisconsin, United States of America
Casey S. Greene Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Aurora, Colorado, United States of America Center for Health AI, University of Colorado School of Medicine, Aurora, Colorado, United States of America
Sebastian Raschka Department of Statistics, University of Wisconsin-Madison, Madison, Wisconsin, United States of America
Finlay Maguire Faculty of Computer Science, Dalhousie University, Halifax, Nova Scotia, Canada
Alexander J. Titus University of New Hampshire, Manchester, New Hampshire, United States of America Bioeconomy.XYZ, Manchester, New Hampshire, United States of America
Michael D. Kessler Department of Oncology, Johns Hopkins University, Baltimore, Maryland, United States of America Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, Maryland, United States of America
Alexandra J. Lee Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America Genomics and Computational Biology Graduate Program, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
Marc G. Chevrette Wisconsin Institute for Discovery and Department of Plant Pathology, University of Wisconsin-Madison, Madison, Wisconsin, United States of America
Paul Allen Stewart Department of Biostatistics and Bioinformatics, Moffitt Cancer Center, Tampa, Florida, United States of America
Thiago Britto-Borges Section of Bioinformatics and Systems Cardiology, Klaus Tschira Institute for Integrative Computational Cardiology, University Hospital Heidelberg, Heidelberg, Germany Department of Internal Medicine III (Cardiology, Angiology, and Pneumology), University Hospital Heidelberg, Heidelberg, Germany
Evan M. Cofer Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America Graduate Program in Quantitative and Computational Biology, Princeton University, Princeton, New Jersey, United States of America
Kun-Hsing Yu Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, United States of America Department of Pathology, Brigham and Women’s Hospital, Boston, Massachusetts, United States of America
Juan Jose Carmona Philips Healthcare, Cambridge, Massachusetts, United States of America
Elana J. Fertig Department of Oncology, Johns Hopkins University, Baltimore, Maryland, United States of America Department of Biomedical Engineering, Department of Applied Mathematics and Statistics, Convergence Institute, Johns Hopkins University, Baltimore, Maryland, United States of America
Alexandr A. Kalinin Medical Big Data Group, Shenzhen Research Institute of Big Data, Shenzhen, China Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
Brandon Signal School of Medicine, College of Health and Medicine, University of Tasmania, Hobart, Australia
Benjamin J. Lengerich Computer Science Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
Timothy J. Triche Center for Epigenetics, Van Andel Research Institute, Grand Rapids, Michigan, United States of America Department of Pediatrics, College of Human Medicine, Michigan State University, East Lansing, Michigan, United States of America Department of Translational Genomics, Keck School of Medicine, University of Southern California, Los Angeles, California, United States of America
Simina M. Boca Innovation Center for Biomedical Informatics, Georgetown University Medical Center, District of Columbia, United States of America Department of Oncology, Georgetown University Medical Center, Washington, DC, United States of America Department of Biostatistics, Bioinformatics and Biomathematics, Georgetown University Medical Center, Washington, DC, United States of America Cancer Prevention and Control Program, Lombardi Comprehensive Cancer Center, Washington, DC, United States of America

Collapse

Petti S, Eddy SR. Constructing benchmark test sets for biological sequence analysis using independent set algorithms. PLoS Comput Biol 2022;18:e1009492. [PMID: 35255082 PMCID: PMC8929697 DOI: 10.1371/journal.pcbi.1009492] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2021] [Revised: 03/17/2022] [Accepted: 02/10/2022] [Indexed: 11/18/2022] Open

Palmblad M, Böcker S, Degroeve S, Kohlbacher O, Käll L, Noble WS, Wilhelm M. Interpretation of the DOME Recommendations for Machine Learning in Proteomics and Metabolomics. J Proteome Res 2022;21:1204-1207. [PMID: 35119864 PMCID: PMC8981311 DOI: 10.1021/acs.jproteome.1c00900] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]

Greener JG, Kandathil SM, Moffat L, Jones DT. A guide to machine learning for biologists. Nat Rev Mol Cell Biol 2022;23:40-55. [PMID: 34518686 DOI: 10.1038/s41580-021-00407-0] [Citation(s) in RCA: 790] [Impact Index Per Article: 263.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/23/2021] [Indexed: 02/08/2023]

DOME: recommendations for supervised machine learning validation in biology. Nat Methods 2021;18:1122-1127. [PMID: 34316068 DOI: 10.1038/s41592-021-01205-4] [Citation(s) in RCA: 111] [Impact Index Per Article: 27.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]

Lam C, Tso CF, Green-Saxena A, Pellegrini E, Iqbal Z, Evans D, Hoffman J, Calvert J, Mao Q, Das R. Semi-supervised deep learning from time series clinical data for acute respiratory distress syndrome prediction: model development and validation study. JMIR Form Res 2021;5:e28028. [PMID: 34398784 PMCID: PMC8447921 DOI: 10.2196/28028] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2021] [Revised: 06/18/2021] [Accepted: 08/01/2021] [Indexed: 11/23/2022] Open

Abstract

Background

A high number of patients who are hospitalized with COVID-19 develop acute respiratory distress syndrome (ARDS).

Objective

In response to the need for clinical decision support tools to help manage the next pandemic during the early stages (ie, when limited labeled data are present), we developed machine learning algorithms that use semisupervised learning (SSL) techniques to predict ARDS development in general and COVID-19 populations based on limited labeled data.

Methods

SSL techniques were applied to 29,127 encounters with patients who were admitted to 7 US hospitals from May 1, 2019, to May 1, 2021. A recurrent neural network that used a time series of electronic health record data was applied to data that were collected when a patient’s peripheral oxygen saturation level fell below the normal range (<97%) to predict the subsequent development of ARDS during the remaining duration of patients’ hospital stay. Model performance was assessed with the area under the receiver operating characteristic curve and area under the precision recall curve of an external hold-out test set.

Results

For the whole data set, the median time between the first peripheral oxygen saturation measurement of <97% and subsequent respiratory failure was 21 hours. The area under the receiver operating characteristic curve for predicting subsequent ARDS development was 0.73 when the model was trained on a labeled data set of 6930 patients, 0.78 when the model was trained on the labeled data set that had been augmented with the unlabeled data set of 16,173 patients by using SSL techniques, and 0.84 when the model was trained on the entire training set of 23,103 labeled patients.

Conclusions

In the context of using time-series inpatient data and a careful model training design, unlabeled data can be used to improve the performance of machine learning models when labeled data for predicting ARDS development are scarce or expensive.

Collapse

Westerman EL, Bowman SEJ, Davidson B, Davis MC, Larson ER, Sanford CPJ. Deploying Big Data to Crack the Genotype to Phenotype Code. Integr Comp Biol 2021;60:385-396. [PMID: 32492136 DOI: 10.1093/icb/icaa055] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open

Wilson CJ, Chang M, Karttunen M, Choy WY. KEAP1 Cancer Mutants: A Large-Scale Molecular Dynamics Study of Protein Stability. Int J Mol Sci 2021;22:5408. [PMID: 34065616 PMCID: PMC8161161 DOI: 10.3390/ijms22105408] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2021] [Revised: 05/11/2021] [Accepted: 05/13/2021] [Indexed: 12/30/2022] Open

Liu Z, Gong Y, Bao Y, Guo Y, Wang H, Lin GN. TMPSS: A Deep Learning-Based Predictor for Secondary Structure and Topology Structure Prediction of Alpha-Helical Transmembrane Proteins. Front Bioeng Biotechnol 2021;8:629937. [PMID: 33569377 PMCID: PMC7869861 DOI: 10.3389/fbioe.2020.629937] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2020] [Accepted: 12/10/2020] [Indexed: 11/13/2022] Open

Sarkar A, Yang Y, Vihinen M. Variation benchmark datasets: update, criteria, quality and applications. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2020;2020:5710862. [PMID: 32016318 PMCID: PMC6997940 DOI: 10.1093/database/baz117] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/12/2019] [Revised: 06/03/2019] [Accepted: 07/01/2019] [Indexed: 02/07/2023]

Abstract

Development of new computational methods and testing their performance has to be carried out using experimental data. Only in comparison to existing knowledge can method performance be assessed. For that purpose, benchmark datasets with known and verified outcome are needed. High-quality benchmark datasets are valuable and may be difficult, laborious and time consuming to generate. VariBench and VariSNP are the two existing databases for sharing variation benchmark datasets used mainly for variation interpretation. They have been used for training and benchmarking predictors for various types of variations and their effects. VariBench was updated with 419 new datasets from 109 papers containing altogether 329 014 152 variants; however, there is plenty of redundancy between the datasets. VariBench is freely available at http://structure.bmc.lu.se/VariBench/. The contents of the datasets vary depending on information in the original source. The available datasets have been categorized into 20 groups and subgroups. There are datasets for insertions and deletions, substitutions in coding and non-coding region, structure mapped, synonymous and benign variants. Effect-specific datasets include DNA regulatory elements, RNA splicing, and protein property for aggregation, binding free energy, disorder and stability. Then there are several datasets for molecule-specific and disease-specific applications, as well as one dataset for variation phenotype effects. Variants are often described at three molecular levels (DNA, RNA and protein) and sometimes also at the protein structural level including relevant cross references and variant descriptions. The updated VariBench facilitates development and testing of new methods and comparison of obtained performances to previously published methods. We compared the performance of the pathogenicity/tolerance predictor PON-P2 to several benchmark studies, and show that such comparisons are feasible and useful, however, there may be limitations due to lack of provided details and shared data.

Database URL: http://structure.bmc.lu.se/VariBench

Collapse

Eitzinger S, Asif A, Watters KE, Iavarone AT, Knott GJ, Doudna JA, Minhas FUAA. Machine learning predicts new anti-CRISPR proteins. Nucleic Acids Res 2020;48:4698-4708. [PMID: 32286628 PMCID: PMC7229843 DOI: 10.1093/nar/gkaa219] [Citation(s) in RCA: 68] [Impact Index Per Article: 13.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2019] [Revised: 03/23/2020] [Accepted: 03/25/2020] [Indexed: 01/30/2023] Open

Camargo G, Bugatti PH, Saito PTM. Active semi-supervised learning for biological data classification. PLoS One 2020;15:e0237428. [PMID: 32813738 PMCID: PMC7437865 DOI: 10.1371/journal.pone.0237428] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2019] [Accepted: 07/27/2020] [Indexed: 11/18/2022] Open

Abstract

Due to datasets have continuously grown, efforts have been performed in the attempt to solve the problem related to the large amount of unlabeled data in disproportion to the scarcity of labeled data. Another important issue is related to the trade-off between the difficulty in obtaining annotations provided by a specialist and the need for a significant amount of annotated data to obtain a robust classifier. In this context, active learning techniques jointly with semi-supervised learning are interesting. A smaller number of more informative samples previously selected (by the active learning strategy) and labeled by a specialist can propagate the labels to a set of unlabeled data (through the semi-supervised one). However, most of the literature works neglect the need for interactive response times that can be required by certain real applications. We propose a more effective and efficient active semi-supervised learning framework, including a new active learning method. An extensive experimental evaluation was performed in the biological context (using the ALL-AML, Escherichia coli and PlantLeaves II datasets), comparing our proposals with state-of-the-art literature works and different supervised (SVM, RF, OPF) and semi-supervised (YATSI-SVM, YATSI-RF and YATSI-OPF) classifiers. From the obtained results, we can observe the benefits of our framework, which allows the classifier to achieve higher accuracies more quickly with a reduced number of annotated samples. Moreover, the selection criterion adopted by our active learning method, based on diversity and uncertainty, enables the prioritization of the most informative boundary samples for the learning process. We obtained a gain of up to 20% against other learning techniques. The active semi-supervised learning approaches presented a better trade-off (accuracies and competitive and viable computational times) when compared with the active supervised learning ones.

Collapse

Piovesan D, Hatos A, Minervini G, Quaglia F, Monzon AM, Tosatto SCE. Assessing predictors for new post translational modification sites: A case study on hydroxylation. PLoS Comput Biol 2020;16:e1007967. [PMID: 32569263 PMCID: PMC7332089 DOI: 10.1371/journal.pcbi.1007967] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2019] [Revised: 07/02/2020] [Accepted: 05/19/2020] [Indexed: 12/15/2022] Open

Abstract

Post-translational modification (PTM) sites have become popular for predictor development. However, with the exception of phosphorylation and a handful of other examples, PTMs suffer from a limited number of available training examples and sparsity in protein sequences. Here, proline hydroxylation is taken as an example to compare different methods and evaluate their performance on new experimentally determined sites. As a guide for effective experimental design, predictors require both high specificity and sensitivity. However, the self-reported performance may often not be indicative of prediction quality and detection of new sites is not guaranteed. We have benchmarked seven published hydroxylation site predictors on two newly constructed independent datasets. The self-reported performance is found to widely overestimate the real accuracy measured on independent datasets. No predictor performs better than random on new examples, indicating the refined models do not sufficiently generalize to detect new sites. The number of false positives is high and precision low, in particular for non-collagen proteins whose motifs are not conserved. As hydroxylation site predictors do not generalize for new data, caution is advised when using PTM predictors in the absence of independent evaluations, in particular for highly specific sites involved in signalling.

Machine learning methods are extensively used by biologists to design and interpret experiments. Predictors which take the only sequence as input are of particular interest due to the large amount of available sequence data and high self-reported performance. In this work, we evaluated post-translational modification (PTM) predictors for hydroxylation sites and found that they perform no better than random, in strong contrast to performances reported in their original publications. PTMs are chemical amino acid alterations providing the cell with conditional mechanisms to fine tune protein function, regulating complex biological processes such as signalling and cell cycle. Hydroxylation sites are a good PTM test case due to the availability of a range of predictors and an abundance of newly experimentally detected modification sites. Poor performances in our results highlight the overlooked problem of predicting PTMs when best practices are not followed and training data are likely incomplete. Experimentalists should be careful when using PTM predictors blindly and more independent assessments are needed to establish their usefulness in practice.

Collapse

Mazurenko S, Prokop Z, Damborsky J. Machine Learning in Enzyme Engineering. ACS Catal 2019. [DOI: 10.1021/acscatal.9b04321] [Citation(s) in RCA: 236] [Impact Index Per Article: 39.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]

Setting the standards for machine learning in biology. Nat Rev Mol Cell Biol 2019;20:659-660. [DOI: 10.1038/s41580-019-0176-5] [Citation(s) in RCA: 46] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]

Torrisi M, Kaleel M, Pollastri G. Deeper Profiles and Cascaded Recurrent and Convolutional Neural Networks for state-of-the-art Protein Secondary Structure Prediction. Sci Rep 2019;9:12374. [PMID: 31451723 PMCID: PMC6710256 DOI: 10.1038/s41598-019-48786-x] [Citation(s) in RCA: 56] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2019] [Accepted: 08/12/2019] [Indexed: 01/10/2023] Open

Latysheva NS, Babu MM. Molecular Signatures of Fusion Proteins in Cancer. ACS Pharmacol Transl Sci 2019;2:122-133. [PMID: 32219217 PMCID: PMC7088938 DOI: 10.1021/acsptsci.9b00019] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2019] [Indexed: 01/07/2023]

Niroula A, Vihinen M. How good are pathogenicity predictors in detecting benign variants? PLoS Comput Biol 2019;15:e1006481. [PMID: 30742610 PMCID: PMC6386394 DOI: 10.1371/journal.pcbi.1006481] [Citation(s) in RCA: 64] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2018] [Revised: 02/22/2019] [Accepted: 12/19/2018] [Indexed: 01/07/2023] Open

Abstract

Computational tools are widely used for interpreting variants detected in sequencing projects. The choice of these tools is critical for reliable variant impact interpretation for precision medicine and should be based on systematic performance assessment. The performance of the methods varies widely in different performance assessments, for example due to the contents and sizes of test datasets. To address this issue, we obtained 63,160 common amino acid substitutions (allele frequency ≥1% and <25%) from the Exome Aggregation Consortium (ExAC) database, which contains variants from 60,706 genomes or exomes. We evaluated the specificity, the capability to detect benign variants, for 10 variant interpretation tools. In addition to overall specificity of the tools, we tested their performance for variants in six geographical populations. PON-P2 had the best performance (95.5%) followed by FATHMM (86.4%) and VEST (83.5%). While these tools had excellent performance, the poorest method predicted more than one third of the benign variants to be disease-causing. The results allow choosing reliable methods for benign variant interpretation, for both research and clinical purposes, as well as provide a benchmark for method developers.

In precision/personalized medicine of many conditions it is essential to investigate individual’s genome. Interpretation of the observed variation (mutation) sets is feasible only with computational approaches. We assessed the performance of variant pathogenicity/tolerance prediction programs on benign variants. Variants were obtained from high-quality ExAC database and selected to have minor allele frequency between 1 and 25%. We obtained 63,160 such cases and investigated 10 widely used predictors. Specificities of the methods showed large differences, from 64 to 96%, thus users of these methods have to be careful when choosing the one(s) they will use. We investigated further the performances on different populations, allele frequencies, separately for males and females, chromosome wise and for population unique and non-unique variants. The ranking of the tools remained the same in all these scenarios, i.e. the best methods were the best irrespective on how the data was filtered and grouped. This is to our knowledge the first large scale evaluation of method performance on benign variants.

Collapse

Schaafsma GCP, Vihinen M. Representativeness of variation benchmark datasets. BMC Bioinformatics 2018;19:461. [PMID: 30497376 PMCID: PMC6267811 DOI: 10.1186/s12859-018-2478-6] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2018] [Accepted: 11/09/2018] [Indexed: 12/14/2022] Open

PON-tstab: Protein Variant Stability Predictor. Importance of Training Data Quality. Int J Mol Sci 2018;19:ijms19041009. [PMID: 29597263 PMCID: PMC5979465 DOI: 10.3390/ijms19041009] [Citation(s) in RCA: 37] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2018] [Revised: 03/21/2018] [Accepted: 03/24/2018] [Indexed: 12/24/2022] Open

Collaborative representation-based classification of microarray gene expression data. PLoS One 2017;12:e0189533. [PMID: 29236759 PMCID: PMC5728509 DOI: 10.1371/journal.pone.0189533] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2017] [Accepted: 11/27/2017] [Indexed: 11/19/2022] Open

Carraro M, Minervini G, Giollo M, Bromberg Y, Capriotti E, Casadio R, Dunbrack R, Elefanti L, Fariselli P, Ferrari C, Gough J, Katsonis P, Leonardi E, Lichtarge O, Menin C, Martelli PL, Niroula A, Pal LR, Repo S, Scaini MC, Vihinen M, Wei Q, Xu Q, Yang Y, Yin Y, Zaucha J, Zhao H, Zhou Y, Brenner SE, Moult J, Tosatto SCE. Performance of in silico tools for the evaluation of p16INK4a (CDKN2A) variants in CAGI. Hum Mutat 2017;38:1042-1050. [PMID: 28440912 PMCID: PMC5561474 DOI: 10.1002/humu.23235] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2016] [Revised: 04/17/2017] [Accepted: 04/19/2017] [Indexed: 12/31/2022]

Affiliation(s)

Marco Carraro Department of Biomedical Sciences, University of Padova, Padova, Italy
Giovanni Minervini Department of Biomedical Sciences, University of Padova, Padova, Italy
Manuel Giollo Department of Biomedical Sciences, University of Padova, Padova, Italy Department of Information Engineering, University of Padova, Padova, Italy
Yana Bromberg Department of Biochemistry and Microbiology, Rutgers University, New Brunswick, New Jersey Department of Genetics, Rutgers University, Piscataway, New Jersey Technical University of Munich Institute for Advanced Study (TUM-IAS), Garching/Munich, Germany
Emidio Capriotti BioFolD Unit, Department of Biological, Geological, and Environmental Sciences (BiGeA), University of Bologna, Bologna, Italy
Rita Casadio Biocomputing Group, Department of Biological, Geological, and Environmental Sciences (BiGeA), University of Bologna, Bologna, Italy
Roland Dunbrack Institute for Cancer Research, Fox Chase Cancer Center, Philadelphia, Pennsylvania
Lisa Elefanti Immunology and Molecular Oncology Unit, Veneto Institute of Oncology, Padua, Italy
Pietro Fariselli Department of Comparative Biomedicine and Food Science, University of Padua, viale dell'Università 16, 35020, Legnaro (PD), Italy
Carlo Ferrari Department of Information Engineering, University of Padova, Padova, Italy
Julian Gough Department of Computer Science, University of Bristol, Bristol, UK
Panagiotis Katsonis Department of Human and Molecular Genetics, Baylor College of Medicine, Houston, Texas
Emanuela Leonardi Department of Woman and Child Health, University of Padova, Padova, Italy
Olivier Lichtarge Department of Human and Molecular Genetics, Baylor College of Medicine, Houston, Texas Department of Biochemistry & Molecular Biology, Baylor College of Medicine, Houston, Texas Department of Pharmacology, Baylor College of Medicine, Houston, Texas Computational and Integrative Biomedical Research Center, Baylor College of Medicine, Houston, Texas
Chiara Menin Immunology and Molecular Oncology Unit, Veneto Institute of Oncology, Padua, Italy
Pier Luigi Martelli BioFolD Unit, Department of Biological, Geological, and Environmental Sciences (BiGeA), University of Bologna, Bologna, Italy
Abhishek Niroula Protein Structure and Bioinformatics Group, Department of Experimental Medical Science, Lund University, Lund, Sweden
Lipika R Pal Institute for Bioscience and Biotechnology Research, University of Maryland, Rockville, Maryland
Susanna Repo EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK
Maria Chiara Scaini Immunology and Molecular Oncology Unit, Veneto Institute of Oncology, Padua, Italy
Mauno Vihinen Protein Structure and Bioinformatics Group, Department of Experimental Medical Science, Lund University, Lund, Sweden
Qiong Wei Biocomputing Group, Department of Biological, Geological, and Environmental Sciences (BiGeA), University of Bologna, Bologna, Italy
Qifang Xu Biocomputing Group, Department of Biological, Geological, and Environmental Sciences (BiGeA), University of Bologna, Bologna, Italy
Yuedong Yang Institute for Glycomics and School of Information and Communication Technology, Griffith University, Gold Coast, Queensland, Australia
Yizhou Yin Institute for Bioscience and Biotechnology Research, University of Maryland, Rockville, Maryland Computational Biology, Bioinformatics and Genomics, Biological Sciences Graduate Program, University of Maryland, College Park, Maryland
Jan Zaucha Department of Computer Science, University of Bristol, Bristol, UK
Huiying Zhao Institute of Health and Biomedical Innovation, Queensland University of Technology, Queensland, Australia
Yaoqi Zhou Institute for Glycomics and School of Information and Communication Technology, Griffith University, Gold Coast, Queensland, Australia
Steven E Brenner Department of Plant and Microbial Biology, University of California, Berkeley, California
John Moult Institute for Bioscience and Biotechnology Research, University of Maryland, Rockville, Maryland Department of Cell Biology and Molecular Genetics, University of Maryland, College Park, Maryland
Silvio C E Tosatto Department of Biomedical Sciences, University of Padova, Padova, Italy CNR Institute of Neuroscience, Padova, Italy

Collapse

Niroula A, Vihinen M. PON-P and PON-P2 predictor performance in CAGI challenges: Lessons learned. Hum Mutat 2017;38:1085-1091. [PMID: 28224672 DOI: 10.1002/humu.23199] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2016] [Revised: 01/25/2017] [Accepted: 02/17/2017] [Indexed: 01/14/2023]

Niroula A, Vihinen M. Predicting Severity of Disease-Causing Variants. Hum Mutat 2017;38:357-364. [PMID: 28070986 DOI: 10.1002/humu.23173] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2016] [Revised: 12/07/2016] [Accepted: 01/06/2017] [Indexed: 12/22/2022]

Richard FD, Alves R, Kajava AV. Tally: a scoring tool for boundary determination between repetitive and non-repetitive protein sequences. Bioinformatics 2016;32:1952-8. [PMID: 27153701 DOI: 10.1093/bioinformatics/btw118] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2015] [Accepted: 02/25/2016] [Indexed: 12/23/2022] Open

Bendl J, Musil M, Štourač J, Zendulka J, Damborský J, Brezovský J. PredictSNP2: A Unified Platform for Accurately Evaluating SNP Effects by Exploiting the Different Characteristics of Variants in Distinct Genomic Regions. PLoS Comput Biol 2016;12:e1004962. [PMID: 27224906 PMCID: PMC4880439 DOI: 10.1371/journal.pcbi.1004962] [Citation(s) in RCA: 143] [Impact Index Per Article: 15.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2015] [Accepted: 05/05/2016] [Indexed: 12/20/2022] Open

Abstract

An important message taken from human genome sequencing projects is that the human population exhibits approximately 99.9% genetic similarity. Variations in the remaining parts of the genome determine our identity, trace our history and reveal our heritage. The precise delineation of phenotypically causal variants plays a key role in providing accurate personalized diagnosis, prognosis, and treatment of inherited diseases. Several computational methods for achieving such delineation have been reported recently. However, their ability to pinpoint potentially deleterious variants is limited by the fact that their mechanisms of prediction do not account for the existence of different categories of variants. Consequently, their output is biased towards the variant categories that are most strongly represented in the variant databases. Moreover, most such methods provide numeric scores but not binary predictions of the deleteriousness of variants or confidence scores that would be more easily understood by users. We have constructed three datasets covering different types of disease-related variants, which were divided across five categories: (i) regulatory, (ii) splicing, (iii) missense, (iv) synonymous, and (v) nonsense variants. These datasets were used to develop category-optimal decision thresholds and to evaluate six tools for variant prioritization: CADD, DANN, FATHMM, FitCons, FunSeq2 and GWAVA. This evaluation revealed some important advantages of the category-based approach. The results obtained with the five best-performing tools were then combined into a consensus score. Additional comparative analyses showed that in the case of missense variations, protein-based predictors perform better than DNA sequence-based predictors. A user-friendly web interface was developed that provides easy access to the five tools’ predictions, and their consensus scores, in a user-understandable format tailored to the specific features of different categories of variations. To enable comprehensive evaluation of variants, the predictions are complemented with annotations from eight databases. The web server is freely available to the community at http://loschmidt.chemi.muni.cz/predictsnp2.

Collapse