1
|
Wei Y, Zhang T, Wang B, Jiang X, Ling F, Fang M, Jin X, Bai Y. INDELpred: Improving the prediction and interpretation of indel pathogenicity within the clinical genome. HGG ADVANCES 2024; 5:100325. [PMID: 38993112 PMCID: PMC11321314 DOI: 10.1016/j.xhgg.2024.100325] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2024] [Revised: 07/04/2024] [Accepted: 07/04/2024] [Indexed: 07/13/2024] Open
Abstract
Small insertions and deletions (indels) are critical yet challenging genetic variations with significant clinical implications. However, the identification of pathogenic indels from neutral variants in clinical contexts remains an understudied problem. Here, we developed INDELpred, a machine-learning-based predictive model for discerning pathogenic from benign indels. INDELpred was established based on key features, including allele frequency, indel length, function-based features, and gene-based features. A set of comprehensive evaluation analyses demonstrated that INDELpred exhibited superior performance over competing methods in terms of computational efficiency and prediction accuracy. Importantly, INDELpred highlighted the crucial role of function-based features in identifying pathogenic indels, with a clear interpretability of the features in understanding the disease-causing variants. We envisage INDELpred as a desirable tool for the detection of pathogenic indels within large-scale genomic datasets, thereby enhancing the precision of genetic diagnoses in clinical settings.
Collapse
Affiliation(s)
- Yilin Wei
- School of Biology and Biological Engineering, South China University of Technology, Guangzhou 510006, China; BGI Research, Shenzhen 518083, China
| | | | | | | | - Fei Ling
- School of Biology and Biological Engineering, South China University of Technology, Guangzhou 510006, China
| | | | - Xin Jin
- BGI Research, Shenzhen 518083, China; The Innovation Centre of Ministry of Education for Development and Diseases, School of Medicine, South China University of Technology, Guangzhou 510006, China; Shanxi Medical University-BGI Collaborative Center for Future Medicine, Shanxi Medical University, Taiyuan 030001, China; Shenzhen Key Laboratory of Transomics Biotechnologies, BGI Research, Shenzhen, China.
| | - Yong Bai
- BGI Research, Shenzhen 518083, China.
| |
Collapse
|
2
|
Lin YJ, Menon AS, Hu Z, Brenner SE. Variant Impact Predictor database (VIPdb), version 2: trends from three decades of genetic variant impact predictors. Hum Genomics 2024; 18:90. [PMID: 39198917 PMCID: PMC11360829 DOI: 10.1186/s40246-024-00663-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2024] [Accepted: 08/19/2024] [Indexed: 09/01/2024] Open
Abstract
BACKGROUND Variant interpretation is essential for identifying patients' disease-causing genetic variants amongst the millions detected in their genomes. Hundreds of Variant Impact Predictors (VIPs), also known as Variant Effect Predictors (VEPs), have been developed for this purpose, with a variety of methodologies and goals. To facilitate the exploration of available VIP options, we have created the Variant Impact Predictor database (VIPdb). RESULTS The Variant Impact Predictor database (VIPdb) version 2 presents a collection of VIPs developed over the past three decades, summarizing their characteristics, ClinGen calibrated scores, CAGI assessment results, publication details, access information, and citation patterns. We previously summarized 217 VIPs and their features in VIPdb in 2019. Building upon this foundation, we identified and categorized an additional 190 VIPs, resulting in a total of 407 VIPs in VIPdb version 2. The majority of the VIPs have the capacity to predict the impacts of single nucleotide variants and nonsynonymous variants. More VIPs tailored to predict the impacts of insertions and deletions have been developed since the 2010s. In contrast, relatively few VIPs are dedicated to the prediction of splicing, structural, synonymous, and regulatory variants. The increasing rate of citations to VIPs reflects the ongoing growth in their use, and the evolving trends in citations reveal development in the field and individual methods. CONCLUSIONS VIPdb version 2 summarizes 407 VIPs and their features, potentially facilitating VIP exploration for various variant interpretation applications. VIPdb is available at https://genomeinterpretation.org/vipdb.
Collapse
Affiliation(s)
- Yu-Jen Lin
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, 94720, USA
- Center for Computational Biology, University of California, Berkeley, CA, 94720, USA
| | - Arul S Menon
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, 94720, USA
- College of Computing, Data Science, and Society, University of California, Berkeley, CA, 94720, USA
| | - Zhiqiang Hu
- Department of Plant and Microbial Biology, University of California, 111 Koshland Hall #3102, Berkeley, CA, 94720-3102, USA
- Illumina, Foster City, CA, 94404, USA
| | - Steven E Brenner
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, 94720, USA.
- Center for Computational Biology, University of California, Berkeley, CA, 94720, USA.
- College of Computing, Data Science, and Society, University of California, Berkeley, CA, 94720, USA.
- Department of Plant and Microbial Biology, University of California, 111 Koshland Hall #3102, Berkeley, CA, 94720-3102, USA.
| |
Collapse
|
3
|
Zinski J, Chung H, Joshi P, Warrick F, Berg BD, Glova G, McGrail M, Balciunas D, Friedberg I, Mullins M. EpicTope: narrating protein sequence features to identify non-disruptive epitope tagging sites. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.03.03.583232. [PMID: 38559275 PMCID: PMC10979891 DOI: 10.1101/2024.03.03.583232] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
Epitope tagging is an invaluable technique enabling the identification, tracking, and purification of proteins in vivo. We developed a tool, EpicTope, to facilitate this method by identifying amino acid positions suitable for epitope insertion. Our method uses a scoring function that considers multiple protein sequence and structural features to determine locations least disruptive to the protein's function. We validated our approach on the zebrafish Smad5 protein, showing that multiple predicted internally tagged Smad5 proteins rescue zebrafish smad5 mutant embryos, while the N- and C-terminal tagged variants do not, also as predicted. We further show that the internally tagged Smad5 proteins are accessible to antibodies in wholemount zebrafish embryo immunohistochemistry and by western blot. Our work demonstrates that EpicTope is an accessible and effective tool for designing epitope tag insertion sites. EpicTope is available under a GPL-3 license from: https://github.com/FriedbergLab/Epictope.
Collapse
Affiliation(s)
- Joseph Zinski
- Department of Cell and Development Biology, University of Pennsylvania Perelman School of Medicine
| | - Henri Chung
- Department of Veterinary Microbiology and Preventive Medicine, Iowa State University
- Program in Bioinformatics and Computational Biology, Iowa State University
| | - Parnal Joshi
- Department of Veterinary Microbiology and Preventive Medicine, Iowa State University
- Program in Bioinformatics and Computational Biology, Iowa State University
| | - Finn Warrick
- Department of Cell and Development Biology, University of Pennsylvania Perelman School of Medicine
| | | | - Greg Glova
- Department of Cell and Development Biology, University of Pennsylvania Perelman School of Medicine
| | - Maura McGrail
- Department of Genetics, Development and Cell Biology, Iowa State University
| | - Darius Balciunas
- Department of Biology, Temple University
- Institute of Biotechnology, Life Sciences Center, Vilnius University
| | - Iddo Friedberg
- Department of Veterinary Microbiology and Preventive Medicine, Iowa State University
| | - Mary Mullins
- Department of Cell and Development Biology, University of Pennsylvania Perelman School of Medicine
| |
Collapse
|
4
|
Shojaei M, Mohammadvand N, Doğan T, Alkan C, Çetin Atalay R, Acar AC. An integrative framework for clinical diagnosis and knowledge discovery from exome sequencing data. Comput Biol Med 2024; 169:107810. [PMID: 38134749 DOI: 10.1016/j.compbiomed.2023.107810] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2023] [Revised: 11/06/2023] [Accepted: 12/03/2023] [Indexed: 12/24/2023]
Abstract
Non-silent single nucleotide genetic variants, like nonsense changes and insertion-deletion variants, that affect protein function and length substantially are prevalent and are frequently misclassified. The low sensitivity and specificity of existing variant effect predictors for nonsense and indel variations restrict their use in clinical applications. We propose the Pathogenic Mutation Prediction (PMPred) method to predict the pathogenicity of single nucleotide variations, which impair protein function by prematurely terminating a protein's elongation during its synthesis. The prediction starts by monitoring functional effects (Gene Ontology annotation changes) of the change in sequence, using an existing ensemble machine learning model (UniGOPred). This, in turn, reveals the mutations that significantly deviate functionally from the wild-type sequence. We have identified novel harmful mutations in patient data and present them as motivating case studies. We also show that our method has increased sensitivity and specificity compared to state-of-the-art, especially in single nucleotide variations that produce large functional changes in the final protein. As further validation, we have done a comparative docking study on such a variation that is misclassified by existing methods and, using the altered binding affinities, show how PMPred can correctly predict the pathogenicity when other tools miss it. PMPred is freely accessible as a web service at https://pmpred.kansil.org/, and the related code is available at https://github.com/kansil/PMPred.
Collapse
Affiliation(s)
- Mona Shojaei
- Cancer Systems Biology Laboratory, Graduate School of Informatics, Middle East Technical University, Ankara 06800 Turkey
| | - Navid Mohammadvand
- Biological Data Science Lab, Dept. of Computer Engineering, Hacettepe University, Ankara 06800 Turkey
| | - Tunca Doğan
- Biological Data Science Lab, Dept. of Computer Engineering, Hacettepe University, Ankara 06800 Turkey; Dept. of Bioinformatics, Graduate School of Health Sciences, Hacettepe University, Ankara 06800 Turkey
| | - Can Alkan
- Department of Computer Engineering, Bilkent University, Ankara 06800 Turkey
| | - Rengül Çetin Atalay
- Department of Medicine, University of Chicago, Chicago, IL, USA; Section of Pulmonary and Critical Care Medicine, University of Chicago, 5841 S. Maryland Avenue, MC6026, Chicago, IL, 60637, USA
| | - Aybar C Acar
- Cancer Systems Biology Laboratory, Graduate School of Informatics, Middle East Technical University, Ankara 06800 Turkey.
| |
Collapse
|
5
|
Ge F, Arif M, Yan Z, Alahmadi H, Worachartcheewan A, Shoombuatong W. Review of Computational Methods and Database Sources for Predicting the Effects of Coding Frameshift Small Insertion and Deletion Variations. ACS OMEGA 2024; 9:2032-2047. [PMID: 38250421 PMCID: PMC10795160 DOI: 10.1021/acsomega.3c07662] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/03/2023] [Revised: 11/30/2023] [Accepted: 12/04/2023] [Indexed: 01/23/2024]
Abstract
Genetic variations (including substitutions, insertions, and deletions) exert a profound influence on DNA sequences. These variations are systematically classified as synonymous, nonsynonymous, and nonsense, each manifesting distinct effects on proteins. The implementation of high-throughput sequencing has significantly augmented our comprehension of the intricate interplay between gene variations and protein structure and function, as well as their ramifications in the context of diseases. Frameshift variations, particularly small insertions and deletions (indels), disrupt protein coding and are instrumental in disease pathogenesis. This review presents a succinct review of computational methods, databases, current challenges, and future directions in predicting the consequences of coding frameshift small indels variations. We analyzed the predictive efficacy, reliability, and utilization of computational methods and variant account, reliability, and utilization of database. Besides, we also compared the prediction methodologies on GOF/LOF pathogenic variation data. Addressing the challenges pertaining to prediction accuracy and cross-species generalizability, nascent technologies such as AI and deep learning harbor immense potential to enhance predictive capabilities. The importance of interdisciplinary research and collaboration cannot be overstated for devising effective diagnosis, treatment, and prevention strategies concerning diseases associated with coding frameshift indels variations.
Collapse
Affiliation(s)
- Fang Ge
- State
Key Laboratory of Organic Electronics and lnformation Displays &
lnstitute of Advanced Materials (IAM), Nanjing University of Posts
& Telecommunications, 9 Wenyuan Road, Nanjing 210023, China
- Center
for Research Innovation and Biomedical Informatics, Faculty of Medical
Technology, Mahidol University, Bangkok 10700, Thailand
| | - Muhammad Arif
- College
of Science and Engineering, Hamad Bin Khalifa
University, Doha 34110, Qatar
| | - Zihao Yan
- School
of Computer Science and Engineering, Nanjing
University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| | - Hanin Alahmadi
- College
of Computer Science and Engineering, Taibah
University, Madinah 344, Saudi Arabia
| | - Apilak Worachartcheewan
- Department
of Community Medical Technology, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand
| | - Watshara Shoombuatong
- Center
for Research Innovation and Biomedical Informatics, Faculty of Medical
Technology, Mahidol University, Bangkok 10700, Thailand
| |
Collapse
|
6
|
Yue Z, Xiang Y, Chen G, Wang X, Li K, Zhang Y. PredinID: Predicting Pathogenic Inframe Indels in Human Through Graph Convolution Neural Network With Graph Sampling Technique. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:3226-3233. [PMID: 37040252 DOI: 10.1109/tcbb.2023.3266232] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Inframe insertion/deletion (indel) variants may alter protein sequence and function, which are closely related to an extensive variety of diseases. Although recent researches have paid attention to the associations between inframe indels and diseases, modeling indels in silico and interpreting their pathogenicity remain challenging, mainly due to the lack of experimental information and computational methodologies. In this article, we propose a novel computational method named PredinID (Predictor for inframe InDels) via graph convolutional network (GCN). PredinID leverages k-nearest neighbor algorithm to construct the feature graph for aggregating more informative representation, regarding the pathogenic inframe indel prediction as a node classification task. An edge-based sampling strategy is designed for extracting information from both the potential connections of feature space and the topological structure of subgraphs. Evaluated by 5-fold cross-validations, the PredinID method achieves satisfactory performance and is superior to four classic machine learning algorithms and two GCN methods. Comprehensive experiments show that PredinID has superior performances when compared with the state-of-the-art methods on the independent test set. Moreover, we also implement a web server at http://predinid.bio.aielab.cc/, to facilitate the use of the model.
Collapse
|
7
|
Banerjee A, Bahar I. Structural Dynamics Predominantly Determine the Adaptability of Proteins to Amino Acid Deletions. Int J Mol Sci 2023; 24:8450. [PMID: 37176156 PMCID: PMC10179678 DOI: 10.3390/ijms24098450] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2023] [Revised: 05/01/2023] [Accepted: 05/06/2023] [Indexed: 05/15/2023] Open
Abstract
The insertion or deletion (indel) of amino acids has a variety of effects on protein function, ranging from disease-forming changes to gaining new functions. Despite their importance, indels have not been systematically characterized towards protein engineering or modification goals. In the present work, we focus on deletions composed of multiple contiguous amino acids (mAA-dels) and their effects on the protein (mutant) folding ability. Our analysis reveals that the mutant retains the native fold when the mAA-del obeys well-defined structural dynamics properties: localization in intrinsically flexible regions, showing low resistance to mechanical stress, and separation from allosteric signaling paths. Motivated by the possibility of distinguishing the features that underlie the adaptability of proteins to mAA-dels, and by the rapid evaluation of these features using elastic network models, we developed a positive-unlabeled learning-based classifier that can be adopted for protein design purposes. Trained on a consolidated set of features, including those reflecting the intrinsic dynamics of the regions where the mAA-dels occur, the new classifier yields a high recall of 84.3% for identifying mAA-dels that are stably tolerated by the protein. The comparative examination of the relative contribution of different features to the prediction reveals the dominant role of structural dynamics in enabling the adaptation of the mutant to mAA-del without disrupting the native fold.
Collapse
Affiliation(s)
- Anupam Banerjee
- Laufer Center for Physical and Quantitative Biology, Stony Brook University, Stony Brook, NY 11794, USA
| | - Ivet Bahar
- Laufer Center for Physical and Quantitative Biology, Stony Brook University, Stony Brook, NY 11794, USA
- Department of Biochemistry and Cell Biology, Stony Brook University, Stony Brook, NY 11794, USA
| |
Collapse
|
8
|
Katsonis P, Wilhelm K, Williams A, Lichtarge O. Genome interpretation using in silico predictors of variant impact. Hum Genet 2022; 141:1549-1577. [PMID: 35488922 PMCID: PMC9055222 DOI: 10.1007/s00439-022-02457-6] [Citation(s) in RCA: 39] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2021] [Accepted: 04/17/2022] [Indexed: 02/06/2023]
Abstract
Estimating the effects of variants found in disease driver genes opens the door to personalized therapeutic opportunities. Clinical associations and laboratory experiments can only characterize a tiny fraction of all the available variants, leaving the majority as variants of unknown significance (VUS). In silico methods bridge this gap by providing instant estimates on a large scale, most often based on the numerous genetic differences between species. Despite concerns that these methods may lack reliability in individual subjects, their numerous practical applications over cohorts suggest they are already helpful and have a role to play in genome interpretation when used at the proper scale and context. In this review, we aim to gain insights into the training and validation of these variant effect predicting methods and illustrate representative types of experimental and clinical applications. Objective performance assessments using various datasets that are not yet published indicate the strengths and limitations of each method. These show that cautious use of in silico variant impact predictors is essential for addressing genome interpretation challenges.
Collapse
Affiliation(s)
- Panagiotis Katsonis
- Department of Molecular and Human Genetics, Baylor College of Medicine, One Baylor Plaza, Houston, TX, 77030, USA.
| | - Kevin Wilhelm
- Graduate School of Biomedical Sciences, Baylor College of Medicine, One Baylor Plaza, Houston, TX, 77030, USA
| | - Amanda Williams
- Department of Molecular and Human Genetics, Baylor College of Medicine, One Baylor Plaza, Houston, TX, 77030, USA
| | - Olivier Lichtarge
- Department of Molecular and Human Genetics, Baylor College of Medicine, One Baylor Plaza, Houston, TX, 77030, USA.
- Department of Biochemistry, Human Genetics and Molecular Biology, Baylor College of Medicine, One Baylor Plaza, Houston, TX, 77030, USA.
- Department of Pharmacology, Baylor College of Medicine, One Baylor Plaza, Houston, TX, 77030, USA.
- Computational and Integrative Biomedical Research Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX, 77030, USA.
| |
Collapse
|
9
|
Nie L, Quan L, Wu T, He R, Lyu Q. TransPPMP: predicting pathogenicity of frameshift and non-sense mutations by a Transformer based on protein features. Bioinformatics 2022; 38:2705-2711. [PMID: 35561183 DOI: 10.1093/bioinformatics/btac188] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2021] [Revised: 01/04/2022] [Accepted: 03/26/2022] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Protein structure can be severely disrupted by frameshift and non-sense mutations at specific positions in the protein sequence. Frameshift and non-sense mutation cases can also be found in healthy individuals. A method to distinguish neutral and potentially disease-associated frameshift and non-sense mutations is of practical and fundamental importance. It would allow researchers to rapidly screen out the potentially pathogenic sites from a large number of mutated genes and then use these sites as drug targets to speed up diagnosis and improve access to treatment. The problem of how to distinguish between neutral and potentially disease-associated frameshift and non-sense mutations remains under-researched. RESULTS We built a Transformer-based neural network model to predict the pathogenicity of frameshift and non-sense mutations on protein features and named it TransPPMP. The feature matrix of contextual sequences computed by the ESM pre-training model, type of mutation residue and the auxiliary features, including structure and function information, are combined as input features, and the focal loss function is designed to solve the sample imbalance problem during the training. In 10-fold cross-validation and independent blind test set, TransPPMP showed good robust performance and absolute advantages in all evaluation metrics compared with four other advanced methods, namely, ENTPRISE-X, VEST-indel, DDIG-in and CADD. In addition, we demonstrate the usefulness of the multi-head attention mechanism in Transformer to predict the pathogenicity of mutations-not only can multiple self-attention heads learn local and global interactions but also functional sites with a large influence on the mutated residue can be captured by attention focus. These could offer useful clues to study the pathogenicity mechanism of human complex diseases for which traditional machine learning methods fall short. AVAILABILITY AND IMPLEMENTATION TransPPMP is available at https://github.com/lennylv/TransPPMP. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Liangpeng Nie
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China
| | - Lijun Quan
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China
- Province Key Lab for Information Processing Technologies, Soochow University, Suzhou 215006, China
- Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210000, China
| | - Tingfang Wu
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China
- Province Key Lab for Information Processing Technologies, Soochow University, Suzhou 215006, China
- Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210000, China
| | - Ruji He
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China
| | - Qiang Lyu
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China
- Province Key Lab for Information Processing Technologies, Soochow University, Suzhou 215006, China
- Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210000, China
| |
Collapse
|
10
|
Spielmann M, Kircher M. Computational and experimental methods for classifying variants of unknown clinical significance. Cold Spring Harb Mol Case Stud 2022; 8:mcs.a006196. [PMID: 35483875 PMCID: PMC9059783 DOI: 10.1101/mcs.a006196] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022] Open
Abstract
The increase in sequencing capacity, reduction in costs, and national and international coordinated efforts have led to the widespread introduction of next-generation sequencing (NGS) technologies in patient care. More generally, human genetics and genomic medicine are gaining importance for more and more patients. Some communities are already discussing the prospect of sequencing each individual's genome at time of birth. Together with digital health records, this shall enable individualized treatments and preventive measures, so-called precision medicine. A central step in this process is the identification of disease causal mutations or variant combinations that make us more susceptible for diseases. Although various technological advances have improved the identification of genetic alterations, the interpretation and ranking of the identified variants remains a major challenge. Based on our knowledge of molecular processes or previously identified disease variants, we can identify potentially functional genetic variants and, using different lines of evidence, we are sometimes able to demonstrate their pathogenicity directly. However, the vast majority of variants are classified as variants of uncertain clinical significance (VUSs) with not enough experimental evidence to determine their pathogenicity. In these cases, computational methods may be used to improve the prioritization and an increasing toolbox of experimental methods is emerging that can be used to assay the molecular effects of VUSs. Here, we discuss how computational and experimental methods can be used to create catalogs of variant effects for a variety of molecular and cellular phenotypes. We discuss the prospects of integrating large-scale functional data with machine learning and clinical knowledge for the development of accurate pathogenicity predictions for clinical applications.
Collapse
Affiliation(s)
- Malte Spielmann
- Institute of Human Genetics, University of Lübeck, 23562 Lübeck, Germany;,Institute of Human Genetics, Christian-Albrechts-Universität, 24105 Kiel, Germany;,Human Molecular Genomics Group, Max Planck Institute for Molecular Genetics, 14195 Berlin, Germany;,DZHK (German Centre for Cardiovascular Research), partner site Hamburg/Lübeck/Kiel, 23562 Lübeck, Germany
| | - Martin Kircher
- Institute of Human Genetics, University of Lübeck, 23562 Lübeck, Germany;,Berlin Institute of Health at Charité—Universitätsmedizin Berlin, 10117 Berlin, Germany;,DZHK (German Centre for Cardiovascular Research), partner site Berlin, 10115 Berlin, Germany
| |
Collapse
|
11
|
Protein secondary structure prediction using a lightweight convolutional network and label distribution aware margin loss. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2021.107771] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023]
|
12
|
Ho CT, Huang YW, Chen TR, Lo CH, Lo WC. Discovering the Ultimate Limits of Protein Secondary Structure Prediction. Biomolecules 2021; 11:1627. [PMID: 34827624 PMCID: PMC8615938 DOI: 10.3390/biom11111627] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2021] [Revised: 10/25/2021] [Accepted: 10/28/2021] [Indexed: 12/29/2022] Open
Abstract
Secondary structure prediction (SSP) of proteins is an important structural biology technique with many applications. There have been ~300 algorithms published in the past seven decades with fierce competition in accuracy. In the first 60 years, the accuracy of three-state SSP rose from ~56% to 81%; after that, it has long stayed at 81-86%. In the 1990s, the theoretical limit of three-state SSP accuracy had been estimated to be 88%. Thus, SSP is now generally considered not challenging or too challenging to improve. However, we found that the limit of three-state SSP might be underestimated. Besides, there is still much room for improving segment-based and eight-state SSPs, but the limits of these emerging topics have not been determined. This work performs large-scale sequence and structural analyses to estimate SSP accuracy limits and assess state-of-the-art SSP methods. The limit of three-state SSP is re-estimated to be ~92%, 4-5% higher than previously expected, indicating that SSP is still challenging. The estimated limit of eight-state SSP is 84-87%. Several proposals for improving future SSP algorithms are made based on our results. We hope that these findings will help move forward the development of SSP and all its applications.
Collapse
Affiliation(s)
- Chia-Tzu Ho
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan; (C.-T.H.); (Y.-W.H.); (T.-R.C.); (C.-H.L.)
| | - Yu-Wei Huang
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan; (C.-T.H.); (Y.-W.H.); (T.-R.C.); (C.-H.L.)
| | - Teng-Ruei Chen
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan; (C.-T.H.); (Y.-W.H.); (T.-R.C.); (C.-H.L.)
| | - Chia-Hua Lo
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan; (C.-T.H.); (Y.-W.H.); (T.-R.C.); (C.-H.L.)
- Department of Biological Science and Technology, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan
| | - Wei-Cheng Lo
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan; (C.-T.H.); (Y.-W.H.); (T.-R.C.); (C.-H.L.)
- Department of Biological Science and Technology, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan
- The Center for Bioinformatics Research, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan
| |
Collapse
|
13
|
Chen J, Guo JT. Structural and functional analysis of somatic coding and UTR indels in breast and lung cancer genomes. Sci Rep 2021; 11:21178. [PMID: 34707120 PMCID: PMC8551294 DOI: 10.1038/s41598-021-00583-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2021] [Accepted: 10/14/2021] [Indexed: 11/24/2022] Open
Abstract
Insertions and deletions (Indels) represent one of the major variation types in the human genome and have been implicated in diseases including cancer. To study the features of somatic indels in different cancer genomes, we investigated the indels from two large samples of cancer types: invasive breast carcinoma (BRCA) and lung adenocarcinoma (LUAD). Besides mapping somatic indels in both coding and untranslated regions (UTRs) from the cancer whole exome sequences, we investigated the overlap between these indels and transcription factor binding sites (TFBSs), the key elements for regulation of gene expression that have been found in both coding and non-coding sequences. Compared to the germline indels in healthy genomes, somatic indels contain more coding indels with higher than expected frame-shift (FS) indels in cancer genomes. LUAD has a higher ratio of deletions and higher coding and FS indel rates than BRCA. More importantly, these somatic indels in cancer genomes tend to locate in sequences with important functions, which can affect the core secondary structures of proteins and have a bigger overlap with predicted TFBSs in coding regions than the germline indels. The somatic CDS indels are also enriched in highly conserved nucleotides when compared with germline CDS indels.
Collapse
Affiliation(s)
- Jing Chen
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, NC, 28223, USA
| | - Jun-Tao Guo
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, NC, 28223, USA.
| |
Collapse
|
14
|
Kinoshita S, Ando M, Ando J, Ishii M, Furukawa Y, Tomita O, Azusawa Y, Shirane S, Kishita Y, Yatsuka Y, Eguchi H, Okazaki Y, Komatsu N. Trigenic ADH5/ ALDH2/ ADGRV1 mutations in myelodysplasia with Usher syndrome. Heliyon 2021; 7:e07804. [PMID: 34458631 PMCID: PMC8379464 DOI: 10.1016/j.heliyon.2021.e07804] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2021] [Revised: 07/19/2021] [Accepted: 08/12/2021] [Indexed: 11/27/2022] Open
Abstract
Trio-next generation sequencing is useful to identify undiagnosed inherited diseases. We have attended a patient with trigenic ADH5/ALDH2/ADGRV1 pathogenic variants, which caused two distinct diseases, myelodysplastic syndrome and Usher syndrome. Whole genome sequencing of peripheral blood from the patient and his parents were applied to identify disease-causing genes. Sanger sequencing was performed to validate the identified ADH5/ALDH2/ADGRV1 variants. Our results identified disease-associated variants in ADGRV1 (disease inheritance autosomal recessive) and in ADH5 (disease inheritance also autosomal recessive) and a variant in ALDH2 (disease inheritance autosomal dominant). Although the variants identified in ADH5 and ALDH2 have been reported, their co-existence in association with disease-causing variation in a third gene has not. They broaden the spectrum of ADGRV1 in Usher syndrome. Findings on next generation sequencing guided rapid and accurate diagnosis, resulting in patient-tailored therapeutic intervention. Trigenic ADH5 / ALDH2 / ADGRV1 variants in myelodysplastic syndrome with Usher syndrome were identified. Two novel pathogenic frameshift variants in ADGRV1 in compound heterozygous state with Usher syndrome type II were described. Findings on next generation sequencing guided rapid and accurate diagnosis, resulting in patient-tailored therapy.
Collapse
Affiliation(s)
- Shintaro Kinoshita
- Department of Hematology, Juntendo University School of Medicine, Tokyo, Japan
| | - Miki Ando
- Department of Hematology, Juntendo University School of Medicine, Tokyo, Japan.,Division of Stem Cell Therapy, Distinguished Professor Unit, The Institute of Medical Science, The University of Tokyo, Tokyo, Japan
| | - Jun Ando
- Department of Hematology, Juntendo University School of Medicine, Tokyo, Japan.,Department of Transfusion Medicine and Stem Cell Regulation, Juntendo University School of Medicine, Tokyo, Japan
| | - Midori Ishii
- Department of Hematology, Juntendo University School of Medicine, Tokyo, Japan
| | - Yoshiki Furukawa
- Department of Hematology, Juntendo University School of Medicine, Tokyo, Japan
| | - Osamu Tomita
- Department of Pediatrics, Juntendo University School of Medicine, Tokyo, Japan
| | - Yoko Azusawa
- Department of Transfusion Medicine and Stem Cell Regulation, Juntendo University School of Medicine, Tokyo, Japan
| | - Shuichi Shirane
- Department of Hematology, Juntendo University School of Medicine, Tokyo, Japan
| | - Yoshihito Kishita
- Diagnostic and Therapeutics of Intractable Diseases, Graduate School of Medicine and Intractable Disease Research Center, Juntendo University, Tokyo, Japan
| | - Yukiko Yatsuka
- Diagnostic and Therapeutics of Intractable Diseases, Graduate School of Medicine and Intractable Disease Research Center, Juntendo University, Tokyo, Japan
| | - Hidetaka Eguchi
- Diagnostic and Therapeutics of Intractable Diseases, Graduate School of Medicine and Intractable Disease Research Center, Juntendo University, Tokyo, Japan
| | - Yasushi Okazaki
- Diagnostic and Therapeutics of Intractable Diseases, Graduate School of Medicine and Intractable Disease Research Center, Juntendo University, Tokyo, Japan
| | - Norio Komatsu
- Department of Hematology, Juntendo University School of Medicine, Tokyo, Japan
| |
Collapse
|
15
|
Ohara K, Kinoshita S, Ando J, Azusawa Y, Ishii M, Harada S, Mitsuishi Y, Asao T, Tajima K, Yamamoto T, Takahashi F, Komatsu N, Takahashi K, Ando M. SCLC-J1, a novel small cell lung cancer cell line. Biochem Biophys Rep 2021; 27:101089. [PMID: 34381882 PMCID: PMC8339127 DOI: 10.1016/j.bbrep.2021.101089] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2021] [Revised: 07/16/2021] [Accepted: 07/22/2021] [Indexed: 11/24/2022] Open
Abstract
Small cell lung cancer (SCLC) is a type of high-grade neuroendocrine carcinoma. It initially responds to chemotherapy but rapidly becomes chemoresistant and it is highly proliferative. The prognosis in SCLC is poor. We have established a novel SCLC cell line, SCLC-J1, from a malignant pleural effusion in a patient with advanced SCLC. SCLC-J1 cells express ganglioside GD2, CD276, and Delta-like protein 3. RB1 is lost. These features of the new SCLC cell line may be useful in understanding the cellular and molecular biology of SCLC and in designing better treatment. A novel small lung cancer cell line, SCLC-J1, was successfully established. SCLC-J1 cells express the tumor-specific antigens ganglioside GD2, CD276, and Delta-like protein 3. RB1 is lost. SCLC-J1 will provide insights into SCLC biology that may permit better therapeutic targeting.
Collapse
Affiliation(s)
- Kazuo Ohara
- Department of Hematology, Juntendo University School of Medicine, 2-1-1 Hongo, Bunkyo-ku, Tokyo, 113-8421, Japan
| | - Shintaro Kinoshita
- Department of Hematology, Juntendo University School of Medicine, 2-1-1 Hongo, Bunkyo-ku, Tokyo, 113-8421, Japan
| | - Jun Ando
- Department of Hematology, Juntendo University School of Medicine, 2-1-1 Hongo, Bunkyo-ku, Tokyo, 113-8421, Japan.,Department of Transfusion Medicine and Stem Cell Regulation, Japan
| | - Yoko Azusawa
- Department of Transfusion Medicine and Stem Cell Regulation, Japan
| | - Midori Ishii
- Department of Hematology, Juntendo University School of Medicine, 2-1-1 Hongo, Bunkyo-ku, Tokyo, 113-8421, Japan
| | - Sakiko Harada
- Department of Hematology, Juntendo University School of Medicine, 2-1-1 Hongo, Bunkyo-ku, Tokyo, 113-8421, Japan
| | - Yoichiro Mitsuishi
- Department of Respiratory Medicine, Juntendo University School of Medicine, 2-1-1 Hongo, Bunkyo-ku, Tokyo, 113-8421, Japan
| | - Tetsuhiko Asao
- Department of Respiratory Medicine, Juntendo University School of Medicine, 2-1-1 Hongo, Bunkyo-ku, Tokyo, 113-8421, Japan
| | - Ken Tajima
- Department of Respiratory Medicine, Juntendo University School of Medicine, 2-1-1 Hongo, Bunkyo-ku, Tokyo, 113-8421, Japan
| | - Taketsugu Yamamoto
- Department of Thoracic Surgery, Yokohama Rosai Hospital, 3211, Kozukue, Kohoku-ku, Yokohama, Kanagawa, Japan
| | - Fumiyuki Takahashi
- Department of Respiratory Medicine, Juntendo University School of Medicine, 2-1-1 Hongo, Bunkyo-ku, Tokyo, 113-8421, Japan
| | - Norio Komatsu
- Department of Hematology, Juntendo University School of Medicine, 2-1-1 Hongo, Bunkyo-ku, Tokyo, 113-8421, Japan
| | - Kazuhisa Takahashi
- Department of Respiratory Medicine, Juntendo University School of Medicine, 2-1-1 Hongo, Bunkyo-ku, Tokyo, 113-8421, Japan
| | - Miki Ando
- Department of Hematology, Juntendo University School of Medicine, 2-1-1 Hongo, Bunkyo-ku, Tokyo, 113-8421, Japan.,Division of Stem Cell Therapy, Distinguished Professor Unit, The Institute of Medical Science, The University of Tokyo, 4-6-1 Shirokanedai, Minato-ku, Tokyo, 108-8639, Japan
| |
Collapse
|
16
|
Shen Y, Zhang Y, Xue W, Yue Z. dbMCS: A Database for Exploring the Mutation Markers of Anti-Cancer Drug Sensitivity. IEEE J Biomed Health Inform 2021; 25:4229-4237. [PMID: 34314366 DOI: 10.1109/jbhi.2021.3100424] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
The identification of mutation markers and the selection of appropriate treatment for patients with specific genome mutations are important steps in the development of targeted therapies and the realization of precision medicine for human cancers. To investigate the baseline characteristics of drug sensitivity markers and develop computational methods of mutation effect prediction, we presented a manually curated online- based database of mutation Markers for anti-Cancer drug Sensitivity (dbMCS). Currently, dbMCS contains 1271 mutations and 4427 mutation-disease-drug associations (3151 and 1276 for sensitivity and resistance, respectively) with their PubMed indexed articles. By comparing the mutations in dbMCS with the putative neutral polymorphisms, we investigated the characteristics of drug sensitivity markers. We found that the mutation markers tend to significantly impact on high-conservative regions both in DNA sequences and protein domains. And some of them presented pleiotropic effects depending on the tumor context, appearing concurrently in the sensitivity and resistance categories. In addition, we preliminarily explored the machine learning-based methods for identifying mutation markers of anti-cancer drug sensitivity and produced optimistic results, which suggests that a reliable dataset may provide new insights and essential clues for future cancer pharmacogenomics studies. dbMCS is available at http://bioinfo.aielab.cc/dbMCS/.
Collapse
|
17
|
Chen TR, Lo CH, Juan SH, Lo WC. The influence of dataset homology and a rigorous evaluation strategy on protein secondary structure prediction. PLoS One 2021; 16:e0254555. [PMID: 34260641 PMCID: PMC8279362 DOI: 10.1371/journal.pone.0254555] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2020] [Accepted: 06/29/2021] [Indexed: 11/28/2022] Open
Abstract
The secondary structure prediction (SSP) of proteins has long been an essential structural biology technique with various applications. Despite its vital role in many research and industrial fields, in recent years, as the accuracy of state-of-the-art secondary structure predictors approaches the theoretical upper limit, SSP has been considered no longer challenging or too challenging to make advances. With the belief that the substantial improvement of SSP will move forward many fields depending on it, we conducted this study, which focused on three issues that have not been noticed or thoroughly examined yet but may have affected the reliability of the evaluation of previous SSP algorithms. These issues are all about the sequence homology between or within the developmental and evaluation datasets. We thus designed many different homology layouts of datasets to train and evaluate SSP prediction models. Multiple repeats were performed in each experiment by random sampling. The conclusions obtained with small experimental datasets were verified with large-scale datasets using state-of-the-art SSP algorithms. Very different from the long-established assumption, we discover that the sequence homology between query datasets for training, testing, and independent tests exerts little influence on SSP accuracy. Besides, the sequence homology redundancy between or within most datasets would make the accuracy of an SSP algorithm overestimated, while the redundancy within the reference dataset for extracting predictive features would make the accuracy underestimated. Since the overestimating effects are more significant than the underestimating effect, the accuracy of some SSP methods might have been overestimated. Based on the discoveries, we propose a rigorous procedure for developing SSP algorithms and making reliable evaluations, hoping to bring substantial improvements to future SSP methods and benefit all research and application fields relying on accurate prediction of protein secondary structures.
Collapse
Affiliation(s)
- Teng-Ruei Chen
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
| | - Chia-Hua Lo
- Department of Biological Science and Technology, National Chiao Tung University, Hsinchu, Taiwan
| | - Sheng-Hung Juan
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan
| | - Wei-Cheng Lo
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
- Department of Biological Science and Technology, National Chiao Tung University, Hsinchu, Taiwan
- Department of Biological Science and Technology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
- The Center for Bioinformatics Research, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
| |
Collapse
|
18
|
Seaby EG, Ennis S. Challenges in the diagnosis and discovery of rare genetic disorders using contemporary sequencing technologies. Brief Funct Genomics 2021; 19:243-258. [PMID: 32393978 DOI: 10.1093/bfgp/elaa009] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Next generation sequencing (NGS) has revolutionised rare disease diagnostics. Concomitant with advancing technologies has been a rise in the number of new gene disorders discovered and diagnoses made for patients and their families. However, despite the trend towards whole exome and whole genome sequencing, diagnostic rates remain suboptimal. On average, only ~30% of patients receive a molecular diagnosis. National sequencing projects launched in the last 5 years are integrating clinical diagnostic testing with research avenues to widen the spectrum of known genetic disorders. Consequently, efforts to diagnose genetic disorders in a clinical setting are now often shared with efforts to prioritise candidate variants for the detection of new disease genes. Herein we discuss some of the biggest obstacles precluding molecular diagnosis and discovery of new gene disorders. We consider bioinformatic and analytical challenges faced when interpreting next generation sequencing data and showcase some of the newest tools available to mitigate these issues. We consider how incomplete penetrance, non-coding variation and structural variants are likely to impact diagnostic rates, and we further discuss methods for uplifting novel gene discovery by adopting a gene-to-patient-based approach.
Collapse
|
19
|
Zhou Y, Lauschke VM. Computational Tools to Assess the Functional Consequences of Rare and Noncoding Pharmacogenetic Variability. Clin Pharmacol Ther 2021; 110:626-636. [PMID: 33998671 DOI: 10.1002/cpt.2289] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2021] [Accepted: 05/07/2021] [Indexed: 12/19/2022]
Abstract
Interindividual differences in drug response are a common concern in both drug development and across layers of care. While genetics clearly influences drug response and toxicity of many drugs, a substantial fraction of the heritable pharmacological and toxicological variability remains unexplained by known genetic polymorphisms. In recent years, population-scale sequencing projects have unveiled tens of thousands of coding and noncoding pharmacogenetic variants with unclear functional effects that might explain at least part of this missing heritability. However, translating these personalized variant signatures into drug response predictions and actionable advice remains challenging and constitutes one of the most important frontiers of contemporary pharmacogenomics. Conventional prediction methods are primarily based on evolutionary conservation, which drastically reduces their predictive accuracy when applied to poorly conserved pharmacogenes. Here, we review the current state-of-the-art of computational variant effect predictors across variant classes and critically discuss their utility for pharmacogenomics. Besides missense variants, we discuss recent progress in the evaluation of synonymous, splice, and noncoding variations. Furthermore, we discuss emerging possibilities to assess haplotypes and structural variations. We advocate for the development of algorithms trained on pharmacogenomic instead of pathogenic data sets to improve the predictive accuracy in order to facilitate the utilization of next-generation sequencing data for personalized clinical decision support and precision pharmacogenomics.
Collapse
Affiliation(s)
- Yitian Zhou
- Department of Physiology and Pharmacology, Karolinska Institutet, Stockholm, Sweden
| | - Volker M Lauschke
- Department of Physiology and Pharmacology, Karolinska Institutet, Stockholm, Sweden
| |
Collapse
|
20
|
Sarkar A, Yang Y, Vihinen M. Variation benchmark datasets: update, criteria, quality and applications. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2020; 2020:5710862. [PMID: 32016318 PMCID: PMC6997940 DOI: 10.1093/database/baz117] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/12/2019] [Revised: 06/03/2019] [Accepted: 07/01/2019] [Indexed: 02/07/2023]
Abstract
Development of new computational methods and testing their performance has to be carried out using experimental data. Only in comparison to existing knowledge can method performance be assessed. For that purpose, benchmark datasets with known and verified outcome are needed. High-quality benchmark datasets are valuable and may be difficult, laborious and time consuming to generate. VariBench and VariSNP are the two existing databases for sharing variation benchmark datasets used mainly for variation interpretation. They have been used for training and benchmarking predictors for various types of variations and their effects. VariBench was updated with 419 new datasets from 109 papers containing altogether 329 014 152 variants; however, there is plenty of redundancy between the datasets. VariBench is freely available at http://structure.bmc.lu.se/VariBench/. The contents of the datasets vary depending on information in the original source. The available datasets have been categorized into 20 groups and subgroups. There are datasets for insertions and deletions, substitutions in coding and non-coding region, structure mapped, synonymous and benign variants. Effect-specific datasets include DNA regulatory elements, RNA splicing, and protein property for aggregation, binding free energy, disorder and stability. Then there are several datasets for molecule-specific and disease-specific applications, as well as one dataset for variation phenotype effects. Variants are often described at three molecular levels (DNA, RNA and protein) and sometimes also at the protein structural level including relevant cross references and variant descriptions. The updated VariBench facilitates development and testing of new methods and comparison of obtained performances to previously published methods. We compared the performance of the pathogenicity/tolerance predictor PON-P2 to several benchmark studies, and show that such comparisons are feasible and useful, however, there may be limitations due to lack of provided details and shared data. Database URL: http://structure.bmc.lu.se/VariBench
Collapse
Affiliation(s)
- Anasua Sarkar
- Department of Experimental Medical Science, BMC B13, Lund University, SE-22 184 Lund, Sweden
| | - Yang Yang
- School of Computer Science and Technology, Soochow University, No1. Shizi Street, Suzhou, 215006 Jiangsu, China.,Provincial Key Laboratory for Computer Information Processing Technology, No1. Shizi Street, Soochow University, Suzhou, 215006 Jiangsu, China
| | - Mauno Vihinen
- Department of Experimental Medical Science, BMC B13, Lund University, SE-22 184 Lund, Sweden
| |
Collapse
|
21
|
Chen J, Guo JT. Comparative assessments of indel annotations in healthy and cancer genomes with next-generation sequencing data. BMC Med Genomics 2020; 13:170. [PMID: 33167946 PMCID: PMC7653722 DOI: 10.1186/s12920-020-00818-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2020] [Accepted: 10/29/2020] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND Insertion and deletion (indel) is one of the major variation types in human genomes. Accurate annotation of indels is of paramount importance in genetic variation analysis and investigation of their roles in human diseases. Previous studies revealed a high number of false positives from existing indel calling methods, which limits downstream analyses of the effects of indels on both healthy and disease genomes. In this study, we evaluated seven commonly used general indel calling programs for germline indels and four somatic indel calling programs through comparative analysis to investigate their common features and differences and to explore ways to improve indel annotation accuracy. METHODS In our comparative analysis, we adopted a more stringent evaluation approach by considering both the indel positions and the indel types (insertion or deletion sequences) between the samples and the reference set. In addition, we applied an efficient way to use a benchmark for improved performance comparisons for the general indel calling programs RESULTS: We found that germline indels in healthy genomes derived by combining several indel calling tools could help remove a large number of false positive indels from individual programs without compromising the number of true positives. The performance comparisons of somatic indel calling programs are more complicated due to the lack of a reliable and comprehensive benchmark. Nevertheless our results revealed large variations among the programs and among cancer types. CONCLUSIONS While more accurate indel calling programs are needed, we found that the performance for germline indel annotations can be improved by combining the results from several programs. In addition, well-designed benchmarks for both germline and somatic indels are key in program development and evaluations.
Collapse
Affiliation(s)
- Jing Chen
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, 9201 University City Blvd, Charlotte, NC, 28223, USA
| | - Jun-Tao Guo
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, 9201 University City Blvd, Charlotte, NC, 28223, USA.
| |
Collapse
|
22
|
Yang DP, Lu HP, Chen G, Yang J, Gao L, Song JH, Chen SW, Mo JX, Kong JL, Tang ZQ, Li CB, Zhou HF, Yang LJ. Integrated expression analysis revealed RUNX2 upregulation in lung squamous cell carcinoma tissues. IET Syst Biol 2020; 14:252-260. [PMID: 33095746 PMCID: PMC8687175 DOI: 10.1049/iet-syb.2020.0063] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2020] [Revised: 08/01/2020] [Accepted: 08/10/2020] [Indexed: 12/14/2022] Open
Abstract
This study aimed to investigate the clinicopathological significance and prospective molecular mechanism of RUNX family transcription factor 2 (RUNX2) in lung squamous cell carcinoma (LUSC). The authors used immunohistochemistry (IHC), RNA-seq, and microarray data from multi-platforms to conduct a comprehensive analysis of the clinicopathological significance and molecular mechanism of RUNX2 in the occurrence and development of LUSC. RUNX2 expression was significantly higher in 16 LUSC tissues than in paired non-cancerous tissues detected by IHC (P < 0.05). RNA-seq data from the combination of TCGA and genotype-tissue expression (GTEx) revealed significantly higher expression of RUNX2 in 502 LUSC samples than in 476 non-cancer samples. The expression of RUNX2 protein was also significantly higher in pathologic T3-T4 than in T1-T2 samples (P = 0.031). The pooled standardised mean difference (SMD) for RUNX2 was 0.87 (95% CI, 0.58-1.16), including 29 microarrays from GEO and one from ArrayExpress. The co-expression network of RUNX2 revealed complicated connections between RUNX2 and 45 co-expressed genes, which were significantly clustered in pathways including ECM-receptor interaction, focal adhesion, protein digestion and absorption, human papillomavirus infection and PI3K-Akt signalling pathway. Overexpression of RUNX2 plays an essential role in the clinical progression of LUSC.
Collapse
Affiliation(s)
- Da-Ping Yang
- Department of Pathology, The Eighth Affiliated Hospital of Guangxi Medical University/Guigang People's Hospital, Guigang, Guangxi, People's Republic of China
| | - Hui-Ping Lu
- Department of Pathology, The First Affiliated Hospital of Guangxi Medical University, Nanning, Guangxi, People's Republic of China
| | - Gang Chen
- Department of Pathology, The First Affiliated Hospital of Guangxi Medical University, Nanning, Guangxi, People's Republic of China
| | - Jie Yang
- Department of Pharmacology, School of Pharmacy, Guangxi Medical University, Nanning, Guangxi, People's Republic of China
| | - Li Gao
- Department of Pathology, The First Affiliated Hospital of Guangxi Medical University, Nanning, Guangxi, People's Republic of China
| | - Jian-Hua Song
- Department of Pathology, The Eighth Affiliated Hospital of Guangxi Medical University/Guigang People's Hospital, Guigang, Guangxi, People's Republic of China
| | - Shang-Wei Chen
- Department of Cardio-Thoracic Surgery, The First Affiliated Hospital of Guangxi Medical University, Nanning, Guangxi, People's Republic of China
| | - Jun-Xian Mo
- Department of Cardio-Thoracic Surgery, The Seventh Affiliated Hospital of Guangxi Medical University/Wuzhou Gongren Hospital, Wuzhou, Guangxi, People's Republic of China
| | - Jin-Liang Kong
- Ward of Pulmonary and Critical Care Medicine, Department of Respiratory Medicine, The First Affiliated Hospital of Guangxi Medical University, Nanning, Guangxi, People's Republic of China
| | - Zhong-Qing Tang
- Department of Pathology, The Seventh Affiliated Hospital of Guangxi Medical University/Wuzhou Gongren Hospital, Wuzhou, Guangxi, People's Republic of China
| | - Chang-Bo Li
- Department of Cardio-Thoracic Surgery, The Seventh Affiliated Hospital of Guangxi Medical University/Wuzhou Gongren Hospital, Wuzhou, Guangxi, People's Republic of China
| | - Hua-Fu Zhou
- Department of Cardio-Thoracic Surgery, The First Affiliated Hospital of Guangxi Medical University, Nanning, Guangxi, People's Republic of China.
| | - Lin-Jie Yang
- Department of Pathology, The First Affiliated Hospital of Guangxi Medical University, Nanning, Guangxi, People's Republic of China
| |
Collapse
|
23
|
Juan SH, Chen TR, Lo WC. A simple strategy to enhance the speed of protein secondary structure prediction without sacrificing accuracy. PLoS One 2020; 15:e0235153. [PMID: 32603341 PMCID: PMC7326220 DOI: 10.1371/journal.pone.0235153] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2019] [Accepted: 06/09/2020] [Indexed: 01/06/2023] Open
Abstract
The secondary structure prediction of proteins is a classic topic of computational structural biology with a variety of applications. During the past decade, the accuracy of prediction achieved by state-of-the-art algorithms has been >80%; meanwhile, the time cost of prediction increased rapidly because of the exponential growth of fundamental protein sequence data. Based on literature studies and preliminary observations on the relationships between the size/homology of the fundamental protein dataset and the speed/accuracy of predictions, we raised two hypotheses that might be helpful to determine the main influence factors of the efficiency of secondary structure prediction. Experimental results of size and homology reductions of the fundamental protein dataset supported those hypotheses. They revealed that shrinking the size of the dataset could substantially cut down the time cost of prediction with a slight decrease of accuracy, which could be increased on the contrary by homology reduction of the dataset. Moreover, the Shannon information entropy could be applied to explain how accuracy was influenced by the size and homology of the dataset. Based on these findings, we proposed that a proper combination of size and homology reductions of the protein dataset could speed up the secondary structure prediction while preserving the high accuracy of state-of-the-art algorithms. Testing the proposed strategy with the fundamental protein dataset of the year 2018 provided by the Universal Protein Resource, the speed of prediction was enhanced over 20 folds while all accuracy measures remained equivalently high. These findings are supposed helpful for improving the efficiency of researches and applications depending on the secondary structure prediction of proteins. To make future implementations of the proposed strategy easy, we have established a database of size and homology reduced protein datasets at http://10.life.nctu.edu.tw/UniRefNR.
Collapse
Affiliation(s)
- Sheng-Hung Juan
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan
| | - Teng-Ruei Chen
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan
| | - Wei-Cheng Lo
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan
- Department of Biological Science and Technology, National Chiao Tung University, Hsinchu, Taiwan
- The Center for Bioinformatics Research, National Chiao Tung University, Hsinchu, Taiwan
| |
Collapse
|
24
|
Yue Z, Chu X, Xia J. PredCID: prediction of driver frameshift indels in human cancer. Brief Bioinform 2020; 22:5860690. [PMID: 32591774 DOI: 10.1093/bib/bbaa119] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2020] [Revised: 05/14/2020] [Accepted: 05/16/2020] [Indexed: 11/12/2022] Open
Abstract
The discrimination of driver from passenger mutations has been a hot topic in the field of cancer biology. Although recent advances have improved the identification of driver mutations in cancer genomic research, there is no computational method specific for the cancer frameshift indels (insertions or/and deletions) yet. In addition, existing pathogenic frameshift indel predictors may suffer from plenty of missing values because of different choices of transcripts during the variant annotation processes. In this study, we proposed a computational model, called PredCID (Predictor for Cancer driver frameshift InDels), for accurately predicting cancer driver frameshift indels. Gene, DNA, transcript and protein level features are combined together and selected for classification with eXtreme Gradient Boosting classifier. Benchmarking results on the cross-validation dataset and independent dataset showed that PredCID achieves better and robust performance compared with existing noncancer-specific methods in distinguishing cancer driver frameshift indels from passengers and is therefore a valuable method for deeper understanding of frameshift indels in human cancer. PredCID is freely available for academic research at http://bioinfo.ahu.edu.cn:8080/PredCID.
Collapse
Affiliation(s)
| | - Xinlu Chu
- Institutes of Physical Science and Information Technology, Anhui University
| | - Junfeng Xia
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Institutes of Physical Science and Information Technology, Anhui University
| |
Collapse
|
25
|
Yue Z, Zhao L, Cheng N, Yan H, Xia J. dbCID: a manually curated resource for exploring the driver indels in human cancer. Brief Bioinform 2019; 20:1925-1933. [DOI: 10.1093/bib/bby059] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2018] [Revised: 05/22/2018] [Indexed: 12/12/2022] Open
Abstract
Abstract
While recent advances in next-generation sequencing technologies have enabled the creation of a multitude of databases in cancer genomic research, there is no comprehensive database focusing on the annotation of driver indels (insertions and deletions) yet. Therefore, we have developed the database of Cancer driver InDels (dbCID), which is a collection of known coding indels that likely to be engaged in cancer development, progression or therapy. dbCID contains experimentally supported and putative driver indels derived from manual curation of literature and is freely available online at http://bioinfo.ahu.edu.cn:8080/dbCID. Using the data deposited in dbCID, we summarized features of driver indels in four levels (gene, DNA, transcript and protein) through comparing with putative neutral indels. We found that most of the genes containing driver indels in dbCID are known cancer genes playing a role in tumorigenesis. Contrary to the expectation, the sequences affected by driver frameshift indels are not larger than those by neutral ones. In addition, the frameshift and inframe driver indels prefer to disrupt high-conservative regions both in DNA sequences and protein domains. Finally, we developed a computational method for discriminating cancer driver from neutral frameshift indels based on the deposited data in dbCID. The proposed method outperformed other widely used non-cancer-specific predictors on an external test set, which demonstrated the usefulness of the data deposited in dbCID. We hope dbCID will be a benchmark for improving and evaluating prediction algorithms, and the characteristics summarized here may assist with investigating the mechanism of indel–cancer association.
Collapse
Affiliation(s)
- Zhenyu Yue
- Institute of Physical Science and Information Technology, School of Computer Science and Technology, Anhui University, Hefei, Anhui, China
| | - Le Zhao
- Institute of Physical Science and Information Technology, School of Computer Science and Technology, Anhui University, Hefei, Anhui, China
| | - Na Cheng
- Institute of Physical Science and Information Technology, School of Computer Science and Technology, Anhui University, Hefei, Anhui, China
| | - Hua Yan
- School of Life Sciences, Anhui University, Hefei, Anhui, China
| | - Junfeng Xia
- Institute of Physical Science and Information Technology, School of Computer Science and Technology, Anhui University, Hefei, Anhui, China
| |
Collapse
|
26
|
Abstract
BACKGROUND Numerous different types of variations can occur in DNA and have diverse effects and consequences. The Variation Ontology (VariO) was developed for systematic descriptions of variations and their effects at DNA, RNA and protein levels. RESULTS VariO use and terms for DNA variations are described in depth. VariO provides systematic names for variation types and detailed descriptions for changes in DNA function, structure and properties. The principles of VariO are presented along with examples from published articles or databases, most often in relation to human diseases. VariO terms describe local DNA changes, chromosome number and structure variants, chromatin alterations, as well as genomic changes, whether of genetic or non-genetic origin. CONCLUSIONS DNA variation systematics facilitates unambiguous descriptions of variations and their effects and further reuse and integration of data from different sources by both human and computers.
Collapse
Affiliation(s)
- Mauno Vihinen
- Department of Experimental Medical Science, Lund University, BMC B13, SE-22184, Lund, Sweden.
| |
Collapse
|
27
|
Zhou Y, Fujikura K, Mkrtchian S, Lauschke VM. Computational Methods for the Pharmacogenetic Interpretation of Next Generation Sequencing Data. Front Pharmacol 2018; 9:1437. [PMID: 30564131 PMCID: PMC6288784 DOI: 10.3389/fphar.2018.01437] [Citation(s) in RCA: 48] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2018] [Accepted: 11/20/2018] [Indexed: 12/21/2022] Open
Abstract
Up to half of all patients do not respond to pharmacological treatment as intended. A substantial fraction of these inter-individual differences is due to heritable factors and a growing number of associations between genetic variations and drug response phenotypes have been identified. Importantly, the rapid progress in Next Generation Sequencing technologies in recent years unveiled the true complexity of the genetic landscape in pharmacogenes with tens of thousands of rare genetic variants. As each individual was found to harbor numerous such rare variants they are anticipated to be important contributors to the genetically encoded inter-individual variability in drug effects. The fundamental challenge however is their functional interpretation due to the sheer scale of the problem that renders systematic experimental characterization of these variants currently unfeasible. Here, we review concepts and important progress in the development of computational prediction methods that allow to evaluate the effect of amino acid sequence alterations in drug metabolizing enzymes and transporters. In addition, we discuss recent advances in the interpretation of functional effects of non-coding variants, such as variations in splice sites, regulatory regions and miRNA binding sites. We anticipate that these methodologies will provide a useful toolkit to facilitate the integration of the vast extent of rare genetic variability into drug response predictions in a precision medicine framework.
Collapse
Affiliation(s)
- Yitian Zhou
- Section of Pharmacogenetics, Department of Physiology and Pharmacology, Karolinska Institutet, Stockholm, Sweden
| | - Kohei Fujikura
- Department of Diagnostic Pathology, Kobe University Graduate School of Medicine, Kobe, Japan
| | - Souren Mkrtchian
- Section of Pharmacogenetics, Department of Physiology and Pharmacology, Karolinska Institutet, Stockholm, Sweden
| | - Volker M. Lauschke
- Section of Pharmacogenetics, Department of Physiology and Pharmacology, Karolinska Institutet, Stockholm, Sweden
| |
Collapse
|
28
|
Sedaghat A, Zamani M, Jahanshahi A, Ghaderian SB, Shariati G, Saberi A, Hamid M, Aminzadeh M, Galehdari H. Frequent novel mutations are causative for maple syrup urine disease from Southwest Iran. Meta Gene 2018. [DOI: 10.1016/j.mgene.2018.01.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022] Open
|
29
|
Zhou H, Gao M, Skolnick J. ENTPRISE-X: Predicting disease-associated frameshift and nonsense mutations. PLoS One 2018; 13:e0196849. [PMID: 29723276 PMCID: PMC5933770 DOI: 10.1371/journal.pone.0196849] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2017] [Accepted: 04/20/2018] [Indexed: 01/11/2023] Open
Abstract
To exploit the plethora of information provided by Next Generation Sequencing, the identification of the genetic mutations responsible for disease in general or cancer in particular, among the thousands of neutral germline or somatic variations is a crucial task. Genome-wide association studies for the detection of disease-associated genes or cancer drivers can only identify common variations or driver genes in a cohort of patients. Thus, they cannot discover unique disease-associated mutations or cancer driver genes on a personal basis. Moreover, even when there are such common variations, their significance is unknown. Here, we extend the machine learning based approach ENTPRISE developed for predicting the disease association of missense mutations to frameshift and nonsense mutations. The new approach, ENTPRISE-X, is shown to outperform the state-of-the-art methods VEST-indel and DDIG-in for predicting the disease association of germline frameshift mutations in terms of balanced measure Matthew’s correlation coefficient, MCC, with a MCC of 0.586 for ENTPRISE-X, versus 0.412 by VEST-indel and 0.321 by DDIG-in, respectively. Large scale testing on the ExAC dataset shows ENTPRISE-X has a much lower fraction of 16% of variations classified as disease causing, as compared to VEST-indel’s 26% and DDIG-in’s 65% of predictions as being disease-associated. A web server for ENTPRISE-X is freely available for academic users at http://cssb2.biology.gatech.edu/entprise-x.
Collapse
Affiliation(s)
- Hongyi Zhou
- Center for the Study of Systems Biology, School of Biological Sciences, Georgia Institute of Technology, Atlanta, Georgia, United States of America
| | - Mu Gao
- Center for the Study of Systems Biology, School of Biological Sciences, Georgia Institute of Technology, Atlanta, Georgia, United States of America
| | - Jeffrey Skolnick
- Center for the Study of Systems Biology, School of Biological Sciences, Georgia Institute of Technology, Atlanta, Georgia, United States of America
- * E-mail:
| |
Collapse
|
30
|
Yang Y, Gao J, Wang J, Heffernan R, Hanson J, Paliwal K, Zhou Y. Sixty-five years of the long march in protein secondary structure prediction: the final stretch? Brief Bioinform 2018; 19:482-494. [PMID: 28040746 PMCID: PMC5952956 DOI: 10.1093/bib/bbw129] [Citation(s) in RCA: 89] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2016] [Revised: 11/15/2016] [Indexed: 11/13/2022] Open
Abstract
Protein secondary structure prediction began in 1951 when Pauling and Corey predicted helical and sheet conformations for protein polypeptide backbone even before the first protein structure was determined. Sixty-five years later, powerful new methods breathe new life into this field. The highest three-state accuracy without relying on structure templates is now at 82-84%, a number unthinkable just a few years ago. These improvements came from increasingly larger databases of protein sequences and structures for training, the use of template secondary structure information and more powerful deep learning techniques. As we are approaching to the theoretical limit of three-state prediction (88-90%), alternative to secondary structure prediction (prediction of backbone torsion angles and Cα-atom-based angles and torsion angles) not only has more room for further improvement but also allows direct prediction of three-dimensional fragment structures with constantly improved accuracy. About 20% of all 40-residue fragments in a database of 1199 non-redundant proteins have <6 Å root-mean-squared distance from the native conformations by SPIDER2. More powerful deep learning methods with improved capability of capturing long-range interactions begin to emerge as the next generation of techniques for secondary structure prediction. The time has come to finish off the final stretch of the long march towards protein secondary structure prediction.
Collapse
Affiliation(s)
- Yuedong Yang
- Insitute for Glycomics and School of Information and Communication Technology, Griffith University, Parklands Drive, Southport, QLD, Australia
| | - Jianzhao Gao
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin, China
| | - Jihua Wang
- Shandong Provincial Key Laboratory of Biophysics, Institute of Biophysics, Dezhou University, Dezhou, China
| | - Rhys Heffernan
- Signal Processing Laboratory, Griffith University, Brisbane, Australia
| | - Jack Hanson
- Signal Processing Laboratory, Griffith University, Brisbane, Australia
| | - Kuldip Paliwal
- Signal Processing Laboratory, Griffith University, Brisbane, Australia
| | - Yaoqi Zhou
- Insitute for Glycomics and School of Information and Communication Technology, Griffith University, Parklands Drive, Southport, QLD, Australia
- Shandong Provincial Key Laboratory of Biophysics, Institute of Biophysics, Dezhou University, Dezhou, China
| |
Collapse
|
31
|
Asim A, Agarwal S, Panigrahi I, Sarangi AN, Muthuswamy S, Kapoor A. CRELD1 gene variants and atrioventricular septal defects in Down syndrome. Gene 2018; 641:180-185. [DOI: 10.1016/j.gene.2017.10.044] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2017] [Revised: 09/21/2017] [Accepted: 10/16/2017] [Indexed: 10/18/2022]
|
32
|
Ferlaino M, Rogers MF, Shihab HA, Mort M, Cooper DN, Gaunt TR, Campbell C. An integrative approach to predicting the functional effects of small indels in non-coding regions of the human genome. BMC Bioinformatics 2017; 18:442. [PMID: 28985712 PMCID: PMC5955213 DOI: 10.1186/s12859-017-1862-y] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2017] [Accepted: 10/02/2017] [Indexed: 11/30/2022] Open
Abstract
Background Small insertions and deletions (indels) have a significant influence in human disease and, in terms of frequency, they are second only to single nucleotide variants as pathogenic mutations. As the majority of mutations associated with complex traits are located outside the exome, it is crucial to investigate the potential pathogenic impact of indels in non-coding regions of the human genome. Results We present FATHMM-indel, an integrative approach to predict the functional effect, pathogenic or neutral, of indels in non-coding regions of the human genome. Our method exploits various genomic annotations in addition to sequence data. When validated on benchmark data, FATHMM-indel significantly outperforms CADD and GAVIN, state of the art models in assessing the pathogenic impact of non-coding variants. FATHMM-indel is available via a web server at indels.biocompute.org.uk. Conclusions FATHMM-indel can accurately predict the functional impact and prioritise small indels throughout the whole non-coding genome. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1862-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Michael Ferlaino
- Big Data Institute, University of Oxford, Oxford, OX3 7LF, UK. .,Nuffield Department of Obstetrics and Gynaecology, University of Oxford, Oxford, OX3 9DU, UK.
| | - Mark F Rogers
- Intelligent Systems Laboratory, University of Bristol, Bristol, BS8 1UB, UK
| | - Hashem A Shihab
- MRC Integrative Epidemiology Unit, University of Bristol, Bristol, BS8 2BN, UK
| | - Matthew Mort
- Institute of Medical Genetics, Cardiff University, Cardiff, CF14 4XN, UK
| | - David N Cooper
- Institute of Medical Genetics, Cardiff University, Cardiff, CF14 4XN, UK
| | - Tom R Gaunt
- MRC Integrative Epidemiology Unit, University of Bristol, Bristol, BS8 2BN, UK
| | - Colin Campbell
- Intelligent Systems Laboratory, University of Bristol, Bristol, BS8 1UB, UK
| |
Collapse
|
33
|
Bousfiha A, Bakhchane A, Charoute H, Detsouli M, Rouba H, Charif M, Lenaers G, Barakat A. Novel compound heterozygous mutations in the GPR98 (USH2C) gene identified by whole exome sequencing in a Moroccan deaf family. Mol Biol Rep 2017; 44:429-434. [PMID: 28951997 DOI: 10.1007/s11033-017-4129-9] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2016] [Accepted: 09/19/2017] [Indexed: 01/26/2023]
Abstract
In the present work, we identified two novel compound heterozygote mutations in the GPR98 (G protein-coupled receptor 98) gene causing Usher syndrome. Whole-exome sequencing was performed to study the genetic causes of Usher syndrome in a Moroccan family with three affected siblings. We identify two novel compound heterozygote mutations (c.1054C > A, c.16544delT) in the GPR98 gene in the three affected siblings carrying post-linguale bilateral moderate hearing loss with normal vestibular functions and before installing visual disturbances. This is the first time that mutations in the GPR98 gene are described in the Moroccan deaf patients.
Collapse
Affiliation(s)
- Amale Bousfiha
- Human Molecular Genetics Laboratory, Institut Pasteur du Maroc, 1, Place Louis Pasteur, 20360, Casablanca, Morocco.,Laboratoire des Sciences Biologiques, Filière Technique de Santé, Institution Supérieure des Professions Infirmières et Techniques de Santé (ISPITS), Casablanca, Morocco
| | - Amina Bakhchane
- Human Molecular Genetics Laboratory, Institut Pasteur du Maroc, 1, Place Louis Pasteur, 20360, Casablanca, Morocco
| | - Hicham Charoute
- Human Molecular Genetics Laboratory, Institut Pasteur du Maroc, 1, Place Louis Pasteur, 20360, Casablanca, Morocco
| | - Mustapha Detsouli
- Human Molecular Genetics Laboratory, Institut Pasteur du Maroc, 1, Place Louis Pasteur, 20360, Casablanca, Morocco
| | - Hassan Rouba
- Human Molecular Genetics Laboratory, Institut Pasteur du Maroc, 1, Place Louis Pasteur, 20360, Casablanca, Morocco
| | - Majida Charif
- PREMMI, Mitochondrial Medicine Research Centre, Université d'Angers, CHU Bât IRIS/IBS, Rue des Capucins, 49933, Angers Cedex 9, France
| | - Guy Lenaers
- PREMMI, Mitochondrial Medicine Research Centre, Université d'Angers, CHU Bât IRIS/IBS, Rue des Capucins, 49933, Angers Cedex 9, France
| | - Abdelhamid Barakat
- Human Molecular Genetics Laboratory, Institut Pasteur du Maroc, 1, Place Louis Pasteur, 20360, Casablanca, Morocco.
| |
Collapse
|
34
|
Chen L, Zhang YH, Huang G, Pan X, Wang S, Huang T, Cai YD. Discriminating cirRNAs from other lncRNAs using a hierarchical extreme learning machine (H-ELM) algorithm with feature selection. Mol Genet Genomics 2017; 293:137-149. [DOI: 10.1007/s00438-017-1372-7] [Citation(s) in RCA: 42] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2017] [Accepted: 09/07/2017] [Indexed: 12/15/2022]
|
35
|
Lin M, Whitmire S, Chen J, Farrel A, Shi X, Guo JT. Effects of short indels on protein structure and function in human genomes. Sci Rep 2017; 7:9313. [PMID: 28839204 PMCID: PMC5570956 DOI: 10.1038/s41598-017-09287-x] [Citation(s) in RCA: 53] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2017] [Accepted: 07/24/2017] [Indexed: 01/20/2023] Open
Abstract
Insertions and deletions (indels) represent the second most common type of genetic variations in human genomes. Indels can be deleterious and contribute to disease susceptibility as recent genome sequencing projects revealed a large number of indels in various cancer types. In this study, we investigated the possible effects of small coding indels on protein structure and function, and the baseline characteristics of indels in 2504 individuals of 26 populations from the 1000 Genomes Project. We found that each population has a distinct pattern in genes with small indels. Frameshift (FS) indels are enriched in olfactory receptor activity while non-frameshift (NFS) indels are enriched in transcription-related proteins. Structural analysis of NFS indels revealed that they predominantly adopt coil or disordered conformations, especially in proteins with transcription-related NFS indels. These results suggest that the annotated coding indels from the 1000 Genomes Project, while contributing to genetic variations and phenotypic diversity, generally do not affect the core protein structures and have no deleterious effect on essential biological processes. In addition, we found that a number of reference genome annotations might need to be updated due to the high prevalence of annotated homozygous indels in the general population.
Collapse
Affiliation(s)
- Maoxuan Lin
- Department of Bioinformatics and Genomics, The University of North Carolina at Charlotte, Charlotte, NC, 28223, USA
| | - Sarah Whitmire
- Department of Bioinformatics and Genomics, The University of North Carolina at Charlotte, Charlotte, NC, 28223, USA
| | - Jing Chen
- Department of Bioinformatics and Genomics, The University of North Carolina at Charlotte, Charlotte, NC, 28223, USA
| | - Alvin Farrel
- Department of Bioinformatics and Genomics, The University of North Carolina at Charlotte, Charlotte, NC, 28223, USA
| | - Xinghua Shi
- Department of Bioinformatics and Genomics, The University of North Carolina at Charlotte, Charlotte, NC, 28223, USA
| | - Jun-Tao Guo
- Department of Bioinformatics and Genomics, The University of North Carolina at Charlotte, Charlotte, NC, 28223, USA.
| |
Collapse
|
36
|
Pagel KA, Pejaver V, Lin GN, Nam HJ, Mort M, Cooper DN, Sebat J, Iakoucheva LM, Mooney SD, Radivojac P. When loss-of-function is loss of function: assessing mutational signatures and impact of loss-of-function genetic variants. Bioinformatics 2017; 33:i389-i398. [PMID: 28882004 PMCID: PMC5870554 DOI: 10.1093/bioinformatics/btx272] [Citation(s) in RCA: 52] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
MOTIVATION Loss-of-function genetic variants are frequently associated with severe clinical phenotypes, yet many are present in the genomes of healthy individuals. The available methods to assess the impact of these variants rely primarily upon evolutionary conservation with little to no consideration of the structural and functional implications for the protein. They further do not provide information to the user regarding specific molecular alterations potentially causative of disease. RESULTS To address this, we investigate protein features underlying loss-of-function genetic variation and develop a machine learning method, MutPred-LOF, for the discrimination of pathogenic and tolerated variants that can also generate hypotheses on specific molecular events disrupted by the variant. We investigate a large set of human variants derived from the Human Gene Mutation Database, ClinVar and the Exome Aggregation Consortium. Our prediction method shows an area under the Receiver Operating Characteristic curve of 0.85 for all loss-of-function variants and 0.75 for proteins in which both pathogenic and neutral variants have been observed. We applied MutPred-LOF to a set of 1142 de novo vari3ants from neurodevelopmental disorders and find enrichment of pathogenic variants in affected individuals. Overall, our results highlight the potential of computational tools to elucidate causal mechanisms underlying loss of protein function in loss-of-function variants. AVAILABILITY AND IMPLEMENTATION http://mutpred.mutdb.org. CONTACT predrag@indiana.edu.
Collapse
Affiliation(s)
- Kymberleigh A Pagel
- Department of Computer Science and Informatics, Indiana University, Bloomington, IN, USA
| | - Vikas Pejaver
- Department of Computer Science and Informatics, Indiana University, Bloomington, IN, USA
| | - Guan Ning Lin
- Department of Psychiatry, University of California San Diego, La Jolla, CA, USA
| | - Hyun-Jun Nam
- Department of Psychiatry, University of California San Diego, La Jolla, CA, USA
| | - Matthew Mort
- Institute of Medical Genetics, Cardiff University, Cardiff, UK
| | - David N Cooper
- Institute of Medical Genetics, Cardiff University, Cardiff, UK
| | - Jonathan Sebat
- Department of Psychiatry, University of California San Diego, La Jolla, CA, USA
- Beyster Center for Psychiatric Genomics, Department of Psychiatry, University of California San Diego, La Jolla, CA, USA
| | - Lilia M Iakoucheva
- Department of Psychiatry, University of California San Diego, La Jolla, CA, USA
| | - Sean D Mooney
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA, USA
| | - Predrag Radivojac
- Department of Computer Science and Informatics, Indiana University, Bloomington, IN, USA
| |
Collapse
|
37
|
Livingstone M, Folkman L, Yang Y, Zhang P, Mort M, Cooper DN, Liu Y, Stantic B, Zhou Y. Investigating DNA-, RNA-, and protein-based features as a means to discriminate pathogenic synonymous variants. Hum Mutat 2017. [DOI: 10.1002/humu.23283] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Affiliation(s)
- Mark Livingstone
- School of Information and Communication Technology; Griffith University; Southport Queensland 4222 Australia
| | - Lukas Folkman
- School of Information and Communication Technology; Griffith University; Southport Queensland 4222 Australia
| | - Yuedong Yang
- School of Information and Communication Technology; Griffith University; Southport Queensland 4222 Australia
- Institute for Glycomics; Griffith University; Southport Queensland 4222 Australia
| | - Ping Zhang
- Menzies Health Institute; Griffith University; Southport Queensland 4222 Australia
| | - Matthew Mort
- Institute of Medical Genetics; Cardiff University; Cardiff CF144XN United Kingdom
| | - David N. Cooper
- Institute of Medical Genetics; Cardiff University; Cardiff CF144XN United Kingdom
| | - Yunlong Liu
- Department of Medical and Molecular Genetics; Indiana University; Indianapolis Indiana 46202
| | - Bela Stantic
- School of Information and Communication Technology; Griffith University; Southport Queensland 4222 Australia
| | - Yaoqi Zhou
- School of Information and Communication Technology; Griffith University; Southport Queensland 4222 Australia
- Institute for Glycomics; Griffith University; Southport Queensland 4222 Australia
| |
Collapse
|
38
|
Wu M, Chen T, Jiang R. Leveraging multiple genomic data to prioritize disease-causing indels from exome sequencing data. Sci Rep 2017; 7:1804. [PMID: 28496131 PMCID: PMC5431795 DOI: 10.1038/s41598-017-01834-w] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2016] [Accepted: 04/05/2017] [Indexed: 01/26/2023] Open
Abstract
The emergence of exome sequencing in recent years has enabled rapid and cost-effective detection of genetic variants in coding regions and offers a great opportunity to combine sequencing experiments with subsequent computational analysis for dissecting genetic basis of human inherited diseases. However, this strategy, though successful in practice, still faces such challenges as limited sample size and substantial number or diversity of candidate variants. To overcome these obstacles, researchers have been concentrated in the development of advanced computational methods and have recently achieved great progress for analysing single nucleotide variant. Nevertheless, it still remains unclear on how to analyse indels, another type of genetic variant that accounts for substantial proportion of known disease-causing variants. In this paper, we proposed an integrative method to effectively identify disease-causing indels from exome sequencing data. Specifically, we put forward a statistical method to combine five functional prediction scores, four genic association scores and a genic intolerance score to produce an integrated p-value, which could then be used for prioritizing candidate indels. We performed extensive simulation studies and demonstrated that our method achieved high accuracy in uncovering disease-causing indels. Our software is available at http://bioinfo.au.tsinghua.edu.cn/jianglab/IndelPrioritizer/.
Collapse
Affiliation(s)
- Mengmeng Wu
- MOE Key Laboratory of Bioinformatics; Bioinformatics Division and Center for Synthetic and Systems Biology, TNLIST, Tsinghua University, Beijing, 100084, China.,Department of Computer Science, Tsinghua University, Beijing, 100084, China
| | - Ting Chen
- MOE Key Laboratory of Bioinformatics; Bioinformatics Division and Center for Synthetic and Systems Biology, TNLIST, Tsinghua University, Beijing, 100084, China. .,Department of Computer Science, Tsinghua University, Beijing, 100084, China.
| | - Rui Jiang
- MOE Key Laboratory of Bioinformatics; Bioinformatics Division and Center for Synthetic and Systems Biology, TNLIST, Tsinghua University, Beijing, 100084, China. .,Department of Automation, Tsinghua University, Beijing, 100084, China.
| |
Collapse
|
39
|
regSNPs-splicing: a tool for prioritizing synonymous single-nucleotide substitution. Hum Genet 2017; 136:1279-1289. [PMID: 28391525 PMCID: PMC5602096 DOI: 10.1007/s00439-017-1783-x] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2016] [Accepted: 02/27/2017] [Indexed: 02/06/2023]
Abstract
While synonymous single-nucleotide variants (sSNVs) have largely been unstudied, since they do not alter protein sequence, mounting evidence suggests that they may affect RNA conformation, splicing, and the stability of nascent-mRNAs to promote various diseases. Accurately prioritizing deleterious sSNVs from a pool of neutral ones can significantly improve our ability of selecting functional genetic variants identified from various genome-sequencing projects, and, therefore, advance our understanding of disease etiology. In this study, we develop a computational algorithm to prioritize sSNVs based on their impact on mRNA splicing and protein function. In addition to genomic features that potentially affect splicing regulation, our proposed algorithm also includes dozens structural features that characterize the functions of alternatively spliced exons on protein function. Our systematical evaluation on thousands of sSNVs suggests that several structural features, including intrinsic disorder protein scores, solvent accessible surface areas, protein secondary structures, and known and predicted protein family domains, show significant differences between disease-causing and neutral sSNVs. Our result suggests that the protein structure features offer an added dimension of information while distinguishing disease-causing and neutral synonymous variants. The inclusion of structural features increases the predictive accuracy for functional sSNV prioritization.
Collapse
|
40
|
Stenson PD, Mort M, Ball EV, Evans K, Hayden M, Heywood S, Hussain M, Phillips AD, Cooper DN. The Human Gene Mutation Database: towards a comprehensive repository of inherited mutation data for medical research, genetic diagnosis and next-generation sequencing studies. Hum Genet 2017. [PMID: 28349240 DOI: 10.1007/s00439‐017‐1779‐6] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
Abstract
The Human Gene Mutation Database (HGMD®) constitutes a comprehensive collection of published germline mutations in nuclear genes that underlie, or are closely associated with human inherited disease. At the time of writing (March 2017), the database contained in excess of 203,000 different gene lesions identified in over 8000 genes manually curated from over 2600 journals. With new mutation entries currently accumulating at a rate exceeding 17,000 per annum, HGMD represents de facto the central unified gene/disease-oriented repository of heritable mutations causing human genetic disease used worldwide by researchers, clinicians, diagnostic laboratories and genetic counsellors, and is an essential tool for the annotation of next-generation sequencing data. The public version of HGMD ( http://www.hgmd.org ) is freely available to registered users from academic institutions and non-profit organisations whilst the subscription version (HGMD Professional) is available to academic, clinical and commercial users under license via QIAGEN Inc.
Collapse
Affiliation(s)
- Peter D Stenson
- School of Medicine, Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff, CF14 4XN, UK.
| | - Matthew Mort
- School of Medicine, Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff, CF14 4XN, UK
| | - Edward V Ball
- School of Medicine, Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff, CF14 4XN, UK
| | - Katy Evans
- School of Medicine, Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff, CF14 4XN, UK
| | - Matthew Hayden
- School of Medicine, Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff, CF14 4XN, UK
| | - Sally Heywood
- School of Medicine, Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff, CF14 4XN, UK
| | - Michelle Hussain
- School of Medicine, Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff, CF14 4XN, UK
| | - Andrew D Phillips
- School of Medicine, Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff, CF14 4XN, UK
| | - David N Cooper
- School of Medicine, Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff, CF14 4XN, UK.
| |
Collapse
|
41
|
Stenson PD, Mort M, Ball EV, Evans K, Hayden M, Heywood S, Hussain M, Phillips AD, Cooper DN. The Human Gene Mutation Database: towards a comprehensive repository of inherited mutation data for medical research, genetic diagnosis and next-generation sequencing studies. Hum Genet 2017; 136:665-677. [PMID: 28349240 PMCID: PMC5429360 DOI: 10.1007/s00439-017-1779-6] [Citation(s) in RCA: 969] [Impact Index Per Article: 121.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2017] [Accepted: 03/14/2017] [Indexed: 02/06/2023]
Abstract
The Human Gene Mutation Database (HGMD®) constitutes a comprehensive collection of published germline mutations in nuclear genes that underlie, or are closely associated with human inherited disease. At the time of writing (March 2017), the database contained in excess of 203,000 different gene lesions identified in over 8000 genes manually curated from over 2600 journals. With new mutation entries currently accumulating at a rate exceeding 17,000 per annum, HGMD represents de facto the central unified gene/disease-oriented repository of heritable mutations causing human genetic disease used worldwide by researchers, clinicians, diagnostic laboratories and genetic counsellors, and is an essential tool for the annotation of next-generation sequencing data. The public version of HGMD (http://www.hgmd.org) is freely available to registered users from academic institutions and non-profit organisations whilst the subscription version (HGMD Professional) is available to academic, clinical and commercial users under license via QIAGEN Inc.
Collapse
Affiliation(s)
- Peter D Stenson
- School of Medicine, Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff, CF14 4XN, UK.
| | - Matthew Mort
- School of Medicine, Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff, CF14 4XN, UK
| | - Edward V Ball
- School of Medicine, Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff, CF14 4XN, UK
| | - Katy Evans
- School of Medicine, Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff, CF14 4XN, UK
| | - Matthew Hayden
- School of Medicine, Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff, CF14 4XN, UK
| | - Sally Heywood
- School of Medicine, Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff, CF14 4XN, UK
| | - Michelle Hussain
- School of Medicine, Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff, CF14 4XN, UK
| | - Andrew D Phillips
- School of Medicine, Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff, CF14 4XN, UK
| | - David N Cooper
- School of Medicine, Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff, CF14 4XN, UK.
| |
Collapse
|
42
|
Yang Y, Heffernan R, Paliwal K, Lyons J, Dehzangi A, Sharma A, Wang J, Sattar A, Zhou Y. SPIDER2: A Package to Predict Secondary Structure, Accessible Surface Area, and Main-Chain Torsional Angles by Deep Neural Networks. Methods Mol Biol 2017; 1484:55-63. [PMID: 27787820 DOI: 10.1007/978-1-4939-6406-2_6] [Citation(s) in RCA: 105] [Impact Index Per Article: 13.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/13/2023]
Abstract
Predicting one-dimensional structure properties has played an important role to improve prediction of protein three-dimensional structures and functions. The most commonly predicted properties are secondary structure and accessible surface area (ASA) representing local and nonlocal structural characteristics, respectively. Secondary structure prediction is further complemented by prediction of continuous main-chain torsional angles. Here we describe a newly developed method SPIDER2 that utilizes three iterations of deep learning neural networks to improve the prediction accuracy of several structural properties simultaneously. For an independent test set of 1199 proteins SPIDER2 achieves 82 % accuracy for secondary structure prediction, 0.76 for the correlation coefficient between predicted and actual solvent accessible surface area, 19° and 30° for mean absolute errors of backbone φ and ψ angles, respectively, and 8° and 32° for mean absolute errors of Cα-based θ and τ angles, respectively. The method provides state-of-the-art, all-in-one accurate prediction of local structure and solvent accessible surface area. The method is implemented, as a webserver along with a standalone package that are available in our website: http://sparks-lab.org .
Collapse
Affiliation(s)
- Yuedong Yang
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Gold Coast Campus, Science 1 (G24) 2.10, Parklands Drive, Southport, QLD, 4222, Australia
| | - Rhys Heffernan
- Signal Processing Laboratory, School of Engineering, Griffith University, Brisbane, QLD, Australia
| | - Kuldip Paliwal
- Signal Processing Laboratory, School of Engineering, Griffith University, Brisbane, QLD, Australia
| | - James Lyons
- Signal Processing Laboratory, School of Engineering, Griffith University, Brisbane, QLD, Australia
| | - Abdollah Dehzangi
- Department of Psychiatry, Medical Research Center, University of Iowa, Iowa City, IA, USA
| | - Alok Sharma
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, QLD, Australia
- School of Engineering and Physics, University of the South Pacific, Private Mail Bag, Laucala Campus, Suva, Fiji
| | - Jihua Wang
- Shandong Provincial Key Laboratory of Functional Macromolecular Biophysics, Dezhou University, Dezhou, Shandong, China
| | - Abdul Sattar
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, QLD, Australia
- National ICT Australia (NICTA), Brisbane, QLD, Australia
| | - Yaoqi Zhou
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Gold Coast Campus, Science 1 (G24) 2.10, Parklands Drive, Southport, QLD, 4222, Australia.
| |
Collapse
|
43
|
Li M, Feng W, Zhang X, Yang Y, Wang K, Mort M, Cooper DN, Wang Y, Zhou Y, Liu Y. ExonImpact: Prioritizing Pathogenic Alternative Splicing Events. Hum Mutat 2016; 38:16-24. [PMID: 27604408 DOI: 10.1002/humu.23111] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2016] [Revised: 08/23/2016] [Accepted: 08/30/2016] [Indexed: 11/11/2022]
Abstract
Alternative splicing (AS) is a closely regulated process that allows a single gene to encode multiple protein isoforms, thereby contributing to the diversity of the proteome. Dysregulation of the splicing process has been found to be associated with many inherited diseases. However, among the pathogenic AS events, there are numerous "passenger" events whose inclusion or exclusion does not lead to significant changes with respect to protein function. In this study, we evaluate the secondary and tertiary structural features of proteins associated with disease-causing and neutral AS events, and show that several structural features are strongly associated with the pathological impact of exon inclusion. We further develop a machine-learning-based computational model, ExonImpact, for prioritizing and evaluating the functional consequences of hitherto uncharacterized AS events. We evaluated our model using several strategies including cross-validation, and data from the Gene-Tissue Expression (GTEx) and ClinVar databases. ExonImpact is freely available at http://watson.compbio.iupui.edu/ExonImpact.
Collapse
Affiliation(s)
- Meng Li
- Institute of Intelligent System and Bioinformatics, College of Automation, Harbin Engineering University, Harbin, Heilongjiang, 150001, China.,Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, Indiana, 46202, USA
| | - Weixing Feng
- Institute of Intelligent System and Bioinformatics, College of Automation, Harbin Engineering University, Harbin, Heilongjiang, 150001, China
| | - Xinjun Zhang
- Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, Indiana, 46202, USA
| | - Yuedong Yang
- Institute for Glycomics and School of Informatics and Communication Technology, Griffith University, Parklands Dr. Southport QLD 4215, Australia
| | - Kejun Wang
- Institute of Intelligent System and Bioinformatics, College of Automation, Harbin Engineering University, Harbin, Heilongjiang, 150001, China
| | - Matthew Mort
- Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff CF14 4XN, UK
| | - David N Cooper
- Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff CF14 4XN, UK
| | - Yue Wang
- Departments of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, Indiana, 46202, USA
| | - Yaoqi Zhou
- Institute for Glycomics and School of Informatics and Communication Technology, Griffith University, Parklands Dr. Southport QLD 4215, Australia
| | - Yunlong Liu
- Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, Indiana, 46202, USA.,Departments of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, Indiana, 46202, USA.,Center for Medical Genomics, Indiana University School of Medicine, Indianapolis, Indiana, 46202, USA
| |
Collapse
|
44
|
Taherzadeh G, Zhou Y, Liew AWC, Yang Y. Sequence-Based Prediction of Protein-Carbohydrate Binding Sites Using Support Vector Machines. J Chem Inf Model 2016; 56:2115-2122. [PMID: 27623166 DOI: 10.1021/acs.jcim.6b00320] [Citation(s) in RCA: 46] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
Carbohydrate-binding proteins play significant roles in many diseases including cancer. Here, we established a machine-learning-based method (called sequence-based prediction of residue-level interaction sites of carbohydrates, SPRINT-CBH) to predict carbohydrate-binding sites in proteins using support vector machines (SVMs). We found that integrating evolution-derived sequence profiles with additional information on sequence and predicted solvent accessible surface area leads to a reasonably accurate, robust, and predictive method, with area under receiver operating characteristic curve (AUC) of 0.78 and 0.77 and Matthew's correlation coefficient of 0.34 and 0.29, respectively for 10-fold cross validation and independent test without balancing binding and nonbinding residues. The quality of the method is further demonstrated by having statistically significantly more binding residues predicted for carbohydrate-binding proteins than presumptive nonbinding proteins in the human proteome, and by the bias of rare alleles toward predicted carbohydrate-binding sites for nonsynonymous mutations from the 1000 genome project. SPRINT-CBH is available as an online server at http://sparks-lab.org/server/SPRINT-CBH .
Collapse
Affiliation(s)
- Ghazaleh Taherzadeh
- School of Information and Communication Technology and ‡Institute for Glycomics, Griffith University , Parklands Drive, Southport, Queensland 4215, Australia
| | - Yaoqi Zhou
- School of Information and Communication Technology and ‡Institute for Glycomics, Griffith University , Parklands Drive, Southport, Queensland 4215, Australia
| | - Alan Wee-Chung Liew
- School of Information and Communication Technology and ‡Institute for Glycomics, Griffith University , Parklands Drive, Southport, Queensland 4215, Australia
| | - Yuedong Yang
- School of Information and Communication Technology and ‡Institute for Glycomics, Griffith University , Parklands Drive, Southport, Queensland 4215, Australia
| |
Collapse
|
45
|
Piva F, Giulietti M, Occhipinti G, Santoni M, Massari F, Sotte V, Iacovelli R, Burattini L, Santini D, Montironi R, Cascinu S, Principato G. Computational analysis of the mutations in BAP1, PBRM1 and SETD2 genes reveals the impaired molecular processes in renal cell carcinoma. Oncotarget 2016; 6:32161-8. [PMID: 26452128 PMCID: PMC4741666 DOI: 10.18632/oncotarget.5147] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2015] [Accepted: 09/25/2015] [Indexed: 01/19/2023] Open
Abstract
Clear cell Renal Cell Carcinoma (ccRCC) is due to loss of von Hippel-Lindau (VHL) gene and at least one out of three chromatin regulating genes BRCA1-associated protein-1 (BAP1), Polybromo-1 (PBRM1) and Set domain-containing 2 (SETD2). More than 350, 700 and 500 mutations are known respectively for BAP1, PBRM1 and SETD2 genes. Each variation damages these genes with different severity levels. Unfortunately for most of these mutations the molecular effect is unknown, so precluding a severity classification. Moreover, the huge number of these gene mutations does not allow to perform experimental assays for each of them. By bioinformatic tools, we performed predictions of the molecular effects of all mutations lying in BAP1, PBRM1 and SETD2 genes. Our results allow to distinguish whether a mutation alters protein function directly or by splicing pattern destruction and how much severely. This classification could be useful to reveal correlation with patients' outcome, to guide experiments, to select the variations that are worth to be included in translational/association studies, and to direct gene therapies.
Collapse
Affiliation(s)
- Francesco Piva
- Department of Specialistic Clinical and Odontostomatological Sciences, Polytechnic University of Marche Region, Ancona, Italy
| | - Matteo Giulietti
- Department of Specialistic Clinical and Odontostomatological Sciences, Polytechnic University of Marche Region, Ancona, Italy
| | - Giulia Occhipinti
- Department of Specialistic Clinical and Odontostomatological Sciences, Polytechnic University of Marche Region, Ancona, Italy
| | - Matteo Santoni
- Department of Medical Oncology, AOU Ospedali Riuniti - Polytechnic University of the Marche Region, Ancona, Italy
| | | | - Valeria Sotte
- Department of Medical Oncology, AOU Ospedali Riuniti - Polytechnic University of the Marche Region, Ancona, Italy
| | - Roberto Iacovelli
- Medical Oncology Unit of Urogenital and Head & Neck Tumors, European Institute of Oncology, Milan, Italy
| | - Luciano Burattini
- Department of Medical Oncology, AOU Ospedali Riuniti - Polytechnic University of the Marche Region, Ancona, Italy
| | - Daniele Santini
- Department of Medical Oncology, Campus Bio-Medico University of Rome, Rome, Italy
| | - Rodolfo Montironi
- Pathological Anatomy, Polytechnic University of the Marche Region School of Medicine United Hospitals, Ancona, Italy
| | - Stefano Cascinu
- Department of Medical Oncology, AOU Ospedali Riuniti - Polytechnic University of the Marche Region, Ancona, Italy
| | - Giovanni Principato
- Department of Specialistic Clinical and Odontostomatological Sciences, Polytechnic University of Marche Region, Ancona, Italy
| |
Collapse
|
46
|
Folkman L, Stantic B, Sattar A, Zhou Y. EASE-MM: Sequence-Based Prediction of Mutation-Induced Stability Changes with Feature-Based Multiple Models. J Mol Biol 2016; 428:1394-1405. [DOI: 10.1016/j.jmb.2016.01.012] [Citation(s) in RCA: 67] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2015] [Revised: 01/12/2016] [Accepted: 01/13/2016] [Indexed: 10/22/2022]
|
47
|
Zhang G, Shao M, Li Z, Gu Y, Du X, Wang X, Li M. Genetic spectrum of dyschromatosis symmetrica hereditaria in Chinese patients including a novel nonstop mutation in ADAR1 gene. BMC MEDICAL GENETICS 2016; 17:14. [PMID: 26892242 PMCID: PMC4759768 DOI: 10.1186/s12881-015-0255-1] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/27/2015] [Accepted: 11/24/2015] [Indexed: 12/04/2022]
Abstract
Background Dyschromatosis symmetrica hereditaria (DSH) is a rare autosomal dominant cutaneous disorder caused by the mutations of adenosine deaminase acting on RNA1 (ADAR1) gene. We present a clinical and genetic study of seven unrelated families and two sporadic cases with DSH for mutations in the full coding sequence of ADAR1 gene. Methods ADAR1 gene was sequenced in seven unrelated families and two sporadic cases with DSH and 120 controls. Functional significance of the observed ADAR1 mutations was analyzed using PolyPhen 2, SIFT and DDIG-in. Results We describe six novel mutations of the ADAR1 gene in Chinese patients with DSH including a nonstop mutation p.Stop1227R, which was firstly reported in ADAR1 gene. In silico analysis proves that all the mutations reported here are pathogenic. Conclusion This study is useful for functional studies of the protein and to define a diagnostic strategy for mutation screening of the ADAR1 gene. A three-generation family exhibiting phenotypic variability with a single germline ADAR1 mutation suggests that chilblain might aggravate the clinical phenotypes of DSH.
Collapse
Affiliation(s)
- Guolong Zhang
- Department of Phototherapy at Shanghai Skin Disease Hospital & Institute of Photomedicine, Tongji University School of Medicine, 1278, Baode Road, Shanghai, 200443, China.
| | - Minhua Shao
- Department of Dermatology, Nanjing Medical University, Affiliated Wuxi People's Hospital, Wuxi, 214023, China.
| | - Zhixiu Li
- University of Queensland Diamantina Institute, Translational Research Institute, Brisbane, Queensland, Australia.
| | - Yong Gu
- Department of Dermatology, Nanjing Medical University, Affiliated Wuxi People's Hospital, Wuxi, 214023, China.
| | - Xufeng Du
- Department of Dermatology, Nanjing Medical University, Affiliated Wuxi People's Hospital, Wuxi, 214023, China.
| | - Xiuli Wang
- Department of Phototherapy at Shanghai Skin Disease Hospital & Institute of Photomedicine, Tongji University School of Medicine, 1278, Baode Road, Shanghai, 200443, China.
| | - Ming Li
- Department of Dermatology, Xinhua Hospital, Shanghai Jiaotong University School of Medicine, 1665, Kongjiang Road, Shanghai, 200092, China.
| |
Collapse
|
48
|
Douville C, Masica DL, Stenson PD, Cooper DN, Gygax DM, Kim R, Ryan M, Karchin R. Assessing the Pathogenicity of Insertion and Deletion Variants with the Variant Effect Scoring Tool (VEST-Indel). Hum Mutat 2016; 37:28-35. [PMID: 26442818 PMCID: PMC5057310 DOI: 10.1002/humu.22911] [Citation(s) in RCA: 102] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2015] [Accepted: 09/14/2015] [Indexed: 12/11/2022]
Abstract
Insertion/deletion variants (indels) alter protein sequence and length, yet are highly prevalent in healthy populations, presenting a challenge to bioinformatics classifiers. Commonly used features--DNA and protein sequence conservation, indel length, and occurrence in repeat regions--are useful for inference of protein damage. However, these features can cause false positives when predicting the impact of indels on disease. Existing methods for indel classification suffer from low specificities, severely limiting clinical utility. Here, we further develop our variant effect scoring tool (VEST) to include the classification of in-frame and frameshift indels (VEST-indel) as pathogenic or benign. We apply 24 features, including a new "PubMed" feature, to estimate a gene's importance in human disease. When compared with four existing indel classifiers, our method achieves a drastically reduced false-positive rate, improving specificity by as much as 90%. This approach of estimating gene importance might be generally applicable to missense and other bioinformatics pathogenicity predictors, which often fail to achieve high specificity. Finally, we tested all possible meta-predictors that can be obtained from combining the four different indel classifiers using Boolean conjunctions and disjunctions, and derived a meta-predictor with improved performance over any individual method.
Collapse
Affiliation(s)
- Christopher Douville
- Department of Biomedical Engineering and Institute for Computational MedicineThe Johns Hopkins UniversityBaltimoreMaryland
| | - David L. Masica
- Department of Biomedical Engineering and Institute for Computational MedicineThe Johns Hopkins UniversityBaltimoreMaryland
| | - Peter D. Stenson
- Institute of Medical GeneticsSchool of MedicineCardiff UniversityHeath ParkCardiffUK
| | - David N. Cooper
- Institute of Medical GeneticsSchool of MedicineCardiff UniversityHeath ParkCardiffUK
| | | | - Rick Kim
- In Silico SolutionsFairfaxVirginia
| | | | - Rachel Karchin
- Department of Biomedical Engineering and Institute for Computational MedicineThe Johns Hopkins UniversityBaltimoreMaryland
- Department of OncologyJohns Hopkins University School of MedicineBaltimoreMaryland
| |
Collapse
|
49
|
Computational approaches to study the effects of small genomic variations. J Mol Model 2015; 21:251. [PMID: 26350246 DOI: 10.1007/s00894-015-2794-y] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2015] [Accepted: 08/23/2015] [Indexed: 10/23/2022]
Abstract
Advances in DNA sequencing technologies have led to an avalanche-like increase in the number of gene sequences deposited in public databases over the last decade as well as the detection of an enormous number of previously unseen nucleotide variants therein. Given the size and complex nature of the genome-wide sequence variation data, as well as the rate of data generation, experimental characterization of the disease association of each of these variations or their effects on protein structure/function would be costly, laborious, time-consuming, and essentially impossible. Thus, in silico methods to predict the functional effects of sequence variations are constantly being developed. In this review, we summarize the major computational approaches and tools that are aimed at the prediction of the functional effect of mutations, and describe the state-of-the-art databases that can be used to obtain information about mutation significance. We also discuss future directions in this highly competitive field.
Collapse
|
50
|
Disorder Prediction Methods, Their Applicability to Different Protein Targets and Their Usefulness for Guiding Experimental Studies. Int J Mol Sci 2015; 16:19040-54. [PMID: 26287166 PMCID: PMC4581285 DOI: 10.3390/ijms160819040] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2015] [Revised: 07/15/2015] [Accepted: 08/04/2015] [Indexed: 12/13/2022] Open
Abstract
The role and function of a given protein is dependent on its structure. In recent years, however, numerous studies have highlighted the importance of unstructured, or disordered regions in governing a protein’s function. Disordered proteins have been found to play important roles in pivotal cellular functions, such as DNA binding and signalling cascades. Studying proteins with extended disordered regions is often problematic as they can be challenging to express, purify and crystallise. This means that interpretable experimental data on protein disorder is hard to generate. As a result, predictive computational tools have been developed with the aim of predicting the level and location of disorder within a protein. Currently, over 60 prediction servers exist, utilizing different methods for classifying disorder and different training sets. Here we review several good performing, publicly available prediction methods, comparing their application and discussing how disorder prediction servers can be used to aid the experimental solution of protein structure. The use of disorder prediction methods allows us to adopt a more targeted approach to experimental studies by accurately identifying the boundaries of ordered protein domains so that they may be investigated separately, thereby increasing the likelihood of their successful experimental solution.
Collapse
|