1
|
Kumar R, Duzan J, Drake E, Dawson J. A novel heterozygous frameshift c.277del p.Thr93Leufs*21 mutation in SERPINC1 associated with type 1 antithrombin deficiency. Pediatr Blood Cancer 2024; 71:e30986. [PMID: 38563157 DOI: 10.1002/pbc.30986] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Revised: 03/11/2024] [Accepted: 03/18/2024] [Indexed: 04/02/2024]
Affiliation(s)
- Riten Kumar
- Dana Farber/Boston Children's Cancer and Blood Disorders Center, Boston, Massachusetts, USA
- Department of Pediatrics, Harvard Medical School, Boston, Massachusetts, USA
| | - Juliann Duzan
- Dana Farber/Boston Children's Cancer and Blood Disorders Center, Boston, Massachusetts, USA
| | - Emily Drake
- Dana Farber/Boston Children's Cancer and Blood Disorders Center, Boston, Massachusetts, USA
| | - Jennifer Dawson
- Genomic Medicine, Lerner Research Institute, Cleveland Clinic, Cleveland, Ohio, USA
| |
Collapse
|
2
|
Wang S, Li Y, Zhong L, Wu K, Zhang R, Kang T, Wu S, Wu Y. Efficient gene editing through an intronic selection marker in cells. Cell Mol Life Sci 2022; 79:111. [PMID: 35098362 PMCID: PMC8801403 DOI: 10.1007/s00018-022-04152-1] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2021] [Revised: 01/10/2022] [Accepted: 01/14/2022] [Indexed: 02/05/2023]
Abstract
BACKGROUND Gene editing technology has provided researchers with the ability to modify genome sequences in almost all eukaryotes. Gene-edited cell lines are being used with increasing frequency in both bench research and targeted therapy. However, despite the great importance and universality of gene editing, the efficiency of homology-directed DNA repair (HDR) is too low, and base editors (BEs) cannot accomplish desired indel editing tasks. RESULTS AND DISCUSSION Our group has improved HDR gene editing technology to indicate DNA variation with an independent selection marker using an HDR strategy, which we named Gene Editing through an Intronic Selection marker (GEIS). GEIS uses a simple process to avoid nonhomologous end joining (NHEJ)-mediated false-positive effects and achieves a DsRed positive rate as high as 87.5% after two rounds of fluorescence-activated cell sorter (FACS) selection without disturbing endogenous gene splicing and expression. We re-examined the correlation of the conversion tract and efficiency, and our data suggest that GEIS has the potential to edit approximately 97% of gene editing targets in human and mouse cells. The results of further comprehensive analysis suggest that the strategy may be useful for introducing multiple DNA variations in cells.
Collapse
Affiliation(s)
- Shang Wang
- State Key Laboratory of Oncology in South China, Collaborative Innovation Center for Cancer Medicine, Department of Experimental Research, Sun Yat-Sen University Cancer Center, Guangzhou, 510060, China
- Institute of Urology, The Third Affiliated Hospital of Shenzhen University, Shenzhen, 518000, China
| | - Yuqing Li
- Institute of Urology, The Third Affiliated Hospital of Shenzhen University, Shenzhen, 518000, China
- Teaching Center of Shenzhen Luohu Hospital, Shantou University Medical College, Shantou, 515000, China
| | - Li Zhong
- Center of Digestive Diseases, The Seventh Affiliated Hospital of Sun Yat-Sen University, Shenzhen, 518107, China
- Scientific Research Center, The Seventh Affiliated Hospital of Sun Yat-Sen University, Shenzhen, 518107, China
| | - Kai Wu
- Institute of Urology, The Third Affiliated Hospital of Shenzhen University, Shenzhen, 518000, China
| | - Ruhua Zhang
- State Key Laboratory of Oncology in South China, Collaborative Innovation Center for Cancer Medicine, Department of Experimental Research, Sun Yat-Sen University Cancer Center, Guangzhou, 510060, China
| | - Tiebang Kang
- State Key Laboratory of Oncology in South China, Collaborative Innovation Center for Cancer Medicine, Department of Experimental Research, Sun Yat-Sen University Cancer Center, Guangzhou, 510060, China
| | - Song Wu
- Institute of Urology, The Third Affiliated Hospital of Shenzhen University, Shenzhen, 518000, China.
- Teaching Center of Shenzhen Luohu Hospital, Shantou University Medical College, Shantou, 515000, China.
- Department of Urology, South China Hospital of Shenzhen University, Shenzhen, 518000, China.
| | - Yuanzhong Wu
- State Key Laboratory of Oncology in South China, Collaborative Innovation Center for Cancer Medicine, Department of Experimental Research, Sun Yat-Sen University Cancer Center, Guangzhou, 510060, China.
| |
Collapse
|
3
|
Rodriguez JM, Pozo F, Cerdán-Vélez D, Di Domenico T, Vázquez J, Tress M. APPRIS: selecting functionally important isoforms. Nucleic Acids Res 2022; 50:D54-D59. [PMID: 34755885 PMCID: PMC8728124 DOI: 10.1093/nar/gkab1058] [Citation(s) in RCA: 42] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2021] [Revised: 10/14/2021] [Accepted: 10/20/2021] [Indexed: 12/20/2022] Open
Abstract
APPRIS (https://appris.bioinfo.cnio.es) is a well-established database housing annotations for protein isoforms for a range of species. APPRIS selects principal isoforms based on protein structure and function features and on cross-species conservation. Most coding genes produce a single main protein isoform and the principal isoforms chosen by the APPRIS database best represent this main cellular isoform. Human genetic data, experimental protein evidence and the distribution of clinical variants all support the relevance of APPRIS principal isoforms. APPRIS annotations and principal isoforms have now been expanded to 10 model organisms. In this paper we highlight the most recent updates to the database. APPRIS annotations have been generated for two new species, cow and chicken, the protein structural information has been augmented with reliable models from the EMBL-EBI AlphaFold database, and we have substantially expanded the confirmatory proteomics evidence available for the human genome. The most significant change in APPRIS has been the implementation of TRIFID functional isoform scores. TRIFID functional scores are assigned to all splice isoforms, and APPRIS uses the TRIFID functional scores and proteomics evidence to determine principal isoforms when core methods cannot.
Collapse
Affiliation(s)
- Jose Manuel Rodriguez
- Cardiovascular Proteomics Laboratory, Centro Nacional de Investigaciones Cardiovasculares Carlos III (CNIC), 28029 Madrid, Spain
| | - Fernando Pozo
- Bioinformatics Institute, Spanish National Cancer Research Centre (CNIO), Madrid, 28029, Spain
| | - Daniel Cerdán-Vélez
- Bioinformatics Institute, Spanish National Cancer Research Centre (CNIO), Madrid, 28029, Spain
| | - Tomás Di Domenico
- Bioinformatics Institute, Spanish National Cancer Research Centre (CNIO), Madrid, 28029, Spain
| | - Jesús Vázquez
- Cardiovascular Proteomics Laboratory, Centro Nacional de Investigaciones Cardiovasculares Carlos III (CNIC), 28029 Madrid, Spain
- CIBER de Enfermedades Cardiovasculares (CIBERCV), 28029 Madrid, Spain
| | - Michael L Tress
- Bioinformatics Institute, Spanish National Cancer Research Centre (CNIO), Madrid, 28029, Spain
| |
Collapse
|
4
|
Pozo F, Martinez-Gomez L, Walsh TA, Rodriguez JM, Di Domenico T, Abascal F, Vazquez J, Tress ML. Assessing the functional relevance of splice isoforms. NAR Genom Bioinform 2021; 3:lqab044. [PMID: 34046593 PMCID: PMC8140736 DOI: 10.1093/nargab/lqab044] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2020] [Revised: 04/22/2021] [Accepted: 05/17/2021] [Indexed: 12/20/2022] Open
Abstract
Alternative splicing of messenger RNA can generate an array of mature transcripts, but it is not clear how many go on to produce functionally relevant protein isoforms. There is only limited evidence for alternative proteins in proteomics analyses and data from population genetic variation studies indicate that most alternative exons are evolving neutrally. Determining which transcripts produce biologically important isoforms is key to understanding isoform function and to interpreting the real impact of somatic mutations and germline variations. Here we have developed a method, TRIFID, to classify the functional importance of splice isoforms. TRIFID was trained on isoforms detected in large-scale proteomics analyses and distinguishes these biologically important splice isoforms with high confidence. Isoforms predicted as functionally important by the algorithm had measurable cross species conservation and significantly fewer broken functional domains. Additionally, exons that code for these functionally important protein isoforms are under purifying selection, while exons from low scoring transcripts largely appear to be evolving neutrally. TRIFID has been developed for the human genome, but it could in principle be applied to other well-annotated species. We believe that this method will generate valuable insights into the cellular importance of alternative splicing.
Collapse
Affiliation(s)
- Fernando Pozo
- Bioinformatics Unit, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| | - Laura Martinez-Gomez
- Bioinformatics Unit, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| | - Thomas A Walsh
- Bioinformatics Unit, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| | - José Manuel Rodriguez
- Cardiovascular Proteomics Laboratory, Centro Nacional de Investigaciones Cardiovasculares (CNIC), Madrid, Spain
| | - Tomas Di Domenico
- Bioinformatics Unit, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| | - Federico Abascal
- Somatic Evolution Group, Wellcome Sanger Institute, Hinxton CB10 1SA, UK
| | - Jesús Vazquez
- Cardiovascular Proteomics Laboratory, Centro Nacional de Investigaciones Cardiovasculares (CNIC), Madrid, Spain
| | - Michael L Tress
- Bioinformatics Unit, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| |
Collapse
|
5
|
Kuang S, Wei Y, Wang L. Expression-based prediction of human essential genes and candidate lncRNAs in cancer cells. Bioinformatics 2021; 37:396-403. [PMID: 32790840 DOI: 10.1093/bioinformatics/btaa717] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2020] [Revised: 07/21/2020] [Accepted: 08/06/2020] [Indexed: 01/12/2023] Open
Abstract
MOTIVATION Essential genes are required for the reproductive success at either cellular or organismal level. The identification of essential genes is important for understanding the core biological processes and identifying effective therapeutic drug targets. However, experimental identification of essential genes is costly, time consuming and labor intensive. Although several machine learning models have been developed to predict essential genes, these models are not readily applicable to lncRNAs. Moreover, the currently available models cannot be used to predict essential genes in a specific cancer type. RESULTS In this study, we have developed a new machine learning approach, XGEP (eXpression-based Gene Essentiality Prediction), to predict essential genes and candidate lncRNAs in cancer cells. The novelty of XGEP lies in the utilization of relevant features derived from the TCGA transcriptome dataset through collaborative embedding. When evaluated on the pan-cancer dataset, XGEP was able to accurately predict human essential genes and achieve significantly higher performance than previous models. Notably, several candidate lncRNAs selected by XGEP are reported to promote cell proliferation and inhibit cell apoptosis. Moreover, XGEP also demonstrated superior performance on cancer-type-specific datasets to identify essential genes. The comprehensive lists of candidate essential genes in specific cancer types may be used to guide experimental characterization and facilitate the discovery of drug targets for cancer therapy. AVAILABILITY AND IMPLEMENTATION The source code and datasets used in this study are freely available at https://github.com/BioDataLearning/XGEP. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Shuzhen Kuang
- Department of Genetics and Biochemistry, Clemson University, Clemson, SC 29634, USA.,Department of Biological Sciences, Clemson University, Clemson, SC 29634, USA
| | - Yanzhang Wei
- Department of Biological Sciences, Clemson University, Clemson, SC 29634, USA
| | - Liangjiang Wang
- Department of Genetics and Biochemistry, Clemson University, Clemson, SC 29634, USA.,Center for Human Genetics, Clemson University, Clemson, SC 29634, USA
| |
Collapse
|
6
|
Quality of whole genome sequencing from blood versus saliva derived DNA in cardiac patients. BMC Med Genomics 2020; 13:11. [PMID: 31996208 PMCID: PMC6988365 DOI: 10.1186/s12920-020-0664-7] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2019] [Accepted: 01/20/2020] [Indexed: 01/03/2023] Open
Abstract
Background Whole-genome sequencing (WGS) is becoming an increasingly important tool for detecting genomic variation. Blood derived DNA is the current standard for WGS for research or clinical purposes but may not always be feasible to acquire. The usability of DNA from saliva for WGS is not known. We compared the quality of WGS between blood versus saliva derived DNA. Methods WGS was performed in DNA from 531 blood and 502 saliva samples (including 5 paired samples) from participants enrolled in a heart disease biorepository. We compared the proportion of sequencing reads that mapped to non-human sources (microbiome), the sequencing coverage, and the yield and concordance of single nucleotide variant (SNV) and copy number variant (CNV) calls between blood and saliva genomes. Results Of 531 blood and 502 saliva samples, 46% saliva DNA failed quality control (QC) requirements for WGS compared to 6% QC failure for blood DNA. An average of 10.7% WGS reads in the saliva samples mapped to the human oral microbiome compared to 0.09% WGS reads in blood samples. However, these reads were readily excluded by excluding reads that did not map to the human reference genome. Sequencing coverage met or exceeded the target sequencing depth of 30x in all the blood samples and 4 of the 5 saliva samples; the fifth saliva sample had an average sequencing depth of 22.6x. Over 95% of SNVs identified in saliva were concordant with those identified in blood across the genome, within all gene coding regions, and within cardiovascular disease-related gene coding regions. Rare SNVs, defined as those with a minor allele frequency of less than 1% in the Genome Aggregation Database, had a lower concordance of 90% between blood and saliva genomes. CNVs had only 76% concordance between blood and saliva samples. Conclusions High quality saliva samples that meet stringent QC criteria can be used for WGS when blood-derived DNA is not available or is not suitable. Saliva DNA provides an acceptable yield of SNV calls but has a lower yield for CNV calls compared to blood DNA.
Collapse
|
7
|
Zhu M, Gribskov M. MiPepid: MicroPeptide identification tool using machine learning. BMC Bioinformatics 2019; 20:559. [PMID: 31703551 PMCID: PMC6842143 DOI: 10.1186/s12859-019-3033-9] [Citation(s) in RCA: 55] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2019] [Accepted: 08/16/2019] [Indexed: 12/13/2022] Open
Abstract
Background Micropeptides are small proteins with length < = 100 amino acids. Short open reading frames that could produces micropeptides were traditionally ignored due to technical difficulties, as few small peptides had been experimentally confirmed. In the past decade, a growing number of micropeptides have been shown to play significant roles in vital biological activities. Despite the increased amount of data, we still lack bioinformatics tools for specifically identifying micropeptides from DNA sequences. Indeed, most existing tools for classifying coding and noncoding ORFs were built on datasets in which “normal-sized” proteins were considered to be positives and short ORFs were generally considered to be noncoding. Since the functional and biophysical constraints on small peptides are likely to be different from those on “normal” proteins, methods for predicting short translated ORFs must be trained independently from those for longer proteins. Results In this study, we have developed MiPepid, a machine-learning tool specifically for the identification of micropeptides. We trained MiPepid using carefully cleaned data from existing databases and used logistic regression with 4-mer features. With only the sequence information of an ORF, MiPepid is able to predict whether it encodes a micropeptide with 96% accuracy on a blind dataset of high-confidence micropeptides, and to correctly classify newly discovered micropeptides not included in either the training or the blind test data. Compared with state-of-the-art coding potential prediction methods, MiPepid performs exceptionally well, as other methods incorrectly classify most bona fide micropeptides as noncoding. MiPepid is alignment-free and runs sufficiently fast for genome-scale analyses. It is easy to use and is available at https://github.com/MindAI/MiPepid. Conclusions MiPepid was developed to specifically predict micropeptides, a category of proteins with increasing significance, from DNA sequences. It shows evident advantages over existing coding potential prediction methods on micropeptide identification. It is ready to use and runs fast. Electronic supplementary material The online version of this article (10.1186/s12859-019-3033-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Mengmeng Zhu
- Department of Statistics, Purdue University, West Lafayette, IN, 47907, USA.,Department of Biological Sciences, Purdue University, West Lafayette, IN, 47907, USA
| | - Michael Gribskov
- Department of Biological Sciences, Purdue University, West Lafayette, IN, 47907, USA.
| |
Collapse
|
8
|
Pujar S, O'Leary NA, Farrell CM, Loveland JE, Mudge JM, Wallin C, Girón CG, Diekhans M, Barnes I, Bennett R, Berry AE, Cox E, Davidson C, Goldfarb T, Gonzalez JM, Hunt T, Jackson J, Joardar V, Kay MP, Kodali VK, Martin FJ, McAndrews M, McGarvey KM, Murphy M, Rajput B, Rangwala SH, Riddick LD, Seal RL, Suner MM, Webb D, Zhu S, Aken BL, Bruford EA, Bult CJ, Frankish A, Murphy T, Pruitt KD. Consensus coding sequence (CCDS) database: a standardized set of human and mouse protein-coding regions supported by expert curation. Nucleic Acids Res 2019; 46:D221-D228. [PMID: 29126148 PMCID: PMC5753299 DOI: 10.1093/nar/gkx1031] [Citation(s) in RCA: 83] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2017] [Accepted: 10/20/2017] [Indexed: 01/29/2023] Open
Abstract
The Consensus Coding Sequence (CCDS) project provides a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assembly in genome annotations produced independently by NCBI and the Ensembl group at EMBL-EBI. This dataset is the product of an international collaboration that includes NCBI, Ensembl, HUGO Gene Nomenclature Committee, Mouse Genome Informatics and University of California, Santa Cruz. Identically annotated coding regions, which are generated using an automated pipeline and pass multiple quality assurance checks, are assigned a stable and tracked identifier (CCDS ID). Additionally, coordinated manual review by expert curators from the CCDS collaboration helps in maintaining the integrity and high quality of the dataset. The CCDS data are available through an interactive web page (https://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi) and an FTP site (ftp://ftp.ncbi.nlm.nih.gov/pub/CCDS/). In this paper, we outline the ongoing work, growth and stability of the CCDS dataset and provide updates on new collaboration members and new features added to the CCDS user interface. We also present expert curation scenarios, with specific examples highlighting the importance of an accurate reference genome assembly and the crucial role played by input from the research community.
Collapse
Affiliation(s)
- Shashikant Pujar
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Nuala A O'Leary
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Catherine M Farrell
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Jane E Loveland
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Jonathan M Mudge
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Craig Wallin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Carlos G Girón
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Mark Diekhans
- University of California Santa Cruz Genomics Institute, Santa Cruz, CA 95064, USA
| | - If Barnes
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Ruth Bennett
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Andrew E Berry
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Eric Cox
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Claire Davidson
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Tamara Goldfarb
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Jose M Gonzalez
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Toby Hunt
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - John Jackson
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Vinita Joardar
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Mike P Kay
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Vamsi K Kodali
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Fergal J Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Monica McAndrews
- Mouse Genome Informatics, The Jackson Laboratory, Bar Harbor, ME 04609, USA
| | - Kelly M McGarvey
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Michael Murphy
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Bhanu Rajput
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Sanjida H Rangwala
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Lillian D Riddick
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Ruth L Seal
- HUGO Gene Nomenclature Committee, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Marie-Marthe Suner
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - David Webb
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Sophia Zhu
- Mouse Genome Informatics, The Jackson Laboratory, Bar Harbor, ME 04609, USA
| | - Bronwen L Aken
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Elspeth A Bruford
- HUGO Gene Nomenclature Committee, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Carol J Bult
- Mouse Genome Informatics, The Jackson Laboratory, Bar Harbor, ME 04609, USA
| | - Adam Frankish
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Terence Murphy
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Kim D Pruitt
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| |
Collapse
|
9
|
Khan W, Ghani A, Azmi MB, Razzak SA. Beyond sequencing: re-visiting annotations for PJL as a test case. BMC Res Notes 2019; 12:467. [PMID: 31366397 PMCID: PMC6670121 DOI: 10.1186/s13104-019-4508-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2018] [Accepted: 07/23/2019] [Indexed: 11/10/2022] Open
Abstract
OBJECTIVES Current developments in sequencing techniques have enabled rapid and high-throughput generation of sequence data. However, there is a growing gap between the generation of raw sequencing data and the extraction of meaningful biological information. Variant annotation is a crucial step in the analysis of genome sequencing data. Incorrect or incomplete annotations can cause researchers to dilute interesting variants in a pool of false positives. We require consistent, accurate and reliable annotation of variants for making diagnostic and treatment decisions. Current annotation depends on the set of transcripts, and software used can be managed, with sufficient care, in the research context. Careful thought needs to be given to the choice of transcript sets and software packages for variant annotation in sequencing studies. In this project, the main objective is to analyze the genetic variants observed in Pakistani population data within the 1000 genomes project (1KGP). RESULTS We characterized only SNVs and InDels types of genetic variations, in total ~ 1.4 million variants. Besides this, we also annotated the genetic variants with multiple annotations tools, ANNOVAR and SnpEff and compared the differential results. Our population-specific catalogue will enhance future studies on the functional impact at protein level.
Collapse
Affiliation(s)
- Waqasuddin Khan
- Jamil-ur-Rahman Center for Genome Research, Dr. Panjwani Center for Molecular Medicine and Drug Research, International Center for Chemical and Biological Sciences, University of Karachi, Karachi, 75270 Pakistan
| | - Aisha Ghani
- Jamil-ur-Rahman Center for Genome Research, Dr. Panjwani Center for Molecular Medicine and Drug Research, International Center for Chemical and Biological Sciences, University of Karachi, Karachi, 75270 Pakistan
| | - Muhammad Bilal Azmi
- Department of Biochemistry, Dow Medical College, Dow University of Health Sciences, Baba-E-Urdu Road, Karachi, 74200 Pakistan
| | - Safina Abdul Razzak
- Jamil-ur-Rahman Center for Genome Research, Dr. Panjwani Center for Molecular Medicine and Drug Research, International Center for Chemical and Biological Sciences, University of Karachi, Karachi, 75270 Pakistan
| |
Collapse
|
10
|
Petersen SD, Zhang J, Lee JS, Jakociunas T, Grav LM, Kildegaard HF, Keasling JD, Jensen MK. Modular 5'-UTR hexamers for context-independent tuning of protein expression in eukaryotes. Nucleic Acids Res 2019; 46:e127. [PMID: 30124898 PMCID: PMC6265478 DOI: 10.1093/nar/gky734] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2018] [Accepted: 08/01/2018] [Indexed: 11/25/2022] Open
Abstract
Functional characterization of regulatory DNA elements in broad genetic contexts is a prerequisite for forward engineering of biological systems. Translation initiation site (TIS) sequences are attractive to use for regulating gene activity and metabolic pathway fluxes because the genetic changes are minimal. However, limited knowledge is available on tuning gene outputs by varying TISs in different genetic and environmental contexts. Here, we created TIS hexamer libraries in baker’s yeast Saccharomyces cerevisiae directly 5′ end of a reporter gene in various promoter contexts and measured gene activity distributions for each library. Next, selected TIS sequences, resulted in almost 10-fold changes in reporter outputs, were experimentally characterized in various environmental and genetic contexts in both yeast and mammalian cells. From our analyses, we observed strong linear correlations (R2 = 0.75–0.98) between all pairwise combinations of TIS order and gene activity. Finally, our analysis enabled the identification of a TIS with almost 50% stronger output than a commonly used TIS for protein expression in mammalian cells, and selected TISs were also used to tune gene activities in yeast at a metabolic branch point in order to prototype fitness and carotenoid production landscapes. Taken together, the characterized TISs support reliable context-independent forward engineering of translation initiation in eukaryotes.
Collapse
Affiliation(s)
- Søren D Petersen
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, 2800 Kgs. Lyngby, Denmark
| | - Jie Zhang
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, 2800 Kgs. Lyngby, Denmark
| | - Jae S Lee
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, 2800 Kgs. Lyngby, Denmark
| | - Tadas Jakociunas
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, 2800 Kgs. Lyngby, Denmark
| | - Lise M Grav
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, 2800 Kgs. Lyngby, Denmark
| | - Helene F Kildegaard
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, 2800 Kgs. Lyngby, Denmark
| | - Jay D Keasling
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, 2800 Kgs. Lyngby, Denmark.,Joint BioEnergy Institute, Emeryville, CA 94608, USA.,Biological Systems and Engineering Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.,Department of Chemical and Biomolecular Engineering, University of California, Berkeley, CA 94720, USA.,Department of Bioengineering, University of California, Berkeley, CA 94720, USA.,Center for Synthetic Biochemistry, Institute for Synthetic Biology, Shenzhen Institutes of Advanced Technologies, Shenzhen 518055, China
| | - Michael K Jensen
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, 2800 Kgs. Lyngby, Denmark
| |
Collapse
|
11
|
Hao Y, Zhang L, Niu Y, Cai T, Luo J, He S, Zhang B, Zhang D, Qin Y, Yang F, Chen R. SmProt: a database of small proteins encoded by annotated coding and non-coding RNA loci. Brief Bioinform 2019; 19:636-643. [PMID: 28137767 DOI: 10.1093/bib/bbx005] [Citation(s) in RCA: 57] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2016] [Indexed: 11/12/2022] Open
Abstract
Small proteins is the general term for proteins with length shorter than 100 amino acids. Identification and functional studies of small proteins have advanced rapidly in recent years, and several studies have shown that small proteins play important roles in diverse functions including development, muscle contraction and DNA repair. Identification and characterization of previously unrecognized small proteins may contribute in important ways to cell biology and human health. Current databases are generally somewhat deficient in that they have either not collected small proteins systematically, or contain only predictions of small proteins in a limited number of tissues and species. Here, we present a specifically designed web-accessible database, small proteins database (SmProt, http://bioinfo.ibp.ac.cn/SmProt), which is a database documenting small proteins. The current release of SmProt incorporates 255 010 small proteins computationally or experimentally identified in 291 cell lines/tissues derived from eight popular species. The database provides a variety of data including basic information (sequence, location, gene name, organism, etc.) as well as specific information (experiment, function, disease type, etc.). To facilitate data extraction, SmProt supports multiple search options, including species, genome location, gene name and their aliases, cell lines/tissues, ORF type, gene type, PubMed ID and SmProt ID. SmProt also incorporates a service for the BLAST alignment search and provides a local UCSC Genome Browser. Additionally, SmProt defines a high-confidence set of small proteins and predicts the functions of the small proteins.
Collapse
Affiliation(s)
- Yajing Hao
- Key Laboratory of RNA Biology, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
| | - Lili Zhang
- Key Laboratory of RNA Biology, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
| | - Yiwei Niu
- Key Laboratory of RNA Biology, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
| | - Tanxi Cai
- Key Laboratory of Protein and Peptide Pharmaceuticals and Laboratory of Proteomics, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
| | - Jianjun Luo
- Key Laboratory of RNA Biology, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
| | - Shunmin He
- Key Laboratory of the Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
| | - Bao Zhang
- Key Laboratory of RNA Biology, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
| | - Dejiu Zhang
- Key Laboratory of RNA Biology, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
| | - Yan Qin
- Key Laboratory of RNA Biology, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
| | - Fuquan Yang
- Key Laboratory of Protein and Peptide Pharmaceuticals and Laboratory of Proteomics, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
| | - Runsheng Chen
- Key Laboratory of RNA Biology, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
| |
Collapse
|
12
|
Gong Y, Wang K, Xiao SP, Mi P, Li W, Shang Y, Dou F. Overexpressed TTC3 Protein Tends to be Cleaved into Fragments and Form Aggregates in the Nucleus. Neuromolecular Med 2019; 21:85-96. [PMID: 30203323 DOI: 10.1007/s12017-018-8509-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2018] [Accepted: 08/31/2018] [Indexed: 12/01/2022]
Abstract
Human tetratricopeptide repeat domain 3 (TTC3) is a gene on 21q22.2 within the Down syndrome critical region (DSCR). Earlier studies suggest that TTC3 may be an important regulator in individual development, especially in neural development. As an E3 ligase, TTC3 binds to phosphorylated Akt and silence its activity via proteasomal cascade. Several groups also reported the involvement of TTC3 in familial Alzheimer's disease recently. In addition, our previous work shows that TTC3 also regulates the degradation of DNA polymerase gamma and over-expressed TTC3 protein tends to form insoluble aggregates in cells. In this study, we focus on the solubility and intracellular localization of TTC3 protein. Over-expressed TTC3 tends to form insoluble aggregates over time. The proteasome inhibitor MG132 treatment resulted in more TTC3 aggregates in a short period of time. We fused the fluorescent protein to either terminus of the TTC3 protein and found that the intracellular localization of fluorescent signals are different between the N-terminal tagged and C-terminal tagged proteins. Western blotting revealed that the TTC3 protein is cleaved into fragments of different sizes at multiple sites. The N-terminal sub-fragments of TTC3 are prone to from nuclear aggregates and the TTC3 nuclear import is mediated by signals within the N-terminal 1 to 650 residues. Moreover, over-expressed TTC3 induced a considerable degree of cytotoxicity, and its N-terminal sub-fragments are more potent inhibitors of cell proliferation than full-length protein. Considering the prevalent proteostasis dysregulation in neurodegenerative diseases, these findings may relate to the pathology of such diseases.
Collapse
Affiliation(s)
- Yueqing Gong
- State Key Laboratory of Cognitive Neuroscience and Learning & IDG/McGovern Institute for Brain Research, College of Life Sciences, Beijing Normal University, Beijing, China
- Key Laboratory of Cell Proliferation and Regulation Biology, Ministry of Education, College of Life Sciences, Beijing Normal University, Beijing, China
- Center for Collaboration and Innovation in Brain and Learning Sciences, Beijing Normal University, Beijing, China
| | - Kun Wang
- State Key Laboratory of Cognitive Neuroscience and Learning & IDG/McGovern Institute for Brain Research, College of Life Sciences, Beijing Normal University, Beijing, China
- Key Laboratory of Cell Proliferation and Regulation Biology, Ministry of Education, College of Life Sciences, Beijing Normal University, Beijing, China
- Center for Collaboration and Innovation in Brain and Learning Sciences, Beijing Normal University, Beijing, China
| | - Sheng-Ping Xiao
- State Key Laboratory of Cognitive Neuroscience and Learning & IDG/McGovern Institute for Brain Research, College of Life Sciences, Beijing Normal University, Beijing, China
- Key Laboratory of Cell Proliferation and Regulation Biology, Ministry of Education, College of Life Sciences, Beijing Normal University, Beijing, China
- Center for Collaboration and Innovation in Brain and Learning Sciences, Beijing Normal University, Beijing, China
| | - Panying Mi
- State Key Laboratory of Cognitive Neuroscience and Learning & IDG/McGovern Institute for Brain Research, College of Life Sciences, Beijing Normal University, Beijing, China
- Key Laboratory of Cell Proliferation and Regulation Biology, Ministry of Education, College of Life Sciences, Beijing Normal University, Beijing, China
- Center for Collaboration and Innovation in Brain and Learning Sciences, Beijing Normal University, Beijing, China
| | - Wanjie Li
- Key Laboratory of Cell Proliferation and Regulation Biology, Ministry of Education, College of Life Sciences, Beijing Normal University, Beijing, China
| | - Yu Shang
- Key Laboratory of Cell Proliferation and Regulation Biology, Ministry of Education, College of Life Sciences, Beijing Normal University, Beijing, China
| | - Fei Dou
- State Key Laboratory of Cognitive Neuroscience and Learning & IDG/McGovern Institute for Brain Research, College of Life Sciences, Beijing Normal University, Beijing, China.
- Key Laboratory of Cell Proliferation and Regulation Biology, Ministry of Education, College of Life Sciences, Beijing Normal University, Beijing, China.
- Center for Collaboration and Innovation in Brain and Learning Sciences, Beijing Normal University, Beijing, China.
| |
Collapse
|
13
|
Guo FB, Dong C, Hua HL, Liu S, Luo H, Zhang HW, Jin YT, Zhang KY. Accurate prediction of human essential genes using only nucleotide composition and association information. Bioinformatics 2018; 33:1758-1764. [PMID: 28158612 PMCID: PMC7110051 DOI: 10.1093/bioinformatics/btx055] [Citation(s) in RCA: 42] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2016] [Accepted: 01/25/2017] [Indexed: 12/20/2022] Open
Abstract
Motivation Previously constructed classifiers in predicting eukaryotic essential genes integrated a variety of features including experimental ones. If we can obtain satisfactory prediction using only nucleotide (sequence) information, it would be more promising. Three groups recently identified essential genes in human cancer cell lines using wet experiments and it provided wonderful opportunity to accomplish our idea. Here we improved the Z curve method into the λ-interval form to denote nucleotide composition and association information and used it to construct the SVM classifying model. Results Our model accurately predicted human gene essentiality with an AUC higher than 0.88 both for 5-fold cross-validation and jackknife tests. These results demonstrated that the essentiality of human genes could be reliably reflected by only sequence information. We re-predicted the negative dataset by our Pheg server and 118 genes were additionally predicted as essential. Among them, 20 were found to be homologues in mouse essential genes, indicating that some of the 118 genes were indeed essential, however previous experiments overlooked them. As the first available server, Pheg could predict essentiality for anonymous gene sequences of human. It is also hoped the λ-interval Z curve method could be effectively extended to classification issues of other DNA elements. Availability and Implementation http://cefg.uestc.edu.cn/Pheg. Contact fbguo@uestc.edu.cn. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Feng-Biao Guo
- School of Life Science and Technology, Center for Informational Biology and Key Laboratory for Neuro-information of the Ministry of Education, University of Electronic Science and Technology of China, Chengdu, China
| | - Chuan Dong
- School of Life Science and Technology, Center for Informational Biology and Key Laboratory for Neuro-information of the Ministry of Education, University of Electronic Science and Technology of China, Chengdu, China
| | - Hong-Li Hua
- School of Life Science and Technology, Center for Informational Biology and Key Laboratory for Neuro-information of the Ministry of Education, University of Electronic Science and Technology of China, Chengdu, China
| | - Shuo Liu
- School of Life Science and Technology, Center for Informational Biology and Key Laboratory for Neuro-information of the Ministry of Education, University of Electronic Science and Technology of China, Chengdu, China
| | - Hao Luo
- Department of Physics, Tianjin University, Tianjin, China
| | - Hong-Wan Zhang
- School of Life Science and Technology, Center for Informational Biology and Key Laboratory for Neuro-information of the Ministry of Education, University of Electronic Science and Technology of China, Chengdu, China
| | - Yan-Ting Jin
- School of Life Science and Technology, Center for Informational Biology and Key Laboratory for Neuro-information of the Ministry of Education, University of Electronic Science and Technology of China, Chengdu, China
| | - Kai-Yue Zhang
- School of Life Science and Technology, Center for Informational Biology and Key Laboratory for Neuro-information of the Ministry of Education, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
14
|
Target-Cell-Directed Bioengineering Approaches for Gene Therapy of Hemophilia A. MOLECULAR THERAPY-METHODS & CLINICAL DEVELOPMENT 2018; 9:57-69. [PMID: 29552578 PMCID: PMC5852392 DOI: 10.1016/j.omtm.2018.01.004] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/03/2017] [Accepted: 01/09/2018] [Indexed: 01/08/2023]
Abstract
Potency is a key optimization parameter for hemophilia A gene therapy product candidates. Optimization strategies include promoter engineering to increase transcription, codon optimization of mRNA to improve translation, and amino-acid substitution to promote secretion. Herein, we describe both rational and empirical design approaches to the development of a minimally sized, highly potent AAV-fVIII vector that incorporates three unique elements: a liver-directed 146-nt transcription regulatory module, a target-cell-specific codon optimization algorithm, and a high-expression bioengineered fVIII variant. The minimal synthetic promoter allows for the smallest AAV-fVIII vector genome known at 4,832 nt, while the tissue-directed codon optimization strategy facilitates increased fVIII transgene product expression in target cell types, e.g., hepatocytes, over traditional genome-level codon optimization strategies. As a tertiary approach, we incorporated ancient and orthologous fVIII sequence elements previously shown to facilitate improved biosynthesis through post-translational mechanisms. Together, these technologies contribute to an AAV-fVIII vector that confers sustained, curative levels of fVIII at a minimal dose in hemophilia A mice. Moreover, the first two technologies should be generalizable to all liver-directed gene therapy vector designs.
Collapse
|
15
|
Darling AL, Liu Y, Oldfield CJ, Uversky VN. Intrinsically Disordered Proteome of Human Membrane-Less Organelles. Proteomics 2017; 18:e1700193. [PMID: 29068531 DOI: 10.1002/pmic.201700193] [Citation(s) in RCA: 145] [Impact Index Per Article: 18.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2017] [Revised: 02/10/2017] [Indexed: 11/10/2022]
Abstract
It is recognized now that various proteinaceous membrane-less organelles (PMLOs) are commonly found in cytoplasm, nucleus, and mitochondria of various eukaryotic cells (as well as in the chloroplasts of plant cells). Being different from the "traditional" membrane-encapsulated organelles, such as chloroplasts, endoplasmic reticulum, Golgi apparatus, lysosomes, mitochondria, nucleus, and vacuoles, PMLOs solve the cellular need to facilitate and regulate molecular interactions via reversible and controllable isolation of target molecules in specialized compartments. PMLOs possess liquid-like behavior and are believed to be formed as a result of biological liquid-liquid phase transitions (LLPTs, also known as liquid-liquid phase separation), where an intricate interplay between RNA and intrinsically disordered proteins (IDPs) or hybrid proteins containing ordered domains and intrinsically disordered protein regions (IDPRs) may play an important role. This review analyzes the prevalence of intrinsic disorder in proteins associated with various PMLOs found in human cells and considers some of the functional roles of IDPs/IDPRs in biogenesis of these organelles.
Collapse
Affiliation(s)
- April L Darling
- Department of Molecular Medicine and USF Health Byrd Alzheimer's Research Institute, Morsani College of Medicine, University of South Florida, Tampa, FL, USA
| | - Yun Liu
- Guangdong Provincial Key Laboratory for Plant Epigenetics, Shenzhen Key Laboratory of Microbial Genetic Engineering, College of Life Sciences and Oceanography, Shenzhen University, Shenzhen, Guangdong, P. R. China
| | | | - Vladimir N Uversky
- Department of Molecular Medicine and USF Health Byrd Alzheimer's Research Institute, Morsani College of Medicine, University of South Florida, Tampa, FL, USA.,Institute for Biological Instrumentation, Russian Academy of Sciences, Moscow Region, Russia
| |
Collapse
|
16
|
Briones-Orta MA, Avendaño-Vázquez SE, Aparicio-Bautista DI, Coombes JD, Weber GF, Syn WK. Osteopontin splice variants and polymorphisms in cancer progression and prognosis. Biochim Biophys Acta Rev Cancer 2017; 1868:93-108.A. [PMID: 28254527 DOI: 10.1016/j.bbcan.2017.02.005] [Citation(s) in RCA: 66] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2017] [Revised: 02/24/2017] [Accepted: 02/25/2017] [Indexed: 12/12/2022]
Abstract
Osteopontin (OPN) is an extracellular matrix protein that is overexpressed in various cancers and promotes oncogenic features including cell proliferation, survival, migration, and angiogenesis, among others. OPN can participate in the regulation of the tumor microenvironment, affecting both cancer and neighboring cells. Here, we review the roles of OPN splice variants (a, b, c) in cancer development, progression, and prognosis, and also discuss the identities of isoforms 4 and 5. We also discussed how single-nucleotide polymorphisms (SNPs) of the OPN gene are an additional factor influencing the level of OPN in individuals, modulating the risks of cancer development and outcome.
Collapse
Affiliation(s)
| | | | | | - Jason D Coombes
- Regeneration and Repair, Institute of Hepatology, Foundation for Liver Research, London, United Kingdom
| | - Georg F Weber
- James L. Winkle College of Pharmacy, University of Cincinnati Academic Health Center, Cincinnati, OH, United States
| | - Wing-Kin Syn
- Regeneration and Repair, Institute of Hepatology, Foundation for Liver Research, London, United Kingdom; Division of Gastroenterology and Hepatology, Department of Medicine, Medical University of South Carolina, Charleston, SC., United States; Section of Gastroenterology, Ralph H Johnson Veteran Affairs Medical Center, Charleston, SC, United States.
| |
Collapse
|
17
|
Tress ML, Abascal F, Valencia A. Alternative Splicing May Not Be the Key to Proteome Complexity. Trends Biochem Sci 2016; 42:98-110. [PMID: 27712956 DOI: 10.1016/j.tibs.2016.08.008] [Citation(s) in RCA: 231] [Impact Index Per Article: 25.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2016] [Revised: 05/19/2016] [Accepted: 08/15/2016] [Indexed: 12/21/2022]
Abstract
Alternative splicing is commonly believed to be a major source of cellular protein diversity. However, although many thousands of alternatively spliced transcripts are routinely detected in RNA-seq studies, reliable large-scale mass spectrometry-based proteomics analyses identify only a small fraction of annotated alternative isoforms. The clearest finding from proteomics experiments is that most human genes have a single main protein isoform, while those alternative isoforms that are identified tend to be the most biologically plausible: those with the most cross-species conservation and those that do not compromise functional domains. Indeed, most alternative exons do not seem to be under selective pressure, suggesting that a large majority of predicted alternative transcripts may not even be translated into proteins.
Collapse
Affiliation(s)
- Michael L Tress
- Structural Biology and Bioinformatics Programme, Spanish National Cancer Research Centre (CNIO), Melchor Fernández Almagro, 3, 28029 Madrid, Spain
| | - Federico Abascal
- Structural Biology and Bioinformatics Programme, Spanish National Cancer Research Centre (CNIO), Melchor Fernández Almagro, 3, 28029 Madrid, Spain; Human Genetics Department, Sandhu Group, Wellcome Trust Sanger Institute, Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - Alfonso Valencia
- Structural Biology and Bioinformatics Programme, Spanish National Cancer Research Centre (CNIO), Melchor Fernández Almagro, 3, 28029 Madrid, Spain; National Bioinformatics Institute (INB), Spanish National Cancer Research Centre (CNIO), Melchor Fernández Almagro, 3, 28029 Madrid, Spain.
| |
Collapse
|
18
|
Shirota M, Kinoshita K. Discrepancies between human DNA, mRNA and protein reference sequences and their relation to single nucleotide variants in the human population. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw124. [PMID: 27589963 PMCID: PMC5009343 DOI: 10.1093/database/baw124] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/05/2016] [Accepted: 08/04/2016] [Indexed: 01/24/2023]
Abstract
The protein coding sequences of the human reference genome GRCh38, RefSeq mRNA and UniProt protein databases are sometimes inconsistent with each other, due to polymorphisms in the human population, but the overall landscape of the discordant sequences has not been clarified. In this study, we comprehensively listed the discordant bases and regions between the GRCh38, RefSeq and UniProt reference sequences, based on the genomic coordinates of GRCh38. We observed that the RefSeq sequences are more likely to represent the major alleles than GRCh38 and UniProt, by assigning the alternative allele frequencies of the discordant bases. Since some reference sequences have minor alleles, functional and structural annotations may be performed based on rare alleles in the human population, thereby biasing these analyses. Some of the differences between the RefSeq and GRCh38 account for biological differences due to known RNA-editing sites. The definitions of the coding regions are frequently complicated by possible micro-exons within introns and by SNVs with large alternative allele frequencies near exon–intron boundaries. The mRNA or protein regions missing from GRCh38 were mainly due to small deletions, and these sequences need to be identified. Taken together, our results clarify overall consistency and remaining inconsistency between the reference sequences.
Collapse
Affiliation(s)
- Matsuyuki Shirota
- Graduate School of Medicine, Tohoku University, Sendai, Miyagi 9808575, Japan Tohoku Medical Megabank Organization, Tohoku University, Sendai, Miyagi 9808575, Japan Graduate School of Information Sciences, Tohoku University, Sendai, Miyagi 9808579, Japan
| | - Kengo Kinoshita
- Tohoku Medical Megabank Organization, Tohoku University, Sendai, Miyagi 9808575, Japan Graduate School of Information Sciences, Tohoku University, Sendai, Miyagi 9808579, Japan Institute for Development, Aging and Cancer, Tohoku University, Sendai, Miyagi 9808575, Japan
| |
Collapse
|
19
|
Du C, Pusey BN, Adams CJ, Lau CC, Bone WP, Gahl WA, Markello TC, Adams DR. Explorations to improve the completeness of exome sequencing. BMC Med Genomics 2016; 9:56. [PMID: 27568008 PMCID: PMC5002202 DOI: 10.1186/s12920-016-0216-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2015] [Accepted: 08/05/2016] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND Exome sequencing has advanced to clinical practice and proven useful for obtaining molecular diagnoses in rare diseases. In approximately 75 % of cases, however, a clinical exome study does not produce a definitive molecular diagnosis. These residual cases comprise a new diagnostic challenge for the genetics community. The Undiagnosed Diseases Program of the National Institutes of Health routinely utilizes exome sequencing for refractory clinical cases. Our preliminary data suggest that disease-causing variants may be missed by current standard-of-care clinical exome analysis. Such false negatives reflect limitations in experimental design, technical performance, and data analysis. RESULTS We present examples from our datasets to quantify the analytical performance associated with current practices, and explore strategies to improve the completeness of data analysis. In particular, we focus on patient ascertainment, exome capture, inclusion of intronic variants, and evaluation of medium-sized structural variants. CONCLUSIONS The strategies we present may recover previously-missed, disease causing variants in second-pass exome analysis. Understanding the limitations of the current clinical exome search space provides a rational basis to improve methods for disease variant detection using genome-scale sequencing techniques.
Collapse
Affiliation(s)
- Chen Du
- NIH Undiagnosed Diseases Program, Common Fund, National Institutes of Health, National Human Genome Research Institute, Bethesda, MD, USA
| | - Barbara N Pusey
- NIH Undiagnosed Diseases Program, Common Fund, National Institutes of Health, National Human Genome Research Institute, Bethesda, MD, USA
| | - Christopher J Adams
- NIH Undiagnosed Diseases Program, Common Fund, National Institutes of Health, National Human Genome Research Institute, Bethesda, MD, USA
| | - C Christopher Lau
- NIH Undiagnosed Diseases Program, Common Fund, National Institutes of Health, National Human Genome Research Institute, Bethesda, MD, USA
| | - William P Bone
- NIH Undiagnosed Diseases Program, Common Fund, National Institutes of Health, National Human Genome Research Institute, Bethesda, MD, USA
| | - William A Gahl
- NIH Undiagnosed Diseases Program, Common Fund, National Institutes of Health, National Human Genome Research Institute, Bethesda, MD, USA
| | - Thomas C Markello
- NIH Undiagnosed Diseases Program, Common Fund, National Institutes of Health, National Human Genome Research Institute, Bethesda, MD, USA
| | - David R Adams
- NIH Undiagnosed Diseases Program, Common Fund, National Institutes of Health, National Human Genome Research Institute, Bethesda, MD, USA.
| |
Collapse
|
20
|
Aken BL, Ayling S, Barrell D, Clarke L, Curwen V, Fairley S, Fernandez Banet J, Billis K, García Girón C, Hourlier T, Howe K, Kähäri A, Kokocinski F, Martin FJ, Murphy DN, Nag R, Ruffier M, Schuster M, Tang YA, Vogel JH, White S, Zadissa A, Flicek P, Searle SMJ. The Ensembl gene annotation system. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw093. [PMID: 27337980 PMCID: PMC4919035 DOI: 10.1093/database/baw093] [Citation(s) in RCA: 769] [Impact Index Per Article: 85.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/11/2016] [Accepted: 05/09/2016] [Indexed: 12/12/2022]
Abstract
The Ensembl gene annotation system has been used to annotate over 70 different vertebrate species across a wide range of genome projects. Furthermore, it generates the automatic alignment-based annotation for the human and mouse GENCODE gene sets. The system is based on the alignment of biological sequences, including cDNAs, proteins and RNA-seq reads, to the target genome in order to construct candidate transcript models. Careful assessment and filtering of these candidate transcripts ultimately leads to the final gene set, which is made available on the Ensembl website. Here, we describe the annotation process in detail.Database URL: http://www.ensembl.org/index.html.
Collapse
Affiliation(s)
- Bronwen L Aken
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - Sarah Ayling
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK Present addresses: The Genome Analysis Centre, Norwich Research Park, Norwich NR4 7UH, UK
| | - Daniel Barrell
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK Eagle Genomics Ltd, Babraham Research Campus, Cambridge CB22 3AT, UK
| | - Laura Clarke
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Valery Curwen
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - Susan Fairley
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Julio Fernandez Banet
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK Pfizer Inc, 10646 Science Center Dr, San Diego, CA 92121, USA
| | - Konstantinos Billis
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - Carlos García Girón
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - Thibaut Hourlier
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - Kevin Howe
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Andreas Kähäri
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK Institutionen för cell-och molekylärbiologi, Uppsala University, Husargatan 3, Uppsala 752 37, Sweden
| | - Felix Kokocinski
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - Fergal J Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - Daniel N Murphy
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - Rishi Nag
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - Magali Ruffier
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Michael Schuster
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK CeMM Research Center for Molecular Medicine of the Austrian Academy of Sciences, Vienna a-1090, Austria
| | - Y Amy Tang
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Jan-Hinnerk Vogel
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK Genentech Inc, 1 DNA Way, South San Francisco, CA 94080, USA
| | - Simon White
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK The Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Amonida Zadissa
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Paul Flicek
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - Stephen M J Searle
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| |
Collapse
|
21
|
Kim H, Kang N, An K, Koo J, Kim MS. MRPrimerW: a tool for rapid design of valid high-quality primers for multiple target qPCR experiments. Nucleic Acids Res 2016; 44:W259-66. [PMID: 27154272 PMCID: PMC4987926 DOI: 10.1093/nar/gkw380] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2016] [Accepted: 04/25/2016] [Indexed: 01/25/2023] Open
Abstract
Design of high-quality primers for multiple target sequences is essential for qPCR experiments, but is challenging due to the need to consider both homology tests on off-target sequences and the same stringent filtering constraints on the primers. Existing web servers for primer design have major drawbacks, including requiring the use of BLAST-like tools for homology tests, lack of support for ranking of primers, TaqMan probes and simultaneous design of primers against multiple targets. Due to the large-scale computational overhead, the few web servers supporting homology tests use heuristic approaches or perform homology tests within a limited scope. Here, we describe the MRPrimerW, which performs complete homology testing, supports batch design of primers for multi-target qPCR experiments, supports design of TaqMan probes and ranks the resulting primers to return the top-1 best primers to the user. To ensure high accuracy, we adopted the core algorithm of a previously reported MapReduce-based method, MRPrimer, but completely redesigned it to allow users to receive query results quickly in a web interface, without requiring a MapReduce cluster or a long computation. MRPrimerW provides primer design services and a complete set of 341 963 135 in silico validated primers covering 99% of human and mouse genes. Free access: http://MRPrimerW.com.
Collapse
Affiliation(s)
- Hyerin Kim
- Department of Information and Communication Engineering, DGIST, Daegu 42988, South Korea
| | - NaNa Kang
- Department of Brain and Cognitive Sciences, DGIST, Daegu 42988, South Korea
| | - KyuHyeon An
- Department of Information and Communication Engineering, DGIST, Daegu 42988, South Korea
| | - JaeHyung Koo
- Department of Brain and Cognitive Sciences, DGIST, Daegu 42988, South Korea
| | - Min-Soo Kim
- Department of Information and Communication Engineering, DGIST, Daegu 42988, South Korea
| |
Collapse
|
22
|
Agarwal R, Kumar B, Jayadev M, Raghav D, Singh A. CoReCG: a comprehensive database of genes associated with colon-rectal cancer. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw059. [PMID: 27114494 PMCID: PMC4843536 DOI: 10.1093/database/baw059] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/24/2015] [Accepted: 03/23/2016] [Indexed: 12/19/2022]
Abstract
Cancer of large intestine is commonly referred as colorectal cancer, which is also the third most frequently prevailing neoplasm across the globe. Though, much of work is being carried out to understand the mechanism of carcinogenesis and advancement of this disease but, fewer studies has been performed to collate the scattered information of alterations in tumorigenic cells like genes, mutations, expression changes, epigenetic alteration or post translation modification, genetic heterogeneity. Earlier findings were mostly focused on understanding etiology of colorectal carcinogenesis but less emphasis were given for the comprehensive review of the existing findings of individual studies which can provide better diagnostics based on the suggested markers in discrete studies. Colon Rectal Cancer Gene Database (CoReCG), contains 2056 colon-rectal cancer genes information involved in distinct colorectal cancer stages sourced from published literature with an effective knowledge based information retrieval system. Additionally, interactive web interface enriched with various browsing sections, augmented with advance search facility for querying the database is provided for user friendly browsing, online tools for sequence similarity searches and knowledge based schema ensures a researcher friendly information retrieval mechanism. Colorectal cancer gene database (CoReCG) is expected to be a single point source for identification of colorectal cancer-related genes, thereby helping with the improvement of classification, diagnosis and treatment of human cancers. Database URL: lms.snu.edu.in/corecg
Collapse
Affiliation(s)
- Rahul Agarwal
- Department of Life Science, Shiv Nadar University, Greater Noida, India
| | - Binayak Kumar
- Department of Life Science, Shiv Nadar University, Greater Noida, India
| | - Msk Jayadev
- Department of Life Science, Shiv Nadar University, Greater Noida, India
| | - Dhwani Raghav
- Department of Health Research (Ministry of Health & Family Welfare), Division of Epidemiology and Communicable Diseases, Indian Council of Medical Research, Ansari Nagar, New Delhi, India
| | - Ashutosh Singh
- Department of Life Science, Shiv Nadar University, Greater Noida, India
| |
Collapse
|
23
|
Dalgleish R. LSDBs and How They Have Evolved. Hum Mutat 2016; 37:532-9. [DOI: 10.1002/humu.22979] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2015] [Accepted: 02/18/2016] [Indexed: 01/10/2023]
Affiliation(s)
- Raymond Dalgleish
- Department of Genetics; University of Leicester; Leicester United Kingdom
| |
Collapse
|
24
|
Breuza L, Poux S, Estreicher A, Famiglietti ML, Magrane M, Tognolli M, Bridge A, Baratin D, Redaschi N. The UniProtKB guide to the human proteome. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:bav120. [PMID: 26896845 PMCID: PMC4761109 DOI: 10.1093/database/bav120] [Citation(s) in RCA: 131] [Impact Index Per Article: 14.6] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/08/2015] [Accepted: 11/24/2015] [Indexed: 11/23/2022]
Abstract
Advances in high-throughput and advanced technologies allow researchers to routinely perform whole genome and proteome analysis. For this purpose, they need high-quality resources providing comprehensive gene and protein sets for their organisms of interest. Using the example of the human proteome, we will describe the content of a complete proteome in the UniProt Knowledgebase (UniProtKB). We will show how manual expert curation of UniProtKB/Swiss-Prot is complemented by expert-driven automatic annotation to build a comprehensive, high-quality and traceable resource. We will also illustrate how the complexity of the human proteome is captured and structured in UniProtKB. Database URL: www.uniprot.org
Collapse
Affiliation(s)
- Lionel Breuza
- SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 Rue Michel Servet, Geneva 4, 1211, Switzerland
| | - Sylvain Poux
- SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 Rue Michel Servet, Geneva 4, 1211, Switzerland
| | - Anne Estreicher
- SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 Rue Michel Servet, Geneva 4, 1211, Switzerland
| | - Maria Livia Famiglietti
- SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 Rue Michel Servet, Geneva 4, 1211, Switzerland
| | - Michele Magrane
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Michael Tognolli
- SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 Rue Michel Servet, Geneva 4, 1211, Switzerland
| | - Alan Bridge
- SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 Rue Michel Servet, Geneva 4, 1211, Switzerland
| | - Delphine Baratin
- SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 Rue Michel Servet, Geneva 4, 1211, Switzerland
| | - Nicole Redaschi
- SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 Rue Michel Servet, Geneva 4, 1211, Switzerland
| | | |
Collapse
|
25
|
Horvatovich P, Lundberg EK, Chen YJ, Sung TY, He F, Nice EC, Goode RJ, Yu S, Ranganathan S, Baker MS, Domont GB, Velasquez E, Li D, Liu S, Wang Q, He QY, Menon R, Guan Y, Corrales FJ, Segura V, Casal JI, Pascual-Montano A, Albar JP, Fuentes M, Gonzalez-Gonzalez M, Diez P, Ibarrola N, Degano RM, Mohammed Y, Borchers CH, Urbani A, Soggiu A, Yamamoto T, Salekdeh GH, Archakov A, Ponomarenko E, Lisitsa A, Lichti CF, Mostovenko E, Kroes RA, Rezeli M, Végvári Á, Fehniger TE, Bischoff R, Vizcaíno JA, Deutsch EW, Lane L, Nilsson CL, Marko-Varga G, Omenn GS, Jeong SK, Lim JS, Paik YK, Hancock WS. Quest for Missing Proteins: Update 2015 on Chromosome-Centric Human Proteome Project. J Proteome Res 2015; 14:3415-3431. [PMID: 26076068 DOI: 10.1021/pr5013009] [Citation(s) in RCA: 52] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
This paper summarizes the recent activities of the Chromosome-Centric Human Proteome Project (C-HPP) consortium, which develops new technologies to identify yet-to-be annotated proteins (termed "missing proteins") in biological samples that lack sufficient experimental evidence at the protein level for confident protein identification. The C-HPP also aims to identify new protein forms that may be caused by genetic variability, post-translational modifications, and alternative splicing. Proteogenomic data integration forms the basis of the C-HPP's activities; therefore, we have summarized some of the key approaches and their roles in the project. We present new analytical technologies that improve the chemical space and lower detection limits coupled to bioinformatics tools and some publicly available resources that can be used to improve data analysis or support the development of analytical assays. Most of this paper's content has been compiled from posters, slides, and discussions presented in the series of C-HPP workshops held during 2014. All data (posters, presentations) used are available at the C-HPP Wiki (http://c-hpp.webhosting.rug.nl/) and in the Supporting Information.
Collapse
Affiliation(s)
- Péter Horvatovich
- Analytical Biochemistry, Department of Pharmacy, University of Groningen , A. Deusinglaan 1, 9713 AV Groningen, The Netherlands
| | - Emma K Lundberg
- Science for Life Laboratory, KTH - Royal Institute of Technology , SE-171 21 Stockholm, Sweden
| | - Yu-Ju Chen
- Institute of Chemistry, Academia Sinica , 128 Academia Road Sec. 2, Taipei 115, Taiwan
| | - Ting-Yi Sung
- Institute of Information Science, Academia Sinica , 128 Academia Road Sec. 2, Taipei 115, Taiwan
| | - Fuchu He
- The State Key Laboratory of Proteomics, Beijing Proteome Research Center, Beijing Institute of Radiation Medicine , No. 27 Taiping Road, Haidian District, Beijing 100850, China
| | - Edouard C Nice
- Department of Biochemistry and Molecular Biology, Monash University , Clayton, Victoria 3800, Australia
| | - Robert J Goode
- Department of Biochemistry and Molecular Biology, Monash University , Clayton, Victoria 3800, Australia
| | - Simon Yu
- Department of Biochemistry and Molecular Biology, Monash University , Clayton, Victoria 3800, Australia
| | - Shoba Ranganathan
- Department of Chemistry and Biomolecular Sciences and ARC Centre of Excellence in Bioinformatics, Macquarie University , Sydney, New South Wales 2109, Australia
| | - Mark S Baker
- Australian School of Advanced Medicine, Macquarie University , Sydney, NSW 2109, Australia
| | - Gilberto B Domont
- Proteomics Unit, Institute of Chemistry, Federal University of Rio de Janeiro , Cidade Universitária, Av Athos da Silveira Ramos 149, CT-A542, 21941-909 Rio de Janeriro, Rj, Brazil
| | - Erika Velasquez
- Proteomics Unit, Institute of Chemistry, Federal University of Rio de Janeiro , Cidade Universitária, Av Athos da Silveira Ramos 149, CT-A542, 21941-909 Rio de Janeriro, Rj, Brazil
| | - Dong Li
- The State Key Laboratory of Proteomics, Beijing Proteome Research Center, Beijing Institute of Radiation Medicine , No. 27 Taiping Road, Haidian District, Beijing 100850, China
| | - Siqi Liu
- Beijing Institute of Genomics and BGI Shenzhen , No. 1 Beichen West Road, Chaoyang District, Beijing 100101, China
- BGI Shenzhen , Beishan Road, Yantian District, Shenzhen, 518083, China
| | - Quanhui Wang
- Beijing Institute of Genomics and BGI Shenzhen , No. 1 Beichen West Road, Chaoyang District, Beijing 100101, China
| | - Qing-Yu He
- ■ Key Laboratory of Functional Protein Research of Guangdong Higher Education Institutes, College of Life Science and Technology, Jinan University , Guangzhou 510632, China
| | - Rajasree Menon
- Department of Computational Medicine & Bioinformatics, University of Michigan , 100 Washtenaw Avenue, Ann Arbor, Michigan 48109-2218, United States
| | - Yuanfang Guan
- Departments of Computational Medicine & Bioinformatics and Computer Sciences, University of Michigan , 100 Washtenaw Avenue, Ann Arbor, Michigan 48109-2218, United States
| | - Fernando J Corrales
- ProteoRed-ISCIII, Biomolecular and Bioinformatics Resources Platform (PRB2), Spanish Consortium of C-HPP (Chr-16), CIMA, University of Navarra, 31008 Pamplona, Spain
- Chr16 SpHPP Consortium , CIMA, University of Navarra, 31008 Pamplona, Spain
| | - Victor Segura
- ProteoRed-ISCIII, Biomolecular and Bioinformatics Resources Platform (PRB2), Spanish Consortium of C-HPP (Chr-16), CIMA, University of Navarra, 31008 Pamplona, Spain
- Chr16 SpHPP Consortium , CIMA, University of Navarra, 31008 Pamplona, Spain
| | - J Ignacio Casal
- Department of Cellular and Molecular Medicine, Centro de Investigaciones Biológicas (CIB-CSIC) , 28040 Madrid, Spain
| | | | - Juan P Albar
- Centro Nacional de Biotecnologia (CNB-CSIC) , Cantoblanco, 28049 Madrid, Spain
| | - Manuel Fuentes
- Cancer Research Center. Proteomics Unit and General Service of Cytometry, Department of Medicine, University of Salmanca-CSIC , IBSAL, Campus Miguel de Unamuno s/n, 37007 Salamanca, Spain
| | - Maria Gonzalez-Gonzalez
- Cancer Research Center. Proteomics Unit and General Service of Cytometry, Department of Medicine, University of Salmanca-CSIC , IBSAL, Campus Miguel de Unamuno s/n, 37007 Salamanca, Spain
| | - Paula Diez
- Cancer Research Center. Proteomics Unit and General Service of Cytometry, Department of Medicine, University of Salmanca-CSIC , IBSAL, Campus Miguel de Unamuno s/n, 37007 Salamanca, Spain
| | - Nieves Ibarrola
- Cancer Research Center. Proteomics Unit and General Service of Cytometry, Department of Medicine, University of Salmanca-CSIC , IBSAL, Campus Miguel de Unamuno s/n, 37007 Salamanca, Spain
| | - Rosa M Degano
- Cancer Research Center. Proteomics Unit and General Service of Cytometry, Department of Medicine, University of Salmanca-CSIC , IBSAL, Campus Miguel de Unamuno s/n, 37007 Salamanca, Spain
| | - Yassene Mohammed
- University of Victoria -Genome British Columbia Proteomics Centre, Vancouver Island Technology Park, #3101-4464 Markham Street, Victoria, British Columbia V8Z 7X8, Canada
- Center for Proteomics and Metabolomics, Leiden University Medical Center , 2333 ZA Leiden, The Netherlands
| | - Christoph H Borchers
- University of Victoria -Genome British Columbia Proteomics Centre, Vancouver Island Technology Park, #3101-4464 Markham Street, Victoria, British Columbia V8Z 7X8, Canada
| | - Andrea Urbani
- Proteomics and Metabonomic, Laboratory, Fondazione Santa Lucia , Rome, Italy
- Department of Experimental Medicine and Surgery, University of Rome "Tor Vergata" , Rome, Italy
| | - Alessio Soggiu
- Department of Veterinary Science and Public Health (DIVET), University of Milano , via Celoria 10, 20133 Milano, Italy
| | - Tadashi Yamamoto
- Institute of Nephrology, Graduate School of Medical and Dental Sciences, Niigata University , Niigata, Japan
| | - Ghasem Hosseini Salekdeh
- Department of Molecular Systems Biology at Cell Science Research Center, Royan Institute for Stem Cell Biology and Technology, ACECR, Tehran, Iran
- Department of Systems Biology, Agricultural Biotechnology Research Institute of Iran, Karaj, Iran
| | | | | | - Andrey Lisitsa
- Orechovich Institute of Biomedical Chemistry , Moscow, Russia
| | - Cheryl F Lichti
- Department of Pharmacology and Toxicology, The University of Texas Medical Branch , Galveston, Texas 77555-0617, United States
| | - Ekaterina Mostovenko
- Department of Pharmacology and Toxicology, The University of Texas Medical Branch , Galveston, Texas 77555-0617, United States
| | - Roger A Kroes
- Falk Center for Molecular Therapeutics, Department of Biomedical Engineering, Northwestern University , 1801 Maple Ave., Suite 4300, Evanston, Illinois 60201, United States
| | - Melinda Rezeli
- Clinical Protein Science & Imaging, Department of Biomedical Engineering, Lund University , BMC D13, 221 84 Lund, Sweden
| | - Ákos Végvári
- Clinical Protein Science & Imaging, Department of Biomedical Engineering, Lund University , BMC D13, 221 84 Lund, Sweden
| | - Thomas E Fehniger
- Clinical Protein Science & Imaging, Department of Biomedical Engineering, Lund University , BMC D13, 221 84 Lund, Sweden
| | - Rainer Bischoff
- Analytical Biochemistry, Department of Pharmacy, University of Groningen , A. Deusinglaan 1, 9713 AV Groningen, The Netherlands
| | - Juan Antonio Vizcaíno
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, CB10 1SD, Hinxton, Cambridge, United Kingdom
| | - Eric W Deutsch
- Institute for Systems Biology , 401 Terry Avenue North, Seattle, Washington 98109, United States
| | - Lydie Lane
- SIB Swiss Institute of Bioinformatics , Geneva, Switzerland
- Department of Human Protein Science, Faculty of Medicine, University of Geneva , Geneva, Switzerland
| | - Carol L Nilsson
- Department of Pharmacology and Toxicology, The University of Texas Medical Branch , Galveston, Texas 77555-0617, United States
| | - György Marko-Varga
- Clinical Protein Science & Imaging, Department of Biomedical Engineering, Lund University , BMC D13, 221 84 Lund, Sweden
| | - Gilbert S Omenn
- Departments of Computational Medicine & Bioinformatics, Internal Medicine, Human Genetics and School of Public Health, University of Michigan , 100 Washtenaw Avenue, Ann Arbor, Michigan 48109-2218, United States
| | - Seul-Ki Jeong
- Departments of Integrated Omics for Biomedical Science & Biochemistry, College of Life Science and Technology, Yonsei Proteome Research Center, Yonsei University , Seoul, 120-749, Korea
| | - Jong-Sun Lim
- Departments of Integrated Omics for Biomedical Science & Biochemistry, College of Life Science and Technology, Yonsei Proteome Research Center, Yonsei University , Seoul, 120-749, Korea
| | - Young-Ki Paik
- Departments of Integrated Omics for Biomedical Science & Biochemistry, College of Life Science and Technology, Yonsei Proteome Research Center, Yonsei University , Seoul, 120-749, Korea
| | - William S Hancock
- The Barnett Institute of Chemical and Biological Analysis, Northeastern University , 140 The Fenway, Boston, Massachusetts 02115, United States
| |
Collapse
|
26
|
Ronke C, Dannemann M, Halbwax M, Fischer A, Helmschrodt C, Brügel M, André C, Atencia R, Mugisha L, Scholz M, Ceglarek U, Thiery J, Pääbo S, Prüfer K, Kelso J. Lineage-Specific Changes in Biomarkers in Great Apes and Humans. PLoS One 2015; 10:e0134548. [PMID: 26247603 PMCID: PMC4527672 DOI: 10.1371/journal.pone.0134548] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2014] [Accepted: 07/10/2015] [Indexed: 12/15/2022] Open
Abstract
Although human biomedical and physiological information is readily available, such information for great apes is limited. We analyzed clinical chemical biomarkers in serum samples from 277 wild- and captive-born great apes and from 312 healthy human volunteers as well as from 20 rhesus macaques. For each individual, we determined a maximum of 33 markers of heart, liver, kidney, thyroid and pancreas function, hemoglobin and lipid metabolism and one marker of inflammation. We identified biomarkers that show differences between humans and the great apes in their average level or activity. Using the rhesus macaques as an outgroup, we identified human-specific differences in the levels of bilirubin, cholinesterase and lactate dehydrogenase, and bonobo-specific differences in the level of apolipoprotein A-I. For the remaining twenty-nine biomarkers there was no evidence for lineage-specific differences. In fact, we find that many biomarkers show differences between individuals of the same species in different environments. Of the four lineage-specific biomarkers, only bilirubin showed no differences between wild- and captive-born great apes. We show that the major factor explaining the human-specific difference in bilirubin levels may be genetic. There are human-specific changes in the sequence of the promoter and the protein-coding sequence of uridine diphosphoglucuronosyltransferase 1 (UGT1A1), the enzyme that transforms bilirubin and toxic plant compounds into water-soluble, excretable metabolites. Experimental evidence that UGT1A1 is down-regulated in the human liver suggests that changes in the promoter may be responsible for the human-specific increase in bilirubin. We speculate that since cooking reduces toxic plant compounds, consumption of cooked foods, which is specific to humans, may have resulted in relaxed constraint on UGT1A1 which has in turn led to higher serum levels of bilirubin in humans.
Collapse
Affiliation(s)
- Claudius Ronke
- Institute of Laboratory Medicine, Clinical Chemistry and Molecular Diagnostics, University Hospital Leipzig, Leipzig, Germany
- * E-mail:
| | - Michael Dannemann
- Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany
| | - Michel Halbwax
- Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany
| | - Anne Fischer
- Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany
| | - Christin Helmschrodt
- Institute of Laboratory Medicine, Clinical Chemistry and Molecular Diagnostics, University Hospital Leipzig, Leipzig, Germany
| | - Mathias Brügel
- Institute of Laboratory Medicine, Clinical Chemistry and Molecular Diagnostics, University Hospital Leipzig, Leipzig, Germany
| | - Claudine André
- Lola Ya Bonobo Sanctuary, “Petites Chutes de la Lukaya,” Kinshasa, Democratic Republic of Congo
| | - Rebeca Atencia
- Réserve Naturelle Sanctuaire à Chimpanzés de Tchimpounga, Jane Goodall Institute, Pointe-Noire, Republic of Congo
| | - Lawrence Mugisha
- Conservation & Ecosystem Health Alliance (CEHA), Kampala, Uganda
- College of Veterinary Medicine, Animal Resources & Biosecurity, Makerere University, Kampala, Uganda
| | - Markus Scholz
- Institute for Medical Informatics, Statistics and Epidemiology, University of Leipzig, Leipzig, Germany
| | - Uta Ceglarek
- Institute of Laboratory Medicine, Clinical Chemistry and Molecular Diagnostics, University Hospital Leipzig, Leipzig, Germany
| | - Joachim Thiery
- Institute of Laboratory Medicine, Clinical Chemistry and Molecular Diagnostics, University Hospital Leipzig, Leipzig, Germany
| | - Svante Pääbo
- Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany
| | - Kay Prüfer
- Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany
| | - Janet Kelso
- Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany
| |
Collapse
|
27
|
Abascal F, Ezkurdia I, Rodriguez-Rivas J, Rodriguez JM, del Pozo A, Vázquez J, Valencia A, Tress ML. Alternatively Spliced Homologous Exons Have Ancient Origins and Are Highly Expressed at the Protein Level. PLoS Comput Biol 2015; 11:e1004325. [PMID: 26061177 PMCID: PMC4465641 DOI: 10.1371/journal.pcbi.1004325] [Citation(s) in RCA: 58] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2014] [Accepted: 05/08/2015] [Indexed: 11/19/2022] Open
Abstract
Alternative splicing of messenger RNA can generate a wide variety of mature RNA transcripts, and these transcripts may produce protein isoforms with diverse cellular functions. While there is much supporting evidence for the expression of alternative transcripts, the same is not true for the alternatively spliced protein products. Large-scale mass spectroscopy experiments have identified evidence of alternative splicing at the protein level, but with conflicting results. Here we carried out a rigorous analysis of the peptide evidence from eight large-scale proteomics experiments to assess the scale of alternative splicing that is detectable by high-resolution mass spectroscopy. We find fewer splice events than would be expected: we identified peptides for almost 64% of human protein coding genes, but detected just 282 splice events. This data suggests that most genes have a single dominant isoform at the protein level. Many of the alternative isoforms that we could identify were only subtly different from the main splice isoform. Very few of the splice events identified at the protein level disrupted functional domains, in stark contrast to the two thirds of splice events annotated in the human genome that would lead to the loss or damage of functional domains. The most striking result was that more than 20% of the splice isoforms we identified were generated by substituting one homologous exon for another. This is significantly more than would be expected from the frequency of these events in the genome. These homologous exon substitution events were remarkably conserved—all the homologous exons we identified evolved over 460 million years ago—and eight of the fourteen tissue-specific splice isoforms we identified were generated from homologous exons. The combination of proteomics evidence, ancient origin and tissue-specific splicing indicates that isoforms generated from homologous exons may have important cellular roles. Alternative splicing is thought to be one means for generating the protein diversity necessary for the whole range of cellular functions. While the presence of alternatively spliced transcripts in the cell has been amply demonstrated, the same cannot be said for alternatively spliced proteins. The quest for alternative protein isoforms has focused primarily on the analysis of peptides from large-scale mass spectroscopy experiments, but evidence for alternative isoforms has been patchy and contradictory. A careful analysis of the peptide evidence is needed to fully understand the scale of alternative splicing detectable at the protein level. Here we analysed peptides from eight large-scale data sets, identifying just 282 splice events among 12,716 genes. This suggests that most genes have a single dominant isoform. Many of the alternative isoforms that we identified were only subtly different from the main splice variant, and one in five was generated by substitution of homologous exons by swapping one related exon for another. Remarkably, the alternative isoforms generated from homologous exons were highly conserved, first appearing 460 million years ago, and several appear to have tissue-specific roles in the brain and heart. Our results suggest that these particular isoforms are likely to have important cellular roles.
Collapse
Affiliation(s)
- Federico Abascal
- Structural Biology and Bioinformatics Programme, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| | - Iakes Ezkurdia
- Unidad de Proteómica, Centro Nacional de Investigaciones Cardiovasculares (CNIC), Madrid, Spain
| | - Juan Rodriguez-Rivas
- Structural Biology and Bioinformatics Programme, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| | - Jose Manuel Rodriguez
- National Bioinformatics Institute (INB), Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| | - Angela del Pozo
- Instituto de Genetica Medica y Molecular, Hospital Universitario La Paz, Madrid, Spain
| | - Jesús Vázquez
- Laboratorio de Proteómica Cardiovascular, Centro Nacional de Investigaciones Cardiovasculares (CNIC) Madrid, Spain
| | - Alfonso Valencia
- Structural Biology and Bioinformatics Programme, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
- National Bioinformatics Institute (INB), Spanish National Cancer Research Centre (CNIO), Madrid, Spain
- * E-mail: (AV); (MLT)
| | - Michael L. Tress
- Structural Biology and Bioinformatics Programme, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
- * E-mail: (AV); (MLT)
| |
Collapse
|
28
|
Jung S, Lee S, Kim S, Nam H. Identification of genomic features in the classification of loss- and gain-of-function mutation. BMC Med Inform Decis Mak 2015; 15 Suppl 1:S6. [PMID: 26043747 PMCID: PMC4460711 DOI: 10.1186/1472-6947-15-s1-s6] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
Background Alterations of a genome can lead to changes in protein functions. Through these genetic mutations, a protein can lose its native function (loss-of-function, LoF), or it can confer a new function (gain-of-function, GoF). However, when a mutation occurs, it is difficult to determine whether it will result in a LoF or a GoF. Therefore, in this paper, we propose a study that analyzes the genomic features of LoF and GoF instances to find features that can be used to classify LoF and GoF mutations. Methods In order to collect experimentally verified LoF and GoF mutational information, we obtained 816 LoF mutations and 474 GoF mutations from a literature text-mining process. Next, with data-preprocessing steps, 258 LoF and 129 GoF mutations remained for a further analysis. We analyzed the properties of these LoF and GoF mutations. Among the properties, we selected features which show different tendencies between the two groups and implemented classifications using support vector machine, random forest, and linear logistic regression methods to confirm whether or not these features can identify LoF and GoF mutations. Results We analyzed the properties of the LoF and GoF mutations and identified six features which have discriminative power between LoF and GoF conditions: the reference allele, the substituted allele, mutation type, mutation impact, subcellular location, and protein domain. When using the six selected features with the random forest, support vector machine, and linear logistic regression classifiers, the result showed accuracy levels of 72.23%, 71.28%, and 70.19%, respectively. Conclusions We analyzed LoF and GoF mutations and selected several properties which were different between the two classes. By implementing classifications with the selected features, it is demonstrated that the selected features have good discriminative power.
Collapse
|
29
|
Ezkurdia I, Rodriguez JM, Carrillo-de Santa Pau E, Vázquez J, Valencia A, Tress ML. Most highly expressed protein-coding genes have a single dominant isoform. J Proteome Res 2015; 14:1880-7. [PMID: 25732134 DOI: 10.1021/pr501286b] [Citation(s) in RCA: 84] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
Although eukaryotic cells express a wide range of alternatively spliced transcripts, it is not clear whether genes tend to express a range of transcripts simultaneously across cells, or produce dominant isoforms in a manner that is either tissue-specific or regardless of tissue. To date, large-scale investigations into the pattern of transcript expression across distinct tissues have produced contradictory results. Here, we attempt to determine whether genes express a dominant splice variant at the protein level. We interrogate peptides from eight large-scale human proteomics experiments and databases and find that there is a single dominant protein isoform, irrespective of tissue or cell type, for the vast majority of the protein-coding genes in these experiments, in partial agreement with the conclusions from the most recent large-scale RNAseq study. Remarkably, the dominant isoforms from the experimental proteomics analyses coincided overwhelmingly with the reference isoforms selected by two completely orthogonal sources, the consensus coding sequence variants, which are agreed upon by separate manual genome curation teams, and the principal isoforms from the APPRIS database, predicted automatically from the conservation of protein sequence, structure, and function.
Collapse
Affiliation(s)
- Iakes Ezkurdia
- †Unidad de Proteómica and ‡Laboratorio de Proteómica Cardiovascular, Centro Nacional de Investigaciones Cardiovasculares, 28029 Madrid, Spain.,§National Bioinformatics Institute and ∥Structural Biology and Bioinformatics Programme, Spanish National Cancer Research Centre, 28029 Madrid, Spain
| | - Jose Manuel Rodriguez
- †Unidad de Proteómica and ‡Laboratorio de Proteómica Cardiovascular, Centro Nacional de Investigaciones Cardiovasculares, 28029 Madrid, Spain.,§National Bioinformatics Institute and ∥Structural Biology and Bioinformatics Programme, Spanish National Cancer Research Centre, 28029 Madrid, Spain
| | - Enrique Carrillo-de Santa Pau
- †Unidad de Proteómica and ‡Laboratorio de Proteómica Cardiovascular, Centro Nacional de Investigaciones Cardiovasculares, 28029 Madrid, Spain.,§National Bioinformatics Institute and ∥Structural Biology and Bioinformatics Programme, Spanish National Cancer Research Centre, 28029 Madrid, Spain
| | - Jesús Vázquez
- †Unidad de Proteómica and ‡Laboratorio de Proteómica Cardiovascular, Centro Nacional de Investigaciones Cardiovasculares, 28029 Madrid, Spain.,§National Bioinformatics Institute and ∥Structural Biology and Bioinformatics Programme, Spanish National Cancer Research Centre, 28029 Madrid, Spain
| | - Alfonso Valencia
- †Unidad de Proteómica and ‡Laboratorio de Proteómica Cardiovascular, Centro Nacional de Investigaciones Cardiovasculares, 28029 Madrid, Spain.,§National Bioinformatics Institute and ∥Structural Biology and Bioinformatics Programme, Spanish National Cancer Research Centre, 28029 Madrid, Spain
| | - Michael L Tress
- †Unidad de Proteómica and ‡Laboratorio de Proteómica Cardiovascular, Centro Nacional de Investigaciones Cardiovasculares, 28029 Madrid, Spain.,§National Bioinformatics Institute and ∥Structural Biology and Bioinformatics Programme, Spanish National Cancer Research Centre, 28029 Madrid, Spain
| |
Collapse
|
30
|
Besenbacher S, Liu S, Izarzugaza JMG, Grove J, Belling K, Bork-Jensen J, Huang S, Als TD, Li S, Yadav R, Rubio-García A, Lescai F, Demontis D, Rao J, Ye W, Mailund T, Friborg RM, Pedersen CNS, Xu R, Sun J, Liu H, Wang O, Cheng X, Flores D, Rydza E, Rapacki K, Damm Sørensen J, Chmura P, Westergaard D, Dworzynski P, Sørensen TIA, Lund O, Hansen T, Xu X, Li N, Bolund L, Pedersen O, Eiberg H, Krogh A, Børglum AD, Brunak S, Kristiansen K, Schierup MH, Wang J, Gupta R, Villesen P, Rasmussen S. Novel variation and de novo mutation rates in population-wide de novo assembled Danish trios. Nat Commun 2015; 6:5969. [PMID: 25597990 PMCID: PMC4309431 DOI: 10.1038/ncomms6969] [Citation(s) in RCA: 123] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2014] [Accepted: 11/25/2014] [Indexed: 02/06/2023] Open
Abstract
Building a population-specific catalogue of single nucleotide variants (SNVs), indels and structural variants (SVs) with frequencies, termed a national pan-genome, is critical for further advancing clinical and public health genetics in large cohorts. Here we report a Danish pan-genome obtained from sequencing 10 trios to high depth (50 × ). We report 536k novel SNVs and 283k novel short indels from mapping approaches and develop a population-wide de novo assembly approach to identify 132k novel indels larger than 10 nucleotides with low false discovery rates. We identify a higher proportion of indels and SVs than previous efforts showing the merits of high coverage and de novo assembly approaches. In addition, we use trio information to identify de novo mutations and use a probabilistic method to provide direct estimates of 1.27e−8 and 1.5e−9 per nucleotide per generation for SNVs and indels, respectively. The generation of a national pan-genome, a population-specific catalogue of genetic variation, may advance the impact of clinical genetics studies. Here the Besenbacher et al. carry out deep sequencing and de novo assembly of 10 parent–child trios to generate a Danish pan-genome that provides insight into structural variation, de novo mutation rates and variant calling.
Collapse
Affiliation(s)
- Søren Besenbacher
- Bioinformatics Research Center, Aarhus University, C. F. Møllers Allé 8, DK-8000 Aarhus, Denmark
| | - Siyang Liu
- 1] BGI Europe, Ole Maaløes Vej 3, DK-2200 Copenhagen, Denmark [2] Department of Biology, University of Copenhagen, Ole Maaløes Vej 5, DK-2200 Copenhagen, Denmark
| | - José M G Izarzugaza
- Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kemitorvet 208, DK-2800 Kgs Lyngby, Denmark
| | - Jakob Grove
- 1] Bioinformatics Research Center, Aarhus University, C. F. Møllers Allé 8, DK-8000 Aarhus, Denmark [2] Centre for Integrative Sequencing, iSEQ, Aarhus University, Bartholins Allé 6, building 1242, DK-8000 Aarhus, Denmark [3] The Lundbeck Foundation Initiative for Integrative Psychiatric Research, iPSYCH, DK-8000 Aarhus, Denmark [4] Department of Biomedicine, Aarhus University, Bartholins Allé 6, building 1242, DK-8000 Aarhus, Denmark
| | - Kirstine Belling
- Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kemitorvet 208, DK-2800 Kgs Lyngby, Denmark
| | - Jette Bork-Jensen
- The Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty of Health and Medical Sciences, University of Copenhagen, Universitetsparken 1-3, DK-2100 Copenhagen, Denmark
| | - Shujia Huang
- 1] BGI Europe, Ole Maaløes Vej 3, DK-2200 Copenhagen, Denmark [2] School of Bioscience and Biotechnology, South China University of Technology, Guangzhou 510006, China
| | - Thomas D Als
- 1] Centre for Integrative Sequencing, iSEQ, Aarhus University, Bartholins Allé 6, building 1242, DK-8000 Aarhus, Denmark [2] The Lundbeck Foundation Initiative for Integrative Psychiatric Research, iPSYCH, DK-8000 Aarhus, Denmark [3] Department of Biomedicine, Aarhus University, Bartholins Allé 6, building 1242, DK-8000 Aarhus, Denmark
| | - Shengting Li
- 1] BGI Europe, Ole Maaløes Vej 3, DK-2200 Copenhagen, Denmark [2] Centre for Integrative Sequencing, iSEQ, Aarhus University, Bartholins Allé 6, building 1242, DK-8000 Aarhus, Denmark [3] The Lundbeck Foundation Initiative for Integrative Psychiatric Research, iPSYCH, DK-8000 Aarhus, Denmark [4] Department of Biomedicine, Aarhus University, Bartholins Allé 6, building 1242, DK-8000 Aarhus, Denmark
| | - Rachita Yadav
- Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kemitorvet 208, DK-2800 Kgs Lyngby, Denmark
| | - Arcadio Rubio-García
- Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kemitorvet 208, DK-2800 Kgs Lyngby, Denmark
| | - Francesco Lescai
- 1] Centre for Integrative Sequencing, iSEQ, Aarhus University, Bartholins Allé 6, building 1242, DK-8000 Aarhus, Denmark [2] The Lundbeck Foundation Initiative for Integrative Psychiatric Research, iPSYCH, DK-8000 Aarhus, Denmark [3] Department of Biomedicine, Aarhus University, Bartholins Allé 6, building 1242, DK-8000 Aarhus, Denmark
| | - Ditte Demontis
- 1] Centre for Integrative Sequencing, iSEQ, Aarhus University, Bartholins Allé 6, building 1242, DK-8000 Aarhus, Denmark [2] The Lundbeck Foundation Initiative for Integrative Psychiatric Research, iPSYCH, DK-8000 Aarhus, Denmark [3] Department of Biomedicine, Aarhus University, Bartholins Allé 6, building 1242, DK-8000 Aarhus, Denmark
| | - Junhua Rao
- BGI Europe, Ole Maaløes Vej 3, DK-2200 Copenhagen, Denmark
| | - Weijian Ye
- BGI Europe, Ole Maaløes Vej 3, DK-2200 Copenhagen, Denmark
| | - Thomas Mailund
- 1] Bioinformatics Research Center, Aarhus University, C. F. Møllers Allé 8, DK-8000 Aarhus, Denmark [2] Centre for Integrative Sequencing, iSEQ, Aarhus University, Bartholins Allé 6, building 1242, DK-8000 Aarhus, Denmark
| | - Rune M Friborg
- 1] Bioinformatics Research Center, Aarhus University, C. F. Møllers Allé 8, DK-8000 Aarhus, Denmark [2] Centre for Integrative Sequencing, iSEQ, Aarhus University, Bartholins Allé 6, building 1242, DK-8000 Aarhus, Denmark
| | - Christian N S Pedersen
- Bioinformatics Research Center, Aarhus University, C. F. Møllers Allé 8, DK-8000 Aarhus, Denmark
| | - Ruiqi Xu
- BGI Europe, Ole Maaløes Vej 3, DK-2200 Copenhagen, Denmark
| | - Jihua Sun
- BGI Europe, Ole Maaløes Vej 3, DK-2200 Copenhagen, Denmark
| | - Hao Liu
- BGI Europe, Ole Maaløes Vej 3, DK-2200 Copenhagen, Denmark
| | - Ou Wang
- BGI Europe, Ole Maaløes Vej 3, DK-2200 Copenhagen, Denmark
| | - Xiaofang Cheng
- BGI Europe, Ole Maaløes Vej 3, DK-2200 Copenhagen, Denmark
| | - David Flores
- Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kemitorvet 208, DK-2800 Kgs Lyngby, Denmark
| | - Emil Rydza
- Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kemitorvet 208, DK-2800 Kgs Lyngby, Denmark
| | - Kristoffer Rapacki
- Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kemitorvet 208, DK-2800 Kgs Lyngby, Denmark
| | - John Damm Sørensen
- Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kemitorvet 208, DK-2800 Kgs Lyngby, Denmark
| | - Piotr Chmura
- Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kemitorvet 208, DK-2800 Kgs Lyngby, Denmark
| | - David Westergaard
- Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kemitorvet 208, DK-2800 Kgs Lyngby, Denmark
| | - Piotr Dworzynski
- Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kemitorvet 208, DK-2800 Kgs Lyngby, Denmark
| | - Thorkild I A Sørensen
- 1] The Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty of Health and Medical Sciences, University of Copenhagen, Universitetsparken 1-3, DK-2100 Copenhagen, Denmark [2] Institute of Preventive Medicine, Bispebjerg and Frederiksberg Hospitals, The Capital Region, Nordre Fasanvej 57, Hovedvejen 5, DK2000 Copenhagen, Denmark
| | - Ole Lund
- Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kemitorvet 208, DK-2800 Kgs Lyngby, Denmark
| | - Torben Hansen
- 1] The Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty of Health and Medical Sciences, University of Copenhagen, Universitetsparken 1-3, DK-2100 Copenhagen, Denmark [2] Faculty of Health Sciences, University of Southern Denmark, DK-5000 Odense, Denmark
| | - Xun Xu
- BGI Europe, Ole Maaløes Vej 3, DK-2200 Copenhagen, Denmark
| | - Ning Li
- BGI Europe, Ole Maaløes Vej 3, DK-2200 Copenhagen, Denmark
| | - Lars Bolund
- 1] Centre for Integrative Sequencing, iSEQ, Aarhus University, Bartholins Allé 6, building 1242, DK-8000 Aarhus, Denmark [2] Department of Biomedicine, Aarhus University, Bartholins Allé 6, building 1242, DK-8000 Aarhus, Denmark
| | - Oluf Pedersen
- The Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty of Health and Medical Sciences, University of Copenhagen, Universitetsparken 1-3, DK-2100 Copenhagen, Denmark
| | - Hans Eiberg
- Department of Cellular and Molecular Medicine, Panum Institute, University of Copenhagen, Blegdamsvej 3, DK-2200 Copenhagen, Denmark
| | - Anders Krogh
- 1] Department of Biology, University of Copenhagen, Ole Maaløes Vej 5, DK-2200 Copenhagen, Denmark [2] Centre for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen, Øster Voldgade 5-7, DK-1350 Copenhagen, Denmark
| | - Anders D Børglum
- 1] Centre for Integrative Sequencing, iSEQ, Aarhus University, Bartholins Allé 6, building 1242, DK-8000 Aarhus, Denmark [2] The Lundbeck Foundation Initiative for Integrative Psychiatric Research, iPSYCH, DK-8000 Aarhus, Denmark [3] Department of Biomedicine, Aarhus University, Bartholins Allé 6, building 1242, DK-8000 Aarhus, Denmark
| | - Søren Brunak
- Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kemitorvet 208, DK-2800 Kgs Lyngby, Denmark
| | - Karsten Kristiansen
- Department of Biology, University of Copenhagen, Ole Maaløes Vej 5, DK-2200 Copenhagen, Denmark
| | - Mikkel H Schierup
- 1] Bioinformatics Research Center, Aarhus University, C. F. Møllers Allé 8, DK-8000 Aarhus, Denmark [2] Centre for Integrative Sequencing, iSEQ, Aarhus University, Bartholins Allé 6, building 1242, DK-8000 Aarhus, Denmark
| | - Jun Wang
- 1] BGI Europe, Ole Maaløes Vej 3, DK-2200 Copenhagen, Denmark [2] Department of Biology, University of Copenhagen, Ole Maaløes Vej 5, DK-2200 Copenhagen, Denmark [3] Centre for Integrative Sequencing, iSEQ, Aarhus University, Bartholins Allé 6, building 1242, DK-8000 Aarhus, Denmark
| | - Ramneek Gupta
- Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kemitorvet 208, DK-2800 Kgs Lyngby, Denmark
| | - Palle Villesen
- 1] Bioinformatics Research Center, Aarhus University, C. F. Møllers Allé 8, DK-8000 Aarhus, Denmark [2] Centre for Integrative Sequencing, iSEQ, Aarhus University, Bartholins Allé 6, building 1242, DK-8000 Aarhus, Denmark
| | - Simon Rasmussen
- Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kemitorvet 208, DK-2800 Kgs Lyngby, Denmark
| |
Collapse
|
31
|
Manase D, D'Alessandro LCA, Manickaraj AK, Al Turki S, Hurles ME, Mital S. High throughput exome coverage of clinically relevant cardiac genes. BMC Med Genomics 2014; 7:67. [PMID: 25496018 PMCID: PMC4272796 DOI: 10.1186/s12920-014-0067-8] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2014] [Accepted: 11/26/2014] [Indexed: 01/27/2023] Open
Abstract
Background Given the growing use of whole-exome sequencing (WES) for clinical diagnostics of complex human disorders, we evaluated coverage of clinically relevant cardiac genes on WES and factors influencing uniformity and depth of coverage of exonic regions. Methods Two hundred and thirteen human DNA samples were exome sequenced via Illumina HiSeq using different versions of the Agilent SureSelect capture kit. 50 cardiac genes were further analyzed including 31 genes from the American College of Medical Genetics (ACMG) list for reporting of incidental findings and 19 genes associated with congenital heart disease for which clinical testing is available. Gene coordinates were obtained from two databases, CCDS and Known Gene and compared. Read depth for each region was extracted from the exomes and used to assess capture variability between kits for individual genes, and for overall coverage. GC content, gene size, and inter-sample variability were also tested as potential contributors to variability in gene coverage. Results All versions of capture kits (designed based on Consensus coding sequence) included only 55% of known genomic regions for the cardiac genes. Although newer versions of each Agilent kit showed improvement in capture of CCDS regions to 99%, only 64% of Known Gene regions were captured even with newer capture kits. There was considerable variability in coverage of the cardiac genes. 10 of the 50 genes including 6 on the ACMG list had less than the optimal coverage of 30X. Within each gene, only 32 of the 50 genes had the majority of their bases covered at an interquartile range ≥30X. Heterogeneity in gene coverage was modestly associated with gene size and significantly associated with GC content. Conclusions Despite improvement in overall coverage across the exome with newer capture kit versions and higher sequencing depths, only 50% of known genomic regions of clinical cardiac genes are targeted and individual gene coverage is non-uniform. This may contribute to a bias with greater attribution of disease causation to mutations in well-represented and well-covered genes. Improvements in WES technology are needed before widespread clinical application. Electronic supplementary material The online version of this article (doi:10.1186/s12920-014-0067-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Dorin Manase
- Division of Cardiology, Department of Pediatrics, Hospital for Sick Children, University of Toronto, Toronto, Ontario, Canada. .,Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, Ontario, Canada.
| | - Lisa C A D'Alessandro
- Division of Cardiology, Department of Pediatrics, Hospital for Sick Children, University of Toronto, Toronto, Ontario, Canada. .,Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, Ontario, Canada.
| | - Ashok Kumar Manickaraj
- Division of Cardiology, Department of Pediatrics, Hospital for Sick Children, University of Toronto, Toronto, Ontario, Canada. .,Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, Ontario, Canada.
| | | | | | - Seema Mital
- Division of Cardiology, Department of Pediatrics, Hospital for Sick Children, University of Toronto, Toronto, Ontario, Canada. .,Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, Ontario, Canada.
| |
Collapse
|
32
|
Bonner TI. Should pharmacologists care about alternative splicing? IUPHAR Review 4. Br J Pharmacol 2014; 171:1231-40. [PMID: 24670145 DOI: 10.1111/bph.12526] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2013] [Revised: 10/27/2013] [Accepted: 11/13/2013] [Indexed: 11/29/2022] Open
Abstract
Alternative splicing of mRNAs occurs in the majority of human genes, and most differential splicing results in different protein isoforms with possibly different functional properties. However, there are many reported splicing variations that may be quite rare, and not all combinatorially possible variants of a given gene are expressed at significant levels. Genes of interest to pharmacologists are frequently expressed at such low levels that they are not adequately represented in genome-wide studies of transcription. In single-gene studies, data are commonly available on the relative abundance and functional significance of individual alternatively spliced exons, but there are rarely data that quantitate the relative abundance of full-length transcripts and define which combinations of exons are significant. A number of criteria for judging the significance of splice variants and suggestions for their nomenclature are discussed.
Collapse
Affiliation(s)
- T I Bonner
- National Institute of Mental Health, Bethesda, MD, USA
| |
Collapse
|
33
|
Snider NT, Altshuler PJ, Wan S, Welling TH, Cavalcoli J, Omary MB. Alternative splicing of human NT5E in cirrhosis and hepatocellular carcinoma produces a negative regulator of ecto-5'-nucleotidase (CD73). Mol Biol Cell 2014; 25:4024-33. [PMID: 25298403 PMCID: PMC4263446 DOI: 10.1091/mbc.e14-06-1167] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
Alternative splicing of human NT5E generates CD73S, an endoplasmic reticulum–associated and dimerization-deficient glycoprotein that lacks enzymatic activity. CD73S functions as a negative regulator of canonical CD73 by promoting its proteasomal degradation, which may have significance in chronic liver disease and liver cancer. Ecto-5′-nucleotidase (CD73), encoded by NT5E, is the major enzymatic source of extracellular adenosine. CD73 controls numerous pathophysiological responses and is a potential disease target, but its regulation is poorly understood. We examined NT5E regulation by alternative splicing. Genomic database analysis of human transcripts led us to identify NT5E-2, a novel splice variant that was expressed at low abundance in normal human tissues but was significantly up-regulated in cirrhosis and hepatocellular carcinoma (HCC). NT5E-2 encodes a shorter CD73 isoform we named CD73S. The presence of CD73S protein, which lacks 50 amino acids, was detected in HCC using an isoform-specific antibody. A noncanonical mouse mRNA, similar to human CD73S, was observed, but the corresponding protein was undetectable. The two human isoforms exhibited functional differences, such that ectopic expression of canonical CD73 (CD73L) in human HepG2 cells was associated with decreased expression of the proliferation marker Ki67, whereas CD73S expression did not have an effect on Ki67 expression. CD73S was glycosylated, catalytically inactive, unable to dimerize, and complexed intracellularly with the endoplasmic reticulum chaperone calnexin. Furthermore, CD73S complexed with CD73L and promoted proteasome-dependent CD73L degradation. The findings reveal species-specific CD73 regulation, with potential significance to cancer, fibrosis, and other diseases characterized by changes in CD73 expression and function.
Collapse
Affiliation(s)
- Natasha T Snider
- Departments of Molecular and Integrative Physiology, University of Michigan, Ann Arbor, MI 48109
| | - Peter J Altshuler
- Departments of Molecular and Integrative Physiology, University of Michigan, Ann Arbor, MI 48109
| | - Shanshan Wan
- Departments of Surgery, University of Michigan, Ann Arbor, MI 48109
| | | | - James Cavalcoli
- Departments of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109
| | - M Bishr Omary
- Departments of Molecular and Integrative Physiology, University of Michigan, Ann Arbor, MI 48109 Departments of Medicine, University of Michigan, Ann Arbor, MI 48109
| |
Collapse
|
34
|
Noderer WL, Flockhart RJ, Bhaduri A, Diaz de Arce AJ, Zhang J, Khavari PA, Wang CL. Quantitative analysis of mammalian translation initiation sites by FACS-seq. Mol Syst Biol 2014; 10:748. [PMID: 25170020 PMCID: PMC4299517 DOI: 10.15252/msb.20145136] [Citation(s) in RCA: 137] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open
Abstract
An approach combining fluorescence-activated cell sorting and high-throughput DNA sequencing
(FACS-seq) was employed to determine the efficiency of start codon recognition for all possible
translation initiation sites (TIS) utilizing AUG start codons. Using FACS-seq, we measured
translation from a genetic reporter library representing all 65,536 possible TIS sequences spanning
the −6 to +5 positions. We found that the motif RYMRMVAUGGC enhanced start codon
recognition and translation efficiency. However, dinucleotide interactions, which cannot be conveyed
by a single motif, were also important for modeling TIS efficiency. Our dataset combined with
modeling allowed us to predict genome-wide translation initiation efficiency for all mRNA
transcripts. Additionally, we screened somatic TIS mutations associated with tumorigenesis to
identify candidate driver mutations consistent with known tumor expression patterns. Finally, we
implemented a quantitative leaky scanning model to predict alternative initiation sites that produce
truncated protein isoforms and compared predictions with ribosome footprint profiling data. The
comprehensive analysis of the TIS sequence space enables quantitative predictions of translation
initiation based on genome sequence.
Collapse
Affiliation(s)
- William L Noderer
- Department of Chemical Engineering, Stanford University, Stanford, CA, USA
| | - Ross J Flockhart
- The Program in Epithelial Biology, Stanford University School of Medicine, Stanford, CA, USA
| | - Aparna Bhaduri
- The Program in Epithelial Biology, Stanford University School of Medicine, Stanford, CA, USA The Program in Cancer Biology, Stanford University School of Medicine, Stanford, CA, USA
| | | | - Jiajing Zhang
- The Program in Epithelial Biology, Stanford University School of Medicine, Stanford, CA, USA
| | - Paul A Khavari
- The Program in Epithelial Biology, Stanford University School of Medicine, Stanford, CA, USA Veterans Affairs Palo Alto Healthcare System, Palo Alto, CA, USA
| | - Clifford L Wang
- Department of Chemical Engineering, Stanford University, Stanford, CA, USA
| |
Collapse
|
35
|
McCarthy DJ, Humburg P, Kanapin A, Rivas MA, Gaulton K, Cazier JB, Donnelly P. Choice of transcripts and software has a large effect on variant annotation. Genome Med 2014; 6:26. [PMID: 24944579 PMCID: PMC4062061 DOI: 10.1186/gm543] [Citation(s) in RCA: 131] [Impact Index Per Article: 11.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2013] [Accepted: 03/20/2014] [Indexed: 12/19/2022] Open
Abstract
Background Variant annotation is a crucial step in the analysis of genome sequencing data. Functional annotation results can have a strong influence on the ultimate conclusions of disease studies. Incorrect or incomplete annotations can cause researchers both to overlook potentially disease-relevant DNA variants and to dilute interesting variants in a pool of false positives. Researchers are aware of these issues in general, but the extent of the dependency of final results on the choice of transcripts and software used for annotation has not been quantified in detail. Methods This paper quantifies the extent of differences in annotation of 80 million variants from a whole-genome sequencing study. We compare results using the RefSeq and Ensembl transcript sets as the basis for variant annotation with the software Annovar, and also compare the results from two annotation software packages, Annovar and VEP (Ensembl’s Variant Effect Predictor), when using Ensembl transcripts. Results We found only 44% agreement in annotations for putative loss-of-function variants when using the RefSeq and Ensembl transcript sets as the basis for annotation with Annovar. The rate of matching annotations for loss-of-function and nonsynonymous variants combined was 79% and for all exonic variants it was 83%. When comparing results from Annovar and VEP using Ensembl transcripts, matching annotations were seen for only 65% of loss-of-function variants and 87% of all exonic variants, with splicing variants revealed as the category with the greatest discrepancy. Using these comparisons, we characterised the types of apparent errors made by Annovar and VEP and discuss their impact on the analysis of DNA variants in genome sequencing studies. Conclusions Variant annotation is not yet a solved problem. Choice of transcript set can have a large effect on the ultimate variant annotations obtained in a whole-genome sequencing study. Choice of annotation software can also have a substantial effect. The annotation step in the analysis of a genome sequencing study must therefore be considered carefully, and a conscious choice made as to which transcript set and software are used for annotation.
Collapse
Affiliation(s)
- Davis J McCarthy
- Department of Statistics, University of Oxford, South Parks Road, Oxford, UK ; Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Drive, Oxford, UK
| | - Peter Humburg
- Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Drive, Oxford, UK
| | - Alexander Kanapin
- Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Drive, Oxford, UK
| | - Manuel A Rivas
- Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Drive, Oxford, UK
| | - Kyle Gaulton
- Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Drive, Oxford, UK
| | | | - Peter Donnelly
- Department of Statistics, University of Oxford, South Parks Road, Oxford, UK ; Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Drive, Oxford, UK
| |
Collapse
|
36
|
Willsey AJ, Sanders SJ, Li M, Dong S, Tebbenkamp AT, Muhle RA, Reilly SK, Lin L, Fertuzinhos S, Miller JA, Murtha MT, Bichsel C, Niu W, Cotney J, Ercan-Sencicek AG, Gockley J, Gupta AR, Han W, He X, Hoffman EJ, Klei L, Lei J, Liu W, Liu L, Lu C, Xu X, Zhu Y, Mane SM, Lein ES, Wei L, Noonan JP, Roeder K, Devlin B, Sestan N, State MW. Coexpression networks implicate human midfetal deep cortical projection neurons in the pathogenesis of autism. Cell 2014; 155:997-1007. [PMID: 24267886 DOI: 10.1016/j.cell.2013.10.020] [Citation(s) in RCA: 677] [Impact Index Per Article: 61.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2013] [Revised: 08/02/2013] [Accepted: 10/07/2013] [Indexed: 01/08/2023]
Abstract
Autism spectrum disorder (ASD) is a complex developmental syndrome of unknown etiology. Recent studies employing exome- and genome-wide sequencing have identified nine high-confidence ASD (hcASD) genes. Working from the hypothesis that ASD-associated mutations in these biologically pleiotropic genes will disrupt intersecting developmental processes to contribute to a common phenotype, we have attempted to identify time periods, brain regions, and cell types in which these genes converge. We have constructed coexpression networks based on the hcASD "seed" genes, leveraging a rich expression data set encompassing multiple human brain regions across human development and into adulthood. By assessing enrichment of an independent set of probable ASD (pASD) genes, derived from the same sequencing studies, we demonstrate a key point of convergence in midfetal layer 5/6 cortical projection neurons. This approach informs when, where, and in what cell types mutations in these specific genes may be productively studied to clarify ASD pathophysiology.
Collapse
Affiliation(s)
- A Jeremy Willsey
- Department of Genetics, Yale School of Medicine, New Haven, CT 06510, USA; Department of Psychiatry, University of California, San Francisco, San Francisco, CA 94143, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
37
|
Improving mRNA 5' coding sequence determination in the mouse genome. Mamm Genome 2014; 25:149-59. [PMID: 24504701 DOI: 10.1007/s00335-013-9498-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2013] [Accepted: 12/09/2013] [Indexed: 10/25/2022]
Abstract
The incomplete determination of the mRNA 5' end sequence may lead to the incorrect assignment of the first AUG codon and to errors in the prediction of the encoded protein product. Due to the significance of the mouse as a model organism in biomedical research, we performed a systematic identification of coding regions at the 5' end of all known mouse mRNAs, using an automated expressed sequence tag (EST)-based approach which we have previously described. By parsing almost 4 million BLAT alignments we found 351 mouse loci, out of 20,221 analyzed, in which an extension of the mRNA 5' coding region was identified. Proof-of-concept confirmation was obtained by in vitro cloning and sequencing for Apc2 and Mknk2 cDNAs. We also generated a list of 16,330 mouse mRNAs where the presence of an in-frame stop codon upstream of the known start codon indicates completeness of the coding sequence at 5' end in the current form. Systematic searches in the main mouse genome databases and genome browsers showed that 82% of our results are original and have not been identified by their annotation pipelines. Moreover, the same information is not easily derivable from RNA-Seq data, due to short sequence length and laboriousness in building full-length transcript structures. In conclusion, our results improve the determination of full-length 5' coding sequences and might be useful in order to reduce errors when studying mouse gene structure and function in biomedical research.
Collapse
|
38
|
Flicek P, Amode MR, Barrell D, Beal K, Billis K, Brent S, Carvalho-Silva D, Clapham P, Coates G, Fitzgerald S, Gil L, Girón CG, Gordon L, Hourlier T, Hunt S, Johnson N, Juettemann T, Kähäri AK, Keenan S, Kulesha E, Martin FJ, Maurel T, McLaren WM, Murphy DN, Nag R, Overduin B, Pignatelli M, Pritchard B, Pritchard E, Riat HS, Ruffier M, Sheppard D, Taylor K, Thormann A, Trevanion SJ, Vullo A, Wilder SP, Wilson M, Zadissa A, Aken BL, Birney E, Cunningham F, Harrow J, Herrero J, Hubbard TJ, Kinsella R, Muffato M, Parker A, Spudich G, Yates A, Zerbino DR, Searle SM. Ensembl 2014. Nucleic Acids Res 2013; 42:D749-55. [PMID: 24316576 PMCID: PMC3964975 DOI: 10.1093/nar/gkt1196] [Citation(s) in RCA: 1083] [Impact Index Per Article: 90.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
Ensembl (http://www.ensembl.org) creates tools and data resources to facilitate genomic analysis in chordate species with an emphasis on human, major vertebrate model organisms and farm animals. Over the past year we have increased the number of species that we support to 77 and expanded our genome browser with a new scrollable overview and improved variation and phenotype views. We also report updates to our core datasets and improvements to our gene homology relationships from the addition of new species. Our REST service has been extended with additional support for comparative genomics and ontology information. Finally, we provide updated information about our methods for data access and resources for user training.
Collapse
Affiliation(s)
- Paul Flicek
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
- *To whom correspondence should be addressed. Tel: +44 1223 492 581; Fax: +44 1223 494 494;
| | - M. Ridwan Amode
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Daniel Barrell
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Kathryn Beal
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Konstantinos Billis
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Simon Brent
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Denise Carvalho-Silva
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Peter Clapham
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Guy Coates
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Stephen Fitzgerald
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Laurent Gil
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Carlos García Girón
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Leo Gordon
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Thibaut Hourlier
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Sarah Hunt
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Nathan Johnson
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Thomas Juettemann
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Andreas K. Kähäri
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Stephen Keenan
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Eugene Kulesha
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Fergal J. Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Thomas Maurel
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - William M. McLaren
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Daniel N. Murphy
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Rishi Nag
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Bert Overduin
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Miguel Pignatelli
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Bethan Pritchard
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Emily Pritchard
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Harpreet S. Riat
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Magali Ruffier
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Daniel Sheppard
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Kieron Taylor
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Anja Thormann
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Stephen J. Trevanion
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Alessandro Vullo
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Steven P. Wilder
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Mark Wilson
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Amonida Zadissa
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Bronwen L. Aken
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Ewan Birney
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Fiona Cunningham
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Jennifer Harrow
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Javier Herrero
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Tim J.P. Hubbard
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Rhoda Kinsella
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Matthieu Muffato
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Anne Parker
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Giulietta Spudich
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Andy Yates
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Daniel R. Zerbino
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Stephen M.J. Searle
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| |
Collapse
|
39
|
MacArthur JAL, Morales J, Tully RE, Astashyn A, Gil L, Bruford EA, Larsson P, Flicek P, Dalgleish R, Maglott DR, Cunningham F. Locus Reference Genomic: reference sequences for the reporting of clinically relevant sequence variants. Nucleic Acids Res 2013; 42:D873-8. [PMID: 24285302 PMCID: PMC3965024 DOI: 10.1093/nar/gkt1198] [Citation(s) in RCA: 66] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
Locus Reference Genomic (LRG; http://www.lrg-sequence.org/) records contain internationally recognized stable reference sequences designed specifically for reporting clinically relevant sequence variants. Each LRG is contained within a single file consisting of a stable ‘fixed’ section and a regularly updated ‘updatable’ section. The fixed section contains stable genomic DNA sequence for a genomic region, essential transcripts and proteins for variant reporting and an exon numbering system. The updatable section contains mapping information, annotation of all transcripts and overlapping genes in the region and legacy exon and amino acid numbering systems. LRGs provide a stable framework that is vital for reporting variants, according to Human Genome Variation Society (HGVS) conventions, in genomic DNA, transcript or protein coordinates. To enable translation of information between LRG and genomic coordinates, LRGs include mapping to the human genome assembly. LRGs are compiled and maintained by the National Center for Biotechnology Information (NCBI) and European Bioinformatics Institute (EBI). LRG reference sequences are selected in collaboration with the diagnostic and research communities, locus-specific database curators and mutation consortia. Currently >700 LRGs have been created, of which >400 are publicly available. The aim is to create an LRG for every locus with clinical implications.
Collapse
Affiliation(s)
- Jacqueline A L MacArthur
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK, National Center for Biotechnology Information, Bethesda, MD 20894, USA, and Department of Genetics, University of Leicester, Leicester LE1 7RH, UK
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
40
|
Farrell CM, O'Leary NA, Harte RA, Loveland JE, Wilming LG, Wallin C, Diekhans M, Barrell D, Searle SMJ, Aken B, Hiatt SM, Frankish A, Suner MM, Rajput B, Steward CA, Brown GR, Bennett R, Murphy M, Wu W, Kay MP, Hart J, Rajan J, Weber J, Snow C, Riddick LD, Hunt T, Webb D, Thomas M, Tamez P, Rangwala SH, McGarvey KM, Pujar S, Shkeda A, Mudge JM, Gonzalez JM, Gilbert JGR, Trevanion SJ, Baertsch R, Harrow JL, Hubbard T, Ostell JM, Haussler D, Pruitt KD. Current status and new features of the Consensus Coding Sequence database. Nucleic Acids Res 2013; 42:D865-72. [PMID: 24217909 PMCID: PMC3965069 DOI: 10.1093/nar/gkt1059] [Citation(s) in RCA: 115] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
The Consensus Coding Sequence (CCDS) project (http://www.ncbi.nlm.nih.gov/CCDS/) is a collaborative effort to maintain a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assemblies by the National Center for Biotechnology Information (NCBI) and Ensembl genome annotation pipelines. Identical annotations that pass quality assurance tests are tracked with a stable identifier (CCDS ID). Members of the collaboration, who are from NCBI, the Wellcome Trust Sanger Institute and the University of California Santa Cruz, provide coordinated and continuous review of the dataset to ensure high-quality CCDS representations. We describe here the current status and recent growth in the CCDS dataset, as well as recent changes to the CCDS web and FTP sites. These changes include more explicit reporting about the NCBI and Ensembl annotation releases being compared, new search and display options, the addition of biologically descriptive information and our approach to representing genes for which support evidence is incomplete. We also present a summary of recent and future curation targets.
Collapse
Affiliation(s)
- Catherine M Farrell
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA, Center for Biomolecular Science and Engineering, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK and Howard Hughes Medical Institute, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
41
|
Michel AM, Fox G, M Kiran A, De Bo C, O'Connor PBF, Heaphy SM, Mullan JPA, Donohue CA, Higgins DG, Baranov PV. GWIPS-viz: development of a ribo-seq genome browser. Nucleic Acids Res 2013; 42:D859-64. [PMID: 24185699 PMCID: PMC3965066 DOI: 10.1093/nar/gkt1035] [Citation(s) in RCA: 189] [Impact Index Per Article: 15.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open
Abstract
We describe the development of GWIPS-viz (http://gwips.ucc.ie), an online genome browser for viewing ribosome profiling data. Ribosome profiling (ribo-seq) is a recently developed technique that provides genome-wide information on protein synthesis (GWIPS) in vivo. It is based on the deep sequencing of ribosome-protected messenger RNA (mRNA) fragments, which allows the ribosome density along all mRNA transcripts present in the cell to be quantified. Since its inception, ribo-seq has been carried out in a number of eukaryotic and prokaryotic organisms. Owing to the increasing interest in ribo-seq, there is a pertinent demand for a dedicated ribo-seq genome browser. GWIPS-viz is based on The University of California Santa Cruz (UCSC) Genome Browser. Ribo-seq tracks, coupled with mRNA-seq tracks, are currently available for several genomes: human, mouse, zebrafish, nematode, yeast, bacteria (Escherichia coli K12, Bacillus subtilis), human cytomegalovirus and bacteriophage lambda. Our objective is to continue incorporating published ribo-seq data sets so that the wider community can readily view ribosome profiling information from multiple studies without the need to carry out computational processing.
Collapse
Affiliation(s)
- Audrey M Michel
- School of Biochemistry and Cell Biology, University College Cork, Cork, Ireland, School of Medicine & Medical Science, Conway Institute, University College Dublin, Dublin 4, Ireland and Howest, University College West Flanders, Rijselstraat 5, 8200 Bruges, Belgium
| | | | | | | | | | | | | | | | | | | |
Collapse
|
42
|
Meynert AM, Bicknell LS, Hurles ME, Jackson AP, Taylor MS. Quantifying single nucleotide variant detection sensitivity in exome sequencing. BMC Bioinformatics 2013; 14:195. [PMID: 23773188 PMCID: PMC3695811 DOI: 10.1186/1471-2105-14-195] [Citation(s) in RCA: 60] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2013] [Accepted: 06/10/2013] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The targeted capture and sequencing of genomic regions has rapidly demonstrated its utility in genetic studies. Inherent in this technology is considerable heterogeneity of target coverage and this is expected to systematically impact our sensitivity to detect genuine polymorphisms. To fully interpret the polymorphisms identified in a genetic study it is often essential to both detect polymorphisms and to understand where and with what probability real polymorphisms may have been missed. RESULTS Using down-sampling of 30 deeply sequenced exomes and a set of gold-standard single nucleotide variant (SNV) genotype calls for each sample, we developed an empirical model relating the read depth at a polymorphic site to the probability of calling the correct genotype at that site. We find that measured sensitivity in SNV detection is substantially worse than that predicted from the naive expectation of sampling from a binomial. This calibrated model allows us to produce single nucleotide resolution SNV sensitivity estimates which can be merged to give summary sensitivity measures for any arbitrary partition of the target sequences (nucleotide, exon, gene, pathway, exome). These metrics are directly comparable between platforms and can be combined between samples to give "power estimates" for an entire study. We estimate a local read depth of 13X is required to detect the alleles and genotype of a heterozygous SNV 95% of the time, but only 3X for a homozygous SNV. At a mean on-target read depth of 20X, commonly used for rare disease exome sequencing studies, we predict 5-15% of heterozygous and 1-4% of homozygous SNVs in the targeted regions will be missed. CONCLUSIONS Non-reference alleles in the heterozygote state have a high chance of being missed when commonly applied read coverage thresholds are used despite the widely held assumption that there is good polymorphism detection at these coverage levels. Such alleles are likely to be of functional importance in population based studies of rare diseases, somatic mutations in cancer and explaining the "missing heritability" of quantitative traits.
Collapse
Affiliation(s)
- Alison M Meynert
- MRC Human Genetics Unit, MRC Institute for Genetics and Molecular Medicine, University of Edinburgh, Western General Hospital, Crewe Road, Edinburgh, UK.
| | | | | | | | | |
Collapse
|
43
|
Flicek P, Ahmed I, Amode MR, Barrell D, Beal K, Brent S, Carvalho-Silva D, Clapham P, Coates G, Fairley S, Fitzgerald S, Gil L, García-Girón C, Gordon L, Hourlier T, Hunt S, Juettemann T, Kähäri AK, Keenan S, Komorowska M, Kulesha E, Longden I, Maurel T, McLaren WM, Muffato M, Nag R, Overduin B, Pignatelli M, Pritchard B, Pritchard E, Riat HS, Ritchie GRS, Ruffier M, Schuster M, Sheppard D, Sobral D, Taylor K, Thormann A, Trevanion S, White S, Wilder SP, Aken BL, Birney E, Cunningham F, Dunham I, Harrow J, Herrero J, Hubbard TJP, Johnson N, Kinsella R, Parker A, Spudich G, Yates A, Zadissa A, Searle SMJ. Ensembl 2013. Nucleic Acids Res 2013; 41:D48-55. [PMID: 23203987 PMCID: PMC3531136 DOI: 10.1093/nar/gks1236] [Citation(s) in RCA: 802] [Impact Index Per Article: 66.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2012] [Revised: 10/31/2012] [Accepted: 11/01/2012] [Indexed: 11/12/2022] Open
Abstract
The Ensembl project (http://www.ensembl.org) provides genome information for sequenced chordate genomes with a particular focus on human, mouse, zebrafish and rat. Our resources include evidenced-based gene sets for all supported species; large-scale whole genome multiple species alignments across vertebrates and clade-specific alignments for eutherian mammals, primates, birds and fish; variation data resources for 17 species and regulation annotations based on ENCODE and other data sets. Ensembl data are accessible through the genome browser at http://www.ensembl.org and through other tools and programmatic interfaces.
Collapse
Affiliation(s)
- Paul Flicek
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SD, UK.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
44
|
Gaudet P, Arighi C, Bastian F, Bateman A, Blake JA, Cherry MJ, D'Eustachio P, Finn R, Giglio M, Hirschman L, Kania R, Klimke W, Martin MJ, Karsch-Mizrachi I, Munoz-Torres M, Natale D, O'Donovan C, Ouellette F, Pruitt KD, Robinson-Rechavi M, Sansone SA, Schofield P, Sutton G, Van Auken K, Vasudevan S, Wu C, Young J, Mazumder R. Recent advances in biocuration: meeting report from the Fifth International Biocuration Conference. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2012; 2012:bas036. [PMID: 23110974 PMCID: PMC3483532 DOI: 10.1093/database/bas036] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
Abstract
The 5th International Biocuration Conference brought together over 300 scientists to exchange on their work, as well as discuss issues relevant to the International Society for Biocuration's (ISB) mission. Recurring themes this year included the creation and promotion of gold standards, the need for more ontologies, and more formal interactions with journals. The conference is an essential part of the ISB's goal to support exchanges among members of the biocuration community. Next year's conference will be held in Cambridge, UK, from 7 to 10 April 2013. In the meanwhile, the ISB website provides information about the society's activities (http://biocurator.org), as well as related events of interest.
Collapse
Affiliation(s)
- Pascale Gaudet
- International Society for Biocuration and CALIPHO Group, Swiss Institute of Bioinformatics, 1 Rue Michel Servet, Geneva, Switzerland.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|