1
|
Walter NG. Are non-protein coding RNAs junk or treasure?: An attempt to explain and reconcile opposing viewpoints of whether the human genome is mostly transcribed into non-functional or functional RNAs. Bioessays 2024; 46:e2300201. [PMID: 38351661 DOI: 10.1002/bies.202300201] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2023] [Revised: 01/18/2024] [Accepted: 01/19/2024] [Indexed: 03/28/2024]
Abstract
The human genome project's lasting legacies are the emerging insights into human physiology and disease, and the ascendance of biology as the dominant science of the 21st century. Sequencing revealed that >90% of the human genome is not coding for proteins, as originally thought, but rather is overwhelmingly transcribed into non-protein coding, or non-coding, RNAs (ncRNAs). This discovery initially led to the hypothesis that most genomic DNA is "junk", a term still championed by some geneticists and evolutionary biologists. In contrast, molecular biologists and biochemists studying the vast number of transcripts produced from most of this genome "junk" often surmise that these ncRNAs have biological significance. What gives? This essay contrasts the two opposing, extant viewpoints, aiming to explain their bases, which arise from distinct reference frames of the underlying scientific disciplines. Finally, it aims to reconcile these divergent mindsets in hopes of stimulating synergy between scientific fields.
Collapse
Affiliation(s)
- Nils G Walter
- Center for RNA Biomedicine, Single Molecule Analysis Group, Department of Chemistry, University of Michigan, Ann Arbor, Michigan, USA
| |
Collapse
|
2
|
Carrion SA, Michal JJ, Jiang Z. Alternative Transcripts Diversify Genome Function for Phenome Relevance to Health and Diseases. Genes (Basel) 2023; 14:2051. [PMID: 38002994 PMCID: PMC10671453 DOI: 10.3390/genes14112051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2023] [Revised: 11/06/2023] [Accepted: 11/07/2023] [Indexed: 11/26/2023] Open
Abstract
Manipulation using alternative exon splicing (AES), alternative transcription start (ATS), and alternative polyadenylation (APA) sites are key to transcript diversity underlying health and disease. All three are pervasive in organisms, present in at least 50% of human protein-coding genes. In fact, ATS and APA site use has the highest impact on protein identity, with their ability to alter which first and last exons are utilized as well as impacting stability and translation efficiency. These RNA variants have been shown to be highly specific, both in tissue type and stage, with demonstrated importance to cell proliferation, differentiation and the transition from fetal to adult cells. While alternative exon splicing has a limited effect on protein identity, its ubiquity highlights the importance of these minor alterations, which can alter other features such as localization. The three processes are also highly interwoven, with overlapping, complementary, and competing factors, RNA polymerase II and its CTD (C-terminal domain) chief among them. Their role in development means dysregulation leads to a wide variety of disorders and cancers, with some forms of disease disproportionately affected by specific mechanisms (AES, ATS, or APA). Challenges associated with the genome-wide profiling of RNA variants and their potential solutions are also discussed in this review.
Collapse
Affiliation(s)
| | | | - Zhihua Jiang
- Department of Animal Sciences and Center for Reproductive Biology, Washington State University, Pullman, WA 99164-7620, USA; (S.A.C.); (J.J.M.)
| |
Collapse
|
3
|
Amaral P, Carbonell-Sala S, De La Vega FM, Faial T, Frankish A, Gingeras T, Guigo R, Harrow JL, Hatzigeorgiou AG, Johnson R, Murphy TD, Pertea M, Pruitt KD, Pujar S, Takahashi H, Ulitsky I, Varabyou A, Wells CA, Yandell M, Carninci P, Salzberg SL. The status of the human gene catalogue. Nature 2023; 622:41-47. [PMID: 37794265 PMCID: PMC10575709 DOI: 10.1038/s41586-023-06490-x] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2023] [Accepted: 07/27/2023] [Indexed: 10/06/2023]
Abstract
Scientists have been trying to identify every gene in the human genome since the initial draft was published in 2001. In the years since, much progress has been made in identifying protein-coding genes, currently estimated to number fewer than 20,000, with an ever-expanding number of distinct protein-coding isoforms. Here we review the status of the human gene catalogue and the efforts to complete it in recent years. Beside the ongoing annotation of protein-coding genes, their isoforms and pseudogenes, the invention of high-throughput RNA sequencing and other technological breakthroughs have led to a rapid growth in the number of reported non-coding RNA genes. For most of these non-coding RNAs, the functional relevance is currently unclear; we look at recent advances that offer paths forward to identifying their functions and towards eventually completing the human gene catalogue. Finally, we examine the need for a universal annotation standard that includes all medically significant genes and maintains their relationships with different reference genomes for the use of the human gene catalogue in clinical settings.
Collapse
Affiliation(s)
- Paulo Amaral
- INSPER Institute of Education and Research, Sao Paulo, Brazil
| | | | - Francisco M De La Vega
- Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA, USA
- Tempus Labs, Chicago, IL, USA
| | | | - Adam Frankish
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, UK
| | - Thomas Gingeras
- Department of Functional Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Roderic Guigo
- Centre for Genomic Regulation (CRG), Barcelona, Spain
- Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Jennifer L Harrow
- Centre for Genomics Research, Discovery Sciences, AstraZeneca, Royston, UK
| | - Artemis G Hatzigeorgiou
- Department of Computer Science and Biomedical Informatics, Universithy of Thessaly, Lamia, Greece
- Hellenic Pasteur Institute, Athens, Greece
| | - Rory Johnson
- School of Biology and Environmental Science, University College Dublin, Dublin, Ireland
- Conway Institute of Biomedical and Biomolecular Research, University College Dublin, Dublin, Ireland
- Department of Medical Oncology, Inselspital, Bern University Hospital, University of Bern, Bern, Switzerland
- Department for BioMedical Research, University of Bern, Bern, Switzerland
| | - Terence D Murphy
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Mihaela Pertea
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Kim D Pruitt
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Shashikant Pujar
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Hazuki Takahashi
- Laboratory for Transcriptome Technology, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
| | - Igor Ulitsky
- Department of Immunology and Regenerative Biology, Weizmann Institute of Science, Rehovot, Israel
- Department of Molecular Neuroscience, Weizmann Institute of Science, Rehovot, Israel
| | - Ales Varabyou
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Christine A Wells
- Stem Cell Systems, Department of Anatomy and Physiology, Faculty of Medicine, Dentistry and Health Sciences, The University of Melbourne, Parkville, Victoria, Australia
| | - Mark Yandell
- Departent of Human Genetics, Utah Center for Genetic Discovery, University of Utah, Salt Lake City, UT, USA
| | - Piero Carninci
- Laboratory for Transcriptome Technology, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan.
- Human Technopole, Milan, Italy.
| | - Steven L Salzberg
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA.
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA.
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD, USA.
| |
Collapse
|
4
|
Barbagallo C, Stella M, Ferrara C, Caponnetto A, Battaglia R, Barbagallo D, Di Pietro C, Ragusa M. RNA-RNA competitive interactions: a molecular civil war ruling cell physiology and diseases. EXPLORATION OF MEDICINE 2023:504-540. [DOI: 10.37349/emed.2023.00159] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Accepted: 06/02/2023] [Indexed: 09/02/2023] Open
Abstract
The idea that proteins are the main determining factors in the functioning of cells and organisms, and their dysfunctions are the first cause of pathologies, has been predominant in biology and biomedicine until recently. This protein-centered view was too simplistic and failed to explain the physiological and pathological complexity of the cell. About 80% of the human genome is dynamically and pervasively transcribed, mostly as non-protein-coding RNAs (ncRNAs), which competitively interact with each other and with coding RNAs generating a complex RNA network regulating RNA processing, stability, and translation and, accordingly, fine-tuning the gene expression of the cells. Qualitative and quantitative dysregulations of RNA-RNA interaction networks are strongly involved in the onset and progression of many pathologies, including cancers and degenerative diseases. This review will summarize the RNA species involved in the competitive endogenous RNA network, their mechanisms of action, and involvement in pathological phenotypes. Moreover, it will give an overview of the most advanced experimental and computational methods to dissect and rebuild RNA networks.
Collapse
Affiliation(s)
- Cristina Barbagallo
- Section of Biology and Genetics, Department of Biomedical and Biotechnological Sciences, University of Catania, 95123 Catania, Italy
| | - Michele Stella
- Section of Biology and Genetics, Department of Biomedical and Biotechnological Sciences, University of Catania, 95123 Catania, Italy
| | | | - Angela Caponnetto
- Section of Biology and Genetics, Department of Biomedical and Biotechnological Sciences, University of Catania, 95123 Catania, Italy
| | - Rosalia Battaglia
- Section of Biology and Genetics, Department of Biomedical and Biotechnological Sciences, University of Catania, 95123 Catania, Italy
| | - Davide Barbagallo
- Section of Biology and Genetics, Department of Biomedical and Biotechnological Sciences, University of Catania, 95123 Catania, Italy
| | - Cinzia Di Pietro
- Section of Biology and Genetics, Department of Biomedical and Biotechnological Sciences, University of Catania, 95123 Catania, Italy
| | - Marco Ragusa
- Section of Biology and Genetics, Department of Biomedical and Biotechnological Sciences, University of Catania, 95123 Catania, Italy
| |
Collapse
|
5
|
Chu W, Chu X, Wang J. Uncovering the Quantitative Relationships Among Chromosome Fluctuations, Epigenetics, and Gene Expressions of Transdifferentiation on Waddington Landscape. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2022; 9:e2103617. [PMID: 35104056 PMCID: PMC8981899 DOI: 10.1002/advs.202103617] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/27/2021] [Revised: 12/21/2021] [Indexed: 06/14/2023]
Abstract
The 3D spatial organization of the chromosomes appears to be linked to the gene function, which is cell type-specific. The chromosome structural ensemble switching model (CSESM) is developed by employing a heteropolymer model on different cell types and the important quantitative relationships among the chromosome ensemble, the epigenetic marks, and the gene expressions are uncovered, that both chromosome fluctuation and epigenetic marks have strong linear correlations with the gene expressions. The results support that the two compartments have different behaviors, corresponding to the relatively sparse and fluctuating phase (compartment A) and the relatively dense and stable phase (compartment B). Importantly, through the investigation of the transdifferentiation processes between the peripheral blood mononuclear cell (PBMC) and the bipolar neuron (BN), a quantitative description for the transdifferentiation is provided, which can be linked to the Waddington landscape. In addition, compared to the direct transdifferentiation between PBMC and BN, the transdifferentiation via the intermediate state neural progenitor cell (NPC) follows a different path (an "uphill" followed by a "downhill"). These theoretical studies bridge the gap among the chromosome fluctuations/ensembles, the epigenetics, and gene expressions in determining the cell fate.
Collapse
Affiliation(s)
- Wen‐Ting Chu
- State Key Laboratory of Electroanalytical ChemistryChangchun Institute of Applied ChemistryChinese Academy of SciencesChangchunJilin130022China
| | - Xiakun Chu
- Department of Chemistry & PhysicsState University of New York at Stony BrookStony BrookNY11794USA
| | - Jin Wang
- Department of Chemistry & PhysicsState University of New York at Stony BrookStony BrookNY11794USA
| |
Collapse
|
6
|
A Cascade Flexible Neural Forest Model for Cancer Subtypes Classification on Gene Expression Data. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2021; 2021:6480456. [PMID: 34650605 PMCID: PMC8510824 DOI: 10.1155/2021/6480456] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/26/2021] [Revised: 07/22/2021] [Accepted: 09/20/2021] [Indexed: 12/15/2022]
Abstract
The correct classification of cancer subtypes is of great significance for the in-depth study of cancer pathogenesis and the realization of accurate treatment for cancer patients. In recent years, the classification of cancer subtypes using deep neural networks and gene expression data has become a hot topic. However, most classifiers may face the challenges of overfitting and low classification accuracy when dealing with small sample size and high-dimensional biological data. In this paper, the Cascade Flexible Neural Forest (CFNForest) Model was proposed to accomplish cancer subtype classification. CFNForest extended the traditional flexible neural tree structure to FNT Group Forest exploiting a bagging ensemble strategy and could automatically generate the model's structure and parameters. In order to deepen the FNT Group Forest without introducing new hyperparameters, the multilayer cascade framework was exploited to design the FNT Group Forest model, which transformed features between levels and improved the performance of the model. The proposed CFNForest model also improved the operational efficiency and the robustness of the model by sample selection mechanism between layers and setting different weights for the output of each layer. To accomplish cancer subtype classification, FNT Group Forest with different feature sets was used to enrich the structural diversity of the model, which make it more suitable for processing small sample size datasets. The experiments on RNA-seq gene expression data showed that CFNForest effectively improves the accuracy of cancer subtype classification. The classification results have good robustness.
Collapse
|
7
|
Zhong L, Meng Q, Chen Y, Du L, Wu P. A laminar augmented cascading flexible neural forest model for classification of cancer subtypes based on gene expression data. BMC Bioinformatics 2021; 22:475. [PMID: 34600466 PMCID: PMC8487515 DOI: 10.1186/s12859-021-04391-2] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2021] [Accepted: 09/22/2021] [Indexed: 01/22/2023] Open
Abstract
BACKGROUND Correctly classifying the subtypes of cancer is of great significance for the in-depth study of cancer pathogenesis and the realization of personalized treatment for cancer patients. In recent years, classification of cancer subtypes using deep neural networks and gene expression data has gradually become a research hotspot. However, most classifiers may face overfitting and low classification accuracy when dealing with small sample size and high-dimensional biology data. RESULTS In this paper, a laminar augmented cascading flexible neural forest (LACFNForest) model was proposed to complete the classification of cancer subtypes. This model is a cascading flexible neural forest using deep flexible neural forest (DFNForest) as the base classifier. A hierarchical broadening ensemble method was proposed, which ensures the robustness of classification results and avoids the waste of model structure and function as much as possible. We also introduced an output judgment mechanism to each layer of the forest to reduce the computational complexity of the model. The deep neural forest was extended to the densely connected deep neural forest to improve the prediction results. The experiments on RNA-seq gene expression data showed that LACFNForest has better performance in the classification of cancer subtypes compared to the conventional methods. CONCLUSION The LACFNForest model effectively improves the accuracy of cancer subtype classification with good robustness. It provides a new approach for the ensemble learning of classifiers in terms of structural design.
Collapse
Affiliation(s)
- Lianxin Zhong
- School of Information Science and Engineering, University of Jinan, Jinan, China
- Shandong Provincial Key laboratory of Network Based Intelligent Computing, Jinan, 250022, China
| | - Qingfang Meng
- School of Information Science and Engineering, University of Jinan, Jinan, China.
- Shandong Provincial Key laboratory of Network Based Intelligent Computing, Jinan, 250022, China.
| | - Yuehui Chen
- School of Information Science and Engineering, University of Jinan, Jinan, China
- Shandong Provincial Key laboratory of Network Based Intelligent Computing, Jinan, 250022, China
| | - Lei Du
- School of Information Science and Engineering, University of Jinan, Jinan, China
- Shandong Provincial Key laboratory of Network Based Intelligent Computing, Jinan, 250022, China
| | - Peng Wu
- School of Information Science and Engineering, University of Jinan, Jinan, China
- Shandong Provincial Key laboratory of Network Based Intelligent Computing, Jinan, 250022, China
| |
Collapse
|
8
|
Patil C, Sylvester JB, Abdilleh K, Norsworthy MW, Pottin K, Malinsky M, Bloomquist RF, Johnson ZV, McGrath PT, Streelman JT. Genome-enabled discovery of evolutionary divergence in brains and behavior. Sci Rep 2021; 11:13016. [PMID: 34155279 PMCID: PMC8217251 DOI: 10.1038/s41598-021-92385-8] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2021] [Accepted: 06/08/2021] [Indexed: 02/05/2023] Open
Abstract
Lake Malawi cichlid fishes exhibit extensive divergence in form and function built from a relatively small number of genetic changes. We compared the genomes of rock- and sand-dwelling species and asked which genetic variants differed among the groups. We found that 96% of differentiated variants reside in non-coding sequence but these non-coding diverged variants are evolutionarily conserved. Genome regions near differentiated variants are enriched for craniofacial, neural and behavioral categories. Following leads from genome sequence, we used rock- vs. sand-species and their hybrids to (i) delineate the push-pull roles of BMP signaling and irx1b in the specification of forebrain territories during gastrulation and (ii) reveal striking context-dependent brain gene expression during adult social behavior. Our results demonstrate how divergent genome sequences can predict differences in key evolutionary traits. We highlight the promise of evolutionary reverse genetics-the inference of phenotypic divergence from unbiased genome sequencing and then empirical validation in natural populations.
Collapse
Affiliation(s)
- Chinar Patil
- School of Biological Sciences and Petit Institute of Bioengineering and Bioscience, Georgia Institute of Technology, Atlanta, GA, USA.
| | - Jonathan B Sylvester
- School of Biological Sciences and Petit Institute of Bioengineering and Bioscience, Georgia Institute of Technology, Atlanta, GA, USA
- Department of Biology, Georgia State University, Atlanta, GA, USA
| | - Kawther Abdilleh
- School of Biological Sciences and Petit Institute of Bioengineering and Bioscience, Georgia Institute of Technology, Atlanta, GA, USA
| | - Michael W Norsworthy
- Catalog Technologies Inc., Boston, MA, USA
- Freedom of Form Foundation, Inc., Cambridge, MA, USA
| | - Karen Pottin
- Laboratoire de Biologie du Dévelopement (IBPS-LBD, UMR7622), CNRS, Institut de Biologie Paris Seine, Sorbonne Université, Paris, France
| | - Milan Malinsky
- Department of Environmental Sciences, Zoological Institute, University of Basel, Basel, Switzerland
- Wellcome Trust Sanger Institute, Cambridge, UK
| | - Ryan F Bloomquist
- School of Biological Sciences and Petit Institute of Bioengineering and Bioscience, Georgia Institute of Technology, Atlanta, GA, USA
- Department of Oral Biology and Diagnostic Sciences, Department of Restorative Sciences, Dental College of Georgia, Augusta University, Augusta, GA, USA
| | - Zachary V Johnson
- School of Biological Sciences and Petit Institute of Bioengineering and Bioscience, Georgia Institute of Technology, Atlanta, GA, USA
| | - Patrick T McGrath
- School of Biological Sciences and Petit Institute of Bioengineering and Bioscience, Georgia Institute of Technology, Atlanta, GA, USA
| | - Jeffrey T Streelman
- School of Biological Sciences and Petit Institute of Bioengineering and Bioscience, Georgia Institute of Technology, Atlanta, GA, USA
| |
Collapse
|
9
|
Prensner JR, Enache OM, Luria V, Krug K, Clauser KR, Dempster JM, Karger A, Wang L, Stumbraite K, Wang VM, Botta G, Lyons NJ, Goodale A, Kalani Z, Fritchman B, Brown A, Alan D, Green T, Yang X, Jaffe JD, Roth JA, Piccioni F, Kirschner MW, Ji Z, Root DE, Golub TR. Noncanonical open reading frames encode functional proteins essential for cancer cell survival. Nat Biotechnol 2021; 39:697-704. [PMID: 33510483 PMCID: PMC8195866 DOI: 10.1038/s41587-020-00806-2] [Citation(s) in RCA: 72] [Impact Index Per Article: 24.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2020] [Accepted: 12/16/2020] [Indexed: 01/30/2023]
Abstract
Although genomic analyses predict many noncanonical open reading frames (ORFs) in the human genome, it is unclear whether they encode biologically active proteins. Here we experimentally interrogated 553 candidates selected from noncanonical ORF datasets. Of these, 57 induced viability defects when knocked out in human cancer cell lines. Following ectopic expression, 257 showed evidence of protein expression and 401 induced gene expression changes. Clustered regularly interspaced short palindromic repeat (CRISPR) tiling and start codon mutagenesis indicated that their biological effects required translation as opposed to RNA-mediated effects. We found that one of these ORFs, G029442-renamed glycine-rich extracellular protein-1 (GREP1)-encodes a secreted protein highly expressed in breast cancer, and its knockout in 263 cancer cell lines showed preferential essentiality in breast cancer-derived lines. The secretome of GREP1-expressing cells has an increased abundance of the oncogenic cytokine GDF15, and GDF15 supplementation mitigated the growth-inhibitory effect of GREP1 knockout. Our experiments suggest that noncanonical ORFs can express biologically active proteins that are potential therapeutic targets.
Collapse
Affiliation(s)
- John R. Prensner
- Broad Institute of Harvard and MIT, Cambridge, MA, 02142, USA.,Department of Pediatric Oncology, Dana-Farber Cancer Institute, Boston, MA 02215,Division of Pediatric Hematology/Oncology, Boston Children’s Hospital, Boston, MA, 02115
| | - Oana M. Enache
- Broad Institute of Harvard and MIT, Cambridge, MA, 02142, USA
| | - Victor Luria
- Department of Systems Biology, Harvard Medical School, Boston, MA, 02115, USA
| | - Karsten Krug
- Broad Institute of Harvard and MIT, Cambridge, MA, 02142, USA
| | - Karl R. Clauser
- Broad Institute of Harvard and MIT, Cambridge, MA, 02142, USA
| | | | - Amir Karger
- IT-Research Computing, Harvard Medical School, Boston, MA, USA, 02115
| | - Li Wang
- Broad Institute of Harvard and MIT, Cambridge, MA, 02142, USA
| | | | - Vickie M. Wang
- Broad Institute of Harvard and MIT, Cambridge, MA, 02142, USA
| | - Ginevra Botta
- Broad Institute of Harvard and MIT, Cambridge, MA, 02142, USA
| | | | - Amy Goodale
- Broad Institute of Harvard and MIT, Cambridge, MA, 02142, USA
| | - Zohra Kalani
- Broad Institute of Harvard and MIT, Cambridge, MA, 02142, USA
| | | | - Adam Brown
- Broad Institute of Harvard and MIT, Cambridge, MA, 02142, USA
| | - Douglas Alan
- Broad Institute of Harvard and MIT, Cambridge, MA, 02142, USA
| | - Thomas Green
- Broad Institute of Harvard and MIT, Cambridge, MA, 02142, USA
| | - Xiaoping Yang
- Broad Institute of Harvard and MIT, Cambridge, MA, 02142, USA
| | - Jacob D. Jaffe
- Broad Institute of Harvard and MIT, Cambridge, MA, 02142, USA.,Present address: Inzen Therapeutics, Cambridge, MA, 02139, USA
| | | | - Federica Piccioni
- Broad Institute of Harvard and MIT, Cambridge, MA, 02142, USA.,Present address: Merck Research Laboratories, Boston, MA, 02115, USA
| | - Marc W. Kirschner
- Department of Systems Biology, Harvard Medical School, Boston, MA, 02115, USA
| | - Zhe Ji
- Department of Pharmacology, Feinberg School of Medicine, Northwestern University, Chicago, IL 60611,Department of Biomedical Engineering, McCormick School of Engineering, Northwestern University, Evanston, IL 60628
| | - David E. Root
- Broad Institute of Harvard and MIT, Cambridge, MA, 02142, USA
| | - Todd R. Golub
- Broad Institute of Harvard and MIT, Cambridge, MA, 02142, USA.,Department of Pediatric Oncology, Dana-Farber Cancer Institute, Boston, MA 02215,Division of Pediatric Hematology/Oncology, Boston Children’s Hospital, Boston, MA, 02115,Corresponding author: Address correspondence to: Todd R. Golub, MD, Chief Scientific Officer, Broad Institute of Harvard and MIT, Room 4013, 415 Main Street, Cambridge, MA, 02142, , Phone: 617-714-7050
| |
Collapse
|
10
|
Zhang L, Karimzadeh M, Welch M, McIntosh C, Wang B. Analytics methods and tools for integration of biomedical data in medicine. Artif Intell Med 2021. [DOI: 10.1016/b978-0-12-821259-2.00007-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
11
|
Abstract
Our understanding of the human genome has continuously expanded since its draft publication in 2001. Over the years, novel assays have allowed us to progressively overlay layers of knowledge above the raw sequence of A's, T's, G's, and C's. The reference human genome sequence is now a complex knowledge base maintained under the shared stewardship of multiple specialist communities. Its complexity stems from the fact that it is simultaneously a template for transcription, a record of evolution, a vehicle for genetics, and a functional molecule. In short, the human genome serves as a frame of reference at the intersection of a diversity of scientific fields. In recent years, the progressive fall in sequencing costs has given increasing importance to the quality of the human reference genome, as hundreds of thousands of individuals are being sequenced yearly, often for clinical applications. Also, novel sequencing-based assays shed light on novel functions of the genome, especially with respect to gene expression regulation. Keeping the human genome annotation up to date and accurate is therefore an ongoing partnership between reference annotation projects and the greater community worldwide.
Collapse
Affiliation(s)
- Daniel R Zerbino
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton CB10 1SD, United Kingdom; , ,
| | - Adam Frankish
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton CB10 1SD, United Kingdom; , ,
| | - Paul Flicek
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton CB10 1SD, United Kingdom; , ,
| |
Collapse
|
12
|
Poverennaya E, Kiseleva O, Romanova A, Pyatnitskiy M. Predicting Functions of Uncharacterized Human Proteins: From Canonical to Proteoforms. Genes (Basel) 2020; 11:E677. [PMID: 32575886 PMCID: PMC7350264 DOI: 10.3390/genes11060677] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2020] [Revised: 06/09/2020] [Accepted: 06/19/2020] [Indexed: 01/22/2023] Open
Abstract
Despite tremendous efforts in genomics, transcriptomics, and proteomics communities, there is still no comprehensive data about the exact number of protein-coding genes, translated proteoforms, and their function. In addition, by now, we lack functional annotation for 1193 genes, where expression was confirmed at the proteomic level (uPE1 proteins). We re-analyzed results of AP-MS experiments from the BioPlex 2.0 database to predict functions of uPE1 proteins and their splice forms. By building a protein-protein interaction network for 12 ths. identified proteins encoded by 11 ths. genes, we were able to predict Gene Ontology categories for a total of 387 uPE1 genes. We predicted different functions for canonical and alternatively spliced forms for four uPE1 genes. In total, functional differences were revealed for 62 proteoforms encoded by 31 genes. Based on these results, it can be carefully concluded that the dynamics and versatility of the interactome is ensured by changing the dominant splice form. Overall, we propose that analysis of large-scale AP-MS experiments performed for various cell lines and under various conditions is a key to understanding the full potential of genes role in cellular processes.
Collapse
Affiliation(s)
- Ekaterina Poverennaya
- Department of Bioinformatics, Institute of Biomedical Chemistry, 119121 Moscow, Russia; (O.K.); (A.R.); (M.P.)
- Institute of Environmental and Agricultural Biology (X-BIO),Tyumen State University, 625003 Tyumen, Russia
| | - Olga Kiseleva
- Department of Bioinformatics, Institute of Biomedical Chemistry, 119121 Moscow, Russia; (O.K.); (A.R.); (M.P.)
| | - Anastasia Romanova
- Department of Bioinformatics, Institute of Biomedical Chemistry, 119121 Moscow, Russia; (O.K.); (A.R.); (M.P.)
- Faculty of Biological and Medical Physics, Moscow Institute of Physics and Technology, Dolgoprudny, 141701 Moscow, Russia
| | - Mikhail Pyatnitskiy
- Department of Bioinformatics, Institute of Biomedical Chemistry, 119121 Moscow, Russia; (O.K.); (A.R.); (M.P.)
- Department of Molecular Biology and Genetics, Federal Research and Clinical Center of Physical-Chemical Medicine of Federal Medical Biological Agency, 119435 Moscow, Russia
| |
Collapse
|
13
|
Lukiw WJ. Gastrointestinal (GI) Tract Microbiome-Derived Neurotoxins-Potent Neuro-Inflammatory Signals From the GI Tract via the Systemic Circulation Into the Brain. Front Cell Infect Microbiol 2020; 10:22. [PMID: 32117799 PMCID: PMC7028696 DOI: 10.3389/fcimb.2020.00022] [Citation(s) in RCA: 40] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2019] [Accepted: 01/15/2020] [Indexed: 12/17/2022] Open
Abstract
The microbiome of the human gastrointestinal (GI)-tract is a rich and dynamic source of microorganisms that together possess a staggering complexity and diversity. Collectively these microbes are capable of secreting what are amongst the most neurotoxic and pro-inflammatory biopolymers known. These include lipopolysaccharide (LPS), enterotoxins, microbial-derived amyloids and small non-coding RNA (sncRNA). One of the major microbial species in the human GI-tract microbiome, about ~100-fold more abundant than Escherichia coli, is Bacteroides fragilis, an anaerobic, rod-shaped Gram-negative bacterium that secretes: (i) a particularly potent, pro-inflammatory LPS glycolipid subtype (BF-LPS); and (ii) a hydrolytic, extracellular zinc metalloproteinase known as B. fragilis toxin (BFT) or fragilysin. Ongoing studies support multiple observations that BF-LPS and BFT (fragilysin) disrupt paracellular barriers by cleavage of intercellular proteins, such as E-cadherin, between epithelial cells, resulting in 'leaky' barriers. These defective barriers, which also become more penetrable with age, in turn permit entry of microbiome-derived neurotoxic biopolymers into the systemic circulation from which they can next transit the blood-brain barrier (BBB) and gain access into the brain. This short communication will highlight some recent advances in this extraordinary research area that links the pro-inflammatory exudates of the GI-tract microbiome with innate-immune disturbances and inflammatory signaling within the human central nervous system (CNS) with reference to Alzheimer's disease (AD) wherever possible.
Collapse
Affiliation(s)
- Walter J. Lukiw
- LSU Neuroscience Center, Louisiana State University Health Sciences Center, New Orleans, LA, United States
- Department of Ophthalmology, Louisiana State University Health Sciences Center, New Orleans, LA, United States
- Department of Neurology, Louisiana State University Health Sciences Center, New Orleans, LA, United States
| |
Collapse
|
14
|
Lukiw WJ, Li W, Bond T, Zhao Y. Facilitation of Gastrointestinal (GI) Tract Microbiome-Derived Lipopolysaccharide (LPS) Entry Into Human Neurons by Amyloid Beta-42 (Aβ42) Peptide. Front Cell Neurosci 2019; 13:545. [PMID: 31866832 PMCID: PMC6908466 DOI: 10.3389/fncel.2019.00545] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2019] [Accepted: 11/22/2019] [Indexed: 01/01/2023] Open
Abstract
Human gastrointestinal (GI)-tract microbiome-derived lipopolysaccharide (LPS): (i) has been recently shown to target, accumulate within, and eventually encapsulate neuronal nuclei of the human central nervous system (CNS) in Alzheimer’s disease (AD) brain; and (ii) this action appears to impede and restrict the outward flow of genetic information from neuronal nuclei. It has previously been shown that in LPS-encased neuronal nuclei in AD brain there is a specific disruption in the output and expression of two AD-relevant, neuron-specific markers encoding the cytoskeletal neurofilament light (NF-L) chain protein and the synaptic phosphoprotein synapsin-1 (SYN1) involved in the regulation of neurotransmitter release. The biophysical mechanisms involved in the facilitation of the targeting of LPS to neuronal cells and nuclei and eventual nuclear envelopment and functional disruption are not entirely clear. In this “Perspectives article” we discuss current advances, and consider future directions in this research area, and provide novel evidence in human neuronal-glial (HNG) cells in primary culture that the co-incubation of LPS with amyloid-beta 42 (Aβ42) peptide facilitates the association of LPS with neuronal cells. These findings: (i) support a novel pathogenic role for Aβ42 peptides in neurons via the formation of pores across the nuclear membrane and/or a significant biophysical disruption of the neuronal nuclear envelope; and (ii) advance the concept that the Aβ42 peptide-facilitated entry of LPS into brain neurons, accession of neuronal nuclei, and down-regulation of neuron-specific components such as NF-L and SYN1 may contribute significantly to neuropathological deficits as are characteristically observed in AD-affected brain.
Collapse
Affiliation(s)
- Walter J Lukiw
- LSU Neuroscience Center, Louisiana State University Health Sciences Center, New Orleans, LA, United States.,Department of Ophthalmology, Louisiana State University Health Sciences Center, New Orleans, LA, United States.,Department of Neurology, Louisiana State University Health Sciences Center, New Orleans, LA, United States
| | - Wenhong Li
- LSU Neuroscience Center, Louisiana State University Health Sciences Center, New Orleans, LA, United States.,Department of Pharmacology, School of Pharmacy, Jiangxi University of Traditional Chinese Medicine (TCM), Nanchang, China
| | - Taylor Bond
- LSU Neuroscience Center, Louisiana State University Health Sciences Center, New Orleans, LA, United States
| | - Yuhai Zhao
- LSU Neuroscience Center, Louisiana State University Health Sciences Center, New Orleans, LA, United States.,Department of Anatomy and Cell Biology, Louisiana State University Health Sciences Center, New Orleans, LA, United States
| |
Collapse
|
15
|
Hatje K, Mühlhausen S, Simm D, Kollmar M. The Protein-Coding Human Genome: Annotating High-Hanging Fruits. Bioessays 2019; 41:e1900066. [PMID: 31544971 DOI: 10.1002/bies.201900066] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2019] [Revised: 08/07/2019] [Indexed: 12/19/2022]
Abstract
The major transcript variants of human protein-coding genes are annotated to a certain degree of accuracy combining manual curation, transcript data, and proteomics evidence. However, there is considerable disagreement on the annotation of about 2000 genes-they can be protein-coding, noncoding, or pseudogenes-and on the annotation of most of the predicted alternative transcripts. Pure transcriptome mapping approaches seem to be limited in discriminating functional expression from noise. These limitations have partially been overcome by dedicated algorithms to detect alternative spliced micro-exons and wobble splice variants. Recently, knowledge about splice mechanism and protein structure are incorporated into an algorithm to predict neighboring homologous exons, often spliced in a mutually exclusive manner. Predicted exons are evaluated by transcript data, structural compatibility, and evolutionary conservation, revealing hundreds of novel coding exons and splice mechanism re-assignments. The emerging human pan-genome is necessitating distinctive annotations incorporating differences between individuals and between populations.
Collapse
Affiliation(s)
- Klas Hatje
- Roche Pharmaceutical Research and Early Development, Pharmaceutical Sciences, Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Grenzacherstr. 124, 4070, Basel, Switzerland
| | - Stefanie Mühlhausen
- Group Systems Biology of Motor Proteins, Department of NMR-based Structural Biology, Max-Planck-Institute for Biophysical Chemistry, Am Fassberg 11, 37077, Göttingen, Germany
| | - Dominic Simm
- Group Systems Biology of Motor Proteins, Department of NMR-based Structural Biology, Max-Planck-Institute for Biophysical Chemistry, Am Fassberg 11, 37077, Göttingen, Germany.,Theoretical Computer Science and Algorithmic Methods, Institute of Computer Science, Georg-August-University Göttingen, Goldschmidtstr. 7, 37077, Göttingen, Germany
| | - Martin Kollmar
- Group Systems Biology of Motor Proteins, Department of NMR-based Structural Biology, Max-Planck-Institute for Biophysical Chemistry, Am Fassberg 11, 37077, Göttingen, Germany
| |
Collapse
|
16
|
Pertea M, Shumate A, Pertea G, Varabyou A, Breitwieser FP, Chang YC, Madugundu AK, Pandey A, Salzberg SL. CHESS: a new human gene catalog curated from thousands of large-scale RNA sequencing experiments reveals extensive transcriptional noise. Genome Biol 2018; 19:208. [PMID: 30486838 PMCID: PMC6260756 DOI: 10.1186/s13059-018-1590-2] [Citation(s) in RCA: 162] [Impact Index Per Article: 27.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2018] [Accepted: 11/16/2018] [Indexed: 01/06/2023] Open
Abstract
We assembled the sequences from deep RNA sequencing experiments by the Genotype-Tissue Expression (GTEx) project, to create a new catalog of human genes and transcripts, called CHESS. The new database contains 42,611 genes, of which 20,352 are potentially protein-coding and 22,259 are noncoding, and a total of 323,258 transcripts. These include 224 novel protein-coding genes and 116,156 novel transcripts. We detected over 30 million additional transcripts at more than 650,000 genomic loci, nearly all of which are likely nonfunctional, revealing a heretofore unappreciated amount of transcriptional noise in human cells. The CHESS database is available at http://ccb.jhu.edu/chess .
Collapse
Affiliation(s)
- Mihaela Pertea
- Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Alaina Shumate
- Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Geo Pertea
- Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, USA
| | - Ales Varabyou
- Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Florian P Breitwieser
- Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, USA
| | - Yu-Chi Chang
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Anil K Madugundu
- McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, USA
- Institute of Bioinformatics, International Technology Park, Bangalore, India
- Manipal Academy of Higher Education (MAHE), Manipal, Karnataka, India
- Present address: Center for Individualized Medicine and Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN, USA
| | - Akhilesh Pandey
- McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, USA
- Departments of Biological Chemistry, Pathology, Neurology, and Oncology, Johns Hopkins University School of Medicine, Baltimore, MD, USA
- Present address: Center for Individualized Medicine and Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN, USA
| | - Steven L Salzberg
- Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, USA.
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA.
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.
- Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, MD, USA.
| |
Collapse
|
17
|
Iñiguez LP, Hernández G. The Evolutionary Relationship between Alternative Splicing and Gene Duplication. Front Genet 2017; 8:14. [PMID: 28261262 PMCID: PMC5306129 DOI: 10.3389/fgene.2017.00014] [Citation(s) in RCA: 38] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2016] [Accepted: 02/02/2017] [Indexed: 01/23/2023] Open
Abstract
The protein diversity that exists today has resulted from various evolutionary processes. It is well known that gene duplication (GD) along with the accumulation of mutations are responsible, among other factors, for an increase in the number of different proteins. The gene structure in eukaryotes requires the removal of non-coding sequences, introns, to produce mature mRNAs. This process, known as cis-splicing, referred to here as splicing, is regulated by several factors which can lead to numerous splicing arrangements, commonly designated as alternative splicing (AS). AS, producing several transcripts isoforms form a single gene, also increases the protein diversity. However, the evolution and manner for increasing protein variation differs between AS and GD. An important question is how are patterns of AS affected after a GD event. Here, we review the current knowledge of AS and GD, focusing on their evolutionary relationship. These two processes are now considered the main contributors to the increasing protein diversity and therefore their relationship is a relevant, yet understudied, area of evolutionary study.
Collapse
Affiliation(s)
- Luis P Iñiguez
- Programa de Genómica Funcional de Eucariotes, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México Cuernavaca, México
| | - Georgina Hernández
- Programa de Genómica Funcional de Eucariotes, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México Cuernavaca, México
| |
Collapse
|
18
|
Milligan MJ, Lipovich L. Pseudogene-derived lncRNAs: emerging regulators of gene expression. Front Genet 2015; 5:476. [PMID: 25699073 PMCID: PMC4316772 DOI: 10.3389/fgene.2014.00476] [Citation(s) in RCA: 82] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2014] [Accepted: 12/25/2014] [Indexed: 01/11/2023] Open
Abstract
In the more than one decade since the completion of the Human Genome Project, the prevalence of non-protein-coding functional elements in the human genome has emerged as a key revelation in post-genomic biology. Highlighted by the ENCODE (Encyclopedia of DNA Elements) and FANTOM (Functional Annotation of Mammals) consortia, these elements include tens of thousands of pseudogenes, as well as comparably numerous long non-coding RNA (lncRNA) genes. Pseudogene transcription and function remain insufficiently understood. However, the field is of great importance for human disease due to the high sequence similarity between pseudogenes and their parental protein-coding genes, which generates the potential for sequence-specific regulation. Recent case studies have established essential and coordinated roles of both pseudogenes and lncRNAs in development and disease in metazoan systems, including functional impacts of lncRNA transcription at pseudogene loci on the regulation of the pseudogenes’ parental genes. This review synthesizes the nascent evidence for regulatory modalities jointly exerted by lncRNAs and pseudogenes in human disease, and for recent evolutionary origins of these systems.
Collapse
Affiliation(s)
- Michael J Milligan
- Center for Molecular Medicine and Genetics, Wayne State University School of Medicine , Detroit, MI, USA
| | - Leonard Lipovich
- Center for Molecular Medicine and Genetics, Wayne State University School of Medicine , Detroit, MI, USA
| |
Collapse
|
19
|
Chen L, Bush SJ, Tovar-Corona JM, Castillo-Morales A, Urrutia AO. Correcting for differential transcript coverage reveals a strong relationship between alternative splicing and organism complexity. Mol Biol Evol 2014; 31:1402-13. [PMID: 24682283 PMCID: PMC4032128 DOI: 10.1093/molbev/msu083] [Citation(s) in RCA: 93] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
What at the genomic level underlies organism complexity? Although several genomic features have been associated with organism complexity, in the case of alternative splicing, which has long been proposed to explain the variation in complexity, no such link has been established. Here, we analyzed over 39 million expressed sequence tags available for 47 eukaryotic species with fully sequenced genomes to obtain a comparable index of alternative splicing estimates, which corrects for the distorting effect of a variable number of transcripts per species—an important obstacle for comparative studies of alternative splicing. We find that alternative splicing has steadily increased over the last 1,400 My of eukaryotic evolution and is strongly associated with organism complexity, assayed as the number of cell types. Importantly, this association is not explained as a by-product of covariance between alternative splicing with other variables previously linked to complexity including gene content, protein length, proteome disorder, and protein interactivity. In addition, we found no evidence to suggest that the relationship of alternative splicing to cell type number is explained by drift due to reduced Ne in more complex species. Taken together, our results firmly establish alternative splicing as a significant predictor of organism complexity and are, in principle, consistent with an important role of transcript diversification through alternative splicing as a means of determining a genome’s functional information capacity.
Collapse
Affiliation(s)
- Lu Chen
- Department of Biology and Biochemistry, University of Bath, Bath, United Kingdom
| | - Stephen J Bush
- Department of Biology and Biochemistry, University of Bath, Bath, United Kingdom
| | - Jaime M Tovar-Corona
- Department of Biology and Biochemistry, University of Bath, Bath, United Kingdom
| | | | - Araxi O Urrutia
- Department of Biology and Biochemistry, University of Bath, Bath, United Kingdom
| |
Collapse
|
20
|
Abstract
Genetics has fascinated societies since ancient times, and references to traits or behaviors that appear to be shared or different among related individuals have permeated legends, literature, and popular culture. Biomedical advances from the past century, and particularly the discovery of the DNA double helix, the increasing numbers of links that were established between mutations and medical conditions or phenotypes, and technological advances that facilitated the sequencing of the human genome, catalyzed the development of genetic testing. Genetic tests were initially performed in health care facilities, interpreted by health care providers, and included the availability of counseling. Recent years have seen an increased availability of genetic tests that are offered by companies directly to consumers, a phenomenon that became known as direct-to-consumer genetic testing. Tests offered in this setting range from the ones that are also provided in health care establishments to tests known as ‘recreational genomics,’ and consumers directly receive the test results. In addition, testing in this context often does not involve the availability of counseling and, when this is provided, it frequently occurs on-line or over the phone. As a field situated at the interface between biotechnology, biomedical research, and social sciences, direct-to-consumer genetic testing opens multiple challenges that can be appropriately addressed only by developing a complex, inter-disciplinary framework.
Collapse
|
21
|
Abstract
Many people expected the question 'How many genes in the human genome?' to be resolved with the publication of the genome sequence in 2001, but estimates continue to fluctuate.
Collapse
Affiliation(s)
- Mihaela Pertea
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA
| | - Steven L Salzberg
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA
| |
Collapse
|
22
|
Hagmann P, Cammoun L, Gigandet X, Gerhard S, Grant PE, Wedeen V, Meuli R, Thiran JP, Honey CJ, Sporns O. MR connectomics: Principles and challenges. J Neurosci Methods 2010; 194:34-45. [PMID: 20096730 DOI: 10.1016/j.jneumeth.2010.01.014] [Citation(s) in RCA: 202] [Impact Index Per Article: 14.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2009] [Revised: 01/02/2010] [Accepted: 01/13/2010] [Indexed: 11/16/2022]
Abstract
MR connectomics is an emerging framework in neuro-science that combines diffusion MRI and whole brain tractography methodologies with the analytical tools of network science. In the present work we review the current methods enabling structural connectivity mapping with MRI and show how such data can be used to infer new information of both brain structure and function. We also list the technical challenges that should be addressed in the future to achieve high-resolution maps of structural connectivity. From the resulting tremendous amount of data that is going to be accumulated soon, we discuss what new challenges must be tackled in terms of methods for advanced network analysis and visualization, as well data organization and distribution. This new framework is well suited to investigate key questions on brain complexity and we try to foresee what fields will most benefit from these approaches.
Collapse
Affiliation(s)
- Patric Hagmann
- Department of Radiology, University Hospital Center and University of Lausanne (CHUV-UNIL), Switzerland; Signal Processing Laboratory (LTS5), Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
23
|
Yang X, Xie L, Li Y, Wei C. More than 9,000,000 unique genes in human gut bacterial community: estimating gene numbers inside a human body. PLoS One 2009; 4:e6074. [PMID: 19562079 PMCID: PMC2699651 DOI: 10.1371/journal.pone.0006074] [Citation(s) in RCA: 97] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2009] [Accepted: 05/29/2009] [Indexed: 01/17/2023] Open
Abstract
BACKGROUND Estimating the number of genes in human genome has been long an important problem in computational biology. With the new conception of considering human as a super-organism, it is also interesting to estimate the number of genes in this human super-organism. PRINCIPAL FINDINGS We presented our estimation of gene numbers in the human gut bacterial community, the largest microbial community inside the human super-organism. We got 552,700 unique genes from 202 complete human gut bacteria genomes. Then, a novel gene counting model was built to check the total number of genes by combining culture-independent sequence data and those complete genomes. 16S rRNAs were used to construct a three-level tree and different counting methods were introduced for the three levels: strain-to-species, species-to-genus, and genus-and-up. The model estimates that the total number of genes is about 9,000,000 after those with identity percentage of 97% or up were merged. CONCLUSION By combining completed genomes currently available and culture-independent sequencing data, we built a model to estimate the number of genes in human gut bacterial community. The total number of genes is estimated to be about 9 million. Although this number is huge, we believe it is underestimated. This is an initial step to tackle this gene counting problem for the human super-organism. It will still be an open problem in the near future. The list of genomes used in this paper can be found in the supplementary table.
Collapse
Affiliation(s)
- Xing Yang
- Shanghai Center for Bioinformation Technology, Shanghai, China
- School of Life Science and Technology, Tongji University, Shanghai, China
| | - Lu Xie
- Shanghai Center for Bioinformation Technology, Shanghai, China
| | - Yixue Li
- Shanghai Center for Bioinformation Technology, Shanghai, China
- Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
- Bioinformation Center, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
- * E-mail: (YXL); (CCW)
| | - Chaochun Wei
- Shanghai Center for Bioinformation Technology, Shanghai, China
- Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
- Lab of Molecular Microbial Ecology and Ecogenomics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
- * E-mail: (YXL); (CCW)
| |
Collapse
|
24
|
Nordström KJV, Mirza MAI, Almén MS, Gloriam DE, Fredriksson R, Schiöth HB. Critical evaluation of the FANTOM3 non-coding RNA transcripts. Genomics 2009; 94:169-76. [PMID: 19505569 DOI: 10.1016/j.ygeno.2009.05.012] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2007] [Revised: 05/25/2009] [Accepted: 05/26/2009] [Indexed: 01/15/2023]
Abstract
We studied the genomic positions of 38,129 putative ncRNAs from the RIKEN dataset in relation to protein-coding genes. We found that the dataset has 41% sense, 6% antisense, 24% intronic and 29% intergenic transcripts. Interestingly, 17,678 (47%) of the FANTOM3 transcripts were found to potentially be internally primed from longer transcripts. The highest fraction of these transcripts was found among the intronic transcripts and as many as 77% or 6929 intronic transcripts were both internally primed and unspliced. We defined a filtered subset of 8535 transcripts that did not overlap with protein-coding genes, did not contain ORFs longer than 100 residues and were not internally primed. This dataset contains 53% of the FANTOM3 transcripts associated to known ncRNA in RNAdb and expands previous similar efforts with 6523 novel transcripts. This bioinformatic filtering of the FANTOM3 non-coding dataset has generated a lead dataset of transcripts without signs of being artefacts, providing a suitable dataset for investigation with hybridization-based techniques.
Collapse
|
25
|
Venkatesan K, Rual JF, Vazquez A, Stelzl U, Lemmens I, Hirozane-Kishikawa T, Hao T, Zenkner M, Xin X, Goh KI, Yildirim MA, Simonis N, Heinzmann K, Gebreab F, Sahalie JM, Cevik S, Simon C, de Smet AS, Dann E, Smolyar A, Vinayagam A, Yu H, Szeto D, Borick H, Dricot A, Klitgord N, Murray RR, Lin C, Lalowski M, Timm J, Rau K, Boone C, Braun P, Cusick ME, Roth FP, Hill DE, Tavernier J, Wanker EE, Barabási AL, Vidal M. An empirical framework for binary interactome mapping. Nat Methods 2008; 6:83-90. [PMID: 19060904 PMCID: PMC2872561 DOI: 10.1038/nmeth.1280] [Citation(s) in RCA: 609] [Impact Index Per Article: 38.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2008] [Accepted: 11/10/2008] [Indexed: 01/05/2023]
Abstract
Several attempts have been made at systematically mapping protein-protein interaction, or “interactome” networks. However, it remains difficult to assess the quality and coverage of existing datasets. We describe a framework that uses an empirically-based approach to rigorously dissect quality parameters of currently available human interactome maps. Our results indicate that high-throughput yeast two-hybrid (HT-Y2H) interactions for human are superior in precision to literature-curated interactions supported by only a single publication, suggesting that HT-Y2H is suitable to map a significant portion of the human interactome. We estimate that the human interactome contains ~130,000 binary interactions, most of which remain to be mapped. Similar to estimates of DNA sequence data quality and genome size early in the human genome project, estimates of protein interaction data quality and interactome size are critical to establish the magnitude of the task of comprehensive human interactome mapping and to illuminate a path towards this goal.
Collapse
Affiliation(s)
- Kavitha Venkatesan
- Center for Cancer Systems Biology and Department of Cancer Biology, Dana-Farber Cancer Institute, 1 Jimmy Fund Way, Boston, MA 02115, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
26
|
Copley RR. The animal in the genome: comparative genomics and evolution. Philos Trans R Soc Lond B Biol Sci 2008; 363:1453-61. [PMID: 18192189 PMCID: PMC2614226 DOI: 10.1098/rstb.2007.2235] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
Comparisons between completely sequenced metazoan genomes have generally emphasized how similar their encoded protein content is, even when the comparison is between phyla. Given the manifest differences between phyla and, in particular, intuitive notions that some animals are more complex than others, this creates something of a paradox. Simplistic explanations have included arguments such as increased numbers of genes; greater numbers of protein products produced through alternative splicing; increased numbers of regulatory non-coding RNAs and increased complexity of the cis-regulatory code. An obvious value of complete genome sequences lies in their ability to provide us with inventories of such components. I examine progress being made in linking genome content to the pattern of animal evolution, and argue that the gap between genomic and phenotypic complexity can only be understood through the totality of interacting components.
Collapse
|
27
|
Uberbacher EC, Hyatt D, Shah M. GrailEXP and Genome Analysis Pipeline for genome annotation. ACTA ACUST UNITED AC 2008; Chapter 6:Unit 6.5. [PMID: 18428363 DOI: 10.1002/0471142905.hg0605s39] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The Gene Recognition and Analysis Internet Link (GRAIL) is one of the most widely used systems for evaluating the protein-coding potential of anonymous DNA sequences. This unit describes the use of the XGRAIL and genQuest client-server applications to locate exons in DNA sequences, to develop gene models, and to search databases for homologs. A support protocol describes how to obtain the GRAIL and genQuest client software by anonymous FTP.
Collapse
|
28
|
Bianchetti L, Wu Y, Guerin E, Plewniak F, Poch O. SAGETTARIUS: a program to reduce the number of tags mapped to multiple transcripts and to plan SAGE sequencing stages. Nucleic Acids Res 2007; 35:e122. [PMID: 17884916 PMCID: PMC2094080 DOI: 10.1093/nar/gkm648] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
SAGE (Serial Analysis of Gene Expression) experiments generate short nucleotide sequences called ‘tags’ which are assumed to map unambiguously to their original transcripts (1 tag to 1 transcript mapping). Nevertheless, many tags are generated that do not map to any transcript or map to multiple transcripts. Current bioinformatics resources, such as SAGEmap and TAGmapper, have focused on reducing the number of unmapped tags. Here, we describe SAGETTARIUS, a new high-throughput program that performs successive precise Nla3 and Sau3A tag to transcript mapping, based on specifically designed Virtual Tag (VT) libraries. First, SAGETTARIUS decreases the number of tags mapped to multiple transcripts. Among the various mapping resources compared, SAGETTARIUS performed the best in this respect by decreasing up to 11% the number of multiply mapped tags. Second, SAGETTARIUS allows the establishment of a guideline for SAGE experiment sequencing efforts through efficient mapping of the CRT (Cytoplasmic Ribosomal protein Transcripts)-specific tags. Using all publicly available human and mouse Nla3 SAGE experiments, we show that sequencing 100 000 tags is sufficient to map almost all CRT-specific tags and that four sequencing stages can be identified when carrying out a human or mouse SAGE project. SAGETTARIUS is web interfaced and freely accessible to academic users.
Collapse
Affiliation(s)
- Laurent Bianchetti
- Plate-forme Bioinformatique de Strasbourg, Institut de Génétique et de Biologie Moléculaire et Cellulaire (CNRS/INSERM/ULP) BP 163, 67404 Illkirch Cedex, France.
| | | | | | | | | |
Collapse
|
29
|
Nordström KJV, Mirza MAI, Larsson TP, Gloriam DEI, Fredriksson R, Schiöth HB. Comprehensive comparisons of the current human, mouse, and rat RefSeq, Ensembl, EST, and FANTOM3 datasets: Identification of new human genes with specific tissue expression profile. Biochem Biophys Res Commun 2006; 348:1063-74. [PMID: 16904064 DOI: 10.1016/j.bbrc.2006.07.153] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2006] [Accepted: 07/25/2006] [Indexed: 10/24/2022]
Abstract
Our understanding of functional genetic elements in the genomes is continuously growing and new entries are entered in various databases on a regular basis. We have here merged the genetic elements in RefSeq, Ensembl, FANTOM3, HINV, and NCBI:s ESTdb using the genome assemblies in order to achieve a comprehensive picture of the current status of the identity and gene number in human, mouse, and rat. The number of human protein coding genes has not increased (25,043) while the increased sequencing of mouse transcripts has provided the considerably higher number of protein coding genes (31,578) in mouse. The results indicate large discrepancies between the datasets, as considerable numbers of unique transcripts can be found in each dataset. Despite the high number of ncRNA (38,129 in mouse) there are also almost 20,000 EST clusters in both mouse and humans with more than one EST that do not overlap any transcript suggesting that several new genetic elements are still to be found. We also demonstrated presence of new genes by identifying new human ones that have specific tissue profiles, using RT-PCR on rat tissues.
Collapse
Affiliation(s)
- Karl J V Nordström
- Department of Neuroscience, Uppsala University, BMC Box 593, 751 24 Uppsala, Sweden
| | | | | | | | | | | |
Collapse
|
30
|
Sgaramella V, Salamini F. Gene paucity, genome instability, clonal development: has an individual genome the potential to encode for more than one brain? DNA Repair (Amst) 2006; 5:531-3. [PMID: 16621729 DOI: 10.1016/j.dnarep.2006.03.001] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2006] [Accepted: 03/07/2006] [Indexed: 11/26/2022]
|
31
|
Abstract
Recent years have brought a dramatic change in our understanding of the role of ribonucleic acids (RNAs) within the cell. In addition to the already well-known classes of RNAs that take part in the transmission of genetic information from DNA to proteins, a new highly heterogeneous group of RNA molecules has emerged. The regulatory nonprotein-coding RNAs (npcRNAs) have been shown to be involved in modulation of gene expression on both the transcriptional and post-transcriptional level. They participate in mechanisms of chromatin modification, regulation of transcription factor activity, and influencing mRNA stability, processing, and translation. npcRNAs are key factors in genetic imprinting, dosage compensation of X-chromosome-linked genes, and many processes of differentiation and development.
Collapse
Affiliation(s)
- M Szymański
- Institute of Bioorganic Chemistry of the Polish Academy of Sciences, Noskowskiego 12, 61-704 Poznan, Poland.
| | | |
Collapse
|
32
|
Tsongalis GJ, Coleman WB. Clinical genotyping: The need for interrogation of single nucleotide polymorphisms and mutations in the clinical laboratory. Clin Chim Acta 2006; 363:127-37. [PMID: 16102741 DOI: 10.1016/j.cccn.2005.05.043] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2005] [Revised: 05/21/2005] [Accepted: 05/21/2005] [Indexed: 11/15/2022]
Abstract
BACKGROUND Detection of single nucleotide polymorphisms (SNPs) and gene mutations is becoming more routine to the clinical laboratory. METHODS Completion of the Human Genome Project has led to new scientific knowledge of human disease processes that has revealed the most fundamental of abnormalities in nucleic acids while at the same time bringing some of the most sophisticated diagnostic tools to the clinical laboratory. In addition, public awareness (both lay persons and healthcare providers) and sensitivity to human genetics has increased tremendously. Together, this rapidly evolving science and increased public education has led to an increasing demand for genotypic testing. CONCLUSIONS There are several clinical applications of human genotyping that are available using these newer technologies.
Collapse
Affiliation(s)
- Gregory J Tsongalis
- Department of Pathology, Dartmouth Medical School and Dartmouth Hitchcock Medical Center Lebanon, NH 03756, USA.
| | | |
Collapse
|
33
|
Wang JPZ, Lindsay BG, Cui L, Wall PK, Marion J, Zhang J, dePamphilis CW. Gene capture prediction and overlap estimation in EST sequencing from one or multiple libraries. BMC Bioinformatics 2005; 6:300. [PMID: 16351717 PMCID: PMC1369009 DOI: 10.1186/1471-2105-6-300] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2004] [Accepted: 12/13/2005] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In expressed sequence tag (EST) sequencing, we are often interested in how many genes we can capture in an EST sample of a targeted size. This information provides insights to sequencing efficiency in experimental design, as well as clues to the diversity of expressed genes in the tissue from which the library was constructed. RESULTS We propose a compound Poisson process model that can accurately predict the gene capture in a future EST sample based on an initial EST sample. It also allows estimation of the number of expressed genes in one cDNA library or co-expressed in two cDNA libraries. The superior performance of the new prediction method over an existing approach is established by a simulation study. Our analysis of four Arabidopsis thaliana EST sets suggests that the number of expressed genes present in four different cDNA libraries of Arabidopsis thaliana varies from 9155 (root) to 12005 (silique). An observed fraction of co-expressed genes in two different EST sets as low as 25% can correspond to an actual overlap fraction greater than 65%. CONCLUSION The proposed method provides a convenient tool for gene capture prediction and cDNA library property diagnosis in EST sequencing.
Collapse
Affiliation(s)
- Ji-Ping Z Wang
- Department of Statistics, Northwestern University, Evanston, IL 60208, USA
| | - Bruce G Lindsay
- Department of Statistics, Penn State University, University Park 16802, USA
| | - Liying Cui
- Department of Biology, Penn State University, University Park 16802, USA
| | - P Kerr Wall
- Department of Biology, Penn State University, University Park 16802, USA
| | - Josh Marion
- Department of Computer Science, Penn State University, University Park 16802, USA
| | - Jiaxuan Zhang
- College of Software, Tsinghua University, Beijing, 100086, PR China
| | | |
Collapse
|
34
|
Zhang J, Zhang L, Coombes KR. Gene sequence signatures revealed by mining the UniGene affiliation network. Bioinformatics 2005; 22:385-91. [PMID: 16339286 DOI: 10.1093/bioinformatics/bti796] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
BACKGROUND In the post-genomic era, developing tools to decode biological information from genomic sequences is important. Inspired by affiliation network theory, we investigated gene sequences of two kinds of UniGene clusters (UCs): narrowly expressed transcripts (NETs), whose expression is confined to a few tissues; and prevalently expressed transcripts (PETs) that are expressed in many tissues. RESULTS We explored the human and the mouse UniGene databases to compare NETs and PETs from different perspectives. We found that NETs were associated with smaller cluster size, shorter sequence length, a lower likelihood of having LocusLink annotations, and lower and more sporadic levels of expression. Significantly, the dinucleotide frequencies of NETs are similar to those of intergenic sequences in the genome, and they differ from those of PETs. We used these differences in dinucleotide frequencies to develop a discriminant analysis model to distinguish PETs from intergenic sequences. CONCLUSIONS Our results show that most NETs resemble intergenic sequences, casting doubts on the quality of such UniGene clusters. However, we also noted that a fraction of NETs resemble PETs in terms of dinucleotide frequencies and other features. Such NETs may have fewer quality problems. This work may be helpful in the studies of non-coding RNAs and in the validation of gene sequence databases.
Collapse
Affiliation(s)
- Jiexin Zhang
- Department of Biostatistics and Applied Mathematics, The University of Texas M.D. Anderson Cancer Center, 1515 Holcombe Boulevard, Box 447, Houston, TX 77030-4009, USA
| | | | | |
Collapse
|
35
|
Larsson TP, Murray CG, Hill T, Fredriksson R, Schiöth HB. Comparison of the current RefSeq, Ensembl and EST databases for counting genes and gene discovery. FEBS Lett 2005; 579:690-8. [PMID: 15670830 DOI: 10.1016/j.febslet.2004.12.046] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2004] [Revised: 12/13/2004] [Accepted: 12/13/2004] [Indexed: 11/25/2022]
Abstract
Large amounts of refined sequence material in the form of predicted, curated and annotated genes and expressed sequences tags (ESTs) have recently been added to the NCBI databases. We matched the transcript-sequences of RefSeq, Ensembl and dbEST in an attempt to provide an updated overview of how many unique human genes can be found. The results indicate that there are about 25000 unique genes in the union of RefSeq and Ensembl with 12-18% and 8-13% of the genes in each set unique to the other set, respectively. About 20% of all genes had splice variants. There are a considerable number of ESTs (2200000) that do not match the identified genes and we used an in-house pipeline to identify 22 novel genes from Genscan predictions that have considerable EST coverage. The study provides an insight into the current status of human gene catalogues and shows that considerable refinement of methods and datasets is needed to come to a conclusive gene count.
Collapse
Affiliation(s)
- Thomas P Larsson
- Department of Neuroscience, Uppsala University, BMC Box 593, 751 24 Uppsala, Sweden.
| | | | | | | | | |
Collapse
|
36
|
Abstract
The completion of the human genome has shifted the attention from deciphering the sequence to the identification and characterization of the encoded components. The identification and functional annotation of the proteome is here of special interest and starts with the identification of genes and transcripts as a prerequisite of proteome annotation. Gene predictions are very powerful in predicting most of the exons in a genome, but reliable gene structure predictions of both known and novel genes are dependent on existing transcript and protein information. An enormous amount of data already exists on the function of many human proteins, but this is scattered over many resources. Public domain databases are required to manage and collate this information and present it to the user community in both a human and machine readable manner.
Collapse
Affiliation(s)
- Sandra Orchard
- EMBL-The European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom
| | | | | |
Collapse
|
37
|
Abstract
The existence of a well conserved linear relationship between GC levels of genes' second and third codon positions (GC2, GC3) prompted us to focus on the landscape, or joint distribution, spanned by these two variables. In human, well curated coding sequences now cover at least 15%-30% of the estimated total gene set. Our analysis of the landscape defined by this gene set revealed not only the well documented linear crest, but also the presence of several peaks and valleys along that crest, a property that was also indicated in two other warm-blooded vertebrates represented by large gene databases, that is, mouse and chicken. GC2 is the sum of eight amino acid frequencies, whereas GC3 is linearly related to the GC level of the chromosomal region containing the gene. The landscapes therefore portray relations between proteins and the DNA environments of the genes that encode them.
Collapse
Affiliation(s)
- Stéphane Cruveiller
- Laboratorio di Evoluzione Molecolare, Stazione Zoologica Anton Dohrn, 80121 Napoli, Italy
| | | | | | | |
Collapse
|
38
|
Abstract
Many non-coding sequences transcribed from the mammalian genome are proving to have important regulatory roles, but the functions of the majority remain mysterious. For decades, researchers have focused most of their attention on protein-coding genes and proteins. With the completion of the human and mouse genomes and the accumulation of data on the mammalian transcriptome, the focus now shifts to non-coding DNA sequences, RNA-coding genes and their transcripts. Many non-coding transcribed sequences are proving to have important regulatory roles, but the functions of the majority remain mysterious.
Collapse
Affiliation(s)
- Svetlana A Shabalina
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
| | | |
Collapse
|
39
|
Uberbacher EC, Hyatt D, Shah M. GrailEXP and Genome Analysis Pipeline for genome annotation. CURRENT PROTOCOLS IN BIOINFORMATICS 2004; Chapter 4:Unit4.9. [PMID: 18428726 DOI: 10.1002/0471250953.bi0409s04] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/26/2023]
Abstract
The Basic Protocol describes the use of GrailEXP, the latest version of the gene finding system from Oak Ridge National Laboratory. GrailEXP provides gene models, by making use of sequence similarity with Expressed Sequence Tags (ESTs) and known genes. GrailEXP also provides alternatively spliced constructs for each gene based on the available EST evidence. The Support Protocol describes the use of the Genome Analysis Pipeline, a web application which allows users to perform comprehensive sequence analysis by offering a selection from a wide choice of supported gene finders, other biological feature finders, and database searches.
Collapse
|
40
|
Risk JM, Evans KE, Jones J, Langan JE, Rowbottom L, McRonald FE, Mills HS, Ellis A, Shaw JM, Leigh IM, Kelsell DP, Field JK. Characterization of a 500 kb region on 17q25 and the exclusion of candidate genes as the familial Tylosis Oesophageal Cancer (TOC) locus. Oncogene 2002; 21:6395-402. [PMID: 12214281 DOI: 10.1038/sj.onc.1205768] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2002] [Revised: 06/10/2002] [Accepted: 06/14/2002] [Indexed: 11/09/2022]
Abstract
The locus for a syndrome of focal palmoplantar keratoderma (Tylosis) associated with squamous cell oesophageal cancer (TOC) has been mapped to chromosome 17q25, a region frequently deleted in sporadic squamous cell oesophageal tumours. Further haplotype analysis described here, based on revised maps of marker order, has reduced the TOC minimal region to a genetic interval of 2 cM limited by the microsatellite markers D17S785 and D17S751. Partial sequence data and complete physical maps estimate the actual size of this region to be only 0.5 Mb. This analysis allowed the exclusion of proposed candidate tumour suppressor genes including MLL septin-like fusion (MSF), survivin, and deleted in multiple human cancer (DMC1). Computer analysis of sequence data from the minimal region identified 13 candidate genes and the presence of 50-70 other 'gene fragments' as ESTs and/or predicted exons and genes. Ten of the characterized genes were assayed for mutations but no disease-specific alterations were identified in the coding and promoter sequences. This region of chromosome 17q25 is, therefore, relatively gene-rich, containing 13 known and possibly as many as 50 predicted genes. Further mutation analysis of these predicted genes, and others possibly residing in the region, is required in order to identify the elusive TOC locus.
Collapse
Affiliation(s)
- Janet M Risk
- Molecular Genetics and Oncology Group, Department of Clinical Dental Sciences, The University of Liverpool, Liverpool L69 3GN, UK.
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
41
|
Reymond A, Camargo AA, Deutsch S, Stevenson BJ, Parmigiani RB, Ucla C, Bettoni F, Rossier C, Lyle R, Guipponi M, de Souza S, Iseli C, Jongeneel CV, Bucher P, Simpson AJG, Antonarakis SE. Nineteen additional unpredicted transcripts from human chromosome 21. Genomics 2002; 79:824-32. [PMID: 12036297 DOI: 10.1006/geno.2002.6781] [Citation(s) in RCA: 41] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
The identification of all human chromosome 21 (HC21) genes is a necessary step in understanding the molecular pathogenesis of trisomy 21 (Down syndrome). The first analysis of the sequence of 21q included 127 previously characterized genes and predicted an additional 98 novel anonymous genes. Recently we evaluated the quality of this annotation by characterizing a set of HC21 open reading frames (C21orfs) identified by mapping spliced expressed sequence tags (ESTs) and predicted genes (PREDs), identified only in silico. This study underscored the limitations of in silico-only gene prediction, as many PREDs were incorrectly predicted. To refine the HC21 annotation, we have developed a reliable algorithm to extract and stringently map sequences that contain bona fide 3' transcript ends to the genome. We then created a specific 21q graphical display allowing an integrated view of the data that incorporates new ESTs as well as features such as CpG islands, repeats, and gene predictions. Using these tools we identified 27 new putative genes. To validate these, we sequenced previously cloned cDNAs and carried out RT-PCR, 5'- and 3'-RACE procedures, and comparative mapping. These approaches substantiated 19 new transcripts, thus increasing the HC21 gene count by 9.5%. These transcripts were likely not previously identified because they are small and encode small proteins. We also identified four transcriptional units that are spliced but contain no obvious open reading frame. The HC21 data presented here further emphasize that current gene prediction algorithms miss a substantial number of transcripts that nevertheless can be identified using a combination of experimental approaches and multiple refined algorithms.
Collapse
Affiliation(s)
- Alexandre Reymond
- Division of Medical Genetics, University of Geneva Medical School, 1211 Geneva, Switzerland
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
42
|
Saha S, Sparks AB, Rago C, Akmaev V, Wang CJ, Vogelstein B, Kinzler KW, Velculescu VE. Using the transcriptome to annotate the genome. Nat Biotechnol 2002; 20:508-12. [PMID: 11981567 DOI: 10.1038/nbt0502-508] [Citation(s) in RCA: 502] [Impact Index Per Article: 22.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
A remaining challenge for the human genome project involves the identification and annotation of expressed genes. The public and private sequencing efforts have identified approximately 15,000 sequences that meet stringent criteria for genes, such as correspondence with known genes from humans or other species, and have made another approximately 10,000-20,000 gene predictions of lower confidence, supported by various types of in silico evidence, including homology studies, domain searches, and ab initio gene predictions. These computational methods have limitations, both because they are unable to identify a significant fraction of genes and exons and because they are unable to provide definitive evidence about whether a hypothetical gene is actually expressed. As the in silico approaches identified a smaller number of genes than anticipated, we wondered whether high-throughput experimental analyses could be used to provide evidence for the expression of hypothetical genes and to reveal previously undiscovered genes. We describe here the development of such a method--called long serial analysis of gene expression (LongSAGE), an adaption of the original SAGE approach--that can be used to rapidly identify novel genes and exons.
Collapse
Affiliation(s)
- Saurabh Saha
- Howard Hughes Medical Institute and the Sidney Kimmel Comprehensive Cancer Center, Baltimore, MD 21231, USA
| | | | | | | | | | | | | | | |
Collapse
|
43
|
Korke R, Rink A, Seow TK, Chung MCM, Beattie CW, Hu WS. Genomic and proteomic perspectives in cell culture engineering. J Biotechnol 2002; 94:73-92. [PMID: 11792453 DOI: 10.1016/s0168-1656(01)00420-5] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023]
Abstract
In the last few years, the number of biologics produced by mammalian cells have been steadily increasing. The advances in cell culture engineering science have contributed significantly to this increase. A common path of product and process development has emerged in the last decade and the host cell lines frequently used have converged to only a few. Selection of cell clones, their adaptation to a desired growth environment, and improving their productivity has been key to developing a new process. However, the fundamental understanding of changes during the selection and adaptation process is still lacking. Some cells may undergo irreversible alteration at the genome level, some may exhibit changes in their gene expression pattern, while others may incur neither genetic reconstruction nor gene expression changes, but only modulation of various fluxes by changing nutrient/metabolite concentrations and enzyme activities. It is likely that the selection of cell clones and their adaptation to various culture conditions may involve alterations not only in cellular machinery directly related to the selected marker or adapted behavior, but also those which may or may not be essential for selection or adaptation. The genomic and proteomic research tools enable one to globally survey the alterations at mRNA and protein levels and to unveil their regulation. Undoubtedly, a better understanding of these cellular processes at the molecular level will lead to a better strategy for 'designing' producing cells. Herein the genomic and proteomic tools are briefly reviewed and their impact on cell culture engineering is discussed.
Collapse
Affiliation(s)
- Rashmi Korke
- Department of Chemical Engineering and Materials Science, University of Minnesota, Minneapolis, MN, USA
| | | | | | | | | | | |
Collapse
|
44
|
Affiliation(s)
- Matthew W Hahn
- Department of Biology, Duke University, Durham, North Carolina 27708, USA.
| | | |
Collapse
|
45
|
Abstract
DNA microarray technology provides a means to examine large numbers of molecular changes related to a biological process in a high throughput manner. This review discusses plausible utilities of this technology in prostate cancer research, including definition of prostate cancer predisposition, global profiling of gene expression patterns associated with cancer initiation and progression, identification of new diagnostic and prognostic markers, and discovery of novel patient classification schemes. The technology, at present, has only been explored in a limited fashion in prostate cancer research. Some hurdles to be overcome are the high cost of the technology, insufficient sample size and repeated experiments, and the inadequate use of bioinformatics. With the completion of the Human Genome Project and the advance of several highly complementary technologies, such as laser capture microdissection, unbiased RNA amplification, customized functional arrays (eg, single-nucleotide polymorphism chips), and amenable bioinformatics software, this technology will become widely used by investigators in the field. The large amount of novel, unbiased hypotheses and insights generated by this technology is expected to have a significant impact on the diagnosis, treatment, and prevention of prostate cancer. Finally, this review emphasizes existing, but currently underutilized, data-mining tools, such as multivariate statistical analyses, neural networking, and machine learning techniques, to stimulate wider usage.
Collapse
Affiliation(s)
- Shuk-Mei Ho
- Department of Surgery, University of Massachusetts Medical School, Room S4-746, 55 Lake Avenue North, Worcester, MA 01655, USA.
| | | |
Collapse
|
46
|
Affiliation(s)
- G K Wong
- University of Washington Genome Center, Department of Medicine, Fluke Hall, M/C 352145, Seattle, Washington 98195, USA.
| | | | | |
Collapse
|
47
|
Hogenesch JB, Ching KA, Batalov S, Su AI, Walker JR, Zhou Y, Kay SA, Schultz PG, Cooke MP. A comparison of the Celera and Ensembl predicted gene sets reveals little overlap in novel genes. Cell 2001; 106:413-5. [PMID: 11534548 DOI: 10.1016/s0092-8674(01)00467-6] [Citation(s) in RCA: 146] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
48
|
Wright FA, Lemon WJ, Zhao WD, Sears R, Zhuo D, Wang JP, Yang HY, Baer T, Stredney D, Spitzner J, Stutz A, Krahe R, Yuan B. A draft annotation and overview of the human genome. Genome Biol 2001; 2:RESEARCH0025. [PMID: 11516338 PMCID: PMC55322 DOI: 10.1186/gb-2001-2-7-research0025] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2001] [Revised: 04/04/2001] [Accepted: 06/01/2001] [Indexed: 11/28/2022] Open
Abstract
BACKGROUND The recent draft assembly of the human genome provides a unified basis for describing genomic structure and function. The draft is sufficiently accurate to provide useful annotation, enabling direct observations of previously inferred biological phenomena. RESULTS We report here a functionally annotated human gene index placed directly on the genome. The index is based on the integration of public transcript, protein, and mapping information, supplemented with computational prediction. We describe numerous global features of the genome and examine the relationship of various genetic maps with the assembly. In addition, initial sequence analysis reveals highly ordered chromosomal landscapes associated with paralogous gene clusters and distinct functional compartments. Finally, these annotation data were synthesized to produce observations of gene density and number that accord well with historical estimates. Such a global approach had previously been described only for chromosomes 21 and 22, which together account for 2.2% of the genome. CONCLUSIONS We estimate that the genome contains 65,000-75,000 transcriptional units, with exon sequences comprising 4%. The creation of a comprehensive gene index requires the synthesis of all available computational and experimental evidence.
Collapse
Affiliation(s)
- Fred A Wright
- Division of Human Cancer Genetics, The Ohio State University, 420 W. 12th Avenue, Columbus, OH 43210, USA
| | - William J Lemon
- Division of Human Cancer Genetics, The Ohio State University, 420 W. 12th Avenue, Columbus, OH 43210, USA
| | - Wei D Zhao
- Division of Human Cancer Genetics, The Ohio State University, 420 W. 12th Avenue, Columbus, OH 43210, USA
| | - Russell Sears
- Division of Human Cancer Genetics, The Ohio State University, 420 W. 12th Avenue, Columbus, OH 43210, USA
| | - Degen Zhuo
- Division of Human Cancer Genetics, The Ohio State University, 420 W. 12th Avenue, Columbus, OH 43210, USA
| | - Jian-Ping Wang
- Division of Human Cancer Genetics, The Ohio State University, 420 W. 12th Avenue, Columbus, OH 43210, USA
| | - Hee-Yung Yang
- LabBook.com, Busch Boulevard, Columbus, OH 43229, USA
| | - Troy Baer
- Ohio Supercomputer Center (OSC), Kinnear Road, Columbus, OH 43212, USA
| | - Don Stredney
- Ohio Supercomputer Center (OSC), Kinnear Road, Columbus, OH 43212, USA
- Department of Computer and Information Science, The Ohio State University, Neil Avenue, Columbus, OH 43210, USA
| | - Joe Spitzner
- LabBook.com, Busch Boulevard, Columbus, OH 43229, USA
| | - Al Stutz
- Ohio Supercomputer Center (OSC), Kinnear Road, Columbus, OH 43212, USA
- Department of Computer and Information Science, The Ohio State University, Neil Avenue, Columbus, OH 43210, USA
| | - Ralf Krahe
- Division of Human Cancer Genetics, The Ohio State University, 420 W. 12th Avenue, Columbus, OH 43210, USA
| | - Bo Yuan
- Division of Human Cancer Genetics, The Ohio State University, 420 W. 12th Avenue, Columbus, OH 43210, USA
| |
Collapse
|
49
|
Wiemann S, Weil B, Wellenreuther R, Gassenhuber J, Glassl S, Ansorge W, Böcher M, Blöcker H, Bauersachs S, Blum H, Lauber J, Düsterhöft A, Beyer A, Köhrer K, Strack N, Mewes HW, Ottenwälder B, Obermaier B, Tampe J, Heubner D, Wambutt R, Korn B, Klein M, Poustka A. Toward a catalog of human genes and proteins: sequencing and analysis of 500 novel complete protein coding human cDNAs. Genome Res 2001. [PMID: 11230166 DOI: 10.1101/gr.154701] [Citation(s) in RCA: 36] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
With the complete human genomic sequence being unraveled, the focus will shift to gene identification and to the functional analysis of gene products. The generation of a set of cDNAs, both sequences and physical clones, which contains the complete and noninterrupted protein coding regions of all human genes will provide the indispensable tools for the systematic and comprehensive analysis of protein function to eventually understand the molecular basis of man. Here we report the sequencing and analysis of 500 novel human cDNAs containing the complete protein coding frame. Assignment to functional categories was possible for 52% (259) of the encoded proteins, the remaining fraction having no similarities with known proteins. By aligning the cDNA sequences with the sequences of the finished chromosomes 21 and 22 we identified a number of genes that either had been completely missed in the analysis of the genomic sequences or had been wrongly predicted. Three of these genes appear to be present in several copies. We conclude that full-length cDNA sequencing continues to be crucial also for the accurate identification of genes. The set of 500 novel cDNAs, and another 1000 full-coding cDNAs of known transcripts we have identified, adds up to cDNA representations covering 2%--5 % of all human genes. We thus substantially contribute to the generation of a gene catalog, consisting of both full-coding cDNA sequences and clones, which should be made freely available and will become an invaluable tool for detailed functional studies.
Collapse
Affiliation(s)
- S Wiemann
- Molecular Genome Analysis, German Cancer Research Center, 69120 Heidelberg, Germany.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
50
|
Stevanovic S, Bohley P. Proteome analysis by three-dimensional protein separation: turnover of cytosolic proteins in hepatocytes. Biol Chem 2001; 382:677-82. [PMID: 11405231 DOI: 10.1515/bc.2001.080] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
We performed a three-dimensional separation of pulse-chase dual-labelled rat liver cytosolic proteins using hydrophobic interaction chromatography, isoelectric focusing, and SDS gel electrophoresis. Due to very different expression rates but similar size and pI of rat liver cytosolic proteins, we demonstrate the impossibility of successful two-dimensional separations of such complex protein mixtures. A pre-fractionation of proteins by hydrophobic interaction chromatography is therefore recommended prior to two-dimensional gel electrophoresis. Our studies confirmed the correlation between protein turnover rates and surface hydrophobicity.
Collapse
Affiliation(s)
- S Stevanovic
- Institut für Zellbiologie, Abteilung Immunologie, Eberhard-Karls-Universität, Tübingen, Germany
| | | |
Collapse
|