1
|
Trost B, Loureiro LO, Scherer SW. Discovery of genomic variation across a generation. Hum Mol Genet 2021; 30:R174-R186. [PMID: 34296264 PMCID: PMC8490016 DOI: 10.1093/hmg/ddab209] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2021] [Revised: 07/09/2021] [Accepted: 07/19/2021] [Indexed: 11/12/2022] Open
Abstract
Over the past 30 years (the timespan of a generation), advances in genomics technologies have revealed tremendous and unexpected variation in the human genome and have provided increasingly accurate answers to long-standing questions of how much genetic variation exists in human populations and to what degree the DNA complement changes between parents and offspring. Tracking the characteristics of these inherited and spontaneous (or de novo) variations has been the basis of the study of human genetic disease. From genome-wide microarray and next-generation sequencing scans, we now know that each human genome contains over 3 million single nucleotide variants when compared with the ~ 3 billion base pairs in the human reference genome, along with roughly an order of magnitude more DNA—approximately 30 megabase pairs (Mb)—being ‘structurally variable’, mostly in the form of indels and copy number changes. Additional large-scale variations include balanced inversions (average of 18 Mb) and complex, difficult-to-resolve alterations. Collectively, ~1% of an individual’s genome will differ from the human reference sequence. When comparing across a generation, fewer than 100 new genetic variants are typically detected in the euchromatic portion of a child’s genome. Driven by increasingly higher-resolution and higher-throughput sequencing technologies, newer and more accurate databases of genetic variation (for instance, more comprehensive structural variation data and phasing of combinations of variants along chromosomes) of worldwide populations will emerge to underpin the next era of discovery in human molecular genetics.
Collapse
Affiliation(s)
- Brett Trost
- The Centre for Applied Genomics and Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON M5G 0A4, Canada
| | - Livia O Loureiro
- The Centre for Applied Genomics and Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON M5G 0A4, Canada
| | - Stephen W Scherer
- The Centre for Applied Genomics and Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON M5G 0A4, Canada.,McLaughlin Centre and Department of Molecular Genetics, University of Toronto, Toronto, ON M5S 1A8, Canada
| |
Collapse
|
2
|
Prior FW, Clark K, Commean P, Freymann J, Jaffe C, Kirby J, Moore S, Smith K, Tarbox L, Vendt B, Marquez G. TCIA: An information resource to enable open science. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2015; 2013:1282-5. [PMID: 24109929 DOI: 10.1109/embc.2013.6609742] [Citation(s) in RCA: 55] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Reusable, publicly available data is a pillar of open science. The Cancer Imaging Archive (TCIA) is an open image archive service supporting cancer research. TCIA collects, de-identifies, curates and manages rich collections of oncology image data. Image data sets have been contributed by 28 institutions and additional image collections are underway. Since June of 2011, more than 2,000 users have registered to search and access data from this freely available resource. TCIA encourages and supports cancer-related open science communities by hosting and managing the image archive, providing project wiki space and searchable metadata repositories. The success of TCIA is measured by the number of active research projects it enables (>40) and the number of scientific publications and presentations that are produced using data from TCIA collections (39).
Collapse
|
3
|
Sankaranarayanan K, Nikjoo H. Genome-based, mechanism-driven computational modeling of risks of ionizing radiation: The next frontier in genetic risk estimation? MUTATION RESEARCH-REVIEWS IN MUTATION RESEARCH 2014; 764:1-15. [PMID: 26041262 DOI: 10.1016/j.mrrev.2014.12.003] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 12/18/2014] [Indexed: 10/24/2022]
Abstract
Research activity in the field of estimation of genetic risks of ionizing radiation to human populations started in the late 1940s and now appears to be passing through a plateau phase. This paper provides a background to the concepts, findings and methods of risk estimation that guided the field through the period of its growth to the beginning of the 21st century. It draws attention to several key facts: (a) thus far, genetic risk estimates have been made indirectly using mutation data collected in mouse radiation studies; (b) important uncertainties and unsolved problems remain, one notable example being that we still do not know the sensitivity of human female germ cells to radiation-induced mutations; and (c) the concept that dominated the field thus far, namely, that radiation exposures to germ cells can result in single gene diseases in the descendants of those exposed has been replaced by the concept that radiation exposure can cause DNA deletions, often involving more than one gene. Genetic risk estimation now encompasses work devoted to studies on DNA deletions induced in human germ cells, their expected frequencies, and phenotypes and associated clinical consequences in the progeny. We argue that the time is ripe to embark on a human genome-based, mechanism-driven, computational modeling of genetic risks of ionizing radiation, and we present a provisional framework for catalyzing research in the field in the 21st century.
Collapse
Affiliation(s)
- K Sankaranarayanan
- Radiation Biophysics Group, Department of Oncology-Pathology, Karolinska Institutet, Box 260, P9-02, Stockholm SE 17176, Sweden
| | - H Nikjoo
- Radiation Biophysics Group, Department of Oncology-Pathology, Karolinska Institutet, Box 260, P9-02, Stockholm SE 17176, Sweden.
| |
Collapse
|
4
|
Clark K, Vendt B, Smith K, Freymann J, Kirby J, Koppel P, Moore S, Phillips S, Maffitt D, Pringle M, Tarbox L, Prior F. The Cancer Imaging Archive (TCIA): maintaining and operating a public information repository. J Digit Imaging 2014; 26:1045-57. [PMID: 23884657 DOI: 10.1007/s10278-013-9622-7] [Citation(s) in RCA: 1768] [Impact Index Per Article: 176.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
The National Institutes of Health have placed significant emphasis on sharing of research data to support secondary research. Investigators have been encouraged to publish their clinical and imaging data as part of fulfilling their grant obligations. Realizing it was not sufficient to merely ask investigators to publish their collection of imaging and clinical data, the National Cancer Institute (NCI) created the open source National Biomedical Image Archive software package as a mechanism for centralized hosting of cancer related imaging. NCI has contracted with Washington University in Saint Louis to create The Cancer Imaging Archive (TCIA)-an open-source, open-access information resource to support research, development, and educational initiatives utilizing advanced medical imaging of cancer. In its first year of operation, TCIA accumulated 23 collections (3.3 million images). Operating and maintaining a high-availability image archive is a complex challenge involving varied archive-specific resources and driven by the needs of both image submitters and image consumers. Quality archives of any type (traditional library, PubMed, refereed journals) require management and customer service. This paper describes the management tasks and user support model for TCIA.
Collapse
Affiliation(s)
- Kenneth Clark
- Mallinckrodt Institute of Radiology, Washington University School of Medicine, ERL 510 South Kingshighway Boulevard, St. Louis, MO, 63110, USA,
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
5
|
Kay HY, Wu H, Lee SI, Kim SG. Applications of genetically modified tools to safety assessment in drug development. Toxicol Res 2010; 26:1-8. [PMID: 24278499 PMCID: PMC3834461 DOI: 10.5487/tr.2010.26.1.001] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2010] [Revised: 01/26/2010] [Accepted: 01/26/2010] [Indexed: 02/01/2023] Open
Abstract
The process of new drug development consists of several stages; after identifying potential candidate compounds, preclinical studies using animal models link the laboratory and human clinical trials. Among many steps in preclinical studies, toxicology and safety assessments contribute to identify potential adverse events and provide rationale for setting the initial doses in clinical trials. Gene modulation is one of the important tools of modern biology, and is commonly employed to examine the function of genes of interest. Advances in new drug development have been achieved by exploding information on target selection and validation using genetically modified animal models as well as those of cells. In this review, a recent trend of genetically modified methods is discussed with reference to safety assessments, and the exemplary applications of gene-modulating tools to the tests in new drug development were summarized.
Collapse
Affiliation(s)
- Hee Yeon Kay
- College of Pharmacy and Research Institute of Pharmaceutical Sciences, Seoul National University
| | | | | | | |
Collapse
|
6
|
Abstract
As our ability to generate sequencing data continues to increase, data analysis is replacing data generation as the rate-limiting step in genomics studies. Here we provide a guide to genomic data visualization tools that facilitate analysis tasks by enabling researchers to explore, interpret and manipulate their data, and in some cases perform on-the-fly computations. We will discuss graphical methods designed for the analysis of de novo sequencing assemblies and read alignments, genome browsing, and comparative genomics, highlighting the strengths and limitations of these approaches and the challenges ahead.
Collapse
|
7
|
Owner controlled data exchange in nutrigenomic collaborations: the NuGO information network. GENES AND NUTRITION 2009; 4:113-22. [PMID: 19408032 DOI: 10.1007/s12263-009-0123-8] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/09/2008] [Accepted: 04/16/2009] [Indexed: 10/20/2022]
Abstract
New 'omics' technologies are changing nutritional sciences research. They enable to tackle increasingly complex questions but also increase the need for collaboration between research groups. An important challenge for successful collaboration is the management and structured exchange of information that accompanies data-intense technologies. NuGO, the European Nutrigenomics Organization, the major collaborating network in molecular nutritional sciences, is supporting the application of modern information technologies in this area. We have developed and implemented a concept for data management and computing infrastructure that supports collaboration between nutrigenomics researchers. The system fills the gap between "private" storing with occasional file sharing by email and the use of centralized databases. It provides flexible tools to share data, also during experiments, while preserving ownership. The NuGO Information Network is a decentral, distributed system for data exchange based on standard web technology. Secure access to data, maintained by the individual researcher, is enabled by web services based on the the BioMoby framework. A central directory provides information about available web services. The flexibility of the infrastructure allows a wide variety of services for data processing and integration by combining several web services, including public services. Therefore, this integrated information system is suited for other research collaborations.
Collapse
|
8
|
Erinjeri JP, Picus D, Prior FW, Rubin DA, Koppel P. Development of a Google-based search engine for data mining radiology reports. J Digit Imaging 2008; 22:348-56. [PMID: 18392657 DOI: 10.1007/s10278-008-9110-7] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2007] [Revised: 01/17/2008] [Accepted: 02/07/2008] [Indexed: 10/22/2022] Open
Abstract
The aim of this study is to develop a secure, Google-based data-mining tool for radiology reports using free and open source technologies and to explore its use within an academic radiology department. A Health Insurance Portability and Accountability Act (HIPAA)-compliant data repository, search engine and user interface were created to facilitate treatment, operations, and reviews preparatory to research. The Institutional Review Board waived review of the project, and informed consent was not required. Comprising 7.9 GB of disk space, 2.9 million text reports were downloaded from our radiology information system to a fileserver. Extensible markup language (XML) representations of the reports were indexed using Google Desktop Enterprise search engine software. A hypertext markup language (HTML) form allowed users to submit queries to Google Desktop, and Google's XML response was interpreted by a practical extraction and report language (PERL) script, presenting ranked results in a web browser window. The query, reason for search, results, and documents visited were logged to maintain HIPAA compliance. Indexing averaged approximately 25,000 reports per hour. Keyword search of a common term like "pneumothorax" yielded the first ten most relevant results of 705,550 total results in 1.36 s. Keyword search of a rare term like "hemangioendothelioma" yielded the first ten most relevant results of 167 total results in 0.23 s; retrieval of all 167 results took 0.26 s. Data mining tools for radiology reports will improve the productivity of academic radiologists in clinical, educational, research, and administrative tasks. By leveraging existing knowledge of Google's interface, radiologists can quickly perform useful searches.
Collapse
Affiliation(s)
- Joseph P Erinjeri
- Mallinckrodt Institute of Radiology, Washington University School of Medicine, 510 South Kingshighway Boulevard, Campus Box 8131, Saint Louis, MO 63110, USA.
| | | | | | | | | |
Collapse
|
9
|
Scheuermann MO, Tajbakhsh J, Kurz A, Saracoglu K, Eils R, Lichter P. Topology of genes and nontranscribed sequences in human interphase nuclei. Exp Cell Res 2005; 301:266-79. [PMID: 15530862 DOI: 10.1016/j.yexcr.2004.08.031] [Citation(s) in RCA: 76] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2004] [Revised: 07/03/2004] [Indexed: 01/29/2023]
Abstract
Knowledge about the functional impact of the topological organization of DNA sequences within interphase chromosome territories is still sparse. Of the few analyzed single copy genomic DNA sequences, the majority had been found to localize preferentially at the chromosome periphery or to loop out from chromosome territories. By means of dual-color fluorescence in situ hybridization (FISH), immunolabeling, confocal microscopy, and three-dimensional (3D) image analysis, we analyzed the intraterritorial and nuclear localization of 10 genomic fragments of different sequence classes in four different human cell types. The localization of three muscle-specific genes FLNA, NEB, and TTN, the oncogene BCL2, the tumor suppressor gene MADH4, and five putatively nontranscribed genomic sequences was predominantly in the periphery of the respective chromosome territories, independent from transcriptional status and from GC content. In interphase nuclei, the noncoding sequences were only rarely found associated with heterochromatic sites marked by the satellite III DNA D1Z1 or clusters of mammalian heterochromatin proteins (HP1alpha, HP1beta, HP1gamma). However, the nontranscribed sequences were found predominantly at the nuclear periphery or at the nucleoli, whereas genes tended to localize on chromosome surfaces exposed to the nuclear interior.
Collapse
Affiliation(s)
- Markus O Scheuermann
- Division of Molecular Genetics, Deutsches Krebsforschungszentrum, D-69120 Heidelberg, Germany
| | | | | | | | | | | |
Collapse
|
10
|
Zhang Z, Carriero N, Gerstein M. Comparative analysis of processed pseudogenes in the mouse and human genomes. Trends Genet 2004; 20:62-7. [PMID: 14746985 DOI: 10.1016/j.tig.2003.12.005] [Citation(s) in RCA: 162] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
Pseudogenes are important resources in evolutionary and comparative genomics because they provide molecular records of the ancient genes that existed in the genome millions of years ago. We have systematically identified approximately 5000 processed pseudogenes in the mouse genome, and estimated that approximately 60% are lineage specific, created after the mouse and human diverged. In both mouse and human genomes, similar types of genes give rise to many processed pseudogenes. These tend to be housekeeping genes, which are highly expressed in the germ line. Ribosomal-protein genes, in particular, form the largest sub-group. The processed pseudogenes in the mouse occur with a distinctly different chromosomal distribution than LINEs or SINEs - preferentially in GC-poor regions. Finally, the age distribution of mouse-processed pseudogenes closely resembles that of LINEs, in contrast to human, where the age distribution closely follows Alus (SINEs).
Collapse
Affiliation(s)
- Zhaolei Zhang
- Department of Molecular Biophysics and Biochemistry, Yale University, 266 Whitney Avenue, New Haven, CT 06520, USA
| | | | | |
Collapse
|
11
|
Holden S, Raymond FL. The human gene CXorf17 encodes a member of a novel family of putative transmembrane proteins: cDNA cloning and characterization of CXorf17 and its mouse ortholog orf34. Gene 2004; 318:149-61. [PMID: 14585507 DOI: 10.1016/s0378-1119(03)00770-4] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023]
Abstract
We report the identification and cloning of a novel human gene, CXorf17, together with its mouse ortholog, orf34. The human and mouse transcripts were cloned from brain cDNA and encode deduced proteins of 1096 and 1091 amino acids, respectively. These proteins are 92% identical and 95% similar at the protein level. CXorf17 appears to be expressed at low levels and could be detected by RT-PCR in several adult and fetal human tissues. Analysis of the deduced amino acid sequence identified five putative transmembrane domains but no significant homology to previously described protein domains or sequence motifs. The CXorf17 protein has homology to two other non-annotated human proteins, C9orf10 and BC012177, the sequence similarity between them being strongest across two discrete domains of 250-270 amino acids in the N- and C-terminal parts of their sequences. We propose that these proteins belong to a previously undescribed family of putative transmembrane proteins. The identification of ESTs coding for similar proteins in other chordates but not lower eukaryotes suggests that these proteins may have first evolved during early chordate evolution. CXorf17 consists of 16 coding exons and maps to Xp11.22, approximately 14 kb telomeric to PRKWNK3 and 27 kb centromeric to KIAA1111. Its identification contributes to the annotation of expressed genes in the proximal part of the X chromosome.
Collapse
Affiliation(s)
- Simon Holden
- Department of Medical Genetics, Cambridge Institute for Medical Research, Addenbrooke's Hospital Box 139, Hills Road, Cambridge CB2 2XY, UK
| | | |
Collapse
|
12
|
Zhang Z, Harrison PM, Liu Y, Gerstein M. Millions of years of evolution preserved: a comprehensive catalog of the processed pseudogenes in the human genome. Genome Res 2004; 13:2541-58. [PMID: 14656962 PMCID: PMC403796 DOI: 10.1101/gr.1429003] [Citation(s) in RCA: 313] [Impact Index Per Article: 15.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Processed pseudogenes were created by reverse-transcription of mRNAs; they provide snapshots of ancient genes existing millions of years ago in the genome. To find them in the present-day human, we developed a pipeline using features such as intron-absence, frame-disruption, polyadenylation, and truncation. This has enabled us to identify in recent genome drafts approximately 8000 processed pseudogenes (distributed from http://pseudogene.org). Overall, processed pseudogenes are very similar to their closest corresponding human gene, being 94% complete in coding regions, with sequence similarity of 75% for amino acids and 86% for nucleotides. Their chromosomal distribution appears random and dispersed, with the numbers on chromosomes proportional to length, suggesting sustained "bombardment" over evolution. However, it does vary with GC-content: Processed pseudogenes occur mostly in intermediate GC-content regions. This is similar to Alus but contrasts with functional genes and L1-repeats. Pseudogenes, moreover, have age profiles similar to Alus. The number of pseudogenes associated with a given gene follows a power-law relationship, with a few genes giving rise to many pseudogenes and most giving rise to few. The prevalence of processed pseudogenes agrees well with germ-line gene expression. Highly expressed ribosomal proteins account for approximately 20% of the total. Other notables include cyclophilin-A, keratin, GAPDH, and cytochrome c.
Collapse
Affiliation(s)
- Zhaolei Zhang
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520-8114, USA
| | | | | | | |
Collapse
|
13
|
Kim SW. Body Changes with Aging and GH Replacement as Antiaging Therapy. JOURNAL OF THE KOREAN MEDICAL ASSOCIATION 2004. [DOI: 10.5124/jkma.2004.47.4.342] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Affiliation(s)
- Sung-Woon Kim
- Department of Internal Medicine / GH Clinic, Kyunghee University College of Medicine & Hospital, Korea.
| |
Collapse
|
14
|
Enright AJ, Kunin V, Ouzounis CA. Protein families and TRIBES in genome sequence space. Nucleic Acids Res 2003; 31:4632-8. [PMID: 12888524 PMCID: PMC169885 DOI: 10.1093/nar/gkg495] [Citation(s) in RCA: 105] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Accurate detection of protein families allows assignment of protein function and the analysis of functional diversity in complete genomes. Recently, we presented a novel algorithm called TribeMCL for the detection of protein families that is both accurate and efficient. This method allows family analysis to be carried out on a very large scale. Using TribeMCL, we have generated a resource called TRIBES that contains protein family information, comprising annotations, protein sequence alignments and phylogenetic distributions describing 311 257 proteins from 83 completely sequenced genomes. The analysis of at least 60 934 detected protein families reveals that, with the essential families excluded, paralogy levels are similar between prokaryotes, irrespective of genome size. The number of essential families is estimated to be between 366 and 426. We also show that the currently known space of protein families is scale free and discuss the implications of this distribution. In addition, we show that smaller families are often formed by shorter proteins and discuss the reasons for this intriguing pattern. Finally, we analyse the functional diversity of protein families in entire genome sequences. The TRIBES protein family resource is accessible at http://www.ebi.ac.uk/research/cgg/tribes/.
Collapse
Affiliation(s)
- Anton J Enright
- Computational Genomics Group, The European Bioinformatics Institute, EMBL Cambridge Outstation, Cambridge CB10 1SD, UK
| | | | | |
Collapse
|
15
|
Bernards A. GAPs galore! A survey of putative Ras superfamily GTPase activating proteins in man and Drosophila. BIOCHIMICA ET BIOPHYSICA ACTA 2003; 1603:47-82. [PMID: 12618308 DOI: 10.1016/s0304-419x(02)00082-3] [Citation(s) in RCA: 154] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
Typical members of the Ras superfamily of small monomeric GTP-binding proteins function as regulators of diverse processes by cycling between biologically active GTP- and inactive GDP-bound conformations. Proteins that control this cycling include guanine nucleotide exchange factors or GEFs, which activate Ras superfamily members by catalyzing GTP for GDP exchange, and GTPase activating proteins or GAPs, which accelerate the low intrinsic GTP hydrolysis rate of typical Ras superfamily members, thus causing their inactivation. Two among the latter class of proteins have been implicated in common genetic disorders associated with an increased cancer risk, neurofibromatosis-1, and tuberous sclerosis. To facilitate genetic analysis, I surveyed Drosophila and human sequence databases for genes predicting proteins related to GAPs for Ras superfamily members. Remarkably, close to 0.5% of genes in both species (173 human and 64 Drosophila genes) predict proteins related to GAPs for Arf, Rab, Ran, Rap, Ras, Rho, and Sar family GTPases. Information on these genes has been entered into a pair of relational databases, which can be used to identify evolutionary conserved proteins that are likely to serve basic biological functions, and which can be updated when definitive information on the coding potential of both genomes becomes available.
Collapse
Affiliation(s)
- André Bernards
- Massachusetts General Hospital Cancer Center, Building 149, 13th Street, Charlestown, MA 02129-2000, USA.
| |
Collapse
|
16
|
Abstract
New directions in computational methods for the prediction of protein function are discussed. THEMATICS, a method for the location and characterization of the active sites of enzymes, is featured. THEMATICS, for Theoretical Microscopic Titration Curves, is based on well-established finite-difference Poisson-Boltzmann methods for computing the electric field function of a protein. THEMATICS requires only the structure of the subject protein and thus may be applied to proteins that bear no similarity in structure or sequence to any previously characterized protein. The unique features of catalytic sites in proteins are discussed. Discussion of the chemical basis for the predictive powers of THEMATICS is featured in this paper. Some results are given for three illustrative examples: HIV-1 protease, human apurinic/apyrimidinic endonuclease, and human adenosine kinase.
Collapse
Affiliation(s)
- Ihsan A Shehadi
- Department of Chemistry, United Arab Emirates University, Al-Ain, United Arab Emirates
| | | | | |
Collapse
|
17
|
Zhang Z, Harrison P, Gerstein M. Identification and analysis of over 2000 ribosomal protein pseudogenes in the human genome. Genome Res 2002; 12:1466-82. [PMID: 12368239 PMCID: PMC187539 DOI: 10.1101/gr.331902] [Citation(s) in RCA: 146] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2002] [Accepted: 08/12/2002] [Indexed: 11/24/2022]
Abstract
Mammals have 79 ribosomal proteins (RP). Using a systematic procedure based on sequence-homology, we have comprehensively identified pseudogenes of these proteins in the human genome. Our assignments are available at http://www.pseudogene.org or http://bioinfo.mbb.yale.edu/genome/pseudogene. In total, we found 2090 processed pseudogenes and 16 duplications of RP genes. In relation to the matching parent protein, each of the processed pseudogenes has an average relative sequence length of 97% and an average sequence identity of 76%. A small number (258) of them do not contain obvious disablements (stop codons or frameshifts) and, therefore, could be mistaken as functional genes, and 178 are disrupted by one or more repetitive elements. On average, processed pseudogenes have a longer truncation at the 5' end than the 3' end, consistent with the target-primed-reverse-transcription (TPRT) mechanism. Interestingly, on chromosome 16, an RPL26 processed pseudogene was found in the intron region of a functional RPS2 gene. The large-scale distribution of RP pseudogenes throughout the genome appears to result, chiefly, from random insertions with the numbers on each chromosome, consequently, proportional to its size. In contrast to RP genes, the RP pseudogenes have the highest density in GC-intermediate regions (41%-46%) of the genome, with the density pattern being between that of LINEs and Alus. This can be explained by a negative selection theory as we observed that GC-rich RP pseudogenes decay faster in GC-poor regions. Also, we observed a correlation between the number of processed pseudogenes and the GC content of the associated functional gene, i.e., relatively GC-poor RPs have more processed pseudogenes. This ranges from 145 pseudogenes for RPL21 down to 3 pseudogenes for RPL14. We were able to date the RP pseudogenes based on their sequence divergence from present-day RP genes, finding an age distribution similar to that for Alus. The distribution is consistent with a decline in retrotransposition activity in the hominid lineage during the last 40 Myr. We discuss the implications for retrotransposon stability and genome dynamics based on these new findings.
Collapse
Affiliation(s)
- Zhaolei Zhang
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA
| | | | | |
Collapse
|
18
|
Dai H, Meyer M, Stepaniants S, Ziman M, Stoughton R. Use of hybridization kinetics for differentiating specific from non-specific binding to oligonucleotide microarrays. Nucleic Acids Res 2002; 30:e86. [PMID: 12177314 PMCID: PMC134259 DOI: 10.1093/nar/gnf085] [Citation(s) in RCA: 114] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Hybridization kinetics were found to be significantly different for specific and non-specific binding of labeled cRNA to surface-bound oligonucleotides on microarrays. We show direct evidence that in a complex sample specific binding takes longer to reach hybridization equilibrium than the non- specific binding. We find that this property can be used to estimate and to correct for the hybridization contributed by non-specific binding. Useful applications are illustrated including the selection of superior oligonucleotides, and the reduction of false positives in exon identification.
Collapse
Affiliation(s)
- Hongyue Dai
- Rosetta Inpharmatics, 12040 115th Avenue NE, Kirkland, WA 98034, USA
| | | | | | | | | |
Collapse
|
19
|
Tolle R. Information technology tools for efficient SNP studies. AMERICAN JOURNAL OF PHARMACOGENOMICS : GENOMICS-RELATED RESEARCH IN DRUG DEVELOPMENT AND CLINICAL PRACTICE 2002; 1:303-14. [PMID: 12083962 DOI: 10.2165/00129785-200101040-00007] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
We are currently facing a new era of studies involving single nucleotide polymorphisms (SNPs). This increased attention is stimulated by interest in individual differences in disease susceptibility as well as individual responses to drug treatment and the falling cost of genotyping. This review is a guide to the numerous public data repositories and Information Technology (IT) tools that may aid planning, preparation, running and analysis of studies involving SNPs. I will also highlight areas where researchers will have to resort to home-made IT solutions. Unfortunately, both information and IT tools are scattered throughout the internet and a lack of data exchange conventions can hamper the efficient use of these existing resources. This can lead to situations where the planning, preparation and analysis of a SNP study can actually cost more than the actual genotyping. We propose that only a customizable backbone IT infrastructure for SNP studies can help reduce costs associated with SNP data handling and tool launching.
Collapse
Affiliation(s)
- R Tolle
- LION bioscience AG, Heidelberg, Germany.
| |
Collapse
|
20
|
Jensen LJ, Gupta R, Blom N, Devos D, Tamames J, Kesmir C, Nielsen H, Staerfeldt HH, Rapacki K, Workman C, Andersen CAF, Knudsen S, Krogh A, Valencia A, Brunak S. Prediction of human protein function from post-translational modifications and localization features. J Mol Biol 2002; 319:1257-65. [PMID: 12079362 DOI: 10.1016/s0022-2836(02)00379-0] [Citation(s) in RCA: 242] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
We have developed an entirely sequence-based method that identifies and integrates relevant features that can be used to assign proteins of unknown function to functional classes, and enzyme categories for enzymes. We show that strategies for the elucidation of protein function may benefit from a number of functional attributes that are more directly related to the linear sequence of amino acids, and hence easier to predict, than protein structure. These attributes include features associated with post-translational modifications and protein sorting, but also much simpler aspects such as the length, isoelectric point and composition of the polypeptide chain.
Collapse
Affiliation(s)
- L J Jensen
- Center for Biological Sequence Analysis, Biocentrum-DTU, Building 208, The Technical University of Denmark, DK-2800 Lyngby, Denmark
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
21
|
Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D. The human genome browser at UCSC. Genome Res 2002; 12:996-1006. [PMID: 12045153 PMCID: PMC186604 DOI: 10.1101/gr.229102] [Citation(s) in RCA: 6677] [Impact Index Per Article: 303.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
As vertebrate genome sequences near completion and research refocuses to their analysis, the issue of effective genome annotation display becomes critical. A mature web tool for rapid and reliable display of any requested portion of the genome at any scale, together with several dozen aligned annotation tracks, is provided at http://genome.ucsc.edu. This browser displays assembly contigs and gaps, mRNA and expressed sequence tag alignments, multiple gene predictions, cross-species homologies, single nucleotide polymorphisms, sequence-tagged sites, radiation hybrid data, transposon repeats, and more as a stack of coregistered tracks. Text and sequence-based searches provide quick and precise access to any region of specific interest. Secondary links from individual features lead to sequence details and supplementary off-site databases. One-half of the annotation tracks are computed at the University of California, Santa Cruz from publicly available sequence data; collaborators worldwide provide the rest. Users can stably add their own custom tracks to the browser for educational or research purposes. The conceptual and technical framework of the browser, its underlying MYSQL database, and overall use are described. The web site currently serves over 50,000 pages per day to over 3000 different users.
Collapse
Affiliation(s)
- W James Kent
- Department of Molecular, Cellular, and Developmental Biology, University of California, Santa Cruz, CA 95064, USA.
| | | | | | | | | | | | | |
Collapse
|
22
|
Abstract
As vertebrate genome sequences near completion and research refocuses to their analysis, the issue of effective genome annotation display becomes critical. A mature web tool for rapid and reliable display of any requested portion of the genome at any scale, together with several dozen aligned annotation tracks, is provided at http://genome.ucsc.edu. This browser displays assembly contigs and gaps, mRNA and expressed sequence tag alignments, multiple gene predictions, cross-species homologies, single nucleotide polymorphisms, sequence-tagged sites, radiation hybrid data, transposon repeats, and more as a stack of coregistered tracks. Text and sequence-based searches provide quick and precise access to any region of specific interest. Secondary links from individual features lead to sequence details and supplementary off-site databases. One-half of the annotation tracks are computed at the University of California, Santa Cruz from publicly available sequence data; collaborators worldwide provide the rest. Users can stably add their own custom tracks to the browser for educational or research purposes. The conceptual and technical framework of the browser, its underlying MYSQL database, and overall use are described. The web site currently serves over 50,000 pages per day to over 3000 different users.
Collapse
Affiliation(s)
- W James Kent
- Department of Molecular, Cellular, and Developmental Biology, University of California, Santa Cruz, CA 95064, USA.
| | | | | | | | | | | | | |
Collapse
|
23
|
Enright AJ, Van Dongen S, Ouzounis CA. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 2002; 30:1575-84. [PMID: 11917018 PMCID: PMC101833 DOI: 10.1093/nar/30.7.1575] [Citation(s) in RCA: 2316] [Impact Index Per Article: 105.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023] Open
Abstract
Detection of protein families in large databases is one of the principal research objectives in structural and functional genomics. Protein family classification can significantly contribute to the delineation of functional diversity of homologous proteins, the prediction of function based on domain architecture or the presence of sequence motifs as well as comparative genomics, providing valuable evolutionary insights. We present a novel approach called TRIBE-MCL for rapid and accurate clustering of protein sequences into families. The method relies on the Markov cluster (MCL) algorithm for the assignment of proteins into families based on precomputed sequence similarity information. This novel approach does not suffer from the problems that normally hinder other protein sequence clustering algorithms, such as the presence of multi-domain proteins, promiscuous domains and fragmented proteins. The method has been rigorously tested and validated on a number of very large databases, including SwissProt, InterPro, SCOP and the draft human genome. Our results indicate that the method is ideally suited to the rapid and accurate detection of protein families on a large scale. The method has been used to detect and categorise protein families within the draft human genome and the resulting families have been used to annotate a large proportion of human proteins.
Collapse
Affiliation(s)
- A J Enright
- Computational Genomics Group, The European Bioinformatics Institute, EMBL Cambridge Outstation, Cambridge CB10 1SD, UK.
| | | | | |
Collapse
|
24
|
Abstract
The review begins by providing a brief typology of biological databases on the Internet, illustrated by examples of the most influential resources of each kind. We then take an insider look at one typical on-line genomic resource -- the yeast genome database hosted at the Munich Information Center for Protein Sequences (MIPS) -- and explain how and why it has evolved from a basic sequence repository to a multidomain knowledge base. The role of community efforts in curating and annotating genome data is discussed. The crucial role of data integration and interoperability in developing next-generation genomic facilities is underscored.
Collapse
Affiliation(s)
- Dmitrij Frishman
- Institute for Bioinformatics, GSF - National Research Center for Environment and Heatlh, Ingolstädter Landstrasse 1, 85764 Neueherberg, Germany.
| | | | | |
Collapse
|
25
|
Harrison PM, Hegyi H, Balasubramanian S, Luscombe NM, Bertone P, Echols N, Johnson T, Gerstein M. Molecular fossils in the human genome: identification and analysis of the pseudogenes in chromosomes 21 and 22. Genome Res 2002; 12:272-80. [PMID: 11827946 PMCID: PMC155275 DOI: 10.1101/gr.207102] [Citation(s) in RCA: 151] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
We have developed an initial approach for annotating and surveying pseudogenes in the human genome. We search human genomic DNA for regions that are similar to known protein sequences and contain obvious disablements (i.e., mid-sequence stop codons or frameshifts), while ensuring minimal overlap with annotations of known genes. Pseudogenes can be divided into "processed" and "nonprocessed"; the former are reverse transcribed from mRNA (and therefore have no intron structure), whereas the latter presumably arise from genomic duplications. We annotate putative processed pseudogenes based on whether there is a continuous span of homology that is >70% of the length of the closest matching human protein (i.e., with introns removed), or whether there is evidence of polyadenylation. We have applied our approach to chromosomes 21 and 22, the first parts of the human genome completely sequenced, finding 190 new pseudogene annotations beyond the 264 reported by the sequencing centers. In total, on chromosomes 21 and 22, there are 189 processed pseudogenes, 195 nonprocessed pseudogenes, and, additionally, 70 pseudogenic immunoglobulin gene segments. (Detailed assignments are available at http://bioinfo.mbb.yale.edu/genome/pseudogene or http://genecensus.org/pseudogene.) By extrapolation, we predict that there could be up to approximately 20,000 pseudogenes in the whole human genome, with a little more than half of them processed. We have determined the main populations and clusters of pseudogenes on chromosomes 21 and 22. There are notable excesses of pseudogenes relative to genes near the centromeres of both chromosomes, indicating the existence of pseudogenic "hot-spots" in the genome. We have looked at the distribution of InterPro families and Gene Ontology (GO) functional categories in our pseudogenes. Overall, the families in both processed and nonprocessed pseudogene populations occur according to a similar power-law distribution as that found for the occurrence of gene families, with a few big families and many small ones. The processed population is, in particular, enriched in highly expressed ribosomal-protein sequences (approximately 20%), which appear fairly evenly distributed across the chromosomes. We compared processed pseudogenes of different evolutionary ages, observing a high degree of similarity between "ancient" and "modern" subpopulations. This may be attributable to the consistently high expression of ribosomal proteins over evolutionary time. Finally, we find that chromosome 22 pseudogene population is dominated by immunoglobulin segments, which have a greater rate of disablement per amino acid than the other pseudogene populations and are also substantially more diverged.
Collapse
MESH Headings
- Chromosome Mapping/methods
- Chromosomes, Human, Pair 21/genetics
- Chromosomes, Human, Pair 22/genetics
- Evolution, Molecular
- Fossils
- Genes, Immunoglobulin
- Genes, Overlapping
- Genome, Human
- Humans
- Multigene Family
- Pseudogenes
- RNA Processing, Post-Transcriptional/genetics
- Sequence Analysis, DNA/statistics & numerical data
Collapse
Affiliation(s)
- Paul M Harrison
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520-8114, USA
| | | | | | | | | | | | | | | |
Collapse
|
26
|
Mousses S, Kallioniemi A, Kauraniemi P, Elkahloun A, Kallioniemi OP. Clinical and functional target validation using tissue and cell microarrays. Curr Opin Chem Biol 2002; 6:97-101. [PMID: 11827831 DOI: 10.1016/s1367-5931(01)00283-6] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
Expression levels of thousands of genes or proteins can be readily determined using microarray techniques. However, this represents only the first step in understanding the biological and medical significance of these molecules. New high-throughput techniques, such as tissue and cell microarrays, will facilitate clinical and functional analysis of molecular targets.
Collapse
Affiliation(s)
- Spyro Mousses
- Cancer Genetics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892-8000, USA
| | | | | | | | | |
Collapse
|
27
|
Ma D. Applications of yeast in drug discovery. PROGRESS IN DRUG RESEARCH. FORTSCHRITTE DER ARZNEIMITTELFORSCHUNG. PROGRES DES RECHERCHES PHARMACEUTIQUES 2002; 57:117-62. [PMID: 11728000 DOI: 10.1007/978-3-0348-8308-5_3] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/09/2023]
Abstract
The yeast Saccharomyces cerevisiae is perhaps the best-studied eukaryotic organism. Its experimental tractability, combined with the remarkable conservation of gene function throughout evolution, makes yeast the ideal model genetic organism. Yeast is a non-pathogenic model of fungal pathogens used to identify antifungal targets suitable for drug development and to elucidate mechanisms of action of antifungal agents. As a model of fundamental cellular processes and metabolic pathways of the human, yeast has improved our understanding and facilitated the molecular analysis of many disease genes. The completion of the Saccharomyces genome sequence helped launch the post-genomic era, focusing on functional analyses of whole genomes. Yeast paved the way for the systematic analysis of large and complex genomes by serving as a test bed for novel experimental approaches and technologies, tools that are fast becoming the standard in drug discovery research
Collapse
Affiliation(s)
- D Ma
- Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, IN 46285, USA.
| |
Collapse
|
28
|
Ondrechen MJ, Clifton JG, Ringe D. THEMATICS: a simple computational predictor of enzyme function from structure. Proc Natl Acad Sci U S A 2001; 98:12473-8. [PMID: 11606719 PMCID: PMC60078 DOI: 10.1073/pnas.211436698] [Citation(s) in RCA: 183] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2001] [Accepted: 08/18/2001] [Indexed: 11/18/2022] Open
Abstract
We show that theoretical microscopic titration curves (THEMATICS) can be used to identify active-site residues in proteins of known structure. Results are featured for three enzymes: triosephosphate isomerase (TIM), aldose reductase (AR), and phosphomannose isomerase (PMI). We note that TIM and AR have similar structures but catalyze different kinds of reactions, whereas TIM and PMI have different structures but catalyze similar reactions. Analysis of the theoretical microscopic titration curves for all of the ionizable residues of these proteins shows that a small fraction (3-7%) of the curves possess a flat region where the residue is partially protonated over a wide pH range. The preponderance of residues with such perturbed curves occur in the active site. Additional results are given in summary form to show the success of the method for proteins with a variety of different chemistries and structures.
Collapse
Affiliation(s)
- M J Ondrechen
- Department of Chemistry, Northeastern University, Boston, MA 02115, USA.
| | | | | |
Collapse
|
29
|
Abstract
Type 2 diabetes mellitus is not a single disease but a genetically heterogeneous group of metabolic disorders sharing glucose intolerance. The precise underlying biochemical defects are unknown and almost certainly include impairments of both insulin secretion and action. The rapidly increasing prevalence of T2D world wide makes it a major cause of morbidity and mortality. Understanding the genetic aetiology of T2D will facilitate its diagnosis, treatment and prevention. The results of linkage and association studies to date demonstrate that, as with other common diseases, multiple genes are involved in the susceptibility to T2D, each making a modest contribution to the overall risk. The completion of the draft human genome sequence and a brace of novel tools for genomic analysis promise to accelerate progress towards a more complete molecular description of T2D.
Collapse
Affiliation(s)
- A L Gloyn
- Centre for Molecular Genetics, Institute of Clinical Science, School of Postgraduate Medicine and Healthcare Sciences, University of Exeter, Barrack Road, Exeter, EX2 5AX, UK
| | | |
Collapse
|
30
|
Weinberg RA. A question of strategy. Trends Biochem Sci 2001; 26:207-8. [PMID: 11295537 DOI: 10.1016/s0968-0004(01)01823-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
31
|
|
32
|
Semple CA. Bases and spaces: resources on the web for accessing the draft human genome - II - after publication of the draft. Genome Biol 2001; 2:REVIEWS2001. [PMID: 11423014 PMCID: PMC138945 DOI: 10.1186/gb-2001-2-6-reviews2001] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
The volume of human genome sequence and the variety of web-based tools to access it continue to grow at an impressive rate, but a working knowledge of certain key resources can be sufficient to get the most from your genome. This article provides an update to Genome Biology 2000, 1(4):reviews2001.1-2001.5.
Collapse
Affiliation(s)
- C A Semple
- Medical Genetics Section, Department of Medical Sciences, The University of Edinburgh, Molecular Medicine Centre, Western General Hospital, Edinburgh EH4 2XU, UK.
| |
Collapse
|