1
|
Rood JE, Wynne S, Robson L, Hupalowska A, Randell J, Teichmann SA, Regev A. The Human Cell Atlas from a cell census to a unified foundation model. Nature 2025; 637:1065-1071. [PMID: 39566552 DOI: 10.1038/s41586-024-08338-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2024] [Accepted: 11/01/2024] [Indexed: 11/22/2024]
Abstract
With the convergence of notable advances in molecular and spatial profiling methods and new computational approaches taking advantage of artificial intelligence and machine learning, the construction of cell atlases is progressing from data collection to atlas integration and beyond. Here, we explore five ways in which cell atlases, including the Human Cell Atlas, are already revealing valuable biological insights, and how they are poised to provide even greater benefits in the coming years. In particular, we discuss cell atlases as censuses of cells; as 3D maps of cells in the body, across modalities and scales; as maps connecting genotype causes to phenotype effects; as 4D maps of development; and, ultimately, as foundation models of biology unifying all of these aspects and helping to transform medicine.
Collapse
Affiliation(s)
- Jennifer E Rood
- Human Cell Atlas, Cambridge, MA, USA
- Genentech, South San Francisco, CA, USA
| | | | | | - Anna Hupalowska
- Human Cell Atlas, Cambridge, MA, USA
- Genentech, South San Francisco, CA, USA
| | | | - Sarah A Teichmann
- Human Cell Atlas, Cambridge, MA, USA.
- Cambridge Stem Cell Institute and Department of Medicine, Jeffrey Cheah Biomedical Centre, Cambridge Biomedical Campus, Cambridge, UK.
| | - Aviv Regev
- Human Cell Atlas, Cambridge, MA, USA.
- Genentech, South San Francisco, CA, USA.
| |
Collapse
|
2
|
Orliaguet L, Ejlalmanesh T, Humbert A, Ballaire R, Diedisheim M, Julla JB, Chokr D, Cuenco J, Michieletto J, Charbit J, Lindén D, Boucher J, Potier C, Hamimi A, Lemoine S, Blugeon C, Legoix P, Lameiras S, Baudrin LG, Baulande S, Soprani A, Castelli FA, Fenaille F, Riveline JP, Dalmas E, Rieusset J, Gautier JF, Venteclef N, Alzaid F. Early macrophage response to obesity encompasses Interferon Regulatory Factor 5 regulated mitochondrial architecture remodelling. Nat Commun 2022; 13:5089. [PMID: 36042203 PMCID: PMC9427774 DOI: 10.1038/s41467-022-32813-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2021] [Accepted: 08/16/2022] [Indexed: 11/29/2022] Open
Abstract
Adipose tissue macrophages (ATM) adapt to changes in their energetic microenvironment. Caloric excess, in a range from transient to diet-induced obesity, could result in the transition of ATMs from highly oxidative and protective to highly inflammatory and metabolically deleterious. Here, we demonstrate that Interferon Regulatory Factor 5 (IRF5) is a key regulator of macrophage oxidative capacity in response to caloric excess. ATMs from mice with genetic-deficiency of Irf5 are characterised by increased oxidative respiration and mitochondrial membrane potential. Transient inhibition of IRF5 activity leads to a similar respiratory phenotype as genomic deletion, and is reversible by reconstitution of IRF5 expression. We find that the highly oxidative nature of Irf5-deficient macrophages results from transcriptional de-repression of the mitochondrial matrix component Growth Hormone Inducible Transmembrane Protein (GHITM) gene. The Irf5-deficiency-associated high oxygen consumption could be alleviated by experimental suppression of Ghitm expression. ATMs and monocytes from patients with obesity or with type-2 diabetes retain the reciprocal regulatory relationship between Irf5 and Ghitm. Thus, our study provides insights into the mechanism of how the inflammatory transcription factor IRF5 controls physiological adaptation to diet-induced obesity via regulating mitochondrial architecture in macrophages. Interferon Regulatory Factor 5 levels have been shown to increase in adipose tissue macrophages in diet-induced obesity. Here authors show that IRF5 transcriptionally represses the Growth Hormone Inducible Transmembrane Protein gene encoding a mitochondrial protein important for oxidative respiration in macrophages, thus driving the detrimental metabolic changes observed in obesity.
Collapse
Affiliation(s)
- L Orliaguet
- INSERM UMR-S1151, CNRS UMR-S8253, Université Paris Cité, Institut Necker Enfants Malades, F-75015, Paris, France.,INSERM UMR-S1138, Université Paris Cité, Sorbonne Université, Centre de Recherche des Cordeliers, IMMEDIAB Laboratory, Paris, France
| | - T Ejlalmanesh
- INSERM UMR-S1151, CNRS UMR-S8253, Université Paris Cité, Institut Necker Enfants Malades, F-75015, Paris, France.,INSERM UMR-S1138, Université Paris Cité, Sorbonne Université, Centre de Recherche des Cordeliers, IMMEDIAB Laboratory, Paris, France
| | - A Humbert
- CarMeN Laboratory, UMR INSERM U1060/INRA U1397, Lyon 1 University, F-69310, Pierre Bénite, France
| | - R Ballaire
- INSERM UMR-S1151, CNRS UMR-S8253, Université Paris Cité, Institut Necker Enfants Malades, F-75015, Paris, France.,INSERM UMR-S1138, Université Paris Cité, Sorbonne Université, Centre de Recherche des Cordeliers, IMMEDIAB Laboratory, Paris, France
| | - M Diedisheim
- INSERM UMR-S1151, CNRS UMR-S8253, Université Paris Cité, Institut Necker Enfants Malades, F-75015, Paris, France.,INSERM UMR-S1138, Université Paris Cité, Sorbonne Université, Centre de Recherche des Cordeliers, IMMEDIAB Laboratory, Paris, France.,Department of Diabetes, Cochin Hospital, Assistance Publique - Hôpitaux de Paris, Université Paris Cité, Paris, France
| | - J B Julla
- INSERM UMR-S1151, CNRS UMR-S8253, Université Paris Cité, Institut Necker Enfants Malades, F-75015, Paris, France.,INSERM UMR-S1138, Université Paris Cité, Sorbonne Université, Centre de Recherche des Cordeliers, IMMEDIAB Laboratory, Paris, France.,Department of Diabetes, Lariboisière Hospital, Assistance Publique - Hôpitaux de Paris, Université Paris Cité, Paris, France
| | - D Chokr
- INSERM UMR-S1151, CNRS UMR-S8253, Université Paris Cité, Institut Necker Enfants Malades, F-75015, Paris, France.,INSERM UMR-S1138, Université Paris Cité, Sorbonne Université, Centre de Recherche des Cordeliers, IMMEDIAB Laboratory, Paris, France
| | - J Cuenco
- INSERM UMR-S1151, CNRS UMR-S8253, Université Paris Cité, Institut Necker Enfants Malades, F-75015, Paris, France.,INSERM UMR-S1138, Université Paris Cité, Sorbonne Université, Centre de Recherche des Cordeliers, IMMEDIAB Laboratory, Paris, France
| | - J Michieletto
- Université Paris-Saclay, CEA, INRAE, Département Médicaments et Technologies pour la Santé (DMTS), MetaboHUB, F-91191, Gif sur Yvette, France
| | - J Charbit
- Service d'endocrinologie, diabétologie, maladies métaboliques, Hôpital Avicenne, 127 Rte de Stalingrad, 93 009, Bobigny, France
| | - D Lindén
- Bioscience Metabolism, Research and Early Development, Cardiovascular, Renal and Metabolism (CVRM), BioPharmaceuticals R&D, AstraZeneca, Gothenburg, Sweden
| | - J Boucher
- Bioscience Metabolism, Research and Early Development, Cardiovascular, Renal and Metabolism (CVRM), BioPharmaceuticals R&D, AstraZeneca, Gothenburg, Sweden
| | - C Potier
- INSERM UMR-S1151, CNRS UMR-S8253, Université Paris Cité, Institut Necker Enfants Malades, F-75015, Paris, France.,INSERM UMR-S1138, Université Paris Cité, Sorbonne Université, Centre de Recherche des Cordeliers, IMMEDIAB Laboratory, Paris, France
| | - A Hamimi
- INSERM UMR-S1138, Université Paris Cité, Sorbonne Université, Centre de Recherche des Cordeliers, IMMEDIAB Laboratory, Paris, France
| | - S Lemoine
- GenomiqueENS, Institut de Biologie de l'ENS (IBENS), Département de biologie, École normale supérieure, CNRS, INSERM, Université PSL, 75005, Paris, France
| | - C Blugeon
- GenomiqueENS, Institut de Biologie de l'ENS (IBENS), Département de biologie, École normale supérieure, CNRS, INSERM, Université PSL, 75005, Paris, France
| | - P Legoix
- Institut Curie Genomics of Excellence Platform, Institut Curie Research Center, PSL University, Paris, France
| | - S Lameiras
- Institut Curie Genomics of Excellence Platform, Institut Curie Research Center, PSL University, Paris, France
| | - L G Baudrin
- Institut Curie Genomics of Excellence Platform, Institut Curie Research Center, PSL University, Paris, France
| | - S Baulande
- Institut Curie Genomics of Excellence Platform, Institut Curie Research Center, PSL University, Paris, France
| | - A Soprani
- INSERM UMR-S1151, CNRS UMR-S8253, Université Paris Cité, Institut Necker Enfants Malades, F-75015, Paris, France.,INSERM UMR-S1138, Université Paris Cité, Sorbonne Université, Centre de Recherche des Cordeliers, IMMEDIAB Laboratory, Paris, France.,Department of Digestive Surgery, Générale de Santé (GDS), Geoffroy Saint Hilaire Clinic, 75005, Paris, France
| | - F A Castelli
- Université Paris-Saclay, CEA, INRAE, Département Médicaments et Technologies pour la Santé (DMTS), MetaboHUB, F-91191, Gif sur Yvette, France
| | - F Fenaille
- Université Paris-Saclay, CEA, INRAE, Département Médicaments et Technologies pour la Santé (DMTS), MetaboHUB, F-91191, Gif sur Yvette, France
| | - J P Riveline
- INSERM UMR-S1151, CNRS UMR-S8253, Université Paris Cité, Institut Necker Enfants Malades, F-75015, Paris, France.,INSERM UMR-S1138, Université Paris Cité, Sorbonne Université, Centre de Recherche des Cordeliers, IMMEDIAB Laboratory, Paris, France.,Department of Diabetes, Lariboisière Hospital, Assistance Publique - Hôpitaux de Paris, Université Paris Cité, Paris, France
| | - E Dalmas
- INSERM UMR-S1151, CNRS UMR-S8253, Université Paris Cité, Institut Necker Enfants Malades, F-75015, Paris, France.,INSERM UMR-S1138, Université Paris Cité, Sorbonne Université, Centre de Recherche des Cordeliers, IMMEDIAB Laboratory, Paris, France
| | - J Rieusset
- CarMeN Laboratory, UMR INSERM U1060/INRA U1397, Lyon 1 University, F-69310, Pierre Bénite, France
| | - J F Gautier
- INSERM UMR-S1151, CNRS UMR-S8253, Université Paris Cité, Institut Necker Enfants Malades, F-75015, Paris, France.,INSERM UMR-S1138, Université Paris Cité, Sorbonne Université, Centre de Recherche des Cordeliers, IMMEDIAB Laboratory, Paris, France.,Department of Diabetes, Lariboisière Hospital, Assistance Publique - Hôpitaux de Paris, Université Paris Cité, Paris, France
| | - N Venteclef
- INSERM UMR-S1151, CNRS UMR-S8253, Université Paris Cité, Institut Necker Enfants Malades, F-75015, Paris, France. .,INSERM UMR-S1138, Université Paris Cité, Sorbonne Université, Centre de Recherche des Cordeliers, IMMEDIAB Laboratory, Paris, France.
| | - F Alzaid
- INSERM UMR-S1151, CNRS UMR-S8253, Université Paris Cité, Institut Necker Enfants Malades, F-75015, Paris, France. .,INSERM UMR-S1138, Université Paris Cité, Sorbonne Université, Centre de Recherche des Cordeliers, IMMEDIAB Laboratory, Paris, France. .,Dasman Diabetes Institute, Kuwait, Kuwait.
| |
Collapse
|
3
|
Badowski C, He B, Garmire LX. Blood-derived lncRNAs as biomarkers for cancer diagnosis: the Good, the Bad and the Beauty. NPJ Precis Oncol 2022; 6:40. [PMID: 35729321 PMCID: PMC9213432 DOI: 10.1038/s41698-022-00283-7] [Citation(s) in RCA: 86] [Impact Index Per Article: 28.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2021] [Accepted: 05/13/2022] [Indexed: 11/24/2022] Open
Abstract
Cancer ranks as one of the deadliest diseases worldwide. The high mortality rate associated with cancer is partially due to the lack of reliable early detection methods and/or inaccurate diagnostic tools such as certain protein biomarkers. Cell-free nucleic acids (cfNA) such as circulating long noncoding RNAs (lncRNAs) have been proposed as a new class of potential biomarkers for cancer diagnosis. The reported correlation between the presence of tumors and abnormal levels of lncRNAs in the blood of cancer patients has notably triggered a worldwide interest among clinicians and oncologists who have been actively investigating their potentials as reliable cancer biomarkers. In this report, we review the progress achieved ("the Good") and challenges encountered ("the Bad") in the development of circulating lncRNAs as potential biomarkers for early cancer diagnosis. We report and discuss the diagnostic performance of more than 50 different circulating lncRNAs and emphasize their numerous potential clinical applications ("the Beauty") including therapeutic targets and agents, on top of diagnostic and prognostic capabilities. This review also summarizes the best methods of investigation and provides useful guidelines for clinicians and scientists who desire conducting their own clinical studies on circulating lncRNAs in cancer patients via RT-qPCR or Next Generation Sequencing (NGS).
Collapse
Affiliation(s)
- Cedric Badowski
- University of Hawaii Cancer Center, Epidemiology, 701 Ilalo Street, Honolulu, HI, 96813, USA.
| | - Bing He
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, 48105, USA
| | - Lana X Garmire
- University of Hawaii Cancer Center, Epidemiology, 701 Ilalo Street, Honolulu, HI, 96813, USA.
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, 48105, USA.
| |
Collapse
|
4
|
Eslami Rasekh M, Hernández Y, Drinan SD, Fuxman Bass J, Benson G. Genome-wide characterization of human minisatellite VNTRs: population-specific alleles and gene expression differences. Nucleic Acids Res 2021; 49:4308-4324. [PMID: 33849068 PMCID: PMC8096271 DOI: 10.1093/nar/gkab224] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2020] [Revised: 03/06/2021] [Accepted: 03/18/2021] [Indexed: 11/12/2022] Open
Abstract
Variable Number Tandem Repeats (VNTRs) are tandem repeat (TR) loci that vary in copy number across a population. Using our program, VNTRseek, we analyzed human whole genome sequencing datasets from 2770 individuals in order to detect minisatellite VNTRs, i.e., those with pattern sizes ≥7 bp. We detected 35 638 VNTR loci and classified 5676 as commonly polymorphic (i.e. with non-reference alleles occurring in >5% of the population). Commonly polymorphic VNTR loci were found to be enriched in genomic regions with regulatory function, i.e. transcription start sites and enhancers. Investigation of the commonly polymorphic VNTRs in the context of population ancestry revealed that 1096 loci contained population-specific alleles and that those could be used to classify individuals into super-populations with near-perfect accuracy. Search for quantitative trait loci (eQTLs), among the VNTRs proximal to genes, indicated that in 187 genes expression differences correlated with VNTR genotype. We validated our predictions in several ways, including experimentally, through the identification of predicted alleles in long reads, and by comparisons showing consistency between sequencing platforms. This study is the most comprehensive analysis of minisatellite VNTRs in the human population to date.
Collapse
Affiliation(s)
| | - Yözen Hernández
- Graduate Program in Bioinformatics, Boston University, Boston, MA 02215, USA
| | | | - Juan I Fuxman Bass
- Graduate Program in Bioinformatics, Boston University, Boston, MA 02215, USA
- Department of Biology, Boston University, Boston, MA 02215, USA
| | - Gary Benson
- Graduate Program in Bioinformatics, Boston University, Boston, MA 02215, USA
- Department of Biology, Boston University, Boston, MA 02215, USA
- Department of Computer Science, Boston University, Boston, MA 02215, USA
| |
Collapse
|
5
|
Gavrielatos M, Kyriakidis K, Spandidos DA, Michalopoulos I. Benchmarking of next and third generation sequencing technologies and their associated algorithms for de novo genome assembly. Mol Med Rep 2021; 23:251. [PMID: 33537807 PMCID: PMC7893683 DOI: 10.3892/mmr.2021.11890] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2020] [Accepted: 01/21/2021] [Indexed: 12/30/2022] Open
Abstract
Genome assemblers are computational tools for de novo genome assembly, based on a plenitude of primary sequencing data. The quality of genome assemblies is estimated by their contiguity and the occurrences of misassemblies (duplications, deletions, translocations or inversions). The rapid development of sequencing technologies has enabled the rise of novel de novo genome assembly strategies. The ultimate goal of such strategies is to utilise the features of each sequencing platform in order to address the existing weaknesses of each sequencing type and compose a complete and correct genome map. In the present study, the hybrid strategy, which is based on Illumina short paired‑end reads and Nanopore long reads, was benchmarked using MaSuRCA and Wengan assemblers. Moreover, the long‑read assembly strategy, which is based on Nanopore reads, was benchmarked using Canu or PacBio HiFi reads were benchmarked using Hifiasm and HiCanu. The assemblies were performed on a computational cluster with limited computational resources. Their outputs were evaluated in terms of accuracy and computational performance. PacBio HiFi assembly strategy outperforms the other ones, while Hi‑C scaffolding, which is based on chromatin 3D structure, is required in order to increase continuity, accuracy and completeness when large and complex genomes, such as the human one, are assembled. The use of Hi‑C data is also necessary while using the hybrid assembly strategy. The results revealed that HiFi sequencing enabled the rise of novel algorithms which require less genome coverage than that of the other strategies making the assembly a less computationally demanding task. Taken together, these developments may lead to the democratisation of genome assembly projects which are now approachable by smaller labs with limited technical and financial resources.
Collapse
Affiliation(s)
- Marios Gavrielatos
- Centre of Systems Biology, Biomedical Research Foundation, Academy of Athens, 11527 Athens, Greece
- Department of Cell Biology and Biophysics, Faculty of Biology, University of Athens, 15701 Athens, Greece
| | - Konstantinos Kyriakidis
- School of Pharmacy, Aristotle University of Thessaloniki (AUTh), 54124 Thessaloniki, Greece
- Genomics and Epigenomics Translational Research (GENeTres), Centre for Interdisciplinary Research and Innovation, 57001 Thessaloniki, Greece
| | - Demetrios A. Spandidos
- Laboratory of Clinical Virology, Medical School, University of Crete, 71003 Heraklion, Greece
| | - Ioannis Michalopoulos
- Centre of Systems Biology, Biomedical Research Foundation, Academy of Athens, 11527 Athens, Greece
| |
Collapse
|
6
|
Kaye AM, Wasserman WW. The genome atlas: navigating a new era of reference genomes. Trends Genet 2021; 37:807-818. [PMID: 33419587 DOI: 10.1016/j.tig.2020.12.002] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2020] [Revised: 12/03/2020] [Accepted: 12/07/2020] [Indexed: 10/22/2022]
Abstract
The reference genome serves two distinct purposes within the field of genomics. First, it provides a persistent structure against which findings can be reported, allowing for universal knowledge exchange between users. Second, it reduces the computational costs and time required to process genomic data by creating a scaffold that can be relied upon by analysis software. Here, we posit that current efforts to extend the linear reference to a graph-based structure while trying to fulfil both of these purposes concurrently will face a trade-off between comprehensiveness and computational efficiency. In this article, we explore how the reference genome is used and suggest an alternative structure, The Genome Atlas (TGA), to fulfil the bipartite role of the reference genome.
Collapse
Affiliation(s)
- Alice M Kaye
- Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, University of British Columbia, Vancouver, BC, Canada
| | - Wyeth W Wasserman
- Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, University of British Columbia, Vancouver, BC, Canada.
| |
Collapse
|
7
|
Kumar PS, Dabdoub SM, Ganesan SM. Probing periodontal microbial dark matter using metataxonomics and metagenomics. Periodontol 2000 2020; 85:12-27. [PMID: 33226714 DOI: 10.1111/prd.12349] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Our view of the periodontal microbial community has been shaped by a century or more of cultivation-based and microscopic investigations. While these studies firmly established the infection-mediated etiology of periodontal diseases, it was apparent from the very early days that periodontal microbiology suffered from what Staley and Konopka described as the "great plate count anomaly", in that these culturable bacteria were only a minor part of what was visible under the microscope. For nearly a century, much effort has been devoted to finding the right tools to investigate this uncultivated majority, also known as "microbial dark matter". The discovery that DNA was an effective tool to "see" microbial dark matter was a significant breakthrough in environmental microbiology, and oral microbiologists were among the earliest to capitalize on these advances. By identifying the order in which nucleotides are arranged in a stretch of DNA (DNA sequencing) and creating a repository of these sequences, sequence databases were created. Computational tools that used probability-driven analysis of these sequences enabled the discovery of new and unsuspected species and ascribed novel functions to these species. This review will trace the development of DNA sequencing as a quantitative, open-ended, comprehensive approach to characterize microbial communities in their native environments, and explore how this technology has shifted traditional dogmas on how the oral microbiome promotes health and its role in disease causation and perpetuation.
Collapse
Affiliation(s)
- Purnima S Kumar
- Department of Periodontology, College of Dentistry, The Ohio State University, Columbus, Ohio, USA
| | - Shareef M Dabdoub
- Department of Periodontology, College of Dentistry, The Ohio State University, Columbus, Ohio, USA
| | - Sukirth M Ganesan
- Department of Periodontics, College of Dentistry and Dental Clinics, The University of Iowa, Iowa City, Iowa, USA
| |
Collapse
|
8
|
SCOP: a novel scaffolding algorithm based on contig classification and optimization. Bioinformatics 2018; 35:1142-1150. [DOI: 10.1093/bioinformatics/bty773] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2017] [Revised: 08/10/2018] [Accepted: 09/01/2018] [Indexed: 12/20/2022] Open
|
9
|
Li M, Tang L, Liao Z, Luo J, Wu F, Pan Y, Wang J. A novel scaffolding algorithm based on contig error correction and path extension. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 16:764-773. [PMID: 30040649 DOI: 10.1109/tcbb.2018.2858267] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
The sequence assembly process can be divided into three stages: contigs extension, scaffolding, and gap filling. The scaffolding method is an essential step during the process to infer the direction and sequence relationships between the contigs. However, scaffolding still faces the challenges of uneven sequencing depth, genome repetitive regions, and sequencing errors, which often leads to many false relationships between contigs. The performance of scaffolding can be improved by removing potential false conjunctions between contigs. In this study, a novel scaffolding algorithm which is on the basis of path extension Loose-Strict-Loose strategy and contig error correction, called iLSLS. iLSLS helps reduce the false relationships between contigs, and improve the accuracy of subsequent steps. iLSLS utilizes a scoring function, which estimates the correctness of candidate paths by the distribution of paired reads, and try to conduction the extension with the path which is scored the highest. What's more, iLSLS can precisely estimate the gap size. We conduct experiments on two real datasets, and the results show that LSLS strategy is efficient to increase the correctness of scaffolds, and iLSLS performs better than other scaffolding methods.
Collapse
|
10
|
Abstract
The process of assembling a species’ reference genome may be performed in a number of iterations, with subsequent genome assemblies differing in the coordinates of mapped elements. The conversion of genome coordinates between different assemblies is required for many integrative and comparative studies. While currently a number of bioinformatics tools are available to accomplish this task, most of them are tailored towards the conversion of single genome coordinates. When converting the boundary positions of segments spanning larger genome regions, segments may be mapped into smaller sub-segments if the original segment’s continuity is disrupted in the target assembly. Such a conversion may lead to a relevant degree of data loss in some circumstances such as copy number variation (CNV) analysis, where the quantitative representation of a genomic region takes precedence over base-specific accuracy.
segment_liftover aims at continuity-preserving remapping of genome segments between assemblies and provides features such as approximate locus conversion, automated batch processing and comprehensive logging to facilitate processing of datasets containing large numbers of structural genome variation data.
Collapse
Affiliation(s)
- Bo Gao
- Institute of molecular Life Sciences, University of Zürich, Zürich, CH-8057, Switzerland.,Swiss Institute of Bioinformatics, University of Zürich, Zürich, CH-8057, Switzerland
| | - Qingyao Huang
- Institute of molecular Life Sciences, University of Zürich, Zürich, CH-8057, Switzerland.,Swiss Institute of Bioinformatics, University of Zürich, Zürich, CH-8057, Switzerland
| | - Michael Baudis
- Institute of molecular Life Sciences, University of Zürich, Zürich, CH-8057, Switzerland.,Swiss Institute of Bioinformatics, University of Zürich, Zürich, CH-8057, Switzerland
| |
Collapse
|
11
|
Jäger M, Schubach M, Zemojtel T, Reinert K, Church DM, Robinson PN. Alternate-locus aware variant calling in whole genome sequencing. Genome Med 2016; 8:130. [PMID: 27964746 PMCID: PMC5155401 DOI: 10.1186/s13073-016-0383-z] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2016] [Accepted: 11/23/2016] [Indexed: 01/09/2023] Open
Abstract
Background The last two human genome assemblies have extended the previous linear golden-path paradigm of the human genome to a graph-like model to better represent regions with a high degree of structural variability. The new model offers opportunities to improve the technical validity of variant calling in whole-genome sequencing (WGS). Methods We developed an algorithm that analyzes the patterns of variant calls in the 178 structurally variable regions of the GRCh38 genome assembly, and infers whether a given sample is most likely to contain sequences from the primary assembly, an alternate locus, or their heterozygous combination at each of these 178 regions. We investigate 121 in-house WGS datasets that have been aligned to the GRCh37 and GRCh38 assemblies. Results We show that stretches of sequences that are largely but not entirely identical between the primary assembly and an alternate locus can result in multiple variant calls against regions of the primary assembly. In WGS analysis, this results in characteristic and recognizable patterns of variant calls at positions that we term alignable scaffold-discrepant positions (ASDPs). In 121 in-house genomes, on average 51.8±3.8 of the 178 regions were found to correspond best to an alternate locus rather than the primary assembly sequence, and filtering these genomes with our algorithm led to the identification of 7863 variant calls per genome that colocalized with ASDPs. Additionally, we found that 437 of 791 genome-wide association study hits located within one of the regions corresponded to ASDPs. Conclusions Our algorithm uses the information contained in the 178 structurally variable regions of the GRCh38 genome assembly to avoid spurious variant calls in cases where samples contain an alternate locus rather than the corresponding segment of the primary assembly. These results suggest the great potential of fully incorporating the resources of graph-like genome assemblies into variant calling, but also underscore the importance of developing computational resources that will allow a full reconstruction of the genotype in personal genomes. Our algorithm is freely available at https://github.com/charite/asdpex. Electronic supplementary material The online version of this article (doi:10.1186/s13073-016-0383-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Marten Jäger
- Institute for Medical and Human Genetics, Charité-Universitätsmedizin Berlin, Augustenburger Platz 1, Berlin, 13353, Germany.,Berlin Brandenburg Center for Regenerative Therapies (BCRT), Charité-Universitätsmedizin Berlin, Augustenburger Platz 1, Berlin, 13353, Germany
| | - Max Schubach
- Institute for Medical and Human Genetics, Charité-Universitätsmedizin Berlin, Augustenburger Platz 1, Berlin, 13353, Germany
| | - Tomasz Zemojtel
- Institute for Medical and Human Genetics, Charité-Universitätsmedizin Berlin, Augustenburger Platz 1, Berlin, 13353, Germany
| | - Knut Reinert
- Institute for Bioinformatics, Department of Mathematics and Computer Science, Freie Universität Berlin, Arnimallee 14, Berlin, 14195, Germany
| | - Deanna M Church
- 10x Genomics, 7068 Koll Center Parkway, Suite 401, Pleasanton, 94566, CA, USA
| | - Peter N Robinson
- Institute for Medical and Human Genetics, Charité-Universitätsmedizin Berlin, Augustenburger Platz 1, Berlin, 13353, Germany. .,Berlin Brandenburg Center for Regenerative Therapies (BCRT), Charité-Universitätsmedizin Berlin, Augustenburger Platz 1, Berlin, 13353, Germany. .,Institute for Bioinformatics, Department of Mathematics and Computer Science, Freie Universität Berlin, Arnimallee 14, Berlin, 14195, Germany. .,The Jackson Laboratory for Genomic Medicine, 10 Discovery Drive, Farmington, 06032, CT, USA. .,Institute for Systems Genomics, University of Connecticut, Farmington, 06032, CT, USA.
| |
Collapse
|
12
|
Kryukov K, Imanishi T. Human Contamination in Public Genome Assemblies. PLoS One 2016; 11:e0162424. [PMID: 27611326 PMCID: PMC5017631 DOI: 10.1371/journal.pone.0162424] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2016] [Accepted: 07/31/2016] [Indexed: 01/29/2023] Open
Abstract
Contamination in genome assembly can lead to wrong or confusing results when using such genome as reference in sequence comparison. Although bacterial contamination is well known, the problem of human-originated contamination received little attention. In this study we surveyed 45,735 available genome assemblies for evidence of human contamination. We used lineage specificity to distinguish between contamination and conservation. We found that 154 genome assemblies contain fragments that with high confidence originate as contamination from human DNA. Majority of contaminating human sequences were present in the reference human genome assembly for over a decade. We recommend that existing contaminated genomes should be revised to remove contaminated sequence, and that new assemblies should be thoroughly checked for presence of human DNA before submitting them to public databases.
Collapse
Affiliation(s)
- Kirill Kryukov
- Department of Molecular Life Science, School of Medicine, Tokai University, Isehara, Kanagawa, Japan
| | - Tadashi Imanishi
- Department of Molecular Life Science, School of Medicine, Tokai University, Isehara, Kanagawa, Japan
- * E-mail:
| |
Collapse
|
13
|
Beier S, Himmelbach A, Schmutzer T, Felder M, Taudien S, Mayer KFX, Platzer M, Stein N, Scholz U, Mascher M. Multiplex sequencing of bacterial artificial chromosomes for assembling complex plant genomes. PLANT BIOTECHNOLOGY JOURNAL 2016; 14:1511-22. [PMID: 26801048 PMCID: PMC5066668 DOI: 10.1111/pbi.12511] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/14/2015] [Revised: 11/11/2015] [Accepted: 11/13/2015] [Indexed: 05/02/2023]
Abstract
Hierarchical shotgun sequencing remains the method of choice for assembling high-quality reference sequences of complex plant genomes. The efficient exploitation of current high-throughput technologies and powerful computational facilities for large-insert clone sequencing necessitates the sequencing and assembly of a large number of clones in parallel. We developed a multiplexed pipeline for shotgun sequencing and assembling individual bacterial artificial chromosomes (BACs) using the Illumina sequencing platform. We illustrate our approach by sequencing 668 barley BACs (Hordeum vulgare L.) in a single Illumina HiSeq 2000 lane. Using a newly designed parallelized computational pipeline, we obtained sequence assemblies of individual BACs that consist, on average, of eight sequence scaffolds and represent >98% of the genomic inserts. Our BAC assemblies are clearly superior to a whole-genome shotgun assembly regarding contiguity, completeness and the representation of the gene space. Our methods may be employed to rapidly obtain high-quality assemblies of a large number of clones to assemble map-based reference sequences of plant and animal species with complex genomes by sequencing along a minimum tiling path.
Collapse
Affiliation(s)
- Sebastian Beier
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Stadt Seeland, Germany
| | - Axel Himmelbach
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Stadt Seeland, Germany
| | - Thomas Schmutzer
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Stadt Seeland, Germany
| | - Marius Felder
- Leibniz Institute on Aging-Fritz Lipmann Institute (FLI), Jena, Germany
| | - Stefan Taudien
- Leibniz Institute on Aging-Fritz Lipmann Institute (FLI), Jena, Germany
| | - Klaus F X Mayer
- Plant Genome and System Biology (PGSB), Helmholtz Center Munich, German Research Center for Environmental Health (GmbH), Neuherberg, Germany
| | - Matthias Platzer
- Leibniz Institute on Aging-Fritz Lipmann Institute (FLI), Jena, Germany
| | - Nils Stein
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Stadt Seeland, Germany
| | - Uwe Scholz
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Stadt Seeland, Germany
| | - Martin Mascher
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Stadt Seeland, Germany
| |
Collapse
|
14
|
Papanicolaou A. The life cycle of a genome project: perspectives and guidelines inspired by insect genome projects. F1000Res 2016; 5:18. [PMID: 27006757 PMCID: PMC4798206 DOI: 10.12688/f1000research.7559.1] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 12/10/2015] [Indexed: 11/20/2022] Open
Abstract
Many research programs on non-model species biology have been empowered by genomics. In turn, genomics is underpinned by a reference sequence and ancillary information created by so-called “genome projects”. The most reliable genome projects are the ones created as part of an active research program and designed to address specific questions but their life extends past publication. In this opinion paper I outline four key insights that have facilitated maintaining genomic communities: the key role of computational capability, the iterative process of building genomic resources, the value of community participation and the importance of manual curation. Taken together, these ideas can and do ensure the longevity of genome projects and the growing non-model species community can use them to focus a discussion with regards to its future genomic infrastructure.
Collapse
Affiliation(s)
- Alexie Papanicolaou
- Hawkesbury Institute for the Environment, University of Western Sydney, Richmond, NSW 2753, Australia
| |
Collapse
|
15
|
Deming Y, Xia J, Cai Y, Lord J, Holmans P, Bertelsen S, Holtzman D, Morris JC, Bales K, Pickering EH, Kauwe J, Goate A, Cruchaga C. A potential endophenotype for Alzheimer's disease: cerebrospinal fluid clusterin. Neurobiol Aging 2016; 37:208.e1-208.e9. [PMID: 26545630 PMCID: PMC5118651 DOI: 10.1016/j.neurobiolaging.2015.09.009] [Citation(s) in RCA: 43] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2015] [Accepted: 09/12/2015] [Indexed: 12/30/2022]
Abstract
Genome-wide association studies have associated clusterin (CLU) variants with Alzheimer's disease (AD). However, the role of CLU on AD pathogenesis is not totally understood. We used cerebrospinal fluid (CSF) and plasma CLU levels as endophenotypes for genetic studies to understand the role of CLU in AD. CSF, but not plasma, CLU levels were significantly associated with AD status and CSF tau/amyloid-beta ratio, and highly correlated with CSF apolipoprotein E (APOE) levels. Several loci showed almost genome-wide significant associations including LINC00917 (p = 3.98 × 10(-7)) and interleukin 6 (IL6, p = 9.94 × 10(-6), in the entire data set and in the APOE ε4- individuals p = 7.40 × 10(-8)). Gene ontology analyses suggest that CSF CLU levels may be associated with wound healing and immune response which supports previous functional studies that demonstrated an association between CLU and IL6. CLU may play a role in AD by influencing immune system changes that have been observed in AD or by disrupting healing after neurodegeneration.
Collapse
Affiliation(s)
- Yuetiva Deming
- Department of Psychiatry, Washington University School of Medicine, St. Louis, MO, USA
| | - Jian Xia
- Department of Psychiatry, Washington University School of Medicine, St. Louis, MO, USA
| | - Yefei Cai
- Department of Psychiatry, Washington University School of Medicine, St. Louis, MO, USA
| | - Jenny Lord
- Department of Psychiatry, Washington University School of Medicine, St. Louis, MO, USA
| | - Peter Holmans
- Institute of Psychological Medicine and Clinical Neurosciences, MRC Center for Neuropsychiatric Genetics and Genomics, Cardiff University School of Medicine, Cardiff, UK
| | - Sarah Bertelsen
- Department of Psychiatry, Washington University School of Medicine, St. Louis, MO, USA
| | - David Holtzman
- Department of Neurology, Washington University School of Medicine, St. Louis, MO, USA; Department of Developmental Biology, Washington University School of Medicine, St. Louis, MO, USA; Knight Alzheimer's Disease Research Center, Washington University School of Medicine, St. Louis, MO, USA; Hope Center for Neurological Disorders, Washington University School of Medicine, St. Louis, MO, USA
| | - John C Morris
- Department of Neurology, Washington University School of Medicine, St. Louis, MO, USA; Department of Developmental Biology, Washington University School of Medicine, St. Louis, MO, USA; Knight Alzheimer's Disease Research Center, Washington University School of Medicine, St. Louis, MO, USA; Hope Center for Neurological Disorders, Washington University School of Medicine, St. Louis, MO, USA
| | - Kelly Bales
- Neuroscience Research Unit, Worldwide Research and Development, Pfizer, Inc., Groton, CT, USA
| | - Eve H Pickering
- Neuroscience Research Unit, Worldwide Research and Development, Pfizer, Inc., Groton, CT, USA
| | - John Kauwe
- Department of Biology, Brigham Young University, Provo, UT, USA
| | - Alison Goate
- Department of Psychiatry, Washington University School of Medicine, St. Louis, MO, USA; Knight Alzheimer's Disease Research Center, Washington University School of Medicine, St. Louis, MO, USA; Hope Center for Neurological Disorders, Washington University School of Medicine, St. Louis, MO, USA
| | - Carlos Cruchaga
- Department of Psychiatry, Washington University School of Medicine, St. Louis, MO, USA; Hope Center for Neurological Disorders, Washington University School of Medicine, St. Louis, MO, USA.
| |
Collapse
|
16
|
Manel S, Perrier C, Pratlong M, Abi-Rached L, Paganini J, Pontarotti P, Aurelle D. Genomic resources and their influence on the detection of the signal of positive selection in genome scans. Mol Ecol 2015; 25:170-84. [DOI: 10.1111/mec.13468] [Citation(s) in RCA: 64] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2015] [Revised: 11/06/2015] [Accepted: 11/07/2015] [Indexed: 12/16/2022]
Affiliation(s)
- S. Manel
- CEFE UMR 5175; CNRS - Université de Montpellier - Université Paul-Valéry Montpellier -EPHE; laboratoire Biogéographie et écologie des vertébrés; 1919 route de Mende 34293 Montpellier Cedex 5 France
| | - C. Perrier
- CEFE UMR 5175; CNRS - Université de Montpellier - Université Paul-Valéry Montpellier -EPHE; laboratoire Biogéographie et écologie des vertébrés; 1919 route de Mende 34293 Montpellier Cedex 5 France
| | - M. Pratlong
- Aix Marseille Université; CNRS; IRD; Avignon Université, IMBE UMR 7263; Station Marine d'Endoume, 13007; Marseille France
- Aix Marseille Université; CNRS; Centrale Marseille; I2M UMR 7373; Evolution Biologique Modélisation; 3 Place Victor Hugo, 13331 Marseille Cedex Case 19 France
| | - L. Abi-Rached
- Equipe ATIP; URMITE UM 63 CNRS 7278 IRD 198 Inserm U1095; IHU Méditerranée Infection; Aix-Marseille Université; 27 Boulevard Jean Moulin, 13385 Marseille Cedex 05 France
| | - J. Paganini
- XEGEN SAS; 15 Rue de la République 13420 Gemenos France
| | - P. Pontarotti
- Aix Marseille Université; CNRS; Centrale Marseille; I2M UMR 7373; Evolution Biologique Modélisation; 3 Place Victor Hugo, 13331 Marseille Cedex Case 19 France
| | - D. Aurelle
- Aix Marseille Université; CNRS; IRD; Avignon Université, IMBE UMR 7263; Station Marine d'Endoume, 13007; Marseille France
| |
Collapse
|
17
|
Abstract
The current genomic revolution was made possible by joint advances in genome sequencing technologies and computational approaches for analyzing sequence data. The close interaction between biologists and computational scientists is perhaps most apparent in the development of approaches for sequencing entire genomes, a feat that would not be possible without sophisticated computational tools called genome assemblers (short for genome sequence assemblers). Here, we survey the key developments in algorithms for assembling genome sequences since the development of the first DNA sequencing methods more than 35 years ago.
Collapse
Affiliation(s)
- Jared T Simpson
- Ontario Institute for Cancer Research, Toronto, Ontario M5G 0A3, Canada;
| | | |
Collapse
|
18
|
Wilks C, Cline MS, Weiler E, Diehkans M, Craft B, Martin C, Murphy D, Pierce H, Black J, Nelson D, Litzinger B, Hatton T, Maltbie L, Ainsworth M, Allen P, Rosewood L, Mitchell E, Smith B, Warner J, Groboske J, Telc H, Wilson D, Sanford B, Schmidt H, Haussler D, Maltbie D. The Cancer Genomics Hub (CGHub): overcoming cancer through the power of torrential data. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2014; 2014:bau093. [PMID: 25267794 PMCID: PMC4178372 DOI: 10.1093/database/bau093] [Citation(s) in RCA: 127] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
The Cancer Genomics Hub (CGHub) is the online repository of the sequencing programs of the National Cancer Institute (NCI), including The Cancer Genomics Atlas (TCGA), the Cancer Cell Line Encyclopedia (CCLE) and the Therapeutically Applicable Research to Generate Effective Treatments (TARGET) projects, with data from 25 different types of cancer. The CGHub currently contains >1.4 PB of data, has grown at an average rate of 50 TB a month and serves >100 TB per week. The architecture of CGHub is designed to support bulk searching and downloading through a Web-accessible application programming interface, enforce patient genome confidentiality in data storage and transmission and optimize for efficiency in access and transfer. In this article, we describe the design of these three components, present performance results for our transfer protocol, GeneTorrent, and finally report on the growth of the system in terms of data stored and transferred, including estimated limits on the current architecture. Our experienced-based estimates suggest that centralizing storage and computational resources is more efficient than wide distribution across many satellite labs. Database URL:https://cghub.ucsc.edu
Collapse
Affiliation(s)
- Christopher Wilks
- Biomolecular Engineering, School of Engineering, University of California Santa Cruz, Santa Cruz, CA, USA, Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA, Annai Systems Inc., 2100 Palomar Airport Road, Suite 210 Carlsbad, California 92011, USA, Cardinal Peak, LLC, 1380 Forest Park Circle, Suite 202 Lafayette, CO 80026, USA and Information Technology Services, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA
| | - Melissa S Cline
- Biomolecular Engineering, School of Engineering, University of California Santa Cruz, Santa Cruz, CA, USA, Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA, Annai Systems Inc., 2100 Palomar Airport Road, Suite 210 Carlsbad, California 92011, USA, Cardinal Peak, LLC, 1380 Forest Park Circle, Suite 202 Lafayette, CO 80026, USA and Information Technology Services, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA
| | - Erich Weiler
- Biomolecular Engineering, School of Engineering, University of California Santa Cruz, Santa Cruz, CA, USA, Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA, Annai Systems Inc., 2100 Palomar Airport Road, Suite 210 Carlsbad, California 92011, USA, Cardinal Peak, LLC, 1380 Forest Park Circle, Suite 202 Lafayette, CO 80026, USA and Information Technology Services, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA
| | - Mark Diehkans
- Biomolecular Engineering, School of Engineering, University of California Santa Cruz, Santa Cruz, CA, USA, Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA, Annai Systems Inc., 2100 Palomar Airport Road, Suite 210 Carlsbad, California 92011, USA, Cardinal Peak, LLC, 1380 Forest Park Circle, Suite 202 Lafayette, CO 80026, USA and Information Technology Services, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA
| | - Brian Craft
- Biomolecular Engineering, School of Engineering, University of California Santa Cruz, Santa Cruz, CA, USA, Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA, Annai Systems Inc., 2100 Palomar Airport Road, Suite 210 Carlsbad, California 92011, USA, Cardinal Peak, LLC, 1380 Forest Park Circle, Suite 202 Lafayette, CO 80026, USA and Information Technology Services, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA
| | - Christy Martin
- Biomolecular Engineering, School of Engineering, University of California Santa Cruz, Santa Cruz, CA, USA, Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA, Annai Systems Inc., 2100 Palomar Airport Road, Suite 210 Carlsbad, California 92011, USA, Cardinal Peak, LLC, 1380 Forest Park Circle, Suite 202 Lafayette, CO 80026, USA and Information Technology Services, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA
| | - Daniel Murphy
- Biomolecular Engineering, School of Engineering, University of California Santa Cruz, Santa Cruz, CA, USA, Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA, Annai Systems Inc., 2100 Palomar Airport Road, Suite 210 Carlsbad, California 92011, USA, Cardinal Peak, LLC, 1380 Forest Park Circle, Suite 202 Lafayette, CO 80026, USA and Information Technology Services, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA
| | - Howdy Pierce
- Biomolecular Engineering, School of Engineering, University of California Santa Cruz, Santa Cruz, CA, USA, Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA, Annai Systems Inc., 2100 Palomar Airport Road, Suite 210 Carlsbad, California 92011, USA, Cardinal Peak, LLC, 1380 Forest Park Circle, Suite 202 Lafayette, CO 80026, USA and Information Technology Services, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA
| | - John Black
- Biomolecular Engineering, School of Engineering, University of California Santa Cruz, Santa Cruz, CA, USA, Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA, Annai Systems Inc., 2100 Palomar Airport Road, Suite 210 Carlsbad, California 92011, USA, Cardinal Peak, LLC, 1380 Forest Park Circle, Suite 202 Lafayette, CO 80026, USA and Information Technology Services, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA
| | - Donavan Nelson
- Biomolecular Engineering, School of Engineering, University of California Santa Cruz, Santa Cruz, CA, USA, Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA, Annai Systems Inc., 2100 Palomar Airport Road, Suite 210 Carlsbad, California 92011, USA, Cardinal Peak, LLC, 1380 Forest Park Circle, Suite 202 Lafayette, CO 80026, USA and Information Technology Services, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA
| | - Brian Litzinger
- Biomolecular Engineering, School of Engineering, University of California Santa Cruz, Santa Cruz, CA, USA, Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA, Annai Systems Inc., 2100 Palomar Airport Road, Suite 210 Carlsbad, California 92011, USA, Cardinal Peak, LLC, 1380 Forest Park Circle, Suite 202 Lafayette, CO 80026, USA and Information Technology Services, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA
| | - Thomas Hatton
- Biomolecular Engineering, School of Engineering, University of California Santa Cruz, Santa Cruz, CA, USA, Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA, Annai Systems Inc., 2100 Palomar Airport Road, Suite 210 Carlsbad, California 92011, USA, Cardinal Peak, LLC, 1380 Forest Park Circle, Suite 202 Lafayette, CO 80026, USA and Information Technology Services, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA
| | - Lori Maltbie
- Biomolecular Engineering, School of Engineering, University of California Santa Cruz, Santa Cruz, CA, USA, Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA, Annai Systems Inc., 2100 Palomar Airport Road, Suite 210 Carlsbad, California 92011, USA, Cardinal Peak, LLC, 1380 Forest Park Circle, Suite 202 Lafayette, CO 80026, USA and Information Technology Services, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA
| | - Michael Ainsworth
- Biomolecular Engineering, School of Engineering, University of California Santa Cruz, Santa Cruz, CA, USA, Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA, Annai Systems Inc., 2100 Palomar Airport Road, Suite 210 Carlsbad, California 92011, USA, Cardinal Peak, LLC, 1380 Forest Park Circle, Suite 202 Lafayette, CO 80026, USA and Information Technology Services, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA
| | - Patrick Allen
- Biomolecular Engineering, School of Engineering, University of California Santa Cruz, Santa Cruz, CA, USA, Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA, Annai Systems Inc., 2100 Palomar Airport Road, Suite 210 Carlsbad, California 92011, USA, Cardinal Peak, LLC, 1380 Forest Park Circle, Suite 202 Lafayette, CO 80026, USA and Information Technology Services, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA
| | - Linda Rosewood
- Biomolecular Engineering, School of Engineering, University of California Santa Cruz, Santa Cruz, CA, USA, Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA, Annai Systems Inc., 2100 Palomar Airport Road, Suite 210 Carlsbad, California 92011, USA, Cardinal Peak, LLC, 1380 Forest Park Circle, Suite 202 Lafayette, CO 80026, USA and Information Technology Services, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA
| | - Elizabeth Mitchell
- Biomolecular Engineering, School of Engineering, University of California Santa Cruz, Santa Cruz, CA, USA, Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA, Annai Systems Inc., 2100 Palomar Airport Road, Suite 210 Carlsbad, California 92011, USA, Cardinal Peak, LLC, 1380 Forest Park Circle, Suite 202 Lafayette, CO 80026, USA and Information Technology Services, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA
| | - Bradley Smith
- Biomolecular Engineering, School of Engineering, University of California Santa Cruz, Santa Cruz, CA, USA, Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA, Annai Systems Inc., 2100 Palomar Airport Road, Suite 210 Carlsbad, California 92011, USA, Cardinal Peak, LLC, 1380 Forest Park Circle, Suite 202 Lafayette, CO 80026, USA and Information Technology Services, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA
| | - Jim Warner
- Biomolecular Engineering, School of Engineering, University of California Santa Cruz, Santa Cruz, CA, USA, Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA, Annai Systems Inc., 2100 Palomar Airport Road, Suite 210 Carlsbad, California 92011, USA, Cardinal Peak, LLC, 1380 Forest Park Circle, Suite 202 Lafayette, CO 80026, USA and Information Technology Services, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA
| | - John Groboske
- Biomolecular Engineering, School of Engineering, University of California Santa Cruz, Santa Cruz, CA, USA, Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA, Annai Systems Inc., 2100 Palomar Airport Road, Suite 210 Carlsbad, California 92011, USA, Cardinal Peak, LLC, 1380 Forest Park Circle, Suite 202 Lafayette, CO 80026, USA and Information Technology Services, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA
| | - Haifang Telc
- Biomolecular Engineering, School of Engineering, University of California Santa Cruz, Santa Cruz, CA, USA, Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA, Annai Systems Inc., 2100 Palomar Airport Road, Suite 210 Carlsbad, California 92011, USA, Cardinal Peak, LLC, 1380 Forest Park Circle, Suite 202 Lafayette, CO 80026, USA and Information Technology Services, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA
| | - Daniel Wilson
- Biomolecular Engineering, School of Engineering, University of California Santa Cruz, Santa Cruz, CA, USA, Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA, Annai Systems Inc., 2100 Palomar Airport Road, Suite 210 Carlsbad, California 92011, USA, Cardinal Peak, LLC, 1380 Forest Park Circle, Suite 202 Lafayette, CO 80026, USA and Information Technology Services, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA
| | - Brian Sanford
- Biomolecular Engineering, School of Engineering, University of California Santa Cruz, Santa Cruz, CA, USA, Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA, Annai Systems Inc., 2100 Palomar Airport Road, Suite 210 Carlsbad, California 92011, USA, Cardinal Peak, LLC, 1380 Forest Park Circle, Suite 202 Lafayette, CO 80026, USA and Information Technology Services, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA
| | - Hannes Schmidt
- Biomolecular Engineering, School of Engineering, University of California Santa Cruz, Santa Cruz, CA, USA, Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA, Annai Systems Inc., 2100 Palomar Airport Road, Suite 210 Carlsbad, California 92011, USA, Cardinal Peak, LLC, 1380 Forest Park Circle, Suite 202 Lafayette, CO 80026, USA and Information Technology Services, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA
| | - David Haussler
- Biomolecular Engineering, School of Engineering, University of California Santa Cruz, Santa Cruz, CA, USA, Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA, Annai Systems Inc., 2100 Palomar Airport Road, Suite 210 Carlsbad, California 92011, USA, Cardinal Peak, LLC, 1380 Forest Park Circle, Suite 202 Lafayette, CO 80026, USA and Information Technology Services, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA
| | - Daniel Maltbie
- Biomolecular Engineering, School of Engineering, University of California Santa Cruz, Santa Cruz, CA, USA, Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA, Annai Systems Inc., 2100 Palomar Airport Road, Suite 210 Carlsbad, California 92011, USA, Cardinal Peak, LLC, 1380 Forest Park Circle, Suite 202 Lafayette, CO 80026, USA and Information Technology Services, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA
| |
Collapse
|
19
|
Xue W, Li JT, Zhu YP, Hou GY, Kong XF, Kuang YY, Sun XW. L_RNA_scaffolder: scaffolding genomes with transcripts. BMC Genomics 2013; 14:604. [PMID: 24010822 PMCID: PMC3846640 DOI: 10.1186/1471-2164-14-604] [Citation(s) in RCA: 83] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2013] [Accepted: 09/03/2013] [Indexed: 11/25/2022] Open
Abstract
Background Generation of large mate-pair libraries is necessary for de novo genome assembly but the procedure is complex and time-consuming. Furthermore, in some complex genomes, it is hard to increase the N50 length even with large mate-pair libraries, which leads to low transcript coverage. Thus, it is necessary to develop other simple scaffolding approaches, to at least solve the elongation of transcribed fragments. Results We describe L_RNA_scaffolder, a novel genome scaffolding method that uses long transcriptome reads to order, orient and combine genomic fragments into larger sequences. To demonstrate the accuracy of the method, the zebrafish genome was scaffolded. With expanded human transcriptome data, the N50 of human genome was doubled and L_RNA_scaffolder out-performed most scaffolding results by existing scaffolders which employ mate-pair libraries. In these two examples, the transcript coverage was almost complete, especially for long transcripts. We applied L_RNA_scaffolder to the highly polymorphic pearl oyster draft genome and the gene model length significantly increased. Conclusions The simplicity and high-throughput of RNA-seq data makes this approach suitable for genome scaffolding. L_RNA_scaffolder is available at http://www.fishbrowser.org/software/L_RNA_scaffolder.
Collapse
Affiliation(s)
- Wei Xue
- The Centre for Applied Aquatic Genomics, Chinese Academy of Fishery Sciences, Beijing 100141, China.
| | | | | | | | | | | | | |
Collapse
|
20
|
Jorge NAN, Ferreira CG, Passetti F. Bioinformatics of Cancer ncRNA in High Throughput Sequencing: Present State and Challenges. Front Genet 2012; 3:287. [PMID: 23251139 PMCID: PMC3523245 DOI: 10.3389/fgene.2012.00287] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2012] [Accepted: 11/22/2012] [Indexed: 12/24/2022] Open
Abstract
The numerous genome sequencing projects produced unprecedented amount of data providing significant information to the discovery of novel non-coding RNA (ncRNA). Several ncRNAs have been described to control gene expression and display important role during cell differentiation and homeostasis. In the last decade, high throughput methods in conjunction with approaches in bioinformatics have been used to identify, classify, and evaluate the expression of hundreds of ncRNA in normal and pathological states, such as cancer. Patient outcomes have been already associated with differential expression of ncRNAs in normal and tumoral tissues, providing new insights in the development of innovative therapeutic strategies in oncology. In this review, we present and discuss bioinformatics advances in the development of computational approaches to analyze and discover ncRNA data in oncology using high throughput sequencing technologies.
Collapse
|
21
|
Abstract
The UCSC Genome Browser (http://genome.ucsc.edu) is a graphical viewer for genomic data now in its 13th year. Since the early days of the Human Genome Project, it has presented an integrated view of genomic data of many kinds. Now home to assemblies for 58 organisms, the Browser presents visualization of annotations mapped to genomic coordinates. The ability to juxtapose annotations of many types facilitates inquiry-driven data mining. Gene predictions, mRNA alignments, epigenomic data from the ENCODE project, conservation scores from vertebrate whole-genome alignments and variation data may be viewed at any scale from a single base to an entire chromosome. The Browser also includes many other widely used tools, including BLAT, which is useful for alignments from high-throughput sequencing experiments. Private data uploaded as Custom Tracks and Data Hubs in many formats may be displayed alongside the rich compendium of precomputed data in the UCSC database. The Table Browser is a full-featured graphical interface, which allows querying, filtering and intersection of data tables. The Saved Session feature allows users to store and share customized views, enhancing the utility of the system for organizing multiple trains of thought. Binary Alignment/Map (BAM), Variant Call Format and the Personal Genome Single Nucleotide Polymorphisms (SNPs) data formats are useful for visualizing a large sequencing experiment (whole-genome or whole-exome), where the differences between the data set and the reference assembly may be displayed graphically. Support for high-throughput sequencing extends to compact, indexed data formats, such as BAM, bigBed and bigWig, allowing rapid visualization of large datasets from RNA-seq and ChIP-seq experiments via local hosting.
Collapse
Affiliation(s)
- Robert M Kuhn
- Center for Biomolecular Science and Engineering, University of California, Santa Cruz, 1156 High Street, Santa Cruz, CA 95064, USA.
| | | | | |
Collapse
|
22
|
Gritsenko AA, Nijkamp JF, Reinders MJT, Ridder DD. GRASS: a generic algorithm for scaffolding next-generation sequencing assemblies. Bioinformatics 2012; 28:1429-37. [DOI: 10.1093/bioinformatics/bts175] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
23
|
Gao S, Sung WK, Nagarajan N. Opera: reconstructing optimal genomic scaffolds with high-throughput paired-end sequences. J Comput Biol 2011; 18:1681-91. [PMID: 21929371 DOI: 10.1089/cmb.2011.0170] [Citation(s) in RCA: 158] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Scaffolding, the problem of ordering and orienting contigs, typically using paired-end reads, is a crucial step in the assembly of high-quality draft genomes. Even as sequencing technologies and mate-pair protocols have improved significantly, scaffolding programs still rely on heuristics, with no guarantees on the quality of the solution. In this work, we explored the feasibility of an exact solution for scaffolding and present a first tractable solution for this problem (Opera). We also describe a graph contraction procedure that allows the solution to scale to large scaffolding problems and demonstrate this by scaffolding several large real and synthetic datasets. In comparisons with existing scaffolders, Opera simultaneously produced longer and more accurate scaffolds demonstrating the utility of an exact approach. Opera also incorporates an exact quadratic programming formulation to precisely compute gap sizes (Availability: http://sourceforge.net/projects/operasf/ ).
Collapse
Affiliation(s)
- Song Gao
- NUS Graduate School for Integrative Sciences and Engineering, National University of Singapore, Singapore
| | | | | |
Collapse
|
24
|
Church DM, Schneider VA, Graves T, Auger K, Cunningham F, Bouk N, Chen HC, Agarwala R, McLaren WM, Ritchie GRS, Albracht D, Kremitzki M, Rock S, Kotkiewicz H, Kremitzki C, Wollam A, Trani L, Fulton L, Fulton R, Matthews L, Whitehead S, Chow W, Torrance J, Dunn M, Harden G, Threadgold G, Wood J, Collins J, Heath P, Griffiths G, Pelan S, Grafham D, Eichler EE, Weinstock G, Mardis ER, Wilson RK, Howe K, Flicek P, Hubbard T. Modernizing reference genome assemblies. PLoS Biol 2011; 9:e1001091. [PMID: 21750661 PMCID: PMC3130012 DOI: 10.1371/journal.pbio.1001091] [Citation(s) in RCA: 378] [Impact Index Per Article: 27.0] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Affiliation(s)
- Deanna M Church
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
25
|
Mortazavi A, Schwarz EM, Williams B, Schaeffer L, Antoshechkin I, Wold BJ, Sternberg PW. Scaffolding a Caenorhabditis nematode genome with RNA-seq. Genome Res 2010; 20:1740-7. [PMID: 20980554 PMCID: PMC2990000 DOI: 10.1101/gr.111021.110] [Citation(s) in RCA: 79] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2010] [Accepted: 08/24/2010] [Indexed: 11/25/2022]
Abstract
Efficient sequencing of animal and plant genomes by next-generation technology should allow many neglected organisms of biological and medical importance to be better understood. As a test case, we have assembled a draft genome of Caenorhabditis sp. 3 PS1010 through a combination of direct sequencing and scaffolding with RNA-seq. We first sequenced genomic DNA and mixed-stage cDNA using paired 75-nt reads from an Illumina GAII. A set of 230 million genomic reads yielded an 80-Mb assembly, with a supercontig N50 of 5.0 kb, covering 90% of 429 kb from previously published genomic contigs. Mixed-stage poly(A)(+) cDNA gave 47.3 million mappable 75-mers (including 5.1 million spliced reads), which separately assembled into 17.8 Mb of cDNA, with an N50 of 1.06 kb. By further scaffolding our genomic supercontigs with cDNA, we increased their N50 to 9.4 kb, nearly double the average gene size in C. elegans. We predicted 22,851 protein-coding genes, and detected expression in 78% of them. Multigenome alignment and data filtering identified 2672 DNA elements conserved between PS1010 and C. elegans that are likely to encode regulatory sequences or previously unknown ncRNAs. Genomic and cDNA sequencing followed by joint assembly is a rapid and useful strategy for biological analysis.
Collapse
Affiliation(s)
- Ali Mortazavi
- Division of Biology, California Institute of Technology, Pasadena, California 91125, USA
- Howard Hughes Medical Institute, Pasadena, California 91125, USA
| | - Erich M. Schwarz
- Division of Biology, California Institute of Technology, Pasadena, California 91125, USA
- Howard Hughes Medical Institute, Pasadena, California 91125, USA
| | - Brian Williams
- Division of Biology, California Institute of Technology, Pasadena, California 91125, USA
| | - Lorian Schaeffer
- Division of Biology, California Institute of Technology, Pasadena, California 91125, USA
| | - Igor Antoshechkin
- Division of Biology, California Institute of Technology, Pasadena, California 91125, USA
| | - Barbara J. Wold
- Division of Biology, California Institute of Technology, Pasadena, California 91125, USA
| | - Paul W. Sternberg
- Division of Biology, California Institute of Technology, Pasadena, California 91125, USA
- Howard Hughes Medical Institute, Pasadena, California 91125, USA
| |
Collapse
|
26
|
Narzisi G, Mishra B. Scoring-and-unfolding trimmed tree assembler: concepts, constructs and comparisons. ACTA ACUST UNITED AC 2010; 27:153-60. [PMID: 21088026 DOI: 10.1093/bioinformatics/btq646] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Mired by its connection to a well-known -complete combinatorial optimization problem-namely, the Shortest Common Superstring Problem (SCSP)-historically, the whole-genome sequence assembly (WGSA) problem has been assumed to be amenable only to greedy and heuristic methods. By placing efficiency as their first priority, these methods opted to rely only on local searches, and are thus inherently approximate, ambiguous or error prone, especially, for genomes with complex structures. Furthermore, since choice of the best heuristics depended critically on the properties of (e.g. errors in) the input data and the available long range information, these approaches hindered designing an error free WGSA pipeline. RESULTS We dispense with the idea of limiting the solutions to just the approximated ones, and instead favor an approach that could potentially lead to an exhaustive (exponential-time) search of all possible layouts. Its computational complexity thus must be tamed through a constrained search (Branch-and-Bound) and quick identification and pruning of implausible overlays. For his purpose, such a method necessarily relies on a set of score functions (oracles) that can combine different structural properties (e.g. transitivity, coverage, physical maps, etc.). We give a detailed description of this novel assembly framework, referred to as Scoring-and-Unfolding Trimmed Tree Assembler (SUTTA), and present experimental results on several bacterial genomes using next-generation sequencing technology data. We also report experimental evidence that the assembly quality strongly depends on the choice of the minimum overlap parameter k. AVAILABILITY AND IMPLEMENTATION SUTTA's binaries are freely available to non-profit institutions for research and educational purposes at http://www.bioinformatics.nyu.edu.
Collapse
Affiliation(s)
- Giuseppe Narzisi
- Courant Institute of Mathematical Sciences, New York University, New York, NY 10012, USA.
| | | |
Collapse
|
27
|
Homer N, Nelson SF. Improved variant discovery through local re-alignment of short-read next-generation sequencing data using SRMA. Genome Biol 2010; 11:R99. [PMID: 20932289 PMCID: PMC3218665 DOI: 10.1186/gb-2010-11-10-r99] [Citation(s) in RCA: 54] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2010] [Revised: 08/25/2010] [Accepted: 10/08/2010] [Indexed: 11/10/2022] Open
Abstract
A primary component of next-generation sequencing analysis is to align short reads to a reference genome, with each read aligned independently. However, reads that observe the same non-reference DNA sequence are highly correlated and can be used to better model the true variation in the target genome. A novel short-read micro realigner, SRMA, that leverages this correlation to better resolve a consensus of the underlying DNA sequence of the targeted genome is described here.
Collapse
Affiliation(s)
- Nils Homer
- Department of Computer Science, University of California, Los Angeles, Boelter Hall, Los Angeles, CA 90095, USA.
| | | |
Collapse
|
28
|
Abstract
There is a growing gap between the generation of massively parallel sequencing output and the ability to process and analyze the resulting data. New users are left to navigate a bewildering maze of base calling, alignment, assembly and analysis tools with often incomplete documentation and no idea how to compare and validate their outputs. Bridging this gap is essential, or the coveted $1,000 genome will come with a $20,000 analysis price tag.
Collapse
|
29
|
Abstract
Recent advances in both clone fingerprinting and draft sequencing technology have made it increasingly common for species to have a bacterial artificial clone (BAC) fingerprint map, BAC end sequences (BESs) and draft genomic sequence. The FPC (fingerprinted contigs) software package contains three modules that maximize the value of these resources. The BSS (blast some sequence) module provides a way to easily view the results of aligning draft sequence to the BESs, and integrates the results with the following two modules. The MTP (minimal tiling path) module uses sequence and fingerprints to determine a minimal tiling path of clones. The DSI (draft sequence integration) module aligns draft sequences to FPC contigs, displays them alongside the contigs and identifies potential discrepancies; the alignment can be based on either individual BES alignments to the draft, or on the locations of BESs that have been assembled into the draft. FPC also supports high-throughput fingerprint map generation as its time-intensive functions have been parallelized for Unix-based desktops or servers with multiple CPUs. Simulation results are provided for the MTP, DSI and parallelization. These features are in the FPC V9.3 software package, which is freely available.
Collapse
Affiliation(s)
- William Nelson
- Arizona Genomics Computational Laboratory, BIO5 Institute, University of Arizona, Tucson, AZ, USA
| | | |
Collapse
|
30
|
Scheibye-Alsing K, Hoffmann S, Frankel A, Jensen P, Stadler PF, Mang Y, Tommerup N, Gilchrist MJ, Nygård AB, Cirera S, Jørgensen CB, Fredholm M, Gorodkin J. Sequence assembly. Comput Biol Chem 2008; 33:121-36. [PMID: 19152793 DOI: 10.1016/j.compbiolchem.2008.11.003] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2008] [Revised: 11/28/2008] [Accepted: 11/28/2008] [Indexed: 01/20/2023]
Abstract
Despite the rapidly increasing number of sequenced and re-sequenced genomes, many issues regarding the computational assembly of large-scale sequencing data have remain unresolved. Computational assembly is crucial in large genome projects as well for the evolving high-throughput technologies and plays an important role in processing the information generated by these methods. Here, we provide a comprehensive overview of the current publicly available sequence assembly programs. We describe the basic principles of computational assembly along with the main concerns, such as repetitive sequences in genomic DNA, highly expressed genes and alternative transcripts in EST sequences. We summarize existing comparisons of different assemblers and provide a detailed descriptions and directions for download of assembly programs at: http://genome.ku.dk/resources/assembly/methods.html.
Collapse
Affiliation(s)
- K Scheibye-Alsing
- Division of Genetics and Bioinformatics, IBHV, University of Copenhagen, Grønnegårdsvej 3, 1870 Frederiksberg C, Denmark
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
31
|
Profile of David Haussler. Proc Natl Acad Sci U S A 2008; 105:14251-3. [DOI: 10.1073/pnas.0808284105] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
|
32
|
Rouchka EC, Phatak AW, Singh AV. Effect of single nucleotide polymorphisms on Affymetrix match-mismatch probe pairs. Bioinformation 2008; 2:405-11. [PMID: 18795114 PMCID: PMC2533060 DOI: 10.6026/97320630002405] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2008] [Revised: 06/17/2008] [Accepted: 07/03/2008] [Indexed: 11/23/2022] Open
Abstract
Microarrays provide a means of studying expression level of tens of thousands of genes by providing one or more oligonucleotide probe(s) for each transcript studied. Affymetrix(R) GeneChiptrade mark platforms historically pair each 25-base perfect match (PM) probe with a mismatch probe (MM) differing by a complementary base located in the 13(th) position to quantify and deflate effects of cross-hybridization. Analytical routines for analyzing these arrays take into account difference in expression levels of MM and PM probes to determine which ones are useful for further study. If a single nucleotide polymorphism (SNP) occurs at the 13(th) base, a probe with a higher MM expression level may be incorrectly omitted. In order to examine SNP affects on PM and MM expression levels, known human SNPs from dbSNP were mapped to probe sets within the Affymetrix(R) HG-U133A platform. Probe sets containing one or more probe pairs with a single SNP at the 13(th) position were extracted. A set of twelve microarray experiments were analyzed for the PM and MM expression levels for these probe sets. Over 6,000,000 human SNPs and their flanking regions were extracted from dbSNP. These sequences were aligned against each of the 247,965 probe pair sequences from the Affymetrix(R) HG-U133A platform. A total of 915 probe sets containing a single probe sequence with a SNP mapped to the 13(th) base were extracted. A subset containing 166 probe sets result in complementary base SNPs. Comparison of gene expression levels for the SNP to non-SNP PM and MM probes does not yield a significant difference using chi2 analysis. Thus, omission of probes with MM expression levels higher than PM expression levels does not appear to result in a loss of information concerning SNPs for these regions.
Collapse
Affiliation(s)
- Eric Christian Rouchka
- Department of Computer Engineering and Computer Science, University of Louisville, Louisville, KY, USA.
| | | | | |
Collapse
|
33
|
A scaffold analysis tool using mate-pair information in genome sequencing. J Biomed Biotechnol 2008; 2008:675741. [PMID: 18414585 PMCID: PMC2291285 DOI: 10.1155/2008/675741] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2007] [Accepted: 12/24/2007] [Indexed: 11/27/2022] Open
Abstract
We have developed a Windows-based program, ConPath,
as a scaffold analyzer. ConPath constructs scaffolds by ordering and orienting separate sequence contigs by exploiting the mate-pair information between
contig-pairs. Our algorithm builds directed graphs from link information and traverses them to find the longest acyclic graphs. Using end read pairs of fixed-sized mate-pair libraries,
ConPath determines relative orientations of all contigs,
estimates the gap size of each adjacent contig pair, and reports wrong assembly
information by validating orientations and gap sizes.
We have utilized ConPath in more than 10 microbial genome projects, including
Mannheimia succiniciproducens and Vibro vulnificus,
where we verified contig assembly and identified several erroneous contigs using the four types of error defined in ConPath. Also, ConPath supports some convenient features and viewers that permit investigation of each contig in detail; these include contig viewer, scaffold viewer, edge information list, mate-pair list, and the printing of complex scaffold structures.
Collapse
|
34
|
Denisov G, Walenz B, Halpern AL, Miller J, Axelrod N, Levy S, Sutton G. Consensus generation and variant detection by Celera Assembler. Bioinformatics 2008; 24:1035-40. [PMID: 18321888 DOI: 10.1093/bioinformatics/btn074] [Citation(s) in RCA: 82] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION We present an algorithm to identify allelic variation given a Whole Genome Shotgun (WGS) assembly of haploid sequences, and to produce a set of haploid consensus sequences rather than a single consensus sequence. Existing WGS assemblers take a column-by-column approach to consensus generation, and produce a single consensus sequence which can be inconsistent with the underlying haploid alleles, and inconsistent with any of the aligned sequence reads. Our new algorithm uses a dynamic windowing approach. It detects alleles by simultaneously processing the portions of aligned reads spanning a region of sequence variation, assigns reads to their respective alleles, phases adjacent variant alleles and generates a consensus sequence corresponding to each confirmed allele. This algorithm was used to produce the first diploid genome sequence of an individual human. It can also be applied to assemblies of multiple diploid individuals and hybrid assemblies of multiple haploid organisms. RESULTS Being applied to the individual human genome assembly, the new algorithm detects exactly two confirmed alleles and reports two consensus sequences in 98.98% of the total number 2,033311 detected regions of sequence variation. In 33,269 out of 460,373 detected regions of size >1 bp, it fixes the constructed errors of a mosaic haploid representation of a diploid locus as produced by the original Celera Assembler consensus algorithm. Using an optimized procedure calibrated against 1 506 344 known SNPs, it detects 438 814 new heterozygous SNPs with false positive rate 12%. AVAILABILITY The open source code is available at: http://wgs-assembler.cvs.sourceforge.net/wgs-assembler/
Collapse
Affiliation(s)
- Gennady Denisov
- J. Craig Venter Institute, 9704 Medical Center Drive, Rockville, MD 20850, USA.
| | | | | | | | | | | | | |
Collapse
|
35
|
Affiliation(s)
- Haixu Tang
- School of Informatics, Center for Genomics and Bioinformatics, Indiana University, Bloomington, Indiana 47408, USA.
| |
Collapse
|
36
|
Sugino H. Comparative genomic analysis of the mouse and rat amylase multigene family. FEBS Lett 2007; 581:355-60. [PMID: 17223109 DOI: 10.1016/j.febslet.2006.12.039] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2006] [Revised: 12/21/2006] [Accepted: 12/22/2006] [Indexed: 11/21/2022]
Abstract
The rat and mouse amylase gene families were characterized using sequence data from the UCSC genome assembly. We found that the rat genome contains one amylase-1 and two amylase-2 genes, lying close to one another on the same chromosome. Detailed analysis revealed at least six additional amylase pseudogenes in the rat genome in the region adjacent to the amylase-2 genes. In contrast, the mouse has one amylase-1 gene and five amylase-2 genes; the latter are tandemly and systematically arranged on the same chromosome and were generated by segmental duplication. Detailed analysis revealed that the mouse has two amylase pseudogenes, located 5' to the five amylase-2 segments. Thus, the amylase genes of mouse and rat tend to be amplified; the sequences of some of them are fixed while others have become pseudogenes during evolution. This is the second report of amylase genomic organization in mammals and the first in the rodents.
Collapse
Affiliation(s)
- Hidehiko Sugino
- Laboratories for Integrated Biology, Graduate School of Frontier Biosciences, Osaka University, 1-3 Suita, Osaka 565-0871, Japan.
| |
Collapse
|
37
|
Warren RL, Varabei D, Platt D, Huang X, Messina D, Yang SP, Kronstad JW, Krzywinski M, Warren WC, Wallis JW, Hillier LW, Chinwalla AT, Schein JE, Siddiqui AS, Marra MA, Wilson RK, Jones SJ. Physical map-assisted whole-genome shotgun sequence assemblies. Genes Dev 2006; 16:768-75. [PMID: 16741162 PMCID: PMC1473187 DOI: 10.1101/gr.5090606] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2005] [Accepted: 04/11/2006] [Indexed: 01/15/2023]
Abstract
We describe a targeted approach to improve the contiguity of whole-genome shotgun sequence (WGS) assemblies at run-time, using information from Bacterial Artificial Chromosome (BAC)-based physical maps. Clone sizes and overlaps derived from clone fingerprints are used for the calculation of length constraints between any two BAC neighbors sharing 40% of their size. These constraints are used to promote the linkage and guide the arrangement of sequence contigs within a sequence scaffold at the layout phase of WGS assemblies. This process is facilitated by FASSI, a stand-alone application that calculates BAC end and BAC overlap length constraints from clone fingerprint map contigs created by the FPC package. FASSI is designed to work with the assembly tool PCAP, but its output can be formatted to work with other WGS assembly algorithms able to use length constraints for individual clones. The FASSI method is simple to implement, potentially cost-effective, and has resulted in the increase of scaffold contiguity for both the Drosophila melanogaster and Cryptococcus gattii genomes when compared to a control assembly without map-derived constraints. A 6.5-fold coverage draft DNA sequence of the Pan troglodytes (chimpanzee) genome was assembled using map-derived constraints and resulted in a 26.1% increase in scaffold contiguity.
Collapse
Affiliation(s)
- René L. Warren
- British Columbia Cancer Agency, Genome Sciences Centre, Vancouver, British Columbia V5Z 4S6, Canada
| | - Dmitry Varabei
- British Columbia Cancer Agency, Genome Sciences Centre, Vancouver, British Columbia V5Z 4S6, Canada
| | - Darren Platt
- U.S. Department of Energy, Joint Genome Institute, Walnut Creek, California 94598, USA
| | - Xiaoqiu Huang
- Department of Computer Science, Iowa State University, Ames, Iowa 50011-1040, USA
| | - David Messina
- Washington University School of Medicine, Genome Sequencing Center, St. Louis, Missouri 63108, USA
| | - Shiaw-Pyng Yang
- Washington University School of Medicine, Genome Sequencing Center, St. Louis, Missouri 63108, USA
| | - James W. Kronstad
- The Michael Smith Laboratories, Department of Microbiology and Immunology, The University of British Columbia, Vancouver, British Columbia V6T 2Z4, Canada
| | - Martin Krzywinski
- British Columbia Cancer Agency, Genome Sciences Centre, Vancouver, British Columbia V5Z 4S6, Canada
| | - Wesley C. Warren
- Washington University School of Medicine, Genome Sequencing Center, St. Louis, Missouri 63108, USA
| | - John W. Wallis
- Washington University School of Medicine, Genome Sequencing Center, St. Louis, Missouri 63108, USA
| | - LaDeana W. Hillier
- Washington University School of Medicine, Genome Sequencing Center, St. Louis, Missouri 63108, USA
| | - Asif T. Chinwalla
- Washington University School of Medicine, Genome Sequencing Center, St. Louis, Missouri 63108, USA
| | - Jacqueline E. Schein
- British Columbia Cancer Agency, Genome Sciences Centre, Vancouver, British Columbia V5Z 4S6, Canada
| | - Asim S. Siddiqui
- British Columbia Cancer Agency, Genome Sciences Centre, Vancouver, British Columbia V5Z 4S6, Canada
| | - Marco A. Marra
- British Columbia Cancer Agency, Genome Sciences Centre, Vancouver, British Columbia V5Z 4S6, Canada
| | - Richard K. Wilson
- Washington University School of Medicine, Genome Sequencing Center, St. Louis, Missouri 63108, USA
| | - Steven J.M. Jones
- British Columbia Cancer Agency, Genome Sciences Centre, Vancouver, British Columbia V5Z 4S6, Canada
| |
Collapse
|
38
|
Loraine AE, Helt GA, Cline MS, Siani-Rose MA. Exploring alternative transcript structure in the human genome using blocks and InterPro. J Bioinform Comput Biol 2005; 1:289-306. [PMID: 15290774 DOI: 10.1142/s0219720003000113] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2002] [Revised: 12/07/2002] [Accepted: 01/15/2003] [Indexed: 11/18/2022]
Abstract
Understanding how alternative splicing affects gene function is an important challenge facing modern-day molecular biology. Using homology-based, protein sequence analysis methods, it should be possible to investigate how transcript diversity impacts protein function. To test this, high-quality exon-intron structures were deduced for over 8000 human genes, including over 1300 (17 percent) that produce multiple transcript variants. A data mining technique (DiffMotif) was developed to identify genes in which transcript variation coincides with changes in conserved motifs between variants. Applying this method, we found that 30 percent of the multi-variant genes in our test set exhibited a differential profile of conserved InterPro and/or BLOCKS motifs across different mRNA variants. To investigate these, a visualization tool (ProtAnnot) that displays amino acid motifs in the context of genomic sequence was developed. Using this tool, genes revealed by the DiffMotif method were analyzed, and when possible, hypotheses regarding the potential role of alternative transcript structure in modulating gene function were developed. Examples of these, including: MEOX1, a homeobox-containing protein; AIRE, involved in auto-immune disease; PLAT, tissue type plasminogen activator; and CD79b, a component of the B-cell receptor complex, are presented. These results demonstrate that amino acid motif databases like BLOCKS and InterPro are useful tools for investigating how alternative transcript structure affects gene function.
Collapse
Affiliation(s)
- Ann E Loraine
- Bioinformatics Department, Affymetrix, 6550 Vallejo St, Emeryville, CA 94530 USA.
| | | | | | | |
Collapse
|
39
|
Abdel-Rahman MH, Craig EL, Davidorf FH, Eng C. Expression of Vascular Endothelial Growth Factor in Uveal Melanoma Is Independent of 6p21-Region Copy Number. Clin Cancer Res 2005. [DOI: 10.1158/1078-0432.73.11.1] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Abstract
Purpose: Overexpression of vascular endothelial growth factor (VEGF) and overrepresentation of the 6p region have been reported with a wide variation in uveal melanoma. The aim of the current study is to identify the frequency of copy number alteration in the 6p21 region and its correlation with the expression of VEGF in uveal melanoma.
Experimental Design: We studied 88 uveal melanomas for copy number change in the 6p region by comparative genomic hybridization and/or chromogenic in situ hybridization. Expression of VEGF protein was estimated by immunohistochemistry. In 15 tumors, VEGF mRNA expression was also studied by quantitative reverse transcription–PCR (RT-PCR) and VEGF splice variants were detected by RT-PCR.
Results: Copy number of the 6p21 region was successfully estimated in 37 tumors. In 10 (27%) of those, overrepresentation of the 6p21 region was detected. There was no statistically significant difference in VEGF expression between tumors with and without gain of 6p21 (P = 0.82). VEGF expression was not confined to the tumors and was also detected in the surrounding normal tissue. Expression of VEGF, detected by quantitative RT-PCR, was concordant with expression of VEGF protein. Different VEGF isoforms were expressed in different tumors with no obvious correlation with disease status.
Conclusion: VEGF is overexpressed in a significant number of uveal melanomas. It should be noted that VEGF is not a candidate oncogene in uveal melanoma with 6p gain/amplification. VEGF overexpression other than structural amplification is probably significant in the pathogenesis of uveal melanomas, and its mechanism must be sought.
Collapse
Affiliation(s)
- Mohamed H. Abdel-Rahman
- 1Department of Ophthalmology and
- 2Clinical Cancer Genetics Program, Human Cancer Genetics Program, Comprehensive Cancer Center; Division of Human Genetics, Department of Internal Medicine, The Ohio State University, Columbus, Ohio
| | | | | | - Charis Eng
- 2Clinical Cancer Genetics Program, Human Cancer Genetics Program, Comprehensive Cancer Center; Division of Human Genetics, Department of Internal Medicine, The Ohio State University, Columbus, Ohio
| |
Collapse
|
40
|
Gladyshev VN, Kryukov GV, Fomenko DE, Hatfield DL. IDENTIFICATION OF TRACE ELEMENT–CONTAINING PROTEINS IN GENOMIC DATABASES. Annu Rev Nutr 2004; 24:579-96. [PMID: 15189132 DOI: 10.1146/annurev.nutr.24.012003.132241] [Citation(s) in RCA: 50] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Development of bioinformatics tools provided researchers with the ability to identify full sets of trace element-containing proteins in organisms for which complete genomic sequences are available. Recently, independent bioinformatics methods were used to identify all, or almost all, genes encoding selenocysteine-containing proteins in human, mouse, and Drosophila genomes, characterizing entire selenoproteomes in these organisms. It also should be possible to search for entire sets of other trace element-associated proteins, such as metal-containing proteins, although methods for their identification are still in development.
Collapse
Affiliation(s)
- Vadim N Gladyshev
- Department of Biochemistry, University of Nebraska, Lincoln, Nebraska 68588-0664, USA.
| | | | | | | |
Collapse
|
41
|
Krzywinski M, Bosdet I, Smailus D, Chiu R, Mathewson C, Wye N, Barber S, Brown-John M, Chan S, Chand S, Cloutier A, Girn N, Lee D, Masson A, Mayo M, Olson T, Pandoh P, Prabhu AL, Schoenmakers E, Tsai M, Albertson D, Lam W, Choy CO, Osoegawa K, Zhao S, de Jong PJ, Schein J, Jones S, Marra MA. A set of BAC clones spanning the human genome. Nucleic Acids Res 2004; 32:3651-60. [PMID: 15247347 PMCID: PMC484185 DOI: 10.1093/nar/gkh700] [Citation(s) in RCA: 101] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2004] [Revised: 06/22/2004] [Accepted: 06/22/2004] [Indexed: 11/15/2022] Open
Abstract
Using the human bacterial artificial chromosome (BAC) fingerprint-based physical map, genome sequence assembly and BAC end sequences, we have generated a fingerprint-validated set of 32 855 BAC clones spanning the human genome. The clone set provides coverage for at least 98% of the human fingerprint map, 99% of the current assembled sequence and has an effective resolving power of 79 kb. We have made the clone set publicly available, anticipating that it will generally facilitate FISH or array-CGH-based identification and characterization of chromosomal alterations relevant to disease.
Collapse
Affiliation(s)
- Martin Krzywinski
- BC Cancer Agency Genome Sciences Center and BC Cancer Agency, Vancouver, BC V5Z 4E6, Canada
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
42
|
Havlak P, Chen R, Durbin KJ, Egan A, Ren Y, Song XZ, Weinstock GM, Gibbs RA. The Atlas genome assembly system. Genome Res 2004; 14:721-32. [PMID: 15060016 PMCID: PMC383319 DOI: 10.1101/gr.2264004] [Citation(s) in RCA: 101] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Atlas is a suite of programs developed for assembly of genomes by a "combined approach" that uses DNA sequence reads from both BACs and whole-genome shotgun (WGS) libraries. The BAC clones afford advantages of localized assembly with reduced computational load, and provide a robust method for dealing with repeated sequences. Inclusion of WGS sequences facilitates use of different clone insert sizes and reduces data production costs. A core function of Atlas software is recruitment of WGS sequences into appropriate BACs based on sequence overlaps. Because construction of consensus sequences is from local assembly of these reads, only small (<0.1%) units of the genome are assembled at a time. Once assembled, each BAC is used to derive a genomic layout. This "sequence-based" growth of the genome map has greater precision than with non-sequence-based methods. Use of BACs allows correction of artifacts due to repeats at each stage of the process. This is aided by ancillary data such as BAC fingerprint, other genomic maps, and syntenic relations with other genomes. Atlas was used to assemble a draft DNA sequence of the rat genome; its major components including overlapper and split-scaffold are also being used in pure WGS projects.
Collapse
Affiliation(s)
- Paul Havlak
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | | | | | | | | | | | | | | |
Collapse
|
43
|
Imanishi T, Itoh T, Suzuki Y, O'Donovan C, Fukuchi S, Koyanagi KO, Barrero RA, Tamura T, Yamaguchi-Kabata Y, Tanino M, Yura K, Miyazaki S, Ikeo K, Homma K, Kasprzyk A, Nishikawa T, Hirakawa M, Thierry-Mieg J, Thierry-Mieg D, Ashurst J, Jia L, Nakao M, Thomas MA, Mulder N, Karavidopoulou Y, Jin L, Kim S, Yasuda T, Lenhard B, Eveno E, Suzuki Y, Yamasaki C, Takeda JI, Gough C, Hilton P, Fujii Y, Sakai H, Tanaka S, Amid C, Bellgard M, Bonaldo MDF, Bono H, Bromberg SK, Brookes AJ, Bruford E, Carninci P, Chelala C, Couillault C, de Souza SJ, Debily MA, Devignes MD, Dubchak I, Endo T, Estreicher A, Eyras E, Fukami-Kobayashi K, R. Gopinath G, Graudens E, Hahn Y, Han M, Han ZG, Hanada K, Hanaoka H, Harada E, Hashimoto K, Hinz U, Hirai M, Hishiki T, Hopkinson I, Imbeaud S, Inoko H, Kanapin A, Kaneko Y, Kasukawa T, Kelso J, Kersey P, Kikuno R, Kimura K, Korn B, Kuryshev V, Makalowska I, Makino T, Mano S, Mariage-Samson R, Mashima J, Matsuda H, Mewes HW, Minoshima S, Nagai K, Nagasaki H, Nagata N, Nigam R, Ogasawara O, Ohara O, Ohtsubo M, Okada N, Okido T, Oota S, Ota M, Ota T, et alImanishi T, Itoh T, Suzuki Y, O'Donovan C, Fukuchi S, Koyanagi KO, Barrero RA, Tamura T, Yamaguchi-Kabata Y, Tanino M, Yura K, Miyazaki S, Ikeo K, Homma K, Kasprzyk A, Nishikawa T, Hirakawa M, Thierry-Mieg J, Thierry-Mieg D, Ashurst J, Jia L, Nakao M, Thomas MA, Mulder N, Karavidopoulou Y, Jin L, Kim S, Yasuda T, Lenhard B, Eveno E, Suzuki Y, Yamasaki C, Takeda JI, Gough C, Hilton P, Fujii Y, Sakai H, Tanaka S, Amid C, Bellgard M, Bonaldo MDF, Bono H, Bromberg SK, Brookes AJ, Bruford E, Carninci P, Chelala C, Couillault C, de Souza SJ, Debily MA, Devignes MD, Dubchak I, Endo T, Estreicher A, Eyras E, Fukami-Kobayashi K, R. Gopinath G, Graudens E, Hahn Y, Han M, Han ZG, Hanada K, Hanaoka H, Harada E, Hashimoto K, Hinz U, Hirai M, Hishiki T, Hopkinson I, Imbeaud S, Inoko H, Kanapin A, Kaneko Y, Kasukawa T, Kelso J, Kersey P, Kikuno R, Kimura K, Korn B, Kuryshev V, Makalowska I, Makino T, Mano S, Mariage-Samson R, Mashima J, Matsuda H, Mewes HW, Minoshima S, Nagai K, Nagasaki H, Nagata N, Nigam R, Ogasawara O, Ohara O, Ohtsubo M, Okada N, Okido T, Oota S, Ota M, Ota T, Otsuki T, Piatier-Tonneau D, Poustka A, Ren SX, Saitou N, Sakai K, Sakamoto S, Sakate R, Schupp I, Servant F, Sherry S, Shiba R, Shimizu N, Shimoyama M, Simpson AJ, Soares B, Steward C, Suwa M, Suzuki M, Takahashi A, Tamiya G, Tanaka H, Taylor T, Terwilliger JD, Unneberg P, Veeramachaneni V, Watanabe S, Wilming L, Yasuda N, Yoo HS, Stodolsky M, Makalowski W, Go M, Nakai K, Takagi T, Kanehisa M, Sakaki Y, Quackenbush J, Okazaki Y, Hayashizaki Y, Hide W, Chakraborty R, Nishikawa K, Sugawara H, Tateno Y, Chen Z, Oishi M, Tonellato P, Apweiler R, Okubo K, Wagner L, Wiemann S, Strausberg RL, Isogai T, Auffray C, Nomura N, Gojobori T, Sugano S. Integrative annotation of 21,037 human genes validated by full-length cDNA clones. PLoS Biol 2004; 2:e162. [PMID: 15103394 PMCID: PMC393292 DOI: 10.1371/journal.pbio.0020162] [Show More Authors] [Citation(s) in RCA: 234] [Impact Index Per Article: 11.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2003] [Accepted: 04/01/2004] [Indexed: 01/08/2023] Open
Abstract
The human genome sequence defines our inherent biological potential; the realization of the biology encoded therein requires knowledge of the function of each gene. Currently, our knowledge in this area is still limited. Several lines of investigation have been used to elucidate the structure and function of the genes in the human genome. Even so, gene prediction remains a difficult task, as the varieties of transcripts of a gene may vary to a great extent. We thus performed an exhaustive integrative characterization of 41,118 full-length cDNAs that capture the gene transcripts as complete functional cassettes, providing an unequivocal report of structural and functional diversity at the gene level. Our international collaboration has validated 21,037 human gene candidates by analysis of high-quality full-length cDNA clones through curation using unified criteria. This led to the identification of 5,155 new gene candidates. It also manifested the most reliable way to control the quality of the cDNA clones. We have developed a human gene database, called the H-Invitational Database (H-InvDB; http://www.h-invitational.jp/). It provides the following: integrative annotation of human genes, description of gene structures, details of novel alternative splicing isoforms, non-protein-coding RNAs, functional domains, subcellular localizations, metabolic pathways, predictions of protein three-dimensional structure, mapping of known single nucleotide polymorphisms (SNPs), identification of polymorphic microsatellite repeats within human genes, and comparative results with mouse full-length cDNAs. The H-InvDB analysis has shown that up to 4% of the human genome sequence (National Center for Biotechnology Information build 34 assembly) may contain misassembled or missing regions. We found that 6.5% of the human gene candidates (1,377 loci) did not have a good protein-coding open reading frame, of which 296 loci are strong candidates for non-protein-coding RNA genes. In addition, among 72,027 uniquely mapped SNPs and insertions/deletions localized within human genes, 13,215 nonsynonymous SNPs, 315 nonsense SNPs, and 452 indels occurred in coding regions. Together with 25 polymorphic microsatellite repeats present in coding regions, they may alter protein structure, causing phenotypic effects or resulting in disease. The H-InvDB platform represents a substantial contribution to resources needed for the exploration of human biology and pathology.
Collapse
Affiliation(s)
- Tadashi Imanishi
- 1Integrated Database Group, Biological Information Research Center, National Institute of Advanced Industrial Science and TechnologyTokyoJapan
| | - Takeshi Itoh
- 1Integrated Database Group, Biological Information Research Center, National Institute of Advanced Industrial Science and TechnologyTokyoJapan
- 2Bioinformatics Laboratory, Genome Research Department, National Institute of Agrobiological SciencesIbarakiJapan
| | - Yutaka Suzuki
- 3Human Genome Center, The Institute of Medical Science, The University of TokyoTokyoJapan
- 68Department of Medical Genome Sciences, Graduate School of Frontier Sciences, University of TokyoTokyoJapan
| | - Claire O'Donovan
- 4EMBL Outstation—European Bioinformatics Institute, Wellcome Trust Genome CampusCambridgeUnited Kingdom
| | - Satoshi Fukuchi
- 5Center for Information Biology and DNA Data Bank of Japan, National Institute of GeneticsShizuokaJapan
| | | | - Roberto A Barrero
- 5Center for Information Biology and DNA Data Bank of Japan, National Institute of GeneticsShizuokaJapan
| | - Takuro Tamura
- 7Integrated Database Group, Japan Biological Information Research Center, Japan Biological Informatics ConsortiumTokyoJapan
- 8BITS CompanyShizuokaJapan
| | - Yumi Yamaguchi-Kabata
- 1Integrated Database Group, Biological Information Research Center, National Institute of Advanced Industrial Science and TechnologyTokyoJapan
| | - Motohiko Tanino
- 1Integrated Database Group, Biological Information Research Center, National Institute of Advanced Industrial Science and TechnologyTokyoJapan
- 7Integrated Database Group, Japan Biological Information Research Center, Japan Biological Informatics ConsortiumTokyoJapan
| | - Kei Yura
- 9Quantum Bioinformatics Group, Center for Promotion of Computational Science and Engineering, Japan Atomic Energy Research InstituteKyotoJapan
| | - Satoru Miyazaki
- 5Center for Information Biology and DNA Data Bank of Japan, National Institute of GeneticsShizuokaJapan
| | - Kazuho Ikeo
- 5Center for Information Biology and DNA Data Bank of Japan, National Institute of GeneticsShizuokaJapan
| | - Keiichi Homma
- 5Center for Information Biology and DNA Data Bank of Japan, National Institute of GeneticsShizuokaJapan
| | - Arek Kasprzyk
- 4EMBL Outstation—European Bioinformatics Institute, Wellcome Trust Genome CampusCambridgeUnited Kingdom
| | - Tetsuo Nishikawa
- 10Reverse Proteomics Research InstituteChibaJapan
- 11Central Research Laboratory, HitachiTokyoJapan
| | - Mika Hirakawa
- 12Bioinformatics Center, Institute for Chemical Research, Kyoto UniversityKyotoJapan
| | - Jean Thierry-Mieg
- 13National Center for Biotechnology Information, National Library of Medicine, National Institutes of HealthBethesda, MarylandUnited States of America
- 14Centre National de la Recherche Scientifique (CNRS), Laboratoire de Physique MathematiqueMontpellierFrance
| | - Danielle Thierry-Mieg
- 13National Center for Biotechnology Information, National Library of Medicine, National Institutes of HealthBethesda, MarylandUnited States of America
- 14Centre National de la Recherche Scientifique (CNRS), Laboratoire de Physique MathematiqueMontpellierFrance
| | - Jennifer Ashurst
- 15The Wellcome Trust Sanger Institute, Wellcome Trust Genome CampusCambridgeUnited Kingdom
| | - Libin Jia
- 16National Cancer Institute, National Institutes of HealthBethesda, MarylandUnited States of America
| | - Mitsuteru Nakao
- 3Human Genome Center, The Institute of Medical Science, The University of TokyoTokyoJapan
| | - Michael A Thomas
- 17Department of Biological Sciences, Idaho State UniversityPocatello, IdahoUnited States of America
| | - Nicola Mulder
- 4EMBL Outstation—European Bioinformatics Institute, Wellcome Trust Genome CampusCambridgeUnited Kingdom
| | - Youla Karavidopoulou
- 4EMBL Outstation—European Bioinformatics Institute, Wellcome Trust Genome CampusCambridgeUnited Kingdom
| | - Lihua Jin
- 5Center for Information Biology and DNA Data Bank of Japan, National Institute of GeneticsShizuokaJapan
| | - Sangsoo Kim
- 18Korea Research Institute of Bioscience and BiotechnologyTaejeonKorea
| | | | - Boris Lenhard
- 19Center for Genomics and Bioinformatics, Karolinska InstitutetStockholmSweden
| | - Eric Eveno
- 20Genexpress—CNRS—Functional Genomics and Systemic Biology for HealthVillejuif CedexFrance
- 21Sino-French Laboratory in Life Sciences and GenomicsShanghaiChina
| | - Yoshiyuki Suzuki
- 5Center for Information Biology and DNA Data Bank of Japan, National Institute of GeneticsShizuokaJapan
| | - Chisato Yamasaki
- 1Integrated Database Group, Biological Information Research Center, National Institute of Advanced Industrial Science and TechnologyTokyoJapan
| | - Jun-ichi Takeda
- 1Integrated Database Group, Biological Information Research Center, National Institute of Advanced Industrial Science and TechnologyTokyoJapan
| | - Craig Gough
- 1Integrated Database Group, Biological Information Research Center, National Institute of Advanced Industrial Science and TechnologyTokyoJapan
- 7Integrated Database Group, Japan Biological Information Research Center, Japan Biological Informatics ConsortiumTokyoJapan
| | - Phillip Hilton
- 1Integrated Database Group, Biological Information Research Center, National Institute of Advanced Industrial Science and TechnologyTokyoJapan
- 7Integrated Database Group, Japan Biological Information Research Center, Japan Biological Informatics ConsortiumTokyoJapan
| | - Yasuyuki Fujii
- 1Integrated Database Group, Biological Information Research Center, National Institute of Advanced Industrial Science and TechnologyTokyoJapan
- 7Integrated Database Group, Japan Biological Information Research Center, Japan Biological Informatics ConsortiumTokyoJapan
| | - Hiroaki Sakai
- 1Integrated Database Group, Biological Information Research Center, National Institute of Advanced Industrial Science and TechnologyTokyoJapan
- 7Integrated Database Group, Japan Biological Information Research Center, Japan Biological Informatics ConsortiumTokyoJapan
- 22Tokyo Research Laboratories, Kyowa Hakko Kogyo CompanyTokyoJapan
| | - Susumu Tanaka
- 1Integrated Database Group, Biological Information Research Center, National Institute of Advanced Industrial Science and TechnologyTokyoJapan
- 7Integrated Database Group, Japan Biological Information Research Center, Japan Biological Informatics ConsortiumTokyoJapan
| | - Clara Amid
- 23MIPS—Institute for Bioinformatics, GSF—National Research Center for Environment and HealthNeuherbergGermany
| | - Matthew Bellgard
- 24Centre for Bioinformatics and Biological Computing, School of Information Technology, Murdoch UniversityMurdoch, Western AustraliaAustralia
| | - Maria de Fatima Bonaldo
- 25Medical Education and Biomedical Research Facility, University of IowaIowa City, IowaUnited States of America
| | - Hidemasa Bono
- 26Genome Exploration Research Group, RIKEN Genomic Sciences Center, RIKEN Yokohama InstituteKanagawaJapan
| | - Susan K Bromberg
- 27Medical College of Wisconsin, MilwaukeeWisconsinUnited States of America
| | - Anthony J Brookes
- 19Center for Genomics and Bioinformatics, Karolinska InstitutetStockholmSweden
| | - Elspeth Bruford
- 28HUGO Gene Nomenclature Committee, University College LondonLondonUnited Kingdom
| | | | - Claude Chelala
- 20Genexpress—CNRS—Functional Genomics and Systemic Biology for HealthVillejuif CedexFrance
| | - Christine Couillault
- 20Genexpress—CNRS—Functional Genomics and Systemic Biology for HealthVillejuif CedexFrance
- 21Sino-French Laboratory in Life Sciences and GenomicsShanghaiChina
| | | | - Marie-Anne Debily
- 20Genexpress—CNRS—Functional Genomics and Systemic Biology for HealthVillejuif CedexFrance
| | | | - Inna Dubchak
- 32Lawrence Berkeley National Laboratory, BerkeleyCaliforniaUnited States of America
| | - Toshinori Endo
- 33Department of Bioinformatics, Medical Research Institute, Tokyo Medical and Dental UniversityTokyoJapan
| | | | - Eduardo Eyras
- 15The Wellcome Trust Sanger Institute, Wellcome Trust Genome CampusCambridgeUnited Kingdom
| | - Kaoru Fukami-Kobayashi
- 35Bioresource Information Division, RIKEN BioResource Center, RIKEN Tsukuba InstituteIbarakiJapan
| | - Gopal R. Gopinath
- 36Genome Knowledgebase, Cold Spring Harbor LaboratoryCold Spring Harbor, New YorkUnited States of America
| | - Esther Graudens
- 20Genexpress—CNRS—Functional Genomics and Systemic Biology for HealthVillejuif CedexFrance
- 21Sino-French Laboratory in Life Sciences and GenomicsShanghaiChina
| | - Yoonsoo Hahn
- 18Korea Research Institute of Bioscience and BiotechnologyTaejeonKorea
| | - Michael Han
- 23MIPS—Institute for Bioinformatics, GSF—National Research Center for Environment and HealthNeuherbergGermany
| | - Ze-Guang Han
- 21Sino-French Laboratory in Life Sciences and GenomicsShanghaiChina
- 37Chinese National Human Genome Center at ShanghaiShanghaiChina
| | - Kousuke Hanada
- 5Center for Information Biology and DNA Data Bank of Japan, National Institute of GeneticsShizuokaJapan
| | - Hideki Hanaoka
- 1Integrated Database Group, Biological Information Research Center, National Institute of Advanced Industrial Science and TechnologyTokyoJapan
| | - Erimi Harada
- 1Integrated Database Group, Biological Information Research Center, National Institute of Advanced Industrial Science and TechnologyTokyoJapan
- 7Integrated Database Group, Japan Biological Information Research Center, Japan Biological Informatics ConsortiumTokyoJapan
| | - Katsuyuki Hashimoto
- 38Division of Genetic Resources, National Institute of Infectious DiseasesTokyoJapan
| | - Ursula Hinz
- 34Swiss Institute of BioinformaticsGenevaSwitzerland
| | - Momoki Hirai
- 39Graduate School of Frontier Sciences, Department of Integrated Biosciences, University of TokyoChibaJapan
| | - Teruyoshi Hishiki
- 40Functional Genomics Group, Biological Information Research Center, National Institute of Advanced Industrial Science and TechnologyTokyoJapan
| | - Ian Hopkinson
- 41Department of Primary Care and Population Sciences, Royal Free University College Medical School, University College LondonLondonUnited Kingdom
- 42Clinical and Molecular Genetics Unit, The Institute of Child HealthLondonUnited Kingdom
| | - Sandrine Imbeaud
- 20Genexpress—CNRS—Functional Genomics and Systemic Biology for HealthVillejuif CedexFrance
- 21Sino-French Laboratory in Life Sciences and GenomicsShanghaiChina
| | - Hidetoshi Inoko
- 1Integrated Database Group, Biological Information Research Center, National Institute of Advanced Industrial Science and TechnologyTokyoJapan
- 7Integrated Database Group, Japan Biological Information Research Center, Japan Biological Informatics ConsortiumTokyoJapan
- 43Department of Genetic Information, Division of Molecular Life Science, School of Medicine, Tokai UniversityKanagawaJapan
| | - Alexander Kanapin
- 4EMBL Outstation—European Bioinformatics Institute, Wellcome Trust Genome CampusCambridgeUnited Kingdom
| | - Yayoi Kaneko
- 1Integrated Database Group, Biological Information Research Center, National Institute of Advanced Industrial Science and TechnologyTokyoJapan
- 7Integrated Database Group, Japan Biological Information Research Center, Japan Biological Informatics ConsortiumTokyoJapan
| | - Takeya Kasukawa
- 26Genome Exploration Research Group, RIKEN Genomic Sciences Center, RIKEN Yokohama InstituteKanagawaJapan
| | - Janet Kelso
- 44South African National Bioinformatics Institute, University of the Western CapeBellvilleSouth Africa
| | - Paul Kersey
- 4EMBL Outstation—European Bioinformatics Institute, Wellcome Trust Genome CampusCambridgeUnited Kingdom
| | | | | | - Bernhard Korn
- 46RZPD Resource Center for Genome ResearchHeidelbergGermany
| | - Vladimir Kuryshev
- 47Molecular Genome Analysis, German Cancer Research Center-DKFZHeidelbergGermany
| | - Izabela Makalowska
- 48Pennsylvania State UniversityUniversity Park, PennsylvaniaUnited States of America
| | - Takashi Makino
- 5Center for Information Biology and DNA Data Bank of Japan, National Institute of GeneticsShizuokaJapan
| | - Shuhei Mano
- 43Department of Genetic Information, Division of Molecular Life Science, School of Medicine, Tokai UniversityKanagawaJapan
| | - Regine Mariage-Samson
- 20Genexpress—CNRS—Functional Genomics and Systemic Biology for HealthVillejuif CedexFrance
| | - Jun Mashima
- 5Center for Information Biology and DNA Data Bank of Japan, National Institute of GeneticsShizuokaJapan
| | - Hideo Matsuda
- 49Department of Bioinformatic Engineering, Graduate School of Information Science and Technology, Osaka UniversityOsakaJapan
| | - Hans-Werner Mewes
- 23MIPS—Institute for Bioinformatics, GSF—National Research Center for Environment and HealthNeuherbergGermany
| | - Shinsei Minoshima
- 50Medical Photobiology Department, Photon Medical Research Center, Hamamatsu University School of MedicineShizuokaJapan
- 52Department of Molecular Biology, Keio University School of MedicineTokyoJapan
| | | | - Hideki Nagasaki
- 51Computational Biology Research Center, National Institute of Advanced Industrial Science and TechnologyTokyoJapan
| | - Naoki Nagata
- 1Integrated Database Group, Biological Information Research Center, National Institute of Advanced Industrial Science and TechnologyTokyoJapan
| | - Rajni Nigam
- 27Medical College of Wisconsin, MilwaukeeWisconsinUnited States of America
| | - Osamu Ogasawara
- 3Human Genome Center, The Institute of Medical Science, The University of TokyoTokyoJapan
| | | | - Masafumi Ohtsubo
- 52Department of Molecular Biology, Keio University School of MedicineTokyoJapan
| | - Norihiro Okada
- 53Department of Biological Sciences, Graduate School of Bioscience and Biotechnology, Tokyo Institute of TechnologyKanagawaJapan
| | - Toshihisa Okido
- 5Center for Information Biology and DNA Data Bank of Japan, National Institute of GeneticsShizuokaJapan
| | - Satoshi Oota
- 35Bioresource Information Division, RIKEN BioResource Center, RIKEN Tsukuba InstituteIbarakiJapan
| | - Motonori Ota
- 54Global Scientific Information and Computing Center, Tokyo Institute of TechnologyTokyoJapan
| | - Toshio Ota
- 22Tokyo Research Laboratories, Kyowa Hakko Kogyo CompanyTokyoJapan
| | - Tetsuji Otsuki
- 55Molecular Biology Laboratory, Medicinal Research Laboratories, Taisho Pharmaceutical CompanySaitamaJapan
| | | | - Annemarie Poustka
- 47Molecular Genome Analysis, German Cancer Research Center-DKFZHeidelbergGermany
| | - Shuang-Xi Ren
- 21Sino-French Laboratory in Life Sciences and GenomicsShanghaiChina
- 37Chinese National Human Genome Center at ShanghaiShanghaiChina
| | - Naruya Saitou
- 56Department of Population Genetics, National Institute of GeneticsShizuokaJapan
| | - Katsunaga Sakai
- 5Center for Information Biology and DNA Data Bank of Japan, National Institute of GeneticsShizuokaJapan
| | - Shigetaka Sakamoto
- 5Center for Information Biology and DNA Data Bank of Japan, National Institute of GeneticsShizuokaJapan
| | - Ryuichi Sakate
- 39Graduate School of Frontier Sciences, Department of Integrated Biosciences, University of TokyoChibaJapan
| | - Ingo Schupp
- 47Molecular Genome Analysis, German Cancer Research Center-DKFZHeidelbergGermany
| | - Florence Servant
- 4EMBL Outstation—European Bioinformatics Institute, Wellcome Trust Genome CampusCambridgeUnited Kingdom
| | - Stephen Sherry
- 13National Center for Biotechnology Information, National Library of Medicine, National Institutes of HealthBethesda, MarylandUnited States of America
| | - Rie Shiba
- 1Integrated Database Group, Biological Information Research Center, National Institute of Advanced Industrial Science and TechnologyTokyoJapan
- 7Integrated Database Group, Japan Biological Information Research Center, Japan Biological Informatics ConsortiumTokyoJapan
| | - Nobuyoshi Shimizu
- 52Department of Molecular Biology, Keio University School of MedicineTokyoJapan
| | - Mary Shimoyama
- 27Medical College of Wisconsin, MilwaukeeWisconsinUnited States of America
| | | | - Bento Soares
- 25Medical Education and Biomedical Research Facility, University of IowaIowa City, IowaUnited States of America
| | - Charles Steward
- 15The Wellcome Trust Sanger Institute, Wellcome Trust Genome CampusCambridgeUnited Kingdom
| | - Makiko Suwa
- 51Computational Biology Research Center, National Institute of Advanced Industrial Science and TechnologyTokyoJapan
| | - Mami Suzuki
- 5Center for Information Biology and DNA Data Bank of Japan, National Institute of GeneticsShizuokaJapan
| | - Aiko Takahashi
- 1Integrated Database Group, Biological Information Research Center, National Institute of Advanced Industrial Science and TechnologyTokyoJapan
- 7Integrated Database Group, Japan Biological Information Research Center, Japan Biological Informatics ConsortiumTokyoJapan
| | - Gen Tamiya
- 1Integrated Database Group, Biological Information Research Center, National Institute of Advanced Industrial Science and TechnologyTokyoJapan
- 7Integrated Database Group, Japan Biological Information Research Center, Japan Biological Informatics ConsortiumTokyoJapan
- 43Department of Genetic Information, Division of Molecular Life Science, School of Medicine, Tokai UniversityKanagawaJapan
| | - Hiroshi Tanaka
- 33Department of Bioinformatics, Medical Research Institute, Tokyo Medical and Dental UniversityTokyoJapan
| | - Todd Taylor
- 57Human Genome Research Group, Genomic Sciences Center, RIKEN Yokohama InstituteKanagawaJapan
| | - Joseph D Terwilliger
- 58Columbia University and Columbia Genome CenterNew York, New YorkUnited States of America
| | - Per Unneberg
- 59Department of Biotechnology, Royal Institute of TechnologyStockholmSweden
| | - Vamsi Veeramachaneni
- 48Pennsylvania State UniversityUniversity Park, PennsylvaniaUnited States of America
| | - Shinya Watanabe
- 3Human Genome Center, The Institute of Medical Science, The University of TokyoTokyoJapan
| | - Laurens Wilming
- 15The Wellcome Trust Sanger Institute, Wellcome Trust Genome CampusCambridgeUnited Kingdom
| | - Norikazu Yasuda
- 1Integrated Database Group, Biological Information Research Center, National Institute of Advanced Industrial Science and TechnologyTokyoJapan
- 7Integrated Database Group, Japan Biological Information Research Center, Japan Biological Informatics ConsortiumTokyoJapan
| | - Hyang-Sook Yoo
- 18Korea Research Institute of Bioscience and BiotechnologyTaejeonKorea
| | - Marvin Stodolsky
- 60Biology Division and Genome Task Group, Office of Biological and Environmental Research, United States Department of EnergyWashington, D.CUnited States of America
| | - Wojciech Makalowski
- 48Pennsylvania State UniversityUniversity Park, PennsylvaniaUnited States of America
| | - Mitiko Go
- 61Faculty of Bio-Science, Nagahama Institute of Bio-Science and TechnologyShigaJapan
| | - Kenta Nakai
- 3Human Genome Center, The Institute of Medical Science, The University of TokyoTokyoJapan
| | - Toshihisa Takagi
- 3Human Genome Center, The Institute of Medical Science, The University of TokyoTokyoJapan
| | - Minoru Kanehisa
- 12Bioinformatics Center, Institute for Chemical Research, Kyoto UniversityKyotoJapan
| | - Yoshiyuki Sakaki
- 3Human Genome Center, The Institute of Medical Science, The University of TokyoTokyoJapan
- 57Human Genome Research Group, Genomic Sciences Center, RIKEN Yokohama InstituteKanagawaJapan
| | - John Quackenbush
- 62Institute for Genomic ResearchRockville, MarylandUnited States of America
| | - Yasushi Okazaki
- 26Genome Exploration Research Group, RIKEN Genomic Sciences Center, RIKEN Yokohama InstituteKanagawaJapan
| | - Yoshihide Hayashizaki
- 26Genome Exploration Research Group, RIKEN Genomic Sciences Center, RIKEN Yokohama InstituteKanagawaJapan
| | - Winston Hide
- 44South African National Bioinformatics Institute, University of the Western CapeBellvilleSouth Africa
| | - Ranajit Chakraborty
- 63Center for Genome Information, Department of Environmental Health, University of CincinnatiCincinnati, OhioUnited States of America
| | - Ken Nishikawa
- 5Center for Information Biology and DNA Data Bank of Japan, National Institute of GeneticsShizuokaJapan
| | - Hideaki Sugawara
- 5Center for Information Biology and DNA Data Bank of Japan, National Institute of GeneticsShizuokaJapan
| | - Yoshio Tateno
- 5Center for Information Biology and DNA Data Bank of Japan, National Institute of GeneticsShizuokaJapan
| | - Zhu Chen
- 21Sino-French Laboratory in Life Sciences and GenomicsShanghaiChina
- 37Chinese National Human Genome Center at ShanghaiShanghaiChina
- 64State Key Laboratory of Medical Genomics, Shanghai Institute of Hematology, Rui-Jin Hospital, Shanghai Second Medical UniversityShanghaiChina
| | | | - Peter Tonellato
- 65PointOne SystemsWauwatosa, WisconsinUnited States of America
| | - Rolf Apweiler
- 4EMBL Outstation—European Bioinformatics Institute, Wellcome Trust Genome CampusCambridgeUnited Kingdom
| | - Kousaku Okubo
- 5Center for Information Biology and DNA Data Bank of Japan, National Institute of GeneticsShizuokaJapan
- 40Functional Genomics Group, Biological Information Research Center, National Institute of Advanced Industrial Science and TechnologyTokyoJapan
| | - Lukas Wagner
- 13National Center for Biotechnology Information, National Library of Medicine, National Institutes of HealthBethesda, MarylandUnited States of America
| | - Stefan Wiemann
- 47Molecular Genome Analysis, German Cancer Research Center-DKFZHeidelbergGermany
| | - Robert L Strausberg
- 16National Cancer Institute, National Institutes of HealthBethesda, MarylandUnited States of America
| | - Takao Isogai
- 10Reverse Proteomics Research InstituteChibaJapan
- 66Graduate School of Life and Environmental Sciences, University of TsukubaIbarakiJapan
| | - Charles Auffray
- 20Genexpress—CNRS—Functional Genomics and Systemic Biology for HealthVillejuif CedexFrance
- 21Sino-French Laboratory in Life Sciences and GenomicsShanghaiChina
| | - Nobuo Nomura
- 40Functional Genomics Group, Biological Information Research Center, National Institute of Advanced Industrial Science and TechnologyTokyoJapan
| | - Takashi Gojobori
- 1Integrated Database Group, Biological Information Research Center, National Institute of Advanced Industrial Science and TechnologyTokyoJapan
- 5Center for Information Biology and DNA Data Bank of Japan, National Institute of GeneticsShizuokaJapan
- 67Department of Genetics, Graduate University for Advanced StudiesShizuokaJapan
| | - Sumio Sugano
- 3Human Genome Center, The Institute of Medical Science, The University of TokyoTokyoJapan
- 40Functional Genomics Group, Biological Information Research Center, National Institute of Advanced Industrial Science and TechnologyTokyoJapan
- 68Department of Medical Genome Sciences, Graduate School of Frontier Sciences, University of TokyoTokyoJapan
| |
Collapse
|
44
|
Abstract
The output of a genome assembler generally comprises a collection of contiguous DNA sequences (contigs) whose relative placement along the genome is not defined. A procedure called scaffolding is commonly used to order and orient these contigs using paired read information. This ordering of contigs is an essential step when finishing and analyzing the data from a whole-genome shotgun project. Most recent assemblers include a scaffolding module; however, users have little control over the scaffolding algorithm or the information produced. We thus developed a general-purpose scaffolder, called Bambus, which affords users significant flexibility in controlling the scaffolding parameters. Bambus was used recently to scaffold the low-coverage draft dog genome data. Most significantly, Bambus enables the use of linking data other than that inferred from mate-pair information. For example, the sequence of a completed genome can be used to guide the scaffolding of a related organism. We present several applications of Bambus: support for finishing, comparative genomics, analysis of the haplotype structure of genomes, and scaffolding of a mammalian genome at low coverage. Bambus is available as an open-source package from our Web site.
Collapse
Affiliation(s)
- Mihai Pop
- The Institute for Genomic Research (TIGR), Rockville, Maryland 20850, USA.
| | | | | |
Collapse
|
45
|
Zou A, Lin Z, Humble M, Creech CD, Wagoner PK, Krafte D, Jegla TJ, Wickenden AD. Distribution and functional properties of human KCNH8 (Elk1) potassium channels. Am J Physiol Cell Physiol 2003; 285:C1356-66. [PMID: 12890647 DOI: 10.1152/ajpcell.00179.2003] [Citation(s) in RCA: 50] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
The Elk subfamily of the Eag K+ channel gene family is represented in mammals by three genes that are highly conserved between humans and rodents. Here we report the distribution and functional properties of a member of the human Elk K+ channel gene family, KCNH8. Quantitative RT-PCR analysis of mRNA expression patterns showed that KCNH8, along with the other Elk family genes, KCNH3 and KCNH4, are primarily expressed in the human nervous system. KCNH8 was expressed at high levels, and the distribution showed substantial overlap with KCNH3. In Xenopus oocytes, KCNH8 gives rise to slowly activating, voltage-dependent K+ currents that open at hyperpolarized potentials (half-maximal activation at -62 mV). Coexpression of KCNH8 with dominant-negative KCNH8, KCNH3, and KCNH4 subunits led to suppression of the KCNH8 currents, suggesting that Elk channels can form heteromultimers. Similar experiments imply that KCNH8 subunits are not able to form heteromultimers with Eag, Erg, or Kv family K+ channels.
Collapse
Affiliation(s)
- Anruo Zou
- Icagen, Inc., 4222 Emperor Blvd., Durham, NC 27703, USA
| | | | | | | | | | | | | | | |
Collapse
|
46
|
Abstract
We propose an assembly algorithm Barnacle for sequences generated by the clone-based approach. We illustrate our approach by assembling the human genome. Our novel method abandons the original physical-mapping-first framework. As we show, Barnacle more effectively resolves conflicts due to repeated sequences which is the main difficulty of the sequence assembly problem. In addition, we are able to detect inconsistencies in the underlying data. We present and compare our results on the December 2001 freeze of the public working draft of the human genome with NCBI's assembly (Build 28). The assembly of December 2001 freeze of the public working draft generated by Barnacle and the source code of Barnacle are available at (http://www.cs.rutgers.edu/~vchoi).
Collapse
Affiliation(s)
- Vicky Choi
- Department of Computer Science, Rutgers University, Piscataway, NJ 08854, USA.
| | | |
Collapse
|
47
|
Ohshima K, Hattori M, Yada T, Gojobori T, Sakaki Y, Okada N. Whole-genome screening indicates a possible burst of formation of processed pseudogenes and Alu repeats by particular L1 subfamilies in ancestral primates. Genome Biol 2003; 4:R74. [PMID: 14611660 PMCID: PMC329124 DOI: 10.1186/gb-2003-4-11-r74] [Citation(s) in RCA: 135] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2003] [Revised: 09/02/2003] [Accepted: 09/25/2003] [Indexed: 01/26/2023] Open
Abstract
BACKGROUND Abundant pseudogenes are a feature of mammalian genomes. Processed pseudogenes (PPs) are reverse transcribed from mRNAs. Recent molecular biological studies show that mammalian long interspersed element 1 (L1)-encoded proteins may have been involved in PP reverse transcription. Here, we present the first comprehensive analysis of human PPs using all known human genes as queries. RESULTS The human genome was queried and 3,664 candidate PPs were identified. The most abundant were copies of genes encoding keratin 18, glyceraldehyde-3-phosphate dehydrogenase and ribosomal protein L21. A simple method was developed to estimate the level of nucleotide substitutions (and therefore the age) of PPs. A Poisson-like age distribution was obtained with a mean age close to that of the Alu repeats, the predominant human short interspersed elements. These data suggest a nearly simultaneous burst of PP and Alu formation in the genomes of ancestral primates. The peak period of amplification of these two distinct retrotransposons was estimated to be 40-50 million years ago. Concordant amplification of certain L1 subfamilies with PPs and Alus was observed. CONCLUSIONS We suggest that a burst of formation of PPs and Alus occurred in the genome of ancestral primates. One possible mechanism is that proteins encoded by members of particular L1 subfamilies acquired an enhanced ability to recognize cytosolic RNAs in trans.
Collapse
Affiliation(s)
- Kazuhiko Ohshima
- School and Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology, 4259 Nagatsuta-cho, Midori-ku, Yokohama, Kanagawa 226-8501, Japan
| | - Masahira Hattori
- RIKEN Genomic Sciences Center, 1-7-22, Suehiro Tsurumi, Yokohama, Kanagawa 230-0045, Japan
- Laboratory of Genome Information, Kitasato Institute for Life Science, Kitasato University, 1-15-1, Kitasato, Sagamihara, Kanagawa 228-8555, Japan
| | - Tetsusi Yada
- Human Genome Center, Institute of Medical Science, University of Tokyo, 4-6-1 Shirokanedai, Minato-ku, Tokyo 108-8639, Japan
| | - Takashi Gojobori
- Center for Information Biology and DNA Data Bank of Japan, National Institute of Genetics, Yata 1111, Mishima, Shizuoka 411-8540, Japan
| | - Yoshiyuki Sakaki
- RIKEN Genomic Sciences Center, 1-7-22, Suehiro Tsurumi, Yokohama, Kanagawa 230-0045, Japan
- Human Genome Center, Institute of Medical Science, University of Tokyo, 4-6-1 Shirokanedai, Minato-ku, Tokyo 108-8639, Japan
| | - Norihiro Okada
- School and Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology, 4259 Nagatsuta-cho, Midori-ku, Yokohama, Kanagawa 226-8501, Japan
| |
Collapse
|
48
|
Kota R, Rudd S, Facius A, Kolesov G, Thiel T, Zhang H, Stein N, Mayer K, Graner A. Snipping polymorphisms from large EST collections in barley (Hordeum vulgare L.). Mol Genet Genomics 2003; 270:24-33. [PMID: 12938038 DOI: 10.1007/s00438-003-0891-6] [Citation(s) in RCA: 99] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2003] [Accepted: 06/01/2003] [Indexed: 11/27/2022]
Abstract
The public EST (expressed sequence tag) databases represent an enormous but heterogeneous repository of sequences, including many from a broad selection of plant species and a wide range of distinct varieties. The significant redundancy within large EST collections makes them an attractive resource for rapid pre-selection of candidate sequence polymorphisms. Here we present a strategy that allows rapid identification of candidate SNPs in barley (Hordeum vulgare L.) using publicly available EST databases. Analysis of 271,630 EST sequences from different cDNA libraries, representing 23 different barley varieties, resulted in the generation of 56,302 tentative consensus sequences. In all, 8171 of these unigene sequences are members of clusters with six or more ESTs. By applying a novel SNP detection algorithm (SNiPpER) to these sequences, we identified 3069 candidate inter-varietal SNPs. In order to verify these candidate SNPs, we selected a small subset of 63 present in 36 ESTs. Of the 63 SNPs selected, we were able to validate 54 (86%) using a direct sequencing approach. For further verification, 28 ESTs were mapped to distinct loci within the barley genome. The polymorphism information content (PIC) and nucleotide diversity (pi) values of the SNPs identified by the SNiPpER algorithm are significantly higher than those that were obtained by random sequencing. This demonstrates the efficiency of our strategy for SNP identification and the cost-efficient development of EST-based SNP-markers.
Collapse
Affiliation(s)
- R Kota
- Institute for Plant Genetics and Crop Plant Research (IPK), Corrensstr. 3, 06466 Gatersleben, Germany
| | | | | | | | | | | | | | | | | |
Collapse
|
49
|
Engler FW, Hatfield J, Nelson W, Soderlund CA. Locating sequence on FPC maps and selecting a minimal tiling path. Genome Res 2003; 13:2152-63. [PMID: 12915486 PMCID: PMC403717 DOI: 10.1101/gr.1068603] [Citation(s) in RCA: 34] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
This study discusses three software tools, the first two aid in integrating sequence with an FPC physical map and the third automatically selects a minimal tiling path given genomic draft sequence and BAC end sequences. The first tool, FSD (FPC Simulated Digest), takes a sequenced clone and adds it back to the map based on a fingerprint generated by an in silico digest of the clone. This allows verification of sequenced clone positions and the integration of sequenced clones that were not originally part of the FPC map. The second tool, BSS (Blast Some Sequence), takes a query sequence and positions it on the map based on sequence associated with the clones in the map. BSS has multiple uses as follows: (1) When the query is a file of marker sequences, they can be added as electronic markers. (2) When the query is draft sequence, the results of BSS can be used to close gaps in a sequenced clone or the physical map. (3) When the query is a sequenced clone and the target is BAC end sequences, one may select the next clone for sequencing using both sequence comparison results and map location. (4) When the query is whole-genome draft sequence and the target is BAC end sequences, the results can be used to select many clones for a minimal tiling path at once. The third tool, pickMTP, automates the majority of this last usage of BSS. Results are presented using the rice FPC map, BAC end sequences, and whole-genome shotgun from Syngenta.
Collapse
Affiliation(s)
- Friedrich W Engler
- Arizona Genomics Computational Laboratory, University of Arizona, Tucson, Arizona 85721, USA
| | | | | | | |
Collapse
|
50
|
Huntley D, Hummerich H, Smedley D, Kittivoravitkul S, McCarthy M, Little P, Sergot M. GANESH: software for customized annotation of genome regions. Genome Res 2003; 13:2195-202. [PMID: 12952886 PMCID: PMC403729 DOI: 10.1101/gr.698103] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
GANESH is a software package designed to support the genetic analysis of regions of human and other genomes. It provides a set of components that may be assembled to construct a self-updating database of DNA sequence, mapping data, and annotations of possible genome features. Once one or more remote sources of data for the target region have been identified, all sequences for that region are downloaded, assimilated, and subjected to a (configurable) set of standard database-searching and genome-analysis packages. The results are stored in compressed form in a relational database, and are updated automatically on a regular schedule so that they are always immediately available in their most up-to-date versions. A Java front-end, executed as a stand alone application or web applet, provides a graphical interface for navigating the database and for viewing the annotations. There are facilities for importing and exporting data in the format of the Distributed Annotation System (DAS), enabling a GANESH database to be used as a component of a DAS configuration. The system has been used to construct databases for about a dozen regions of human chromosomes and for three regions of mouse chromosomes.
Collapse
Affiliation(s)
- Derek Huntley
- Department of Computing, Imperial College, London SW7 2AZ, UK
| | | | | | | | | | | | | |
Collapse
|