1
|
Fernández-Orth D, Rueda M, Singh B, Moldes M, Jene A, Ferri M, Vasallo C, Fromont LA, Navarro A, Rambla J. A quality control portal for sequencing data deposited at the European genome-phenome archive. Brief Bioinform 2022; 23:6570012. [PMID: 35438138 PMCID: PMC9116225 DOI: 10.1093/bib/bbac136] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2021] [Revised: 03/01/2022] [Accepted: 03/23/2022] [Indexed: 11/15/2022] Open
Abstract
Since its launch in 2008, the European Genome-Phenome Archive (EGA) has been leading the archiving and distribution of human identifiable genomic data. In this regard, one of the community concerns is the potential usability of the stored data, as of now, data submitters are not mandated to perform any quality control (QC) before uploading their data and associated metadata information. Here, we present a new File QC Portal developed at EGA, along with QC reports performed and created for 1 694 442 files [Fastq, sequence alignment map (SAM)/binary alignment map (BAM)/CRAM and variant call format (VCF)] submitted at EGA. QC reports allow anonymous EGA users to view summary-level information regarding the files within a specific dataset, such as quality of reads, alignment quality, number and type of variants and other features. Researchers benefit from being able to assess the quality of data prior to the data access decision and thereby, increasing the reusability of data (https://ega-archive.org/blog/data-upcycling-powered-by-ega/).
Collapse
Affiliation(s)
- Dietmar Fernández-Orth
- European Genome-phenome Archive (EGA) in the Centre for Genomic Regulation (CRG), the Barcelona Institute of Science and Technology Dr. Aiguader 88, Barcelona, 08003 Spain
| | - Manuel Rueda
- European Genome-phenome Archive (EGA) in the Centre for Genomic Regulation (CRG), the Barcelona Institute of Science and Technology Dr. Aiguader 88, Barcelona, 08003 Spain
| | - Babita Singh
- European Genome-phenome Archive (EGA) in the Centre for Genomic Regulation (CRG), the Barcelona Institute of Science and Technology Dr. Aiguader 88, Barcelona, 08003 Spain
| | - Mauricio Moldes
- European Genome-phenome Archive (EGA) in the Centre for Genomic Regulation (CRG), the Barcelona Institute of Science and Technology Dr. Aiguader 88, Barcelona, 08003 Spain
| | - Aina Jene
- European Genome-phenome Archive (EGA) in the Centre for Genomic Regulation (CRG), the Barcelona Institute of Science and Technology Dr. Aiguader 88, Barcelona, 08003 Spain
| | - Marta Ferri
- European Genome-phenome Archive (EGA) in the Centre for Genomic Regulation (CRG), the Barcelona Institute of Science and Technology Dr. Aiguader 88, Barcelona, 08003 Spain
| | - Claudia Vasallo
- European Genome-phenome Archive (EGA) in the Centre for Genomic Regulation (CRG), the Barcelona Institute of Science and Technology Dr. Aiguader 88, Barcelona, 08003 Spain
| | - Lauren A Fromont
- European Genome-phenome Archive (EGA) in the Centre for Genomic Regulation (CRG), the Barcelona Institute of Science and Technology Dr. Aiguader 88, Barcelona, 08003 Spain
| | - Arcadi Navarro
- European Genome-phenome Archive (EGA) in the Centre for Genomic Regulation (CRG), the Barcelona Institute of Science and Technology Dr. Aiguader 88, Barcelona, 08003 Spain
| | - Jordi Rambla
- European Genome-phenome Archive (EGA) in the Centre for Genomic Regulation (CRG), the Barcelona Institute of Science and Technology Dr. Aiguader 88, Barcelona, 08003 Spain
| |
Collapse
|
2
|
Prunier J, Lemaçon A, Bastien A, Jafarikia M, Porth I, Robert C, Droit A. LD-annot: A Bioinformatics Tool to Automatically Provide Candidate SNPs With Annotations for Genetically Linked Genes. Front Genet 2019; 10:1192. [PMID: 31850063 PMCID: PMC6889475 DOI: 10.3389/fgene.2019.01192] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2019] [Accepted: 10/28/2019] [Indexed: 11/24/2022] Open
Abstract
A multitude of model and non-model species studies have now taken full advantage of powerful high-throughput genotyping advances such as SNP arrays and genotyping-by-sequencing (GBS) technology to investigate the genetic basis of trait variation. However, due to incomplete genome coverage by these technologies, the identified SNPs are likely in linkage disequilibrium (LD) with the causal polymorphisms, rather than be causal themselves. In addition, researchers could benefit from annotations for the identified candidate SNPs and, simultaneously, for all neighboring genes in genetic linkage. In such case, LD extent estimation surrounding the candidate SNPs is required to determine the regions encompassing genes of interest. We describe here an automated pipeline, “LD-annot,” designed to delineate specific regions of interest for a given experiment and candidate polymorphisms on the basis of LD extent, and furthermore, provide annotations for all genes within such regions. LD-annot uses standard file formats, bioinformatics tools, and languages to provide identifiers, coordinates, and annotations for genes in genetic linkage with each candidate polymorphism. Although the focus lies upon SNP arrays and GBS data as they are being routinely deployed, this pipeline can be applied to a variety of datasets as long as genotypic data are available for a high number of polymorphisms and formatted into a vcf file. A checkpoint procedure in the pipeline allows to test several threshold values for linkage without having to rerun the entire pipeline, thus saving the user computational time and resources. We applied this new pipeline to four different sample sets: two breeding populations GBS datasets, one within-pedigree SNP set coming from whole genome sequencing (WGS), and a very large multi-varieties SNP dataset obtained from WGS, representing variable sample sizes, and numbers of polymorphisms. LD-annot performed within minutes, even when very high numbers of polymorphisms are investigated and thus will efficiently assist research efforts aimed at identifying biologically meaningful genetic polymorphisms underlying phenotypic variation. LD-annot tool is available under a GPL license from https://github.com/ArnaudDroitLab/LD-annot.
Collapse
Affiliation(s)
- Julien Prunier
- Genomics Center, Centre Hospitalier Universitaire de Québec-Université Laval Research Center, Quebec, QC, Canada.,Forestry Research Centre, Forestry Department, Université Laval, Quebec, QC, Canada
| | - Audrey Lemaçon
- Genomics Center, Centre Hospitalier Universitaire de Québec-Université Laval Research Center, Quebec, QC, Canada
| | - Alexandre Bastien
- Faculty of Agricultural and Food Science, Université Laval, Quebec, QC, Canada
| | - Mohsen Jafarikia
- Canadian Centre for Swine Improvement, Ottawa, ON, Canada.,Department of Animal Biosciences, University of Guelph, Guelph, ON, Canada
| | - Ilga Porth
- Forestry Research Centre, Forestry Department, Université Laval, Quebec, QC, Canada
| | - Claude Robert
- Forestry Research Centre, Forestry Department, Université Laval, Quebec, QC, Canada
| | - Arnaud Droit
- Genomics Center, Centre Hospitalier Universitaire de Québec-Université Laval Research Center, Quebec, QC, Canada
| |
Collapse
|