Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Liu X, Han S, Wang Z, Gelernter J, Yang BZ. Variant callers for next-generation sequencing data: a comparison study. PLoS One 2013;8:e75619. [PMID: 24086590 PMCID: PMC3785481 DOI: 10.1371/journal.pone.0075619] [Citation(s) in RCA: 117] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2012] [Accepted: 08/15/2013] [Indexed: 11/19/2022] Open

For:	Liu X, Han S, Wang Z, Gelernter J, Yang BZ. Variant callers for next-generation sequencing data: a comparison study. PLoS One 2013;8:e75619. [PMID: 24086590 PMCID: PMC3785481 DOI: 10.1371/journal.pone.0075619] [Citation(s) in RCA: 117] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2012] [Accepted: 08/15/2013] [Indexed: 11/19/2022] Open

Number

Cited by Other Article(s)

Zhang L, Li H, Shi M, Ren K, Zhang W, Cheng Y, Wang Y, Xia XQ. FishSNP: a high quality cross-species SNP database of fishes. Sci Data 2024;11:286. [PMID: 38461307 PMCID: PMC10924876 DOI: 10.1038/s41597-024-03111-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2023] [Accepted: 03/04/2024] [Indexed: 03/11/2024] Open

Affiliation(s)

Lei Zhang State Key Laboratory of Freshwater Ecology and Biotechnology, Hubei Hongshan Laboratory, Key Laboratory of Aquaculture Disease Control, Ministry of Agriculture and Rural Affairs, The Innovation Academy of Seed Design, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan, 430072, China College of Advanced Agricultural Sciences, University of Chinese Academy of Sciences, Beijing, 100049, China
Heng Li State Key Laboratory of Freshwater Ecology and Biotechnology, Hubei Hongshan Laboratory, Key Laboratory of Aquaculture Disease Control, Ministry of Agriculture and Rural Affairs, The Innovation Academy of Seed Design, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan, 430072, China College of Advanced Agricultural Sciences, University of Chinese Academy of Sciences, Beijing, 100049, China
Mijuan Shi State Key Laboratory of Freshwater Ecology and Biotechnology, Hubei Hongshan Laboratory, Key Laboratory of Aquaculture Disease Control, Ministry of Agriculture and Rural Affairs, The Innovation Academy of Seed Design, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan, 430072, China. College of Advanced Agricultural Sciences, University of Chinese Academy of Sciences, Beijing, 100049, China.
Keyi Ren State Key Laboratory of Freshwater Ecology and Biotechnology, Hubei Hongshan Laboratory, Key Laboratory of Aquaculture Disease Control, Ministry of Agriculture and Rural Affairs, The Innovation Academy of Seed Design, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan, 430072, China College of Fisheries and Life Science, Dalian Ocean University, Dalian, 116023, China
Wanting Zhang State Key Laboratory of Freshwater Ecology and Biotechnology, Hubei Hongshan Laboratory, Key Laboratory of Aquaculture Disease Control, Ministry of Agriculture and Rural Affairs, The Innovation Academy of Seed Design, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan, 430072, China
Yingyin Cheng State Key Laboratory of Freshwater Ecology and Biotechnology, Hubei Hongshan Laboratory, Key Laboratory of Aquaculture Disease Control, Ministry of Agriculture and Rural Affairs, The Innovation Academy of Seed Design, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan, 430072, China
Yaping Wang State Key Laboratory of Freshwater Ecology and Biotechnology, Hubei Hongshan Laboratory, Key Laboratory of Aquaculture Disease Control, Ministry of Agriculture and Rural Affairs, The Innovation Academy of Seed Design, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan, 430072, China College of Advanced Agricultural Sciences, University of Chinese Academy of Sciences, Beijing, 100049, China
Xiao-Qin Xia State Key Laboratory of Freshwater Ecology and Biotechnology, Hubei Hongshan Laboratory, Key Laboratory of Aquaculture Disease Control, Ministry of Agriculture and Rural Affairs, The Innovation Academy of Seed Design, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan, 430072, China. College of Advanced Agricultural Sciences, University of Chinese Academy of Sciences, Beijing, 100049, China.

Collapse

Xiang X, Lu B, Song D, Li J, Shu K, Pu D. Evaluating the performance of low-frequency variant calling tools for the detection of variants from short-read deep sequencing data. Sci Rep 2023;13:20444. [PMID: 37993475 PMCID: PMC10665316 DOI: 10.1038/s41598-023-47135-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2023] [Accepted: 11/09/2023] [Indexed: 11/24/2023] Open

Abstract

Detection of low-frequency variants with high accuracy plays an important role in biomedical research and clinical practice. However, it is challenging to do so with next-generation sequencing (NGS) approaches due to the high error rates of NGS. To accurately distinguish low-level true variants from these errors, many statistical variants calling tools for calling low-frequency variants have been proposed, but a systematic performance comparison of these tools has not yet been performed. Here, we evaluated four raw-reads-based variant callers (SiNVICT, outLyzer, Pisces, and LoFreq) and four UMI-based variant callers (DeepSNVMiner, MAGERI, smCounter2, and UMI-VarCal) considering their capability to call single nucleotide variants (SNVs) with allelic frequency as low as 0.025% in deep sequencing data. We analyzed a total of 54 simulated data with various sequencing depths and variant allele frequencies (VAFs), two reference data, and Horizon Tru-Q sample data. The results showed that the UMI-based callers, except smCounter2, outperformed the raw-reads-based callers regarding detection limit. Sequencing depth had almost no effect on the UMI-based callers but significantly influenced on the raw-reads-based callers. Regardless of the sequencing depth, MAGERI showed the fastest analysis, while smCounter2 consistently took the longest to finish the variant calling process. Overall, DeepSNVMiner and UMI-VarCal performed the best with considerably good sensitivity and precision of 88%, 100%, and 84%, 100%, respectively. In conclusion, the UMI-based callers, except smCounter2, outperformed the raw-reads-based callers in terms of sensitivity and precision. We recommend using DeepSNVMiner and UMI-VarCal for low-frequency variant detection. The results provide important information regarding future directions for reliable low-frequency variant detection and algorithm development, which is critical in genetics-based medical research and clinical applications.

Collapse

Sancha-Velasco A, Uceda-Heras A, García-Cabezas MÁ. Cortical type: a conceptual tool for meaningful biological interpretation of high-throughput gene expression data in the human cerebral cortex. Front Neuroanat 2023;17:1187280. [PMID: 37426901 PMCID: PMC10323436 DOI: 10.3389/fnana.2023.1187280] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2023] [Accepted: 05/31/2023] [Indexed: 07/11/2023] Open

Abstract

The interpretation of massive high-throughput gene expression data requires computational and biological analyses to identify statistically and biologically significant differences, respectively. There are abundant sources that describe computational tools for statistical analysis of massive gene expression data but few address data analysis for biological significance. In the present article we exemplify the importance of selecting the proper biological context in the human brain for gene expression data analysis and interpretation. For this purpose, we use cortical type as conceptual tool to make predictions about gene expression in areas of the human temporal cortex. We predict that the expression of genes related to glutamatergic transmission would be higher in areas of simpler cortical type, the expression of genes related to GABAergic transmission would be higher in areas of more complex cortical type, and the expression of genes related to epigenetic regulation would be higher in areas of simpler cortical type. Then, we test these predictions with gene expression data from several regions of the human temporal cortex obtained from the Allen Human Brain Atlas. We find that the expression of several genes shows statistically significant differences in agreement with the predicted gradual expression along the laminar complexity gradient of the human cortex, suggesting that simpler cortical types may have greater glutamatergic excitability and epigenetic turnover compared to more complex types; on the other hand, complex cortical types seem to have greater GABAergic inhibitory control compared to simpler types. Our results show that cortical type is a good predictor of synaptic plasticity, epigenetic turnover, and selective vulnerability in human cortical areas. Thus, cortical type can provide a meaningful context for interpreting high-throughput gene expression data in the human cerebral cortex.

Collapse

Zhong ZQ, Li R, Wang Z, Tian SS, Xie XF, Wang ZY, Na W, Wang QS, Pan YC, Xiao Q. Genome-wide scans for selection signatures in indigenous pigs revealed candidate genes relating to heat tolerance. Animal 2023;17:100882. [PMID: 37406393 DOI: 10.1016/j.animal.2023.100882] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2022] [Revised: 06/04/2023] [Accepted: 06/06/2023] [Indexed: 07/07/2023] Open

Craven KE, Fischer CG, Jiang L, Pallavajjala A, Lin MT, Eshleman JR. Optimizing Insertion and Deletion Detection Using Next-Generation Sequencing in the Clinical Laboratory. J Mol Diagn 2022;24:1217-1231. [PMID: 36162758 PMCID: PMC9808503 DOI: 10.1016/j.jmoldx.2022.08.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2022] [Revised: 07/18/2022] [Accepted: 08/31/2022] [Indexed: 01/13/2023] Open

Wang H, Wen J, Li H, Zhu T, Zhao X, Zhang J, Zhang X, Tang C, Qu L, Gemingguli M. Candidate pigmentation genes related to feather color variation in an indigenous chicken breed revealed by whole genome data. Front Genet 2022;13:985228. [PMID: 36479242 PMCID: PMC9720402 DOI: 10.3389/fgene.2022.985228] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2022] [Accepted: 10/10/2022] [Indexed: 08/27/2023] Open

Affiliation(s)

Huie Wang Xinjiang Production and Construction Corps, Key Laboratory of Protection and Utilization of Biological Resources in Tarim Basin, Tarim University, Alar, China College of Life Science and Technology, College of Animal Science and Technology, Tarim University, Alar, China
Junhui Wen National Engineering Laboratory for Animal Breeding, Department of Animal Genetics and Breeding, College of Animal Science and Technology, China Agricultural University, Beijing, China
Haiying Li College of Animal Science, Xinjiang Agricultural University, Urumchi, China
Tao Zhu National Engineering Laboratory for Animal Breeding, Department of Animal Genetics and Breeding, College of Animal Science and Technology, China Agricultural University, Beijing, China
Xiurong Zhao National Engineering Laboratory for Animal Breeding, Department of Animal Genetics and Breeding, College of Animal Science and Technology, China Agricultural University, Beijing, China
Jinxin Zhang National Engineering Laboratory for Animal Breeding, Department of Animal Genetics and Breeding, College of Animal Science and Technology, China Agricultural University, Beijing, China
Xinye Zhang National Engineering Laboratory for Animal Breeding, Department of Animal Genetics and Breeding, College of Animal Science and Technology, China Agricultural University, Beijing, China
Chi Tang Xinjiang Production and Construction Corps, Key Laboratory of Protection and Utilization of Biological Resources in Tarim Basin, Tarim University, Alar, China
Lujiang Qu Xinjiang Production and Construction Corps, Key Laboratory of Protection and Utilization of Biological Resources in Tarim Basin, Tarim University, Alar, China National Engineering Laboratory for Animal Breeding, Department of Animal Genetics and Breeding, College of Animal Science and Technology, China Agricultural University, Beijing, China
M. Gemingguli Xinjiang Production and Construction Corps, Key Laboratory of Protection and Utilization of Biological Resources in Tarim Basin, Tarim University, Alar, China College of Life Science and Technology, College of Animal Science and Technology, Tarim University, Alar, China

Collapse

Feldmann D, Bope CD, Patricios J, Chimusa ER, Collins M, September AV. A whole genome sequencing approach to anterior cruciate ligament rupture-a twin study in two unrelated families. PLoS One 2022;17:e0274354. [PMID: 36201451 PMCID: PMC9536556 DOI: 10.1371/journal.pone.0274354] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2021] [Accepted: 08/25/2022] [Indexed: 11/06/2022] Open

Abstract

Predisposition to anterior cruciate ligament (ACL) rupture is multi-factorial, with variation in the genome considered a key intrinsic risk factor. Most implicated loci have been identified from candidate gene-based approach using case-control association settings. Here, we leverage a hypothesis-free whole genome sequencing in two two unrelated families (Family A and B) each with twins with a history of recurrent ACL ruptures acquired playing rugby as their primary sport, aimed to elucidate biologically relevant function-altering variants and genetic modifiers in ACL rupture. Family A monozygotic twin males (Twin 1 and Twin 2) both sustained two unilateral non-contact ACL ruptures of the right limb while playing club level touch rugby. Their male sibling sustained a bilateral non-contact ACL rupture while playing rugby union was also recruited. The father had sustained a unilateral non-contact ACL rupture on the right limb while playing professional amateur level football and mother who had participated in dancing for over 10 years at a social level, with no previous ligament or tendon injuries were both recruited. Family B monozygotic twin males (Twin 3 and Twin 4) were recruited with Twin 3 who had sustained a unilateral non-contact ACL rupture of the right limb and Twin 4 sustained three non-contact ACL ruptures (two in right limb and one in left limb), both while playing provincial level rugby union. Their female sibling participated in karate and swimming activities; and mother in hockey (4 years) horse riding (15 years) and swimming, had both reported no previous history of ligament or tendon injury. Variants with potential deleterious, loss-of-function and pathogenic effects were prioritised. Identity by descent, molecular dynamic simulation and functional partner analyses were conducted. We identified, in all nine affected individuals, including twin sets, non-synonymous SNPs in three genes: COL12A1 and CATSPER2, and KCNJ12 that are commonly enriched for deleterious, loss-of-function mutations, and their dysfunctions are known to be involved in the development of chronic pain, and represent key therapeutic targets. Notably, using Identity By Decent (IBD) analyses a long shared identical sequence interval which included the LINC01250 gene, around the telomeric region of chromosome 2p25.3, was common between affected twins in both families, and an affected brother'. Overall gene sets were enriched in pathways relevant to ACL pathophysiology, including complement/coagulation cascades (p = 3.0e-7), purine metabolism (p = 6.0e-7) and mismatch repair (p = 6.9e-5) pathways. Highlighted, is that this study fills an important gap in knowledge by using a WGS approach, focusing on potential deleterious variants in two unrelated families with a historical record of ACL rupture; and providing new insights into the pathophysiology of ACL, by identifying gene sets that contribute to variability in ACL risk.

Collapse

Lefouili M, Nam K. The evaluation of Bcftools mpileup and GATK HaplotypeCaller for variant calling in non-human species. Sci Rep 2022;12:11331. [PMID: 35790846 PMCID: PMC9256665 DOI: 10.1038/s41598-022-15563-2] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Accepted: 06/27/2022] [Indexed: 11/09/2022] Open

Saremi B, Gusmag F, Distl O, Schaarschmidt F, Metzger J, Becker S, Jung K. A comparison of strategies for generating artificial replicates in RNA-seq experiments. Sci Rep 2022;12:7170. [PMID: 35505053 PMCID: PMC9065086 DOI: 10.1038/s41598-022-11302-9] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Accepted: 04/04/2022] [Indexed: 11/21/2022] Open

Abstract

Due to the overall high costs, technical replicates are usually omitted in RNA-seq experiments, but several methods exist to generate them artificially. Bootstrapping reads from FASTQ-files has recently been used in the context of other NGS analyses and can be used to generate artificial technical replicates. Bootstrapping samples from the columns of the expression matrix has already been used for DNA microarray data and generates a new artificial replicate of the whole experiment. Mixing data of individual samples has been used for data augmentation in machine learning. The aim of this comparison is to evaluate which of these strategies are best suited to study the reproducibility of differential expression and gene-set enrichment analysis in an RNA-seq experiment. To study the approaches under controlled conditions, we performed a new RNA-seq experiment on gene expression changes upon virus infection compared to untreated control samples. In order to compare the approaches for artificial replicates, each of the samples was sequenced twice, i.e. as true technical replicates, and differential expression analysis and GO term enrichment analysis was conducted separately for the two resulting data sets. Although we observed a high correlation between the results from the two replicates, there are still many genes and GO terms that would be selected from one replicate but not from the other. Cluster analyses showed that artificial replicates generated by bootstrapping reads produce it p values and fold changes that are close to those obtained from the true data sets. Results generated from artificial replicates with the approaches of column bootstrap or mixing observations were less similar to the results from the true replicates. Furthermore, the overlap of results among replicates generated by column bootstrap or mixing observations was much stronger than among the true replicates. Artificial technical replicates generated by bootstrapping sequencing reads from FASTQ-files are better suited to study the reproducibility of results from differential expression and GO term enrichment analysis in RNA-seq experiments than column bootstrap or mixing observations. However, FASTQ-bootstrapping is computationally more expensive than the other two approaches. The FASTQ-bootstrapping may be applicable to other applications of high-throughput sequencing.

Collapse

Akoniyon OP, Adewumi TS, Maharaj L, Oyegoke OO, Roux A, Adeleke MA, Maharaj R, Okpeku M. Whole Genome Sequencing Contributions and Challenges in Disease Reduction Focused on Malaria. BIOLOGY 2022;11:587. [PMID: 35453786 PMCID: PMC9027812 DOI: 10.3390/biology11040587] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/13/2022] [Revised: 03/31/2022] [Accepted: 04/01/2022] [Indexed: 12/11/2022]

Liu J, Shen Q, Bao H. Comparison of seven SNP calling pipelines for the next-generation sequencing data of chickens. PLoS One 2022;17:e0262574. [PMID: 35100292 PMCID: PMC8803190 DOI: 10.1371/journal.pone.0262574] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2021] [Accepted: 12/29/2021] [Indexed: 11/18/2022] Open

Abstract

Single nucleotide polymorphisms (SNPs) are widely used in genome-wide association studies and population genetics analyses. Next-generation sequencing (NGS) has become convenient, and many SNP-calling pipelines have been developed for human NGS data. We took advantage of a gap knowledge in selecting the appropriated SNP calling pipeline to handle with high-throughput NGS data. To fill this gap, we studied and compared seven SNP calling pipelines, which include 16GT, genome analysis toolkit (GATK), Bcftools-single (Bcftools single sample mode), Bcftools-multiple (Bcftools multiple sample mode), VarScan2-single (VarScan2 single sample mode), VarScan2-multiple (VarScan2 multiple sample mode) and Freebayes pipelines, using 96 NGS data with the different depth gradients of approximately 5X, 10X, 20X, 30X, 40X, and 50X coverage from 16 Rhode Island Red chickens. The sixteen chickens were also genotyped with a 50K SNP array, and the sensitivity and specificity of each pipeline were assessed by comparison to the results of SNP arrays. For each pipeline, except Freebayes, the number of detected SNPs increased as the input read depth increased. In comparison with other pipelines, 16GT, followed by Bcftools-multiple, obtained the most SNPs when the input coverage exceeded 10X, and Bcftools-multiple obtained the most when the input was 5X and 10X. The sensitivity and specificity of each pipeline increased with increasing input. Bcftools-multiple had the highest sensitivity numerically when the input ranged from 5X to 30X, and 16GT showed the highest sensitivity when the input was 40X and 50X. Bcftools-multiple also had the highest specificity, followed by GATK, at almost all input levels. For most calling pipelines, there were no obvious changes in SNP numbers, sensitivities or specificities beyond 20X. In conclusion, (1) if only SNPs were detected, the sequencing depth did not need to exceed 20X; (2) the Bcftools-multiple may be the best choice for detecting SNPs from chicken NGS data, but for a single sample or sequencing depth greater than 20X, 16GT was recommended. Our findings provide a reference for researchers to select suitable pipelines to obtain SNPs from the NGS data of chickens or nonhuman animals.

Collapse

Casellas J, Martín de Hijas-Villalba M, Vázquez-Gómez M, Id-Lahoucine S. Low-coverage whole-genome sequencing in livestock species for individual traceability and parentage testing. Livest Sci 2021. [DOI: 10.1016/j.livsci.2021.104629] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]

Ahmed Z, Renart EG, Zeeshan S. Genomics pipelines to investigate susceptibility in whole genome and exome sequenced data for variant discovery, annotation, prediction and genotyping. PeerJ 2021;9:e11724. [PMID: 34395068 PMCID: PMC8320519 DOI: 10.7717/peerj.11724] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2021] [Accepted: 06/14/2021] [Indexed: 12/12/2022] Open

Zanti M, Michailidou K, Loizidou MA, Machattou C, Pirpa P, Christodoulou K, Spyrou GM, Kyriacou K, Hadjisavvas A. Performance evaluation of pipelines for mapping, variant calling and interval padding, for the analysis of NGS germline panels. BMC Bioinformatics 2021;22:218. [PMID: 33910496 PMCID: PMC8080428 DOI: 10.1186/s12859-021-04144-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2021] [Accepted: 04/15/2021] [Indexed: 11/10/2022] Open

Abstract

Background

Next-generation sequencing (NGS) represents a significant advancement in clinical genetics. However, its use creates several technical, data interpretation and management challenges. It is essential to follow a consistent data analysis pipeline to achieve the highest possible accuracy and avoid false variant calls. Herein, we aimed to compare the performance of twenty-eight combinations of NGS data analysis pipeline compartments, including short-read mapping (BWA-MEM, Bowtie2, Stampy), variant calling (GATK-HaplotypeCaller, GATK-UnifiedGenotyper, SAMtools) and interval padding (null, 50 bp, 100 bp) methods, along with a commercially available pipeline (BWA Enrichment, Illumina®). Fourteen germline DNA samples from breast cancer patients were sequenced using a targeted NGS panel approach and subjected to data analysis.

Results

We highlight that interval padding is required for the accurate detection of intronic variants including spliceogenic pathogenic variants (PVs). In addition, using nearly default parameters, the BWA Enrichment algorithm, failed to detect these spliceogenic PVs and a missense PV in the TP53 gene. We also recommend the BWA-MEM algorithm for sequence alignment, whereas variant calling should be performed using a combination of variant calling algorithms; GATK-HaplotypeCaller and SAMtools for the accurate detection of insertions/deletions and GATK-UnifiedGenotyper for the efficient detection of single nucleotide variant calls.

Conclusions

These findings have important implications towards the identification of clinically actionable variants through panel testing in a clinical laboratory setting, when dedicated bioinformatics personnel might not always be available. The results also reveal the necessity of improving the existing tools and/or at the same time developing new pipelines to generate more reliable and more consistent data.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12859-021-04144-1.

Collapse

Next Generation Sequencing Technology in the Clinic and Its Challenges. Cancers (Basel) 2021;13:cancers13081751. [PMID: 33916923 PMCID: PMC8067551 DOI: 10.3390/cancers13081751] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2021] [Revised: 03/30/2021] [Accepted: 04/05/2021] [Indexed: 12/12/2022] Open

Hynst J, Navrkalova V, Pal K, Pospisilova S. Bioinformatic strategies for the analysis of genomic aberrations detected by targeted NGS panels with clinical application. PeerJ 2021;9:e10897. [PMID: 33850640 PMCID: PMC8019320 DOI: 10.7717/peerj.10897] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2020] [Accepted: 01/13/2021] [Indexed: 01/21/2023] Open

Investigating the importance of individual mitochondrial genotype in susceptibility to drug-induced toxicity. Biochem Soc Trans 2021;48:787-797. [PMID: 32453388 PMCID: PMC7329340 DOI: 10.1042/bst20190233] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2020] [Revised: 04/30/2020] [Accepted: 05/01/2020] [Indexed: 12/13/2022]

Chen H, Yin Y, Li X, Li S, Gao H, Wang X, Zhang Y, Liu Y, Wang H. Whole-Genome Analysis of Livestock-Associated Methicillin-Resistant Staphylococcus aureus Sequence Type 398 Strains Isolated From Patients With Bacteremia in China. J Infect Dis 2021;221:S220-S228. [PMID: 32176793 DOI: 10.1093/infdis/jiz575] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open

Valiente-Mullor C, Beamud B, Ansari I, Francés-Cuesta C, García-González N, Mejía L, Ruiz-Hueso P, González-Candelas F. One is not enough: On the effects of reference genome for the mapping and subsequent analyses of short-reads. PLoS Comput Biol 2021;17:e1008678. [PMID: 33503026 PMCID: PMC7870062 DOI: 10.1371/journal.pcbi.1008678] [Citation(s) in RCA: 31] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2020] [Revised: 02/08/2021] [Accepted: 01/05/2021] [Indexed: 12/17/2022] Open

Abstract

Mapping of high-throughput sequencing (HTS) reads to a single arbitrary reference genome is a frequently used approach in microbial genomics. However, the choice of a reference may represent a source of errors that may affect subsequent analyses such as the detection of single nucleotide polymorphisms (SNPs) and phylogenetic inference. In this work, we evaluated the effect of reference choice on short-read sequence data from five clinically and epidemiologically relevant bacteria (Klebsiella pneumoniae, Legionella pneumophila, Neisseria gonorrhoeae, Pseudomonas aeruginosa and Serratia marcescens). Publicly available whole-genome assemblies encompassing the genomic diversity of these species were selected as reference sequences, and read alignment statistics, SNP calling, recombination rates, dN/dS ratios, and phylogenetic trees were evaluated depending on the mapping reference. The choice of different reference genomes proved to have an impact on almost all the parameters considered in the five species. In addition, these biases had potential epidemiological implications such as including/excluding isolates of particular clades and the estimation of genetic distances. These findings suggest that the single reference approach might introduce systematic errors during mapping that affect subsequent analyses, particularly for data sets with isolates from genetically diverse backgrounds. In any case, exploring the effects of different references on the final conclusions is highly recommended.

Mapping consists in the alignment of reads (i.e., DNA fragments) obtained through high-throughput genome sequencing to a previously assembled reference sequence. It is a common practice in genomic studies to use a single reference for mapping, usually the ‘reference genome’ of a species—a high-quality assembly. However, the selection of an optimal reference is hindered by intrinsic intra-species genetic variability, particularly in bacteria. It is known that genetic differences between the reference genome and the read sequences may produce incorrect alignments during mapping. Eventually, these errors could lead to misidentification of variants and biased reconstruction of phylogenetic trees (which reflect ancestry between different bacterial lineages). To our knowledge, this is the first work to systematically examine the effect of different references for mapping on the inference of tree topology as well as the impact on recombination and natural selection inferences. Furthermore, the novelty of this work relies on a procedure that guarantees that we are evaluating only the effect of the reference. This effect has proved to be pervasive in the five bacterial species that we have studied and, in some cases, alterations in phylogenetic trees could lead to incorrect epidemiological inferences. Hence, the use of different reference genomes may be prescriptive to assess the potential biases of mapping.

Collapse

Alosaimi S, van Biljon N, Awany D, Thami PK, Defo J, Mugo JW, Bope CD, Mazandu GK, Mulder NJ, Chimusa ER. Simulation of African and non-African low and high coverage whole genome sequence data to assess variant calling approaches. Brief Bioinform 2020;22:6042242. [PMID: 33341897 DOI: 10.1093/bib/bbaa366] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2020] [Revised: 11/14/2020] [Accepted: 01/08/2020] [Indexed: 12/15/2022] Open

Affiliation(s)

Shatha Alosaimi Faculty of Health Sciences, Division of Human Genetics, Department of Pathology, University of Cape Town, Cape Town, South Africa
Noëlle van Biljon Department of Statistical Sciences, University of Cape Town, Cape Town, South Africa
Denis Awany Faculty of Health Sciences, Division of Human Genetics, Department of Pathology, University of Cape Town, Cape Town, South Africa
Prisca K Thami Faculty of Health Sciences, Division of Human Genetics, Department of Pathology, University of Cape Town, Cape Town, South Africa
Joel Defo Faculty of Health Sciences, Division of Human Genetics, Department of Pathology, University of Cape Town, Cape Town, South Africa
Jacquiline W Mugo Faculty of Health Sciences, Division of Computational Biology, Department of Biomedical Sciences, University of Cape Town, Cape Town, South Africa
Christian D Bope Faculty of Sciences, Department of Mathematics and Computer Science, University of Kinshasa, Kinshasa, DRC
Gaston K Mazandu Faculty of Health Sciences, Division of Human Genetics, Department of Pathology, University of Cape Town, Cape Town, South Africa.,Faculty of Health Sciences, Division of Computational Biology, Department of Biomedical Sciences, University of Cape Town, Cape Town, South Africa
Nicola J Mulder Faculty of Health Sciences, Division of Computational Biology, Department of Biomedical Sciences, University of Cape Town, Cape Town, South Africa.,Institute of Infectious Disease and Molecular Medicine, University of Cape Town, Anzio Road, Observatory, Cape Town 7925, South Africa
Emile R Chimusa Faculty of Health Sciences, Division of Human Genetics, Department of Pathology, University of Cape Town, Cape Town, South Africa.,Institute of Infectious Disease and Molecular Medicine, University of Cape Town, Anzio Road, Observatory, Cape Town 7925, South Africa

Collapse

Castrignanò T, Gioiosa S, Flati T, Cestari M, Picardi E, Chiara M, Fratelli M, Amente S, Cirilli M, Tangaro MA, Chillemi G, Pesole G, Zambelli F. ELIXIR-IT HPC@CINECA: high performance computing resources for the bioinformatics community. BMC Bioinformatics 2020;21:352. [PMID: 32838759 PMCID: PMC7446135 DOI: 10.1186/s12859-020-03565-8] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023] Open

Abstract

BACKGROUND

The advent of Next Generation Sequencing (NGS) technologies and the concomitant reduction in sequencing costs allows unprecedented high throughput profiling of biological systems in a cost-efficient manner. Modern biological experiments are increasingly becoming both data and computationally intensive and the wealth of publicly available biological data is introducing bioinformatics into the "Big Data" era. For these reasons, the effective application of High Performance Computing (HPC) architectures is becoming progressively more recognized also by bioinformaticians. Here we describe HPC resources provisioning pilot programs dedicated to bioinformaticians, run by the Italian Node of ELIXIR (ELIXIR-IT) in collaboration with CINECA, the main Italian supercomputing center.

RESULTS

Starting from April 2016, CINECA and ELIXIR-IT launched the pilot Call "ELIXIR-IT HPC@CINECA", offering streamlined access to HPC resources for bioinformatics. Resources are made available either through web front-ends to dedicated workflows developed at CINECA or by providing direct access to the High Performance Computing systems through a standard command-line interface tailored for bioinformatics data analysis. This allows to offer to the biomedical research community a production scale environment, continuously updated with the latest available versions of publicly available reference datasets and bioinformatic tools. Currently, 63 research projects have gained access to the HPC@CINECA program, for a total handout of ~ 8 Millions of CPU/hours and, for data storage, ~ 100 TB of permanent and ~ 300 TB of temporary space.

CONCLUSIONS

Three years after the beginning of the ELIXIR-IT HPC@CINECA program, we can appreciate its impact over the Italian bioinformatics community and draw some considerations. Several Italian researchers who applied to the program have gained access to one of the top-ranking public scientific supercomputing facilities in Europe. Those investigators had the opportunity to sensibly reduce computational turnaround times in their research projects and to process massive amounts of data, pursuing research approaches that would have been otherwise difficult or impossible to undertake. Moreover, by taking advantage of the wealth of documentation and training material provided by CINECA, participants had the opportunity to improve their skills in the usage of HPC systems and be better positioned to apply to similar EU programs of greater scale, such as PRACE. To illustrate the effective usage and impact of the resources awarded by the program - in different research applications - we report five successful use cases, which have already published their findings in peer-reviewed journals.

Collapse

Affiliation(s)

Tiziana Castrignanò Department of Ecological and Biological Sciences (DEB), University of Tuscia, Viterbo, Italy.
Silvia Gioiosa CINECA, SuperComputing Applications and Innovation Department, Rome, Italy.,Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council (IBIOM-CNR), Bari, Italy
Tiziano Flati CINECA, SuperComputing Applications and Innovation Department, Rome, Italy.,Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council (IBIOM-CNR), Bari, Italy
Mirko Cestari CINECA, SuperComputing Applications and Innovation Department, Rome, Italy
Ernesto Picardi Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council (IBIOM-CNR), Bari, Italy.,Department of Biosciences, Biotechnology and Biopharmaceutics, University of Bari "A. Moro", Bari, Italy
Matteo Chiara Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council (IBIOM-CNR), Bari, Italy.,Department of Biosciences, University of Milan, Milan, Italy
Maddalena Fratelli IRCCS-Istituto di Ricerche Farmacologiche "Mario Negri", Milano, Milan, Italy
Stefano Amente Department of Molecular Medicine and Medical Biotechnologies, University of Naples 'Federico II', Naples, Italy
Marco Cirilli Department of Agricultural and Environmental Sciences - Production, Landscape, Agroenergy (DISAA), University of Milan, Milan, Italy
Marco Antonio Tangaro Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council (IBIOM-CNR), Bari, Italy
Giovanni Chillemi Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council (IBIOM-CNR), Bari, Italy.,Department for Innovation in Biological, Agro-food and Forest systems (DIBAF), University of Tuscia, Viterbo, Italy
Graziano Pesole Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council (IBIOM-CNR), Bari, Italy. .,Department of Biosciences, Biotechnology and Biopharmaceutics, University of Bari "A. Moro", Bari, Italy.
Federico Zambelli Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council (IBIOM-CNR), Bari, Italy. .,Department of Biosciences, University of Milan, Milan, Italy.

Collapse

Daw Elbait G, Henschel A, Tay GK, Al Safar HS. Whole Genome Sequencing of Four Representatives From the Admixed Population of the United Arab Emirates. Front Genet 2020;11:681. [PMID: 32754195 PMCID: PMC7367215 DOI: 10.3389/fgene.2020.00681] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2020] [Accepted: 06/03/2020] [Indexed: 01/21/2023] Open

Venkataraman GR, Rivas MA. Rare and common variant discovery in complex disease: the IBD case study. Hum Mol Genet 2020;28:R162-R169. [PMID: 31363759 DOI: 10.1093/hmg/ddz189] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2019] [Revised: 07/24/2019] [Accepted: 07/25/2019] [Indexed: 12/15/2022] Open

Hendrix MM, Cuthbert CD, Cordovado SK. Assessing the Performance of Dried-Blood-Spot DNA Extraction Methods in Next Generation Sequencing. Int J Neonatal Screen 2020;6:36. [PMID: 32514487 PMCID: PMC7278269 DOI: 10.3390/ijns6020036] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/26/2020] [Accepted: 04/27/2020] [Indexed: 12/31/2022] Open

Alqahtani A, Skelton A, Eley L, Annavarapu S, Henderson DJ, Chaudhry B. Isolation and next generation sequencing of archival formalin-fixed DNA. J Anat 2020;237:587-600. [PMID: 32426881 PMCID: PMC7476199 DOI: 10.1111/joa.13209] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2020] [Revised: 03/31/2020] [Accepted: 04/07/2020] [Indexed: 11/29/2022] Open

Schilbert HM, Rempel A, Pucker B. Comparison of Read Mapping and Variant Calling Tools for the Analysis of Plant NGS Data. PLANTS (BASEL, SWITZERLAND) 2020;9:E439. [PMID: 32252268 PMCID: PMC7238416 DOI: 10.3390/plants9040439] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/15/2020] [Revised: 03/28/2020] [Accepted: 03/30/2020] [Indexed: 12/30/2022]

Bush SJ, Foster D, Eyre DW, Clark EL, De Maio N, Shaw LP, Stoesser N, Peto TEA, Crook DW, Walker AS. Genomic diversity affects the accuracy of bacterial single-nucleotide polymorphism-calling pipelines. Gigascience 2020;9:giaa007. [PMID: 32025702 PMCID: PMC7002876 DOI: 10.1093/gigascience/giaa007] [Citation(s) in RCA: 61] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2019] [Revised: 12/02/2019] [Accepted: 01/15/2020] [Indexed: 02/06/2023] Open

Abstract

BACKGROUND

Accurately identifying single-nucleotide polymorphisms (SNPs) from bacterial sequencing data is an essential requirement for using genomics to track transmission and predict important phenotypes such as antimicrobial resistance. However, most previous performance evaluations of SNP calling have been restricted to eukaryotic (human) data. Additionally, bacterial SNP calling requires choosing an appropriate reference genome to align reads to, which, together with the bioinformatic pipeline, affects the accuracy and completeness of a set of SNP calls obtained. This study evaluates the performance of 209 SNP-calling pipelines using a combination of simulated data from 254 strains of 10 clinically common bacteria and real data from environmentally sourced and genomically diverse isolates within the genera Citrobacter, Enterobacter, Escherichia, and Klebsiella.

RESULTS

We evaluated the performance of 209 SNP-calling pipelines, aligning reads to genomes of the same or a divergent strain. Irrespective of pipeline, a principal determinant of reliable SNP calling was reference genome selection. Across multiple taxa, there was a strong inverse relationship between pipeline sensitivity and precision, and the Mash distance (a proxy for average nucleotide divergence) between reads and reference genome. The effect was especially pronounced for diverse, recombinogenic bacteria such as Escherichia coli but less dominant for clonal species such as Mycobacterium tuberculosis.

CONCLUSIONS

The accuracy of SNP calling for a given species is compromised by increasing intra-species diversity. When reads were aligned to the same genome from which they were sequenced, among the highest-performing pipelines was Novoalign/GATK. By contrast, when reads were aligned to particularly divergent genomes, the highest-performing pipelines often used the aligners NextGenMap or SMALT, and/or the variant callers LoFreq, mpileup, or Strelka.

Collapse

Affiliation(s)

Stephen J Bush Nuffield Department of Medicine, University of Oxford, John Radcliffe Hospital, Headington, Oxford, OX3 9DU, UK National Institute for Health Research Health Research Protection Unit in Healthcare Associated Infections and Antimicrobial Resistance at University of Oxford in partnership with Public Health England, Oxford, John Radcliffe Hospital, Headington, Oxford, OX3 9DU, UK
Dona Foster Nuffield Department of Medicine, University of Oxford, John Radcliffe Hospital, Headington, Oxford, OX3 9DU, UK National Institute for Health Research Oxford Biomedical Research Centre, Oxford, John Radcliffe Hospital, Headington, Oxford, OX3 9DU, UK
David W Eyre Nuffield Department of Medicine, University of Oxford, John Radcliffe Hospital, Headington, Oxford, OX3 9DU, UK
Emily L Clark The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Easter Bush Campus, Midlothian, EH25 9RG, UK
Nicola De Maio European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SH, UK
Liam P Shaw Nuffield Department of Medicine, University of Oxford, John Radcliffe Hospital, Headington, Oxford, OX3 9DU, UK
Nicole Stoesser Nuffield Department of Medicine, University of Oxford, John Radcliffe Hospital, Headington, Oxford, OX3 9DU, UK
Tim E A Peto Nuffield Department of Medicine, University of Oxford, John Radcliffe Hospital, Headington, Oxford, OX3 9DU, UK National Institute for Health Research Health Research Protection Unit in Healthcare Associated Infections and Antimicrobial Resistance at University of Oxford in partnership with Public Health England, Oxford, John Radcliffe Hospital, Headington, Oxford, OX3 9DU, UK National Institute for Health Research Oxford Biomedical Research Centre, Oxford, John Radcliffe Hospital, Headington, Oxford, OX3 9DU, UK
Derrick W Crook Nuffield Department of Medicine, University of Oxford, John Radcliffe Hospital, Headington, Oxford, OX3 9DU, UK National Institute for Health Research Health Research Protection Unit in Healthcare Associated Infections and Antimicrobial Resistance at University of Oxford in partnership with Public Health England, Oxford, John Radcliffe Hospital, Headington, Oxford, OX3 9DU, UK National Institute for Health Research Oxford Biomedical Research Centre, Oxford, John Radcliffe Hospital, Headington, Oxford, OX3 9DU, UK
A Sarah Walker Nuffield Department of Medicine, University of Oxford, John Radcliffe Hospital, Headington, Oxford, OX3 9DU, UK National Institute for Health Research Health Research Protection Unit in Healthcare Associated Infections and Antimicrobial Resistance at University of Oxford in partnership with Public Health England, Oxford, John Radcliffe Hospital, Headington, Oxford, OX3 9DU, UK National Institute for Health Research Oxford Biomedical Research Centre, Oxford, John Radcliffe Hospital, Headington, Oxford, OX3 9DU, UK

Collapse

Role of Bioinformatics in Molecular Medicine. Genomic Med 2020. [DOI: 10.1007/978-3-030-22922-1_4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022] Open

Variant Calling Using Whole Genome Resequencing and Sequence Capture for Population and Evolutionary Genomic Inferences in Norway Spruce (Picea Abies). COMPENDIUM OF PLANT GENOMES 2020. [DOI: 10.1007/978-3-030-21001-4_2] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]

Lewin AC, Coghill LM, McLellan GJ, Bentley E, Kousoulas KG. Genomic analysis for virulence determinants in feline herpesvirus type-1 isolates. Virus Genes 2019;56:49-57. [PMID: 31776852 DOI: 10.1007/s11262-019-01718-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2019] [Accepted: 11/21/2019] [Indexed: 12/27/2022]

Jiang Y, Jiang Y, Wang S, Zhang Q, Ding X. Optimal sequencing depth design for whole genome re-sequencing in pigs. BMC Bioinformatics 2019;20:556. [PMID: 31703550 PMCID: PMC6839175 DOI: 10.1186/s12859-019-3164-z] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2019] [Accepted: 10/16/2019] [Indexed: 12/30/2022] Open

Abstract

BACKGROUND

As whole-genome sequencing is becoming a routine technique, it is important to identify a cost-effective depth of sequencing for such studies. However, the relationship between sequencing depth and biological results from the aspects of whole-genome coverage, variant discovery power and the quality of variants is unclear, especially in pigs. We sequenced the genomes of three Yorkshire boars at an approximately 20X depth on the Illumina HiSeq X Ten platform and downloaded whole-genome sequencing data for three Duroc and three Landrace pigs with an approximately 20X depth for each individual. Then, we downsampled the deep genome data by extracting twelve different proportions of 0.05, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 and 0.9 paired reads from the original bam files to mimic the sequence data of the same individuals at sequencing depths of 1.09X, 2.18X, 3.26X, 4.35X, 6.53X, 8.70X, 10.88X, 13.05X, 15.22X, 17.40X, 19.57X and 21.75X to evaluate the influence of genome coverage, the variant discovery rate and genotyping accuracy as a function of sequencing depth. In addition, SNP chip data for Yorkshire pigs were used as a validation for the comparison of single-sample calling and multisample calling algorithms.

RESULTS

Our results indicated that 10X is an ideal practical depth for achieving plateau coverage and discovering accurate variants, which achieved greater than 99% genome coverage. The number of false-positive variants was increased dramatically at a depth of less than 4X, which covered 95% of the whole genome. In addition, the comparison of multi- and single-sample calling showed that multisample calling was more sensitive than single-sample calling, especially at lower depths. The number of variants discovered under multisample calling was 13-fold and 2-fold higher than that under single-sample calling at 1X and 22X, respectively. A large difference was observed when the depth was less than 4.38X. However, more false-positive variants were detected under multisample calling.

CONCLUSIONS

Our research will inform important study design decisions regarding whole-genome sequencing depth. Our results will be helpful for choosing the appropriate depth to achieve the same power for studies performed under limited budgets.

Collapse

AlSafar HS, Al-Ali M, Elbait GD, Al-Maini MH, Ruta D, Peramo B, Henschel A, Tay GK. Introducing the first whole genomes of nationals from the United Arab Emirates. Sci Rep 2019;9:14725. [PMID: 31604968 PMCID: PMC6789106 DOI: 10.1038/s41598-019-50876-9] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2018] [Accepted: 09/20/2019] [Indexed: 12/30/2022] Open

Caspar SM, Dubacher N, Kopps AM, Meienberg J, Henggeler C, Matyas G. Clinical sequencing: From raw data to diagnosis with lifetime value. Clin Genet 2019;93:508-519. [PMID: 29206278 DOI: 10.1111/cge.13190] [Citation(s) in RCA: 60] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2017] [Revised: 11/28/2017] [Accepted: 11/30/2017] [Indexed: 12/22/2022]

Wu X, Heffelfinger C, Zhao H, Dellaporta SL. Benchmarking variant identification tools for plant diversity discovery. BMC Genomics 2019;20:701. [PMID: 31500583 PMCID: PMC6734213 DOI: 10.1186/s12864-019-6057-7] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2019] [Accepted: 08/22/2019] [Indexed: 11/10/2022] Open

Abstract

BACKGROUND

The ability to accurately and comprehensively identify genomic variations is critical for plant studies utilizing high-throughput sequencing. Most bioinformatics tools for processing next-generation sequencing data were originally developed and tested in human studies, raising questions as to their efficacy for plant research. A detailed evaluation of the entire variant calling pipeline, including alignment, variant calling, variant filtering, and imputation was performed on different programs using both simulated and real plant genomic datasets.

RESULTS

A comparison of SOAP2, Bowtie2, and BWA-MEM found that BWA-MEM was consistently able to align the most reads with high accuracy, whereas Bowtie2 had the highest overall accuracy. Comparative results of GATK HaplotypCaller versus SAMtools mpileup indicated that the choice of variant caller affected precision and recall differentially depending on the levels of diversity, sequence coverage and genome complexity. A cross-reference experiment of S. lycopersicum and S. pennellii reference genomes revealed the inadequacy of single reference genome for variant discovery that includes distantly-related plant individuals. Machine-learning-based variant filtering strategy outperformed the traditional hard-cutoff strategy resulting in higher number of true positive variants and fewer false positive variants. A 2-step imputation method, which utilized a set of high-confidence SNPs as the reference panel, showed up to 60% higher accuracy than direct LD-based imputation.

CONCLUSIONS

Programs in the variant discovery pipeline have different performance on plant genomic dataset. Choice of the programs is subjected to the goal of the study and available resources. This study serves as an important guiding information for plant biologists utilizing next-generation sequencing data for diversity characterization and crop improvement.

Collapse

Li D, Kim W, Wang L, Yoon KA, Park B, Park C, Kong SY, Hwang Y, Baek D, Lee ES, Won S. Comparison of INDEL Calling Tools with Simulation Data and Real Short-Read Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019;16:1635-1644. [PMID: 30004886 DOI: 10.1109/tcbb.2018.2854793] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]

Variant calling and quality control of large-scale human genome sequencing data. Emerg Top Life Sci 2019;3:399-409. [DOI: 10.1042/etls20190007] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2019] [Revised: 06/28/2019] [Accepted: 07/16/2019] [Indexed: 12/12/2022]

Batcha AMN, Bamopoulos SA, Kerbs P, Kumar A, Jurinovic V, Rothenberg-Thurley M, Ksienzyk B, Philippou-Massier J, Krebs S, Blum H, Schneider S, Konstandin N, Bohlander SK, Heckman C, Kontro M, Hiddemann W, Spiekermann K, Braess J, Metzeler KH, Greif PA, Mansmann U, Herold T. Allelic Imbalance of Recurrently Mutated Genes in Acute Myeloid Leukaemia. Sci Rep 2019;9:11796. [PMID: 31409822 PMCID: PMC6692371 DOI: 10.1038/s41598-019-48167-4] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2019] [Accepted: 07/29/2019] [Indexed: 12/24/2022] Open

Affiliation(s)

Aarif M N Batcha Institute of Medical Data Processing, Biometrics and Epidemiology (IBE), Faculty of Medicine, LMU Munich, Munich, Germany. .,Data Integration for Future Medicine (DiFuture, www.difuture.de), LMU Munich, Munich, Germany.
Stefanos A Bamopoulos Laboratory for Leukemia Diagnostics, Department of Medicine III, University Hospital, LMU Munich, Munich, Germany
Paul Kerbs Laboratory for Leukemia Diagnostics, Department of Medicine III, University Hospital, LMU Munich, Munich, Germany
Ashwini Kumar Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki, Finland
Vindi Jurinovic Institute of Medical Data Processing, Biometrics and Epidemiology (IBE), Faculty of Medicine, LMU Munich, Munich, Germany.,Laboratory for Leukemia Diagnostics, Department of Medicine III, University Hospital, LMU Munich, Munich, Germany
Maja Rothenberg-Thurley Laboratory for Leukemia Diagnostics, Department of Medicine III, University Hospital, LMU Munich, Munich, Germany
Bianka Ksienzyk Laboratory for Leukemia Diagnostics, Department of Medicine III, University Hospital, LMU Munich, Munich, Germany
Julia Philippou-Massier Laboratory for Functional Genome Analysis (LAFUGA), Gene Center, University of Munich, Munich, Germany
Stefan Krebs Laboratory for Functional Genome Analysis (LAFUGA), Gene Center, University of Munich, Munich, Germany
Helmut Blum Laboratory for Functional Genome Analysis (LAFUGA), Gene Center, University of Munich, Munich, Germany
Stephanie Schneider Laboratory for Leukemia Diagnostics, Department of Medicine III, University Hospital, LMU Munich, Munich, Germany.,Institute of Human Genetics, University Hospital, LMU Munich, Munich, Germany
Nikola Konstandin Laboratory for Leukemia Diagnostics, Department of Medicine III, University Hospital, LMU Munich, Munich, Germany
Stefan K Bohlander Leukaemia and Blood Cancer Research Unit, Department of Molecular Medicine and Pathology, University of Auckland, Auckland, New Zealand
Caroline Heckman Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki, Finland
Mika Kontro Department of Haematology, Helsinki University Hospital Comprehensive Cancer Center, Helsinki, Finland
Wolfgang Hiddemann Laboratory for Leukemia Diagnostics, Department of Medicine III, University Hospital, LMU Munich, Munich, Germany.,German Cancer Consortium (DKTK), Partner Site Munich, Munich, Germany.,German Cancer Research Center (DKFZ), Heidelberg, Germany
Karsten Spiekermann Laboratory for Leukemia Diagnostics, Department of Medicine III, University Hospital, LMU Munich, Munich, Germany.,German Cancer Consortium (DKTK), Partner Site Munich, Munich, Germany.,German Cancer Research Center (DKFZ), Heidelberg, Germany
Jan Braess Department of Oncology and Hematology, Hospital Barmherzige Brüder, Regensburg, Germany
Klaus H Metzeler Laboratory for Leukemia Diagnostics, Department of Medicine III, University Hospital, LMU Munich, Munich, Germany.,German Cancer Consortium (DKTK), Partner Site Munich, Munich, Germany.,German Cancer Research Center (DKFZ), Heidelberg, Germany
Philipp A Greif Laboratory for Leukemia Diagnostics, Department of Medicine III, University Hospital, LMU Munich, Munich, Germany.,German Cancer Consortium (DKTK), Partner Site Munich, Munich, Germany.,German Cancer Research Center (DKFZ), Heidelberg, Germany
Ulrich Mansmann Institute of Medical Data Processing, Biometrics and Epidemiology (IBE), Faculty of Medicine, LMU Munich, Munich, Germany.,Data Integration for Future Medicine (DiFuture, www.difuture.de), LMU Munich, Munich, Germany.,German Cancer Consortium (DKTK), Partner Site Munich, Munich, Germany.,German Cancer Research Center (DKFZ), Heidelberg, Germany
Tobias Herold Laboratory for Leukemia Diagnostics, Department of Medicine III, University Hospital, LMU Munich, Munich, Germany. .,German Cancer Consortium (DKTK), Partner Site Munich, Munich, Germany. .,German Cancer Research Center (DKFZ), Heidelberg, Germany. .,Research Unit Apoptosis in Hematopoietic Stem Cells, Helmholtz Zentrum München, German Research Center for Environmental Health (HMGU), Munich, Germany.

Collapse

Comprehensive evaluation and characterisation of short read general-purpose structural variant calling software. Nat Commun 2019;10:3240. [PMID: 31324872 PMCID: PMC6642177 DOI: 10.1038/s41467-019-11146-4] [Citation(s) in RCA: 137] [Impact Index Per Article: 27.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2018] [Accepted: 06/26/2019] [Indexed: 01/12/2023] Open

Brouard JS, Schenkel F, Marete A, Bissonnette N. The GATK joint genotyping workflow is appropriate for calling variants in RNA-seq experiments. J Anim Sci Biotechnol 2019;10:44. [PMID: 31249686 PMCID: PMC6587293 DOI: 10.1186/s40104-019-0359-0] [Citation(s) in RCA: 68] [Impact Index Per Article: 13.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2018] [Accepted: 04/28/2019] [Indexed: 12/30/2022] Open

Abstract

The Genome Analysis Toolkit (GATK) is a popular set of programs for discovering and genotyping variants from next-generation sequencing data. The current GATK recommendation for RNA sequencing (RNA-seq) is to perform variant calling from individual samples, with the drawback that only variable positions are reported. Versions 3.0 and above of GATK offer the possibility of calling DNA variants on cohorts of samples using the HaplotypeCaller algorithm in Genomic Variant Call Format (GVCF) mode. Using this approach, variants are called individually on each sample, generating one GVCF file per sample that lists genotype likelihoods and their genome annotations. In a second step, variants are called from the GVCF files through a joint genotyping analysis. This strategy is more flexible and reduces computational challenges in comparison to the traditional joint discovery workflow. Using a GVCF workflow for mining SNP in RNA-seq data provides substantial advantages, including reporting homozygous genotypes for the reference allele as well as missing data. Taking advantage of RNA-seq data derived from primary macrophages isolated from 50 cows, the GATK joint genotyping method for calling variants on RNA-seq data was validated by comparing this approach to a so-called “per-sample” method. In addition, pair-wise comparisons of the two methods were performed to evaluate their respective sensitivity, precision and accuracy using DNA genotypes from a companion study including the same 50 cows genotyped using either genotyping-by-sequencing or with the Bovine SNP50 Beadchip (imputed to the Bovine high density). Results indicate that both approaches are very close in their capacity of detecting reference variants and that the joint genotyping method is more sensitive than the per-sample method. Given that the joint genotyping method is more flexible and technically easier, we recommend this approach for variant calling in RNA-seq experiments.

Collapse

Kumaran M, Subramanian U, Devarajan B. Performance assessment of variant calling pipelines using human whole exome sequencing and simulated data. BMC Bioinformatics 2019;20:342. [PMID: 31208315 PMCID: PMC6580603 DOI: 10.1186/s12859-019-2928-9] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2018] [Accepted: 05/31/2019] [Indexed: 12/30/2022] Open

Crysnanto D, Wurmser C, Pausch H. Accurate sequence variant genotyping in cattle using variation-aware genome graphs. Genet Sel Evol 2019;51:21. [PMID: 31092189 PMCID: PMC6521551 DOI: 10.1186/s12711-019-0462-x] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2018] [Accepted: 05/03/2019] [Indexed: 12/22/2022] Open

Abstract

BACKGROUND

Genotyping of sequence variants typically involves, as a first step, the alignment of sequencing reads to a linear reference genome. Because a linear reference genome represents only a small fraction of all the DNA sequence variation within a species, reference allele bias may occur at highly polymorphic or divergent regions of the genome. Graph-based methods facilitate the comparison of sequencing reads to a variation-aware genome graph, which incorporates a collection of non-redundant DNA sequences that segregate within a species. We compared the accuracy and sensitivity of graph-based sequence variant genotyping using the Graphtyper software to two widely-used methods, i.e., GATK and SAMtools, which rely on linear reference genomes using whole-genome sequencing data from 49 Original Braunvieh cattle.

RESULTS

We discovered 21,140,196, 20,262,913, and 20,668,459 polymorphic sites using GATK, Graphtyper, and SAMtools, respectively. Comparisons between sequence variant genotypes and microarray-derived genotypes showed that Graphtyper outperformed both GATK and SAMtools in terms of genotype concordance, non-reference sensitivity, and non-reference discrepancy. The sequence variant genotypes that were obtained using Graphtyper had the smallest number of Mendelian inconsistencies between sequence-derived single nucleotide polymorphisms and indels in nine sire-son pairs. Genotype phasing and imputation using the Beagle software improved the quality of the sequence variant genotypes for all the tools evaluated, particularly for animals that were sequenced at low coverage. Following imputation, the concordance between sequence- and microarray-derived genotypes was almost identical for the three methods evaluated, i.e., 99.32, 99.46, and 99.24% for GATK, Graphtyper, and SAMtools, respectively. Variant filtration based on commonly used criteria improved genotype concordance slightly but it also decreased sensitivity. Graphtyper required considerably more computing resources than SAMtools but less than GATK.

CONCLUSIONS

Sequence variant genotyping using Graphtyper is accurate, sensitive and computationally feasible in cattle. Graph-based methods enable sequence variant genotyping from variation-aware reference genomes that may incorporate cohort-specific sequence variants, which is not possible with the current implementation of state-of-the-art methods that rely on linear reference genomes.

Collapse

Veeckman E, Van Glabeke S, Haegeman A, Muylle H, van Parijs FRD, Byrne SL, Asp T, Studer B, Rohde A, Roldán-Ruiz I, Vandepoele K, Ruttink T. Overcoming challenges in variant calling: exploring sequence diversity in candidate genes for plant development in perennial ryegrass (Lolium perenne). DNA Res 2019;26:1-12. [PMID: 30325414 PMCID: PMC6379033 DOI: 10.1093/dnares/dsy033] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2018] [Accepted: 09/06/2018] [Indexed: 11/13/2022] Open

Ali H, Al-Mulla F, Hussain N, Naim M, Asbeutah AM, AlSahow A, Abu-Farha M, Abubaker J, Al Madhoun A, Ahmad S, Harris PC. PKD1 Duplicated regions limit clinical Utility of Whole Exome Sequencing for Genetic Diagnosis of Autosomal Dominant Polycystic Kidney Disease. Sci Rep 2019;9:4141. [PMID: 30858458 PMCID: PMC6412018 DOI: 10.1038/s41598-019-40761-w] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2018] [Accepted: 02/21/2019] [Indexed: 12/18/2022] Open

Vo NS, Phan V. Leveraging known genomic variants to improve detection of variants, especially close-by Indels. Bioinformatics 2018;34:2918-2926. [PMID: 29590294 DOI: 10.1093/bioinformatics/bty183] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2017] [Accepted: 03/23/2018] [Indexed: 12/30/2022] Open

Abstract

Motivation

The detection of genomic variants has great significance in genomics, bioinformatics, biomedical research and its applications. However, despite a lot of effort, Indels and structural variants are still under-characterized compared to SNPs. Current approaches based on next-generation sequencing data usually require large numbers of reads (high coverage) to be able to detect such types of variants accurately. However Indels, especially those close to each other, are still hard to detect accurately.

Results

We introduce a novel approach that leverages known variant information, e.g. provided by dbSNP, dbVar, ExAC or the 1000 Genomes Project, to improve sensitivity of detecting variants, especially close-by Indels. In our approach, the standard reference genome and the known variants are combined to build a meta-reference, which is expected to be probabilistically closer to the subject genomes than the standard reference. An alignment algorithm, which can take into account known variant information, is developed to accurately align reads to the meta-reference. This strategy resulted in accurate alignment and variant calling even with low coverage data. We showed that compared to popular methods such as GATK and SAMtools, our method significantly improves the sensitivity of detecting variants, especially Indels that are close to each other. In particular, our method was able to call these close-by Indels at a 15-20% higher sensitivity than other methods at low coverage, and still get 1-5% higher sensitivity at high coverage, at competitive precision. These results were validated using simulated data with variant profiles extracted from the 1000 Genomes Project data, and real data from the Illumina Platinum Genomes Project and ExAC database. Our finding suggests that by incorporating known variant information in an appropriate manner, sensitive variant calling is possible at a low cost.

Availability and implementation

Implementation can be found in our public code repository https://github.com/namsyvo/IVC.

Supplementary information

Supplementary data are available at Bioinformatics online.

Collapse

Hadigol M, Khiabanian H. MERIT reveals the impact of genomic context on sequencing error rate in ultra-deep applications. BMC Bioinformatics 2018;19:219. [PMID: 29884116 PMCID: PMC5994075 DOI: 10.1186/s12859-018-2223-1] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2017] [Accepted: 05/29/2018] [Indexed: 02/07/2023] Open

Abstract

BACKGROUND

Rapid progress in high-throughput sequencing (HTS) and the development of novel library preparation methods have improved the sensitivity of detecting mutations in heterogeneous samples, specifically in high-depth (> 500×) clinical applications. However, HTS methods are bounded by their technical and theoretical limitations and sequencing errors cannot be completely eliminated. Comprehensive quantification of the background noise can highlight both the efficiency and the limitations of any HTS methodology, and help differentiate true mutations at low abundance from artifacts.

RESULTS

We introduce MERIT (Mutation Error Rate Inference Toolkit), designed for in-depth quantification of erroneous substitutions and small insertions and deletions. MERIT incorporates an all-inclusive variant caller and considers genomic context, including the nucleotides immediately at 5 'and 3 ', thereby establishing error rates for 96 possible substitutions as well as four single-base and 16 double-base indels. We applied MERIT to ultra-deep sequencing data (1,300,000 ×) obtained from the amplification of multiple clinically relevant loci, and showed a significant relationship between error rates and genomic contexts. In addition to observing significant difference between transversion and transition rates, we identified variations of more than 100-fold within each error type at high sequencing depths. For instance, T >G transversions in trinucleotide GTCs occurred 133.5 ± 65.9 more often than those in ATAs. Similarly, C >T transitions in GCGs were observed at 73.8 ± 10.5 higher rate than those in TCTs. We also devised an in silico approach to determine the optimal sequencing depth, where errors occur at rates similar to those of expected true mutations. Our analyses showed that increasing sequencing depth might improve sensitivity for detecting some mutations based on their genomic context. For example, T >G rate of error in GTCs did not change when sequenced beyond 10,000 ×; in contrast, T >G rate in TTAs consistently improved even at above 500,000 ×.

CONCLUSIONS

Our results demonstrate significant variation in nucleotide misincorporation rates, and suggest that genomic context should be considered for comprehensive profiling of specimen-specific and sequencing artifacts in high-depth assays. This data provide strong evidence against assigning a single allele frequency threshold to call mutations, for it can result in substantial false positive as well as false negative variants, with important clinical consequences.

Collapse

Tuzov N. A framework for the estimation of the proportion of true discoveries in single nucleotide variant detection studies for human data. PLoS One 2018;13:e0196058. [PMID: 29694377 PMCID: PMC5918994 DOI: 10.1371/journal.pone.0196058] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2017] [Accepted: 04/05/2018] [Indexed: 12/30/2022] Open

Ren Y, Reddy JS, Pottier C, Sarangi V, Tian S, Sinnwell JP, McDonnell SK, Biernacka JM, Carrasquillo MM, Ross OA, Ertekin-Taner N, Rademakers R, Hudson M, Mainzer LS, Asmann YW. Identification of missing variants by combining multiple analytic pipelines. BMC Bioinformatics 2018;19:139. [PMID: 29661148 PMCID: PMC5902939 DOI: 10.1186/s12859-018-2151-0] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2018] [Accepted: 04/09/2018] [Indexed: 02/02/2023] Open

Abstract

Background

After decades of identifying risk factors using array-based genome-wide association studies (GWAS), genetic research of complex diseases has shifted to sequencing-based rare variants discovery. This requires large sample sizes for statistical power and has brought up questions about whether the current variant calling practices are adequate for large cohorts. It is well-known that there are discrepancies between variants called by different pipelines, and that using a single pipeline always misses true variants exclusively identifiable by other pipelines. Nonetheless, it is common practice today to call variants by one pipeline due to computational cost and assume that false negative calls are a small percent of total.

Results

We analyzed 10,000 exomes from the Alzheimer’s Disease Sequencing Project (ADSP) using multiple analytic pipelines consisting of different read aligners and variant calling strategies. We compared variants identified by using two aligners in 50,100, 200, 500, 1000, and 1952 samples; and compared variants identified by adding single-sample genotyping to the default multi-sample joint genotyping in 50,100, 500, 2000, 5000 and 10,000 samples. We found that using a single pipeline missed increasing numbers of high-quality variants correlated with sample sizes. By combining two read aligners and two variant calling strategies, we rescued 30% of pass-QC variants at sample size of 2000, and 56% at 10,000 samples. The rescued variants had higher proportions of low frequency (minor allele frequency [MAF] 1–5%) and rare (MAF < 1%) variants, which are the very type of variants of interest. In 660 Alzheimer’s disease cases with earlier onset ages of ≤65, 4 out of 13 (31%) previously-published rare pathogenic and protective mutations in APP, PSEN1, and PSEN2 genes were undetected by the default one-pipeline approach but recovered by the multi-pipeline approach.

Conclusions

Identification of the complete variant set from sequencing data is the prerequisite of genetic association analyses. The current analytic practice of calling genetic variants from sequencing data using a single bioinformatics pipeline is no longer adequate with the increasingly large projects. The number and percentage of quality variants that passed quality filters but are missed by the one-pipeline approach rapidly increased with sample size.

Electronic supplementary material

The online version of this article (10.1186/s12859-018-2151-0) contains supplementary material, which is available to authorized users.

Collapse

Affiliation(s)

Yingxue Ren Department of Health Sciences Research, Mayo Clinic, Jacksonville, FL, 32224, USA
Joseph S Reddy Department of Health Sciences Research, Mayo Clinic, Jacksonville, FL, 32224, USA
Cyril Pottier Department of Neuroscience, Mayo Clinic, Jacksonville, FL, 32224, USA
Vivekananda Sarangi Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, 55905, USA
Shulan Tian Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, 55905, USA
Jason P Sinnwell Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, 55905, USA
Shannon K McDonnell Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, 55905, USA
Joanna M Biernacka Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, 55905, USA
Minerva M Carrasquillo Department of Neuroscience, Mayo Clinic, Jacksonville, FL, 32224, USA
Owen A Ross Department of Neuroscience, Mayo Clinic, Jacksonville, FL, 32224, USA.,Department of Clinical Genomics, Mayo Clinic, Jacksonville, FL, 32224, USA
Nilüfer Ertekin-Taner Department of Neuroscience, Mayo Clinic, Jacksonville, FL, 32224, USA.,Department of Neurology, Mayo Clinic, Jacksonville, FL, 32224, USA
Rosa Rademakers Department of Neuroscience, Mayo Clinic, Jacksonville, FL, 32224, USA
Matthew Hudson National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA.,Carl R Woese Institute for Genomic Biology, Carver Biotechnology Center and Department of Crop Sciences, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
Liudmila Sergeevna Mainzer National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
Yan W Asmann Department of Health Sciences Research, Mayo Clinic, Jacksonville, FL, 32224, USA.

Collapse

Shringarpure SS, Mathias RA, Hernandez RD, O'Connor TD, Szpiech ZA, Torres R, De La Vega FM, Bustamante CD, Barnes KC, Taub MA. Using genotype array data to compare multi- and single-sample variant calls and improve variant call sets from deep coverage whole-genome sequencing data. Bioinformatics 2018;33:1147-1153. [PMID: 28035032 PMCID: PMC5408850 DOI: 10.1093/bioinformatics/btw786] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2016] [Accepted: 12/07/2016] [Indexed: 12/30/2022] Open

Abstract

Motivation

Variant calling from next-generation sequencing (NGS) data is susceptible to false positive calls due to sequencing, mapping and other errors. To better distinguish true from false positive calls, we present a method that uses genotype array data from the sequenced samples, rather than public data such as HapMap or dbSNP, to train an accurate classifier using Random Forests. We demonstrate our method on a set of variant calls obtained from 642 African-ancestry genomes from the Consortium on Asthma among African-ancestry Populations in the Americas (CAAPA), sequenced to high depth (30X).

Results

We have applied our classifier to compare call sets generated with different calling methods, including both single-sample and multi-sample callers. At a False Positive Rate of 5%, our method determines true positive rates of 97.5%, 95% and 99% on variant calls obtained using Illuminas single-sample caller CASAVA, Real Time Genomics multisample variant caller, and the GATK UnifiedGenotyper, respectively. Since NGS sequencing data may be accompanied by genotype data for the same samples, either collected concurrent to sequencing or from a previous study, our method can be trained on each dataset to provide a more accurate computational validation of site calls compared to generic methods. Moreover, our method allows for adjustment based on allele frequency (e.g. a different set of criteria to determine quality for rare versus common variants) and thereby provides insight into sequencing characteristics that indicate call quality for variants of different frequencies.

Availability and Implementation

Code is available on Github at: https://github.com/suyashss/variant_validation.

Contacts

suyashs@stanford.edu or mtaub@jhsph.edu.

Supplementary information

Supplementary data are available at Bioinformatics online.

Collapse

Zomnir MG, Lipkin L, Pacula M, Dominguez Meneses E, MacLeay A, Duraisamy S, Nadhamuni N, Al Turki SH, Zheng Z, Rivera M, Nardi V, Dias-Santagata D, Iafrate AJ, Le LP, Lennerz JK. Artificial Intelligence Approach for Variant Reporting. JCO Clin Cancer Inform 2018;2:CCI.16.00079. [PMID: 30364844 PMCID: PMC6198661 DOI: 10.1200/cci.16.00079] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023] Open

Abstract

Purpose

Next-generation sequencing technologies are actively applied in clinical oncology. Bioinformatics pipeline analysis is an integral part of this process; however, humans cannot yet realize the full potential of the highly complex pipeline output. As a result, the decision to include a variant in the final report during routine clinical sign-out remains challenging.

Methods

We used an artificial intelligence approach to capture the collective clinical sign-out experience of six board-certified molecular pathologists to build and validate a decision support tool for variant reporting. We extracted all reviewed and reported variants from our clinical database and tested several machine learning models. We used 10-fold cross-validation for our variant call prediction model, which derives a contiguous prediction score from 0 to 1 (no to yes) for clinical reporting.

Results

For each of the 19,594 initial training variants, our pipeline generates approximately 500 features, which results in a matrix of > 9 million data points. From a comparison of naive Bayes, decision trees, random forests, and logistic regression models, we selected models that allow human interpretability of the prediction score. The logistic regression model demonstrated 1% false negativity and 2% false positivity. The final models' Youden indices were 0.87 and 0.77 for screening and confirmatory cutoffs, respectively. Retraining on a new assay and performance assessment in 16,123 independent variants validated our approach (Youden index, 0.93). We also derived individual pathologist-centric models (virtual consensus conference function), and a visual drill-down functionality allows assessment of how underlying features contributed to a particular score or decision branch for clinical implementation.

Conclusion

Our decision support tool for variant reporting is a practically relevant artificial intelligence approach to harness the next-generation sequencing bioinformatics pipeline output when the complexity of data interpretation exceeds human capabilities.

Collapse

Ye S, Yuan X, Lin X, Gao N, Luo Y, Chen Z, Li J, Zhang X, Zhang Z. Imputation from SNP chip to sequence: a case study in a Chinese indigenous chicken population. J Anim Sci Biotechnol 2018;9:30. [PMID: 29581880 PMCID: PMC5861640 DOI: 10.1186/s40104-018-0241-5] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2017] [Accepted: 01/26/2018] [Indexed: 11/24/2022] Open

Abstract

Background

Genome-wide association studies and genomic predictions are thought to be optimized by using whole-genome sequence (WGS) data. However, sequencing thousands of individuals of interest is expensive. Imputation from SNP panels to WGS data is an attractive and less expensive approach to obtain WGS data. The aims of this study were to investigate the accuracy of imputation and to provide insight into the design and execution of genotype imputation.

Results

We genotyped 450 chickens with a 600 K SNP array, and sequenced 24 key individuals by whole genome re-sequencing. Accuracy of imputation from putative 60 K and 600 K array data to WGS data was 0.620 and 0.812 for Beagle, and 0.810 and 0.914 for FImpute, respectively. By increasing the sequencing cost from 24X to 144X, the imputation accuracy increased from 0.525 to 0.698 for Beagle and from 0.654 to 0.823 for FImpute. With fixed sequence depth (12X), increasing the number of sequenced animals from 1 to 24, improved accuracy from 0.421 to 0.897 for FImpute and from 0.396 to 0.777 for Beagle. Using optimally selected key individuals resulted in a higher imputation accuracy compared with using randomly selected individuals as a reference population for re-sequencing. With fixed reference population size (24), imputation accuracy increased from 0.654 to 0.875 for FImpute and from 0.512 to 0.762 for Beagle as the sequencing depth increased from 1X to 12X. With a given total cost of genotyping, accuracy increased with the size of the reference population for FImpute, but the pattern was not valid for Beagle, which showed the highest accuracy at six fold coverage for the scenarios used in this study.

Conclusions

In conclusion, we comprehensively investigated the impacts of several key factors on genotype imputation. Generally, increasing sequencing cost gave a higher imputation accuracy. But with a fixed sequencing cost, the optimal imputation enhance the performance of WGP and GWAS. An optimal imputation strategy should take size of reference population, imputation algorithms, marker density, and population structure of the target population and methods to select key individuals into consideration comprehensively. This work sheds additional light on how to design and execute genotype imputation for livestock populations.

Electronic supplementary material

The online version of this article (10.1186/s40104-018-0241-5) contains supplementary material, which is available to authorized users.

Collapse

Affiliation(s)

Shaopan Ye Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, National Engineering Research Centre for Breeding Swine Industry, College of Animal Science, South China Agricultural University, Guangzhou, Guangdong China
Xiaolong Yuan Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, National Engineering Research Centre for Breeding Swine Industry, College of Animal Science, South China Agricultural University, Guangzhou, Guangdong China
Xiran Lin Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, National Engineering Research Centre for Breeding Swine Industry, College of Animal Science, South China Agricultural University, Guangzhou, Guangdong China
Ning Gao Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, National Engineering Research Centre for Breeding Swine Industry, College of Animal Science, South China Agricultural University, Guangzhou, Guangdong China
Yuanyu Luo Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, National Engineering Research Centre for Breeding Swine Industry, College of Animal Science, South China Agricultural University, Guangzhou, Guangdong China
Zanmou Chen Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, National Engineering Research Centre for Breeding Swine Industry, College of Animal Science, South China Agricultural University, Guangzhou, Guangdong China
Jiaqi Li Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, National Engineering Research Centre for Breeding Swine Industry, College of Animal Science, South China Agricultural University, Guangzhou, Guangdong China
Xiquan Zhang Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, National Engineering Research Centre for Breeding Swine Industry, College of Animal Science, South China Agricultural University, Guangzhou, Guangdong China
Zhe Zhang Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, National Engineering Research Centre for Breeding Swine Industry, College of Animal Science, South China Agricultural University, Guangzhou, Guangdong China

Collapse