1
|
Zhang L, Li H, Shi M, Ren K, Zhang W, Cheng Y, Wang Y, Xia XQ. FishSNP: a high quality cross-species SNP database of fishes. Sci Data 2024; 11:286. [PMID: 38461307 PMCID: PMC10924876 DOI: 10.1038/s41597-024-03111-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2023] [Accepted: 03/04/2024] [Indexed: 03/11/2024] Open
Abstract
The progress of aquaculture heavily depends on the efficient utilization of diverse genetic resources to enhance production efficiency and maximize profitability. Single nucleotide polymorphisms (SNPs) have been widely used in the study of aquaculture genomics, genetics, and breeding research since they are the most prevalent molecular markers on the genome. Currently, a large number of SNP markers from cultured fish species are scattered in individual studies, making querying complicated and data reuse problematic. We compiled relevant SNP data from literature and public databases to create a fish SNP database, FishSNP ( http://bioinfo.ihb.ac.cn/fishsnp ), and also used a unified analysis pipeline to process raw data that the author of the literature did not perform SNP calling on to obtain SNPs with high reliability. This database presently contains 45,690,243 (45 million) nonredundant SNP data for 13 fish species, with 30,288,958 (30 million) of those being high-quality SNPs. The main function of FishSNP is to search, browse, annotate and download SNPs, which provide researchers various and comprehensive associated information.
Collapse
Affiliation(s)
- Lei Zhang
- State Key Laboratory of Freshwater Ecology and Biotechnology, Hubei Hongshan Laboratory, Key Laboratory of Aquaculture Disease Control, Ministry of Agriculture and Rural Affairs, The Innovation Academy of Seed Design, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan, 430072, China
- College of Advanced Agricultural Sciences, University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Heng Li
- State Key Laboratory of Freshwater Ecology and Biotechnology, Hubei Hongshan Laboratory, Key Laboratory of Aquaculture Disease Control, Ministry of Agriculture and Rural Affairs, The Innovation Academy of Seed Design, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan, 430072, China
- College of Advanced Agricultural Sciences, University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Mijuan Shi
- State Key Laboratory of Freshwater Ecology and Biotechnology, Hubei Hongshan Laboratory, Key Laboratory of Aquaculture Disease Control, Ministry of Agriculture and Rural Affairs, The Innovation Academy of Seed Design, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan, 430072, China.
- College of Advanced Agricultural Sciences, University of Chinese Academy of Sciences, Beijing, 100049, China.
| | - Keyi Ren
- State Key Laboratory of Freshwater Ecology and Biotechnology, Hubei Hongshan Laboratory, Key Laboratory of Aquaculture Disease Control, Ministry of Agriculture and Rural Affairs, The Innovation Academy of Seed Design, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan, 430072, China
- College of Fisheries and Life Science, Dalian Ocean University, Dalian, 116023, China
| | - Wanting Zhang
- State Key Laboratory of Freshwater Ecology and Biotechnology, Hubei Hongshan Laboratory, Key Laboratory of Aquaculture Disease Control, Ministry of Agriculture and Rural Affairs, The Innovation Academy of Seed Design, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan, 430072, China
| | - Yingyin Cheng
- State Key Laboratory of Freshwater Ecology and Biotechnology, Hubei Hongshan Laboratory, Key Laboratory of Aquaculture Disease Control, Ministry of Agriculture and Rural Affairs, The Innovation Academy of Seed Design, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan, 430072, China
| | - Yaping Wang
- State Key Laboratory of Freshwater Ecology and Biotechnology, Hubei Hongshan Laboratory, Key Laboratory of Aquaculture Disease Control, Ministry of Agriculture and Rural Affairs, The Innovation Academy of Seed Design, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan, 430072, China
- College of Advanced Agricultural Sciences, University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Xiao-Qin Xia
- State Key Laboratory of Freshwater Ecology and Biotechnology, Hubei Hongshan Laboratory, Key Laboratory of Aquaculture Disease Control, Ministry of Agriculture and Rural Affairs, The Innovation Academy of Seed Design, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan, 430072, China.
- College of Advanced Agricultural Sciences, University of Chinese Academy of Sciences, Beijing, 100049, China.
| |
Collapse
|
2
|
Xiang X, Lu B, Song D, Li J, Shu K, Pu D. Evaluating the performance of low-frequency variant calling tools for the detection of variants from short-read deep sequencing data. Sci Rep 2023; 13:20444. [PMID: 37993475 PMCID: PMC10665316 DOI: 10.1038/s41598-023-47135-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2023] [Accepted: 11/09/2023] [Indexed: 11/24/2023] Open
Abstract
Detection of low-frequency variants with high accuracy plays an important role in biomedical research and clinical practice. However, it is challenging to do so with next-generation sequencing (NGS) approaches due to the high error rates of NGS. To accurately distinguish low-level true variants from these errors, many statistical variants calling tools for calling low-frequency variants have been proposed, but a systematic performance comparison of these tools has not yet been performed. Here, we evaluated four raw-reads-based variant callers (SiNVICT, outLyzer, Pisces, and LoFreq) and four UMI-based variant callers (DeepSNVMiner, MAGERI, smCounter2, and UMI-VarCal) considering their capability to call single nucleotide variants (SNVs) with allelic frequency as low as 0.025% in deep sequencing data. We analyzed a total of 54 simulated data with various sequencing depths and variant allele frequencies (VAFs), two reference data, and Horizon Tru-Q sample data. The results showed that the UMI-based callers, except smCounter2, outperformed the raw-reads-based callers regarding detection limit. Sequencing depth had almost no effect on the UMI-based callers but significantly influenced on the raw-reads-based callers. Regardless of the sequencing depth, MAGERI showed the fastest analysis, while smCounter2 consistently took the longest to finish the variant calling process. Overall, DeepSNVMiner and UMI-VarCal performed the best with considerably good sensitivity and precision of 88%, 100%, and 84%, 100%, respectively. In conclusion, the UMI-based callers, except smCounter2, outperformed the raw-reads-based callers in terms of sensitivity and precision. We recommend using DeepSNVMiner and UMI-VarCal for low-frequency variant detection. The results provide important information regarding future directions for reliable low-frequency variant detection and algorithm development, which is critical in genetics-based medical research and clinical applications.
Collapse
Affiliation(s)
- Xudong Xiang
- Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing, 400065, China
| | - Bowen Lu
- Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing, 400065, China
| | - Dongyang Song
- Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing, 400065, China
| | - Jie Li
- Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing, 400065, China
| | - Kunxian Shu
- Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing, 400065, China.
| | - Dan Pu
- Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing, 400065, China.
| |
Collapse
|
3
|
Sancha-Velasco A, Uceda-Heras A, García-Cabezas MÁ. Cortical type: a conceptual tool for meaningful biological interpretation of high-throughput gene expression data in the human cerebral cortex. Front Neuroanat 2023; 17:1187280. [PMID: 37426901 PMCID: PMC10323436 DOI: 10.3389/fnana.2023.1187280] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2023] [Accepted: 05/31/2023] [Indexed: 07/11/2023] Open
Abstract
The interpretation of massive high-throughput gene expression data requires computational and biological analyses to identify statistically and biologically significant differences, respectively. There are abundant sources that describe computational tools for statistical analysis of massive gene expression data but few address data analysis for biological significance. In the present article we exemplify the importance of selecting the proper biological context in the human brain for gene expression data analysis and interpretation. For this purpose, we use cortical type as conceptual tool to make predictions about gene expression in areas of the human temporal cortex. We predict that the expression of genes related to glutamatergic transmission would be higher in areas of simpler cortical type, the expression of genes related to GABAergic transmission would be higher in areas of more complex cortical type, and the expression of genes related to epigenetic regulation would be higher in areas of simpler cortical type. Then, we test these predictions with gene expression data from several regions of the human temporal cortex obtained from the Allen Human Brain Atlas. We find that the expression of several genes shows statistically significant differences in agreement with the predicted gradual expression along the laminar complexity gradient of the human cortex, suggesting that simpler cortical types may have greater glutamatergic excitability and epigenetic turnover compared to more complex types; on the other hand, complex cortical types seem to have greater GABAergic inhibitory control compared to simpler types. Our results show that cortical type is a good predictor of synaptic plasticity, epigenetic turnover, and selective vulnerability in human cortical areas. Thus, cortical type can provide a meaningful context for interpreting high-throughput gene expression data in the human cerebral cortex.
Collapse
Affiliation(s)
- Ariadna Sancha-Velasco
- Department of Anatomy, Histology and Neuroscience, School of Medicine, Autonomous University of Madrid, Madrid, Spain
- Master Program in Neuroscience, Autonomous University of Madrid, Madrid, Spain
| | - Alicia Uceda-Heras
- Master Program in Neuroscience, Autonomous University of Madrid, Madrid, Spain
- Ph.D. Program in Neuroscience UAM-Cajal, Autonomous University of Madrid, Madrid, Spain
| | - Miguel Ángel García-Cabezas
- Department of Anatomy, Histology and Neuroscience, School of Medicine, Autonomous University of Madrid, Madrid, Spain
- Master Program in Neuroscience, Autonomous University of Madrid, Madrid, Spain
- Ph.D. Program in Neuroscience UAM-Cajal, Autonomous University of Madrid, Madrid, Spain
- Neural Systems Laboratory, Department of Health Sciences, Boston University, Boston, MA, United States
| |
Collapse
|
4
|
Zhong ZQ, Li R, Wang Z, Tian SS, Xie XF, Wang ZY, Na W, Wang QS, Pan YC, Xiao Q. Genome-wide scans for selection signatures in indigenous pigs revealed candidate genes relating to heat tolerance. Animal 2023; 17:100882. [PMID: 37406393 DOI: 10.1016/j.animal.2023.100882] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2022] [Revised: 06/04/2023] [Accepted: 06/06/2023] [Indexed: 07/07/2023] Open
Abstract
Heat stress is a major problem that constrains pig productivity. Understanding and identifying adaptation to heat stress has been the focus of recent studies, and the identification of genome-wide selection signatures can provide insights into the mechanisms of environmental adaptation. Here, we generated whole-genome re-sequencing data from six Chinese indigenous pig populations to identify genomic regions with selection signatures related to heat tolerance using multiple methods: three methods for intra-population analyses (Integrated Haplotype Score, Runs of Homozygosity and Nucleotide diversity Analysis) and three methods for inter-population analyses (Fixation index (FST), Cross-population Composite Likelihood Ratio and Cross-population Extended Haplotype Homozygosity). In total, 1 966 796 single nucleotide polymorphisms were identified in this study. Genetic structure analyses and FST indicated differentiation among these breeds. Based on information on the location environment, the six breeds were divided into heat and cold groups. By combining two or more approaches for selection signatures, outlier signals in overlapping regions were identified as candidate selection regions. A total of 163 candidate genes were identified, of which, 29 were associated with heat stress injury and anti-inflammatory effects. These candidate genes were further associated with 78 Gene Ontology functional terms and 30 Kyoto Encyclopedia of Genes and Genomes pathways in enrichment analysis (P < 0.05). Some of these have clear relevance to heat resistance, such as the AMPK signalling pathway and the mTOR signalling pathway. The results improve our understanding of the selection mechanisms responsible for heat resistance in pigs and provide new insights of introgression in heat adaptation.
Collapse
Affiliation(s)
- Z Q Zhong
- Hainan Key Laboratory of Tropical Animal Reproduction & Breeding and Epidemic Disease Research, College of Animal Science and Technology, Hainan University, Haikou 570228, China
| | - R Li
- Key Laboratory of Animal Genetics, Breeding and Reproduction of Shanxi Province, College of Animal Science and Technology, Northwest A&F University, Yangling 712100, China
| | - Z Wang
- Department of Animal Science, College of Animal Science, Zhejiang University, Hangzhou 310058, China
| | - S S Tian
- Hainan Key Laboratory of Tropical Animal Reproduction & Breeding and Epidemic Disease Research, College of Animal Science and Technology, Hainan University, Haikou 570228, China
| | - X F Xie
- Hainan Key Laboratory of Tropical Animal Reproduction & Breeding and Epidemic Disease Research, College of Animal Science and Technology, Hainan University, Haikou 570228, China
| | - Z Y Wang
- Hainan Key Laboratory of Tropical Animal Reproduction & Breeding and Epidemic Disease Research, College of Animal Science and Technology, Hainan University, Haikou 570228, China
| | - W Na
- Hainan Key Laboratory of Tropical Animal Reproduction & Breeding and Epidemic Disease Research, College of Animal Science and Technology, Hainan University, Haikou 570228, China
| | - Q S Wang
- Hainan Yazhou Bay Seed Laboratory, Yongyou Industrial Park, Yazhou Bay Sci-Tech City, Sanya 572025, China; Department of Animal Science, College of Animal Science, Zhejiang University, Hangzhou 310058, China
| | - Y C Pan
- Hainan Yazhou Bay Seed Laboratory, Yongyou Industrial Park, Yazhou Bay Sci-Tech City, Sanya 572025, China; Department of Animal Science, College of Animal Science, Zhejiang University, Hangzhou 310058, China
| | - Q Xiao
- Hainan Key Laboratory of Tropical Animal Reproduction & Breeding and Epidemic Disease Research, College of Animal Science and Technology, Hainan University, Haikou 570228, China.
| |
Collapse
|
5
|
Craven KE, Fischer CG, Jiang L, Pallavajjala A, Lin MT, Eshleman JR. Optimizing Insertion and Deletion Detection Using Next-Generation Sequencing in the Clinical Laboratory. J Mol Diagn 2022; 24:1217-1231. [PMID: 36162758 PMCID: PMC9808503 DOI: 10.1016/j.jmoldx.2022.08.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2022] [Revised: 07/18/2022] [Accepted: 08/31/2022] [Indexed: 01/13/2023] Open
Abstract
Detection of insertions and deletions (InDels) by short-read next-generation sequencing (NGS) technology can be challenging because of frequent misaligned reads. A systematic analysis of short InDels (1 to 30 bases) and fms-related receptor tyrosine kinase 3 (FLT3) internal tandem duplications (ITDs; 6 to 183 bases) from 46 clinical cases of solid or hematologic malignancy processed with a clinical NGS assay identified misaligned reads in every case, ranging from 3% to 100% of reads with the InDel showing mismapped bases. Mismaps also increased with InDel size. As a consequence, the clinical NGS bioinformatics pipeline undercalled the variant allele frequency by 1% to 84%, incorrectly called simultaneous single-base substitutions along with InDels, or did not report an FLT3 ITD that had been detected by capillary electrophoresis. To improve the ability of the pipeline to better detect and quantify InDels, we utilized a software program called Assembly-Based ReAligner (ABRA2) to more accurately remap reads. ABRA2 was able to correct 41% to 100% of the reads with mismapped bases and led to absolute increases in the variant allele frequency from 1% to 61% along with correction of all of the single-base substitutions except for two cases. ABRA2 could also detect multiple FLT3 ITD clones except for one 183-base ITD. Our analysis has found that ABRA2 performs well on short InDels as well as FLT3 ITDs that are <100 bases.
Collapse
Affiliation(s)
- Kelly E Craven
- Department of Pathology, Johns Hopkins University School of Medicine, Baltimore, Maryland
| | - Catherine G Fischer
- Department of Pathology, Johns Hopkins University School of Medicine, Baltimore, Maryland; Division of Cancer Prevention, National Cancer Institute, Rockville, Maryland
| | - LiQun Jiang
- Department of Pathology, Johns Hopkins University School of Medicine, Baltimore, Maryland
| | - Aparna Pallavajjala
- Department of Pathology, Johns Hopkins University School of Medicine, Baltimore, Maryland
| | - Ming-Tseh Lin
- Department of Pathology, Johns Hopkins University School of Medicine, Baltimore, Maryland
| | - James R Eshleman
- Department of Pathology, Johns Hopkins University School of Medicine, Baltimore, Maryland; Department of Oncology, Johns Hopkins University School of Medicine, Baltimore, Maryland; The Sol Goldman Pancreatic Cancer Research Center, Johns Hopkins University School of Medicine, Baltimore, Maryland.
| |
Collapse
|
6
|
Wang H, Wen J, Li H, Zhu T, Zhao X, Zhang J, Zhang X, Tang C, Qu L, Gemingguli M. Candidate pigmentation genes related to feather color variation in an indigenous chicken breed revealed by whole genome data. Front Genet 2022; 13:985228. [PMID: 36479242 PMCID: PMC9720402 DOI: 10.3389/fgene.2022.985228] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2022] [Accepted: 10/10/2022] [Indexed: 08/27/2023] Open
Abstract
Chicken plumage color is an inheritable phenotype that was naturally and artificially selected for during domestication. The Baicheng You chicken is an indigenous Chinese chicken breed presenting three main feather colors, lavender, black, and yellow plumages. To explore the genetic mechanisms underlying the pigmentation in Baicheng You chickens, we re-sequenced the whole genome of Baicheng You chicken with the three plumage colors. By analyzing the divergent regions of the genome among the chickens with different feather colors, we identified some candidate genomic regions associated with the feather colors in Baicheng You chickens. We found that EGR1, MLPH, RAB17, SOX5, and GRM5 genes were the potential genes for black, lavender, and yellow feathers. MLPH, GRM5, and SOX5 genes have been found to be related to plumage colors in birds. Our results showed that EGR1 is a most plausible candidate gene for black plumage, RAB17, MLPH, and SOX5 for lavender plumage, and GRM5 for yellow plumage in Baicheng You chicken.
Collapse
Affiliation(s)
- Huie Wang
- Xinjiang Production and Construction Corps, Key Laboratory of Protection and Utilization of Biological Resources in Tarim Basin, Tarim University, Alar, China
- College of Life Science and Technology, College of Animal Science and Technology, Tarim University, Alar, China
| | - Junhui Wen
- National Engineering Laboratory for Animal Breeding, Department of Animal Genetics and Breeding, College of Animal Science and Technology, China Agricultural University, Beijing, China
| | - Haiying Li
- College of Animal Science, Xinjiang Agricultural University, Urumchi, China
| | - Tao Zhu
- National Engineering Laboratory for Animal Breeding, Department of Animal Genetics and Breeding, College of Animal Science and Technology, China Agricultural University, Beijing, China
| | - Xiurong Zhao
- National Engineering Laboratory for Animal Breeding, Department of Animal Genetics and Breeding, College of Animal Science and Technology, China Agricultural University, Beijing, China
| | - Jinxin Zhang
- National Engineering Laboratory for Animal Breeding, Department of Animal Genetics and Breeding, College of Animal Science and Technology, China Agricultural University, Beijing, China
| | - Xinye Zhang
- National Engineering Laboratory for Animal Breeding, Department of Animal Genetics and Breeding, College of Animal Science and Technology, China Agricultural University, Beijing, China
| | - Chi Tang
- Xinjiang Production and Construction Corps, Key Laboratory of Protection and Utilization of Biological Resources in Tarim Basin, Tarim University, Alar, China
| | - Lujiang Qu
- Xinjiang Production and Construction Corps, Key Laboratory of Protection and Utilization of Biological Resources in Tarim Basin, Tarim University, Alar, China
- National Engineering Laboratory for Animal Breeding, Department of Animal Genetics and Breeding, College of Animal Science and Technology, China Agricultural University, Beijing, China
| | - M. Gemingguli
- Xinjiang Production and Construction Corps, Key Laboratory of Protection and Utilization of Biological Resources in Tarim Basin, Tarim University, Alar, China
- College of Life Science and Technology, College of Animal Science and Technology, Tarim University, Alar, China
| |
Collapse
|
7
|
Feldmann D, Bope CD, Patricios J, Chimusa ER, Collins M, September AV. A whole genome sequencing approach to anterior cruciate ligament rupture-a twin study in two unrelated families. PLoS One 2022; 17:e0274354. [PMID: 36201451 PMCID: PMC9536556 DOI: 10.1371/journal.pone.0274354] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2021] [Accepted: 08/25/2022] [Indexed: 11/06/2022] Open
Abstract
Predisposition to anterior cruciate ligament (ACL) rupture is multi-factorial, with variation in the genome considered a key intrinsic risk factor. Most implicated loci have been identified from candidate gene-based approach using case-control association settings. Here, we leverage a hypothesis-free whole genome sequencing in two two unrelated families (Family A and B) each with twins with a history of recurrent ACL ruptures acquired playing rugby as their primary sport, aimed to elucidate biologically relevant function-altering variants and genetic modifiers in ACL rupture. Family A monozygotic twin males (Twin 1 and Twin 2) both sustained two unilateral non-contact ACL ruptures of the right limb while playing club level touch rugby. Their male sibling sustained a bilateral non-contact ACL rupture while playing rugby union was also recruited. The father had sustained a unilateral non-contact ACL rupture on the right limb while playing professional amateur level football and mother who had participated in dancing for over 10 years at a social level, with no previous ligament or tendon injuries were both recruited. Family B monozygotic twin males (Twin 3 and Twin 4) were recruited with Twin 3 who had sustained a unilateral non-contact ACL rupture of the right limb and Twin 4 sustained three non-contact ACL ruptures (two in right limb and one in left limb), both while playing provincial level rugby union. Their female sibling participated in karate and swimming activities; and mother in hockey (4 years) horse riding (15 years) and swimming, had both reported no previous history of ligament or tendon injury. Variants with potential deleterious, loss-of-function and pathogenic effects were prioritised. Identity by descent, molecular dynamic simulation and functional partner analyses were conducted. We identified, in all nine affected individuals, including twin sets, non-synonymous SNPs in three genes: COL12A1 and CATSPER2, and KCNJ12 that are commonly enriched for deleterious, loss-of-function mutations, and their dysfunctions are known to be involved in the development of chronic pain, and represent key therapeutic targets. Notably, using Identity By Decent (IBD) analyses a long shared identical sequence interval which included the LINC01250 gene, around the telomeric region of chromosome 2p25.3, was common between affected twins in both families, and an affected brother'. Overall gene sets were enriched in pathways relevant to ACL pathophysiology, including complement/coagulation cascades (p = 3.0e-7), purine metabolism (p = 6.0e-7) and mismatch repair (p = 6.9e-5) pathways. Highlighted, is that this study fills an important gap in knowledge by using a WGS approach, focusing on potential deleterious variants in two unrelated families with a historical record of ACL rupture; and providing new insights into the pathophysiology of ACL, by identifying gene sets that contribute to variability in ACL risk.
Collapse
Affiliation(s)
- Daneil Feldmann
- Division of Physiological Sciences, Department of Human Biology, University of Cape Town, Cape Town, South Africa
| | - Christian D. Bope
- Department of Mathematics and Computer Science, Faculty of Sciences, University of Kinshasa, Kinshasa, Democratic Republic of Congo
- Division of Human Genetics, Department of Pathology, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa
- Centre for Bioinformatics, Department of Informatics, University of Oslo, Oslo, Norway
| | - Jon Patricios
- Wits Sport and Health (WiSH), School of Clinical Medicine, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa
| | - Emile R. Chimusa
- Department of Applied Sciences, Faculty of Health and Life Sciences, Northumbria University, Newcastle, Tyne and Wear, United Kingdom
- Institute of Infectious Disease and Molecular Medicine, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa
| | - Malcolm Collins
- Division of Physiological Sciences, Department of Human Biology, University of Cape Town, Cape Town, South Africa
- UCT Research Centre for Health Through Physical Activity, Lifestyle and Sport (HPALS), Cape Town, South Africa
- International Federation of Sports Medicine (FIMS) Collaborative Centre of Sports Medicine, Cape Town, South Africa
| | - Alison V. September
- Division of Physiological Sciences, Department of Human Biology, University of Cape Town, Cape Town, South Africa
- UCT Research Centre for Health Through Physical Activity, Lifestyle and Sport (HPALS), Cape Town, South Africa
- International Federation of Sports Medicine (FIMS) Collaborative Centre of Sports Medicine, Cape Town, South Africa
- * E-mail:
| |
Collapse
|
8
|
Lefouili M, Nam K. The evaluation of Bcftools mpileup and GATK HaplotypeCaller for variant calling in non-human species. Sci Rep 2022; 12:11331. [PMID: 35790846 PMCID: PMC9256665 DOI: 10.1038/s41598-022-15563-2] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Accepted: 06/27/2022] [Indexed: 11/09/2022] Open
Abstract
Identification of genetic variations is a central part of population and quantitative genomics studies based on high-throughput sequencing data. Even though popular variant callers such as Bcftools mpileup and GATK HaplotypeCaller were developed nearly 10 years ago, their performance is still largely unknown for non-human species. Here, we showed by benchmark analyses with a simulated insect population that Bcftools mpileup performs better than GATK HaplotypeCaller in terms of recovery rate and accuracy regardless of mapping software. The vast majority of false positives were observed from repeats, especially for GATK HaplotypeCaller. Variant scores calculated by GATK did not clearly distinguish true positives from false positives in the vast majority of cases, implying that hard-filtering with GATK could be challenging. These results suggest that Bcftools mpileup may be the first choice for non-human studies and that variants within repeats might have to be excluded for downstream analyses.
Collapse
Affiliation(s)
| | - Kiwoong Nam
- DGIMI, Univ Montpellier, INRAE, Montpellier, France.
| |
Collapse
|
9
|
Saremi B, Gusmag F, Distl O, Schaarschmidt F, Metzger J, Becker S, Jung K. A comparison of strategies for generating artificial replicates in RNA-seq experiments. Sci Rep 2022; 12:7170. [PMID: 35505053 PMCID: PMC9065086 DOI: 10.1038/s41598-022-11302-9] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Accepted: 04/04/2022] [Indexed: 11/21/2022] Open
Abstract
Due to the overall high costs, technical replicates are usually omitted in RNA-seq experiments, but several methods exist to generate them artificially. Bootstrapping reads from FASTQ-files has recently been used in the context of other NGS analyses and can be used to generate artificial technical replicates. Bootstrapping samples from the columns of the expression matrix has already been used for DNA microarray data and generates a new artificial replicate of the whole experiment. Mixing data of individual samples has been used for data augmentation in machine learning. The aim of this comparison is to evaluate which of these strategies are best suited to study the reproducibility of differential expression and gene-set enrichment analysis in an RNA-seq experiment. To study the approaches under controlled conditions, we performed a new RNA-seq experiment on gene expression changes upon virus infection compared to untreated control samples. In order to compare the approaches for artificial replicates, each of the samples was sequenced twice, i.e. as true technical replicates, and differential expression analysis and GO term enrichment analysis was conducted separately for the two resulting data sets. Although we observed a high correlation between the results from the two replicates, there are still many genes and GO terms that would be selected from one replicate but not from the other. Cluster analyses showed that artificial replicates generated by bootstrapping reads produce it p values and fold changes that are close to those obtained from the true data sets. Results generated from artificial replicates with the approaches of column bootstrap or mixing observations were less similar to the results from the true replicates. Furthermore, the overlap of results among replicates generated by column bootstrap or mixing observations was much stronger than among the true replicates. Artificial technical replicates generated by bootstrapping sequencing reads from FASTQ-files are better suited to study the reproducibility of results from differential expression and GO term enrichment analysis in RNA-seq experiments than column bootstrap or mixing observations. However, FASTQ-bootstrapping is computationally more expensive than the other two approaches. The FASTQ-bootstrapping may be applicable to other applications of high-throughput sequencing.
Collapse
Affiliation(s)
- Babak Saremi
- Institute for Animal Breeding and Genetics, University of Veterinary Medicine Hannover, Foundation, Hannover, Germany
| | - Frederic Gusmag
- Institute for Parasitology, University of Veterinary Medicine Hannover, Foundation, Hannover, Germany
| | - Ottmar Distl
- Institute for Animal Breeding and Genetics, University of Veterinary Medicine Hannover, Foundation, Hannover, Germany
| | - Frank Schaarschmidt
- Biostatistics Department, Institute for Cell Biology, Leibniz University Hannover, Hannover, Germany
| | - Julia Metzger
- Institute for Animal Breeding and Genetics, University of Veterinary Medicine Hannover, Foundation, Hannover, Germany.,RG Development and Disease, Veterinary Functional Genomics, Max-Planck-Institute for Molecular Genetics, Berlin, Germany
| | - Stefanie Becker
- Institute for Parasitology, University of Veterinary Medicine Hannover, Foundation, Hannover, Germany
| | - Klaus Jung
- Institute for Animal Breeding and Genetics, University of Veterinary Medicine Hannover, Foundation, Hannover, Germany.
| |
Collapse
|
10
|
Akoniyon OP, Adewumi TS, Maharaj L, Oyegoke OO, Roux A, Adeleke MA, Maharaj R, Okpeku M. Whole Genome Sequencing Contributions and Challenges in Disease Reduction Focused on Malaria. BIOLOGY 2022; 11:587. [PMID: 35453786 PMCID: PMC9027812 DOI: 10.3390/biology11040587] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/13/2022] [Revised: 03/31/2022] [Accepted: 04/01/2022] [Indexed: 12/11/2022]
Abstract
Malaria elimination remains an important goal that requires the adoption of sophisticated science and management strategies in the era of the COVID-19 pandemic. The advent of next generation sequencing (NGS) is making whole genome sequencing (WGS) a standard today in the field of life sciences, as PCR genotyping and targeted sequencing provide insufficient information compared to the whole genome. Thus, adapting WGS approaches to malaria parasites is pertinent to studying the epidemiology of the disease, as different regions are at different phases in their malaria elimination agenda. Therefore, this review highlights the applications of WGS in disease management, challenges of WGS in controlling malaria parasites, and in furtherance, provides the roles of WGS in pursuit of malaria reduction and elimination. WGS has invaluable impacts in malaria research and has helped countries to reach elimination phase rapidly by providing required information needed to thwart transmission, pathology, and drug resistance. However, to eliminate malaria in sub-Saharan Africa (SSA), with high malaria transmission, we recommend that WGS machines should be readily available and affordable in the region.
Collapse
Affiliation(s)
- Olusegun Philip Akoniyon
- Discipline of Genetics, School of Life Sciences, University of KwaZulu-Natal, Westville Campus, Durban 4041, South Africa; (O.P.A.); (T.S.A.); (L.M.); (O.O.O.); (A.R.); (M.A.A.)
| | - Taiye Samson Adewumi
- Discipline of Genetics, School of Life Sciences, University of KwaZulu-Natal, Westville Campus, Durban 4041, South Africa; (O.P.A.); (T.S.A.); (L.M.); (O.O.O.); (A.R.); (M.A.A.)
| | - Leah Maharaj
- Discipline of Genetics, School of Life Sciences, University of KwaZulu-Natal, Westville Campus, Durban 4041, South Africa; (O.P.A.); (T.S.A.); (L.M.); (O.O.O.); (A.R.); (M.A.A.)
| | - Olukunle Olugbenle Oyegoke
- Discipline of Genetics, School of Life Sciences, University of KwaZulu-Natal, Westville Campus, Durban 4041, South Africa; (O.P.A.); (T.S.A.); (L.M.); (O.O.O.); (A.R.); (M.A.A.)
| | - Alexandra Roux
- Discipline of Genetics, School of Life Sciences, University of KwaZulu-Natal, Westville Campus, Durban 4041, South Africa; (O.P.A.); (T.S.A.); (L.M.); (O.O.O.); (A.R.); (M.A.A.)
| | - Matthew A. Adeleke
- Discipline of Genetics, School of Life Sciences, University of KwaZulu-Natal, Westville Campus, Durban 4041, South Africa; (O.P.A.); (T.S.A.); (L.M.); (O.O.O.); (A.R.); (M.A.A.)
| | - Rajendra Maharaj
- Office of Malaria Research, South African Medical Research Council, Cape Town 7505, South Africa;
| | - Moses Okpeku
- Discipline of Genetics, School of Life Sciences, University of KwaZulu-Natal, Westville Campus, Durban 4041, South Africa; (O.P.A.); (T.S.A.); (L.M.); (O.O.O.); (A.R.); (M.A.A.)
| |
Collapse
|
11
|
Liu J, Shen Q, Bao H. Comparison of seven SNP calling pipelines for the next-generation sequencing data of chickens. PLoS One 2022; 17:e0262574. [PMID: 35100292 PMCID: PMC8803190 DOI: 10.1371/journal.pone.0262574] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2021] [Accepted: 12/29/2021] [Indexed: 11/18/2022] Open
Abstract
Single nucleotide polymorphisms (SNPs) are widely used in genome-wide association studies and population genetics analyses. Next-generation sequencing (NGS) has become convenient, and many SNP-calling pipelines have been developed for human NGS data. We took advantage of a gap knowledge in selecting the appropriated SNP calling pipeline to handle with high-throughput NGS data. To fill this gap, we studied and compared seven SNP calling pipelines, which include 16GT, genome analysis toolkit (GATK), Bcftools-single (Bcftools single sample mode), Bcftools-multiple (Bcftools multiple sample mode), VarScan2-single (VarScan2 single sample mode), VarScan2-multiple (VarScan2 multiple sample mode) and Freebayes pipelines, using 96 NGS data with the different depth gradients of approximately 5X, 10X, 20X, 30X, 40X, and 50X coverage from 16 Rhode Island Red chickens. The sixteen chickens were also genotyped with a 50K SNP array, and the sensitivity and specificity of each pipeline were assessed by comparison to the results of SNP arrays. For each pipeline, except Freebayes, the number of detected SNPs increased as the input read depth increased. In comparison with other pipelines, 16GT, followed by Bcftools-multiple, obtained the most SNPs when the input coverage exceeded 10X, and Bcftools-multiple obtained the most when the input was 5X and 10X. The sensitivity and specificity of each pipeline increased with increasing input. Bcftools-multiple had the highest sensitivity numerically when the input ranged from 5X to 30X, and 16GT showed the highest sensitivity when the input was 40X and 50X. Bcftools-multiple also had the highest specificity, followed by GATK, at almost all input levels. For most calling pipelines, there were no obvious changes in SNP numbers, sensitivities or specificities beyond 20X. In conclusion, (1) if only SNPs were detected, the sequencing depth did not need to exceed 20X; (2) the Bcftools-multiple may be the best choice for detecting SNPs from chicken NGS data, but for a single sample or sequencing depth greater than 20X, 16GT was recommended. Our findings provide a reference for researchers to select suitable pipelines to obtain SNPs from the NGS data of chickens or nonhuman animals.
Collapse
Affiliation(s)
- Jing Liu
- National Engineering Laboratory for Animal Breeding, Beijing Key Laboratory for Animal Genetic Improvement, College of Animal Science and Technology, China Agricultural University, Beijing, China
| | - Qingmiao Shen
- National Engineering Laboratory for Animal Breeding, Beijing Key Laboratory for Animal Genetic Improvement, College of Animal Science and Technology, China Agricultural University, Beijing, China
| | - Haigang Bao
- National Engineering Laboratory for Animal Breeding, Beijing Key Laboratory for Animal Genetic Improvement, College of Animal Science and Technology, China Agricultural University, Beijing, China
- * E-mail:
| |
Collapse
|
12
|
Casellas J, Martín de Hijas-Villalba M, Vázquez-Gómez M, Id-Lahoucine S. Low-coverage whole-genome sequencing in livestock species for individual traceability and parentage testing. Livest Sci 2021. [DOI: 10.1016/j.livsci.2021.104629] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
13
|
Ahmed Z, Renart EG, Zeeshan S. Genomics pipelines to investigate susceptibility in whole genome and exome sequenced data for variant discovery, annotation, prediction and genotyping. PeerJ 2021; 9:e11724. [PMID: 34395068 PMCID: PMC8320519 DOI: 10.7717/peerj.11724] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2021] [Accepted: 06/14/2021] [Indexed: 12/12/2022] Open
Abstract
Over the last few decades, genomics is leading toward audacious future, and has been changing our views about conducting biomedical research, studying diseases, and understanding diversity in our society across the human species. The whole genome and exome sequencing (WGS/WES) are two of the most popular next-generation sequencing (NGS) methodologies that are currently being used to detect genetic variations of clinical significance. Investigating WGS/WES data for the variant discovery and genotyping is based on the nexus of different data analytic applications. Although several bioinformatics applications have been developed, and many of those are freely available and published. Timely finding and interpreting genetic variants are still challenging tasks among diagnostic laboratories and clinicians. In this study, we are interested in understanding, evaluating, and reporting the current state of solutions available to process the NGS data of variable lengths and types for the identification of variants, alleles, and haplotypes. Residing within the scope, we consulted high quality peer reviewed literature published in last 10 years. We were focused on the standalone and networked bioinformatics applications proposed to efficiently process WGS and WES data, and support downstream analysis for gene-variant discovery, annotation, prediction, and interpretation. We have discussed our findings in this manuscript, which include but not are limited to the set of operations, workflow, data handling, involved tools, technologies and algorithms and limitations of the assessed applications.
Collapse
Affiliation(s)
- Zeeshan Ahmed
- Institute for Health, Health Care Policy and Aging Research, Rutgers, The State University of New Jersey, New Brunswick, NJ, USA.,Department of Medicine, Robert Wood Johnson Medical School, Rutgers, The State University of New Jersey, New Brunswick, NJ, USA
| | - Eduard Gibert Renart
- Institute for Health, Health Care Policy and Aging Research, Rutgers, The State University of New Jersey, New Brunswick, NJ, USA
| | - Saman Zeeshan
- Cancer Institute of New Jersey, Rutgers, The State University of New Jersey, New Brunswick, NJ, USA
| |
Collapse
|
14
|
Zanti M, Michailidou K, Loizidou MA, Machattou C, Pirpa P, Christodoulou K, Spyrou GM, Kyriacou K, Hadjisavvas A. Performance evaluation of pipelines for mapping, variant calling and interval padding, for the analysis of NGS germline panels. BMC Bioinformatics 2021; 22:218. [PMID: 33910496 PMCID: PMC8080428 DOI: 10.1186/s12859-021-04144-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2021] [Accepted: 04/15/2021] [Indexed: 11/10/2022] Open
Abstract
Background Next-generation sequencing (NGS) represents a significant advancement in clinical genetics. However, its use creates several technical, data interpretation and management challenges. It is essential to follow a consistent data analysis pipeline to achieve the highest possible accuracy and avoid false variant calls. Herein, we aimed to compare the performance of twenty-eight combinations of NGS data analysis pipeline compartments, including short-read mapping (BWA-MEM, Bowtie2, Stampy), variant calling (GATK-HaplotypeCaller, GATK-UnifiedGenotyper, SAMtools) and interval padding (null, 50 bp, 100 bp) methods, along with a commercially available pipeline (BWA Enrichment, Illumina®). Fourteen germline DNA samples from breast cancer patients were sequenced using a targeted NGS panel approach and subjected to data analysis. Results We highlight that interval padding is required for the accurate detection of intronic variants including spliceogenic pathogenic variants (PVs). In addition, using nearly default parameters, the BWA Enrichment algorithm, failed to detect these spliceogenic PVs and a missense PV in the TP53 gene. We also recommend the BWA-MEM algorithm for sequence alignment, whereas variant calling should be performed using a combination of variant calling algorithms; GATK-HaplotypeCaller and SAMtools for the accurate detection of insertions/deletions and GATK-UnifiedGenotyper for the efficient detection of single nucleotide variant calls. Conclusions These findings have important implications towards the identification of clinically actionable variants through panel testing in a clinical laboratory setting, when dedicated bioinformatics personnel might not always be available. The results also reveal the necessity of improving the existing tools and/or at the same time developing new pipelines to generate more reliable and more consistent data. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04144-1.
Collapse
Affiliation(s)
- Maria Zanti
- Department of Electron Microscopy/Molecular Pathology, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus.,Cyprus School of Molecular Medicine, 2371, Nicosia, Cyprus.,Bioinformatics Department, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus
| | - Kyriaki Michailidou
- Cyprus School of Molecular Medicine, 2371, Nicosia, Cyprus.,Biostatistics Unit, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus
| | - Maria A Loizidou
- Department of Electron Microscopy/Molecular Pathology, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus.,Cyprus School of Molecular Medicine, 2371, Nicosia, Cyprus
| | - Christina Machattou
- Department of Electron Microscopy/Molecular Pathology, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus
| | - Panagiota Pirpa
- Department of Electron Microscopy/Molecular Pathology, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus
| | - Kyproula Christodoulou
- Cyprus School of Molecular Medicine, 2371, Nicosia, Cyprus.,Neurogenetics Department, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus
| | - George M Spyrou
- Cyprus School of Molecular Medicine, 2371, Nicosia, Cyprus.,Bioinformatics Department, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus
| | - Kyriacos Kyriacou
- Department of Electron Microscopy/Molecular Pathology, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus.,Cyprus School of Molecular Medicine, 2371, Nicosia, Cyprus
| | - Andreas Hadjisavvas
- Department of Electron Microscopy/Molecular Pathology, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus. .,Cyprus School of Molecular Medicine, 2371, Nicosia, Cyprus.
| |
Collapse
|
15
|
Next Generation Sequencing Technology in the Clinic and Its Challenges. Cancers (Basel) 2021; 13:cancers13081751. [PMID: 33916923 PMCID: PMC8067551 DOI: 10.3390/cancers13081751] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2021] [Revised: 03/30/2021] [Accepted: 04/05/2021] [Indexed: 12/12/2022] Open
Abstract
Simple Summary Precise identification and annotation of mutations are of utmost importance in clinical oncology. Insights of the DNA sequence can provide meaningful knowledge to unravel the underlying genetics of disease. Hence, tailoring of personalized medicine often relies on specific genomic alteration for treatment efficacy. The aim of this review is to highlight that sequencing harbors much more than just four nucleotides. Moreover, the gradual transition from first to second generation sequencing technologies has led to awareness for choosing the most appropriate bioinformatic analytic tools based on the aim, quality and demand for a specific purpose. Thus, the same raw data can lead to various results reflecting the intrinsic features of different datamining pipelines. Abstract Data analysis has become a crucial aspect in clinical oncology to interpret output from next-generation sequencing-based testing. NGS being able to resolve billions of sequencing reactions in a few days has consequently increased the demand for tools to handle and analyze such large data sets. Many tools have been developed since the advent of NGS, featuring their own peculiarities. Increased awareness when interpreting alterations in the genome is therefore of utmost importance, as the same data using different tools can provide diverse outcomes. Hence, it is crucial to evaluate and validate bioinformatic pipelines in clinical settings. Moreover, personalized medicine implies treatment targeting efficacy of biological drugs for specific genomic alterations. Here, we focused on different sequencing technologies, features underlying the genome complexity, and bioinformatic tools that can impact the final annotation. Additionally, we discuss the clinical demand and design for implementing NGS.
Collapse
|
16
|
Hynst J, Navrkalova V, Pal K, Pospisilova S. Bioinformatic strategies for the analysis of genomic aberrations detected by targeted NGS panels with clinical application. PeerJ 2021; 9:e10897. [PMID: 33850640 PMCID: PMC8019320 DOI: 10.7717/peerj.10897] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2020] [Accepted: 01/13/2021] [Indexed: 01/21/2023] Open
Abstract
Molecular profiling of tumor samples has acquired importance in cancer research, but currently also plays an important role in the clinical management of cancer patients. Rapid identification of genomic aberrations improves diagnosis, prognosis and effective therapy selection. This can be attributed mainly to the development of next-generation sequencing (NGS) methods, especially targeted DNA panels. Such panels enable a relatively inexpensive and rapid analysis of various aberrations with clinical impact specific to particular diagnoses. In this review, we discuss the experimental approaches and bioinformatic strategies available for the development of an NGS panel for a reliable analysis of selected biomarkers. Compliance with defined analytical steps is crucial to ensure accurate and reproducible results. In addition, a careful validation procedure has to be performed before the application of NGS targeted assays in routine clinical practice. With more focus on bioinformatics, we emphasize the need for thorough pipeline validation and management in relation to the particular experimental setting as an integral part of the NGS method establishment. A robust and reproducible bioinformatic analysis running on powerful machines is essential for proper detection of genomic variants in clinical settings since distinguishing between experimental noise and real biological variants is fundamental. This review summarizes state-of-the-art bioinformatic solutions for careful detection of the SNV/Indels and CNVs for targeted sequencing resulting in translation of sequencing data into clinically relevant information. Finally, we share our experience with the development of a custom targeted NGS panel for an integrated analysis of biomarkers in lymphoproliferative disorders.
Collapse
Affiliation(s)
- Jakub Hynst
- Center of Molecular Medicine, Central European Institute of Technology, Masaryk University, Brno, Czech Republic.,Department of Internal Medicine-Hematology and Oncology, Faculty of Medicine and University Hospital Brno, Masaryk University, Brno, Czech Republic.,Department of Medical Genetics and Genomics, Faculty of Medicine and University Hospital Brno, Masaryk University, Brno, Czech Republic
| | - Veronika Navrkalova
- Center of Molecular Medicine, Central European Institute of Technology, Masaryk University, Brno, Czech Republic.,Department of Internal Medicine-Hematology and Oncology, Faculty of Medicine and University Hospital Brno, Masaryk University, Brno, Czech Republic
| | - Karol Pal
- Center of Molecular Medicine, Central European Institute of Technology, Masaryk University, Brno, Czech Republic.,Department of Hematology, University Hospital Schleswig-Holstein, Kiel, Germany
| | - Sarka Pospisilova
- Center of Molecular Medicine, Central European Institute of Technology, Masaryk University, Brno, Czech Republic.,Department of Internal Medicine-Hematology and Oncology, Faculty of Medicine and University Hospital Brno, Masaryk University, Brno, Czech Republic.,Department of Medical Genetics and Genomics, Faculty of Medicine and University Hospital Brno, Masaryk University, Brno, Czech Republic
| |
Collapse
|
17
|
Investigating the importance of individual mitochondrial genotype in susceptibility to drug-induced toxicity. Biochem Soc Trans 2021; 48:787-797. [PMID: 32453388 PMCID: PMC7329340 DOI: 10.1042/bst20190233] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2020] [Revised: 04/30/2020] [Accepted: 05/01/2020] [Indexed: 12/13/2022]
Abstract
The mitochondrion is an essential organelle responsible for generating cellular energy. Additionally, mitochondria are a source of inter-individual variation as they contain their own genome. Evidence has revealed that mitochondrial DNA (mtDNA) variation can confer differences in mitochondrial function and importantly, these differences may be a factor underlying the idiosyncrasies associated with unpredictable drug-induced toxicities. Thus far, preclinical and clinical data are limited but have revealed evidence in support of an association between mitochondrial haplogroup and susceptibility to specific adverse drug reactions. In particular, clinical studies have reported associations between mitochondrial haplogroup and antiretroviral therapy, chemotherapy and antibiotic-induced toxicity, although study limitations and conflicting findings mean that the importance of mtDNA variation to toxicity remains unclear. Several studies have used transmitochondrial cybrid cells as personalised models with which to study the impact of mitochondrial genetic variation. Cybrids allow the effects of mtDNA to be assessed against a stable nuclear background and thus the in vitro elucidation of the fundamental mechanistic basis of such differences. Overall, the current evidence supports the tenet that mitochondrial genetics represent an exciting area within the field of personalised medicine and drug toxicity. However, further research effort is required to confirm its importance. In particular, efforts should focus upon translational research to connect preclinical and clinical data that can inform whether mitochondrial genetics can be useful to identify at risk individuals or inform risk assessment during drug development.
Collapse
|
18
|
Chen H, Yin Y, Li X, Li S, Gao H, Wang X, Zhang Y, Liu Y, Wang H. Whole-Genome Analysis of Livestock-Associated Methicillin-Resistant Staphylococcus aureus Sequence Type 398 Strains Isolated From Patients With Bacteremia in China. J Infect Dis 2021; 221:S220-S228. [PMID: 32176793 DOI: 10.1093/infdis/jiz575] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open
Abstract
Sequence type (ST) 398 is the most prevalent clone of livestock-associated methicillin-resistant Staphylococcus aureus (MRSA). To evaluate the molecular characteristics and phylogeny of Chinese ST398 isolates, 4 MRSA ST398 strains and 4 methicillin-susceptible S. aureus (MSSA) ST398 strains were collected from patients with bacteremia at 6 teaching hospitals in China between 1999 and 2016. Moreover, 689 ST398 genome sequences were downloaded from the GenBank database for comparison. The 4 MRSA ST398 strains were resistant to β-lactam antibiotics, and 2 strains were also resistant to erythromycin. Among the 4 MSSA ST398 strains, 2 strains displayed multidrug resistance (MDR) and were resistant to penicillin, erythromycin, tetracycline, and gentamicin. The accessory genome of MSSA ST398 was more diverse than that of MRSA ST398. All 4 MRSA ST398 strains carried type V staphylococcal cassette chromosome mec elements; however, MSSA ST398 carried more resistance genes than MRSA ST398. These 4 MRSA ST398 strains carried hemolysin, along with virulence genes associated with immune invasion and protease. Phylogenic analysis showed that the 4 MRSA ST398 strains clustered in 1 clade. The global ST398 phylogeny showed that ST398 was divided into an animal clade and a human clade, and the ST398 strains of this study clustered in the human clade. A small number of human strains were also present in the animal clade and vice versa, suggesting transmission of ST398 between animals and humans. In conclusion, livestock-associated MRSA ST398 has caused severe infections in Chinese hospitals, and it should therefore be paid more attention to and monitored.
Collapse
Affiliation(s)
- Hongbin Chen
- Department of Clinical Laboratory, Peking University People's Hospital, Beijing, China
| | - Yuyao Yin
- Department of Clinical Laboratory, Peking University People's Hospital, Beijing, China
| | - Xiaohua Li
- Department of Clinical Laboratory, Ordos Central Hospital, Nei Mongol, China
| | - Shuguang Li
- Department of Clinical Laboratory, Peking University People's Hospital, Beijing, China
| | - Hua Gao
- Department of Clinical Laboratory, Peking University People's Hospital, Beijing, China
| | - Xiaojuan Wang
- Department of Clinical Laboratory, Peking University People's Hospital, Beijing, China
| | - Yawei Zhang
- Department of Clinical Laboratory, Peking University People's Hospital, Beijing, China
| | - Yudong Liu
- Department of Clinical Laboratory, Peking University People's Hospital, Beijing, China
| | - Hui Wang
- Department of Clinical Laboratory, Peking University People's Hospital, Beijing, China
| |
Collapse
|
19
|
Valiente-Mullor C, Beamud B, Ansari I, Francés-Cuesta C, García-González N, Mejía L, Ruiz-Hueso P, González-Candelas F. One is not enough: On the effects of reference genome for the mapping and subsequent analyses of short-reads. PLoS Comput Biol 2021; 17:e1008678. [PMID: 33503026 PMCID: PMC7870062 DOI: 10.1371/journal.pcbi.1008678] [Citation(s) in RCA: 31] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2020] [Revised: 02/08/2021] [Accepted: 01/05/2021] [Indexed: 12/17/2022] Open
Abstract
Mapping of high-throughput sequencing (HTS) reads to a single arbitrary reference genome is a frequently used approach in microbial genomics. However, the choice of a reference may represent a source of errors that may affect subsequent analyses such as the detection of single nucleotide polymorphisms (SNPs) and phylogenetic inference. In this work, we evaluated the effect of reference choice on short-read sequence data from five clinically and epidemiologically relevant bacteria (Klebsiella pneumoniae, Legionella pneumophila, Neisseria gonorrhoeae, Pseudomonas aeruginosa and Serratia marcescens). Publicly available whole-genome assemblies encompassing the genomic diversity of these species were selected as reference sequences, and read alignment statistics, SNP calling, recombination rates, dN/dS ratios, and phylogenetic trees were evaluated depending on the mapping reference. The choice of different reference genomes proved to have an impact on almost all the parameters considered in the five species. In addition, these biases had potential epidemiological implications such as including/excluding isolates of particular clades and the estimation of genetic distances. These findings suggest that the single reference approach might introduce systematic errors during mapping that affect subsequent analyses, particularly for data sets with isolates from genetically diverse backgrounds. In any case, exploring the effects of different references on the final conclusions is highly recommended. Mapping consists in the alignment of reads (i.e., DNA fragments) obtained through high-throughput genome sequencing to a previously assembled reference sequence. It is a common practice in genomic studies to use a single reference for mapping, usually the ‘reference genome’ of a species—a high-quality assembly. However, the selection of an optimal reference is hindered by intrinsic intra-species genetic variability, particularly in bacteria. It is known that genetic differences between the reference genome and the read sequences may produce incorrect alignments during mapping. Eventually, these errors could lead to misidentification of variants and biased reconstruction of phylogenetic trees (which reflect ancestry between different bacterial lineages). To our knowledge, this is the first work to systematically examine the effect of different references for mapping on the inference of tree topology as well as the impact on recombination and natural selection inferences. Furthermore, the novelty of this work relies on a procedure that guarantees that we are evaluating only the effect of the reference. This effect has proved to be pervasive in the five bacterial species that we have studied and, in some cases, alterations in phylogenetic trees could lead to incorrect epidemiological inferences. Hence, the use of different reference genomes may be prescriptive to assess the potential biases of mapping.
Collapse
Affiliation(s)
- Carlos Valiente-Mullor
- Joint Research Unit “Infection and Public Health” FISABIO-University of Valencia, Institute for Integrative Systems Biology (I2SysBio), Valencia, Spain
| | - Beatriz Beamud
- Joint Research Unit “Infection and Public Health” FISABIO-University of Valencia, Institute for Integrative Systems Biology (I2SysBio), Valencia, Spain
- * E-mail: (BB); (FG-C)
| | - Iván Ansari
- Joint Research Unit “Infection and Public Health” FISABIO-University of Valencia, Institute for Integrative Systems Biology (I2SysBio), Valencia, Spain
| | - Carlos Francés-Cuesta
- Joint Research Unit “Infection and Public Health” FISABIO-University of Valencia, Institute for Integrative Systems Biology (I2SysBio), Valencia, Spain
| | - Neris García-González
- Joint Research Unit “Infection and Public Health” FISABIO-University of Valencia, Institute for Integrative Systems Biology (I2SysBio), Valencia, Spain
| | - Lorena Mejía
- Joint Research Unit “Infection and Public Health” FISABIO-University of Valencia, Institute for Integrative Systems Biology (I2SysBio), Valencia, Spain
- Instituto de Microbiología, Colegio de Ciencias Biológicas y Ambientales, Universidad San Francisco de Quito, Quito, Ecuador
| | - Paula Ruiz-Hueso
- Joint Research Unit “Infection and Public Health” FISABIO-University of Valencia, Institute for Integrative Systems Biology (I2SysBio), Valencia, Spain
| | - Fernando González-Candelas
- Joint Research Unit “Infection and Public Health” FISABIO-University of Valencia, Institute for Integrative Systems Biology (I2SysBio), Valencia, Spain
- CIBER in Epidemiology and Public Health, Valencia, Spain
- * E-mail: (BB); (FG-C)
| |
Collapse
|
20
|
Alosaimi S, van Biljon N, Awany D, Thami PK, Defo J, Mugo JW, Bope CD, Mazandu GK, Mulder NJ, Chimusa ER. Simulation of African and non-African low and high coverage whole genome sequence data to assess variant calling approaches. Brief Bioinform 2020; 22:6042242. [PMID: 33341897 DOI: 10.1093/bib/bbaa366] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2020] [Revised: 11/14/2020] [Accepted: 01/08/2020] [Indexed: 12/15/2022] Open
Abstract
Current variant calling (VC) approaches have been designed to leverage populations of long-range haplotypes and were benchmarked using populations of European descent, whereas most genetic diversity is found in non-European such as Africa populations. Working with these genetically diverse populations, VC tools may produce false positive and false negative results, which may produce misleading conclusions in prioritization of mutations, clinical relevancy and actionability of genes. The most prominent question is which tool or pipeline has a high rate of sensitivity and precision when analysing African data with either low or high sequence coverage, given the high genetic diversity and heterogeneity of this data. Here, a total of 100 synthetic Whole Genome Sequencing (WGS) samples, mimicking the genetics profile of African and European subjects for different specific coverage levels (high/low), have been generated to assess the performance of nine different VC tools on these contrasting datasets. The performances of these tools were assessed in false positive and false negative call rates by comparing the simulated golden variants to the variants identified by each VC tool. Combining our results on sensitivity and positive predictive value (PPV), VarDict [PPV = 0.999 and Matthews correlation coefficient (MCC) = 0.832] and BCFtools (PPV = 0.999 and MCC = 0.813) perform best when using African population data on high and low coverage data. Overall, current VC tools produce high false positive and false negative rates when analysing African compared with European data. This highlights the need for development of VC approaches with high sensitivity and precision tailored for populations characterized by high genetic variations and low linkage disequilibrium.
Collapse
Affiliation(s)
- Shatha Alosaimi
- Faculty of Health Sciences, Division of Human Genetics, Department of Pathology, University of Cape Town, Cape Town, South Africa
| | - Noëlle van Biljon
- Department of Statistical Sciences, University of Cape Town, Cape Town, South Africa
| | - Denis Awany
- Faculty of Health Sciences, Division of Human Genetics, Department of Pathology, University of Cape Town, Cape Town, South Africa
| | - Prisca K Thami
- Faculty of Health Sciences, Division of Human Genetics, Department of Pathology, University of Cape Town, Cape Town, South Africa
| | - Joel Defo
- Faculty of Health Sciences, Division of Human Genetics, Department of Pathology, University of Cape Town, Cape Town, South Africa
| | - Jacquiline W Mugo
- Faculty of Health Sciences, Division of Computational Biology, Department of Biomedical Sciences, University of Cape Town, Cape Town, South Africa
| | - Christian D Bope
- Faculty of Sciences, Department of Mathematics and Computer Science, University of Kinshasa, Kinshasa, DRC
| | - Gaston K Mazandu
- Faculty of Health Sciences, Division of Human Genetics, Department of Pathology, University of Cape Town, Cape Town, South Africa.,Faculty of Health Sciences, Division of Computational Biology, Department of Biomedical Sciences, University of Cape Town, Cape Town, South Africa
| | - Nicola J Mulder
- Faculty of Health Sciences, Division of Computational Biology, Department of Biomedical Sciences, University of Cape Town, Cape Town, South Africa.,Institute of Infectious Disease and Molecular Medicine, University of Cape Town, Anzio Road, Observatory, Cape Town 7925, South Africa
| | - Emile R Chimusa
- Faculty of Health Sciences, Division of Human Genetics, Department of Pathology, University of Cape Town, Cape Town, South Africa.,Institute of Infectious Disease and Molecular Medicine, University of Cape Town, Anzio Road, Observatory, Cape Town 7925, South Africa
| |
Collapse
|
21
|
Castrignanò T, Gioiosa S, Flati T, Cestari M, Picardi E, Chiara M, Fratelli M, Amente S, Cirilli M, Tangaro MA, Chillemi G, Pesole G, Zambelli F. ELIXIR-IT HPC@CINECA: high performance computing resources for the bioinformatics community. BMC Bioinformatics 2020; 21:352. [PMID: 32838759 PMCID: PMC7446135 DOI: 10.1186/s12859-020-03565-8] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023] Open
Abstract
BACKGROUND The advent of Next Generation Sequencing (NGS) technologies and the concomitant reduction in sequencing costs allows unprecedented high throughput profiling of biological systems in a cost-efficient manner. Modern biological experiments are increasingly becoming both data and computationally intensive and the wealth of publicly available biological data is introducing bioinformatics into the "Big Data" era. For these reasons, the effective application of High Performance Computing (HPC) architectures is becoming progressively more recognized also by bioinformaticians. Here we describe HPC resources provisioning pilot programs dedicated to bioinformaticians, run by the Italian Node of ELIXIR (ELIXIR-IT) in collaboration with CINECA, the main Italian supercomputing center. RESULTS Starting from April 2016, CINECA and ELIXIR-IT launched the pilot Call "ELIXIR-IT HPC@CINECA", offering streamlined access to HPC resources for bioinformatics. Resources are made available either through web front-ends to dedicated workflows developed at CINECA or by providing direct access to the High Performance Computing systems through a standard command-line interface tailored for bioinformatics data analysis. This allows to offer to the biomedical research community a production scale environment, continuously updated with the latest available versions of publicly available reference datasets and bioinformatic tools. Currently, 63 research projects have gained access to the HPC@CINECA program, for a total handout of ~ 8 Millions of CPU/hours and, for data storage, ~ 100 TB of permanent and ~ 300 TB of temporary space. CONCLUSIONS Three years after the beginning of the ELIXIR-IT HPC@CINECA program, we can appreciate its impact over the Italian bioinformatics community and draw some considerations. Several Italian researchers who applied to the program have gained access to one of the top-ranking public scientific supercomputing facilities in Europe. Those investigators had the opportunity to sensibly reduce computational turnaround times in their research projects and to process massive amounts of data, pursuing research approaches that would have been otherwise difficult or impossible to undertake. Moreover, by taking advantage of the wealth of documentation and training material provided by CINECA, participants had the opportunity to improve their skills in the usage of HPC systems and be better positioned to apply to similar EU programs of greater scale, such as PRACE. To illustrate the effective usage and impact of the resources awarded by the program - in different research applications - we report five successful use cases, which have already published their findings in peer-reviewed journals.
Collapse
Affiliation(s)
- Tiziana Castrignanò
- Department of Ecological and Biological Sciences (DEB), University of Tuscia, Viterbo, Italy.
| | - Silvia Gioiosa
- CINECA, SuperComputing Applications and Innovation Department, Rome, Italy.,Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council (IBIOM-CNR), Bari, Italy
| | - Tiziano Flati
- CINECA, SuperComputing Applications and Innovation Department, Rome, Italy.,Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council (IBIOM-CNR), Bari, Italy
| | - Mirko Cestari
- CINECA, SuperComputing Applications and Innovation Department, Rome, Italy
| | - Ernesto Picardi
- Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council (IBIOM-CNR), Bari, Italy.,Department of Biosciences, Biotechnology and Biopharmaceutics, University of Bari "A. Moro", Bari, Italy
| | - Matteo Chiara
- Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council (IBIOM-CNR), Bari, Italy.,Department of Biosciences, University of Milan, Milan, Italy
| | - Maddalena Fratelli
- IRCCS-Istituto di Ricerche Farmacologiche "Mario Negri", Milano, Milan, Italy
| | - Stefano Amente
- Department of Molecular Medicine and Medical Biotechnologies, University of Naples 'Federico II', Naples, Italy
| | - Marco Cirilli
- Department of Agricultural and Environmental Sciences - Production, Landscape, Agroenergy (DISAA), University of Milan, Milan, Italy
| | - Marco Antonio Tangaro
- Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council (IBIOM-CNR), Bari, Italy
| | - Giovanni Chillemi
- Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council (IBIOM-CNR), Bari, Italy.,Department for Innovation in Biological, Agro-food and Forest systems (DIBAF), University of Tuscia, Viterbo, Italy
| | - Graziano Pesole
- Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council (IBIOM-CNR), Bari, Italy. .,Department of Biosciences, Biotechnology and Biopharmaceutics, University of Bari "A. Moro", Bari, Italy.
| | - Federico Zambelli
- Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council (IBIOM-CNR), Bari, Italy. .,Department of Biosciences, University of Milan, Milan, Italy.
| |
Collapse
|
22
|
Daw Elbait G, Henschel A, Tay GK, Al Safar HS. Whole Genome Sequencing of Four Representatives From the Admixed Population of the United Arab Emirates. Front Genet 2020; 11:681. [PMID: 32754195 PMCID: PMC7367215 DOI: 10.3389/fgene.2020.00681] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2020] [Accepted: 06/03/2020] [Indexed: 01/21/2023] Open
Abstract
Whole genome sequences (WGS) of four nationals of the United Arab Emirates (UAE) at an average coverage of 33X have been completed and described. The selection of suitable subpopulation representatives was informed by a preceding comprehensive population structure analysis. Representatives were chosen based on their central location within the subpopulation on a principal component analysis (PCA) and the degree to which they were admixed. Novel genomic variations among the different subgroups of the UAE population are reported here. Specifically, the WGS analysis identified 4,161,067-4,798,806 variants in the four individual samples, where approximately 80% were single nucleotide polymorphisms (SNPs) and 20% were insertions or deletions (indels). An average of 2.75% was found to be novel variants according to dbSNP (build 151). This is the first report of structural variants (SV) from WGS data from UAE nationals. There were 15,677-20,339 called SVs, of which around 13.5% were novel. The four samples shared 1,399,178 variants, each with distinct variants as follows: 1,085,524 (for the individual denoted as UAE S011), 1,228,559 (UAE S012), 791,072 (UAE S013), and 906,818 (UAE S014). These results show a previously unappreciated population diversity in the region. The synergy of WGS and genotype array data was demonstrated through variant annotation of the former using 2.3 million allele frequencies for the local population derived from the latter technology platform. This novel approach of combining breadth and depth of array and WGS technologies has guided the choice of population genetic representatives and provides complementary, regionalized allele frequency annotation to new genomes comprising millions of loci.
Collapse
Affiliation(s)
- Gihan Daw Elbait
- Center for Biotechnology, Khalifa University of Science and Technology, Abu Dhabi, United Arab Emirates
| | - Andreas Henschel
- Center for Biotechnology, Khalifa University of Science and Technology, Abu Dhabi, United Arab Emirates.,Department of Computer Science, Khalifa University of Science and Technology, Abu Dhabi, United Arab Emirates
| | - Guan K Tay
- Center for Biotechnology, Khalifa University of Science and Technology, Abu Dhabi, United Arab Emirates.,Department of Biomedical Engineering, Khalifa University of Science and Technology, Abu Dhabi, United Arab Emirates.,Division of Psychiatry, Faculty of Health and Medical Sciences, The University of Western Australia, Crawley, WA, Australia.,School of Medical and Health Sciences, Edith Cowan University, Joondalup, WA, Australia
| | - Habiba S Al Safar
- Center for Biotechnology, Khalifa University of Science and Technology, Abu Dhabi, United Arab Emirates.,Department of Biomedical Engineering, Khalifa University of Science and Technology, Abu Dhabi, United Arab Emirates.,Department of Genetics and Molecular Biology, Collage of Medicine and Health Sciences, Khalifa University of Science and Technology, Abu Dhabi, United Arab Emirates
| |
Collapse
|
23
|
Venkataraman GR, Rivas MA. Rare and common variant discovery in complex disease: the IBD case study. Hum Mol Genet 2020; 28:R162-R169. [PMID: 31363759 DOI: 10.1093/hmg/ddz189] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2019] [Revised: 07/24/2019] [Accepted: 07/25/2019] [Indexed: 12/15/2022] Open
Abstract
Complex diseases such as inflammatory bowel disease (IBD), which consists of ulcerative colitis and Crohn's disease, are a significant medical burden-70 000 new cases of IBD are diagnosed in the United States annually. In this review, we examine the history of genetic variant discovery in complex disease with a focus on IBD. We cover methods that have been applied to microsatellite, common variant, targeted resequencing and whole-exome and -genome data, specifically focusing on the progression of technologies towards rare-variant discovery. The inception of these methods combined with better availability of population level variation data has led to rapid discovery of IBD-causative and/or -associated variants at over 200 loci; over time, these methods have grown exponentially in both power and ascertainment to detect rare variation. We highlight rare-variant discoveries critical to the elucidation of the pathogenesis of IBD, including those in NOD2, IL23R, CARD9, RNF186 and ADCY7. We additionally identify the major areas of rare-variant discovery that will evolve in the coming years. A better understanding of the genetic basis of IBD and other complex diseases will lead to improved diagnosis, prognosis, treatment and surveillance.
Collapse
Affiliation(s)
- Guhan R Venkataraman
- Department of Biomedical Data Science, School of Medicine, Stanford University, Stanford, CA, USA
| | - Manuel A Rivas
- Department of Biomedical Data Science, School of Medicine, Stanford University, Stanford, CA, USA
| |
Collapse
|
24
|
Hendrix MM, Cuthbert CD, Cordovado SK. Assessing the Performance of Dried-Blood-Spot DNA Extraction Methods in Next Generation Sequencing. Int J Neonatal Screen 2020; 6:36. [PMID: 32514487 PMCID: PMC7278269 DOI: 10.3390/ijns6020036] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/26/2020] [Accepted: 04/27/2020] [Indexed: 12/31/2022] Open
Abstract
An increasing number of newborn screening laboratories in the United States and abroad are moving towards incorporating next-generation sequencing technology, or NGS, into routine screening, particularly for cystic fibrosis. As more programs utilize this technology for both cystic fibrosis and beyond, it is critical to identify appropriate DNA extraction methods that can be used with dried blood spots that will result in consistent, high-quality sequencing results. To provide comprehensive quality assurance and technical assistance to newborn screening laboratories wishing to incorporate NGS assays, CDC's Newborn Screening and Molecular Biology Branch designed a study to evaluate the performance of nine commercial or laboratory-developed DNA extraction methods that range from a highly purified column extraction to a crude detergent-based no-wash boil prep. The DNA from these nine methods was used in two NGS library preparations that interrogate the CFTR gene. All DNA extraction methods including the cruder preps performed reasonably well with both library preps. One lower-concentration, older sample was excluded from one of the assay evaluations due to poor performance across all DNA extraction methods. When 84 samples, versus eight, were run on a flow cell, the DNA quality and quantity were more significant variables.
Collapse
Affiliation(s)
| | | | - Suzanne K. Cordovado
- Centers for Disease Control and Prevention; 4770 Buford Hwy, NE, Atlanta, GA 30341, USA; (M.M.H.); (C.D.C.)
| |
Collapse
|
25
|
Alqahtani A, Skelton A, Eley L, Annavarapu S, Henderson DJ, Chaudhry B. Isolation and next generation sequencing of archival formalin-fixed DNA. J Anat 2020; 237:587-600. [PMID: 32426881 PMCID: PMC7476199 DOI: 10.1111/joa.13209] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2020] [Revised: 03/31/2020] [Accepted: 04/07/2020] [Indexed: 11/29/2022] Open
Abstract
DNA from archived organs is presumed unsuitable for genomic studies because of excessive formalin‐fixation. As next generation sequencing (NGS) requires short DNA fragments, and Uracil‐N‐glycosylase (UNG) can be used to overcome deamination, there has been renewed interest in the possibility of genomic studies using these collections. We describe a novel method of DNA extraction capable of providing PCR amplicons of at least 400 bp length from such excessively formalin‐fixed human tissues. When compared with a leading commercial formalin‐fixed DNA extraction kit, our method produced greater yields of DNA and reduced sequence variations. Analysis of PCR products using bacterial sub‐cloning and Sanger sequencing from UNG‐treated DNA unexpectedly revealed increased sequence variations, compared with untreated samples. Finally, whole exome NGS was performed on a myocardial sample fixed in formalin for 2 years and compared with lymphocyte‐derived DNA (as a gold standard) from the same patient. Despite the reduction in the number and quality of reads in the formalin‐fixed DNA, we were able to show that bioinformatic processing by joint calling and variant quality score recalibration (VQSR) increased the sensitivity four‐fold to 56% and doubled specificity to 68% when compared with a standard hard‐filtering approach. Thus, high‐quality DNA can be extracted from excessively formalin‐fixed tissues and bioinformatic processing can optimise sensitivity and specificity of results. Sequencing of several sub‐cloned amplicons is an important methodological step in assessing DNA quality.
Collapse
Affiliation(s)
- Ahlam Alqahtani
- Bioscience Institute, Faculty of Medical Sciences, Newcastle University, Newcastle upon Tyne, UK
| | - Andrew Skelton
- Bioinformatic Support Unit, Faculty of Medical Sciences, Newcastle University, Newcastle upon Tyne, UK
| | - Lorraine Eley
- Bioscience Institute, Faculty of Medical Sciences, Newcastle University, Newcastle upon Tyne, UK
| | - Srinivas Annavarapu
- Bioscience Institute, Faculty of Medical Sciences, Newcastle University, Newcastle upon Tyne, UK.,Department of Cellular Pathology, Newcastle Hospitals NHS Foundation Trust, Newcastle upon Tyne, UK
| | - Deborah J Henderson
- Bioscience Institute, Faculty of Medical Sciences, Newcastle University, Newcastle upon Tyne, UK
| | - Bill Chaudhry
- Bioscience Institute, Faculty of Medical Sciences, Newcastle University, Newcastle upon Tyne, UK
| |
Collapse
|
26
|
Schilbert HM, Rempel A, Pucker B. Comparison of Read Mapping and Variant Calling Tools for the Analysis of Plant NGS Data. PLANTS (BASEL, SWITZERLAND) 2020; 9:E439. [PMID: 32252268 PMCID: PMC7238416 DOI: 10.3390/plants9040439] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/15/2020] [Revised: 03/28/2020] [Accepted: 03/30/2020] [Indexed: 12/30/2022]
Abstract
High-throughput sequencing technologies have rapidly developed during the past years and have become an essential tool in plant sciences. However, the analysis of genomic data remains challenging and relies mostly on the performance of automatic pipelines. Frequently applied pipelines involve the alignment of sequence reads against a reference sequence and the identification of sequence variants. Since most benchmarking studies of bioinformatics tools for this purpose have been conducted on human datasets, there is a lack of benchmarking studies in plant sciences. In this study, we evaluated the performance of 50 different variant calling pipelines, including five read mappers and ten variant callers, on six real plant datasets of the model organism Arabidopsis thaliana. Sets of variants were evaluated based on various parameters including sensitivity and specificity. We found that all investigated tools are suitable for analysis of NGS data in plant research. When looking at different performance metrics, BWA-MEM and Novoalign were the best mappers and GATK returned the best results in the variant calling step.
Collapse
Affiliation(s)
- Hanna Marie Schilbert
- Genetics and Genomics of Plants, CeBiTec and Faculty of Biology, Bielefeld University, 33615 Bielefeld, Germany
| | - Andreas Rempel
- Genetics and Genomics of Plants, CeBiTec and Faculty of Biology, Bielefeld University, 33615 Bielefeld, Germany
- Graduate School DILS, Bielefeld Institute for Bioinformatics Infrastructure (BIBI), Faculty of Technology, Bielefeld University, 33615 Bielefeld, Germany
| | - Boas Pucker
- Genetics and Genomics of Plants, CeBiTec and Faculty of Biology, Bielefeld University, 33615 Bielefeld, Germany
- Molecular Genetics and Physiology of Plants, Faculty of Biology and Biotechnology, Ruhr-University Bochum, 44801 Bochum, Germany
| |
Collapse
|
27
|
Bush SJ, Foster D, Eyre DW, Clark EL, De Maio N, Shaw LP, Stoesser N, Peto TEA, Crook DW, Walker AS. Genomic diversity affects the accuracy of bacterial single-nucleotide polymorphism-calling pipelines. Gigascience 2020; 9:giaa007. [PMID: 32025702 PMCID: PMC7002876 DOI: 10.1093/gigascience/giaa007] [Citation(s) in RCA: 61] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2019] [Revised: 12/02/2019] [Accepted: 01/15/2020] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND Accurately identifying single-nucleotide polymorphisms (SNPs) from bacterial sequencing data is an essential requirement for using genomics to track transmission and predict important phenotypes such as antimicrobial resistance. However, most previous performance evaluations of SNP calling have been restricted to eukaryotic (human) data. Additionally, bacterial SNP calling requires choosing an appropriate reference genome to align reads to, which, together with the bioinformatic pipeline, affects the accuracy and completeness of a set of SNP calls obtained. This study evaluates the performance of 209 SNP-calling pipelines using a combination of simulated data from 254 strains of 10 clinically common bacteria and real data from environmentally sourced and genomically diverse isolates within the genera Citrobacter, Enterobacter, Escherichia, and Klebsiella. RESULTS We evaluated the performance of 209 SNP-calling pipelines, aligning reads to genomes of the same or a divergent strain. Irrespective of pipeline, a principal determinant of reliable SNP calling was reference genome selection. Across multiple taxa, there was a strong inverse relationship between pipeline sensitivity and precision, and the Mash distance (a proxy for average nucleotide divergence) between reads and reference genome. The effect was especially pronounced for diverse, recombinogenic bacteria such as Escherichia coli but less dominant for clonal species such as Mycobacterium tuberculosis. CONCLUSIONS The accuracy of SNP calling for a given species is compromised by increasing intra-species diversity. When reads were aligned to the same genome from which they were sequenced, among the highest-performing pipelines was Novoalign/GATK. By contrast, when reads were aligned to particularly divergent genomes, the highest-performing pipelines often used the aligners NextGenMap or SMALT, and/or the variant callers LoFreq, mpileup, or Strelka.
Collapse
Affiliation(s)
- Stephen J Bush
- Nuffield Department of Medicine, University of Oxford, John Radcliffe Hospital, Headington, Oxford, OX3 9DU, UK
- National Institute for Health Research Health Research Protection Unit in Healthcare Associated Infections and Antimicrobial Resistance at University of Oxford in partnership with Public Health England, Oxford, John Radcliffe Hospital, Headington, Oxford, OX3 9DU, UK
| | - Dona Foster
- Nuffield Department of Medicine, University of Oxford, John Radcliffe Hospital, Headington, Oxford, OX3 9DU, UK
- National Institute for Health Research Oxford Biomedical Research Centre, Oxford, John Radcliffe Hospital, Headington, Oxford, OX3 9DU, UK
| | - David W Eyre
- Nuffield Department of Medicine, University of Oxford, John Radcliffe Hospital, Headington, Oxford, OX3 9DU, UK
| | - Emily L Clark
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Easter Bush Campus, Midlothian, EH25 9RG, UK
| | - Nicola De Maio
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SH, UK
| | - Liam P Shaw
- Nuffield Department of Medicine, University of Oxford, John Radcliffe Hospital, Headington, Oxford, OX3 9DU, UK
| | - Nicole Stoesser
- Nuffield Department of Medicine, University of Oxford, John Radcliffe Hospital, Headington, Oxford, OX3 9DU, UK
| | - Tim E A Peto
- Nuffield Department of Medicine, University of Oxford, John Radcliffe Hospital, Headington, Oxford, OX3 9DU, UK
- National Institute for Health Research Health Research Protection Unit in Healthcare Associated Infections and Antimicrobial Resistance at University of Oxford in partnership with Public Health England, Oxford, John Radcliffe Hospital, Headington, Oxford, OX3 9DU, UK
- National Institute for Health Research Oxford Biomedical Research Centre, Oxford, John Radcliffe Hospital, Headington, Oxford, OX3 9DU, UK
| | - Derrick W Crook
- Nuffield Department of Medicine, University of Oxford, John Radcliffe Hospital, Headington, Oxford, OX3 9DU, UK
- National Institute for Health Research Health Research Protection Unit in Healthcare Associated Infections and Antimicrobial Resistance at University of Oxford in partnership with Public Health England, Oxford, John Radcliffe Hospital, Headington, Oxford, OX3 9DU, UK
- National Institute for Health Research Oxford Biomedical Research Centre, Oxford, John Radcliffe Hospital, Headington, Oxford, OX3 9DU, UK
| | - A Sarah Walker
- Nuffield Department of Medicine, University of Oxford, John Radcliffe Hospital, Headington, Oxford, OX3 9DU, UK
- National Institute for Health Research Health Research Protection Unit in Healthcare Associated Infections and Antimicrobial Resistance at University of Oxford in partnership with Public Health England, Oxford, John Radcliffe Hospital, Headington, Oxford, OX3 9DU, UK
- National Institute for Health Research Oxford Biomedical Research Centre, Oxford, John Radcliffe Hospital, Headington, Oxford, OX3 9DU, UK
| |
Collapse
|
28
|
Role of Bioinformatics in Molecular Medicine. Genomic Med 2020. [DOI: 10.1007/978-3-030-22922-1_4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022] Open
|
29
|
Variant Calling Using Whole Genome Resequencing and Sequence Capture for Population and Evolutionary Genomic Inferences in Norway Spruce (Picea Abies). COMPENDIUM OF PLANT GENOMES 2020. [DOI: 10.1007/978-3-030-21001-4_2] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
|
30
|
Lewin AC, Coghill LM, McLellan GJ, Bentley E, Kousoulas KG. Genomic analysis for virulence determinants in feline herpesvirus type-1 isolates. Virus Genes 2019; 56:49-57. [PMID: 31776852 DOI: 10.1007/s11262-019-01718-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2019] [Accepted: 11/21/2019] [Indexed: 12/27/2022]
Abstract
Feline herpesvirus type 1 (FHV-1) is a widespread cause of respiratory and ocular disease in domestic cats. A spectrum of disease severity is observed in host animals, but there has been limited prior investigation into viral genome factors which could be responsible. Stocks of FHV-1 were established from oropharyngeal swabs obtained from twenty-five cats with signs of infection housed in eight animal shelters around the USA. A standardized numerical host clinical disease severity scoring scheme was used for each cat from which an isolate was obtained. Illumina MiSeq was used to sequence the genome of each isolate. Genomic homogeneity among isolates was relatively high. A general linear model for fixed effects determined that only two synonymous single nucleotide polymorphisms across two genes (UL37/39) in the same isolate (from one host animal with a low disease severity score) were significantly associated (p ≤ 0.05) with assigned host respiratory and total disease severity score. No variants in any isolate were found to be significantly associated with assigned host ocular disease severity score. A concurrent analysis of missense mutations among the viral isolates identified three genes as being primarily involved in the observed genomic variation, but none were significantly associated with host disease severity scores. An ancestral state likelihood reconstruction was performed and determined that there was no evidence of a connection between host disease severity score and viral evolutionary state. We conclude from our results that the spectrum of host disease severity observed with FHV-1 is unlikely to be primarily related to viral genomic variations, and is instead due to host response and/or other factors.
Collapse
Affiliation(s)
- Andrew C Lewin
- Department of Veterinary Clinical Sciences, School of Veterinary Medicine, Louisiana State University, Skip Bertman Drive, Baton Rouge, LA, 70803, USA.
| | - Lyndon M Coghill
- Center for Computation and Technology, Louisiana State University, 340 E Parker Boulevard, Baton Rouge, LA, 70808, USA.,Department of Pathobiological Sciences, School of Veterinary Medicine, Louisiana State University, Skip Bertman Drive, Baton Rouge, LA, 70803, USA
| | - Gillian J McLellan
- Department of Surgical Sciences, School of Veterinary Medicine, University of Wisconsin-Madison, 2015 Linden Drive, Madison, WI, 53706, USA.,Department of Ophthalmology and Visual Sciences, School of Medicine and Public Health, University of Wisconsin-Madison, 1300 University Avenue, Madison, WI, 53706, USA
| | - Ellison Bentley
- Department of Surgical Sciences, School of Veterinary Medicine, University of Wisconsin-Madison, 2015 Linden Drive, Madison, WI, 53706, USA
| | - Konstantin G Kousoulas
- Department of Pathobiological Sciences, School of Veterinary Medicine, Louisiana State University, Skip Bertman Drive, Baton Rouge, LA, 70803, USA
| |
Collapse
|
31
|
Jiang Y, Jiang Y, Wang S, Zhang Q, Ding X. Optimal sequencing depth design for whole genome re-sequencing in pigs. BMC Bioinformatics 2019; 20:556. [PMID: 31703550 PMCID: PMC6839175 DOI: 10.1186/s12859-019-3164-z] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2019] [Accepted: 10/16/2019] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND As whole-genome sequencing is becoming a routine technique, it is important to identify a cost-effective depth of sequencing for such studies. However, the relationship between sequencing depth and biological results from the aspects of whole-genome coverage, variant discovery power and the quality of variants is unclear, especially in pigs. We sequenced the genomes of three Yorkshire boars at an approximately 20X depth on the Illumina HiSeq X Ten platform and downloaded whole-genome sequencing data for three Duroc and three Landrace pigs with an approximately 20X depth for each individual. Then, we downsampled the deep genome data by extracting twelve different proportions of 0.05, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 and 0.9 paired reads from the original bam files to mimic the sequence data of the same individuals at sequencing depths of 1.09X, 2.18X, 3.26X, 4.35X, 6.53X, 8.70X, 10.88X, 13.05X, 15.22X, 17.40X, 19.57X and 21.75X to evaluate the influence of genome coverage, the variant discovery rate and genotyping accuracy as a function of sequencing depth. In addition, SNP chip data for Yorkshire pigs were used as a validation for the comparison of single-sample calling and multisample calling algorithms. RESULTS Our results indicated that 10X is an ideal practical depth for achieving plateau coverage and discovering accurate variants, which achieved greater than 99% genome coverage. The number of false-positive variants was increased dramatically at a depth of less than 4X, which covered 95% of the whole genome. In addition, the comparison of multi- and single-sample calling showed that multisample calling was more sensitive than single-sample calling, especially at lower depths. The number of variants discovered under multisample calling was 13-fold and 2-fold higher than that under single-sample calling at 1X and 22X, respectively. A large difference was observed when the depth was less than 4.38X. However, more false-positive variants were detected under multisample calling. CONCLUSIONS Our research will inform important study design decisions regarding whole-genome sequencing depth. Our results will be helpful for choosing the appropriate depth to achieve the same power for studies performed under limited budgets.
Collapse
Affiliation(s)
- Yifan Jiang
- National Engineering Laboratory for Animal Breeding, Laboratory of Animal Genetics, Breeding and Reproduction, Ministry of Agriculture, College of Animal Science and Technology, China Agricultural University, Beijing, 100193 China
| | - Yao Jiang
- National Engineering Laboratory for Animal Breeding, Laboratory of Animal Genetics, Breeding and Reproduction, Ministry of Agriculture, College of Animal Science and Technology, China Agricultural University, Beijing, 100193 China
| | - Sheng Wang
- National Engineering Laboratory for Animal Breeding, Laboratory of Animal Genetics, Breeding and Reproduction, Ministry of Agriculture, College of Animal Science and Technology, China Agricultural University, Beijing, 100193 China
| | - Qin Zhang
- Shandong Provincial Key Laboratory of Animal Biotechnology and Disease Control and Prevention, College of Animal Science and Technology, Shandong Agricultural University, Taian, 271001 China
| | - Xiangdong Ding
- National Engineering Laboratory for Animal Breeding, Laboratory of Animal Genetics, Breeding and Reproduction, Ministry of Agriculture, College of Animal Science and Technology, China Agricultural University, Beijing, 100193 China
| |
Collapse
|
32
|
AlSafar HS, Al-Ali M, Elbait GD, Al-Maini MH, Ruta D, Peramo B, Henschel A, Tay GK. Introducing the first whole genomes of nationals from the United Arab Emirates. Sci Rep 2019; 9:14725. [PMID: 31604968 PMCID: PMC6789106 DOI: 10.1038/s41598-019-50876-9] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2018] [Accepted: 09/20/2019] [Indexed: 12/30/2022] Open
Abstract
Whole Genome Sequencing (WGS) provides an in depth description of genome variation. In the era of large-scale population genome projects, the assembly of ethnic-specific genomes combined with mapping human reference genomes of underrepresented populations has improved the understanding of human diversity and disease associations. In this study, for the first time, whole genome sequences of two nationals of the United Arab Emirates (UAE) at >27X coverage are reported. The two Emirati individuals were predominantly of Central/South Asian ancestry. An in-house customized pipeline using BWA, Picard followed by the GATK tools to map the raw data from whole genome sequences of both individuals was used. A total of 3,994,521 variants (3,350,574 Single Nucleotide Polymorphisms (SNPs) and 643,947 indels) were identified for the first individual, the UAE S001 sample. A similar number of variants, 4,031,580 (3,373,501 SNPs and 658,079 indels), were identified for UAE S002. Variants that are associated with diabetes, hypertension, increased cholesterol levels, and obesity were also identified in these individuals. These Whole Genome Sequences has provided a starting point for constructing a UAE reference panel which will lead to improvements in the delivery of precision medicine, quality of life for affected individuals and a reduction in healthcare costs. The information compiled will likely lead to the identification of target genes that could potentially lead to the development of novel therapeutic modalities.
Collapse
Affiliation(s)
- Habiba S AlSafar
- Center of Biotechnology, Khalifa University of Science and Technology, Abu Dhabi, United Arab Emirates.,Department of Biomedical Engineering, Khalifa University of Science and Technology, Abu Dhabi, United Arab Emirates.,College of Medicine and Health Sciences, Khalifa University of Science and Technology, Abu Dhabi, United Arab Emirates
| | - Mariam Al-Ali
- Center of Biotechnology, Khalifa University of Science and Technology, Abu Dhabi, United Arab Emirates.,Department of Biomedical Engineering, Khalifa University of Science and Technology, Abu Dhabi, United Arab Emirates
| | - Gihan Daw Elbait
- Center of Biotechnology, Khalifa University of Science and Technology, Abu Dhabi, United Arab Emirates
| | | | - Dymitr Ruta
- Etisalat-British Telecom Innovation Center, Abu Dhabi, United Arab Emirates
| | | | - Andreas Henschel
- Center of Biotechnology, Khalifa University of Science and Technology, Abu Dhabi, United Arab Emirates.,Department of Computer Science, Khalifa University of Science and Technology, Abu Dhabi, United Arab Emirates
| | - Guan K Tay
- Center of Biotechnology, Khalifa University of Science and Technology, Abu Dhabi, United Arab Emirates. .,Department of Biomedical Engineering, Khalifa University of Science and Technology, Abu Dhabi, United Arab Emirates. .,College of Medicine and Health Sciences, Khalifa University of Science and Technology, Abu Dhabi, United Arab Emirates. .,School of Psychiatry and Clinical Neurosciences, University of Western Australia, Nedlands, Australia. .,School of Medical and Health Sciences, Edith Cowan University, Joondalup, Australia.
| |
Collapse
|
33
|
Caspar SM, Dubacher N, Kopps AM, Meienberg J, Henggeler C, Matyas G. Clinical sequencing: From raw data to diagnosis with lifetime value. Clin Genet 2019; 93:508-519. [PMID: 29206278 DOI: 10.1111/cge.13190] [Citation(s) in RCA: 60] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2017] [Revised: 11/28/2017] [Accepted: 11/30/2017] [Indexed: 12/22/2022]
Abstract
High-throughput sequencing (HTS) has revolutionized genetics by enabling the detection of sequence variants at hitherto unprecedented large scale. Despite these advances, however, there are still remaining challenges in the complete coverage of targeted regions (genes, exome or genome) as well as in HTS data analysis and interpretation. Moreover, it is easy to get overwhelmed by the plethora of available methods and tools for HTS. Here, we review the step-by-step process from the generation of sequence data to molecular diagnosis of Mendelian diseases. Highlighting advantages and limitations, this review addresses the current state of (1) HTS technologies, considering targeted, whole-exome, and whole-genome sequencing on short- and long-read platforms; (2) read alignment, variant calling and interpretation; as well as (3) regulatory issues related to genetic counseling, reimbursement, and data storage.
Collapse
Affiliation(s)
- S M Caspar
- Center for Cardiovascular Genetics and Gene Diagnostics, Foundation for People with Rare Diseases, Schlieren-Zurich, Switzerland
| | - N Dubacher
- Center for Cardiovascular Genetics and Gene Diagnostics, Foundation for People with Rare Diseases, Schlieren-Zurich, Switzerland
| | - A M Kopps
- Center for Cardiovascular Genetics and Gene Diagnostics, Foundation for People with Rare Diseases, Schlieren-Zurich, Switzerland
| | - J Meienberg
- Center for Cardiovascular Genetics and Gene Diagnostics, Foundation for People with Rare Diseases, Schlieren-Zurich, Switzerland
| | - C Henggeler
- Center for Cardiovascular Genetics and Gene Diagnostics, Foundation for People with Rare Diseases, Schlieren-Zurich, Switzerland
| | - G Matyas
- Center for Cardiovascular Genetics and Gene Diagnostics, Foundation for People with Rare Diseases, Schlieren-Zurich, Switzerland.,Zurich Center for Integrative Human Physiology, University of Zurich, Zurich, Switzerland
| |
Collapse
|
34
|
Wu X, Heffelfinger C, Zhao H, Dellaporta SL. Benchmarking variant identification tools for plant diversity discovery. BMC Genomics 2019; 20:701. [PMID: 31500583 PMCID: PMC6734213 DOI: 10.1186/s12864-019-6057-7] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2019] [Accepted: 08/22/2019] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The ability to accurately and comprehensively identify genomic variations is critical for plant studies utilizing high-throughput sequencing. Most bioinformatics tools for processing next-generation sequencing data were originally developed and tested in human studies, raising questions as to their efficacy for plant research. A detailed evaluation of the entire variant calling pipeline, including alignment, variant calling, variant filtering, and imputation was performed on different programs using both simulated and real plant genomic datasets. RESULTS A comparison of SOAP2, Bowtie2, and BWA-MEM found that BWA-MEM was consistently able to align the most reads with high accuracy, whereas Bowtie2 had the highest overall accuracy. Comparative results of GATK HaplotypCaller versus SAMtools mpileup indicated that the choice of variant caller affected precision and recall differentially depending on the levels of diversity, sequence coverage and genome complexity. A cross-reference experiment of S. lycopersicum and S. pennellii reference genomes revealed the inadequacy of single reference genome for variant discovery that includes distantly-related plant individuals. Machine-learning-based variant filtering strategy outperformed the traditional hard-cutoff strategy resulting in higher number of true positive variants and fewer false positive variants. A 2-step imputation method, which utilized a set of high-confidence SNPs as the reference panel, showed up to 60% higher accuracy than direct LD-based imputation. CONCLUSIONS Programs in the variant discovery pipeline have different performance on plant genomic dataset. Choice of the programs is subjected to the goal of the study and available resources. This study serves as an important guiding information for plant biologists utilizing next-generation sequencing data for diversity characterization and crop improvement.
Collapse
Affiliation(s)
- Xing Wu
- Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven, CT, 06520-8104, USA
| | - Christopher Heffelfinger
- Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven, CT, 06520-8104, USA
| | - Hongyu Zhao
- Department of Biostatistics, Yale School of Public Health, Yale University, New Haven, CT, 06520-8034, USA
| | - Stephen L Dellaporta
- Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven, CT, 06520-8104, USA.
| |
Collapse
|
35
|
Li D, Kim W, Wang L, Yoon KA, Park B, Park C, Kong SY, Hwang Y, Baek D, Lee ES, Won S. Comparison of INDEL Calling Tools with Simulation Data and Real Short-Read Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:1635-1644. [PMID: 30004886 DOI: 10.1109/tcbb.2018.2854793] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Insertions and deletions (INDELs) comprise a significant proportion of human genetic variation, and recent papers have revealed that many human diseases may be attributable to INDELs. With the development of next-generation sequencing (NGS) technology, many statistical/computational tools have been developed for calling INDELs. However, there are differences among those tools, and comparisons among them have been limited. In order to better understand these inter-tool differences, five popular and publicly available INDEL calling tools-GATK HaplotypeCaller, Platypus, VarScan2, Scalpel, and GotCloud-were evaluated using simulation data, 1000 Genomes Project data, and family-based sequencing data. The accuracy of INDEL calling by each tool was mainly evaluated by concordance rates. Family-based sequencing data, which consisted of 49 individuals from eight Korean families, were used to calculate Mendelian error rates. Our comparison results show that GATK HaplotypeCaller usually performs the best and that joint calling with Platypus can lead to additional improvements in accuracy. The result of this study provides important information regarding future directions for the variant detection and the algorithms development.
Collapse
|
36
|
Variant calling and quality control of large-scale human genome sequencing data. Emerg Top Life Sci 2019; 3:399-409. [DOI: 10.1042/etls20190007] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2019] [Revised: 06/28/2019] [Accepted: 07/16/2019] [Indexed: 12/12/2022]
Abstract
Abstract
Next-generation sequencing has allowed genetic studies to collect genome sequencing data from a large number of individuals. However, raw sequencing data are not usually interpretable due to fragmentation of the genome and technical biases; therefore, analysis of these data requires many computational approaches. First, for each sequenced individual, sequencing data are aligned and further processed to account for technical biases. Then, variant calling is performed to obtain information on the positions of genetic variants and their corresponding genotypes. Quality control (QC) is applied to identify individuals and genetic variants with sequencing errors. These procedures are necessary to generate accurate variant calls from sequencing data, and many computational approaches have been developed for these tasks. This review will focus on current widely used approaches for variant calling and QC.
Collapse
|
37
|
Batcha AMN, Bamopoulos SA, Kerbs P, Kumar A, Jurinovic V, Rothenberg-Thurley M, Ksienzyk B, Philippou-Massier J, Krebs S, Blum H, Schneider S, Konstandin N, Bohlander SK, Heckman C, Kontro M, Hiddemann W, Spiekermann K, Braess J, Metzeler KH, Greif PA, Mansmann U, Herold T. Allelic Imbalance of Recurrently Mutated Genes in Acute Myeloid Leukaemia. Sci Rep 2019; 9:11796. [PMID: 31409822 PMCID: PMC6692371 DOI: 10.1038/s41598-019-48167-4] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2019] [Accepted: 07/29/2019] [Indexed: 12/24/2022] Open
Abstract
The patho-mechanism of somatic driver mutations in cancer usually involves transcription, but the proportion of mutations and wild-type alleles transcribed from DNA to RNA is largely unknown. We systematically compared the variant allele frequencies of recurrently mutated genes in DNA and RNA sequencing data of 246 acute myeloid leukaemia (AML) patients. We observed that 95% of all detected variants were transcribed while the rest were not detectable in RNA sequencing with a minimum read-depth cut-off (10x). Our analysis focusing on 11 genes harbouring recurring mutations demonstrated allelic imbalance (AI) in most patients. GATA2, RUNX1, TET2, SRSF2, IDH2, PTPN11, WT1, NPM1 and CEBPA showed significant AIs. While the effect size was small in general, GATA2 exhibited the largest allelic imbalance. By pooling heterogeneous data from three independent AML cohorts with paired DNA and RNA sequencing (N = 253), we could validate the preferential transcription of GATA2-mutated alleles. Differential expression analysis of the genes with significant AI showed no significant differential gene and isoform expression for the mutated genes, between mutated and wild-type patients. In conclusion, our analyses identified AI in nine out of eleven recurrently mutated genes. AI might be a common phenomenon in AML which potentially contributes to leukaemogenesis.
Collapse
Affiliation(s)
- Aarif M N Batcha
- Institute of Medical Data Processing, Biometrics and Epidemiology (IBE), Faculty of Medicine, LMU Munich, Munich, Germany. .,Data Integration for Future Medicine (DiFuture, www.difuture.de), LMU Munich, Munich, Germany.
| | - Stefanos A Bamopoulos
- Laboratory for Leukemia Diagnostics, Department of Medicine III, University Hospital, LMU Munich, Munich, Germany
| | - Paul Kerbs
- Laboratory for Leukemia Diagnostics, Department of Medicine III, University Hospital, LMU Munich, Munich, Germany
| | - Ashwini Kumar
- Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki, Finland
| | - Vindi Jurinovic
- Institute of Medical Data Processing, Biometrics and Epidemiology (IBE), Faculty of Medicine, LMU Munich, Munich, Germany.,Laboratory for Leukemia Diagnostics, Department of Medicine III, University Hospital, LMU Munich, Munich, Germany
| | - Maja Rothenberg-Thurley
- Laboratory for Leukemia Diagnostics, Department of Medicine III, University Hospital, LMU Munich, Munich, Germany
| | - Bianka Ksienzyk
- Laboratory for Leukemia Diagnostics, Department of Medicine III, University Hospital, LMU Munich, Munich, Germany
| | - Julia Philippou-Massier
- Laboratory for Functional Genome Analysis (LAFUGA), Gene Center, University of Munich, Munich, Germany
| | - Stefan Krebs
- Laboratory for Functional Genome Analysis (LAFUGA), Gene Center, University of Munich, Munich, Germany
| | - Helmut Blum
- Laboratory for Functional Genome Analysis (LAFUGA), Gene Center, University of Munich, Munich, Germany
| | - Stephanie Schneider
- Laboratory for Leukemia Diagnostics, Department of Medicine III, University Hospital, LMU Munich, Munich, Germany.,Institute of Human Genetics, University Hospital, LMU Munich, Munich, Germany
| | - Nikola Konstandin
- Laboratory for Leukemia Diagnostics, Department of Medicine III, University Hospital, LMU Munich, Munich, Germany
| | - Stefan K Bohlander
- Leukaemia and Blood Cancer Research Unit, Department of Molecular Medicine and Pathology, University of Auckland, Auckland, New Zealand
| | - Caroline Heckman
- Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki, Finland
| | - Mika Kontro
- Department of Haematology, Helsinki University Hospital Comprehensive Cancer Center, Helsinki, Finland
| | - Wolfgang Hiddemann
- Laboratory for Leukemia Diagnostics, Department of Medicine III, University Hospital, LMU Munich, Munich, Germany.,German Cancer Consortium (DKTK), Partner Site Munich, Munich, Germany.,German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Karsten Spiekermann
- Laboratory for Leukemia Diagnostics, Department of Medicine III, University Hospital, LMU Munich, Munich, Germany.,German Cancer Consortium (DKTK), Partner Site Munich, Munich, Germany.,German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Jan Braess
- Department of Oncology and Hematology, Hospital Barmherzige Brüder, Regensburg, Germany
| | - Klaus H Metzeler
- Laboratory for Leukemia Diagnostics, Department of Medicine III, University Hospital, LMU Munich, Munich, Germany.,German Cancer Consortium (DKTK), Partner Site Munich, Munich, Germany.,German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Philipp A Greif
- Laboratory for Leukemia Diagnostics, Department of Medicine III, University Hospital, LMU Munich, Munich, Germany.,German Cancer Consortium (DKTK), Partner Site Munich, Munich, Germany.,German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Ulrich Mansmann
- Institute of Medical Data Processing, Biometrics and Epidemiology (IBE), Faculty of Medicine, LMU Munich, Munich, Germany.,Data Integration for Future Medicine (DiFuture, www.difuture.de), LMU Munich, Munich, Germany.,German Cancer Consortium (DKTK), Partner Site Munich, Munich, Germany.,German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Tobias Herold
- Laboratory for Leukemia Diagnostics, Department of Medicine III, University Hospital, LMU Munich, Munich, Germany. .,German Cancer Consortium (DKTK), Partner Site Munich, Munich, Germany. .,German Cancer Research Center (DKFZ), Heidelberg, Germany. .,Research Unit Apoptosis in Hematopoietic Stem Cells, Helmholtz Zentrum München, German Research Center for Environmental Health (HMGU), Munich, Germany.
| |
Collapse
|
38
|
Comprehensive evaluation and characterisation of short read general-purpose structural variant calling software. Nat Commun 2019; 10:3240. [PMID: 31324872 PMCID: PMC6642177 DOI: 10.1038/s41467-019-11146-4] [Citation(s) in RCA: 137] [Impact Index Per Article: 27.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2018] [Accepted: 06/26/2019] [Indexed: 01/12/2023] Open
Abstract
In recent years, many software packages for identifying structural variants (SVs) using whole-genome sequencing data have been released. When published, a new method is commonly compared with those already available, but this tends to be selective and incomplete. The lack of comprehensive benchmarking of methods presents challenges for users in selecting methods and for developers in understanding algorithm behaviours and limitations. Here we report the comprehensive evaluation of 10 SV callers, selected following a rigorous process and spanning the breadth of detection approaches, using high-quality reference cell lines, as well as simulations. Due to the nature of available truth sets, our focus is on general-purpose rather than somatic callers. We characterise the impact on performance of event size and type, sequencing characteristics, and genomic context, and analyse the efficacy of ensemble calling and calibration of variant quality scores. Finally, we provide recommendations for both users and methods developers. A number of computational methods have been developed for calling structural variants (SVs) using short read sequencing data. Here, the authors perform a comprehensive benchmarking analysis comparing 10 general-purpose callers and provide recommendations for both users and methods developers.
Collapse
|
39
|
Brouard JS, Schenkel F, Marete A, Bissonnette N. The GATK joint genotyping workflow is appropriate for calling variants in RNA-seq experiments. J Anim Sci Biotechnol 2019; 10:44. [PMID: 31249686 PMCID: PMC6587293 DOI: 10.1186/s40104-019-0359-0] [Citation(s) in RCA: 68] [Impact Index Per Article: 13.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2018] [Accepted: 04/28/2019] [Indexed: 12/30/2022] Open
Abstract
The Genome Analysis Toolkit (GATK) is a popular set of programs for discovering and genotyping variants from next-generation sequencing data. The current GATK recommendation for RNA sequencing (RNA-seq) is to perform variant calling from individual samples, with the drawback that only variable positions are reported. Versions 3.0 and above of GATK offer the possibility of calling DNA variants on cohorts of samples using the HaplotypeCaller algorithm in Genomic Variant Call Format (GVCF) mode. Using this approach, variants are called individually on each sample, generating one GVCF file per sample that lists genotype likelihoods and their genome annotations. In a second step, variants are called from the GVCF files through a joint genotyping analysis. This strategy is more flexible and reduces computational challenges in comparison to the traditional joint discovery workflow. Using a GVCF workflow for mining SNP in RNA-seq data provides substantial advantages, including reporting homozygous genotypes for the reference allele as well as missing data. Taking advantage of RNA-seq data derived from primary macrophages isolated from 50 cows, the GATK joint genotyping method for calling variants on RNA-seq data was validated by comparing this approach to a so-called “per-sample” method. In addition, pair-wise comparisons of the two methods were performed to evaluate their respective sensitivity, precision and accuracy using DNA genotypes from a companion study including the same 50 cows genotyped using either genotyping-by-sequencing or with the Bovine SNP50 Beadchip (imputed to the Bovine high density). Results indicate that both approaches are very close in their capacity of detecting reference variants and that the joint genotyping method is more sensitive than the per-sample method. Given that the joint genotyping method is more flexible and technically easier, we recommend this approach for variant calling in RNA-seq experiments.
Collapse
Affiliation(s)
- Jean-Simon Brouard
- 1Sherbrooke Research and Development Centre, Agriculture and Agri-Food Canada, Sherbrooke, QC J1M 0C8 Canada
| | - Flavio Schenkel
- 2Center of Genetic Improvement of Livestock, University of Guelph, Guelph, ON N1G 2W1 Canada
| | - Andrew Marete
- 1Sherbrooke Research and Development Centre, Agriculture and Agri-Food Canada, Sherbrooke, QC J1M 0C8 Canada
| | - Nathalie Bissonnette
- 1Sherbrooke Research and Development Centre, Agriculture and Agri-Food Canada, Sherbrooke, QC J1M 0C8 Canada
| |
Collapse
|
40
|
Kumaran M, Subramanian U, Devarajan B. Performance assessment of variant calling pipelines using human whole exome sequencing and simulated data. BMC Bioinformatics 2019; 20:342. [PMID: 31208315 PMCID: PMC6580603 DOI: 10.1186/s12859-019-2928-9] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2018] [Accepted: 05/31/2019] [Indexed: 12/30/2022] Open
Abstract
Background Whole exome sequencing (WES) is a cost-effective method that identifies clinical variants but it demands accurate variant caller tools. Currently available tools have variable accuracy in predicting specific clinical variants. But it may be possible to find the best combination of aligner-variant caller tools for detecting accurate single nucleotide variants (SNVs) and small insertion and deletion (InDels) separately. Moreover, many important aspects of InDel detection are overlooked while comparing the performance of tools, particularly its base pair length. Results We assessed the performance of variant calling pipelines using the combinations of four variant callers and five aligners on human NA12878 and simulated exome data. We used high confidence variant calls from Genome in a Bottle (GiaB) consortium for validation, and GRCh37 and GRCh38 as the human reference genome. Based on the performance metrics, both BWA and Novoalign aligners performed better with DeepVariant and SAMtools callers for detecting SNVs, and with DeepVariant and GATK for InDels. Furthermore, we obtained similar results on human NA24385 and NA24631 exome data from GiaB. Conclusion In this study, DeepVariant with BWA and Novoalign performed best for detecting accurate SNVs and InDels. The accuracy of variant calling was improved by merging the top performing pipelines. The results of our study provide useful recommendations for analysis of WES data in clinical genomics. Electronic supplementary material The online version of this article (10.1186/s12859-019-2928-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Manojkumar Kumaran
- Department of Bioinformatics, Aravind Medical Research Foundation, Madurai, Tamil Nadu, 625020, India.,School of Chemical and Biotechnology, SASTRA (Deemed to be University), Thanjavur, Tamil Nadu, 613401, India
| | - Umadevi Subramanian
- Department of Bioinformatics, Aravind Medical Research Foundation, Madurai, Tamil Nadu, 625020, India
| | - Bharanidharan Devarajan
- Department of Bioinformatics, Aravind Medical Research Foundation, Madurai, Tamil Nadu, 625020, India.
| |
Collapse
|
41
|
Crysnanto D, Wurmser C, Pausch H. Accurate sequence variant genotyping in cattle using variation-aware genome graphs. Genet Sel Evol 2019; 51:21. [PMID: 31092189 PMCID: PMC6521551 DOI: 10.1186/s12711-019-0462-x] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2018] [Accepted: 05/03/2019] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND Genotyping of sequence variants typically involves, as a first step, the alignment of sequencing reads to a linear reference genome. Because a linear reference genome represents only a small fraction of all the DNA sequence variation within a species, reference allele bias may occur at highly polymorphic or divergent regions of the genome. Graph-based methods facilitate the comparison of sequencing reads to a variation-aware genome graph, which incorporates a collection of non-redundant DNA sequences that segregate within a species. We compared the accuracy and sensitivity of graph-based sequence variant genotyping using the Graphtyper software to two widely-used methods, i.e., GATK and SAMtools, which rely on linear reference genomes using whole-genome sequencing data from 49 Original Braunvieh cattle. RESULTS We discovered 21,140,196, 20,262,913, and 20,668,459 polymorphic sites using GATK, Graphtyper, and SAMtools, respectively. Comparisons between sequence variant genotypes and microarray-derived genotypes showed that Graphtyper outperformed both GATK and SAMtools in terms of genotype concordance, non-reference sensitivity, and non-reference discrepancy. The sequence variant genotypes that were obtained using Graphtyper had the smallest number of Mendelian inconsistencies between sequence-derived single nucleotide polymorphisms and indels in nine sire-son pairs. Genotype phasing and imputation using the Beagle software improved the quality of the sequence variant genotypes for all the tools evaluated, particularly for animals that were sequenced at low coverage. Following imputation, the concordance between sequence- and microarray-derived genotypes was almost identical for the three methods evaluated, i.e., 99.32, 99.46, and 99.24% for GATK, Graphtyper, and SAMtools, respectively. Variant filtration based on commonly used criteria improved genotype concordance slightly but it also decreased sensitivity. Graphtyper required considerably more computing resources than SAMtools but less than GATK. CONCLUSIONS Sequence variant genotyping using Graphtyper is accurate, sensitive and computationally feasible in cattle. Graph-based methods enable sequence variant genotyping from variation-aware reference genomes that may incorporate cohort-specific sequence variants, which is not possible with the current implementation of state-of-the-art methods that rely on linear reference genomes.
Collapse
|
42
|
Veeckman E, Van Glabeke S, Haegeman A, Muylle H, van Parijs FRD, Byrne SL, Asp T, Studer B, Rohde A, Roldán-Ruiz I, Vandepoele K, Ruttink T. Overcoming challenges in variant calling: exploring sequence diversity in candidate genes for plant development in perennial ryegrass (Lolium perenne). DNA Res 2019; 26:1-12. [PMID: 30325414 PMCID: PMC6379033 DOI: 10.1093/dnares/dsy033] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2018] [Accepted: 09/06/2018] [Indexed: 11/13/2022] Open
Abstract
Revealing DNA sequence variation within the Lolium perenne genepool is important for genetic analysis and development of breeding applications. We reviewed current literature on plant development to select candidate genes in pathways that control agronomic traits, and identified 503 orthologues in L. perenne. Using targeted resequencing, we constructed a comprehensive catalogue of genomic variation for a L. perenne germplasm collection of 736 genotypes derived from current cultivars, breeding material and wild accessions. To overcome challenges of variant calling in heterogeneous outbreeding species, we used two complementary strategies to explore sequence diversity. First, four variant calling pipelines were integrated with the VariantMetaCaller to reach maximal sensitivity. Additional multiplex amplicon sequencing was used to empirically estimate an appropriate precision threshold. Second, a de novo assembly strategy was used to reconstruct divergent alleles for each gene. The advantage of this approach was illustrated by discovery of 28 novel alleles of LpSDUF247, a polymorphic gene co-segregating with the S-locus of the grass self-incompatibility system. Our approach is applicable to other genetically diverse outbreeding species. The resulting collection of functionally annotated variants can be mined for variants causing phenotypic variation, either through genetic association studies, or by selecting carriers of rare defective alleles for physiological analyses.
Collapse
Affiliation(s)
- Elisabeth Veeckman
- ILVO, Plant Sciences Unit, B Melle, Belgium.,Bioinformatics Institute Ghent, Ghent University, B Ghent, Belgium.,Department of Plant Biotechnology and Bioinformatics, Ghent University, B Ghent, Belgium
| | | | | | | | | | | | - Torben Asp
- Department of Molecular Biology and Genetics, Faculty of Science and Technology, Research Center Flakkebjerg Aarhus University, DK Slagelse, Denmark
| | - Bruno Studer
- Molecular Plant Breeding, Institute of Agricultural Sciences, ETH Zurich, CH Zurich, Switzerland
| | | | - Isabel Roldán-Ruiz
- ILVO, Plant Sciences Unit, B Melle, Belgium.,Department of Plant Biotechnology and Bioinformatics, Ghent University, B Ghent, Belgium
| | - Klaas Vandepoele
- Bioinformatics Institute Ghent, Ghent University, B Ghent, Belgium.,Department of Plant Biotechnology and Bioinformatics, Ghent University, B Ghent, Belgium.,Center for Plant Systems Biology, VIB, B Ghent, Belgium
| | - Tom Ruttink
- ILVO, Plant Sciences Unit, B Melle, Belgium.,Bioinformatics Institute Ghent, Ghent University, B Ghent, Belgium
| |
Collapse
|
43
|
Ali H, Al-Mulla F, Hussain N, Naim M, Asbeutah AM, AlSahow A, Abu-Farha M, Abubaker J, Al Madhoun A, Ahmad S, Harris PC. PKD1 Duplicated regions limit clinical Utility of Whole Exome Sequencing for Genetic Diagnosis of Autosomal Dominant Polycystic Kidney Disease. Sci Rep 2019; 9:4141. [PMID: 30858458 PMCID: PMC6412018 DOI: 10.1038/s41598-019-40761-w] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2018] [Accepted: 02/21/2019] [Indexed: 12/18/2022] Open
Abstract
Autosomal dominant polycystic kidney disease (ADPKD) is an inherited monogenic renal disease characterised by the accumulation of clusters of fluid-filled cysts in the kidneys and is caused by mutations in PKD1 or PKD2 genes. ADPKD genetic diagnosis is complicated by PKD1 pseudogenes located proximal to the original gene with a high degree of homology. The next generation sequencing (NGS) technology including whole exome sequencing (WES) and whole genome sequencing (WGS), is becoming more affordable and its use in the detection of ADPKD mutations for diagnostic and research purposes more widespread. However, how well does NGS technology compare with the Gold standard (Sanger sequencing) in the detection of ADPKD mutations? Is a question that remains to be answered. We have evaluated the efficacy of WES, WGS and targeted enrichment methodologies in detecting ADPKD mutations in the PKD1 and PKD2 genes in patients who were clinically evaluated by ultrasonography and renal function tests. Our results showed that WES detected PKD1 mutations in ADPKD patients with 50% sensitivity, as the reading depth and sequencing quality were low in the duplicated regions of PKD1 (exons 1–32) compared with those of WGS and target enrichment arrays. Our investigation highlights major limitations of WES in ADPKD genetic diagnosis. Enhancing reading depth, quality and sensitivity of WES in the PKD1 duplicated regions (exons 1–32) is crucial for its potential diagnostic or research applications.
Collapse
Affiliation(s)
- Hamad Ali
- Department of Medical Laboratory Sciences, Faculty of Allied Health Sciences, Health Sciences Center, Kuwait University, Jabriya, Kuwait. .,Department of Genetics and Bioinformatics, Dasman Diabetes Institute (DDI), Dasman, Kuwait. .,Division of Nephrology, Mubarak Al-Kabeer Hospital, Ministry of Health, Jabriya, Kuwait.
| | - Fahd Al-Mulla
- Department of Genetics and Bioinformatics, Dasman Diabetes Institute (DDI), Dasman, Kuwait.
| | - Naser Hussain
- Division of Nephrology, Mubarak Al-Kabeer Hospital, Ministry of Health, Jabriya, Kuwait
| | - Medhat Naim
- Division of Nephrology, Mubarak Al-Kabeer Hospital, Ministry of Health, Jabriya, Kuwait
| | - Akram M Asbeutah
- Department of Radiological Sciences, Faculty of Allied Health Sciences, Health Sciences Center, Kuwait University, Jabriya, Kuwait
| | - Ali AlSahow
- Division of Nephrology, Al-Jahra Hospital, Ministry of Health, Al-Jahra, Kuwait
| | - Mohamed Abu-Farha
- Department of Biochemistry and Molecular Biology, Dasman Diabetes Institute (DDI), Dasman, Kuwait
| | - Jehad Abubaker
- Department of Biochemistry and Molecular Biology, Dasman Diabetes Institute (DDI), Dasman, Kuwait
| | - Ashraf Al Madhoun
- Department of Genetics and Bioinformatics, Dasman Diabetes Institute (DDI), Dasman, Kuwait
| | - Sajjad Ahmad
- Department of Cornea and External Diseases, Moorfields Eye Hospital-NHS Foundation Trust, London, United Kingdom.,Institute of Ophthalmology, University Collage London (UCL), London, United Kingdom
| | - Peter C Harris
- Division of Nephrology and Hypertension, Mayo Clinic, Rochester, USA
| |
Collapse
|
44
|
Vo NS, Phan V. Leveraging known genomic variants to improve detection of variants, especially close-by Indels. Bioinformatics 2018; 34:2918-2926. [PMID: 29590294 DOI: 10.1093/bioinformatics/bty183] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2017] [Accepted: 03/23/2018] [Indexed: 12/30/2022] Open
Abstract
Motivation The detection of genomic variants has great significance in genomics, bioinformatics, biomedical research and its applications. However, despite a lot of effort, Indels and structural variants are still under-characterized compared to SNPs. Current approaches based on next-generation sequencing data usually require large numbers of reads (high coverage) to be able to detect such types of variants accurately. However Indels, especially those close to each other, are still hard to detect accurately. Results We introduce a novel approach that leverages known variant information, e.g. provided by dbSNP, dbVar, ExAC or the 1000 Genomes Project, to improve sensitivity of detecting variants, especially close-by Indels. In our approach, the standard reference genome and the known variants are combined to build a meta-reference, which is expected to be probabilistically closer to the subject genomes than the standard reference. An alignment algorithm, which can take into account known variant information, is developed to accurately align reads to the meta-reference. This strategy resulted in accurate alignment and variant calling even with low coverage data. We showed that compared to popular methods such as GATK and SAMtools, our method significantly improves the sensitivity of detecting variants, especially Indels that are close to each other. In particular, our method was able to call these close-by Indels at a 15-20% higher sensitivity than other methods at low coverage, and still get 1-5% higher sensitivity at high coverage, at competitive precision. These results were validated using simulated data with variant profiles extracted from the 1000 Genomes Project data, and real data from the Illumina Platinum Genomes Project and ExAC database. Our finding suggests that by incorporating known variant information in an appropriate manner, sensitive variant calling is possible at a low cost. Availability and implementation Implementation can be found in our public code repository https://github.com/namsyvo/IVC. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Nam S Vo
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Vinhthuy Phan
- Department of Computer Science, The University of Memphis, Memphis, TN, USA
| |
Collapse
|
45
|
Hadigol M, Khiabanian H. MERIT reveals the impact of genomic context on sequencing error rate in ultra-deep applications. BMC Bioinformatics 2018; 19:219. [PMID: 29884116 PMCID: PMC5994075 DOI: 10.1186/s12859-018-2223-1] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2017] [Accepted: 05/29/2018] [Indexed: 02/07/2023] Open
Abstract
BACKGROUND Rapid progress in high-throughput sequencing (HTS) and the development of novel library preparation methods have improved the sensitivity of detecting mutations in heterogeneous samples, specifically in high-depth (> 500×) clinical applications. However, HTS methods are bounded by their technical and theoretical limitations and sequencing errors cannot be completely eliminated. Comprehensive quantification of the background noise can highlight both the efficiency and the limitations of any HTS methodology, and help differentiate true mutations at low abundance from artifacts. RESULTS We introduce MERIT (Mutation Error Rate Inference Toolkit), designed for in-depth quantification of erroneous substitutions and small insertions and deletions. MERIT incorporates an all-inclusive variant caller and considers genomic context, including the nucleotides immediately at 5 'and 3 ', thereby establishing error rates for 96 possible substitutions as well as four single-base and 16 double-base indels. We applied MERIT to ultra-deep sequencing data (1,300,000 ×) obtained from the amplification of multiple clinically relevant loci, and showed a significant relationship between error rates and genomic contexts. In addition to observing significant difference between transversion and transition rates, we identified variations of more than 100-fold within each error type at high sequencing depths. For instance, T >G transversions in trinucleotide GTCs occurred 133.5 ± 65.9 more often than those in ATAs. Similarly, C >T transitions in GCGs were observed at 73.8 ± 10.5 higher rate than those in TCTs. We also devised an in silico approach to determine the optimal sequencing depth, where errors occur at rates similar to those of expected true mutations. Our analyses showed that increasing sequencing depth might improve sensitivity for detecting some mutations based on their genomic context. For example, T >G rate of error in GTCs did not change when sequenced beyond 10,000 ×; in contrast, T >G rate in TTAs consistently improved even at above 500,000 ×. CONCLUSIONS Our results demonstrate significant variation in nucleotide misincorporation rates, and suggest that genomic context should be considered for comprehensive profiling of specimen-specific and sequencing artifacts in high-depth assays. This data provide strong evidence against assigning a single allele frequency threshold to call mutations, for it can result in substantial false positive as well as false negative variants, with important clinical consequences.
Collapse
Affiliation(s)
- Mohammad Hadigol
- Center for Systems and Computational Biology, Rutgers Cancer Institute of New Jersey, Rutgers University, New Brunswick, NJ USA
| | - Hossein Khiabanian
- Center for Systems and Computational Biology, Rutgers Cancer Institute of New Jersey, Rutgers University, New Brunswick, NJ USA
- Department of Pathology and Laboratory Medicine, Rutgers Robert Wood Johnson Medical School, Rutgers University, New Brunswick, NJ USA
| |
Collapse
|
46
|
Tuzov N. A framework for the estimation of the proportion of true discoveries in single nucleotide variant detection studies for human data. PLoS One 2018; 13:e0196058. [PMID: 29694377 PMCID: PMC5918994 DOI: 10.1371/journal.pone.0196058] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2017] [Accepted: 04/05/2018] [Indexed: 12/30/2022] Open
Abstract
Any single nucleotide variant detection study could benefit from a fast and cheap method of measuring the quality of variant call list. It is advantageous to be able to see how the call list quality is affected by different variant filtering thresholds and other adjustments to the study parameters. Here we look into a possibility of estimating the proportion of true positives in a single nucleotide variant call list for human data. Using whole-exome and whole-genome gold standard data sets for training, we focus on building a generic model that only relies on information available from any variant caller. We assess and compare the performance of different candidate models based on their practical accuracy. We find that the generic model delivers decent accuracy most of the time. Further, we conclude that its performance could be improved substantially by leveraging the variant quality metrics that are specific to each variant calling tool.
Collapse
Affiliation(s)
- Nik Tuzov
- Partek Incorporated, Saint Louis, Missouri, United States of America
- * E-mail:
| |
Collapse
|
47
|
Ren Y, Reddy JS, Pottier C, Sarangi V, Tian S, Sinnwell JP, McDonnell SK, Biernacka JM, Carrasquillo MM, Ross OA, Ertekin-Taner N, Rademakers R, Hudson M, Mainzer LS, Asmann YW. Identification of missing variants by combining multiple analytic pipelines. BMC Bioinformatics 2018; 19:139. [PMID: 29661148 PMCID: PMC5902939 DOI: 10.1186/s12859-018-2151-0] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2018] [Accepted: 04/09/2018] [Indexed: 02/02/2023] Open
Abstract
Background After decades of identifying risk factors using array-based genome-wide association studies (GWAS), genetic research of complex diseases has shifted to sequencing-based rare variants discovery. This requires large sample sizes for statistical power and has brought up questions about whether the current variant calling practices are adequate for large cohorts. It is well-known that there are discrepancies between variants called by different pipelines, and that using a single pipeline always misses true variants exclusively identifiable by other pipelines. Nonetheless, it is common practice today to call variants by one pipeline due to computational cost and assume that false negative calls are a small percent of total. Results We analyzed 10,000 exomes from the Alzheimer’s Disease Sequencing Project (ADSP) using multiple analytic pipelines consisting of different read aligners and variant calling strategies. We compared variants identified by using two aligners in 50,100, 200, 500, 1000, and 1952 samples; and compared variants identified by adding single-sample genotyping to the default multi-sample joint genotyping in 50,100, 500, 2000, 5000 and 10,000 samples. We found that using a single pipeline missed increasing numbers of high-quality variants correlated with sample sizes. By combining two read aligners and two variant calling strategies, we rescued 30% of pass-QC variants at sample size of 2000, and 56% at 10,000 samples. The rescued variants had higher proportions of low frequency (minor allele frequency [MAF] 1–5%) and rare (MAF < 1%) variants, which are the very type of variants of interest. In 660 Alzheimer’s disease cases with earlier onset ages of ≤65, 4 out of 13 (31%) previously-published rare pathogenic and protective mutations in APP, PSEN1, and PSEN2 genes were undetected by the default one-pipeline approach but recovered by the multi-pipeline approach. Conclusions Identification of the complete variant set from sequencing data is the prerequisite of genetic association analyses. The current analytic practice of calling genetic variants from sequencing data using a single bioinformatics pipeline is no longer adequate with the increasingly large projects. The number and percentage of quality variants that passed quality filters but are missed by the one-pipeline approach rapidly increased with sample size. Electronic supplementary material The online version of this article (10.1186/s12859-018-2151-0) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Yingxue Ren
- Department of Health Sciences Research, Mayo Clinic, Jacksonville, FL, 32224, USA
| | - Joseph S Reddy
- Department of Health Sciences Research, Mayo Clinic, Jacksonville, FL, 32224, USA
| | - Cyril Pottier
- Department of Neuroscience, Mayo Clinic, Jacksonville, FL, 32224, USA
| | - Vivekananda Sarangi
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, 55905, USA
| | - Shulan Tian
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, 55905, USA
| | - Jason P Sinnwell
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, 55905, USA
| | - Shannon K McDonnell
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, 55905, USA
| | - Joanna M Biernacka
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, 55905, USA
| | | | - Owen A Ross
- Department of Neuroscience, Mayo Clinic, Jacksonville, FL, 32224, USA.,Department of Clinical Genomics, Mayo Clinic, Jacksonville, FL, 32224, USA
| | - Nilüfer Ertekin-Taner
- Department of Neuroscience, Mayo Clinic, Jacksonville, FL, 32224, USA.,Department of Neurology, Mayo Clinic, Jacksonville, FL, 32224, USA
| | - Rosa Rademakers
- Department of Neuroscience, Mayo Clinic, Jacksonville, FL, 32224, USA
| | - Matthew Hudson
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA.,Carl R Woese Institute for Genomic Biology, Carver Biotechnology Center and Department of Crop Sciences, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
| | - Liudmila Sergeevna Mainzer
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
| | - Yan W Asmann
- Department of Health Sciences Research, Mayo Clinic, Jacksonville, FL, 32224, USA.
| |
Collapse
|
48
|
Shringarpure SS, Mathias RA, Hernandez RD, O'Connor TD, Szpiech ZA, Torres R, De La Vega FM, Bustamante CD, Barnes KC, Taub MA. Using genotype array data to compare multi- and single-sample variant calls and improve variant call sets from deep coverage whole-genome sequencing data. Bioinformatics 2018; 33:1147-1153. [PMID: 28035032 PMCID: PMC5408850 DOI: 10.1093/bioinformatics/btw786] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2016] [Accepted: 12/07/2016] [Indexed: 12/30/2022] Open
Abstract
Motivation Variant calling from next-generation sequencing (NGS) data is susceptible to false positive calls due to sequencing, mapping and other errors. To better distinguish true from false positive calls, we present a method that uses genotype array data from the sequenced samples, rather than public data such as HapMap or dbSNP, to train an accurate classifier using Random Forests. We demonstrate our method on a set of variant calls obtained from 642 African-ancestry genomes from the Consortium on Asthma among African-ancestry Populations in the Americas (CAAPA), sequenced to high depth (30X). Results We have applied our classifier to compare call sets generated with different calling methods, including both single-sample and multi-sample callers. At a False Positive Rate of 5%, our method determines true positive rates of 97.5%, 95% and 99% on variant calls obtained using Illuminas single-sample caller CASAVA, Real Time Genomics multisample variant caller, and the GATK UnifiedGenotyper, respectively. Since NGS sequencing data may be accompanied by genotype data for the same samples, either collected concurrent to sequencing or from a previous study, our method can be trained on each dataset to provide a more accurate computational validation of site calls compared to generic methods. Moreover, our method allows for adjustment based on allele frequency (e.g. a different set of criteria to determine quality for rare versus common variants) and thereby provides insight into sequencing characteristics that indicate call quality for variants of different frequencies. Availability and Implementation Code is available on Github at: https://github.com/suyashss/variant_validation. Contacts suyashs@stanford.edu or mtaub@jhsph.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Suyash S Shringarpure
- Departments of Genetics and Biomedical Data Science, Stanford University School of Medicine, Stanford, CA, USA
| | - Rasika A Mathias
- 23 and Me Inc, Mountain View, CA, USA.,Department of Medicine, Johns Hopkins University, Baltimore, MD, USA
| | - Ryan D Hernandez
- Department of Epidemiology, Bloomberg School of Public Health, JHU, Baltimore, MD, USA.,Department of Bioengineering and Therapeutic Sciences.,Institute for Human Genetics
| | - Timothy D O'Connor
- Quantitative Biosciences Institute, University of California, San Francisco, San Francisco, CA, USA.,Institute for Genome Sciences.,Program in Personalized and Genomic Medicine
| | - Zachary A Szpiech
- Department of Epidemiology, Bloomberg School of Public Health, JHU, Baltimore, MD, USA
| | - Raul Torres
- Department of Medicine, University of Maryland School of Medicine, Baltimore, MD, USA
| | - Francisco M De La Vega
- Departments of Genetics and Biomedical Data Science, Stanford University School of Medicine, Stanford, CA, USA
| | - Carlos D Bustamante
- Departments of Genetics and Biomedical Data Science, Stanford University School of Medicine, Stanford, CA, USA
| | - Kathleen C Barnes
- 23 and Me Inc, Mountain View, CA, USA.,Department of Medicine, Johns Hopkins University, Baltimore, MD, USA
| | - Margaret A Taub
- Biomedical Sciences Graduate Program, University of California, San Francisco, San Francisco, CA, USA
| | | |
Collapse
|
49
|
Zomnir MG, Lipkin L, Pacula M, Dominguez Meneses E, MacLeay A, Duraisamy S, Nadhamuni N, Al Turki SH, Zheng Z, Rivera M, Nardi V, Dias-Santagata D, Iafrate AJ, Le LP, Lennerz JK. Artificial Intelligence Approach for Variant Reporting. JCO Clin Cancer Inform 2018; 2:CCI.16.00079. [PMID: 30364844 PMCID: PMC6198661 DOI: 10.1200/cci.16.00079] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023] Open
Abstract
Purpose Next-generation sequencing technologies are actively applied in clinical oncology. Bioinformatics pipeline analysis is an integral part of this process; however, humans cannot yet realize the full potential of the highly complex pipeline output. As a result, the decision to include a variant in the final report during routine clinical sign-out remains challenging. Methods We used an artificial intelligence approach to capture the collective clinical sign-out experience of six board-certified molecular pathologists to build and validate a decision support tool for variant reporting. We extracted all reviewed and reported variants from our clinical database and tested several machine learning models. We used 10-fold cross-validation for our variant call prediction model, which derives a contiguous prediction score from 0 to 1 (no to yes) for clinical reporting. Results For each of the 19,594 initial training variants, our pipeline generates approximately 500 features, which results in a matrix of > 9 million data points. From a comparison of naive Bayes, decision trees, random forests, and logistic regression models, we selected models that allow human interpretability of the prediction score. The logistic regression model demonstrated 1% false negativity and 2% false positivity. The final models' Youden indices were 0.87 and 0.77 for screening and confirmatory cutoffs, respectively. Retraining on a new assay and performance assessment in 16,123 independent variants validated our approach (Youden index, 0.93). We also derived individual pathologist-centric models (virtual consensus conference function), and a visual drill-down functionality allows assessment of how underlying features contributed to a particular score or decision branch for clinical implementation. Conclusion Our decision support tool for variant reporting is a practically relevant artificial intelligence approach to harness the next-generation sequencing bioinformatics pipeline output when the complexity of data interpretation exceeds human capabilities.
Collapse
Affiliation(s)
| | - Lev Lipkin
- All authors: Massachusetts General Hospital, Boston, MA
| | - Maciej Pacula
- All authors: Massachusetts General Hospital, Boston, MA
| | | | | | | | | | | | - Zongli Zheng
- All authors: Massachusetts General Hospital, Boston, MA
| | - Miguel Rivera
- All authors: Massachusetts General Hospital, Boston, MA
| | | | | | | | - Long P. Le
- All authors: Massachusetts General Hospital, Boston, MA
| | | |
Collapse
|
50
|
Ye S, Yuan X, Lin X, Gao N, Luo Y, Chen Z, Li J, Zhang X, Zhang Z. Imputation from SNP chip to sequence: a case study in a Chinese indigenous chicken population. J Anim Sci Biotechnol 2018; 9:30. [PMID: 29581880 PMCID: PMC5861640 DOI: 10.1186/s40104-018-0241-5] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2017] [Accepted: 01/26/2018] [Indexed: 11/24/2022] Open
Abstract
Background Genome-wide association studies and genomic predictions are thought to be optimized by using whole-genome sequence (WGS) data. However, sequencing thousands of individuals of interest is expensive. Imputation from SNP panels to WGS data is an attractive and less expensive approach to obtain WGS data. The aims of this study were to investigate the accuracy of imputation and to provide insight into the design and execution of genotype imputation. Results We genotyped 450 chickens with a 600 K SNP array, and sequenced 24 key individuals by whole genome re-sequencing. Accuracy of imputation from putative 60 K and 600 K array data to WGS data was 0.620 and 0.812 for Beagle, and 0.810 and 0.914 for FImpute, respectively. By increasing the sequencing cost from 24X to 144X, the imputation accuracy increased from 0.525 to 0.698 for Beagle and from 0.654 to 0.823 for FImpute. With fixed sequence depth (12X), increasing the number of sequenced animals from 1 to 24, improved accuracy from 0.421 to 0.897 for FImpute and from 0.396 to 0.777 for Beagle. Using optimally selected key individuals resulted in a higher imputation accuracy compared with using randomly selected individuals as a reference population for re-sequencing. With fixed reference population size (24), imputation accuracy increased from 0.654 to 0.875 for FImpute and from 0.512 to 0.762 for Beagle as the sequencing depth increased from 1X to 12X. With a given total cost of genotyping, accuracy increased with the size of the reference population for FImpute, but the pattern was not valid for Beagle, which showed the highest accuracy at six fold coverage for the scenarios used in this study. Conclusions In conclusion, we comprehensively investigated the impacts of several key factors on genotype imputation. Generally, increasing sequencing cost gave a higher imputation accuracy. But with a fixed sequencing cost, the optimal imputation enhance the performance of WGP and GWAS. An optimal imputation strategy should take size of reference population, imputation algorithms, marker density, and population structure of the target population and methods to select key individuals into consideration comprehensively. This work sheds additional light on how to design and execute genotype imputation for livestock populations. Electronic supplementary material The online version of this article (10.1186/s40104-018-0241-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Shaopan Ye
- Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, National Engineering Research Centre for Breeding Swine Industry, College of Animal Science, South China Agricultural University, Guangzhou, Guangdong China
| | - Xiaolong Yuan
- Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, National Engineering Research Centre for Breeding Swine Industry, College of Animal Science, South China Agricultural University, Guangzhou, Guangdong China
| | - Xiran Lin
- Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, National Engineering Research Centre for Breeding Swine Industry, College of Animal Science, South China Agricultural University, Guangzhou, Guangdong China
| | - Ning Gao
- Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, National Engineering Research Centre for Breeding Swine Industry, College of Animal Science, South China Agricultural University, Guangzhou, Guangdong China
| | - Yuanyu Luo
- Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, National Engineering Research Centre for Breeding Swine Industry, College of Animal Science, South China Agricultural University, Guangzhou, Guangdong China
| | - Zanmou Chen
- Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, National Engineering Research Centre for Breeding Swine Industry, College of Animal Science, South China Agricultural University, Guangzhou, Guangdong China
| | - Jiaqi Li
- Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, National Engineering Research Centre for Breeding Swine Industry, College of Animal Science, South China Agricultural University, Guangzhou, Guangdong China
| | - Xiquan Zhang
- Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, National Engineering Research Centre for Breeding Swine Industry, College of Animal Science, South China Agricultural University, Guangzhou, Guangdong China
| | - Zhe Zhang
- Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, National Engineering Research Centre for Breeding Swine Industry, College of Animal Science, South China Agricultural University, Guangzhou, Guangdong China
| |
Collapse
|