1
|
Thanh Nguyen D, Hoang Nguyen Q, Thuy Duong N, Vo NS. LmTag: functional-enrichment and imputation-aware tag SNP selection for population-specific genotyping arrays. Brief Bioinform 2022; 23:6627269. [PMID: 35780383 DOI: 10.1093/bib/bbac252] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2022] [Revised: 05/02/2022] [Accepted: 05/31/2022] [Indexed: 12/16/2022] Open
Abstract
Despite the rapid development of sequencing technology, single-nucleotide polymorphism (SNP) arrays are still the most cost-effective genotyping solutions for large-scale genomic research and applications. Recent years have witnessed the rapid development of numerous genotyping platforms of different sizes and designs, but population-specific platforms are still lacking, especially for those in developing countries. SNP arrays designed for these countries should be cost-effective (small size), yet incorporate key information needed to associate genotypes with traits. A key design principle for most current platforms is to improve genome-wide imputation so that more SNPs not included in the array (imputed SNPs) can be predicted. However, current tag SNP selection methods mostly focus on imputation accuracy and coverage, but not the functional content of the array. It is those functional SNPs that are most likely associated with traits. Here, we propose LmTag, a novel method for tag SNP selection that not only improves imputation performance but also prioritizes highly functional SNP markers. We apply LmTag on a wide range of populations using both public and in-house whole-genome sequencing databases. Our results show that LmTag improved both functional marker prioritization and genome-wide imputation accuracy compared to existing methods. This novel approach could contribute to the next generation genotyping arrays that provide excellent imputation capability as well as facilitate array-based functional genetic studies. Such arrays are particularly suitable for under-represented populations in developing countries or non-model species, where little genomics data are available while investment in genome sequencing or high-density SNP arrays is limited. $\textrm{LmTag}$ is available at: https://github.com/datngu/LmTag.
Collapse
Affiliation(s)
- Dat Thanh Nguyen
- Center for Biomedical Informatics, Vingroup Big Data Institute, 458 Minh Khai, 10000, Hanoi, Vietnam
| | - Quan Hoang Nguyen
- Institute for Molecular Bioscience, University of Queensland, st Lucia, QLD 4067, Brisbane, Australia
| | - Nguyen Thuy Duong
- Center for Biomedical Informatics, Vingroup Big Data Institute, 458 Minh Khai, 10000, Hanoi, Vietnam.,Institute of Genome Research, Vietnam Academy of Science and Technology, 18 Hoang Quoc Viet, 10000, Hanoi, Vietnam
| | - Nam S Vo
- Center for Biomedical Informatics, Vingroup Big Data Institute, 458 Minh Khai, 10000, Hanoi, Vietnam.,College of Engineering and Computer Science, VinUniversity, Vinhomes Ocean Park, 10000, Hanoi, Vietnam
| |
Collapse
|
2
|
Discovering Genome-Wide Tag SNPs Based on the Mutual Information of the Variants. PLoS One 2016; 11:e0167994. [PMID: 27992465 PMCID: PMC5161470 DOI: 10.1371/journal.pone.0167994] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2016] [Accepted: 11/23/2016] [Indexed: 01/01/2023] Open
Abstract
Exploring linkage disequilibrium (LD) patterns among the single nucleotide polymorphism (SNP) sites can improve the accuracy and cost-effectiveness of genomic association studies, whereby representative (tag) SNPs are identified to sufficiently represent the genomic diversity in populations. There has been considerable amount of effort in developing efficient algorithms to select tag SNPs from the growing large-scale data sets. Methods using the classical pairwise-LD and multi-locus LD measures have been proposed that aim to reduce the computational complexity and to increase the accuracy, respectively. The present work solves the tag SNP selection problem by efficiently balancing the computational complexity and accuracy, and improves the coverage in genomic diversity in a cost-effective manner. The employed algorithm makes use of mutual information to explore the multi-locus association between SNPs and can handle different data types and conditions. Experiments with benchmark HapMap data sets show comparable or better performance against the state-of-the-art algorithms. In particular, as a novel application, the genome-wide SNP tagging is performed in the 1000 Genomes Project data sets, and produced a well-annotated database of tagging variants that capture the common genotype diversity in 2,504 samples from 26 human populations. Compared to conventional methods, the algorithm requires as input only the genotype (or haplotype) sequences, can scale up to genome-wide analyses, and produces accurate solutions with more information-rich output, providing an improved platform for researchers towards the subsequent association studies.
Collapse
|
3
|
Budhathoki S, Yamaji T, Iwasaki M, Sawada N, Shimazu T, Sasazuki S, Yoshida T, Tsugane S. Vitamin D Receptor Gene Polymorphism and the Risk of Colorectal Cancer: A Nested Case-Control Study. PLoS One 2016; 11:e0164648. [PMID: 27736940 PMCID: PMC5063384 DOI: 10.1371/journal.pone.0164648] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2016] [Accepted: 09/28/2016] [Indexed: 12/31/2022] Open
Abstract
Epidemiological and experimental evidence suggest that vitamin D is protective against the risk of colorectal cancer. Polymorphisms in the gene encoding vitamin D receptor (VDR), which mediates most of the known cellular effects of vitamin D, have been suggested to alter this association. Here, using a tag SNP approach, we comprehensively evaluated the role of common genetic variants in VDR and their interaction with plasma vitamin D levels in relation to colorectal cancer risk in Japanese populations. A total of 356 colorectal cancer cases and 709 matched control subjects were selected from the participants of the Japan Public Health Center-based Prospective Cohort Study. Among these subjects, 29 VDR single nucleotide polymorphisms (SNPs) were selected and genotyped, and plasma vitamin D concentrations were measured. Conditional logistic regression models were used to estimate odds ratios (ORs) and 95% confidence intervals (CIs) of colorectal cancer, with adjustment for potential confounding factors. Among the results, eight VDR SNPs, namely rs2254210, rs1540339, rs2107301, rs11168267, rs11574113, rs731236, rs3847987 and rs11574143, the latter 5 of which were located in the 3′ region, were nominally associated with the risk of colorectal cancer (P = 0.01–0.048). Furthermore, of the above 5 3′ region SNPs, the inverse associations for 3 SNPs (rs11574113, rs3847987 and rs11574143) appeared to be evident only in those with high plasma vitamin D concentration. However, neither of these direct and suggestive interaction analysis associations was significant after multiple testing adjustment. Overall, the findings of this study provide only limited support for an association between common genetic variations in VDR and colorectal cancer risk in the Japanese population.
Collapse
Affiliation(s)
- Sanjeev Budhathoki
- Epidemiology and Prevention Group, Center for Public Health Sciences, National Cancer Center, Tokyo, Japan
| | - Taiki Yamaji
- Epidemiology and Prevention Group, Center for Public Health Sciences, National Cancer Center, Tokyo, Japan
| | - Motoki Iwasaki
- Epidemiology and Prevention Group, Center for Public Health Sciences, National Cancer Center, Tokyo, Japan
- * E-mail:
| | - Norie Sawada
- Epidemiology and Prevention Group, Center for Public Health Sciences, National Cancer Center, Tokyo, Japan
| | - Taichi Shimazu
- Epidemiology and Prevention Group, Center for Public Health Sciences, National Cancer Center, Tokyo, Japan
| | - Shizuka Sasazuki
- Epidemiology and Prevention Group, Center for Public Health Sciences, National Cancer Center, Tokyo, Japan
| | - Teruhiko Yoshida
- Division of Genetics, National Cancer Center Research Institute, Tokyo, Japan
| | - Shoichiro Tsugane
- Epidemiology and Prevention Group, Center for Public Health Sciences, National Cancer Center, Tokyo, Japan
| |
Collapse
|
4
|
Liao B, Li X, Cai L, Cao Z, Chen H. A Hierarchical Clustering Method of Selecting Kernel SNP to Unify Informative SNP and Tag SNP. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:113-122. [PMID: 26357082 DOI: 10.1109/tcbb.2014.2351797] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Various strategies can be used to select representative single nucleotide polymorphisms (SNPs) from a large number of SNPs, such as tag SNP for haplotype coverage and informative SNP for haplotype reconstruction, respectively. Representative SNPs are not only instrumental in reducing the cost of genotyping, but also serve an important function in narrowing the combinatorial space in epistasis analysis. The capacity of kernel SNPs to unify informative SNP and tag SNP is explored, and inconsistencies are minimized in further studies. The correlation between multiple SNPs is formalized using multi-information measures. In extending the correlation, a distance formula for measuring the similarity between clusters is first designed to conduct hierarchical clustering. Hierarchical clustering consists of both information gain and haplotype diversity, so that the proposed approach can achieve unification. The kernel SNPs are then selected from every cluster through the top rank or backward elimination scheme. Using these kernel SNPs, extensive experimental comparisons are conducted between informative SNPs on haplotype reconstruction accuracy and tag SNPs on haplotype coverage. Results indicate that the kernel SNP can practically unify informative SNP and tag SNP and is therefore adaptable to various applications.
Collapse
|
5
|
Srivastava AK, Chopra R, Ali S, Aggarwal S, Vig L, Bamezai RNK. Inferring population structure and relationship using minimal independent evolutionary markers in Y-chromosome: a hybrid approach of recursive feature selection for hierarchical clustering. Nucleic Acids Res 2014; 42:e122. [PMID: 25030906 PMCID: PMC4150763 DOI: 10.1093/nar/gku585] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
Inundation of evolutionary markers expedited in Human Genome Project and 1000 Genome Consortium has necessitated pruning of redundant and dependent variables. Various computational tools based on machine-learning and data-mining methods like feature selection/extraction have been proposed to escape the curse of dimensionality in large datasets. Incidentally, evolutionary studies, primarily based on sequentially evolved variations have remained un-facilitated by such advances till date. Here, we present a novel approach of recursive feature selection for hierarchical clustering of Y-chromosomal SNPs/haplogroups to select a minimal set of independent markers, sufficient to infer population structure as precisely as deduced by a larger number of evolutionary markers. To validate the applicability of our approach, we optimally designed MALDI-TOF mass spectrometry-based multiplex to accommodate independent Y-chromosomal markers in a single multiplex and genotyped two geographically distinct Indian populations. An analysis of 105 world-wide populations reflected that 15 independent variations/markers were optimal in defining population structure parameters, such as FST, molecular variance and correlation-based relationship. A subsequent addition of randomly selected markers had a negligible effect (close to zero, i.e. 1 × 10−3) on these parameters. The study proves efficient in tracing complex population structures and deriving relationships among world-wide populations in a cost-effective and expedient manner.
Collapse
Affiliation(s)
- Amit Kumar Srivastava
- National Centre of Applied Human Genetics, School of Life Sciences, Jawaharlal Nehru University, New Delhi 110067, India
| | - Rupali Chopra
- National Centre of Applied Human Genetics, School of Life Sciences, Jawaharlal Nehru University, New Delhi 110067, India
| | - Shafat Ali
- National Centre of Applied Human Genetics, School of Life Sciences, Jawaharlal Nehru University, New Delhi 110067, India
| | - Shweta Aggarwal
- National Centre of Applied Human Genetics, School of Life Sciences, Jawaharlal Nehru University, New Delhi 110067, India
| | - Lovekesh Vig
- School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi 110067, India
| | - Rameshwar Nath Koul Bamezai
- National Centre of Applied Human Genetics, School of Life Sciences, and School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi 110067, India
| |
Collapse
|
6
|
Demir HD, Ortak H, Şahin Ş, Ateş Ö, Benli İ, İnanır A. VKORC1 C1173Tand VKORC1 G-1639AGene Polymorphisms in Turkish Behçet’s Patients with Ocular and Non-ocular Involvement. Ophthalmic Genet 2014; 35:7-11. [DOI: 10.3109/13816810.2013.763994] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
|
7
|
İlhan İ, Tezel G. How to Select Tag SNPs in Genetic Association Studies? The CLONTagger Method with Parameter Optimization. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2013; 17:368-83. [DOI: 10.1089/omi.2012.0100] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Affiliation(s)
- İlhan İlhan
- Akören Vocational School, Selçuk University, Konya, Turkey
| | - Gülay Tezel
- Department of Computer Engineering Faculty of Engineering and Architecture, Selçuk University, Konya, Turkey
| |
Collapse
|
8
|
İlhan İ, Tezel G. A genetic algorithm–support vector machine method with parameter optimization for selecting the tag SNPs. J Biomed Inform 2013; 46:328-40. [DOI: 10.1016/j.jbi.2012.12.002] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2012] [Revised: 10/13/2012] [Accepted: 12/11/2012] [Indexed: 01/06/2023]
|
9
|
Liao B, Li X, Zhu W, Cao Z. A novel method to select informative SNPs and their application in genetic association studies. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2012; 9:1529-1534. [PMID: 22585142 DOI: 10.1109/tcbb.2012.70] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
The association studies between complex diseases and single nucleotide polymorphisms (SNPs) or haplotypes have recently received great attention. However, these studies are limited by the cost of genotyping all SNPs. Therefore, it is essential to find a small subset of tag SNPs representing the rest of the SNPs. The presence of linkage disequilibrium between tag SNPs and the disease variant (genotyped or not), may allow fine mapping study. In this paper, we combine a nearest-means classifier (NMC) and ant colony algorithm to select tags. Results show that our method (ACO/NMC) can get a similar prediction accuracy with method BPSO/SVM and is better than BPSO/STAMPA for small data sets. For large data sets, although the prediction accuracy of our method is lower than BPSO/SVM, ACO/NMC can reach a high accuracy (>99 percent) in a relatively short time. when the number of tags increases, the time complexity of NMC is nearly linear growth. To find out that the ability of tags to locate disease locus, we simulate a case-control study and use two-locus haplotype analysis to quantitatively assess the power. The result showed that 20 percent of all SNPs selected by NMC have about 10 percent higher power than random tags, on average.
Collapse
Affiliation(s)
- Bo Liao
- College of Information Science and Engineering, Hunan University, Changsha, Hunan 410082, China.
| | | | | | | |
Collapse
|
10
|
Weersma RK, Crusius JBA, Roberts RL, Koeleman BPC, Palomino-Morales R, Wolfkamp S, Hollis-Moffatt JE, Festen EAM, Meisneris S, Heijmans R, Noble CL, Gearry RB, Barclay ML, Gómez-Garcia M, Lopez-Nevot MA, Nieto A, Rodrigo L, Radstake TRDJ, van Bodegraven AA, Wijmenga C, Merriman TR, Stokkers PCF, Peña AS, Martín J, Alizadeh BZ. Association of FcgR2a, but not FcgR3a, with inflammatory bowel diseases across three Caucasian populations. Inflamm Bowel Dis 2010; 16:2080-9. [PMID: 20848524 DOI: 10.1002/ibd.21342] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/09/2022]
Abstract
BACKGROUND The Fc receptors II and III (FcgR2a, and FcgR3a) play a crucial role in the regulation of the immune response. The FcgR2a*519GG and FcgR3a*559CC genotypes have been associated with several autoimmune diseases including systemic lupus erythematosus, rheumatoid arthritis, nephritis, and possibly to type I diabetes, and celiac disease. In a large multicenter, two-stage study of 6570 people, we tested whether the FcgR2a and FcgR3a genes were also involved in inflammatory bowel disease (IBD), which includes Crohn's disease (CD) and ulcerative colitis (UC). METHODS We genotyped the FcgR2a*A519G and FcgR3a*A559C functional variants in 4205 IBD patients in six well-phenotyped Caucasian IBD cohorts and 2365 ethnically matched controls recruited from the Netherlands, Spain, and New Zealand. RESULTS In the initial Dutch study we found a significant association of FcgR2a genotypes with IBD (P-genotype = 0.02); while the FcgR2a*519GG was more common in controls (23%) than in IBD patients (18%; odds ratio [OR] = 0.75; 95% confidence interval [CI] 0.61-0.92; P = 0.004). This association was corroborated by a combined analysis across all the study populations (Mantel-Haenszel [MH] OR = 0.84; 0.74-0.95; P = 0.005) in the next stage. The Fcgr2a*GG genotype was associated with both UC (MH-OR = 0.84; 0.72-0.97; P = 0.01) and CD (MH-OR = 0.84; 0.73-0.97; P = 0.01), suggesting that this genotype confers a protective effect against IBD. There was no association of FcgR3a*A559C genotypes with IBD, CD, or UC in any of the three studied populations. CONCLUSIONS The FcgR2a*519G functional variant was associated with IBD and reduced susceptibility to UC and to CD in Caucasians. There was no association between FcgR3a*5A559C and IBD, CD or UC.
Collapse
Affiliation(s)
- Rinse K Weersma
- Department of Gastroenterology and Hepatology, University Medical Center Groningen, Groningen, The Netherlands
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
11
|
Liu G, Wang Y, Wong L. FastTagger: an efficient algorithm for genome-wide tag SNP selection using multi-marker linkage disequilibrium. BMC Bioinformatics 2010; 11:66. [PMID: 20113476 PMCID: PMC3098109 DOI: 10.1186/1471-2105-11-66] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2009] [Accepted: 01/29/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Human genome contains millions of common single nucleotide polymorphisms (SNPs) and these SNPs play an important role in understanding the association between genetic variations and human diseases. Many SNPs show correlated genotypes, or linkage disequilibrium (LD), thus it is not necessary to genotype all SNPs for association study. Many algorithms have been developed to find a small subset of SNPs called tag SNPs that are sufficient to infer all the other SNPs. Algorithms based on the r2 LD statistic have gained popularity because r2 is directly related to statistical power to detect disease associations. Most of existing r2 based algorithms use pairwise LD. Recent studies show that multi-marker LD can help further reduce the number of tag SNPs. However, existing tag SNP selection algorithms based on multi-marker LD are both time-consuming and memory-consuming. They cannot work on chromosomes containing more than 100 k SNPs using length-3 tagging rules. RESULTS We propose an efficient algorithm called FastTagger to calculate multi-marker tagging rules and select tag SNPs based on multi-marker LD. FastTagger uses several techniques to reduce running time and memory consumption. Our experiment results show that FastTagger is several times faster than existing multi-marker based tag SNP selection algorithms, and it consumes much less memory at the same time. As a result, FastTagger can work on chromosomes containing more than 100 k SNPs using length-3 tagging rules.FastTagger also produces smaller sets of tag SNPs than existing multi-marker based algorithms, and the reduction ratio ranges from 3%-9% when length-3 tagging rules are used. The generated tagging rules can also be used for genotype imputation. We studied the prediction accuracy of individual rules, and the average accuracy is above 96% when r2 >/= 0.9. CONCLUSIONS Generating multi-marker tagging rules is a computation intensive task, and it is the bottleneck of existing multi-marker based tag SNP selection methods. FastTagger is a practical and scalable algorithm to solve this problem.
Collapse
Affiliation(s)
- Guimei Liu
- Department of Computer Science, National University of Singapore, Singapore.
| | | | | |
Collapse
|