1
|
Nguyen QH, Nguyen H, Oh EC, Nguyen T. Current approaches and outstanding challenges of functional annotation of metabolites: a comprehensive review. Brief Bioinform 2024; 25:bbae498. [PMID: 39397425 PMCID: PMC11471905 DOI: 10.1093/bib/bbae498] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2024] [Revised: 09/03/2024] [Accepted: 10/02/2024] [Indexed: 10/15/2024] Open
Abstract
Metabolite profiling is a powerful approach for the clinical diagnosis of complex diseases, ranging from cardiometabolic diseases, cancer, and cognitive disorders to respiratory pathologies and conditions that involve dysregulated metabolism. Because of the importance of systems-level interpretation, many methods have been developed to identify biologically significant pathways using metabolomics data. In this review, we first describe a complete metabolomics workflow (sample preparation, data acquisition, pre-processing, downstream analysis, etc.). We then comprehensively review 24 approaches capable of performing functional analysis, including those that combine metabolomics data with other types of data to investigate the disease-relevant changes at multiple omics layers. We discuss their availability, implementation, capability for pre-processing and quality control, supported omics types, embedded databases, pathway analysis methodologies, and integration techniques. We also provide a rating and evaluation of each software, focusing on their key technique, software accessibility, documentation, and user-friendliness. Following our guideline, life scientists can easily choose a suitable method depending on method rating, available data, input format, and method category. More importantly, we highlight outstanding challenges and potential solutions that need to be addressed by future research. To further assist users in executing the reviewed methods, we provide wrappers of the software packages at https://github.com/tinnlab/metabolite-pathway-review-docker.
Collapse
Affiliation(s)
- Quang-Huy Nguyen
- Department of Computer Science and Software Engineering, Auburn University, Auburn, AL 36849, United States
| | - Ha Nguyen
- Department of Computer Science and Software Engineering, Auburn University, Auburn, AL 36849, United States
| | - Edwin C Oh
- Department of Internal Medicine, UNLV School of Medicine, University of Nevada, Las Vegas, NV 89154, United States
| | - Tin Nguyen
- Department of Computer Science and Software Engineering, Auburn University, Auburn, AL 36849, United States
| |
Collapse
|
2
|
Zhang W, Xu C, Zhou M, Liu L, Ni Z, Su S, Wang C. Copy number variants selected during pig domestication inferred from whole genome resequencing. Front Vet Sci 2024; 11:1364267. [PMID: 38505001 PMCID: PMC10950068 DOI: 10.3389/fvets.2024.1364267] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2024] [Accepted: 02/19/2024] [Indexed: 03/21/2024] Open
Abstract
Over extended periods of natural and artificial selection, China has developed numerous exceptional pig breeds. Deciphering the germplasm characteristics of these breeds is crucial for their preservation and utilization. While many studies have employed single nucleotide polymorphism (SNP) analysis to investigate the local pig germplasm characteristics, copy number variation (CNV), another significant type of genetic variation, has been less explored in understanding pig resources. In this study, we examined the CNVs of 18 Wanbei pigs (WBP) using whole genome resequencing data with an average depth of 12.61. We identified a total of 8,783 CNVs (~30.07 Mb, 1.20% of the pig genome) in WBP, including 8,427 deletions and 356 duplications. Utilizing fixation index (Fst), we determined that 164 CNVs were within the top 1% of the Fst value and defined as under selection. Functional enrichment analyses of the genes associated with these selected CNVs revealed genes linked to reproduction (SPATA6, CFAP43, CFTR, BPTF), growth and development (NR6A1, SMYD3, VIPR2), and immunity (PARD3, FYB2). This study enhances our understanding of the genomic characteristics of the Wanbei pig and offers a theoretical foundation for the future breeding of this breed.
Collapse
Affiliation(s)
- Wei Zhang
- Institute of Animal Husbandry and Veterinary Medicine, Anhui Academy of Agricultural Sciences, Anhui Provincial Breeding Pig Genetic Evaluation Center, Key Laboratory of Pig Molecular Quantitative Genetics of Anhui Academy of Agricultural Sciences, Anhui Provincial Key Laboratory of Livestock and Poultry Product Safety Engineering, Hefei, China
| | - Chengliang Xu
- Institute of Animal Husbandry and Veterinary Medicine, Anhui Academy of Agricultural Sciences, Anhui Provincial Breeding Pig Genetic Evaluation Center, Key Laboratory of Pig Molecular Quantitative Genetics of Anhui Academy of Agricultural Sciences, Anhui Provincial Key Laboratory of Livestock and Poultry Product Safety Engineering, Hefei, China
| | - Mei Zhou
- Institute of Animal Husbandry and Veterinary Medicine, Anhui Academy of Agricultural Sciences, Anhui Provincial Breeding Pig Genetic Evaluation Center, Key Laboratory of Pig Molecular Quantitative Genetics of Anhui Academy of Agricultural Sciences, Anhui Provincial Key Laboratory of Livestock and Poultry Product Safety Engineering, Hefei, China
| | - Linqing Liu
- Institute of Animal Husbandry and Veterinary Medicine, Anhui Academy of Agricultural Sciences, Anhui Provincial Breeding Pig Genetic Evaluation Center, Key Laboratory of Pig Molecular Quantitative Genetics of Anhui Academy of Agricultural Sciences, Anhui Provincial Key Laboratory of Livestock and Poultry Product Safety Engineering, Hefei, China
| | - Zelan Ni
- Anhui Provincial Livestock and Poultry Genetic Resources Conservation Center, Hefei, China
| | - Shiguang Su
- Institute of Animal Husbandry and Veterinary Medicine, Anhui Academy of Agricultural Sciences, Anhui Provincial Breeding Pig Genetic Evaluation Center, Key Laboratory of Pig Molecular Quantitative Genetics of Anhui Academy of Agricultural Sciences, Anhui Provincial Key Laboratory of Livestock and Poultry Product Safety Engineering, Hefei, China
| | - Chonglong Wang
- Institute of Animal Husbandry and Veterinary Medicine, Anhui Academy of Agricultural Sciences, Anhui Provincial Breeding Pig Genetic Evaluation Center, Key Laboratory of Pig Molecular Quantitative Genetics of Anhui Academy of Agricultural Sciences, Anhui Provincial Key Laboratory of Livestock and Poultry Product Safety Engineering, Hefei, China
| |
Collapse
|
3
|
Javaid A, Frost HR. SPECK: an unsupervised learning approach for cell surface receptor abundance estimation for single-cell RNA-sequencing data. BIOINFORMATICS ADVANCES 2023; 3:vbad073. [PMID: 37359727 PMCID: PMC10290233 DOI: 10.1093/bioadv/vbad073] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/15/2022] [Revised: 05/23/2023] [Accepted: 06/12/2023] [Indexed: 06/28/2023]
Abstract
Summary The rapid development of single-cell transcriptomics has revolutionized the study of complex tissues. Single-cell RNA-sequencing (scRNA-seq) can profile tens-of-thousands of dissociated cells from a tissue sample, enabling researchers to identify cell types, phenotypes and interactions that control tissue structure and function. A key requirement of these applications is the accurate estimation of cell surface protein abundance. Although technologies to directly quantify surface proteins are available, these data are uncommon and limited to proteins with available antibodies. While supervised methods that are trained on Cellular Indexing of Transcriptomes and Epitopes by Sequencing data can provide the best performance, these training data are limited by available antibodies and may not exist for the tissue under investigation. In the absence of protein measurements, researchers must estimate receptor abundance from scRNA-seq data. Therefore, we developed a new unsupervised method for receptor abundance estimation using scRNA-seq data called SPECK (Surface Protein abundance Estimation using CKmeans-based clustered thresholding) and primarily evaluated its performance against unsupervised approaches for at least 25 human receptors and multiple tissue types. This analysis reveals that techniques based on a thresholded reduced rank reconstruction of scRNA-seq data are effective for receptor abundance estimation, with SPECK providing the best overall performance. Availability and implementation SPECK is freely available at https://CRAN.R-project.org/package=SPECK. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- Azka Javaid
- Department of Biomedical Data Science, Dartmouth College, Hanover, NH 03755, USA
| | - H Robert Frost
- Department of Biomedical Data Science, Dartmouth College, Hanover, NH 03755, USA
| |
Collapse
|
4
|
Zhang J, Luo Q, Hou J, Xiao W, Long P, Hu Y, Chen X, Wang H. Fatty acids and risk of dilated cardiomyopathy: A two-sample Mendelian randomization study. Front Nutr 2023; 10:1068050. [PMID: 36875854 PMCID: PMC9980906 DOI: 10.3389/fnut.2023.1068050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2022] [Accepted: 01/30/2023] [Indexed: 02/18/2023] Open
Abstract
BACKGROUND Previous observational studies have shown intimate associations between fatty acids (FAs) and dilated cardiomyopathy (DCM). However, due to the confounding factors and reverse causal association found in observational epidemiological studies, the etiological explanation is not credible. OBJECTIVE To exclude possible confounding factors and reverse causal associations found in observational epidemiological studies, we used the two-sample Mendelian randomization (MR) analysis to verify the causal relationship between FAs and DCM risk. METHOD All data of 54 FAs were downloaded from the genome-wide association studies (GWAS) catalog, and the summary statistics of DCM were extracted from the HF Molecular Epidemiology for Therapeutic Targets Consortium GWAS. Two-sample MR analysis was conducted to evaluate the causal effect of FAs on DCM risk through several analytical methods, including MR-Egger, inverse variance weighting (IVW), maximum likelihood, weighted median estimator (WME), and the MR pleiotropy residual sum and outlier test (MRPRESSO). Directionality tests using MR-Steiger to assess the possibility of reverse causation. RESULTS Our analysis identified two FAs, oleic acid and fatty acid (18:1)-OH, that may have a significant causal effect on DCM. MR analyses indicated that oleic acid was suggestively associated with a heightened risk of DCM (OR = 1.291, 95%CI: 1.044-1.595, P = 0.018). As a probable metabolite of oleic acid, fatty acid (18:1)-OH has a suggestive association with a lower risk of DCM (OR = 0.402, 95%CI: 0.167-0.966, P = 0.041). The results of the directionality test suggested that there was no reverse causality between exposure and outcome (P < 0.001). In contrast, the other 52 available FAs were discovered to have no significant causal relationships with DCM (P > 0.05). CONCLUSION Our findings propose that oleic acid and fatty acid (18:1)-OH may have causal relationships with DCM, indicating that the risk of DCM from oleic acid may be decreased by encouraging the conversion of oleic acid to fatty acid (18:1)-OH.
Collapse
Affiliation(s)
- Jiexin Zhang
- Department of Laboratory Medicine, Affiliated Hospital of Southwest Jiaotong University, The Third People's Hospital of Chengdu, Chengdu, Sichuan, China
- Central Laboratory, The General Hospital of Western Theater Command, Chengdu, Sichuan, China
| | - Qiang Luo
- Department of Laboratory Medicine, Affiliated Hospital of Southwest Jiaotong University, The Third People's Hospital of Chengdu, Chengdu, Sichuan, China
| | - Jun Hou
- Department of Laboratory Medicine, Affiliated Hospital of Southwest Jiaotong University, The Third People's Hospital of Chengdu, Chengdu, Sichuan, China
| | - Wenjing Xiao
- Central Laboratory, The General Hospital of Western Theater Command, Chengdu, Sichuan, China
| | - Pan Long
- Central Laboratory, The General Hospital of Western Theater Command, Chengdu, Sichuan, China
| | - Yonghe Hu
- Central Laboratory, The General Hospital of Western Theater Command, Chengdu, Sichuan, China
| | - Xin Chen
- Department of Laboratory Medicine, Affiliated Hospital of Southwest Jiaotong University, The Third People's Hospital of Chengdu, Chengdu, Sichuan, China
| | - Han Wang
- Department of Cardiology, Affiliated Hospital of Southwest Jiaotong University, The Third People's Hospital of Chengdu, Chengdu, Sichuan, China
| |
Collapse
|
5
|
Learning high-order interactions for polygenic risk prediction. PLoS One 2023; 18:e0281618. [PMID: 36763605 PMCID: PMC9916647 DOI: 10.1371/journal.pone.0281618] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2022] [Accepted: 01/27/2023] [Indexed: 02/11/2023] Open
Abstract
Within the framework of precision medicine, the stratification of individual genetic susceptibility based on inherited DNA variation has paramount relevance. However, one of the most relevant pitfalls of traditional Polygenic Risk Scores (PRS) approaches is their inability to model complex high-order non-linear SNP-SNP interactions and their effect on the phenotype (e.g. epistasis). Indeed, they incur in a computational challenge as the number of possible interactions grows exponentially with the number of SNPs considered, affecting the statistical reliability of the model parameters as well. In this work, we address this issue by proposing a novel PRS approach, called High-order Interactions-aware Polygenic Risk Score (hiPRS), that incorporates high-order interactions in modeling polygenic risk. The latter combines an interaction search routine based on frequent itemsets mining and a novel interaction selection algorithm based on Mutual Information, to construct a simple and interpretable weighted model of user-specified dimensionality that can predict a given binary phenotype. Compared to traditional PRSs methods, hiPRS does not rely on GWAS summary statistics nor any external information. Moreover, hiPRS differs from Machine Learning-based approaches that can include complex interactions in that it provides a readable and interpretable model and it is able to control overfitting, even on small samples. In the present work we demonstrate through a comprehensive simulation study the superior performance of hiPRS w.r.t. state of the art methods, both in terms of scoring performance and interpretability of the resulting model. We also test hiPRS against small sample size, class imbalance and the presence of noise, showcasing its robustness to extreme experimental settings. Finally, we apply hiPRS to a case study on real data from DACHS cohort, defining an interaction-aware scoring model to predict mortality of stage II-III Colon-Rectal Cancer patients treated with oxaliplatin.
Collapse
|
6
|
Zhang W, Liu L, Zhou M, Su S, Dong L, Meng X, Li X, Wang C. Assessing Population Structure and Signatures of Selection in Wanbei Pigs Using Whole Genome Resequencing Data. Animals (Basel) 2022; 13:ani13010013. [PMID: 36611624 PMCID: PMC9817800 DOI: 10.3390/ani13010013] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2022] [Revised: 12/10/2022] [Accepted: 12/18/2022] [Indexed: 12/24/2022] Open
Abstract
Wanbei pig (WBP) is one of the indigenous pig resources in China and has many germplasm characteristics. However, research on its genome is lacking. To assess the genomic variation, population structure, and selection signatures, we resequenced 18 WBP for the first time and performed a comprehensive analysis with resequenced data of 10 Asian wild boars. In total, 590.03 Gb of data and approximately 41 million variants were obtained. Polymorphism level (θπ) ratio and genetic differentiation (fixation index)-based cross approaches were applied, and 539 regions, which harbored 176 genes, were selected. Functional analysis of the selected genes revealed that they were associated with lipid metabolism (SCP2, APOA1, APOA4, APOC3, CD36, BCL6, ADCY8), backfat thickness (PLAG1, CACNA2D1), muscle (MYOG), and reproduction (CABS1). Overall, our results provide a valuable resource for characterizing the uniqueness of WBP and a basis for future breeding.
Collapse
Affiliation(s)
- Wei Zhang
- Key Laboratory of Pig Molecular Quantitative Genetics, Anhui Academy of Agricultural Sciences, Hefei 230031, China
- Anhui Provincial Key Laboratory of Livestock and Poultry Product Safety Engineering, Hefei 230031, China
- Institute of Animal Husbandry and Veterinary Medicine, Anhui Academy of Agricultural Sciences, Hefei 230031, China
| | - Linqing Liu
- Key Laboratory of Pig Molecular Quantitative Genetics, Anhui Academy of Agricultural Sciences, Hefei 230031, China
- Anhui Provincial Key Laboratory of Livestock and Poultry Product Safety Engineering, Hefei 230031, China
- Institute of Animal Husbandry and Veterinary Medicine, Anhui Academy of Agricultural Sciences, Hefei 230031, China
| | - Mei Zhou
- Key Laboratory of Pig Molecular Quantitative Genetics, Anhui Academy of Agricultural Sciences, Hefei 230031, China
- Anhui Provincial Key Laboratory of Livestock and Poultry Product Safety Engineering, Hefei 230031, China
- Institute of Animal Husbandry and Veterinary Medicine, Anhui Academy of Agricultural Sciences, Hefei 230031, China
| | - Shiguang Su
- Key Laboratory of Pig Molecular Quantitative Genetics, Anhui Academy of Agricultural Sciences, Hefei 230031, China
- Anhui Provincial Key Laboratory of Livestock and Poultry Product Safety Engineering, Hefei 230031, China
- Institute of Animal Husbandry and Veterinary Medicine, Anhui Academy of Agricultural Sciences, Hefei 230031, China
| | - Lin Dong
- Key Laboratory of Pig Molecular Quantitative Genetics, Anhui Academy of Agricultural Sciences, Hefei 230031, China
- Institute of Animal Husbandry and Veterinary Medicine, Anhui Academy of Agricultural Sciences, Hefei 230031, China
| | - Xinxin Meng
- Key Laboratory of Pig Molecular Quantitative Genetics, Anhui Academy of Agricultural Sciences, Hefei 230031, China
- Institute of Animal Husbandry and Veterinary Medicine, Anhui Academy of Agricultural Sciences, Hefei 230031, China
| | - Xueting Li
- Key Laboratory of Pig Molecular Quantitative Genetics, Anhui Academy of Agricultural Sciences, Hefei 230031, China
- Institute of Animal Husbandry and Veterinary Medicine, Anhui Academy of Agricultural Sciences, Hefei 230031, China
| | - Chonglong Wang
- Key Laboratory of Pig Molecular Quantitative Genetics, Anhui Academy of Agricultural Sciences, Hefei 230031, China
- Anhui Provincial Key Laboratory of Livestock and Poultry Product Safety Engineering, Hefei 230031, China
- Institute of Animal Husbandry and Veterinary Medicine, Anhui Academy of Agricultural Sciences, Hefei 230031, China
- Correspondence:
| |
Collapse
|
7
|
Cai TT, Zhang AR, Zhou Y. Sparse Group Lasso: Optimal Sample Complexity, Convergence Rate, and Statistical Inference. IEEE TRANSACTIONS ON INFORMATION THEORY 2022; 68:5975-6002. [PMID: 36865503 PMCID: PMC9974176 DOI: 10.1109/tit.2022.3175455] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/18/2023]
Abstract
We study sparse group Lasso for high-dimensional double sparse linear regression, where the parameter of interest is simultaneously element-wise and group-wise sparse. This problem is an important instance of the simultaneously structured model - an actively studied topic in statistics and machine learning. In the noiseless case, matching upper and lower bounds on sample complexity are established for the exact recovery of sparse vectors and for stable estimation of approximately sparse vectors, respectively. In the noisy case, upper and matching minimax lower bounds for estimation error are obtained. We also consider the debiased sparse group Lasso and investigate its asymptotic property for the purpose of statistical inference. Finally, numerical studies are provided to support the theoretical results.
Collapse
Affiliation(s)
- T Tony Cai
- Department of Statistics & Data Science, The Wharton School, University of Pennsylvania, Philadelphia, PA 19104
| | - Anru R Zhang
- Departments of Biostatistics & Bioinformatics, Computer Science, Mathematics, and Statistical Science, Duke University, Durham, NC 27710
- Department of Statistics, University of Wisconsin-Madison, Madison, WI 53706
| | - Yuchen Zhou
- Department of Statistics & Data Science, The Wharton School, University of Pennsylvania, Philadelphia, PA 19104
- Department of Statistics, University of Wisconsin-Madison, Madison, WI 53706
| |
Collapse
|
8
|
Wu H, Wang X, Chu M, Xiang R, Zhou K. FRMC: a fast and robust method for the imputation of scRNA-seq data. RNA Biol 2021; 18:172-181. [PMID: 34459719 DOI: 10.1080/15476286.2021.1960688] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022] Open
Abstract
The high-resolution feature of single-cell transcriptome sequencing technology allows researchers to observe cellular gene expression profiles at the single-cell level, offering numerous possibilities for subsequent biomedical investigation. However, the unavoidable technical impact of high missing values in the gene-cell expression matrices generated by insufficient RNA input severely hampers the accuracy of downstream analysis. To address this problem, it is essential to develop a more rapid and stable imputation method with greater accuracy, which should not only be able to recover the missing data, but also effectively facilitate the following biological mechanism analysis. The existing imputation methods all have their drawbacks and limitations, some require pre-assumed data distribution, some cannot distinguish between technical and biological zeros, and some have poor computational performance. In this paper, we presented a novel imputation software FRMC for single-cell RNA-Seq data, which innovates a fast and accurate singular value thresholding approximation method. The experiments demonstrated that FRMC can not only precisely distinguish 'true zeros' from dropout events and correctly impute missing values attributed to technical noises, but also effectively enhance intracellular and intergenic connections and achieve accurate clustering of cells in biological applications. In summary, FRMC can be a powerful tool for analysing single-cell data because it ensures biological significance, accuracy, and rapidity simultaneously. FRMC is implemented in Python and is freely accessible to non-commercial users on GitHub: https://github.com/HUST-DataMan/FRMC.
Collapse
Affiliation(s)
- Honglong Wu
- Wuhan National Laboratory for Optoelectronics, Huazhong University of Science & Technology, Wuhan, Hubei, China.,BGI PathoGenesis Pharmaceutical Technology, BGI-Shenzhen, Shenzhen 518083, China
| | - Xuebin Wang
- BGI PathoGenesis Pharmaceutical Technology, BGI-Shenzhen, Shenzhen 518083, China
| | - Mengtian Chu
- BGI PathoGenesis Pharmaceutical Technology, BGI-Shenzhen, Shenzhen 518083, China
| | - Ruizhi Xiang
- BGI PathoGenesis Pharmaceutical Technology, BGI-Shenzhen, Shenzhen 518083, China
| | - Ke Zhou
- Wuhan National Laboratory for Optoelectronics, Huazhong University of Science & Technology, Wuhan, Hubei, China
| |
Collapse
|
9
|
Demetci P, Cheng W, Darnell G, Zhou X, Ramachandran S, Crawford L. Multi-scale inference of genetic trait architecture using biologically annotated neural networks. PLoS Genet 2021; 17:e1009754. [PMID: 34411094 PMCID: PMC8407593 DOI: 10.1371/journal.pgen.1009754] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2020] [Revised: 08/31/2021] [Accepted: 07/31/2021] [Indexed: 01/01/2023] Open
Abstract
In this article, we present Biologically Annotated Neural Networks (BANNs), a nonlinear probabilistic framework for association mapping in genome-wide association (GWA) studies. BANNs are feedforward models with partially connected architectures that are based on biological annotations. This setup yields a fully interpretable neural network where the input layer encodes SNP-level effects, and the hidden layer models the aggregated effects among SNP-sets. We treat the weights and connections of the network as random variables with prior distributions that reflect how genetic effects manifest at different genomic scales. The BANNs software uses variational inference to provide posterior summaries which allow researchers to simultaneously perform (i) mapping with SNPs and (ii) enrichment analyses with SNP-sets on complex traits. Through simulations, we show that our method improves upon state-of-the-art association mapping and enrichment approaches across a wide range of genetic architectures. We then further illustrate the benefits of BANNs by analyzing real GWA data assayed in approximately 2,000 heterogenous stock of mice from the Wellcome Trust Centre for Human Genetics and approximately 7,000 individuals from the Framingham Heart Study. Lastly, using a random subset of individuals of European ancestry from the UK Biobank, we show that BANNs is able to replicate known associations in high and low-density lipoprotein cholesterol content.
Collapse
Affiliation(s)
- Pinar Demetci
- Department of Computer Science, Brown University, Providence, Rhode Island, United States of America
- Center for Computational Molecular Biology, Brown University, Providence, Rhode Island, United States of America
| | - Wei Cheng
- Center for Computational Molecular Biology, Brown University, Providence, Rhode Island, United States of America
- Department of Ecology and Evolutionary Biology, Brown University, Providence, Rhode Island, United States of America
| | - Gregory Darnell
- Center for Computational Molecular Biology, Brown University, Providence, Rhode Island, United States of America
| | - Xiang Zhou
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, United States of America
- Center for Statistical Genetics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Sohini Ramachandran
- Department of Computer Science, Brown University, Providence, Rhode Island, United States of America
- Center for Computational Molecular Biology, Brown University, Providence, Rhode Island, United States of America
- Department of Ecology and Evolutionary Biology, Brown University, Providence, Rhode Island, United States of America
| | - Lorin Crawford
- Center for Computational Molecular Biology, Brown University, Providence, Rhode Island, United States of America
- Microsoft Research New England, Cambridge, Massachusetts, United States of America
- Department of Biostatistics, Brown University, Providence, Rhode Island, United States of America
| |
Collapse
|
10
|
Deng L, Ma L, Cheng KK, Xu X, Raftery D, Dong J. Sparse PLS-Based Method for Overlapping Metabolite Set Enrichment Analysis. J Proteome Res 2021; 20:3204-3213. [PMID: 34002606 DOI: 10.1021/acs.jproteome.1c00064] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Metabolite set enrichment analysis (MSEA) has gained increasing research interest for identification of perturbed metabolic pathways in metabolomics. The method incorporates predefined metabolic pathways information in the analysis where metabolite sets are typically assumed to be mutually exclusive to each other. However, metabolic pathways are known to contain common metabolites and intermediates. This situation, along with limitations in metabolite detection or coverage leads to overlapping, incomplete metabolite sets in pathway analysis. For overlapping metabolite sets, MSEA tends to result in high false positives due to improper weights allocated to the overlapping metabolites. Here, we proposed an extended partial least squares (PLS) model with a new sparse scheme for overlapping metabolite set enrichment analysis, named overlapping group PLS (ogPLS) analysis. The weight vector of the ogPLS model was decomposed into pathway-specific subvectors, and then a group lasso penalty was imposed on these subvectors to achieve a proper weight allocation for the overlapping metabolites. Two strategies were adopted in the proposed ogPLS model to identify the perturbed metabolic pathways. The first strategy involves debiasing regularization, which was used to reduce inequalities amongst the predefined metabolic pathways. The second strategy is stable selection, which was used to rank pathways while avoiding the nuisance problems of model parameter optimization. Both simulated and real-world metabolomic datasets were used to evaluate the proposed method and compare with two other MSEA methods including Global-test and the multiblock PLS (MB-PLS)-based pathway importance in projection (PIP) methods. Using a simulated dataset with known perturbed pathways, the average true discovery rate for the ogPLS method was found to be higher than the Global-test and the MB-PLS-based PIP methods. Analysis with a real-world metabolomics dataset also indicated that the developed method was less prone to select pathways with highly overlapped detected metabolite sets. Compared with the two other methods, the proposed method features higher accuracy, lower false-positive rate, and is more robust when applied to overlapping metabolite set analysis. The developed ogPLS method may serve as an alternative MSEA method to facilitate biological interpretation of metabolomics data for overlapping metabolite sets.
Collapse
Affiliation(s)
- Lingli Deng
- Jiangxi Engineering Technology Research Center of Nuclear Geoscience Data Science and System, East China University of Technology, Nanchang 330013, China.,Department of Information Engineering, East China University of Technology, Nanchang 330013, China
| | - Lei Ma
- Department of Information Engineering, East China University of Technology, Nanchang 330013, China
| | - Kian-Kai Cheng
- Innovation Centre in Agritechnology, Universiti Teknologi Malaysia, Muar 84600, Johor, Malaysia
| | - Xiangnan Xu
- School of Mathematics and Statistics, The University of Sydney, Camperdown, NSW 2006, Australia
| | - Daniel Raftery
- Northwest Metabolomics Research Center, Department of Anesthesiology and Pain Medicine, University of Washington, Seattle, Washington 98109, United States
| | - Jiyang Dong
- Department of Electronic Science, National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen 361005, China
| |
Collapse
|
11
|
Huang M, Chen X, Yu Y, Lai H, Feng Q. Imaging Genetics Study Based on a Temporal Group Sparse Regression and Additive Model for Biomarker Detection of Alzheimer's Disease. IEEE TRANSACTIONS ON MEDICAL IMAGING 2021; 40:1461-1473. [PMID: 33556003 DOI: 10.1109/tmi.2021.3057660] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Imaging genetics is an effective tool used to detect potential biomarkers of Alzheimer's disease (AD) in imaging and genetic data. Most existing imaging genetics methods analyze the association between brain imaging quantitative traits (QTs) and genetic data [e.g., single nucleotide polymorphism (SNP)] by using a linear model, ignoring correlations between a set of QTs and SNP groups, and disregarding the varied associations between longitudinal imaging QTs and SNPs. To solve these problems, we propose a novel temporal group sparsity regression and additive model (T-GSRAM) to identify associations between longitudinal imaging QTs and SNPs for detection of potential AD biomarkers. We first construct a nonparametric regression model to analyze the nonlinear association between QTs and SNPs, which can accurately model the complex influence of SNPs on QTs. We then use longitudinal QTs to identify the trajectory of imaging genetic patterns over time. Moreover, the SNP information of group and individual levels are incorporated into the proposed method to boost the power of biomarker detection. Finally, we propose an efficient algorithm to solve the whole T-GSRAM model. We evaluated our method using simulation data and real data obtained from AD neuroimaging initiative. Experimental results show that our proposed method outperforms several state-of-the-art methods in terms of the receiver operating characteristic curves and area under the curve. Moreover, the detection of AD-related genes and QTs has been confirmed in previous studies, thereby further verifying the effectiveness of our approach and helping understand the genetic basis over time during disease progression.
Collapse
|
12
|
Tozzo V, Azencott CA, Fiorini S, Fava E, Trucco A, Barla A. Where Do We Stand in Regularization for Life Science Studies? J Comput Biol 2021; 29:213-232. [PMID: 33926217 PMCID: PMC8968832 DOI: 10.1089/cmb.2019.0371] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022] Open
Abstract
More and more biologists and bioinformaticians turn to machine learning to analyze large amounts of data. In this context, it is crucial to understand which is the most suitable data analysis pipeline for achieving reliable results. This process may be challenging, due to a variety of factors, the most crucial ones being the data type and the general goal of the analysis (e.g., explorative or predictive). Life science data sets require further consideration as they often contain measures with a low signal-to-noise ratio, high-dimensional observations, and relatively few samples. In this complex setting, regularization, which can be defined as the introduction of additional information to solve an ill-posed problem, is the tool of choice to obtain robust models. Different regularization practices may be used depending both on characteristics of the data and of the question asked, and different choices may lead to different results. In this article, we provide a comprehensive description of the impact and importance of regularization techniques in life science studies. In particular, we provide an intuition of what regularization is and of the different ways it can be implemented and exploited. We propose four general life sciences problems in which regularization is fundamental and should be exploited for robustness. For each of these large families of problems, we enumerate different techniques as well as examples and case studies. Lastly, we provide a unified view of how to approach each data type with various regularization techniques.
Collapse
Affiliation(s)
- Veronica Tozzo
- Department of Informatics, Bioengineering, Robotics and System Engineering-DIBRIS, University of Genoa, Genoa, Italy
| | - Chloé-Agathe Azencott
- Centre for Computational Biology-CBIO, MINES ParisTech, PSL Research University, Paris, France.,Institut Curie, PSL Research University, Paris, France.,INSERM, U900, Paris, France
| | | | - Emanuele Fava
- Departiment of Electrical, Electronic, Telecommunications Engineering, and Naval Architecture (DITEN), University of Genoa, Genoa, Italy
| | - Andrea Trucco
- Departiment of Electrical, Electronic, Telecommunications Engineering, and Naval Architecture (DITEN), University of Genoa, Genoa, Italy
| | - Annalisa Barla
- Department of Informatics, Bioengineering, Robotics and System Engineering-DIBRIS, University of Genoa, Genoa, Italy
| |
Collapse
|
13
|
Zhou J, Qiu Y, Chen S, Liu L, Liao H, Chen H, Lv S, Li X. A Novel Three-Stage Framework for Association Analysis Between SNPs and Brain Regions. Front Genet 2020; 11:572350. [PMID: 33193677 PMCID: PMC7542238 DOI: 10.3389/fgene.2020.572350] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2020] [Accepted: 08/17/2020] [Indexed: 12/17/2022] Open
Abstract
Motivation: At present, a number of correlation analysis methods between SNPs and ROIs have been devised to explore the pathogenic mechanism of Alzheimer's disease. However, some of the deficiencies inherent in these methods, including lack of statistical efficacy and biological meaning. This study aims at addressing issues: insufficient correlation by previous methods (relative high regression error) and the lack of biological meaning in association analysis. Results: In this paper, a novel three-stage SNPs and ROIs correlation analysis framework is proposed. Firstly, clustering algorithm is applied to remove the potential linkage unbalanced structure of two SNPs. Then, the group sparse model is used to introduce prior information such as gene structure and linkage unbalanced structure to select feature SNPs. After the above steps, each SNP has a weight vector corresponding to each ROI, and the importance of SNPs can be judged according to the weights in the feature vector, and then the feature SNPs can be selected. Finally, for the selected feature SNPS, a support vector machine regression model is used to implement the prediction of the ROIs phenotype values. The experimental results under multiple performance measures show that the proposed method has better accuracy than other methods.
Collapse
Affiliation(s)
- Juan Zhou
- School of Software, East China Jiaotong University, Nanchang, China
| | - Yangping Qiu
- School of Software, East China Jiaotong University, Nanchang, China
| | - Shuo Chen
- School of Software, East China Jiaotong University, Nanchang, China
| | - Liyue Liu
- School of Software, East China Jiaotong University, Nanchang, China
| | - Huifa Liao
- School of Software, East China Jiaotong University, Nanchang, China
| | - Hongli Chen
- School of Software, East China Jiaotong University, Nanchang, China
| | - Shanguo Lv
- School of Software, East China Jiaotong University, Nanchang, China
| | - Xiong Li
- School of Software, East China Jiaotong University, Nanchang, China
| |
Collapse
|
14
|
Huang EW, Bhope A, Lim J, Sinha S, Emad A. Tissue-guided LASSO for prediction of clinical drug response using preclinical samples. PLoS Comput Biol 2020; 16:e1007607. [PMID: 31967990 PMCID: PMC6975549 DOI: 10.1371/journal.pcbi.1007607] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2019] [Accepted: 12/15/2019] [Indexed: 12/12/2022] Open
Abstract
Prediction of clinical drug response (CDR) of cancer patients, based on their clinical and molecular profiles obtained prior to administration of the drug, can play a significant role in individualized medicine. Machine learning models have the potential to address this issue but training them requires data from a large number of patients treated with each drug, limiting their feasibility. While large databases of drug response and molecular profiles of preclinical in-vitro cancer cell lines (CCLs) exist for many drugs, it is unclear whether preclinical samples can be used to predict CDR of real patients. We designed a systematic approach to evaluate how well different algorithms, trained on gene expression and drug response of CCLs, can predict CDR of patients. Using data from two large databases, we evaluated various linear and non-linear algorithms, some of which utilized information on gene interactions. Then, we developed a new algorithm called TG-LASSO that explicitly integrates information on samples' tissue of origin with gene expression profiles to improve prediction performance. Our results showed that regularized regression methods provide better prediction performance. However, including the network information or common methods of including information on the tissue of origin did not improve the results. On the other hand, TG-LASSO improved the predictions and distinguished resistant and sensitive patients for 7 out of 13 drugs. Additionally, TG-LASSO identified genes associated with the drug response, including known targets and pathways involved in the drugs' mechanism of action. Moreover, genes identified by TG-LASSO for multiple drugs in a tissue were associated with patient survival. In summary, our analysis suggests that preclinical samples can be used to predict CDR of patients and identify biomarkers of drug sensitivity and survival.
Collapse
Affiliation(s)
- Edward W. Huang
- Department of Computer Science, University of Illinois at Urbana-Champaign, Illinois, United States of America
| | - Ameya Bhope
- Department of Electrical and Computer Engineering, McGill University, Canada
| | - Jing Lim
- Department of Computer Science, University of Illinois at Urbana-Champaign, Illinois, United States of America
| | - Saurabh Sinha
- Department of Computer Science, University of Illinois at Urbana-Champaign, Illinois, United States of America
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Illinois, United States of America
- Cancer Center at Illinois, University of Illinois at Urbana-Champaign, Illinois, United States of America
| | - Amin Emad
- Department of Electrical and Computer Engineering, McGill University, Canada
| |
Collapse
|
15
|
Bovo S, Mazzoni G, Bertolini F, Schiavo G, Galimberti G, Gallo M, Dall'Olio S, Fontanesi L. Genome-wide association studies for 30 haematological and blood clinical-biochemical traits in Large White pigs reveal genomic regions affecting intermediate phenotypes. Sci Rep 2019; 9:7003. [PMID: 31065004 PMCID: PMC6504931 DOI: 10.1038/s41598-019-43297-1] [Citation(s) in RCA: 43] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2018] [Accepted: 04/16/2019] [Indexed: 12/20/2022] Open
Abstract
Haematological and clinical-biochemical parameters are considered indicators of the physiological/health status of animals and might serve as intermediate phenotypes to link physiological aspects to production and disease resistance traits. The dissection of the genetic variability affecting these phenotypes might be useful to describe the resilience of the animals and to support the usefulness of the pig as animal model. Here, we analysed 15 haematological and 15 clinical-biochemical traits in 843 Italian Large White pigs, via three genome-wide association scan approaches (single-trait, multi-trait and Bayesian). We identified 52 quantitative trait loci (QTLs) associated with 29 out of 30 analysed blood parameters, with the most significant QTL identified on porcine chromosome 14 for basophil count. Some QTL regions harbour genes that may be the obvious candidates: QTLs for cholesterol parameters identified genes (ADCY8, APOB, ATG5, CDKAL1, PCSK5, PRL and SOX6) that are directly involved in cholesterol metabolism; other QTLs highlighted genes encoding the enzymes being measured [ALT (known also as GPT) and AST (known also as GOT)]. Moreover, the multivariate approach strengthened the association results for several candidate genes. The obtained results can contribute to define new measurable phenotypes that could be applied in breeding programs as proxies for more complex traits.
Collapse
Affiliation(s)
- Samuele Bovo
- Department of Agricultural and Food Sciences, Division of Animal Sciences, University of Bologna, Viale G. Fanin 46, 40127, Bologna, Italy
| | - Gianluca Mazzoni
- Department of Health Technology, Technical University of Denmark (DTU), Lyngby, 2800, Denmark
| | - Francesca Bertolini
- National Institute of Aquatic Resources, Technical University of Denmark (DTU), Lyngby, 2800, Denmark
| | - Giuseppina Schiavo
- Department of Agricultural and Food Sciences, Division of Animal Sciences, University of Bologna, Viale G. Fanin 46, 40127, Bologna, Italy
| | - Giuliano Galimberti
- Department of Statistical Sciences "Paolo Fortunati", University of Bologna, Via delle Belle Arti 41, 40126, Bologna, Italy
| | - Maurizio Gallo
- Associazione Nazionale Allevatori Suini (ANAS), Via Nizza 53, 00198, Roma, Italy
| | - Stefania Dall'Olio
- Department of Agricultural and Food Sciences, Division of Animal Sciences, University of Bologna, Viale G. Fanin 46, 40127, Bologna, Italy
| | - Luca Fontanesi
- Department of Agricultural and Food Sciences, Division of Animal Sciences, University of Bologna, Viale G. Fanin 46, 40127, Bologna, Italy.
| |
Collapse
|
16
|
Ho DSW, Schierding W, Wake M, Saffery R, O’Sullivan J. Machine Learning SNP Based Prediction for Precision Medicine. Front Genet 2019; 10:267. [PMID: 30972108 PMCID: PMC6445847 DOI: 10.3389/fgene.2019.00267] [Citation(s) in RCA: 113] [Impact Index Per Article: 18.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2018] [Accepted: 03/11/2019] [Indexed: 12/17/2022] Open
Abstract
In the past decade, precision genomics based medicine has emerged to provide tailored and effective healthcare for patients depending upon their genetic features. Genome Wide Association Studies have also identified population based risk genetic variants for common and complex diseases. In order to meet the full promise of precision medicine, research is attempting to leverage our increasing genomic understanding and further develop personalized medical healthcare through ever more accurate disease risk prediction models. Polygenic risk scoring and machine learning are two primary approaches for disease risk prediction. Despite recent improvements, the results of polygenic risk scoring remain limited due to the approaches that are currently used. By contrast, machine learning algorithms have increased predictive abilities for complex disease risk. This increase in predictive abilities results from the ability of machine learning algorithms to handle multi-dimensional data. Here, we provide an overview of polygenic risk scoring and machine learning in complex disease risk prediction. We highlight recent machine learning application developments and describe how machine learning approaches can lead to improved complex disease prediction, which will help to incorporate genetic features into future personalized healthcare. Finally, we discuss how the future application of machine learning prediction models might help manage complex disease by providing tissue-specific targets for customized, preventive interventions.
Collapse
Affiliation(s)
| | | | - Melissa Wake
- Murdoch Children Research Institute, Melbourne, VIC, Australia
| | - Richard Saffery
- Murdoch Children Research Institute, Melbourne, VIC, Australia
| | | |
Collapse
|
17
|
Tang Z, Lei S, Zhang X, Yi Z, Guo B, Chen JY, Shen Y, Yi N. Gsslasso Cox: a Bayesian hierarchical model for predicting survival and detecting associated genes by incorporating pathway information. BMC Bioinformatics 2019; 20:94. [PMID: 30813883 PMCID: PMC6391807 DOI: 10.1186/s12859-019-2656-1] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2018] [Accepted: 01/28/2019] [Indexed: 12/13/2022] Open
Abstract
BACKGROUND Group structures among genes encoded in functional relationships or biological pathways are valuable and unique features in large-scale molecular data for survival analysis. However, most of previous approaches for molecular data analysis ignore such group structures. It is desirable to develop powerful analytic methods for incorporating valuable pathway information for predicting disease survival outcomes and detecting associated genes. RESULTS We here propose a Bayesian hierarchical Cox survival model, called the group spike-and-slab lasso Cox (gsslasso Cox), for predicting disease survival outcomes and detecting associated genes by incorporating group structures of biological pathways. Our hierarchical model employs a novel prior on the coefficients of genes, i.e., the group spike-and-slab double-exponential distribution, to integrate group structures and to adaptively shrink the effects of genes. We have developed a fast and stable deterministic algorithm to fit the proposed models. We performed extensive simulation studies to assess the model fitting properties and the prognostic performance of the proposed method, and also applied our method to analyze three cancer data sets. CONCLUSIONS Both the theoretical and empirical studies show that the proposed method can induce weaker shrinkage on predictors in an active pathway, thereby incorporating the biological similarity of genes within a same pathway into the hierarchical modeling. Compared with several existing methods, the proposed method can more accurately estimate gene effects and can better predict survival outcomes. For the three cancer data sets, the results show that the proposed method generates more powerful models for survival prediction and detecting associated genes. The method has been implemented in a freely available R package BhGLM at https://github.com/nyiuab/BhGLM .
Collapse
Affiliation(s)
- Zaixiang Tang
- Department of Biostatistics, School of Public Health, Medical College of Soochow University, University of Alabama at Birmingham, Suzhou, 215123 China
- Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, Medical College of Soochow University, Suzhou, 215123 China
- Department of Biostatistics, School of Public Health, University of Alabama at Birmingham, Birmingham, AL 35294-0022 USA
| | - Shufeng Lei
- Department of Biostatistics, School of Public Health, Medical College of Soochow University, University of Alabama at Birmingham, Suzhou, 215123 China
- Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, Medical College of Soochow University, Suzhou, 215123 China
| | - Xinyan Zhang
- Department of Biostatistics, Jiann-Ping Hsu College of Public Health, Georgia Southern University, Statesboro, GA 30458 USA
| | - Zixuan Yi
- Eastern Virginia Medical School, Norfork, VA 23507 USA
| | - Boyi Guo
- Department of Biostatistics, School of Public Health, University of Alabama at Birmingham, Birmingham, AL 35294-0022 USA
| | - Jake Y. Chen
- Informatics Institute, School of Medicine, University of Alabama at Birmingham, Birmingham, AL 35294 USA
| | - Yueping Shen
- Department of Biostatistics, School of Public Health, Medical College of Soochow University, University of Alabama at Birmingham, Suzhou, 215123 China
- Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, Medical College of Soochow University, Suzhou, 215123 China
| | - Nengjun Yi
- Department of Biostatistics, School of Public Health, University of Alabama at Birmingham, Birmingham, AL 35294-0022 USA
| |
Collapse
|
18
|
Mongia A, Sengupta D, Majumdar A. McImpute: Matrix Completion Based Imputation for Single Cell RNA-seq Data. Front Genet 2019; 10:9. [PMID: 30761179 PMCID: PMC6361810 DOI: 10.3389/fgene.2019.00009] [Citation(s) in RCA: 47] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2018] [Accepted: 01/10/2019] [Indexed: 01/10/2023] Open
Abstract
Motivation: Single-cell RNA sequencing has been proved to be revolutionary for its potential of zooming into complex biological systems. Genome-wide expression analysis at single-cell resolution provides a window into dynamics of cellular phenotypes. This facilitates the characterization of transcriptional heterogeneity in normal and diseased tissues under various conditions. It also sheds light on the development or emergence of specific cell populations and phenotypes. However, owing to the paucity of input RNA, a typical single cell RNA sequencing data features a high number of dropout events where transcripts fail to get amplified. Results: We introduce mcImpute, a low-rank matrix completion based technique to impute dropouts in single cell expression data. On a number of real datasets, application of mcImpute yields significant improvements in the separation of true zeros from dropouts, cell-clustering, differential expression analysis, cell type separability, the performance of dimensionality reduction techniques for cell visualization, and gene distribution. Availability and Implementation: https://github.com/aanchalMongia/McImpute_scRNAseq.
Collapse
Affiliation(s)
- Aanchal Mongia
- Department of Computer Science and Engineering, Indraprastha Institute of Information Technology Delhi, New Delhi, India
| | - Debarka Sengupta
- Department of Computer Science and Engineering, Indraprastha Institute of Information Technology Delhi, New Delhi, India
- Center for Computational Biology, Indraprastha Institute of Information Technology Delhi, New Delhi, India
| | - Angshul Majumdar
- Department of Electronics and Communications Engineering, Indraprastha Institute of Information Technology Delhi, New Delhi, India
| |
Collapse
|
19
|
Tang Z, Shen Y, Li Y, Zhang X, Wen J, Qian C, Zhuang W, Shi X, Yi N. Group spike-and-slab lasso generalized linear models for disease prediction and associated genes detection by incorporating pathway information. Bioinformatics 2018; 34:901-910. [PMID: 29077795 PMCID: PMC5860634 DOI: 10.1093/bioinformatics/btx684] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2017] [Revised: 10/05/2017] [Accepted: 10/24/2017] [Indexed: 01/10/2023] Open
Abstract
Motivation Large-scale molecular data have been increasingly used as an important resource for prognostic prediction of diseases and detection of associated genes. However, standard approaches for omics data analysis ignore the group structure among genes encoded in functional relationships or pathway information. Results We propose new Bayesian hierarchical generalized linear models, called group spike-and-slab lasso GLMs, for predicting disease outcomes and detecting associated genes by incorporating large-scale molecular data and group structures. The proposed model employs a mixture double-exponential prior for coefficients that induces self-adaptive shrinkage amount on different coefficients. The group information is incorporated into the model by setting group-specific parameters. We have developed a fast and stable deterministic algorithm to fit the proposed hierarchal GLMs, which can perform variable selection within groups. We assess the performance of the proposed method on several simulated scenarios, by varying the overlap among groups, group size, number of non-null groups, and the correlation within group. Compared with existing methods, the proposed method provides not only more accurate estimates of the parameters but also better prediction. We further demonstrate the application of the proposed procedure on three cancer datasets by utilizing pathway structures of genes. Our results show that the proposed method generates powerful models for predicting disease outcomes and detecting associated genes. Availability and implementation The methods have been implemented in a freely available R package BhGLM (http://www.ssg.uab.edu/bhglm/). Contact nyi@uab.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Zaixiang Tang
- Department of Biostatistics, School of Public Health, Medical College of Soochow University, Suzhou, China
- Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, Medical College of Soochow University, Suzhou, China
- Center for Genetic Epidemiology and Genomics, Medical College of Soochow University, Suzhou, China
- Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Yueping Shen
- Department of Biostatistics, School of Public Health, Medical College of Soochow University, Suzhou, China
- Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, Medical College of Soochow University, Suzhou, China
| | - Yan Li
- Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Xinyan Zhang
- Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Jia Wen
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, NC, USA
| | - Chen’ao Qian
- Department of Bioinformatics, School of Biology & Basic Medical Science, Soochow University, Suzhou, China
| | - Wenzhuo Zhuang
- Department of Cell Biology, School of Biology & Basic Medical Science, Soochow University, Suzhou, China
| | - Xinghua Shi
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, NC, USA
| | - Nengjun Yi
- Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL, USA
| |
Collapse
|
20
|
Krishnan ML, Wang Z, Silver M, Boardman JP, Ball G, Counsell SJ, Walley AJ, Montana G, Edwards AD. Possible relationship between common genetic variation and white matter development in a pilot study of preterm infants. Brain Behav 2016; 6:e00434. [PMID: 27110435 PMCID: PMC4821839 DOI: 10.1002/brb3.434] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/08/2015] [Revised: 12/16/2015] [Accepted: 12/19/2015] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND The consequences of preterm birth are a major public health concern with high rates of ensuing multisystem morbidity, and uncertain biological mechanisms. Common genetic variation may mediate vulnerability to the insult of prematurity and provide opportunities to predict and modify risk. OBJECTIVE To gain novel biological and therapeutic insights from the integrated analysis of magnetic resonance imaging and genetic data, informed by prior knowledge. METHODS We apply our previously validated pathway-based statistical method and a novel network-based method to discover sources of common genetic variation associated with imaging features indicative of structural brain damage. RESULTS Lipid pathways were highly ranked by Pathways Sparse Reduced Rank Regression in a model examining the effect of prematurity, and PPAR (peroxisome proliferator-activated receptor) signaling was the highest ranked pathway once degree of prematurity was accounted for. Within the PPAR pathway, five genes were found by Graph Guided Group Lasso to be highly associated with the phenotype: aquaporin 7 (AQP7), malic enzyme 1, NADP(+)-dependent, cytosolic (ME1), perilipin 1 (PLIN1), solute carrier family 27 (fatty acid transporter), member 1 (SLC27A1), and acetyl-CoA acyltransferase 1 (ACAA1). Expression of four of these (ACAA1, AQP7, ME1, and SLC27A1) is controlled by a common transcription factor, early growth response 4 (EGR-4). CONCLUSIONS This suggests an important role for lipid pathways in influencing development of white matter in preterm infants, and in particular a significant role for interindividual genetic variation in PPAR signaling.
Collapse
Affiliation(s)
- Michelle L Krishnan
- Centre for the Developing Brain King's College London St Thomas' Hospital London SE1 7EH UK
| | - Zi Wang
- Department of Biomedical Engineering King's College London St Thomas' Hospital London SE1 7EH UK
| | - Matt Silver
- Department of Population Health London School of Hygiene and Tropical Medicine London WC1E 7HT UK
| | - James P Boardman
- MRC Centre for Reproductive Health University of Edinburgh Edinburgh EH16 4TJ UK
| | - Gareth Ball
- Centre for the Developing Brain King's College London St Thomas' Hospital London SE1 7EH UK
| | - Serena J Counsell
- Centre for the Developing Brain King's College London St Thomas' Hospital London SE1 7EH UK
| | - Andrew J Walley
- School of Public Health Faculty of Medicine Imperial College London Norfolk Place London W2 1PG UK
| | - Giovanni Montana
- Department of Biomedical Engineering King's College London St Thomas' Hospital London SE1 7EH UK
| | - Anthony David Edwards
- Centre for the Developing Brain King's College London St Thomas' Hospital London SE1 7EH UK
| |
Collapse
|
21
|
Kapur A, Marwah K, Alterovitz G. Gene expression prediction using low-rank matrix completion. BMC Bioinformatics 2016; 17:243. [PMID: 27317252 PMCID: PMC4912738 DOI: 10.1186/s12859-016-1106-6] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2015] [Accepted: 05/28/2016] [Indexed: 11/25/2022] Open
Abstract
BACKGROUND An exponential growth of high-throughput biological information and data has occurred in the past decade, supported by technologies, such as microarrays and RNA-Seq. Most data generated using such methods are used to encode large amounts of rich information, and determine diagnostic and prognostic biomarkers. Although data storage costs have reduced, process of capturing data using aforementioned technologies is still expensive. Moreover, the time required for the assay, from sample preparation to raw value measurement is excessive (in the order of days). There is an opportunity to reduce both the cost and time for generating such expression datasets. RESULTS We propose a framework in which complete gene expression values can be reliably predicted in-silico from partial measurements. This is achieved by modelling expression data as a low-rank matrix and then applying recently discovered techniques of matrix completion by using nonlinear convex optimisation. We evaluated prediction of gene expression data based on 133 studies, sourced from a combined total of 10,921 samples. It is shown that such datasets can be constructed with a low relative error even at high missing value rates (>50 %), and that such predicted datasets can be reliably used as surrogates for further analysis. CONCLUSION This method has potentially far-reaching applications including how bio-medical data is sourced and generated, and transcriptomic prediction by optimisation. We show that gene expression data can be computationally constructed, thereby potentially reducing the costs of gene expression profiling. In conclusion, this method shows great promise of opening new avenues in research on low-rank matrix completion in biological sciences.
Collapse
Affiliation(s)
- Arnav Kapur
- />Biomedical Cybernetics Laboratory, Harvard Medical School, Boston, 02115 MA USA
| | - Kshitij Marwah
- />Biomedical Cybernetics Laboratory, Harvard Medical School, Boston, 02115 MA USA
| | - Gil Alterovitz
- />Biomedical Cybernetics Laboratory, Harvard Medical School, Boston, 02115 MA USA
- />Department of Health Science and Technology, Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, 02139 MA USA
| |
Collapse
|
22
|
Sokolov A, Carlin DE, Paull EO, Baertsch R, Stuart JM. Pathway-Based Genomics Prediction using Generalized Elastic Net. PLoS Comput Biol 2016; 12:e1004790. [PMID: 26960204 PMCID: PMC4784899 DOI: 10.1371/journal.pcbi.1004790] [Citation(s) in RCA: 69] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2015] [Accepted: 02/04/2016] [Indexed: 11/19/2022] Open
Abstract
We present a novel regularization scheme called The Generalized Elastic Net (GELnet) that incorporates gene pathway information into feature selection. The proposed formulation is applicable to a wide variety of problems in which the interpretation of predictive features using known molecular interactions is desired. The method naturally steers solutions toward sets of mechanistically interlinked genes. Using experiments on synthetic data, we demonstrate that pathway-guided results maintain, and often improve, the accuracy of predictors even in cases where the full gene network is unknown. We apply the method to predict the drug response of breast cancer cell lines. GELnet is able to reveal genetic determinants of sensitivity and resistance for several compounds. In particular, for an EGFR/HER2 inhibitor, it finds a possible trans-differentiation resistance mechanism missed by the corresponding pathway agnostic approach.
Collapse
Affiliation(s)
- Artem Sokolov
- Department of Biomolecular Engineering and Center for Biomolecular Science and Engineering, University of California Santa Cruz, Santa Cruz, California, United States of America
| | - Daniel E. Carlin
- Department of Biomolecular Engineering and Center for Biomolecular Science and Engineering, University of California Santa Cruz, Santa Cruz, California, United States of America
| | - Evan O. Paull
- Department of Biomolecular Engineering and Center for Biomolecular Science and Engineering, University of California Santa Cruz, Santa Cruz, California, United States of America
| | - Robert Baertsch
- Department of Biomolecular Engineering and Center for Biomolecular Science and Engineering, University of California Santa Cruz, Santa Cruz, California, United States of America
| | - Joshua M. Stuart
- Department of Biomolecular Engineering and Center for Biomolecular Science and Engineering, University of California Santa Cruz, Santa Cruz, California, United States of America
| |
Collapse
|
23
|
Mooney MA, Wilmot B. Gene set analysis: A step-by-step guide. Am J Med Genet B Neuropsychiatr Genet 2015; 168:517-27. [PMID: 26059482 PMCID: PMC4638147 DOI: 10.1002/ajmg.b.32328] [Citation(s) in RCA: 59] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/16/2015] [Accepted: 05/20/2015] [Indexed: 12/21/2022]
Abstract
To maximize the potential of genome-wide association studies, many researchers are performing secondary analyses to identify sets of genes jointly associated with the trait of interest. Although methods for gene-set analyses (GSA), also called pathway analyses, have been around for more than a decade, the field is still evolving. There are numerous algorithms available for testing the cumulative effect of multiple SNPs, yet no real consensus in the field about the best way to perform a GSA. This paper provides an overview of the factors that can affect the results of a GSA, the lessons learned from past studies, and suggestions for how to make analysis choices that are most appropriate for different types of data. © 2015 Wiley Periodicals, Inc.
Collapse
Affiliation(s)
- Michael A. Mooney
- Department of Medical Informatics & Clinical Epidemiology, Division of Bioinformatics & Computational Biology, Oregon Health & Science University, Portland, Oregon,OHSU Knight Cancer Institute, Portland, Oregon
| | - Beth Wilmot
- Department of Medical Informatics & Clinical Epidemiology, Division of Bioinformatics & Computational Biology, Oregon Health & Science University, Portland, Oregon,OHSU Knight Cancer Institute, Portland, Oregon,Oregon Clinical and Translational Research Institute, Portland, Oregon,Correspondence to: Beth Wilmot, Department of Medical Informatics & Clinical Epidemiology, Division of Bioinformatics & Computational Biology, Oregon Health & Science University, Portland, OR 97239.
| |
Collapse
|
24
|
Begun J, Lassen KG, Jijon HB, Baxt LA, Goel G, Heath RJ, Ng A, Tam JM, Kuo SY, Villablanca EJ, Fagbami L, Oosting M, Kumar V, Schenone M, Carr SA, Joosten LAB, Vyas JM, Daly MJ, Netea MG, Brown GD, Wijmenga C, Xavier RJ. Integrated Genomics of Crohn's Disease Risk Variant Identifies a Role for CLEC12A in Antibacterial Autophagy. Cell Rep 2015; 11:1905-18. [PMID: 26095365 PMCID: PMC4507440 DOI: 10.1016/j.celrep.2015.05.045] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2015] [Revised: 05/04/2015] [Accepted: 05/26/2015] [Indexed: 01/07/2023] Open
Abstract
The polymorphism ATG16L1 T300A, associated with increased risk of Crohn’s disease, impairs pathogen defense mechanisms including selective autophagy, but specific pathway interactions altered by the risk allele remain unknown. Here, we use perturbational profiling of human peripheral blood cells to reveal that CLEC12A is regulated in an ATG16L1-T300A-dependent manner. Antibacterial autophagy is impaired in CLEC12A-deficient cells, and this effect is exacerbated in the presence of the ATG16L1∗300A risk allele. Clec12a−/− mice are more susceptible to Salmonella infection, supporting a role for CLEC12A in antibacterial defense pathways in vivo. CLEC12A is recruited to sites of bacterial entry, bacteria-autophagosome complexes, and sites of sterile membrane damage. Integrated genomics identified a functional interaction between CLEC12A and an E3-ubiquitin ligase complex that functions in antibacterial autophagy. These data identify CLEC12A as early adaptor molecule for antibacterial autophagy and highlight perturbational profiling as a method to elucidate defense pathways in complex genetic disease. Integrated genomics reveals risk-allele-specific autophagy pathway interactions CLEC12A is important for antibacterial autophagy in epithelial and immune cells CLEC12A knockdown amplifies antibacterial autophagy defects in ATG16L1∗300A cells Clec12a−/− mice are more susceptible to Salmonella infection in vivo
Collapse
Affiliation(s)
- Jakob Begun
- Gastrointestinal Unit and Center for the Study of Inflammatory Bowel Disease, Massachusetts General Hospital, Boston, MA 02114, USA; Center for Computational and Integrative Biology, Massachusetts General Hospital, Boston, MA 02114, USA; Mater Research Institute, University of Queensland, Brisbane, QLD 4101, Australia
| | - Kara G Lassen
- Center for Computational and Integrative Biology, Massachusetts General Hospital, Boston, MA 02114, USA; The Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
| | - Humberto B Jijon
- Gastrointestinal Unit and Center for the Study of Inflammatory Bowel Disease, Massachusetts General Hospital, Boston, MA 02114, USA; Center for Computational and Integrative Biology, Massachusetts General Hospital, Boston, MA 02114, USA; Division of Gastroenterology, Department of Medicine, University of Calgary, Calgary, AB T2N 4N1, Canada
| | - Leigh A Baxt
- Center for Computational and Integrative Biology, Massachusetts General Hospital, Boston, MA 02114, USA
| | - Gautam Goel
- Center for Computational and Integrative Biology, Massachusetts General Hospital, Boston, MA 02114, USA
| | - Robert J Heath
- Center for Computational and Integrative Biology, Massachusetts General Hospital, Boston, MA 02114, USA; The Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Aylwin Ng
- Gastrointestinal Unit and Center for the Study of Inflammatory Bowel Disease, Massachusetts General Hospital, Boston, MA 02114, USA; Center for Computational and Integrative Biology, Massachusetts General Hospital, Boston, MA 02114, USA; The Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Jenny M Tam
- Division of Infectious Diseases, Department of Medicine, Massachusetts General Hospital, Boston, MA 02114, USA; Department of Medicine, Harvard Medical School, Boston, MA 02115, USA
| | - Szu-Yu Kuo
- The Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Eduardo J Villablanca
- Gastrointestinal Unit and Center for the Study of Inflammatory Bowel Disease, Massachusetts General Hospital, Boston, MA 02114, USA; Center for Computational and Integrative Biology, Massachusetts General Hospital, Boston, MA 02114, USA; The Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Lola Fagbami
- The Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Marije Oosting
- Department of Internal Medicine and Radboud Center for Infectious Diseases, Radboud University Nijmegen Medical Center, Nijmegen 6525 GA, the Netherlands
| | - Vinod Kumar
- Department of Genetics, University Medical Center Groningen, University of Groningen, Groningen 9700 RB, the Netherlands
| | - Monica Schenone
- The Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Steven A Carr
- The Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Leo A B Joosten
- Department of Internal Medicine and Radboud Center for Infectious Diseases, Radboud University Nijmegen Medical Center, Nijmegen 6525 GA, the Netherlands
| | - Jatin M Vyas
- Division of Infectious Diseases, Department of Medicine, Massachusetts General Hospital, Boston, MA 02114, USA; Department of Medicine, Harvard Medical School, Boston, MA 02115, USA
| | - Mark J Daly
- The Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Analytic and Translational Genetics Unit, Massachusetts General Hospital, Harvard Medical School, Boston, MA 02114, USA
| | - Mihai G Netea
- Department of Internal Medicine and Radboud Center for Infectious Diseases, Radboud University Nijmegen Medical Center, Nijmegen 6525 GA, the Netherlands
| | - Gordon D Brown
- Aberdeen Fungal Group, Division of Applied Medicine, CLSM, Institute of Medical Sciences, University of Aberdeen, Foresterhill, Aberdeen AB25 2ZD, UK
| | - Cisca Wijmenga
- Department of Genetics, University Medical Center Groningen, University of Groningen, Groningen 9700 RB, the Netherlands
| | - Ramnik J Xavier
- Gastrointestinal Unit and Center for the Study of Inflammatory Bowel Disease, Massachusetts General Hospital, Boston, MA 02114, USA; Center for Computational and Integrative Biology, Massachusetts General Hospital, Boston, MA 02114, USA; The Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
| |
Collapse
|
25
|
Okser S, Pahikkala T, Airola A, Salakoski T, Ripatti S, Aittokallio T. Regularized machine learning in the genetic prediction of complex traits. PLoS Genet 2014; 10:e1004754. [PMID: 25393026 PMCID: PMC4230844 DOI: 10.1371/journal.pgen.1004754] [Citation(s) in RCA: 94] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023] Open
Affiliation(s)
- Sebastian Okser
- Department of Information Technology, University of Turku, Turku, Finland
- Turku Centre for Computer Science (TUCS), University of Turku and Åbo Akademi University, Turku, Finland
| | - Tapio Pahikkala
- Department of Information Technology, University of Turku, Turku, Finland
- Turku Centre for Computer Science (TUCS), University of Turku and Åbo Akademi University, Turku, Finland
| | - Antti Airola
- Department of Information Technology, University of Turku, Turku, Finland
- Turku Centre for Computer Science (TUCS), University of Turku and Åbo Akademi University, Turku, Finland
| | - Tapio Salakoski
- Department of Information Technology, University of Turku, Turku, Finland
- Turku Centre for Computer Science (TUCS), University of Turku and Åbo Akademi University, Turku, Finland
| | - Samuli Ripatti
- Hjelt Institute, University of Helsinki, Helsinki, Finland
- Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki, Finland
- Wellcome Trust Sanger Institute, Hinxton, United Kingdom
| | - Tero Aittokallio
- Turku Centre for Computer Science (TUCS), University of Turku and Åbo Akademi University, Turku, Finland
- Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki, Finland
- * E-mail:
| |
Collapse
|
26
|
Kabamba AT, Bakari SA, Longanga AO, Lukumwena ZK. [Decrease in HDL-cholesterol indicator of oxidative stress in type 2 diabetes]. Pan Afr Med J 2014; 19:140. [PMID: 25767660 PMCID: PMC4345204 DOI: 10.11604/pamj.2014.19.140.5279] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2014] [Accepted: 09/18/2014] [Indexed: 12/12/2022] Open
Affiliation(s)
| | - Salvius Amuri Bakari
- Faculté des Sciences Pharmaceutiques, Université de Lubumbashi, République Démocratique Congo
| | - Albert Otshudi Longanga
- Faculté des Sciences Pharmaceutiques, Université de Lubumbashi, République Démocratique Congo ; Université Libre de Bruxelles (ULB), Belgique
| | - Zet Kalala Lukumwena
- Faculté de Médecine Vétérinaire, Université de Lubumbashi, République Démocratique Congo
| |
Collapse
|
27
|
Lin D, Cao H, Calhoun VD, Wang YP. Sparse models for correlative and integrative analysis of imaging and genetic data. J Neurosci Methods 2014; 237:69-78. [PMID: 25218561 DOI: 10.1016/j.jneumeth.2014.09.001] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2014] [Revised: 08/27/2014] [Accepted: 09/01/2014] [Indexed: 11/29/2022]
Abstract
The development of advanced medical imaging technologies and high-throughput genomic measurements has enhanced our ability to understand their interplay as well as their relationship with human behavior by integrating these two types of datasets. However, the high dimensionality and heterogeneity of these datasets presents a challenge to conventional statistical methods; there is a high demand for the development of both correlative and integrative analysis approaches. Here, we review our recent work on developing sparse representation based approaches to address this challenge. We show how sparse models are applied to the correlation and integration of imaging and genetic data for biomarker identification. We present examples on how these approaches are used for the detection of risk genes and classification of complex diseases such as schizophrenia. Finally, we discuss future directions on the integration of multiple imaging and genomic datasets including their interactions such as epistasis.
Collapse
Affiliation(s)
- Dongdong Lin
- Department of Biomedical Engineering, Tulane University, New Orleans, LA, 70118, USA; Center of Genomics and Bioinformatics, Tulane University, New Orleans, LA, 70112, USA.
| | - Hongbao Cao
- Unit on Statistical Genomics, Intramural Program of Research, National Institute of Mental Health, NIH, Bethesda 20852, USA.
| | - Vince D Calhoun
- The Mind Research Network & LBERI, Albuquerque, NM 87106, USA; Department of Electrical and Computer Engineering, University of New Mexico, Albuquerque, NM 87131, USA.
| | - Yu-Ping Wang
- Department of Biomedical Engineering, Tulane University, New Orleans, LA, 70118, USA; Center of Genomics and Bioinformatics, Tulane University, New Orleans, LA, 70112, USA.
| |
Collapse
|
28
|
Mooney MA, Nigg JT, McWeeney SK, Wilmot B. Functional and genomic context in pathway analysis of GWAS data. Trends Genet 2014; 30:390-400. [PMID: 25154796 DOI: 10.1016/j.tig.2014.07.004] [Citation(s) in RCA: 86] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2014] [Revised: 07/18/2014] [Accepted: 07/18/2014] [Indexed: 02/07/2023]
Abstract
Gene set analysis (GSA) is a promising tool for uncovering the polygenic effects associated with complex diseases. However, the available techniques reflect a wide variety of hypotheses about how genetic effects interact to contribute to disease susceptibility. The lack of consensus about the best way to perform GSA has led to confusion in the field and has made it difficult to compare results across methods. A clear understanding of the various choices made during GSA - such as how gene sets are defined, how single-nucleotide polymorphisms (SNPs) are assigned to genes, and how individual SNP-level effects are aggregated to produce gene- or pathway-level effects - will improve the interpretability and comparability of results across methods and studies. In this review we provide an overview of the various data sources used to construct gene sets and the statistical methods used to test for gene set association, as well as provide guidelines for ensuring the comparability of results.
Collapse
Affiliation(s)
- Michael A Mooney
- Division of Bioinformatics and Computational Biology, Department of Medical Informatics and Clinical Epidemiology, Oregon Health & Science University, Portland, OR, USA; OHSU Knight Cancer Institute, Portland, OR, USA
| | - Joel T Nigg
- Division of Psychology, Department of Psychiatry, Oregon Health & Science University, Portland, OR, USA; Department of Behavioral Neuroscience, Oregon Health & Science University, Portland, OR, USA
| | - Shannon K McWeeney
- Division of Bioinformatics and Computational Biology, Department of Medical Informatics and Clinical Epidemiology, Oregon Health & Science University, Portland, OR, USA; Oregon Clinical and Translational Research Institute, Portland, OR, USA; OHSU Knight Cancer Institute, Portland, OR, USA.
| | - Beth Wilmot
- Division of Bioinformatics and Computational Biology, Department of Medical Informatics and Clinical Epidemiology, Oregon Health & Science University, Portland, OR, USA; Oregon Clinical and Translational Research Institute, Portland, OR, USA; OHSU Knight Cancer Institute, Portland, OR, USA
| |
Collapse
|