1
|
Guo X, Han J, Song Y, Yin Z, Liu S, Shang X. Using expression quantitative trait loci data and graph-embedded neural networks to uncover genotype–phenotype interactions. Front Genet 2022; 13:921775. [PMID: 36046233 PMCID: PMC9421127 DOI: 10.3389/fgene.2022.921775] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2022] [Accepted: 07/04/2022] [Indexed: 11/13/2022] Open
Abstract
Motivation: A central goal of current biology is to establish a complete functional link between the genotype and phenotype, known as the so-called genotype–phenotype map. With the continuous development of high-throughput technology and the decline in sequencing costs, multi-omics analysis has become more widely employed. While this gives us new opportunities to uncover the correlation mechanisms between single-nucleotide polymorphism (SNP), genes, and phenotypes, multi-omics still faces certain challenges, specifically: 1) When the sample size is large enough, the number of omics types is often not large enough to meet the requirements of multi-omics analysis; 2) each omics’ internal correlations are often unclear, such as the correlation between genes in genomics; 3) when analyzing a large number of traits (p), the sample size (n) is often smaller than p, n << p, hindering the application of machine learning methods in the classification of disease outcomes.Results: To solve these issues with multi-omics and build a robust classification model, we propose a graph-embedded deep neural network (G-EDNN) based on expression quantitative trait loci (eQTL) data, which achieves sparse connectivity between network layers to prevent overfitting. The correlation within each omics is also considered such that the model more closely resembles biological reality. To verify the capabilities of this method, we conducted experimental analysis using the GSE28127 and GSE95496 data sets from the Gene Expression Omnibus (GEO) database, tested various neural network architectures, and used prior data for feature selection and graph embedding. Results show that the proposed method could achieve a high classification accuracy and easy-to-interpret feature selection. This method represents an extended application of genotype–phenotype association analysis in deep learning networks.
Collapse
Affiliation(s)
- Xinpeng Guo
- School of Computer Science and Engineering, Northwestern Polytechnical University, Xi’an, China
- School of Air and Missile Defense, Air Force Engineering University, Xi’an, China
| | - Jinyu Han
- School of Economics and Management, Chang ‘an University, Xi’an, China
| | - Yafei Song
- School of Air and Missile Defense, Air Force Engineering University, Xi’an, China
| | - Zhilei Yin
- School of Computer Science and Engineering, Northwestern Polytechnical University, Xi’an, China
| | - Shuaichen Liu
- School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an, China
| | - Xuequn Shang
- School of Computer Science and Engineering, Northwestern Polytechnical University, Xi’an, China
- *Correspondence: Xuequn Shang,
| |
Collapse
|
2
|
Burns JJR, Shealy BT, Greer MS, Hadish JA, McGowan MT, Biggs T, Smith MC, Feltus FA, Ficklin SP. Addressing noise in co-expression network construction. Brief Bioinform 2021; 23:6446269. [PMID: 34850822 PMCID: PMC8769892 DOI: 10.1093/bib/bbab495] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2021] [Revised: 10/25/2021] [Accepted: 10/28/2021] [Indexed: 11/13/2022] Open
Abstract
Gene co-expression networks (GCNs) provide multiple benefits to molecular research including hypothesis generation and biomarker discovery. Transcriptome profiles serve as input for GCN construction and are derived from increasingly larger studies with samples across multiple experimental conditions, treatments, time points, genotypes, etc. Such experiments with larger numbers of variables confound discovery of true network edges, exclude edges and inhibit discovery of context (or condition) specific network edges. To demonstrate this problem, a 475-sample dataset is used to show that up to 97% of GCN edges can be misleading because correlations are false or incorrect. False and incorrect correlations can occur when tests are applied without ensuring assumptions are met, and pairwise gene expression may not meet test assumptions if the expression of at least one gene in the pairwise comparison is a function of multiple confounding variables. The ‘one-size-fits-all’ approach to GCN construction is therefore problematic for large, multivariable datasets. Recently, the Knowledge Independent Network Construction toolkit has been used in multiple studies to provide a dynamic approach to GCN construction that ensures statistical tests meet assumptions and confounding variables are addressed. Additionally, it can associate experimental context for each edge of the network resulting in context-specific GCNs (csGCNs). To help researchers recognize such challenges in GCN construction, and the creation of csGCNs, we provide a review of the workflow.
Collapse
Affiliation(s)
- Joshua J R Burns
- Department of Horticulture, 149 Johnson Hall. Washington State University, Pullman, WA 99164. USA
| | - Benjamin T Shealy
- Department of Electrical & Computer Engineering, 105 Riggs Hall. Clemson University, Clemson, SC 29631. USA
| | - Mitchell S Greer
- School of Electrical Engineering and Computer Science, EME 102. Washington State University, Pullman, WA 99164. USA
| | - John A Hadish
- Molecular Plant Sciences Program, French Ad 324g. Washington State University, Pullman, WA 99164. USA
| | - Matthew T McGowan
- Molecular Plant Sciences Program, French Ad 324g. Washington State University, Pullman, WA 99164. USA
| | - Tyler Biggs
- Department of Horticulture, 149 Johnson Hall. Washington State University, Pullman, WA 99164. USA
| | - Melissa C Smith
- Department of Electrical & Computer Engineering, 105 Riggs Hall. Clemson University, Clemson, SC 29631. USA
| | - F Alex Feltus
- Department of Genetics and Biochemistry, 130 McGinty Court. Clemson University, Clemson, SC 29634. USA.,Biomedical Data Science & Informatics Program, 100 McAdams Hall. Clemson University, Clemson, SC 29634. USA.,Clemson Center for Human Genetics, 114 Gregor Mendel Circle, Greenwood, SC 29646. USA
| | - Stephen P Ficklin
- Department of Horticulture, 149 Johnson Hall. Washington State University, Pullman, WA 99164. USA.,School of Electrical Engineering and Computer Science, EME 102. Washington State University, Pullman, WA 99164. USA
| |
Collapse
|
3
|
Guo X, Song Y, Liu S, Gao M, Qi Y, Shang X. Linking genotype to phenotype in multi-omics data of small sample. BMC Genomics 2021; 22:537. [PMID: 34256701 PMCID: PMC8278664 DOI: 10.1186/s12864-021-07867-w] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2021] [Accepted: 06/30/2021] [Indexed: 12/03/2022] Open
Abstract
BACKGROUND Genome-wide association studies (GWAS) that link genotype to phenotype represent an effective means to associate an individual genetic background with a disease or trait. However, single-omics data only provide limited information on biological mechanisms, and it is necessary to improve the accuracy for predicting the biological association between genotype and phenotype by integrating multi-omics data. Typically, gene expression data are integrated to analyze the effect of single nucleotide polymorphisms (SNPs) on phenotype. Such multi-omics data integration mainly follows two approaches: multi-staged analysis and meta-dimensional analysis, which respectively ignore intra-omics and inter-omics associations. Moreover, both approaches require omics data from a single sample set, and the large feature set of SNPs necessitates a large sample size for model establishment, but it is difficult to obtain multi-omics data from a single, large sample set. RESULTS To address this problem, we propose a method of genotype-phenotype association based on multi-omics data from small samples. The workflow of this method includes clustering genes using a protein-protein interaction network and gene expression data, screening gene clusters with group lasso, obtaining SNP clusters corresponding to the selected gene clusters through expression quantitative trait locus data, integrating SNP clusters and corresponding gene clusters and phenotypes into three-layer network blocks, analyzing and predicting based on each block, and obtaining the final prediction by taking the average. CONCLUSIONS We compare this method to others using two datasets and find that our method shows better results in both cases. Our method can effectively solve the prediction problem in multi-omics data of small sample, and provide valuable resources for further studies on the fusion of more omics data.
Collapse
Affiliation(s)
- Xinpeng Guo
- School of Computer Science, Northwestern Polytechnical University, Xi'an, 710072, People's Republic of China
- School of Air and Missile Defense, Air Force Engineering University, Xi'an, 710051, People's Republic of China
| | - Yafei Song
- School of Air and Missile Defense, Air Force Engineering University, Xi'an, 710051, People's Republic of China
| | - Shuhui Liu
- School of Computer Science, Northwestern Polytechnical University, Xi'an, 710072, People's Republic of China
| | - Meihong Gao
- School of Computer Science, Northwestern Polytechnical University, Xi'an, 710072, People's Republic of China
| | - Yang Qi
- School of Computer Science, Northwestern Polytechnical University, Xi'an, 710072, People's Republic of China
| | - Xuequn Shang
- School of Computer Science, Northwestern Polytechnical University, Xi'an, 710072, People's Republic of China.
| |
Collapse
|
4
|
Krawczyk PS, Lipinski L, Dziembowski A. PlasFlow: predicting plasmid sequences in metagenomic data using genome signatures. Nucleic Acids Res 2019; 46:e35. [PMID: 29346586 PMCID: PMC5887522 DOI: 10.1093/nar/gkx1321] [Citation(s) in RCA: 345] [Impact Index Per Article: 57.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2017] [Accepted: 12/28/2017] [Indexed: 12/14/2022] Open
Abstract
Plasmids are mobile genetics elements that play an important role in the environmental adaptation of microorganisms. Although plasmids are usually analyzed in cultured microorganisms, there is a need for methods that allow for the analysis of pools of plasmids (plasmidomes) in environmental samples. To that end, several molecular biology and bioinformatics methods have been developed; however, they are limited to environments with low diversity and cannot recover large plasmids. Here, we present PlasFlow, a novel tool based on genomic signatures that employs a neural network approach for identification of bacterial plasmid sequences in environmental samples. PlasFlow can recover plasmid sequences from assembled metagenomes without any prior knowledge of the taxonomical or functional composition of samples with an accuracy up to 96%. It can also recover sequences of both circular and linear plasmids and can perform initial taxonomical classification of sequences. Compared to other currently available tools, PlasFlow demonstrated significantly better performance on test datasets. Analysis of two samples from heavy metal-contaminated microbial mats revealed that plasmids may constitute an important fraction of their metagenomes and carry genes involved in heavy-metal homeostasis, proving the pivotal role of plasmids in microorganism adaptation to environmental conditions.
Collapse
Affiliation(s)
- Pawel S Krawczyk
- Institute of Biochemistry and Biophysics, Polish Academy of Sciences, Pawinskiego 5a, 02-106 Warsaw, Poland.,Department of Genetics and Biotechnology, Faculty of Biology, University of Warsaw, Pawinskiego 5a, 02-106 Warsaw, Poland
| | - Leszek Lipinski
- Institute of Biochemistry and Biophysics, Polish Academy of Sciences, Pawinskiego 5a, 02-106 Warsaw, Poland
| | - Andrzej Dziembowski
- Institute of Biochemistry and Biophysics, Polish Academy of Sciences, Pawinskiego 5a, 02-106 Warsaw, Poland.,Department of Genetics and Biotechnology, Faculty of Biology, University of Warsaw, Pawinskiego 5a, 02-106 Warsaw, Poland
| |
Collapse
|
5
|
Yuan L, Huang DS. A Network-guided Association Mapping Approach from DNA Methylation to Disease. Sci Rep 2019; 9:5601. [PMID: 30944378 PMCID: PMC6447594 DOI: 10.1038/s41598-019-42010-6] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2018] [Accepted: 03/12/2019] [Indexed: 01/11/2023] Open
Abstract
Aberrant DNA methylation may contribute to development of cancer. However, understanding the associations between DNA methylation and cancer remains a challenge because of the complex mechanisms involved in the associations and insufficient sample sizes. The unprecedented wealth of DNA methylation, gene expression and disease status data give us a new opportunity to design machine learning methods to investigate the underlying associated mechanisms. In this paper, we propose a network-guided association mapping approach from DNA methylation to disease (NAMDD). Compared with existing methods, NAMDD finds methylation-disease path associations by integrating analysis of multiple data combined with a stability selection strategy, thereby mining more information in the datasets and improving the quality of resultant methylation sites. The experimental results on both synthetic and real ovarian cancer data show that NAMDD substantially outperforms former disease-related methylation site research methods (including NsRRR and PCLOGIT) under false positive control. Furthermore, we applied NAMDD to ovarian cancer data, identified significant path associations and provided hypothetical biological path associations to explain our findings.
Collapse
Affiliation(s)
- Lin Yuan
- Institute of Machine Learning and Systems Biology, College of Electronic and Information Engineering, Tongji University, Shanghai, 201804, P.R. China
| | - De-Shuang Huang
- Institute of Machine Learning and Systems Biology, College of Electronic and Information Engineering, Tongji University, Shanghai, 201804, P.R. China.
| |
Collapse
|
6
|
Lee S, Gornitz N, Xing EP, Heckerman D, Lippert C. Ensembles of Lasso Screening Rules. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2018; 40:2841-2852. [PMID: 29989981 PMCID: PMC6925025 DOI: 10.1109/tpami.2017.2765321] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
In order to solve large-scale lasso problems, screening algorithms have been developed that discard features with zero coefficients based on a computationally efficient screening rule. Most existing screening rules were developed from a spherical constraint and half-space constraints on a dual optimal solution. However, existing rules admit at most two half-space constraints due to the computational cost incurred by the half-spaces, even though additional constraints may be useful to discard more features. In this paper, we present AdaScreen, an adaptive lasso screening rule ensemble, which allows to combine any one sphere with multiple half-space constraints on a dual optimal solution. Thanks to geometrical considerations that lead to a simple closed form solution for AdaScreen, we can incorporate multiple half-space constraints at small computational cost. In our experiments, we show that AdaScreen with multiple half-space constraints simultaneously improves screening performance and speeds up lasso solvers.
Collapse
|
7
|
Machine learning identifies interacting genetic variants contributing to breast cancer risk: A case study in Finnish cases and controls. Sci Rep 2018; 8:13149. [PMID: 30177847 PMCID: PMC6120908 DOI: 10.1038/s41598-018-31573-5] [Citation(s) in RCA: 37] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2018] [Accepted: 08/22/2018] [Indexed: 01/01/2023] Open
Abstract
We propose an effective machine learning approach to identify group of interacting single nucleotide polymorphisms (SNPs), which contribute most to the breast cancer (BC) risk by assuming dependencies among BCAC iCOGS SNPs. We adopt a gradient tree boosting method followed by an adaptive iterative SNP search to capture complex non-linear SNP-SNP interactions and consequently, obtain group of interacting SNPs with high BC risk-predictive potential. We also propose a support vector machine formed by the identified SNPs to classify BC cases and controls. Our approach achieves mean average precision (mAP) of 72.66, 67.24 and 69.25 in discriminating BC cases and controls in KBCP, OBCS and merged KBCP-OBCS sample sets, respectively. These results are better than the mAP of 70.08, 63.61 and 66.41 obtained by using a polygenic risk score model derived from 51 known BC-associated SNPs, respectively, in KBCP, OBCS and merged KBCP-OBCS sample sets. BC subtype analysis further reveals that the 200 identified KBCP SNPs from the proposed method performs favorably in classifying estrogen receptor positive (ER+) and negative (ER-) BC cases both in KBCP and OBCS data. Further, a biological analysis of the identified SNPs reveals genes related to important BC-related mechanisms, estrogen metabolism and apoptosis.
Collapse
|
8
|
Backward genotype-transcript-phenotype association mapping. Methods 2017; 129:18-23. [PMID: 28917724 DOI: 10.1016/j.ymeth.2017.09.004] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2017] [Revised: 08/24/2017] [Accepted: 09/05/2017] [Indexed: 01/16/2023] Open
Abstract
Genome-wide association studies have discovered a large number of genetic variants associated with complex diseases such as Alzheimer's disease. However, the genetic background of such diseases is largely unknown due to the complex mechanisms underlying genetic effects on traits, as well as a small sample size (e.g., 1000) and a large number of genetic variants (e.g., 1 million). Fortunately, datasets that contain genotypes, transcripts, and phenotypes are becoming more readily available, creating new opportunities for detecting disease-associated genetic variants. In this paper, we present a novel approach called "Backward Three-way Association Mapping" (BTAM) for detecting three-way associations among genotypes, transcripts, and phenotypes. Assuming that genotypes affect transcript levels, which in turn affect phenotypes, we first find transcripts associated with the phenotypes, and then find genotypes associated with the chosen transcripts. The backward ordering of association mappings allows us to avoid a large number of association testings between all genotypes and all transcripts, making it possible to identify three-way associations with a small computational cost. In our simulation study, we demonstrate that BTAM significantly improves the statistical power over "forward" three-way association mapping that finds genotypes associated with both transcripts and phenotypes and genotype-phenotype association mapping. Furthermore, we apply BTAM on an Alzheimer's disease dataset and report top 10 genotype-transcript-phenotype associations.
Collapse
|