1
|
Ao C, Jiao S, Wang Y, Yu L, Zou Q. Biological Sequence Classification: A Review on Data and General Methods. RESEARCH (WASHINGTON, D.C.) 2022; 2022:0011. [PMID: 39285948 PMCID: PMC11404319 DOI: 10.34133/research.0011] [Citation(s) in RCA: 32] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/25/2022] [Accepted: 10/25/2022] [Indexed: 09/19/2024]
Abstract
With the rapid development of biotechnology, the number of biological sequences has grown exponentially. The continuous expansion of biological sequence data promotes the application of machine learning in biological sequences to construct predictive models for mining biological sequence information. There are many branches of biological sequence classification research. In this review, we mainly focus on the function and modification classification of biological sequences based on machine learning. Sequence-based prediction and analysis are the basic tasks to understand the biological functions of DNA, RNA, proteins, and peptides. However, there are hundreds of classification models developed for biological sequences, and the quite varied specific methods seem dizzying at first glance. Here, we aim to establish a long-term support website (http://lab.malab.cn/~acy/BioseqData/home.html), which provides readers with detailed information on the classification method and download links to relevant datasets. We briefly introduce the steps to build an effective model framework for biological sequence data. In addition, a brief introduction to single-cell sequencing data analysis methods and applications in biology is also included. Finally, we discuss the current challenges and future perspectives of biological sequence classification research.
Collapse
Affiliation(s)
- Chunyan Ao
- School of Computer Science and Technology, Xidian University, Xi'an, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Shihu Jiao
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Yansu Wang
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Liang Yu
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Quan Zou
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
2
|
Cancer classification based on multiple dimensions: SNV patterns. Comput Biol Med 2022; 151:106270. [PMID: 36395594 DOI: 10.1016/j.compbiomed.2022.106270] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2022] [Revised: 10/09/2022] [Accepted: 10/30/2022] [Indexed: 11/13/2022]
Abstract
BACKGROUND The occurrence of cancer is closely related to single nucleotide variants (SNVs). However, in DNA samples collected from patients with distinct cancers, SNVs are detected in different patterns. Therefore, it is an important task to select the appropriate method by which to classify cancer to the greatest extent of SNV patterns, which will aid in cancer diagnosis and treatment. In traditional studies, researchers combined each SNV with its neighboring nucleotides to form a trinucleotide. Mutation signatures for cancer classification were extracted from the patterns of the trinucleotides, but the SNV feature extraction in a single dimension may result in partial information loss and poor model performance. RESULTS In this study, we defined multidimensional SNV (M-SNV) features to classify cancer. M-SNV features considered first- and second-order neighboring nucleotides of one-dimensional SNVs and included six types of features. We validated the feasibility of M-SNV features using a dataset obtained from The Cancer Genome Atlas (TCGA) consisting of 2761 samples from 12 cancers. We performed preliminary screening of 562,321 DNA mutation sites in these samples. The remaining mutation sites were characterized by cancer type in six signatures. We found that the extracted features showed a similar distribution in the cluster center of the cancer type of the samples. After the preprocessing of raw data, samples were more focused on the cancer subtype distributions at the SNV level. We used KNN (k-nearest neighbors) to classify the extracted features and employed the leave-one-out cross to verify them. The accuracy of classifying is stable at approximately 97% and can reach 97.43% in the most optimal case. Furthermore, we found that the validated oncogenes in the loci of the features had the highest importance among the 8 cancers. CONCLUSIONS It is feasible to classify cancers by the distribution of features we defined. Moreover, our methodology has potential implications for the discovery of oncogenes.
Collapse
|
3
|
Zhang P, Zhai J, Wang K, Wu Y. IKBKE and BANK1 Polymorphisms and Clinical Characteristics in Chinese Women with Systemic Lupus Erythematosus. Immunol Invest 2022; 51:2097-2107. [PMID: 35930382 DOI: 10.1080/08820139.2022.2108325] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Abstract
BACKGROUND Defects in apoptotic cell clearance is a pathogenic factor in systemic lupus erythematosus (SLE). This study screened potential pathogenic single nucleotide polymorphisms (SNPs) related to anti-apoptosis from an SLE family and explored their contribution to SLE susceptibility in Chinese women. METHODS Four SNPs (IKBKE rs15672, BANK1 rs12640056, BANK1 rs6842661, and NFKBIA rs1957106) with potential SLE susceptibility were analyzed for clinical characteristics between 567 patients with SLE and 345 healthy control subjects. RESULTS IKBKE rs15672 G/A and BANK1 rs12640056C/T polymorphisms were associated with SLE susceptibility (rs15672 A vs G, P = 0.028, OR = 1.25, 95% CI = 1.02-1.52; rs12640056 T vs C, P = 0.015, OR = 0.78, 95% CI = 0.64-0.95, respectively). In addition, patients with AA+GA genotypes of IKBKE rs15672 had higher positive rates of anti-SSB antibodies (q = 0.008) and lower positive rates of anti-RIB antibodies (q = 0.024) than those with the GG genotype. There were no significant differences in BANK1 rs12640056 between different genotypes and clinical characteristics. CONCLUSION IKBKE rs15672 G/A and BANK1 rs12640056C/T polymorphisms are associated with susceptibility to SLE in Chinese women. This highlights the important role of these two SNPs in this disease and suggests that multiple genes from these pathways are candidates for functional studies and therapeutic targets.
Collapse
Affiliation(s)
- Ping Zhang
- West China School of Medicine/Department of Laboratory Medicine, West China Hospital, Sichuan University, Chengdu, China
| | - Jianzhao Zhai
- West China School of Medicine/Department of Laboratory Medicine, West China Hospital, Sichuan University, Chengdu, China
| | - Kefen Wang
- West China School of Medicine/Department of Laboratory Medicine, West China Hospital, Sichuan University, Chengdu, China
| | - Yongkang Wu
- West China School of Medicine/Department of Laboratory Medicine, West China Hospital, Sichuan University, Chengdu, China.,Outpatient Department, West China Hospital, Sichuan University, Chengdu, China
| |
Collapse
|
4
|
Zhai J, Zhang P, Zhang N, Luo Y, Wu Y. Analysis of WDFY4 rs7097397 and PHLDB1 rs7389 polymorphisms in Chinese patients with systemic lupus erythematosus. Clin Rheumatol 2022; 41:2035-2042. [PMID: 35188604 DOI: 10.1007/s10067-022-06103-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2021] [Revised: 02/10/2022] [Accepted: 02/15/2022] [Indexed: 02/05/2023]
Abstract
OBJECTIVES To explore the relationship among patient-specific SNPs from one SLE family, lupus susceptibility, and laboratory indicators in a western Chinese population. METHODS We previously performed whole exome sequencing in one SLE family and screened 5 SLE candidate SNPs. In this study, we verified them in 634 SLE patients and 400 healthy controls and analyzed the relationship between SNPs and laboratory indicators. RESULTS Among the 5 candidate SNPs, PHLDB1 rs7389T/G (dominant model, OR = 0.627, 95%CI = 0.480-0.820, P = 0.001) and WDFY4 rs7097397G/A (dominant model, OR = 0.653, 95%CI = 0.438-0.973, P = 0.035) were associated with SLE susceptibility. In addition, the G allele of rs7389 was related to an increased level of TNF-α (q = 0.013). The A allele of rs7097397 was related to reduced levels of IL-1β (q = 0.033) and IL-6 (q = 0.039) and high positive rate of antinuclear antibodies (q = 0.021). CONCLUSIONS Our study indicated that both the rs7389T/G and rs7097397G/A polymorphisms were related to SLE susceptibility in western China. rs7389T/G was related to increased TNF-α content, while rs7097397G/A was associated with reduced IL-1β and IL-6 content and increased antinuclear antibody positive rate. Key Points • The G allele of rs7389 was related to reduced susceptibility to SLE. • The A allele of rs7097397 was associated with reduced susceptibility to SLE. • The G allele of rs7389 was related to increased levels of TNF-α. • The A allele of rs7097397 was related to decreased concentrations of IL-1β and IL-6, as well as an increased positive rate of antinuclear antibody.
Collapse
Affiliation(s)
- Jianzhao Zhai
- West China School of Medicine/Department of Laboratory Medicine, West China Hospital of Sichuan University, Chengdu, China
| | - Ping Zhang
- West China School of Medicine/Department of Laboratory Medicine, West China Hospital of Sichuan University, Chengdu, China
| | - Naidan Zhang
- West China School of Medicine/Department of Laboratory Medicine, West China Hospital of Sichuan University, Chengdu, China
| | - Yubin Luo
- Department of Rheumatology & Immunology, West China Hospital of Sichuan University, Chengdu, China
| | - Yongkang Wu
- Outpatient Department, West China Hospital of Sichuan University, Chengdu, China.
| |
Collapse
|
5
|
Guo Y, Cheng H, Yuan Z, Liang Z, Wang Y, Du D. Testing Gene-Gene Interactions Based on a Neighborhood Perspective in Genome-wide Association Studies. Front Genet 2021; 12:801261. [PMID: 34956337 PMCID: PMC8693929 DOI: 10.3389/fgene.2021.801261] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2021] [Accepted: 11/15/2021] [Indexed: 12/21/2022] Open
Abstract
Unexplained genetic variation that causes complex diseases is often induced by gene-gene interactions (GGIs). Gene-based methods are one of the current statistical methodologies for discovering GGIs in case-control genome-wide association studies that are not only powerful statistically, but also interpretable biologically. However, most approaches include assumptions about the form of GGIs, which results in poor statistical performance. As a result, we propose gene-based testing based on the maximal neighborhood coefficient (MNC) called gene-based gene-gene interaction through a maximal neighborhood coefficient (GBMNC). MNC is a metric for capturing a wide range of relationships between two random vectors with arbitrary, but not necessarily equal, dimensions. We established a statistic that leverages the difference in MNC in case and in control samples as an indication of the existence of GGIs, based on the assumption that the joint distribution of two genes in cases and controls should not be substantially different if there is no interaction between them. We then used a permutation-based statistical test to evaluate this statistic and calculate a statistical p-value to represent the significance of the interaction. Experimental results using both simulation and real data showed that our approach outperformed earlier methods for detecting GGIs.
Collapse
Affiliation(s)
- Yingjie Guo
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, China.,Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Honghong Cheng
- School of Information, Shanxi University of Finance and Economics, Taiyuan, China
| | - Zhian Yuan
- Research Institute of Big Data Science and Industry, Shanxi University, Taiyuan, China
| | - Zhen Liang
- School of Life Science, Shanxi University, Taiyuan, China
| | - Yang Wang
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, China
| | - Debing Du
- Beidahuang Industry Group General Hospital, Harbin, China
| |
Collapse
|
6
|
Guo Y, Wu C, Yuan Z, Wang Y, Liang Z, Wang Y, Zhang Y, Xu L. Gene-Based Testing of Interactions Using XGBoost in Genome-Wide Association Studies. Front Cell Dev Biol 2021; 9:801113. [PMID: 34977040 PMCID: PMC8716787 DOI: 10.3389/fcell.2021.801113] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2021] [Accepted: 11/23/2021] [Indexed: 11/30/2022] Open
Abstract
Among the myriad of statistical methods that identify gene–gene interactions in the realm of qualitative genome-wide association studies, gene-based interactions are not only powerful statistically, but also they are interpretable biologically. However, they have limited statistical detection by making assumptions on the association between traits and single nucleotide polymorphisms. Thus, a gene-based method (GGInt-XGBoost) originated from XGBoost is proposed in this article. Assuming that log odds ratio of disease traits satisfies the additive relationship if the pair of genes had no interactions, the difference in error between the XGBoost model with and without additive constraint could indicate gene–gene interaction; we then used a permutation-based statistical test to assess this difference and to provide a statistical p-value to represent the significance of the interaction. Experimental results on both simulation and real data showed that our approach had superior performance than previous experiments to detect gene–gene interactions.
Collapse
Affiliation(s)
- Yingjie Guo
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, China
| | - Chenxi Wu
- Department of Mathematics, University of Wisconsin-Madison, Madison, WI, United States
| | - Zhian Yuan
- Research Institute of Big Data Science and Industry, Shanxi University, Taiyuan, China
| | - Yansu Wang
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, China
| | - Zhen Liang
- School of Life Science, Shanxi University, Taiyuan, China
| | - Yang Wang
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, China
| | - Yi Zhang
- Beidahuang Industry Group General Hospital, Harbin, China
- *Correspondence: Yi Zhang, ; Lei Xu,
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, China
- *Correspondence: Yi Zhang, ; Lei Xu,
| |
Collapse
|