1
|
Zhai Y, Chao J, Wang Y, Zhang P, Tang F, Zou Q. TPMA: A two pointers meta-alignment tool to ensemble different multiple nucleic acid sequence alignments. PLoS Comput Biol 2024; 20:e1011988. [PMID: 38557416 PMCID: PMC11008887 DOI: 10.1371/journal.pcbi.1011988] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2023] [Revised: 04/11/2024] [Accepted: 03/11/2024] [Indexed: 04/04/2024] Open
Abstract
Accurate multiple sequence alignment (MSA) is imperative for the comprehensive analysis of biological sequences. However, a notable challenge arises as no single MSA tool consistently outperforms its counterparts across diverse datasets. Users often have to try multiple MSA tools to achieve optimal alignment results, which can be time-consuming and memory-intensive. While the overall accuracy of certain MSA results may be lower, there could be local regions with the highest alignment scores, prompting researchers to seek a tool capable of merging these locally optimal results from multiple initial alignments into a globally optimal alignment. In this study, we introduce Two Pointers Meta-Alignment (TPMA), a novel tool designed for the integration of nucleic acid sequence alignments. TPMA employs two pointers to partition the initial alignments into blocks containing identical sequence fragments. It selects blocks with the high sum of pairs (SP) scores to concatenate them into an alignment with an overall SP score superior to that of the initial alignments. Through tests on simulated and real datasets, the experimental results consistently demonstrate that TPMA outperforms M-Coffee in terms of aSP, Q, and total column (TC) scores across most datasets. Even in cases where TPMA's scores are comparable to M-Coffee, TPMA exhibits significantly lower running time and memory consumption. Furthermore, we comprehensively assessed all the MSA tools used in the experiments, considering accuracy, time, and memory consumption. We propose accurate and fast combination strategies for small and large datasets, which streamline the user tool selection process and facilitate large-scale dataset integration. The dataset and source code of TPMA are available on GitHub (https://github.com/malabz/TPMA).
Collapse
Affiliation(s)
- Yixiao Zhai
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Quzhou People’s Hospital, Quzhou Affiliated Hospital of Wenzhou Medical University, Quzhou, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Jiannan Chao
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Yizheng Wang
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Pinglu Zhang
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Furong Tang
- Quzhou People’s Hospital, Quzhou Affiliated Hospital of Wenzhou Medical University, Quzhou, China
- Department of Basic Medical Sciences, School of Medicine, Tsinghua University, Beijing, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| |
Collapse
|
2
|
Chen M, Sun M, Su X, Tiwari P, Ding Y. Fuzzy kernel evidence Random Forest for identifying pseudouridine sites. Brief Bioinform 2024; 25:bbae169. [PMID: 38622357 PMCID: PMC11018548 DOI: 10.1093/bib/bbae169] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2024] [Revised: 03/27/2024] [Accepted: 03/31/2024] [Indexed: 04/17/2024] Open
Abstract
Pseudouridine is an RNA modification that is widely distributed in both prokaryotes and eukaryotes, and plays a critical role in numerous biological activities. Despite its importance, the precise identification of pseudouridine sites through experimental approaches poses significant challenges, requiring substantial time and resources.Therefore, there is a growing need for computational techniques that can reliably and quickly identify pseudouridine sites from vast amounts of RNA sequencing data. In this study, we propose fuzzy kernel evidence Random Forest (FKeERF) to identify pseudouridine sites. This method is called PseU-FKeERF, which demonstrates high accuracy in identifying pseudouridine sites from RNA sequencing data. The PseU-FKeERF model selected four RNA feature coding schemes with relatively good performance for feature combination, and then input them into the newly proposed FKeERF method for category prediction. FKeERF not only uses fuzzy logic to expand the original feature space, but also combines kernel methods that are easy to interpret in general for category prediction. Both cross-validation tests and independent tests on benchmark datasets have shown that PseU-FKeERF has better predictive performance than several state-of-the-art methods. This new method not only improves the accuracy of pseudouridine site identification, but also provides a certain reference for disease control and related drug development in the future.
Collapse
Affiliation(s)
- Mingshuai Chen
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 611731, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003, China
| | - Mingai Sun
- Beidahuang Industry Group General Hospital, Harbin 150001, China
| | - Xi Su
- Foshan Women and Children Hospital, Foshan 528000, China
| | - Prayag Tiwari
- School of Information Technology, Halmstad University, Sweden
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003, China
| |
Collapse
|
3
|
Jiang J, Pei H, Li J, Li M, Zou Q, Lv Z. FEOpti-ACVP: identification of novel anti-coronavirus peptide sequences based on feature engineering and optimization. Brief Bioinform 2024; 25:bbae037. [PMID: 38366802 PMCID: PMC10939380 DOI: 10.1093/bib/bbae037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2023] [Revised: 12/27/2023] [Accepted: 01/17/2024] [Indexed: 02/18/2024] Open
Abstract
Anti-coronavirus peptides (ACVPs) represent a relatively novel approach of inhibiting the adsorption and fusion of the virus with human cells. Several peptide-based inhibitors showed promise as potential therapeutic drug candidates. However, identifying such peptides in laboratory experiments is both costly and time consuming. Therefore, there is growing interest in using computational methods to predict ACVPs. Here, we describe a model for the prediction of ACVPs that is based on the combination of feature engineering (FE) optimization and deep representation learning. FEOpti-ACVP was pre-trained using two feature extraction frameworks. At the next step, several machine learning approaches were tested in to construct the final algorithm. The final version of FEOpti-ACVP outperformed existing methods used for ACVPs prediction and it has the potential to become a valuable tool in ACVP drug design. A user-friendly webserver of FEOpti-ACVP can be accessed at http://servers.aibiochem.net/soft/FEOpti-ACVP/.
Collapse
Affiliation(s)
- Jici Jiang
- College of Biomedical Engineering, Sichuan University, Chengdu 610065, China
| | - Hongdi Pei
- College of Biomedical Engineering, Sichuan University, Chengdu 610065, China
| | - Jiayu Li
- College of Life Science, Sichuan University, Chengdu 610065, China
| | - Mingxin Li
- College of Biomedical Engineering, Sichuan University, Chengdu 610065, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China
| | - Zhibin Lv
- College of Biomedical Engineering, Sichuan University, Chengdu 610065, China
| |
Collapse
|
4
|
Zhang P, Liu H, Wei Y, Zhai Y, Tian Q, Zou Q. FMAlign2: a novel fast multiple nucleotide sequence alignment method for ultralong datasets. Bioinformatics 2024; 40:btae014. [PMID: 38200554 PMCID: PMC10809904 DOI: 10.1093/bioinformatics/btae014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2023] [Revised: 12/27/2023] [Accepted: 01/09/2024] [Indexed: 01/12/2024] Open
Abstract
MOTIVATION In bioinformatics, multiple sequence alignment (MSA) is a crucial task. However, conventional methods often struggle with aligning ultralong sequences. To address this issue, researchers have designed MSA methods rooted in a vertical division strategy, which segments sequence data for parallel alignment. A prime example of this approach is FMAlign, which utilizes the FM-index to extract common seeds and segment the sequences accordingly. RESULTS FMAlign2 leverages the suffix array to identify maximal exact matches, redefining the approach of FMAlign from searching for global chains to partial chains. By using a vertical division strategy, large-scale problem is deconstructed into manageable tasks, enabling parallel execution of subMSA. Furthermore, sequence-profile alignment and refinement are incorporated to concatenate subsets, yielding the final result seamlessly. Compared to FMAlign, FMAlign2 markedly augments the segmentation of sequences and significantly reduces the time while maintaining accuracy, especially on ultralong datasets. Importantly, FMAlign2 enhances existing MSA methods by conferring the capability to handle sequences reaching billions in length within an acceptable time frame. AVAILABILITY AND IMPLEMENTATION Source code and datasets are available at https://github.com/malabz/FMAlign2 and https://zenodo.org/records/10435770.
Collapse
Affiliation(s)
- Pinglu Zhang
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, Sichuan, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003, Zhejiang, China
| | - Huan Liu
- School of Computer Science and Technology, Southwest University of Science and Technology, Mianyang 621010, Sichuan, China
| | - Yanming Wei
- School of Computer Science and Technology, Xidian University, Xi’an 710071, Shaanxi, China
| | - Yixiao Zhai
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, Sichuan, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003, Zhejiang, China
| | - Qinzhong Tian
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, Sichuan, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003, Zhejiang, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, Sichuan, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003, Zhejiang, China
| |
Collapse
|
5
|
Cao C, Wang C, Yang S, Zou Q. CircSI-SSL: circRNA-binding site identification based on self-supervised learning. Bioinformatics 2024; 40:btae004. [PMID: 38180876 PMCID: PMC10789309 DOI: 10.1093/bioinformatics/btae004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2023] [Revised: 11/13/2023] [Accepted: 01/03/2024] [Indexed: 01/07/2024] Open
Abstract
MOTIVATION In recent years, circular RNAs (circRNAs), the particular form of RNA with a closed-loop structure, have attracted widespread attention due to their physiological significance (they can directly bind proteins), leading to the development of numerous protein site identification algorithms. Unfortunately, these studies are supervised and require the vast majority of labeled samples in training to produce superior performance. But the acquisition of sample labels requires a large number of biological experiments and is difficult to obtain. RESULTS To resolve this matter that a great deal of tags need to be trained in the circRNA-binding site prediction task, a self-supervised learning binding site identification algorithm named CircSI-SSL is proposed in this article. According to the survey, this is unprecedented in the research field. Specifically, CircSI-SSL initially combines multiple feature coding schemes and employs RNA_Transformer for cross-view sequence prediction (self-supervised task) to learn mutual information from the multi-view data, and then fine-tuning with only a few sample labels. Comprehensive experiments on six widely used circRNA datasets indicate that our CircSI-SSL algorithm achieves excellent performance in comparison to previous algorithms, even in the extreme case where the ratio of training data to test data is 1:9. In addition, the transplantation experiment of six linRNA datasets without network modification and hyperparameter adjustment shows that CircSI-SSL has good scalability. In summary, the prediction algorithm based on self-supervised learning proposed in this article is expected to replace previous supervised algorithms and has more extensive application value. AVAILABILITY AND IMPLEMENTATION The source code and data are available at https://github.com/cc646201081/CircSI-SSL.
Collapse
Affiliation(s)
- Chao Cao
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang 324003, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, Sichuan 611731, China
| | - Chunyu Wang
- Faculty of Computing, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
| | - Shuhong Yang
- Faculty of Mathematics and Computer Science, Guangdong Ocean University, Zhanjiang, Guangdong 524088, China
| | - Quan Zou
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang 324003, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, Sichuan 611731, China
| |
Collapse
|
6
|
Chen Y, Wang J, Wang C, Zou Q. AutoEdge-CCP: A novel approach for predicting cancer-associated circRNAs and drugs based on automated edge embedding. PLoS Comput Biol 2024; 20:e1011851. [PMID: 38289973 PMCID: PMC10857569 DOI: 10.1371/journal.pcbi.1011851] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2023] [Revised: 02/09/2024] [Accepted: 01/22/2024] [Indexed: 02/01/2024] Open
Abstract
The unique expression patterns of circRNAs linked to the advancement and prognosis of cancer underscore their considerable potential as valuable biomarkers. Repurposing existing drugs for new indications can significantly reduce the cost of cancer treatment. Computational prediction of circRNA-cancer and drug-cancer relationships is crucial for precise cancer therapy. However, prior computational methods fail to analyze the interaction between circRNAs, drugs, and cancer at the systematic level. It is essential to propose a method that uncover more valuable information for achieving cancer-centered multi-association prediction. In this paper, we present a novel computational method, AutoEdge-CCP, to unveil cancer-associated circRNAs and drugs. We abstract the complex relationships between circRNAs, drugs, and cancer into a multi-source heterogeneous network. In this network, each molecule is represented by two types information, one is the intrinsic attribute information of molecular features, and the other is the link information explicitly modeled by autoGNN, which searches information from both intra-layer and inter-layer of message passing neural network. The significant performance on multi-scenario applications and case studies establishes AutoEdge-CCP as a potent and promising association prediction tool.
Collapse
Affiliation(s)
- Yaojia Chen
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Jiacheng Wang
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Chunyu Wang
- Faculty of Computing, Harbin Institute of Technology, Harbin, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| |
Collapse
|
7
|
Luo X, Li Q, Tang Y, Liu Y, Zou Q, Zheng J, Zhang Y, Xu L. Predicting active enhancers with DNA methylation and histone modification. BMC Bioinformatics 2023; 24:414. [PMID: 37919681 PMCID: PMC10621108 DOI: 10.1186/s12859-023-05547-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2023] [Accepted: 10/27/2023] [Indexed: 11/04/2023] Open
Abstract
BACKGROUND Enhancers play a crucial role in gene regulation, and some active enhancers produce noncoding RNAs known as enhancer RNAs (eRNAs) bi-directionally. The most commonly used method for detecting eRNAs is CAGE-seq, but the instability of eRNAs in vivo leads to data noise in sequencing results. Unfortunately, there is currently a lack of research focused on the noise inherent in CAGE-seq data, and few approaches have been developed for predicting eRNAs. Bridging this gap and developing widely applicable eRNA prediction models is of utmost importance. RESULTS In this study, we proposed a method to reduce false positives in the identification of eRNAs by adjusting the statistical distribution of expression levels. We also developed eRNA prediction models using joint gene expressions, DNA methylation, and histone modification. These models achieved impressive performance with an AUC value of approximately 0.95 for intra-cell prediction and 0.9 for cross-cell prediction. CONCLUSIONS Our method effectively attenuates the noise generated by stochastic RNA production, resulting in more accurate detection of eRNAs. Furthermore, our eRNA prediction model exhibited significant accuracy in both intra-cell and cross-cell validation, highlighting its robustness and potential application in various cellular contexts.
Collapse
Affiliation(s)
- Ximei Luo
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
- School of Electronic and Communication Engineering, Shenzhen Polytechnic University, Shenzhen, Guangdong, China
| | - Qun Li
- Department of Pain, The Affiliated Traditional Chinese Medicine Hospital of Southwest Medical University, Luzhou, Sichuan, China
| | - Yifan Tang
- Department of Anesthesiology, The Affiliated Traditional Chinese Medicine Hospital of Southwest Medical University, Luzhou, Sichuan, China
| | - Yan Liu
- Department of Anesthesiology, The Affiliated Traditional Chinese Medicine Hospital of Southwest Medical University, Luzhou, Sichuan, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China
| | - Jie Zheng
- Department of Anesthesiology, The Affiliated Traditional Chinese Medicine Hospital of Southwest Medical University, Luzhou, Sichuan, China
| | - Ying Zhang
- Department of Anesthesiology, The Affiliated Traditional Chinese Medicine Hospital of Southwest Medical University, Luzhou, Sichuan, China
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic University, Shenzhen, Guangdong, China.
| |
Collapse
|
8
|
Jiao S, Ye X, Ao C, Sakurai T, Zou Q, Xu L. Adaptive learning embedding features to improve the predictive performance of SARS-CoV-2 phosphorylation sites. Bioinformatics 2023; 39:btad627. [PMID: 37847658 PMCID: PMC10628388 DOI: 10.1093/bioinformatics/btad627] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2023] [Revised: 09/11/2023] [Accepted: 10/16/2023] [Indexed: 10/19/2023] Open
Abstract
MOTIVATION The rapid and extensive transmission of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has led to an unprecedented global health emergency, affecting millions of people and causing an immense socioeconomic impact. The identification of SARS-CoV-2 phosphorylation sites plays an important role in unraveling the complex molecular mechanisms behind infection and the resulting alterations in host cell pathways. However, currently available prediction tools for identifying these sites lack accuracy and efficiency. RESULTS In this study, we presented a comprehensive biological function analysis of SARS-CoV-2 infection in a clonal human lung epithelial A549 cell, revealing dramatic changes in protein phosphorylation pathways in host cells. Moreover, a novel deep learning predictor called PSPred-ALE is specifically designed to identify phosphorylation sites in human host cells that are infected with SARS-CoV-2. The key idea of PSPred-ALE lies in the use of a self-adaptive learning embedding algorithm, which enables the automatic extraction of context sequential features from protein sequences. In addition, the tool uses multihead attention module that enables the capturing of global information, further improving the accuracy of predictions. Comparative analysis of features demonstrated that the self-adaptive learning embedding features are superior to hand-crafted statistical features in capturing discriminative sequence information. Benchmarking comparison shows that PSPred-ALE outperforms the state-of-the-art prediction tools and achieves robust performance. Therefore, the proposed model can effectively identify phosphorylation sites assistant the biomedical scientists in understanding the mechanism of phosphorylation in SARS-CoV-2 infection. AVAILABILITY AND IMPLEMENTATION PSPred-ALE is available at https://github.com/jiaoshihu/PSPred-ALE and Zenodo (https://doi.org/10.5281/zenodo.8330277).
Collapse
Affiliation(s)
- Shihu Jiao
- Department of Computer Science, University of Tsukuba, Tsukuba 3058577, Japan
| | - Xiucai Ye
- Department of Computer Science, University of Tsukuba, Tsukuba 3058577, Japan
| | - Chunyan Ao
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Tetsuya Sakurai
- Department of Computer Science, University of Tsukuba, Tsukuba 3058577, Japan
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, No. 4089 Shahexi Road, Shenzhen 518000, China
| |
Collapse
|
9
|
Ru X, Zou Q, Lin C. Optimization of drug-target affinity prediction methods through feature processing schemes. Bioinformatics 2023; 39:btad615. [PMID: 37812388 PMCID: PMC10636279 DOI: 10.1093/bioinformatics/btad615] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2023] [Revised: 09/19/2023] [Accepted: 10/07/2023] [Indexed: 10/10/2023] Open
Abstract
MOTIVATION Numerous high-accuracy drug-target affinity (DTA) prediction models, whose performance is heavily reliant on the drug and target feature information, are developed at the expense of complexity and interpretability. Feature extraction and optimization constitute a critical step that significantly influences the enhancement of model performance, robustness, and interpretability. Many existing studies aim to comprehensively characterize drugs and targets by extracting features from multiple perspectives; however, this approach has drawbacks: (i) an abundance of redundant or noisy features; and (ii) the feature sets often suffer from high dimensionality. RESULTS In this study, to obtain a model with high accuracy and strong interpretability, we utilize various traditional and cutting-edge feature selection and dimensionality reduction techniques to process self-associated features and adjacent associated features. These optimized features are then fed into learning to rank to achieve efficient DTA prediction. Extensive experimental results on two commonly used datasets indicate that, among various feature optimization methods, the regression tree-based feature selection method is most beneficial for constructing models with good performance and strong robustness. Then, by utilizing Shapley Additive Explanations values and the incremental feature selection approach, we obtain that the high-quality feature subset consists of the top 150D features and the top 20D features have a breakthrough impact on the DTA prediction. In conclusion, our study thoroughly validates the importance of feature optimization in DTA prediction and serves as inspiration for constructing high-performance and high-interpretable models. AVAILABILITY AND IMPLEMENTATION https://github.com/RUXIAOQING964914140/FS_DTA.
Collapse
Affiliation(s)
- Xiaoqing Ru
- Department of Computer Science, University of Tsukuba, Tsukuba, Japan
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China
| | - Chen Lin
- Department of Computer Science and Technology, School of Informatics, Xiamen University, Xiamen, Fujian, 361005, China
| |
Collapse
|
10
|
Tian M, Wang H, Liu X, Ye Y, Ouyang G, Shen Y, Li Z, Wang X, Wu S. Delineation of clinical target volume and organs at risk in cervical cancer radiotherapy by deep learning networks. Med Phys 2023; 50:6354-6365. [PMID: 37246619 DOI: 10.1002/mp.16468] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2022] [Revised: 04/17/2023] [Accepted: 04/28/2023] [Indexed: 05/30/2023] Open
Abstract
PURPOSE Delineation of the clinical target volume (CTV) and organs-at-risk (OARs) is important in cervical cancer radiotherapy. But it is generally labor-intensive, time-consuming, and subjective. This paper proposes a parallel-path attention fusion network (PPAF-net) to overcome these disadvantages in the delineation task. METHODS The PPAF-net utilizes both the texture and structure information of CTV and OARs by employing a U-Net network to capture the high-level texture information, and an up-sampling and down-sampling (USDS) network to capture the low-level structure information to accentuate the boundaries of CTV and OARs. Multi-level features extracted from both networks are then fused together through an attention module to generate the delineation result. RESULTS The dataset contains 276 computed tomography (CT) scans of patients with cervical cancer of staging IB-IIA. The images are provided by the West China Hospital of Sichuan University. Simulation results demonstrate that PPAF-net performs favorably on the delineation of the CTV and OARs (e.g., rectum, bladder and etc.) and achieves the state-of-the-art delineation accuracy, respectively, for the CTV and OARs. In terms of the Dice Similarity Coefficient (DSC) and the Hausdorff Distance (HD), 88.61% and 2.25 cm for the CTV, 92.27% and 0.73 cm for the rectum, 96.74% and 0.68 cm for the bladder, 96.38% and 0.65 cm for the left kidney, 96.79% and 0.63 cm for the right kidney, 93.42% and 0.52 cm for the left femoral head, 93.69% and 0.51 cm for the right femoral head, 87.53% and 1.07 cm for the small intestine, and 91.50% and 0.84 cm for the spinal cord. CONCLUSIONS The proposed automatic delineation network PPAF-net performs well on CTV and OARs segmentation tasks, which has great potential for reducing the burden of radiation oncologists and increasing the accuracy of delineation. In future, radiation oncologists from the West China Hospital of Sichuan University will further evaluate the results of network delineation, making this method helpful in clinical practice.
Collapse
Affiliation(s)
- Miao Tian
- School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu, China
| | - Hongqiu Wang
- School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu, China
| | - Xingang Liu
- School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu, China
| | - Yuyun Ye
- Department of Electrical and Computer Engineering, University of Tulsa, Tulsa, USA
| | - Ganlu Ouyang
- Department of Radiation Oncology, Cancer Center, the West China Hospital of Sichuan University, Chengdu, China
| | - Yali Shen
- Department of Radiation Oncology, Cancer Center, the West China Hospital of Sichuan University, Chengdu, China
| | - Zhiping Li
- Department of Radiation Oncology, Cancer Center, the West China Hospital of Sichuan University, Chengdu, China
| | - Xin Wang
- Department of Radiation Oncology, Cancer Center, the West China Hospital of Sichuan University, Chengdu, China
| | - Shaozhi Wu
- School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| |
Collapse
|
11
|
Wang J, Chen Y, Zou Q. Inferring gene regulatory network from single-cell transcriptomes with graph autoencoder model. PLoS Genet 2023; 19:e1010942. [PMID: 37703293 PMCID: PMC10519590 DOI: 10.1371/journal.pgen.1010942] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2023] [Revised: 09/25/2023] [Accepted: 08/29/2023] [Indexed: 09/15/2023] Open
Abstract
The gene regulatory structure of cells involves not only the regulatory relationship between two genes, but also the cooperative associations of multiple genes. However, most gene regulatory network inference methods for single cell only focus on and infer the regulatory relationships of pairs of genes, ignoring the global regulatory structure which is crucial to identify the regulations in the complex biological systems. Here, we proposed a graph-based Deep learning model for Regulatory networks Inference among Genes (DeepRIG) from single-cell RNA-seq data. To learn the global regulatory structure, DeepRIG builds a prior regulatory graph by transforming the gene expression of data into the co-expression mode. Then it utilizes a graph autoencoder model to embed the global regulatory information contained in the graph into gene latent embeddings and to reconstruct the gene regulatory network. Extensive benchmarking results demonstrate that DeepRIG can accurately reconstruct the gene regulatory networks and outperform existing methods on multiple simulated networks and real-cell regulatory networks. Additionally, we applied DeepRIG to the samples of human peripheral blood mononuclear cells and triple-negative breast cancer, and presented that DeepRIG can provide accurate cell-type-specific gene regulatory networks inference and identify novel regulators of progression and inhibition.
Collapse
Affiliation(s)
- Jiacheng Wang
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China
| | - Yaojia Chen
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China
| |
Collapse
|
12
|
Hu RS, Zhang X, Wei Y. ZooPathWeb: a comprehensive web resource for zoonotic pathogens. Bioinform Adv 2023; 3:vbad094. [PMID: 37465397 PMCID: PMC10351968 DOI: 10.1093/bioadv/vbad094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/19/2023] [Revised: 06/15/2023] [Accepted: 07/08/2023] [Indexed: 07/20/2023]
Abstract
Motivation Zoonotic pathogens, such as viruses, bacteria, fungi and parasites, can be transmitted from animals to humans, causing a wide range of diseases that can vary from mild to life-threatening. These pathogens typically exhibit a broad host range, infecting domestic and/or wild animals, which serve as reservoirs of infection. Human infection can occur through direct contact with infected animals or their body fluids, consumption of contaminated food or water, or via bites from infected arthropod vectors. Understanding the epidemiological characteristics and population structure of zoonotic pathogens is of paramount importance for preventing and controlling the spread of zoonotic diseases. Results Here, we present ZooPathWeb, a comprehensive online resource for zoonotic pathogens. ZooPathWeb provides essential information on pathogens that are particularly relevant to public health and includes a literature collection organized by pathogen classification, such as lineage, host, country or region and publication year. Moreover, we have developed four web-based utility tools for this release: SeqNHandle, PaPhy-ML, TreeView and BLAST. These tools are specifically designed to facilitate the identification of population structure and adaptive evolution in relation to zoonotic pathogens. Availability and implementation The ZooPathWeb website is accessed via http://lab.malab.cn/~hrs/zoopathweb/. The source code for AKINND, which is used for collecting pathogen-related literature, can be found at https://github.com/RuiSiHu/AKINND. Additionally, the source code for PaPhy-ML, utilized for phylogenetic analysis, can be found at https://github.com/RuiSiHu/PaPhy-ML. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- Rui-Si Hu
- To whom correspondence should be addressed.
| | - Xin Zhang
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang 324003, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, Sichuan 610054, China
| | - Yanming Wei
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang 324003, China
- School of Computer Science and Technology, Xidian University, Xi’an, Shaanxi 710071, China
| |
Collapse
|
13
|
Cui Y, Liu H, Ming Y, Zhang Z, Liu L, Liu R. Prediction of strand-specific and cell-type-specific G-quadruplexes based on high-resolution CUT&Tag data. Brief Funct Genomics 2023:elad024. [PMID: 37357985 DOI: 10.1093/bfgp/elad024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2023] [Revised: 05/20/2023] [Accepted: 06/01/2023] [Indexed: 06/27/2023] Open
Abstract
G-quadruplex (G4), a non-classical deoxyribonucleic acid structure, is widely distributed in the genome and involved in various biological processes. In vivo, high-throughput sequencing has indicated that G4s are significantly enriched at functional regions in a cell-type-specific manner. Therefore, the prediction of G4s based on computational methods is necessary instead of the time-consuming and laborious experimental methods. Recently, G4 CUT&Tag has been developed to generate higher-resolution sequencing data than ChIP-seq, which provides more accurate training samples for model construction. In this paper, we present a new dataset construction method based on G4 CUT&Tag sequencing data and an XGBoost prediction model based on the machine learning boost method. The results show that our model performs well within and across cell types. Furthermore, sequence analysis indicates that the formation of G4 structure is greatly affected by the flanking sequences, and the GC content of the G4 flanking sequences is higher than non-G4. Moreover, we also identified G4 motifs in the high-resolution dataset, among which we found several motifs for known transcription factors (TFs), such as SP2 and BPC. These TFs may directly or indirectly affect the formation of the G4 structure.
Collapse
Affiliation(s)
- Yizhi Cui
- School of Computer Science and Engineering, Beijing Technology and Business University, Beijing, 100048, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, 324003, Zhejiang, China
| | - Hongzhi Liu
- School of Computer Science and Engineering, Beijing Technology and Business University, Beijing, 100048, China
| | - Yutong Ming
- School of Computer Science and Engineering, Beijing Technology and Business University, Beijing, 100048, China
| | - Zheng Zhang
- Department of Computer Science and Software Engineering, Auburn University, Auburn, 36830, Alabama, USA
| | - Li Liu
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, 324003, Zhejiang, China
| | - Ruijun Liu
- School of Computer Science and Engineering, Beijing Technology and Business University, Beijing, 100048, China
| |
Collapse
|
14
|
Chen J, Chao J, Liu H, Yang F, Zou Q, Tang F. WMSA 2: a multiple DNA/RNA sequence alignment tool implemented with accurate progressive mode and a fast win-win mode combining the center star and progressive strategies. Brief Bioinform 2023:7169135. [PMID: 37200156 DOI: 10.1093/bib/bbad190] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2023] [Revised: 04/09/2023] [Accepted: 04/26/2023] [Indexed: 05/20/2023] Open
Abstract
Multiple sequence alignment is widely used for sequence analysis, such as identifying important sites and phylogenetic analysis. Traditional methods, such as progressive alignment, are time-consuming. To address this issue, we introduce StarTree, a novel method to fast construct a guide tree by combining sequence clustering and hierarchical clustering. Furthermore, we develop a new heuristic similar region detection algorithm using the FM-index and apply the k-banded dynamic program to the profile alignment. We also introduce a win-win alignment algorithm that applies the central star strategy within the clusters to fast the alignment process, then uses the progressive strategy to align the central-aligned profiles, guaranteeing the final alignment's accuracy. We present WMSA 2 based on these improvements and compare the speed and accuracy with other popular methods. The results show that the guide tree made by the StarTree clustering method can lead to better accuracy than that of PartTree while consuming less time and memory than that of UPGMA and mBed methods on datasets with thousands of sequences. During the alignment of simulated data sets, WMSA 2 can consume less time and memory while ranking at the top of Q and TC scores. The WMSA 2 is still better at the time, and memory efficiency on the real datasets and ranks at the top on the average sum of pairs score. For the alignment of 1 million SARS-CoV-2 genomes, the win-win mode of WMSA 2 significantly decreased the consumption time than the former version. The source code and data are available at https://github.com/malabz/WMSA2.
Collapse
Affiliation(s)
- Juntao Chen
- Quzhou People's Hospital, Quzhou Affiliated Hospital of Wenzhou Medical University, Quzhou, China, Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China, and the Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Jiannan Chao
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China, and the Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Huan Liu
- School of Computer Science and Technology, Southwest University of Science and Technology, Mianyang, China
| | - Fenglong Yang
- Department of Bioinformatics, Fujian Key Laboratory of Medical Bioinformatics, School of Medical Technology and Engineering, Fujian Medical University, Fuzhou, China
| | - Quan Zou
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China and the Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Furong Tang
- Quzhou People's Hospital, Quzhou Affiliated Hospital of Wenzhou Medical University, Quzhou, China, and Department of Basic Medical Sciences, School of Medicine, Tsinghua University, Beijing, China
| |
Collapse
|
15
|
Liu L, Han K, Sun H, Han L, Gao D, Xi Q, Zhang L, Lin H. A comprehensive review of bioinformatics tools for chromatin loop calling. Brief Bioinform 2023; 24:7071576. [PMID: 36882016 DOI: 10.1093/bib/bbad072] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2022] [Revised: 12/19/2022] [Accepted: 02/08/2023] [Indexed: 03/09/2023] Open
Abstract
Precisely calling chromatin loops has profound implications for further analysis of gene regulation and disease mechanisms. Technological advances in chromatin conformation capture (3C) assays make it possible to identify chromatin loops in the genome. However, a variety of experimental protocols have resulted in different levels of biases, which require distinct methods to call true loops from the background. Although many bioinformatics tools have been developed to address this problem, there is still a lack of special introduction to loop-calling algorithms. This review provides an overview of the loop-calling tools for various 3C-based techniques. We first discuss the background biases produced by different experimental techniques and the denoising algorithms. Then, the completeness and priority of each tool are categorized and summarized according to the data source of application. The summary of these works can help researchers select the most appropriate method to call loops and further perform downstream analysis. In addition, this survey is also useful for bioinformatics scientists aiming to develop new loop-calling algorithms.
Collapse
Affiliation(s)
- Li Liu
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003, China
| | - Kaiyuan Han
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Huimin Sun
- School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China
| | - Lu Han
- School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China
| | - Dong Gao
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Qilemuge Xi
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot 010070, China
| | - Lirong Zhang
- School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China
| | - Hao Lin
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| |
Collapse
|