1
|
Sharma A, López Y, Jia S, Lysenko A, Boroevich KA, Tsunoda T. Enhanced analysis of tabular data through Multi-representation DeepInsight. Sci Rep 2024; 14:12851. [PMID: 38834670 DOI: 10.1038/s41598-024-63630-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2023] [Accepted: 05/30/2024] [Indexed: 06/06/2024] Open
Abstract
Tabular data analysis is a critical task in various domains, enabling us to uncover valuable insights from structured datasets. While traditional machine learning methods can be used for feature engineering and dimensionality reduction, they often struggle to capture the intricate relationships and dependencies within real-world datasets. In this paper, we present Multi-representation DeepInsight (MRep-DeepInsight), a novel extension of the DeepInsight method designed to enhance the analysis of tabular data. By generating multiple representations of samples using diverse feature extraction techniques, our approach is able to capture a broader range of features and reveal deeper insights. We demonstrate the effectiveness of MRep-DeepInsight on single-cell datasets, Alzheimer's data, and artificial data, showcasing an improved accuracy over the original DeepInsight approach and machine learning methods like random forest, XGBoost, LightGBM, FT-Transformer and L2-regularized logistic regression. Our results highlight the value of incorporating multiple representations for robust and accurate tabular data analysis. By leveraging the power of diverse representations, MRep-DeepInsight offers a promising new avenue for advancing decision-making and scientific discovery across a wide range of fields.
Collapse
Affiliation(s)
- Alok Sharma
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan.
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Australia.
- Laboratory for Medical Science Mathematics, Department of Biological Sciences, School of Science, The University of Tokyo, Tokyo, Japan.
| | | | - Shangru Jia
- Laboratory for Medical Science Mathematics, Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Tokyo, Japan
| | - Artem Lysenko
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
- Laboratory for Medical Science Mathematics, Department of Biological Sciences, School of Science, The University of Tokyo, Tokyo, Japan
| | - Keith A Boroevich
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
| | - Tatsuhiko Tsunoda
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan.
- Laboratory for Medical Science Mathematics, Department of Biological Sciences, School of Science, The University of Tokyo, Tokyo, Japan.
- Laboratory for Medical Science Mathematics, Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Tokyo, Japan.
| |
Collapse
|
2
|
Chuah CW, He W, Huang DS. GMean-a semi-supervised GRU and K-mean model for predicting the TF binding site. Sci Rep 2024; 14:2539. [PMID: 38291225 PMCID: PMC10827707 DOI: 10.1038/s41598-024-52933-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2023] [Accepted: 01/25/2024] [Indexed: 02/01/2024] Open
Abstract
The transcription factor binding site is a deoxyribonucleic acid sequence that binds to transcription factors. Transcription factors are proteins that regulate the transcription gene. Abnormal turnover of transcription factors can lead to uncontrolled cell growth. Therefore, discovering the relationships between transcription factors and deoxyribonucleic acid sequences is an important component of bioinformatics research. Numerous deep learning and machine learning language models have been developed to accomplish these tasks. Our goal in this work is to propose a GMean model for predicting unlabelled deoxyribonucleic acid sequences. The GMean model is a hybrid model with a combination of gated recurrent unit and K-mean clustering. The GMean model is developed in three phases. The labelled and unlabelled data are processed based on k-mers and tokenization. The labelled data is used for training. The unlabelled data are used for testing and prediction. The experimental data consists of deoxyribonucleic acid experimental of GM12878, K562 and HepG2. The experimental results show that GMean is feasible and effective in predicting deoxyribonucleic acid sequences, as the highest accuracy is 91.85% in predicting K562 and HepG2. This is followed by the prediction of the sequence between GM12878 and K562 with an accuracy of 89.13%. The lowest accuracy is the prediction of the sequence between HepG2 and GM12828, which is 88.80%.
Collapse
Affiliation(s)
- Chai Wen Chuah
- Guangdong University of Science and Technology, Songsan Hu, Dongguang, 523070, Guangdong, China.
| | - Wanxian He
- Guangxi Academy of Sciences, 98 Daling Road, Nanning, 530007, Guangxi, China
| | - De-Shuang Huang
- Eastern Institute for Advanced Study, Eastern Institute of Technology, Tongxin Road No. 568, Ningbo, 315201, China
| |
Collapse
|
3
|
Zhang J, Basu S, Kurgan L. HybridDBRpred: improved sequence-based prediction of DNA-binding amino acids using annotations from structured complexes and disordered proteins. Nucleic Acids Res 2024; 52:e10. [PMID: 38048333 PMCID: PMC10810184 DOI: 10.1093/nar/gkad1131] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2023] [Accepted: 11/10/2023] [Indexed: 12/06/2023] Open
Abstract
Current predictors of DNA-binding residues (DBRs) from protein sequences belong to two distinct groups, those trained on binding annotations extracted from structured protein-DNA complexes (structure-trained) vs. intrinsically disordered proteins (disorder-trained). We complete the first empirical analysis of predictive performance across the structure- and disorder-annotated proteins for a representative collection of ten predictors. Majority of the structure-trained tools perform well on the structure-annotated proteins while doing relatively poorly on the disorder-annotated proteins, and vice versa. Several methods make accurate predictions for the structure-annotated proteins or the disorder-annotated proteins, but none performs highly accurately for both annotation types. Moreover, most predictors make excessive cross-predictions for the disorder-annotated proteins, where residues that interact with non-DNA ligand types are predicted as DBRs. Motivated by these results, we design, validate and deploy an innovative meta-model, hybridDBRpred, that uses deep transformer network to combine predictions generated by three best current predictors. HybridDBRpred provides accurate predictions and low levels of cross-predictions across the two annotation types, and is statistically more accurate than each of the ten tools and baseline meta-predictors that rely on averaging and logistic regression. We deploy hybridDBRpred as a convenient web server at http://biomine.cs.vcu.edu/servers/hybridDBRpred/ and provide the corresponding source code at https://github.com/jianzhang-xynu/hybridDBRpred.
Collapse
Affiliation(s)
- Jian Zhang
- School of Computer and Information Technology, Xinyang Normal University, Xinyang 464000, PR China
| | - Sushmita Basu
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| |
Collapse
|
4
|
Liu S, Liang Y, Li J, Yang S, Liu M, Liu C, Yang D, Zuo Y. Integrating reduced amino acid composition into PSSM for improving copper ion-binding protein prediction. Int J Biol Macromol 2023:124993. [PMID: 37307968 DOI: 10.1016/j.ijbiomac.2023.124993] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2023] [Revised: 05/12/2023] [Accepted: 05/19/2023] [Indexed: 06/14/2023]
Abstract
Copper ion-binding proteins play an essential role in metabolic processes and are critical factors in many diseases, such as breast cancer, lung cancer, and Menkes disease. Many algorithms have been developed for predicting metal ion classification and binding sites, but none have been applied to copper ion-binding proteins. In this study, we developed a copper ion-bound protein classifier, RPCIBP, which integrating the reduced amino acid composition into position-specific score matrix (PSSM). The reduced amino acid composition filters out a large number of useless evolutionary features, improving the operational efficiency and predictive ability of the model (feature dimension from 2900 to 200, ACC from 83 % to 85.1 %). Compared with the basic model using only three sequence feature extraction methods (ACC in training set between 73.8 %-86.2 %, ACC in test set between 69.3 %-87.5 %), the model integrating the evolutionary features of the reduced amino acid composition showed higher accuracy and robustness (ACC in training set between 83.1 %-90.8 %, ACC in test set between 79.1 %-91.9 %). Best copper ion-binding protein classifiers filtered by feature selection progress were deployed in a user-friendly web server (http://bioinfor.imu.edu.cn/RPCIBP). RPCIBP can accurately predict copper ion-binding proteins, which is convenient for further structural and functional studies, and conducive to mechanism exploration and target drug development.
Collapse
Affiliation(s)
- Shanghua Liu
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, Institutes of Biomedical Sciences, School of Life Sciences, Inner Mongolia University, Hohhot 010021, China; Inner Mongolia International Mongolian Hospital, Hohhot 010065, China
| | - Yuchao Liang
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, Institutes of Biomedical Sciences, School of Life Sciences, Inner Mongolia University, Hohhot 010021, China; Digital College, Inner Mongolia Intelligent Union Big Data Academy, Hohhot 010010, China
| | - Jinzhao Li
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, Institutes of Biomedical Sciences, School of Life Sciences, Inner Mongolia University, Hohhot 010021, China
| | - Siqi Yang
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, Institutes of Biomedical Sciences, School of Life Sciences, Inner Mongolia University, Hohhot 010021, China
| | - Ming Liu
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, Institutes of Biomedical Sciences, School of Life Sciences, Inner Mongolia University, Hohhot 010021, China
| | - Chengfang Liu
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, Institutes of Biomedical Sciences, School of Life Sciences, Inner Mongolia University, Hohhot 010021, China
| | - Dezhi Yang
- Inner Mongolia International Mongolian Hospital, Hohhot 010065, China.
| | - Yongchun Zuo
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, Institutes of Biomedical Sciences, School of Life Sciences, Inner Mongolia University, Hohhot 010021, China; Inner Mongolia International Mongolian Hospital, Hohhot 010065, China; Digital College, Inner Mongolia Intelligent Union Big Data Academy, Hohhot 010010, China.
| |
Collapse
|
5
|
Sharma A, Lysenko A, Boroevich KA, Tsunoda T. DeepInsight-3D architecture for anti-cancer drug response prediction with deep-learning on multi-omics. Sci Rep 2023; 13:2483. [PMID: 36774402 PMCID: PMC9922304 DOI: 10.1038/s41598-023-29644-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2022] [Accepted: 02/08/2023] [Indexed: 02/13/2023] Open
Abstract
Modern oncology offers a wide range of treatments and therefore choosing the best option for particular patient is very important for optimal outcome. Multi-omics profiling in combination with AI-based predictive models have great potential for streamlining these treatment decisions. However, these encouraging developments continue to be hampered by very high dimensionality of the datasets in combination with insufficiently large numbers of annotated samples. Here we proposed a novel deep learning-based method to predict patient-specific anticancer drug response from three types of multi-omics data. The proposed DeepInsight-3D approach relies on structured data-to-image conversion that then allows use of convolutional neural networks, which are particularly robust to high dimensionality of the inputs while retaining capabilities to model highly complex relationships between variables. Of particular note, we demonstrate that in this formalism additional channels of an image can be effectively used to accommodate data from different omics layers while implicitly encoding the connection between them. DeepInsight-3D was able to outperform other state-of-the-art methods applied to this task. The proposed improvements can facilitate the development of better personalized treatment strategies for different cancers in the future.
Collapse
Affiliation(s)
- Alok Sharma
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan.
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Australia.
| | - Artem Lysenko
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan.
- Laboratory for Medical Science Mathematics, Department of Biological Sciences, School of Science, The University of Tokyo, Tokyo, Japan.
| | - Keith A Boroevich
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
| | - Tatsuhiko Tsunoda
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan.
- Laboratory for Medical Science Mathematics, Department of Biological Sciences, School of Science, The University of Tokyo, Tokyo, Japan.
- Laboratory for Medical Science Mathematics, Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Tokyo, Japan.
| |
Collapse
|