1
|
Bao Y, Ma Q, Chen L, Feng K, Guo W, Huang T, Cai YD. Recognizing SARS-CoV-2 infection of nasopharyngeal tissue at the single-cell level by machine learning method. Mol Immunol 2025; 177:44-61. [PMID: 39700903 DOI: 10.1016/j.molimm.2024.12.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2024] [Revised: 11/27/2024] [Accepted: 12/13/2024] [Indexed: 12/21/2024]
Abstract
SARS-CoV-2 has posed serious global health challenges not only because of the high degree of virus transmissibility but also due to its severe effects on the respiratory system, such as inducing changes in multiple organs through the ACE2 receptor. This virus makes changes to gene expression at the single-cell level and thus to cellular functions and immune responses in a variety of cell types. Previous studies have not been able to resolve these mechanisms fully, and so our study tries to bridge knowledge gaps about the cellular responses under conditions of infection. We performed single-cell RNA-sequencing of nasopharyngeal swabs from COVID-19 patients and healthy controls. We assembled a dataset of 32,588 cells for 58 subjects for analysis. The data were sorted into eight cell types: ciliated, basal, deuterosomal, goblet, myeloid, secretory, squamous, and T cells. Using machine learning, including nine feature ranking algorithms and two classification algorithms, we classified the infection status of single cells and analyzed gene expression to pinpoint critical markers of SARS-CoV-2 infection. Our findings show distinct gene expression profiles between infected and uninfected cells across diverse cell types, with key indicators such as FKBP4, IFITM1, SLC35E1, CD200R1, MT-ATP6, KRT13, RBM15, and FTH1 illuminating unique immune responses and potential pathways for viral spread and immune evasion. The machine learning methods effectively differentiated between infected and non-infected cells, shedding light on the cellular heterogeneity of SARS-CoV-2 infection. The findings will improve our knowledge of the cellular dynamics of SARS-CoV-2.
Collapse
Affiliation(s)
- YuSheng Bao
- School of Life Sciences, Shanghai University, Shanghai 200444, China.
| | - QingLan Ma
- School of Life Sciences, Shanghai University, Shanghai 200444, China.
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China.
| | - KaiYan Feng
- Department of Computer Science, Guangdong AIB Polytechnic College, Guangzhou 510507, China.
| | - Wei Guo
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China.
| | - Tao Huang
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China; CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China.
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai 200444, China.
| |
Collapse
|
2
|
Gajate-Arenas M, García-Pérez O, Domínguez-De-Barros A, Sirvent-Blanco C, Dorta-Guerra R, García-Ramos A, Piñero JE, Lorenzo-Morales J, Córdoba-Lanús E. Differential Inflammatory and Immune Response to Viral Infection in the Upper-Airway and Peripheral Blood of Mild COVID-19 Cases. J Pers Med 2024; 14:1099. [PMID: 39590591 PMCID: PMC11595938 DOI: 10.3390/jpm14111099] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2024] [Revised: 11/05/2024] [Accepted: 11/07/2024] [Indexed: 11/28/2024] Open
Abstract
BACKGROUND/OBJECTIVES COVID-19 is characterised by a wide variety of clinical manifestations, and clinical tests and genetic analysis might help to predict patient outcomes. METHODS In the current study, the expression of genes related to immune response (CCL5, IFI6, OAS1, IRF9, IL1B, and TGFB1) was analysed in the upper airway and paired-blood samples from 25 subjects infected with SARS-CoV-2. Relative gene expression was determined by RT-qPCR. RESULTS CCL5 expression was higher in the blood than in the upper airway (p < 0.001). In addition, a negative correlation was found between IFI6 and viral load (p = 0.033) in the upper airway, suggesting that the IFI6 expression inhibits the viral infection. Concerning sex, women expressed IL1B and IRF9 in a higher proportion than men at a systemic level (p = 0.008 and p = 0.049, respectively). However, an increased expression of IRF9 was found in men compared to women in the upper airway (p = 0.046), which could be due to the protective effect of IRF9, especially in men. CONCLUSIONS The higher expression of CCL5 in blood might be due to the key role of this gene in the migration and recruitment of immune cells from the systemic circulation to the lungs. Our findings confirm the existence of sex differences in the immune response to early stages of the infection. Further studies in a larger cohort are necessary to corroborate the current findings.
Collapse
Affiliation(s)
- Malena Gajate-Arenas
- Instituto Universitario de Enfermedades Tropicales y Salud Pública de Canarias (IUETSPC), Universidad de La Laguna, 38029 La Laguna, Tenerife, Spain; (M.G.-A.); (O.G.-P.); (A.D.-D.-B.); (C.S.-B.); (R.D.-G.); (A.G.-R.); (J.E.P.)
- Centro de Investigación Biomédica en Red de Enfermedades Infecciosas (CIBERINFEC), Instituto de Salud Carlos III, 28029 Madrid, Spain
| | - Omar García-Pérez
- Instituto Universitario de Enfermedades Tropicales y Salud Pública de Canarias (IUETSPC), Universidad de La Laguna, 38029 La Laguna, Tenerife, Spain; (M.G.-A.); (O.G.-P.); (A.D.-D.-B.); (C.S.-B.); (R.D.-G.); (A.G.-R.); (J.E.P.)
- Centro de Investigación Biomédica en Red de Enfermedades Infecciosas (CIBERINFEC), Instituto de Salud Carlos III, 28029 Madrid, Spain
| | - Angélica Domínguez-De-Barros
- Instituto Universitario de Enfermedades Tropicales y Salud Pública de Canarias (IUETSPC), Universidad de La Laguna, 38029 La Laguna, Tenerife, Spain; (M.G.-A.); (O.G.-P.); (A.D.-D.-B.); (C.S.-B.); (R.D.-G.); (A.G.-R.); (J.E.P.)
- Centro de Investigación Biomédica en Red de Enfermedades Infecciosas (CIBERINFEC), Instituto de Salud Carlos III, 28029 Madrid, Spain
| | - Candela Sirvent-Blanco
- Instituto Universitario de Enfermedades Tropicales y Salud Pública de Canarias (IUETSPC), Universidad de La Laguna, 38029 La Laguna, Tenerife, Spain; (M.G.-A.); (O.G.-P.); (A.D.-D.-B.); (C.S.-B.); (R.D.-G.); (A.G.-R.); (J.E.P.)
| | - Roberto Dorta-Guerra
- Instituto Universitario de Enfermedades Tropicales y Salud Pública de Canarias (IUETSPC), Universidad de La Laguna, 38029 La Laguna, Tenerife, Spain; (M.G.-A.); (O.G.-P.); (A.D.-D.-B.); (C.S.-B.); (R.D.-G.); (A.G.-R.); (J.E.P.)
- Departamento de Matemáticas, Estadística e Investigación Operativa, Facultad de Ciencias, Sección de Matemáticas, Universidad de La Laguna, 38200 La Laguna, Tenerife, Spain
| | - Alma García-Ramos
- Instituto Universitario de Enfermedades Tropicales y Salud Pública de Canarias (IUETSPC), Universidad de La Laguna, 38029 La Laguna, Tenerife, Spain; (M.G.-A.); (O.G.-P.); (A.D.-D.-B.); (C.S.-B.); (R.D.-G.); (A.G.-R.); (J.E.P.)
| | - José E. Piñero
- Instituto Universitario de Enfermedades Tropicales y Salud Pública de Canarias (IUETSPC), Universidad de La Laguna, 38029 La Laguna, Tenerife, Spain; (M.G.-A.); (O.G.-P.); (A.D.-D.-B.); (C.S.-B.); (R.D.-G.); (A.G.-R.); (J.E.P.)
- Centro de Investigación Biomédica en Red de Enfermedades Infecciosas (CIBERINFEC), Instituto de Salud Carlos III, 28029 Madrid, Spain
- Departamento de Obstetricia y Ginecología, Pediatría, Medicina Preventiva y Salud Pública, Toxicología, Medicina Legal y Forense y Parasitología, Facultad de Ciencias de la Salud, Universidad de La Laguna, 38200 La Laguna, Tenerife, Spain
| | - Jacob Lorenzo-Morales
- Instituto Universitario de Enfermedades Tropicales y Salud Pública de Canarias (IUETSPC), Universidad de La Laguna, 38029 La Laguna, Tenerife, Spain; (M.G.-A.); (O.G.-P.); (A.D.-D.-B.); (C.S.-B.); (R.D.-G.); (A.G.-R.); (J.E.P.)
- Centro de Investigación Biomédica en Red de Enfermedades Infecciosas (CIBERINFEC), Instituto de Salud Carlos III, 28029 Madrid, Spain
- Departamento de Obstetricia y Ginecología, Pediatría, Medicina Preventiva y Salud Pública, Toxicología, Medicina Legal y Forense y Parasitología, Facultad de Ciencias de la Salud, Universidad de La Laguna, 38200 La Laguna, Tenerife, Spain
| | - Elizabeth Córdoba-Lanús
- Instituto Universitario de Enfermedades Tropicales y Salud Pública de Canarias (IUETSPC), Universidad de La Laguna, 38029 La Laguna, Tenerife, Spain; (M.G.-A.); (O.G.-P.); (A.D.-D.-B.); (C.S.-B.); (R.D.-G.); (A.G.-R.); (J.E.P.)
- Centro de Investigación Biomédica en Red de Enfermedades Infecciosas (CIBERINFEC), Instituto de Salud Carlos III, 28029 Madrid, Spain
| |
Collapse
|
3
|
Zhao K, So HC, Lin Z. scParser: sparse representation learning for scalable single-cell RNA sequencing data analysis. Genome Biol 2024; 25:223. [PMID: 39152499 PMCID: PMC11328435 DOI: 10.1186/s13059-024-03345-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2023] [Accepted: 07/23/2024] [Indexed: 08/19/2024] Open
Abstract
The rapid rise in the availability and scale of scRNA-seq data needs scalable methods for integrative analysis. Though many methods for data integration have been developed, few focus on understanding the heterogeneous effects of biological conditions across different cell populations in integrative analysis. Our proposed scalable approach, scParser, models the heterogeneous effects from biological conditions, which unveils the key mechanisms by which gene expression contributes to phenotypes. Notably, the extended scParser pinpoints biological processes in cell subpopulations that contribute to disease pathogenesis. scParser achieves favorable performance in cell clustering compared to state-of-the-art methods and has a broad and diverse applicability.
Collapse
Affiliation(s)
- Kai Zhao
- Department of Statistics, The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China
| | - Hon-Cheong So
- School of Biomedical Sciences, The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China.
- KIZ-CUHK Joint Laboratory of Bioresources and Molecular Research of Common Diseases, Kunming Institute of Zoology and The Chinese University of Hong Kong, Hong Kong SAR, China.
- Department of Psychiatry, The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China.
- Margaret K.L. Cheung Research Centre for Management of Parkinsonism, The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China.
- Brain and Mind Institute, The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China.
- Hong Kong Branch of the Chinese Academy of Sciences Center for Excellence in Animal Evolution and Genetics, The Chinese University of Hong Kong, Hong Kong SAR, China.
| | - Zhixiang Lin
- Department of Statistics, The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China.
| |
Collapse
|
4
|
Alamin MH, Rahaman MM, Ferdousi F, Sarker A, Ali MA, Hossen MB, Sarker B, Kumar N, Mollah MNH. In-silico discovery of common molecular signatures for which SARS-CoV-2 infections and lung diseases stimulate each other, and drug repurposing. PLoS One 2024; 19:e0304425. [PMID: 39024368 PMCID: PMC11257407 DOI: 10.1371/journal.pone.0304425] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2023] [Accepted: 05/12/2024] [Indexed: 07/20/2024] Open
Abstract
COVID-19 caused by SARS-CoV-2 is a global health issue. It is yet a severe risk factor to the patients, who are also suffering from one or more chronic diseases including different lung diseases. In this study, we explored common molecular signatures for which SARS-CoV-2 infections and different lung diseases stimulate each other, and associated candidate drug molecules. We identified both SARS-CoV-2 infections and different lung diseases (Asthma, Tuberculosis, Cystic Fibrosis, Pneumonia, Emphysema, Bronchitis, IPF, ILD, and COPD) causing top-ranked 11 shared genes (STAT1, TLR4, CXCL10, CCL2, JUN, DDX58, IRF7, ICAM1, MX2, IRF9 and ISG15) as the hub of the shared differentially expressed genes (hub-sDEGs). The gene ontology (GO) and pathway enrichment analyses of hub-sDEGs revealed some crucial common pathogenetic processes of SARS-CoV-2 infections and different lung diseases. The regulatory network analysis of hub-sDEGs detected top-ranked 6 TFs proteins and 6 micro RNAs as the key transcriptional and post-transcriptional regulatory factors of hub-sDEGs, respectively. Then we proposed hub-sDEGs guided top-ranked three repurposable drug molecules (Entrectinib, Imatinib, and Nilotinib), for the treatment against COVID-19 with different lung diseases. This recommendation is based on the results obtained from molecular docking analysis using the AutoDock Vina and GLIDE module of Schrödinger. The selected drug molecules were optimized through density functional theory (DFT) and observing their good chemical stability. Finally, we explored the binding stability of the highest-ranked receptor protein RELA with top-ordered three drugs (Entrectinib, Imatinib, and Nilotinib) through 100 ns molecular dynamic (MD) simulations with YASARA and Desmond module of Schrödinger and observed their consistent performance. Therefore, the findings of this study might be useful resources for the diagnosis and therapies of COVID-19 patients who are also suffering from one or more lung diseases.
Collapse
Affiliation(s)
- Muhammad Habibulla Alamin
- Faculty of Science, Department of Statistics, Bangabandhu Sheikh Mujibur Rahman Science and Technology University, Gopalganj, Bangladesh
| | - Md. Matiur Rahaman
- Faculty of Science, Department of Statistics, Bangabandhu Sheikh Mujibur Rahman Science and Technology University, Gopalganj, Bangladesh
- Zhejiang University-University of Edinburgh Institute, Zhejiang University School of Medicine, Zhejiang University, Haining, P. R. China
| | - Farzana Ferdousi
- Faculty of Science, Department of Statistics, Bangabandhu Sheikh Mujibur Rahman Science and Technology University, Gopalganj, Bangladesh
| | - Arnob Sarker
- Faculty of Science, Department of Biochemistry and Molecular Biology, University of Rajshahi, Rajshahi, Bangladesh
- Faculty of Science, Department of Statistics, Bioinformatics Laboratory (Dry), University of Rajshahi, Rajshahi, Bangladesh
| | - Md. Ahad Ali
- Faculty of Science, Department of Statistics, Bioinformatics Laboratory (Dry), University of Rajshahi, Rajshahi, Bangladesh
- Faculty of Science, Department of Chemistry, University of Rajshahi, Rajshahi, Bangladesh
| | - Md. Bayazid Hossen
- Faculty of Science, Department of Statistics, Bioinformatics Laboratory (Dry), University of Rajshahi, Rajshahi, Bangladesh
- Department of Agricultural and Applied Statistics, Bangladesh Agricultural University, Mymensingh, Bangladesh
| | - Bandhan Sarker
- Faculty of Science, Department of Statistics, Bangabandhu Sheikh Mujibur Rahman Science and Technology University, Gopalganj, Bangladesh
| | - Nishith Kumar
- Faculty of Science, Department of Statistics, Bangabandhu Sheikh Mujibur Rahman Science and Technology University, Gopalganj, Bangladesh
| | - Md. Nurul Haque Mollah
- Faculty of Science, Department of Statistics, Bioinformatics Laboratory (Dry), University of Rajshahi, Rajshahi, Bangladesh
| |
Collapse
|
5
|
Gajate-Arenas M, Fricke-Galindo I, García-Pérez O, Domínguez-de-Barros A, Pérez-Rubio G, Dorta-Guerra R, Buendía-Roldán I, Chávez-Galán L, Lorenzo-Morales J, Falfán-Valencia R, Córdoba-Lanús E. The Immune Response of OAS1, IRF9, and IFI6 Genes in the Pathogenesis of COVID-19. Int J Mol Sci 2024; 25:4632. [PMID: 38731851 PMCID: PMC11083791 DOI: 10.3390/ijms25094632] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Revised: 04/17/2024] [Accepted: 04/23/2024] [Indexed: 05/13/2024] Open
Abstract
COVID-19 is characterized by a wide range of clinical manifestations, where aging, underlying diseases, and genetic background are related to worse outcomes. In the present study, the differential expression of seven genes related to immunity, IRF9, CCL5, IFI6, TGFB1, IL1B, OAS1, and TFRC, was analyzed in individuals with COVID-19 diagnoses of different disease severities. Two-step RT-qPCR was performed to determine the relative gene expression in whole-blood samples from 160 individuals. The expression of OAS1 (p < 0.05) and IFI6 (p < 0.05) was higher in moderate hospitalized cases than in severe ones. Increased gene expression of OAS1 (OR = 0.64, CI = 0.52-0.79; p = 0.001), IRF9 (OR = 0.581, CI = 0.43-0.79; p = 0.001), and IFI6 (OR = 0.544, CI = 0.39-0.69; p < 0.001) was associated with a lower risk of requiring IMV. Moreover, TGFB1 (OR = 0.646, CI = 0.50-0.83; p = 0.001), CCL5 (OR = 0.57, CI = 0.39-0.83; p = 0.003), IRF9 (OR = 0.80, CI = 0.653-0.979; p = 0.03), and IFI6 (OR = 0.827, CI = 0.69-0.991; p = 0.039) expression was associated with patient survival. In conclusion, the relevance of OAS1, IRF9, and IFI6 in controlling the viral infection was confirmed.
Collapse
Affiliation(s)
- Malena Gajate-Arenas
- Instituto Universitario de Enfermedades Tropicales y Salud Pública de Canarias (IUETSPC), Universidad de La Laguna, 38029 San Cristóbal de La Laguna, Spain; (M.G.-A.); (O.G.-P.); (A.D.-d.-B.); (R.D.-G.)
| | - Ingrid Fricke-Galindo
- HLA Laboratory, Instituto Nacional de Enfermedades Respiratorias Ismael Cosío Villegas, Mexico City 14080, Mexico; (I.F.-G.); (G.P.-R.); (R.F.-V.)
| | - Omar García-Pérez
- Instituto Universitario de Enfermedades Tropicales y Salud Pública de Canarias (IUETSPC), Universidad de La Laguna, 38029 San Cristóbal de La Laguna, Spain; (M.G.-A.); (O.G.-P.); (A.D.-d.-B.); (R.D.-G.)
| | - Angélica Domínguez-de-Barros
- Instituto Universitario de Enfermedades Tropicales y Salud Pública de Canarias (IUETSPC), Universidad de La Laguna, 38029 San Cristóbal de La Laguna, Spain; (M.G.-A.); (O.G.-P.); (A.D.-d.-B.); (R.D.-G.)
| | - Gloria Pérez-Rubio
- HLA Laboratory, Instituto Nacional de Enfermedades Respiratorias Ismael Cosío Villegas, Mexico City 14080, Mexico; (I.F.-G.); (G.P.-R.); (R.F.-V.)
| | - Roberto Dorta-Guerra
- Instituto Universitario de Enfermedades Tropicales y Salud Pública de Canarias (IUETSPC), Universidad de La Laguna, 38029 San Cristóbal de La Laguna, Spain; (M.G.-A.); (O.G.-P.); (A.D.-d.-B.); (R.D.-G.)
- Department of Mathematics, Statistics and Operations Research, Faculty of Sciences, Mathematics Section, Universidad de La Laguna, 38200 San Cristóbal de La Laguna, Spain
| | - Ivette Buendía-Roldán
- Translational Research Laboratory on Aging and Pulmonary Fibrosis, Instituto Nacional de Enfermedades Respiratorias Ismael Cosio Villegas, Mexico City 14080, Mexico;
| | - Leslie Chávez-Galán
- Laboratory of Integrative Immunology, Instituto Nacional de Enfermedades Respiratorias Ismael Cosio Villegas, Mexico City 14080, Mexico;
| | - Jacob Lorenzo-Morales
- Instituto Universitario de Enfermedades Tropicales y Salud Pública de Canarias (IUETSPC), Universidad de La Laguna, 38029 San Cristóbal de La Laguna, Spain; (M.G.-A.); (O.G.-P.); (A.D.-d.-B.); (R.D.-G.)
- Centro de Investigación Biomédica en Red de Enfermedades Infecciosas (CIBERINFEC), Instituto de Salud Carlos III, 28029 Madrid, Spain
- Department of Obstetrics and Gynecology, Pediatrics, Preventive Medicine and Public Health, Toxicology, Legal and Forensic Medicine and Parasitology, Faculty of Health Sciences, Universidad de La Laguna, 38200 San Cristóbal de La Laguna, Spain
| | - Ramcés Falfán-Valencia
- HLA Laboratory, Instituto Nacional de Enfermedades Respiratorias Ismael Cosío Villegas, Mexico City 14080, Mexico; (I.F.-G.); (G.P.-R.); (R.F.-V.)
| | - Elizabeth Córdoba-Lanús
- Instituto Universitario de Enfermedades Tropicales y Salud Pública de Canarias (IUETSPC), Universidad de La Laguna, 38029 San Cristóbal de La Laguna, Spain; (M.G.-A.); (O.G.-P.); (A.D.-d.-B.); (R.D.-G.)
- Centro de Investigación Biomédica en Red de Enfermedades Infecciosas (CIBERINFEC), Instituto de Salud Carlos III, 28029 Madrid, Spain
| |
Collapse
|
6
|
Luo H, Yan J, Gong R, Zhang D, Zhou X, Wang X. Identification of biomarkers and pathways for the SARS-CoV-2 infections in obstructive sleep apnea patients based on machine learning and proteomic analysis. BMC Pulm Med 2024; 24:112. [PMID: 38443855 PMCID: PMC10913609 DOI: 10.1186/s12890-024-02921-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2023] [Accepted: 02/22/2024] [Indexed: 03/07/2024] Open
Abstract
BACKGROUND The prevalence of obstructive sleep apnea (OSA) was found to be higher in individuals following COVID-19 infection. However, the intricate mechanisms that underscore this concomitance remain partially elucidated. The aim of this study was to delve deeper into the molecular mechanisms that underpin this comorbidity. METHODS We acquired gene expression profiles for COVID-19 (GSE157103) and OSA (GSE75097) from the Gene Expression Omnibus (GEO) database. Upon identifying shared feature genes between OSA and COVID-19 utilizing LASSO, Random forest and Support vector machines algorithms, we advanced to functional annotation, analysis of protein-protein interaction networks, module construction, and identification of pivotal genes. Furthermore, we established regulatory networks encompassing transcription factor (TF)-gene and TF-miRNA interactions, and searched for promising drug targets. Subsequently, the expression levels of pivotal genes were validated through proteomics data from COVID-19 cases. RESULTS Fourteen feature genes shared between OSA and COVID-19 were selected for further investigation. Through functional annotation, it was indicated that metabolic pathways play a role in the pathogenesis of both disorders. Subsequently, employing the cytoHubba plugin, ten hub genes were recognized, namely TP53, CCND1, MDM2, RB1, HIF1A, EP300, STAT3, CDK2, HSP90AA1, and PPARG. The finding of proteomics unveiled a substantial augmentation in the expression level of HSP90AA1 in COVID-19 patient samples, especially in severe conditions. CONCLUSIONS Our investigation illuminate a mutual pathogenic mechanism that underlies both OSA and COVID-19, which may provide novel perspectives for future investigations into the underlying mechanisms.
Collapse
Affiliation(s)
- Hong Luo
- Department of Tuberculosis and Respiratory, Wuhan Jinyintan Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| | - Jisong Yan
- Department of Tuberculosis and Respiratory, Wuhan Jinyintan Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| | - Rui Gong
- Division of Life Sciences and Medicine, The First Affiliated Hospital of USTC, University of Science and Technology of China (USTC), Hefei, Anhui, China
| | - Dingyu Zhang
- Division of Life Sciences and Medicine, The First Affiliated Hospital of USTC, University of Science and Technology of China (USTC), Hefei, Anhui, China
- Center for Translational Medicine, Wuhan Jinyintan Hospital, Tongji Medical College, Huazhong University of Science and Technology (HUST), Wuhan, Hubei, China
- Department of Critical Care Medicine, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology (HUST), Wuhan, Hubei, China
| | - Xia Zhou
- Department of Tuberculosis and Respiratory, Wuhan Jinyintan Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China.
- Hubei Clinical Research Center for Infectious Diseases, Wuhan, China.
- Wuhan Research Center for Communicable Disease Diagnosis and Treatment, Chinese Academy of Medical Sciences, Wuhan, China.
- Joint Laboratory of Infectious Diseases and Health, Wuhan Institute of Virology and Wuhan Jinyintan Hospital, Chinese Academy of Sciences, Wuhan, China.
| | - Xianguang Wang
- Department of Tuberculosis and Respiratory, Wuhan Jinyintan Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China.
- Hubei Clinical Research Center for Infectious Diseases, Wuhan, China.
- Wuhan Research Center for Communicable Disease Diagnosis and Treatment, Chinese Academy of Medical Sciences, Wuhan, China.
- Joint Laboratory of Infectious Diseases and Health, Wuhan Institute of Virology and Wuhan Jinyintan Hospital, Chinese Academy of Sciences, Wuhan, China.
| |
Collapse
|
7
|
Ebrahimi A, Roshani F. Systems biology approaches to identify driver genes and drug combinations for treating COVID-19. Sci Rep 2024; 14:2257. [PMID: 38278931 PMCID: PMC10817985 DOI: 10.1038/s41598-024-52484-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2023] [Accepted: 01/19/2024] [Indexed: 01/28/2024] Open
Abstract
Corona virus 19 (Covid-19) has caused many problems in public health, economic, and even cultural and social fields since the beginning of the epidemic. However, in order to provide therapeutic solutions, many researches have been conducted and various omics data have been published. But there is still no early diagnosis method and comprehensive treatment solution. In this manuscript, by collecting important genes related to COVID-19 and using centrality and controllability analysis in PPI networks and signaling pathways related to the disease; hub and driver genes have been identified in the formation and progression of the disease. Next, by analyzing the expression data, the obtained genes have been evaluated. The results show that in addition to the significant difference in the expression of most of these genes, their expression correlation pattern is also different in the two groups of COVID-19 and control. Finally, based on the drug-gene interaction, drugs affecting the identified genes are presented in the form of a bipartite graph, which can be used as the potential drug combinations.
Collapse
Affiliation(s)
- Ali Ebrahimi
- Department of Physics, Alzahra University, Tehran, Iran
| | | |
Collapse
|
8
|
Wang Z, Sun L, Xu Y, Liang P, Xu K, Huang J. Discovery of novel JAK1 inhibitors through combining machine learning, structure-based pharmacophore modeling and bio-evaluation. J Transl Med 2023; 21:579. [PMID: 37641144 PMCID: PMC10464202 DOI: 10.1186/s12967-023-04443-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2023] [Accepted: 08/16/2023] [Indexed: 08/31/2023] Open
Abstract
BACKGROUND Janus kinase 1 (JAK1) plays a critical role in most cytokine-mediated inflammatory, autoimmune responses and various cancers via the JAK/STAT signaling pathway. Inhibition of JAK1 is therefore an attractive therapeutic strategy for several diseases. Recently, high-performance machine learning techniques have been increasingly applied in virtual screening to develop new kinase inhibitors. Our study aimed to develop a novel layered virtual screening method based on machine learning (ML) and pharmacophore models to identify the potential JAK1 inhibitors. METHODS Firstly, we constructed a high-quality dataset comprising 3834 JAK1 inhibitors and 12,230 decoys, followed by establishing a series of classification models based on a combination of three molecular descriptors and six ML algorithms. To further screen potential compounds, we constructed several pharmacophore models based on Hiphop and receptor-ligand algorithms. We then used molecular docking to filter the recognized compounds. Finally, the binding stability and enzyme inhibition activity of the identified compounds were assessed by molecular dynamics (MD) simulations and in vitro enzyme activity tests. RESULTS The best performance ML model DNN-ECFP4 and two pharmacophore models Hiphop3 and 6TPF 08 were utilized to screen the ZINC database. A total of 13 potentially active compounds were screened and the MD results demonstrated that all of the above molecules could bind with JAK1 stably in dynamic conditions. Among the shortlisted compounds, the four purchasable compounds demonstrated significant kinase inhibition activity, with Z-10 being the most active (IC50 = 194.9 nM). CONCLUSION The current study provides an efficient and accurate integrated model. The hit compounds were promising candidates for the further development of novel JAK1 inhibitors.
Collapse
Affiliation(s)
- Zixiao Wang
- Department of Pharmacy, Honghui Hospital, Xi' an Jiaotong University, Xi' an, 710054, China.
| | - Lili Sun
- Department of Pharmacy, The First Affiliated Hospital of Xi'an Jiaotong University, Xi'an, 710061, China
| | - Yu Xu
- State Key Laboratory of Natural Medicines,Jiangsu Key Laboratory of Drug Discovery for Metabolic Diseases, Center of Drug Discovery,China Pharmaceutical University, Nanjing, 210009, China
| | - Peida Liang
- Department of Pharmacy, Honghui Hospital, Xi' an Jiaotong University, Xi' an, 710054, China
| | - Kaiyan Xu
- School of Pharmacy, Lanzhou University, Lanzhou, 730000, China
| | - Jing Huang
- Department of Pharmacy, Honghui Hospital, Xi' an Jiaotong University, Xi' an, 710054, China.
| |
Collapse
|
9
|
Gajate-Arenas M, García-Pérez O, Chao-Pellicer J, Domínguez-De-Barros A, Dorta-Guerra R, Lorenzo-Morales J, Córdoba-Lanus E. Differential expression of antiviral and immune-related genes in individuals with COVID-19 asymptomatic or with mild symptoms. Front Cell Infect Microbiol 2023; 13:1173213. [PMID: 37389217 PMCID: PMC10302728 DOI: 10.3389/fcimb.2023.1173213] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2023] [Accepted: 05/25/2023] [Indexed: 07/01/2023] Open
Abstract
COVID-19 is characterized by a wide range of symptoms where the genetic background plays a key role in SARS-CoV-2 infection. In this study, the relative expression of IRF9, CCL5, IFI6, TGFB1, IL1B, OAS1, and TFRC genes (related to immunity and antiviral activity) was analyzed in upper airway samples from 127 individuals (97 COVID-19 positive and 30 controls) by using a two-step RT-PCR. All genes excepting IL1B (p=0.878) showed a significantly higher expression (p<0.005) in COVID-19 cases than in the samples from the control group suggesting that in asymptomatic-mild cases antiviral and immune system cells recruitment gene expression is being promoted. Moreover, IFI6 (p=0.002) and OAS1 (p=0.044) were upregulated in cases with high viral loads, which could be related to protection against severe forms of this viral infection. In addition, a higher frequency (68.7%) of individuals infected with the Omicron variant presented higher viral load values of infection when compared to individuals infected with other variants (p<0.001). Furthermore, an increased expression of IRF9 (p<0.001), IFI6 (p<0.001), OAS1 (p=0.011), CCL5, (p=0.003) and TGFB1 (p<0.001) genes was observed in individuals infected with SARS-CoV-2 wildtype virus, which might be due to immune response evasion of the viral variants and/or vaccination. The obtained results indicate a protective role of IFI6, OAS1 and IRF9 in asymptomatic -mild cases of SARS-CoV-2 infection while the role of TGFB1 and CCL5 in the pathogenesis of the disease is still unclear. The importance of studying the dysregulation of immune genes in relation to the infective variant is stand out in this study.
Collapse
Affiliation(s)
- Malena Gajate-Arenas
- Instituto Universitario de Enfermedades Tropicales y Salud Pública de Canarias (IUETSPC), Universidad de La Laguna, La Laguna, Spain
| | - Omar García-Pérez
- Instituto Universitario de Enfermedades Tropicales y Salud Pública de Canarias (IUETSPC), Universidad de La Laguna, La Laguna, Spain
| | - Javier Chao-Pellicer
- Instituto Universitario de Enfermedades Tropicales y Salud Pública de Canarias (IUETSPC), Universidad de La Laguna, La Laguna, Spain
- Centro de Investigación Biomédica en Red de Enfermedades Infecciosas (CIBERINFEC), Instituto de Salud Carlos III, Madrid, Spain
| | - Angélica Domínguez-De-Barros
- Instituto Universitario de Enfermedades Tropicales y Salud Pública de Canarias (IUETSPC), Universidad de La Laguna, La Laguna, Spain
| | - Roberto Dorta-Guerra
- Instituto Universitario de Enfermedades Tropicales y Salud Pública de Canarias (IUETSPC), Universidad de La Laguna, La Laguna, Spain
- Departamento de Matemáticas, Estadística e Investigación Operativa, Facultad de Ciencias, Sección de Matemáticas, Universidad de La Laguna, La Laguna, Spain
| | - Jacob Lorenzo-Morales
- Instituto Universitario de Enfermedades Tropicales y Salud Pública de Canarias (IUETSPC), Universidad de La Laguna, La Laguna, Spain
- Centro de Investigación Biomédica en Red de Enfermedades Infecciosas (CIBERINFEC), Instituto de Salud Carlos III, Madrid, Spain
- Departamento de Obstetricia y Ginecología, Pediatría, Medicina Preventiva y Salud Pública, Toxicología, Medicina Legal y Forense y Parasitología, Facultad de Ciencias de la Salud, Universidad de La Laguna, La Laguna, Spain
| | - Elizabeth Córdoba-Lanus
- Instituto Universitario de Enfermedades Tropicales y Salud Pública de Canarias (IUETSPC), Universidad de La Laguna, La Laguna, Spain
- Centro de Investigación Biomédica en Red de Enfermedades Infecciosas (CIBERINFEC), Instituto de Salud Carlos III, Madrid, Spain
| |
Collapse
|
10
|
Kim H, Ahn HS, Hwang N, Huh Y, Bu S, Seo KJ, Kwon SH, Lee HK, Kim JW, Yoon BK, Fang S. Epigenomic landscape exhibits interferon signaling suppression in the patient of myocarditis after BNT162b2 vaccination. Sci Rep 2023; 13:8926. [PMID: 37264110 DOI: 10.1038/s41598-023-36070-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2023] [Accepted: 05/29/2023] [Indexed: 06/03/2023] Open
Abstract
After the outbreak of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic, a novel mRNA vaccine (BNT162b2) was developed at an unprecedented speed. Although most countries have achieved widespread immunity from vaccines and infections, yet people, even who have recovered from SARS-CoV-2 infection, are recommended to receive vaccination due to their effectiveness in lowering the risk of recurrent infection. However, the BNT162b2 vaccine has been reported to increase the risk of myocarditis. To our knowledge, for the first time in this study, we tracked changes in the chromatin dynamics of peripheral blood mononuclear cells (PBMCs) in the patient who underwent myocarditis after BNT162b2 vaccination. A longitudinal study of chromatin accessibility using concurrent analysis of single-cell assays for transposase-accessible chromatin with sequencing and single-cell RNA sequencing showed downregulation of interferon signaling and upregulated RUNX2/3 activity in PBMCs. Considering BNT162b2 vaccination increases the level of interferon-α/γ in serum, our data highlight the immune responses different from the conventional responses to the vaccination, which is possibly the key to understanding the side effects of BNT162b2 vaccination.
Collapse
Affiliation(s)
- Hyeonhui Kim
- Graduate School of Medical Science, Brain Korea 21 Project, Yonsei University College of Medicine, Seoul, 03722, Korea
- Severance Biomedical Science Institute, Gangnam Severance Hospital, Yonsei University College of Medicine, Seoul, 03722, Korea
| | - Hyo-Suk Ahn
- Division of Cardiology, Department of Internal Medicine, The Catholic University of Korea, Uijeongbu St. Mary's Hospital, Seoul, 06591, Korea
- Catholic Research Institute for Intractable Cardiovascular Disease (CRID), College of Medicine, The Catholic University of Korea, Seoul, 06591, Korea
| | - Nahee Hwang
- Graduate School of Medical Science, Brain Korea 21 Project, Yonsei University College of Medicine, Seoul, 03722, Korea
- Department of Biochemistry and Molecular Biology, Yonsei University College of Medicine, Seoul, 03722, Korea
| | - Yune Huh
- Department of Medicine, Yonsei University College of Medicine, Seoul, South Korea
| | - Seonghyeon Bu
- Division of Cardiology, Department of Internal Medicine, The Catholic University of Korea, Uijeongbu St. Mary's Hospital, Seoul, 06591, Korea
- Catholic Research Institute for Intractable Cardiovascular Disease (CRID), College of Medicine, The Catholic University of Korea, Seoul, 06591, Korea
| | - Kyung Jin Seo
- Department of Hospital Pathology, College of Medicine, The Catholic University of Korea, Uijeongbu St. Mary's Hospital, Seoul, South Korea
| | - Se Hwan Kwon
- Department of Radiology, Kyung Hee University Medical Center, Seoul, South Korea
| | - Hae-Kyung Lee
- Severance Biomedical Science Institute, Gangnam Severance Hospital, Yonsei University College of Medicine, Seoul, 03722, Korea
| | - Jae-Woo Kim
- Department of Biochemistry and Molecular Biology, Yonsei University College of Medicine, Seoul, 03722, Korea
| | - Bo Kyung Yoon
- Department of Biochemistry and Molecular Biology, Yonsei University College of Medicine, Seoul, 03722, Korea.
| | - Sungsoon Fang
- Graduate School of Medical Science, Brain Korea 21 Project, Yonsei University College of Medicine, Seoul, 03722, Korea.
- Severance Biomedical Science Institute, Gangnam Severance Hospital, Yonsei University College of Medicine, Seoul, 03722, Korea.
| |
Collapse
|
11
|
Li H, Ma Q, Ren J, Guo W, Feng K, Li Z, Huang T, Cai YD. Immune responses of different COVID-19 vaccination strategies by analyzing single-cell RNA sequencing data from multiple tissues using machine learning methods. Front Genet 2023; 14:1157305. [PMID: 37007947 PMCID: PMC10065150 DOI: 10.3389/fgene.2023.1157305] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2023] [Accepted: 03/07/2023] [Indexed: 03/19/2023] Open
Abstract
Multiple types of COVID-19 vaccines have been shown to be highly effective in preventing SARS-CoV-2 infection and in reducing post-infection symptoms. Almost all of these vaccines induce systemic immune responses, but differences in immune responses induced by different vaccination regimens are evident. This study aimed to reveal the differences in immune gene expression levels of different target cells under different vaccine strategies after SARS-CoV-2 infection in hamsters. A machine learning based process was designed to analyze single-cell transcriptomic data of different cell types from the blood, lung, and nasal mucosa of hamsters infected with SARS-CoV-2, including B and T cells from the blood and nasal cavity, macrophages from the lung and nasal cavity, alveolar epithelial and lung endothelial cells. The cohort was divided into five groups: non-vaccinated (control), 2*adenovirus (two doses of adenovirus vaccine), 2*attenuated (two doses of attenuated virus vaccine), 2*mRNA (two doses of mRNA vaccine), and mRNA/attenuated (primed by mRNA vaccine, boosted by attenuated vaccine). All genes were ranked using five signature ranking methods (LASSO, LightGBM, Monte Carlo feature selection, mRMR, and permutation feature importance). Some key genes that contributed to the analysis of immune changes, such as RPS23, DDX5, PFN1 in immune cells, and IRF9 and MX1 in tissue cells, were screened. Afterward, the five feature sorting lists were fed into the feature incremental selection framework, which contained two classification algorithms (decision tree [DT] and random forest [RF]), to construct optimal classifiers and generate quantitative rules. Results showed that random forest classifiers could provide relative higher performance than decision tree classifiers, whereas the DT classifiers provided quantitative rules that indicated special gene expression levels under different vaccine strategies. These findings may help us to develop better protective vaccination programs and new vaccines.
Collapse
Affiliation(s)
- Hao Li
- College of Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Qinglan Ma
- School of Life Sciences, Shanghai University, Shanghai, China
| | - Jingxin Ren
- School of Life Sciences, Shanghai University, Shanghai, China
| | - Wei Guo
- Key Laboratory of Stem Cell Biology, Shanghai Institutes for Biological Sciences (SIBS), Shanghai Jiao Tong University School of Medicine (SJTUSM), Chinese Academy of Sciences (CAS), Shanghai, China
| | - Kaiyan Feng
- Department of Computer Science, Guangdong AIB Polytechnic College, Guangzhou, China
| | - Zhandong Li
- College of Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Tao Huang
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
- CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
| |
Collapse
|
12
|
Yagin FH, Cicek İB, Alkhateeb A, Yagin B, Colak C, Azzeh M, Akbulut S. Explainable artificial intelligence model for identifying COVID-19 gene biomarkers. Comput Biol Med 2023; 154:106619. [PMID: 36738712 PMCID: PMC9889119 DOI: 10.1016/j.compbiomed.2023.106619] [Citation(s) in RCA: 51] [Impact Index Per Article: 25.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2022] [Revised: 01/11/2023] [Accepted: 01/28/2023] [Indexed: 02/04/2023]
Abstract
AIM COVID-19 has revealed the need for fast and reliable methods to assist clinicians in diagnosing the disease. This article presents a model that applies explainable artificial intelligence (XAI) methods based on machine learning techniques on COVID-19 metagenomic next-generation sequencing (mNGS) samples. METHODS In the data set used in the study, there are 15,979 gene expressions of 234 patients with COVID-19 negative 141 (60.3%) and COVID-19 positive 93 (39.7%). The least absolute shrinkage and selection operator (LASSO) method was applied to select genes associated with COVID-19. Support Vector Machine - Synthetic Minority Oversampling Technique (SVM-SMOTE) method was used to handle the class imbalance problem. Logistics regression (LR), SVM, random forest (RF), and extreme gradient boosting (XGBoost) methods were constructed to predict COVID-19. An explainable approach based on local interpretable model-agnostic explanations (LIME) and SHAPley Additive exPlanations (SHAP) methods was applied to determine COVID-19- associated biomarker candidate genes and improve the final model's interpretability. RESULTS For the diagnosis of COVID-19, the XGBoost (accuracy: 0.930) model outperformed the RF (accuracy: 0.912), SVM (accuracy: 0.877), and LR (accuracy: 0.912) models. As a result of the SHAP, the three most important genes associated with COVID-19 were IFI27, LGR6, and FAM83A. The results of LIME showed that especially the high level of IFI27 gene expression contributed to increasing the probability of positive class. CONCLUSIONS The proposed model (XGBoost) was able to predict COVID-19 successfully. The results show that machine learning combined with LIME and SHAP can explain the biomarker prediction for COVID-19 and provide clinicians with an intuitive understanding and interpretability of the impact of risk factors in the model.
Collapse
Affiliation(s)
- Fatma Hilal Yagin
- Department of Biostatistics and Medical Informatics, Faculty of Medicine, Inonu University, 44280, Malatya, Turkey.
| | - İpek Balikci Cicek
- Department of Biostatistics and Medical Informatics, Faculty of Medicine, Inonu University, 44280, Malatya, Turkey.
| | - Abedalrhman Alkhateeb
- Software Engineering Department, King Hussein School for Computing Sciences, Amman, Jordan.
| | - Burak Yagin
- Department of Biostatistics and Medical Informatics, Faculty of Medicine, Inonu University, 44280, Malatya, Turkey.
| | - Cemil Colak
- Department of Biostatistics and Medical Informatics, Faculty of Medicine, Inonu University, 44280, Malatya, Turkey.
| | - Mohammad Azzeh
- Data Science Department, King Hussein School for Computing Sciences, Amman, Jordan.
| | - Sami Akbulut
- Department of Biostatistics and Medical Informatics, Faculty of Medicine, Inonu University, 44280, Malatya, Turkey; Inonu University, Faculty of Medicine, Department of Surgery, 44280, Malatya, Turkey; Inonu University, Faculty of Medicine, Department of Public Health, 44280, Malatya, Turkey.
| |
Collapse
|
13
|
Bajo-Morales J, Castillo-Secilla D, Herrera LJ, Caba O, Prados JC, Rojas I. Predicting COVID-19 Severity Integrating RNA-Seq Data Using Machine
Learning Techniques. Curr Bioinform 2023; 18:221-231. [DOI: 10.2174/1574893617666220718110053] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2022] [Revised: 05/21/2022] [Accepted: 05/31/2022] [Indexed: 11/22/2022]
Abstract
Abstract:
A fundamental challenge in the fight against COVID -19 is the development of reliable and accurate tools to predict disease progression in a patient. This information can be extremely useful in distinguishing hospitalized patients at higher risk for needing UCI from patients with low severity. How SARS-CoV-2 infection will evolve is still unclear.
Methods:
A novel pipeline was developed that can integrate RNA-Seq data from different databases to obtain a genetic biomarker COVID -19 severity index using an artificial intelligence algorithm. Our pipeline ensures robustness through multiple cross-validation processes in different steps.
Results:
CD93, RPS24, PSCA, and CD300E were identified as a COVID -19 severity gene signature. Furthermore, using the obtained gene signature, an effective multi-class classifier capable of discriminating between control, outpatient, inpatient, and ICU COVID -19 patients was optimized, achieving an accuracy of 97.5%.
Conclusion:
In summary, during this research, a new intelligent pipeline was implemented with the goal of developing a specific gene signature that can detect the severity of patients suffering COVID -19. Our approach to clinical decision support systems achieved excellent results, even when processing unseen samples. Our system can be of great clinical utility for the strategy of planning, organizing and managing human and material resources, as well as for automatically classifying the severity of patients affected by COVID -19.
Collapse
Affiliation(s)
- Javier Bajo-Morales
- Department of Computer Architecture and Technology, University of Granada, C.I.T.I.C., Periodista Rafael Gómez
Montero, 2, 18014, Granada, Spain
- Deuser Tech Group, Calle Islandia, 182-NAV 24A, Córdoba,
14014, Córdoba; Spain
| | - Daniel Castillo-Secilla
- Department of Computer Architecture and Technology, University of Granada, C.I.T.I.C., Periodista Rafael Gómez
Montero, 2, 18014, Granada, Spain
- Fujitsu Technology Solutions S.A, CoE Data Intelligence, Camino del Cerro
de los Gamos, 1, Pozuelo de Alarcón, 28224, Madrid, Spain
| | - Luis Javier Herrera
- Department of Computer Architecture and Technology, University of Granada, C.I.T.I.C., Periodista Rafael Gómez
Montero, 2, 18014, Granada, Spain
| | - Octavio Caba
- Nuclear Medicine Department, IMIBIC, University Hospital Reina Sofia, Menéndez
Pidal Avenue, 14004, Córdoba, Spain
| | - Jose Carlos Prados
- Nuclear Medicine Department, IMIBIC, University Hospital Reina Sofia, Menéndez
Pidal Avenue, 14004, Córdoba, Spain
| | - Ignacio Rojas
- Department of Computer Architecture and Technology, University of Granada, C.I.T.I.C., Periodista Rafael Gómez
Montero, 2, 18014, Granada, Spain
| |
Collapse
|
14
|
Identification of Smoking-Associated Transcriptome Aberration in Blood with Machine Learning Methods. BIOMED RESEARCH INTERNATIONAL 2023; 2023:5333361. [PMID: 36644165 PMCID: PMC9833906 DOI: 10.1155/2023/5333361] [Citation(s) in RCA: 26] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/03/2022] [Revised: 12/15/2022] [Accepted: 12/15/2022] [Indexed: 01/06/2023]
Abstract
Long-term cigarette smoking causes various human diseases, including respiratory disease, cancer, and gastrointestinal (GI) disorders. Alterations in gene expression and variable splicing processes induced by smoking are associated with the development of diseases. This study applied advanced machine learning methods to identify the isoforms with important roles in distinguishing smokers from former smokers based on the expression profile of isoforms from current and former smokers collected in one previous study. These isoforms were deemed as features, which were first analyzed by the Boruta to select features highly correlated with the target variables. Then, the selected features were evaluated by four feature ranking algorithms, resulting in four feature lists. The incremental feature selection method was applied to each list for obtaining the optimal feature subsets and building high-performance classification models. Furthermore, a series of classification rules were accessed by decision tree with the highest performance. Eventually, the rationality of the mined isoforms (features) and classification rules was verified by reviewing previous research. Features such as isoforms ENST00000464835 (expressed by LRRN3), ENST00000622663 (expressed by SASH1), and ENST00000284311 (expressed by GPR15), and pathways (cytotoxicity mediated by natural killer cell and cytokine-cytokine receptor interaction) revealed by the enrichment analysis, were highly relevant to smoking response, suggesting the robustness of our analysis pipeline.
Collapse
|
15
|
Jeyananthan P. SARS-CoV-2 Diagnosis Using Transcriptome Data: A Machine Learning Approach. SN COMPUTER SCIENCE 2023; 4:218. [PMID: 36844504 PMCID: PMC9936926 DOI: 10.1007/s42979-023-01703-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/19/2022] [Accepted: 01/24/2023] [Indexed: 05/02/2023]
Abstract
SARS-CoV-2 pandemic is the big issue of the whole world right now. The health community is struggling to rescue the public and countries from this spread, which revives time to time with different waves. Even the vaccination seems to be not prevents this spread. Accurate identification of infected people on time is essential these days to control the spread. So far, Polymerase chain reaction (PCR) and rapid antigen tests are widely used in this identification, accepting their own drawbacks. False negative cases are the menaces in this scenario. To avoid these problems, this study uses machine learning techniques to build a classification model with higher accuracy to filter the COVID-19 cases from the non-COVID individuals. Transcriptome data of the SARS-CoV-2 patients along with the control are used in this stratification using three different feature selection algorithms and seven classification models. Differently expressed genes also studied between these two groups of people and used in this classification. Results shows that mutual information (or DEGs) along with naïve Bayes (or SVM) gives the best accuracy (0.98 ± 0.04) among these methods. Supplementary Information The online version contains supplementary material available at 10.1007/s42979-023-01703-6.
Collapse
|
16
|
Das B. An implementation of a hybrid method based on machine learning to identify biomarkers in the Covid-19 diagnosis using DNA sequences. CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS : AN INTERNATIONAL JOURNAL SPONSORED BY THE CHEMOMETRICS SOCIETY 2022; 230:104680. [PMID: 36213553 PMCID: PMC9528020 DOI: 10.1016/j.chemolab.2022.104680] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/08/2022] [Revised: 09/20/2022] [Accepted: 09/27/2022] [Indexed: 06/16/2023]
Abstract
Although some people do not have any chronic disease or are not in the risky age group for Covid-19, they are more vulnerable to the coronavirus. As the reason for this situation, some experts focus on the immune system of the person, while others think that the genetic history of patients may play a role. It is critical to detect corona from DNA signals as early as possible to determine the relationship between Covid-19 and genes. Thus, the effect on the severe course of the disease of variations in the genes associated with the corona disease will be revealed. In this study, a novel intelligent computer approach is proposed to identify coronavirus from nucleotide signals for the first time. The proposed method presents a multilayered feature extraction structure to extract the most effective features using an Entropy-based mapping technique, Discrete Wavelet Transform (DWT), statistical feature extractor, and Singular Value Decomposition (SVD), together. Then 94 distinctive features are selected by the ReliefF technique. Support vector machine (SVM) and k nearest neighborhood (k-NN) are chosen as classifiers. The method achieved the highest classification accuracy rate of 98.84% with an SVM classifier to detect Covid-19 from DNA signals. The proposed method is ready to be tested with a different database in the diagnosis of Covid-19 using RNA or other signals.
Collapse
Affiliation(s)
- Bihter Das
- Department of Software Engineering, Technology Faculty, Firat University, 23119, Elazig, Turkey
| |
Collapse
|
17
|
Network-Based Data Analysis Reveals Ion Channel-Related Gene Features in COVID-19: A Bioinformatic Approach. Biochem Genet 2022; 61:471-505. [PMID: 36104591 PMCID: PMC9473477 DOI: 10.1007/s10528-022-10280-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2022] [Accepted: 09/01/2022] [Indexed: 11/02/2022]
Abstract
Coronavirus disease 2019 (COVID-19) seriously threatens human health and has been disseminated worldwide. Although there are several treatments for COVID-19, its control is currently suboptimal. Therefore, the development of novel strategies to treat COVID-19 is necessary. Ion channels are located on the membranes of all excitable cells and many intracellular organelles and are key components involved in various biological processes. They are a target of interest when searching for drug targets. This study aimed to reveal the relevant molecular features of ion channel genes in COVID-19 based on bioinformatic analyses. The RNA-sequencing data of patients with COVID-19 and healthy subjects (GSE152418 and GSE171110 datasets) were obtained from the Gene Expression Omnibus (GEO) database. Ion channel genes were selected from the Hugo Gene Nomenclature Committee (HGNC) database. The RStudio software was used to process the data based on the corresponding R language package to identify ion channel-associated differentially expressed genes (DEGs). Based on the DEGs, Gene Ontology (GO) functional and pathway enrichment analyses were performed using the Enrichr web tool. The STRING database was used to generate a protein-protein interaction (PPI) network, and the Cytoscape software was used to screen for hub genes in the PPI network based on the cytoHubba plug-in. Transcription factors (TF)-DEG, DEG-microRNA (miRNA) and DEG-disease association networks were constructed using the NetworkAnalyst web tool. Finally, the screened hub genes as drug targets were subjected to enrichment analysis based on the DSigDB using the Enrichr web tool to identify potential therapeutic agents for COVID-19. A total of 29 ion channel-associated DEGs were identified. GO functional analysis showed that the DEGs were integral components of the plasma membrane and were mainly involved in inorganic cation transmembrane transport and ion channel activity functions. Pathway analysis showed that the DEGs were mainly involved in nicotine addiction, calcium regulation in the cardiac cell and neuronal system pathways. The top 10 hub genes screened based on the PPI network included KCNA2, KCNJ4, CACNA1A, CACNA1E, NALCN, KCNA5, CACNA2D1, TRPC1, TRPM3 and KCNN3. The TF-DEG and DEG-miRNA networks revealed significant TFs (FOXC1, GATA2, HINFP, USF2, JUN and NFKB1) and miRNAs (hsa-mir-146a-5p, hsa-mir-27a-3p, hsa-mir-335-5p, hsa-let-7b-5p and hsa-mir-129-2-3p). Gene-disease association network analysis revealed that the DEGs were closely associated with intellectual disability and cerebellar ataxia. Drug-target enrichment analysis showed that the relevant drugs targeting the hub genes CACNA2D1, CACNA1A, CACNA1E, KCNA2 and KCNA5 were gabapentin, gabapentin enacarbil, pregabalin, guanidine hydrochloride and 4-aminopyridine. The results of this study provide a valuable basis for exploring the mechanisms of ion channel genes in COVID-19 and clues for developing therapeutic strategies for COVID-19.
Collapse
|
18
|
Jeyananthan P. Prolonged viral shedding prediction on non-hospitalized, uncomplicated SARS-CoV-2 patients using their transcriptome data. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE UPDATE 2022; 2:100070. [PMID: 36090806 PMCID: PMC9444307 DOI: 10.1016/j.cmpbup.2022.100070] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/08/2022] [Revised: 05/24/2022] [Accepted: 09/04/2022] [Indexed: 06/15/2023]
Abstract
Severe acute respiratory syndrome coronavirus type 2 (SARS-CoV-2) is identified as a highly transmissible coronavirus which threatens the world with this deadly pandemic. WHO reported that it spreads through contact, droplet, airborne, formite, fecal-oral, bloodborne, mother-to-child and animal-to-human. Hence, viral shedding has a huge impact on this pandemic. This study uses transcriptome data of coronavirus disease 2019 (COVID-19) patients to predict the prolonged viral shedding of the corresponding patient. This prediction starts with the transcriptome features which gives the lowest root mean squared value of 16.3±3.3 using top 25 feature selected using forward feature selection algorithm and linear regression algorithm. Then to see the impact of few non-molecular features in this prediction, they were added to the model one by one along with the selected transcriptome features. However, this study shows that those features do not have any impact on prolonged viral shedding prediction. Further this study predicts the day since onset in the same way. Here also top 25 transcriptome features selected using forward feature selection algorithm gives a comparably good accuracy (accuracy value of 0.74±0.1). However, the best accuracy was obtained using the best 20 features from feature importance using SVM (0.78±0.1). Moreover, adding non-molecular features shows a great impact on mutual information selected features in this prediction.
Collapse
|
19
|
Khalid Z, Huan M, Sohail Raza M, Abbas M, Naz Z, Kombe Kombe AJ, Zeng W, He H, Jin T. Identification of Novel Therapeutic Candidates Against SARS-CoV-2 Infections: An Application of RNA Sequencing Toward mRNA Based Nanotherapeutics. Front Microbiol 2022; 13:901848. [PMID: 35983322 PMCID: PMC9378778 DOI: 10.3389/fmicb.2022.901848] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2022] [Accepted: 06/10/2022] [Indexed: 12/15/2022] Open
Abstract
Due to fast transmission and various circulating SARS-CoV-2 variants, a significant increase of coronavirus 2019 infection cases with acute respiratory symptoms has prompted worries about the efficiency of current vaccines. The possible evasion from vaccine immunity urged scientists to identify novel therapeutic targets for developing improved vaccines to manage worldwide COVID-19 infections. Our study sequenced pooled peripheral blood mononuclear cells transcriptomes of SARS-CoV-2 patients with moderate and critical clinical outcomes to identify novel potential host receptors and biomarkers that can assist in developing new translational nanomedicines and vaccine therapies. The dysregulated signatures were associated with humoral immune responses in moderate and critical patients, including B-cell activation, cell cycle perturbations, plasmablast antibody processing, adaptive immune responses, cytokinesis, and interleukin signaling pathway. The comparative and longitudinal analysis of moderate and critically infected groups elucidated diversity in regulatory pathways and biological processes. Several immunoglobin genes (IGLV9-49, IGHV7-4, IGHV3-64, IGHV1-24, IGKV1D-12, and IGKV2-29), ribosomal proteins (RPL29, RPL4P2, RPL5, and RPL14), inflammatory response related cytokines including Tumor Necrosis Factor (TNF, TNFRSF17, and TNFRSF13B), C-C motif chemokine ligands (CCL3, CCL25, CCL4L2, CCL22, and CCL4), C-X-C motif chemokine ligands (CXCL2, CXCL10, and CXCL11) and genes related to cell cycle process and DNA proliferation (MYBL2, CDC20, KIFC1, and UHCL1) were significantly upregulated among SARS-CoV-2 infected patients. 60S Ribosomal protein L29 (RPL29) was a highly expressed gene among all COVID-19 infected groups. Our study suggested that identifying differentially expressed genes (DEGs) based on disease severity and onset can be a powerful approach for identifying potential therapeutic targets to develop effective drug delivery systems against SARS-CoV-2 infections. As a result, potential therapeutic targets, such as the RPL29 protein, can be tested in vivo and in vitro to develop future mRNA-based translational nanomedicines and therapies to combat SARS-CoV-2 infections.
Collapse
Affiliation(s)
- Zunera Khalid
- Department of Obstetrics and Gynecology, The First Affiliated Hospital of University of Science and Technology of China (USTC), Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, China
| | - Ma Huan
- Department of Obstetrics and Gynecology, The First Affiliated Hospital of University of Science and Technology of China (USTC), Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, China
| | - Muhammad Sohail Raza
- CAS Key Laboratory of Genomic and Precision Medicine, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, China
- China National Center for Bioinformation, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Misbah Abbas
- CAS Key Laboratory of Innate Immunity and Chronic Disease, School of Basic Medical Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, China
| | - Zara Naz
- CAS Key Laboratory of Innate Immunity and Chronic Disease, School of Basic Medical Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, China
| | - Arnaud John Kombe Kombe
- CAS Key Laboratory of Innate Immunity and Chronic Disease, School of Basic Medical Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, China
| | - Weihong Zeng
- CAS Key Laboratory of Innate Immunity and Chronic Disease, School of Basic Medical Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, China
| | - Hongliang He
- Department of Infectious Diseases, The First Affiliated Hospital of University of Science and Technology of China (USTC), Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, China
| | - Tengchuan Jin
- Department of Obstetrics and Gynecology, The First Affiliated Hospital of University of Science and Technology of China (USTC), Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, China
- CAS Key Laboratory of Innate Immunity and Chronic Disease, School of Basic Medical Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, China
- CAS Center for Excellence in Molecular Cell Science, Shanghai, China
- *Correspondence: Tengchuan Jin,
| |
Collapse
|
20
|
Li H, Huang F, Liao H, Li Z, Feng K, Huang T, Cai YD. Identification of COVID-19-Specific Immune Markers Using a Machine Learning Method. Front Mol Biosci 2022; 9:952626. [PMID: 35928229 PMCID: PMC9344575 DOI: 10.3389/fmolb.2022.952626] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2022] [Accepted: 06/21/2022] [Indexed: 01/08/2023] Open
Abstract
Notably, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has a tight relationship with the immune system. Human resistance to COVID-19 infection comprises two stages. The first stage is immune defense, while the second stage is extensive inflammation. This process is further divided into innate and adaptive immunity during the immune defense phase. These two stages involve various immune cells, including CD4+ T cells, CD8+ T cells, monocytes, dendritic cells, B cells, and natural killer cells. Various immune cells are involved and make up the complex and unique immune system response to COVID-19, providing characteristics that set it apart from other respiratory infectious diseases. In the present study, we identified cell markers for differentiating COVID-19 from common inflammatory responses, non-COVID-19 severe respiratory diseases, and healthy populations based on single-cell profiling of the gene expression of six immune cell types by using Boruta and mRMR feature selection methods. Some features such as IFI44L in B cells, S100A8 in monocytes, and NCR2 in natural killer cells are involved in the innate immune response of COVID-19. Other features such as ZFP36L2 in CD4+ T cells can regulate the inflammatory process of COVID-19. Subsequently, the IFS method was used to determine the best feature subsets and classifiers in the six immune cell types for two classification algorithms. Furthermore, we established the quantitative rules used to distinguish the disease status. The results of this study can provide theoretical support for a more in-depth investigation of COVID-19 pathogenesis and intervention strategies.
Collapse
Affiliation(s)
- Hao Li
- College of Biological and Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Feiming Huang
- School of Life Sciences, Shanghai University, Shanghai, China
| | - Huiping Liao
- Ophthalmology and Optometry Medical School, Shandong University of Traditional Chinese Medicine, Jinan, China
| | - Zhandong Li
- College of Biological and Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Kaiyan Feng
- Department of Computer Science, Guangdong AIB Polytechnic College, Guangzhou, China
| | - Tao Huang
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
- CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
- *Correspondence: Tao Huang, ; Yu-Dong Cai,
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
- *Correspondence: Tao Huang, ; Yu-Dong Cai,
| |
Collapse
|
21
|
Li Z, Wang D, Guo W, Zhang S, Chen L, Zhang YH, Lu L, Pan X, Huang T, Cai YD. Identification of cortical interneuron cell markers in mouse embryos based on machine learning analysis of single-cell transcriptomics. Front Neurosci 2022; 16:841145. [PMID: 35911980 PMCID: PMC9337837 DOI: 10.3389/fnins.2022.841145] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2021] [Accepted: 06/28/2022] [Indexed: 11/13/2022] Open
Abstract
Mammalian cortical interneurons (CINs) could be classified into more than two dozen cell types that possess diverse electrophysiological and molecular characteristics, and participate in various essential biological processes in the human neural system. However, the mechanism to generate diversity in CINs remains controversial. This study aims to predict CIN diversity in mouse embryo by using single-cell transcriptomics and the machine learning methods. Data of 2,669 single-cell transcriptome sequencing results are employed. The 2,669 cells are classified into three categories, caudal ganglionic eminence (CGE) cells, dorsal medial ganglionic eminence (dMGE) cells, and ventral medial ganglionic eminence (vMGE) cells, corresponding to the three regions in the mouse subpallium where the cells are collected. Such transcriptomic profiles were first analyzed by the minimum redundancy and maximum relevance method. A feature list was obtained, which was further fed into the incremental feature selection, incorporating two classification algorithms (random forest and repeated incremental pruning to produce error reduction), to extract key genes and construct powerful classifiers and classification rules. The optimal classifier could achieve an MCC of 0.725, and category-specified prediction accuracies of 0.958, 0.760, and 0.737 for the CGE, dMGE, and vMGE cells, respectively. The related genes and rules may provide helpful information for deepening the understanding of CIN diversity.
Collapse
Affiliation(s)
- Zhandong Li
- College of Biological and Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Deling Wang
- State Key Laboratory of Oncology in South China, Department of Radiology, Collaborative Innovation Center for Cancer Medicine, Sun Yat-sen University Cancer Center, Guangzhou, China
| | - Wei Guo
- Key Laboratory of Stem Cell Biology, Shanghai Jiao Tong University School of Medicine, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Shiqi Zhang
- Department of Biostatistics, University of Copenhagen, Copenhagen, Denmark
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, China
| | - Yu-Hang Zhang
- Channing Division of Network Medicine, Harvard Medical School, Brigham and Women’s Hospital, Boston, MA, United States
| | - Lin Lu
- Department of Radiology, Columbia University Irving Medical Center, New York, NY, United States
| | - XiaoYong Pan
- Key Laboratory of System Control and Information Processing, Ministry of Education of China, Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai, China
| | - Tao Huang
- CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
- *Correspondence: Tao Huang,
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
- Yu-Dong Cai,
| |
Collapse
|
22
|
Gao K, Wang R, Chen J, Cheng L, Frishcosy J, Huzumi Y, Qiu Y, Schluckbier T, Wei X, Wei GW. Methodology-Centered Review of Molecular Modeling, Simulation, and Prediction of SARS-CoV-2. Chem Rev 2022; 122:11287-11368. [PMID: 35594413 PMCID: PMC9159519 DOI: 10.1021/acs.chemrev.1c00965] [Citation(s) in RCA: 37] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
Despite tremendous efforts in the past two years, our understanding of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), virus-host interactions, immune response, virulence, transmission, and evolution is still very limited. This limitation calls for further in-depth investigation. Computational studies have become an indispensable component in combating coronavirus disease 2019 (COVID-19) due to their low cost, their efficiency, and the fact that they are free from safety and ethical constraints. Additionally, the mechanism that governs the global evolution and transmission of SARS-CoV-2 cannot be revealed from individual experiments and was discovered by integrating genotyping of massive viral sequences, biophysical modeling of protein-protein interactions, deep mutational data, deep learning, and advanced mathematics. There exists a tsunami of literature on the molecular modeling, simulations, and predictions of SARS-CoV-2 and related developments of drugs, vaccines, antibodies, and diagnostics. To provide readers with a quick update about this literature, we present a comprehensive and systematic methodology-centered review. Aspects such as molecular biophysics, bioinformatics, cheminformatics, machine learning, and mathematics are discussed. This review will be beneficial to researchers who are looking for ways to contribute to SARS-CoV-2 studies and those who are interested in the status of the field.
Collapse
Affiliation(s)
- Kaifu Gao
- Department
of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Rui Wang
- Department
of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Jiahui Chen
- Department
of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Limei Cheng
- Clinical
Pharmacology and Pharmacometrics, Bristol
Myers Squibb, Princeton, New Jersey 08536, United States
| | - Jaclyn Frishcosy
- Department
of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Yuta Huzumi
- Department
of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Yuchi Qiu
- Department
of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Tom Schluckbier
- Department
of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Xiaoqi Wei
- Department
of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Guo-Wei Wei
- Department
of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
- Department
of Electrical and Computer Engineering, Michigan State University, East Lansing, Michigan 48824, United States
- Department
of Biochemistry and Molecular Biology, Michigan
State University, East Lansing, Michigan 48824, United States
| |
Collapse
|
23
|
Zhang YH, Li ZD, Zeng T, Chen L, Huang T, Cai YD. Screening gene signatures for clinical response subtypes of lung transplantation. Mol Genet Genomics 2022; 297:1301-1313. [PMID: 35780439 DOI: 10.1007/s00438-022-01918-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2021] [Accepted: 06/12/2022] [Indexed: 11/30/2022]
Abstract
Lung is the most important organ in the human respiratory system, whose normal functions are quite essential for human beings. Under certain pathological conditions, the normal lung functions could no longer be maintained in patients, and lung transplantation is generally applied to ease patients' breathing and prolong their lives. However, several risk factors exist during and after lung transplantation, including bleeding, infection, and transplant rejections. In particular, transplant rejections are difficult to predict or prevent, leading to the most dangerous complications and severe status in patients undergoing lung transplantation. Given that most common monitoring and validation methods for lung transplantation rejections may take quite a long time and have low reproducibility, new technologies and methods are required to improve the efficacy and accuracy of rejection monitoring after lung transplantation. Recently, one previous study set up the gene expression profiles of patients who underwent lung transplantation. However, it did not provide a tool to predict lung transplantation responses. Here, a further deep investigation was conducted on such profiling data. A computational framework, incorporating several machine learning algorithms, such as feature selection methods and classification algorithms, was built to establish an effective prediction model distinguishing patient into different clinical subgroups, corresponding to different rejection responses after lung transplantation. Furthermore, the framework also screened essential genes with functional enrichments and create quantitative rules for the distinction of patients with different rejection responses to lung transplantation. The outcome of this contribution could provide guidelines for clinical treatment of each rejection subtype and contribute to the revealing of complicated rejection mechanisms of lung transplantation.
Collapse
Affiliation(s)
- Yu-Hang Zhang
- School of Life Sciences, Shanghai University, Shanghai, 200444, China
- Channing Division of Network Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Zhan Dong Li
- College of Food Engineering, Jilin Engineering Normal University, Changchun, 130052, China
| | - Tao Zeng
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, 200031, China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, 201306, China
| | - Tao Huang
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, 200031, China.
- CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China.
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, 200444, China.
| |
Collapse
|
24
|
Rodrigues P, Costa RS, Henriques R. Enrichment analysis on regulatory subspaces: A novel direction for the superior description of cellular responses to SARS-CoV-2. Comput Biol Med 2022; 146:105443. [PMID: 35533463 PMCID: PMC9040465 DOI: 10.1016/j.compbiomed.2022.105443] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2021] [Revised: 03/13/2022] [Accepted: 03/20/2022] [Indexed: 12/16/2022]
Abstract
STATEMENT Enrichment analysis of cell transcriptional responses to SARS-CoV-2 infection from biclustering solutions yields broader coverage and superior enrichment of GO terms and KEGG pathways against alternative state-of-the-art machine learning solutions, thus aiding knowledge extraction. MOTIVATION AND METHODS The comprehensive understanding of the impacts of SARS-CoV-2 virus on infected cells is still incomplete. This work aims at comparing the role of state-of-the-art machine learning approaches in the study of cell regulatory processes affected and induced by the SARS-CoV-2 virus using transcriptomic data from both infectable cell lines available in public databases and in vivo samples. In particular, we assess the relevance of clustering, biclustering and predictive modeling methods for functional enrichment. Statistical principles to handle scarcity of observations, high data dimensionality, and complex gene interactions are further discussed. In particular, and without loos of generalization ability, the proposed methods are applied to study the differential regulatory response of lung cell lines to SARS-CoV-2 (α-variant) against RSV, IAV (H1N1), and HPIV3 viruses. RESULTS Gathered results show that, although clustering and predictive algorithms aid classic stances to functional enrichment analysis, more recent pattern-based biclustering algorithms significantly improve the number and quality of enriched GO terms and KEGG pathways with controlled false positive risks. Additionally, a comparative analysis of these results is performed to identify potential pathophysiological characteristics of COVID-19. These are further compared to those identified by other authors for the same virus as well as related ones such as SARS-CoV-1. The findings are particularly relevant given the lack of other works utilizing more complex machine learning algorithms within this context.
Collapse
Affiliation(s)
- Pedro Rodrigues
- IDMEC, Instituto Superior Tecnico, Universidade de Lisboa, Lisbon, Portugal; INESC-ID and Instituto Superior Tecnico, Universidade de Lisboa, Lisbon, Portugal
| | - Rafael S Costa
- IDMEC, Instituto Superior Tecnico, Universidade de Lisboa, Lisbon, Portugal; LAQV-REQUIMTE, DQ, NOVA School of Science and Technology, Caparica, Portugal
| | - Rui Henriques
- INESC-ID and Instituto Superior Tecnico, Universidade de Lisboa, Lisbon, Portugal.
| |
Collapse
|
25
|
Odacı H, Kaya F, Aydın F. Does educational stress mediate the relationship between intolerance of uncertainty and academic life satisfaction in teenagers during the COVID-19 pandemic? PSYCHOLOGY IN THE SCHOOLS 2022; 60:PITS22766. [PMID: 35942391 PMCID: PMC9350207 DOI: 10.1002/pits.22766] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2022] [Revised: 04/03/2022] [Accepted: 06/15/2022] [Indexed: 11/11/2022]
Abstract
The present study aims to investigate the mediator role of educational stress in the relationship between intolerance to uncertainty and academic life satisfaction among teenagers. The sample consisted of 257 female and 202 male high school students with an average age of 16.03 (SD = 1.21) continuing their education in the spring semester of the 2020-2021 academic year in Turkey. The data were collected via an online survey. Analyses revealed that intolerance of uncertainty directly and indirectly via educational stress affects the academic life satisfaction of teenagers. Educational stress partially mediates the relationship. It was also found that the full mediation model has a good fit with the data. The academic life satisfaction of teenagers was harmed by their tendencies in tolerating the uncertainties they have been facing during the COVID-19 pandemic and elevated levels of educational stress.
Collapse
Affiliation(s)
- Hatice Odacı
- Department of Social PsychologyKaradeniz Technical UniversityTrabzonTurkey
| | - Feridun Kaya
- Department of PsychometricsAtatürk UniversityErzurumTurkey
| | - Fatih Aydın
- Department of Counseling and GuidanceSivas Cumhuriyet UniversitySivasTurkey
| |
Collapse
|
26
|
Li Z, Pan X, Cai YD. Identification of Type 2 Diabetes Biomarkers From Mixed Single-Cell Sequencing Data With Feature Selection Methods. Front Bioeng Biotechnol 2022; 10:890901. [PMID: 35721855 PMCID: PMC9201257 DOI: 10.3389/fbioe.2022.890901] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2022] [Accepted: 04/04/2022] [Indexed: 11/18/2022] Open
Abstract
Diabetes is the most common disease and a major threat to human health. Type 2 diabetes (T2D) makes up about 90% of all cases. With the development of high-throughput sequencing technologies, more and more fundamental pathogenesis of T2D at genetic and transcriptomic levels has been revealed. The recent single-cell sequencing can further reveal the cellular heterogenicity of complex diseases in an unprecedented way. With the expectation on the molecular essence of T2D across multiple cell types, we investigated the expression profiling of more than 1,600 single cells (949 cells from T2D patients and 651 cells from normal controls) and identified the differential expression profiling and characteristics at the transcriptomics level that can distinguish such two groups of cells at the single-cell level. The expression profile was analyzed by several machine learning algorithms, including Monte Carlo feature selection, support vector machine, and repeated incremental pruning to produce error reduction (RIPPER). On one hand, some T2D-associated genes (MTND4P24, MTND2P28, and LOC100128906) were discovered. On the other hand, we revealed novel potential pathogenic mechanisms in a rule manner. They are induced by newly recognized genes and neglected by traditional bulk sequencing techniques. Particularly, the newly identified T2D genes were shown to follow specific quantitative rules with diabetes prediction potentials, and such rules further indicated several potential functional crosstalks involved in T2D.
Collapse
Affiliation(s)
- Zhandong Li
- College of Biological and Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Xiaoyong Pan
- Key Laboratory of System Control and Information Processing, Institute of Image Processing and Pattern Recognition, Ministry of Education of China, Shanghai Jiao Tong University, Shanghai, China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
- *Correspondence: Yu-Dong Cai,
| |
Collapse
|
27
|
Li Z, Mei Z, Ding S, Chen L, Li H, Feng K, Huang T, Cai YD. Identifying Methylation Signatures and Rules for COVID-19 With Machine Learning Methods. Front Mol Biosci 2022; 9:908080. [PMID: 35620480 PMCID: PMC9127386 DOI: 10.3389/fmolb.2022.908080] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2022] [Accepted: 04/27/2022] [Indexed: 11/13/2022] Open
Abstract
The occurrence of coronavirus disease 2019 (COVID-19) has become a serious challenge to global public health. Definitive and effective treatments for COVID-19 are still lacking, and targeted antiviral drugs are not available. In addition, viruses can regulate host innate immunity and antiviral processes through the epigenome to promote viral self-replication and disease progression. In this study, we first analyzed the methylation dataset of COVID-19 using the Monte Carlo feature selection method to obtain a feature list. This feature list was subjected to the incremental feature selection method combined with a decision tree algorithm to extract key biomarkers, build effective classification models and classification rules that can remarkably distinguish patients with or without COVID-19. EPSTI1, NACAP1, SHROOM3, C19ORF35, and MX1 as the essential features play important roles in the infection and immune response to novel coronavirus. The six significant rules extracted from the optimal classifier quantitatively explained the expression pattern of COVID-19. Therefore, these findings validated that our method can distinguish COVID-19 at the methylation level and provide guidance for the diagnosis and treatment of COVID-19.
Collapse
Affiliation(s)
- Zhandong Li
- College of Biological and Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Zi Mei
- Shanghai Institute of Nutrition and Health, Chinese Academy of Sciences, Shanghai, China
| | - Shijian Ding
- School of Life Sciences, Shanghai University, Shanghai, China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, China
| | - Hao Li
- College of Biological and Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Kaiyan Feng
- Department of Computer Science, Guangdong AIB Polytechnic College, Guangzhou, China
| | - Tao Huang
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
- CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
- *Correspondence: Tao Huang, ; Yu-Dong Cai,
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
- *Correspondence: Tao Huang, ; Yu-Dong Cai,
| |
Collapse
|
28
|
Li Z, Guo W, Zeng T, Yin J, Feng K, Huang T, Cai YD. Detecting Brain Structure-Specific Methylation Signatures and Rules for Alzheimer's Disease. Front Neurosci 2022; 16:895181. [PMID: 35585924 PMCID: PMC9108872 DOI: 10.3389/fnins.2022.895181] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2022] [Accepted: 04/11/2022] [Indexed: 01/01/2023] Open
Abstract
Alzheimer's disease (AD) is a progressive disease that leads to irreversible behavioral changes, erratic emotions, and loss of motor skills. These conditions make people with AD hard or almost impossible to take care of. Multiple internal and external pathological factors may affect or even trigger the initiation and progression of AD. DNA methylation is one of the most effective regulatory roles during AD pathogenesis, and pathological methylation alterations may be potentially different in the various brain structures of people with AD. Although multiple loci associated with AD initiation and progression have been identified, the spatial distribution patterns of AD-associated DNA methylation in the brain have not been clarified. According to the systematic methylation profiles on different structural brain regions, we applied multiple machine learning algorithms to investigate such profiles. First, the profile on each brain region was analyzed by the Boruta feature filtering method. Some important methylation features were extracted and further analyzed by the max-relevance and min-redundancy method, resulting in a feature list. Then, the incremental feature selection method, incorporating some classification algorithms, adopted such list to identify candidate AD-associated loci at methylation with structural specificity, establish a group of quantitative rules for revealing the effects of DNA methylation in various brain regions (i.e., four brain structures) on AD pathogenesis. Furthermore, some efficient classifiers based on essential methylation sites were proposed to identify AD samples. Results revealed that methylation alterations in different brain structures have different contributions to AD pathogenesis. This study further illustrates the complex pathological mechanisms of AD.
Collapse
Affiliation(s)
- ZhanDong Li
- College of Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Wei Guo
- Key Laboratory of Stem Cell Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Tao Zeng
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Shanghai, China
| | - Jie Yin
- Cancer Institute, Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
- Department of Human Genetics, Institute of Genetics, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, China
| | - KaiYan Feng
- Department of Computer Science, Guangdong AIB Polytechnic College, Guangzhou, China
| | - Tao Huang
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Shanghai, China
- CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Shanghai, China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
| |
Collapse
|
29
|
Li ZD, Yu X, Mei Z, Zeng T, Chen L, Xu XL, Li H, Huang T, Cai YD. Identifying luminal and basal mammary cell specific genes and their expression patterns during pregnancy. PLoS One 2022; 17:e0267211. [PMID: 35486595 PMCID: PMC9053804 DOI: 10.1371/journal.pone.0267211] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2021] [Accepted: 04/05/2022] [Indexed: 11/25/2022] Open
Abstract
Mammary gland is present in all mammals and usually functions in producing milk to feed the young offspring. Mammogenesis refers to the growth and development of mammary gland, which begins at puberty and ends after lactation. Pregnancy is regulated by various cytokines, which further contributes to mammary gland development. Epithelial cells, including basal and luminal cells, are one of the major components of mammary gland cells. The development of basal and luminal cells has been observed to significantly differ at different stages. However, the underlying mechanisms for differences between basal and luminal cells have not been fully studied. To explore the mechanisms underlying the differentiation of mammary progenitors or their offspring into luminal and myoepithelial cells, the single-cell sequencing data on mammary epithelia cells of virgin and pregnant mouse was deeply investigated in this work. We evaluated features by using Monte Carlo feature selection and plotted the incremental feature selection curve with support vector machine or RIPPER to find the optimal gene features and rules that can divide epithelial cells into four clusters with different cell subtypes like basal and luminal cells and different phases like pregnancy and virginity. As representations, the feature genes Cldn7, Gjb6, Sparc, Cldn3, Cited1, Krt17, Spp1, Cldn4, Gjb2 and Cldn19 might play an important role in classifying the epithelial mammary cells. Notably, seven most important rules based on the combination of cell-specific and tissue-specific expressions of feature genes effectively classify the epithelial mammary cells in a quantitative and interpretable manner.
Collapse
Affiliation(s)
- Zhan Dong Li
- College of Biological and Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Xiangtian Yu
- Shanghai Jiao Tong University Affiliated Sixth People’s Hospital, Shanghai, China
| | - Zi Mei
- Shanghai Institute of Nutrition and Health, Chinese Academy of Sciences, Shanghai, China
| | - Tao Zeng
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, China
| | - Xian Ling Xu
- Guangdong AIB Polytechnic College, Guangzhou, China
| | - Hao Li
- College of Biological and Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Tao Huang
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
- CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
- * E-mail: (TH); (YDC)
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
- * E-mail: (TH); (YDC)
| |
Collapse
|
30
|
Chen L, Mei Z, Guo W, Ding S, Huang T, Cai YD. Recognition of Immune Cell Markers of COVID-19 Severity with Machine Learning Methods. BIOMED RESEARCH INTERNATIONAL 2022; 2022:6089242. [PMID: 35528178 PMCID: PMC9073549 DOI: 10.1155/2022/6089242] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/06/2022] [Accepted: 04/11/2022] [Indexed: 01/08/2023]
Abstract
COVID-19 is hypothesized to be linked to the host's excessive inflammatory immunological response to SARS-CoV-2 infection, which is regarded to be a major factor in disease severity and mortality. Numerous immune cells play a key role in immune response regulation, and gene expression analysis in these cells could be a useful method for studying disease states, assessing immunological responses, and detecting biomarkers. Here, we developed a machine learning procedure to find biomarkers that discriminate disease severity in individual immune cells (B cell, CD4+ cell, CD8+ cell, monocyte, and NK cell) using single-cell gene expression profiles of COVID-19. The gene features of each profile were first filtered and ranked using the Boruta feature selection method and mRMR, and the resulting ranked feature lists were then fed into the incremental feature selection method to determine the optimal number of features with decision tree and random forest algorithms. Meanwhile, we extracted the classification rules in each cell type from the optimal decision tree classifiers. The best gene sets discovered in this study were analyzed by GO and KEGG pathway enrichment, and some important biomarkers like TLR2, ITK, CX3CR1, IL1B, and PRDM1 were validated by recent literature. The findings reveal that the optimal gene sets for each cell type can accurately classify COVID-19 disease severity and provide insight into the molecular mechanisms involved in disease progression.
Collapse
Affiliation(s)
- Lei Chen
- School of Life Sciences, Shanghai University, Shanghai 200444, China
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Zi Mei
- Shanghai Institute of Nutrition and Health, Chinese Academy of Sciences, Shanghai 200031, China
| | - Wei Guo
- Key Laboratory of Stem Cell Biology, Shanghai Jiao Tong University School of Medicine (SJTUSM) & Shanghai Institutes for Biological Sciences (SIBS), Chinese Academy of Sciences (CAS), Shanghai 200031, China
| | - ShiJian Ding
- School of Life Sciences, Shanghai University, Shanghai 200444, China
| | - Tao Huang
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
- CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai 200444, China
| |
Collapse
|
31
|
Li Z, Guo W, Ding S, Feng K, Lu L, Huang T, Cai Y. Detecting Blood Methylation Signatures in Response to Childhood Cancer Radiotherapy via Machine Learning Methods. BIOLOGY 2022; 11:biology11040607. [PMID: 35453806 PMCID: PMC9030135 DOI: 10.3390/biology11040607] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/25/2022] [Revised: 04/09/2022] [Accepted: 04/14/2022] [Indexed: 11/16/2022]
Abstract
Radiotherapy is a helpful treatment for cancer, but it can also potentially cause changes in many molecules, resulting in adverse effects. Among these changes, the occurrence of abnormal DNA methylation patterns has alarmed scientists. To explore the influence of region-specific radiotherapy on blood DNA methylation, we designed a computational workflow by using machine learning methods that can identify crucial methylation alterations related to treatment exposure. Irrelevant methylation features from the DNA methylation profiles of 2052 childhood cancer survivors were excluded via the Boruta method, and the remaining features were ranked using the minimum redundancy maximum relevance method to generate feature lists. These feature lists were then fed into the incremental feature selection method, which uses a combination of deep forest, k-nearest neighbor, random forest, and decision tree to find the most important methylation signatures and build the best classifiers and classification rules. Several methylation signatures and rules have been discovered and confirmed, allowing for a better understanding of methylation patterns in response to different treatment exposures.
Collapse
Affiliation(s)
- Zhandong Li
- College of Food Engineering, Jilin Engineering Normal University, Changchun 130052, China;
| | - Wei Guo
- Key Laboratory of Stem Cell Biology, Shanghai Jiao Tong University School of Medicine (SJTUSM) & Shanghai Institutes for Biological Sciences (SIBS), Chinese Academy of Sciences (CAS), Shanghai 200025, China;
| | - Shijian Ding
- School of Life Sciences, Shanghai University, Shanghai 200444, China;
| | - Kaiyan Feng
- Department of Computer Science, Guangdong AIB Polytechnic College, Guangzhou 510507, China;
| | - Lin Lu
- Department of Radiology, Columbia University Medical Center, New York, NY 10032, USA
- Correspondence: (L.L.); (T.H.); or (Y.C.); Tel.: +86-21-54923269 (T.H.); +86-21-66136132 (Y.C.)
| | - Tao Huang
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
- CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
- Correspondence: (L.L.); (T.H.); or (Y.C.); Tel.: +86-21-54923269 (T.H.); +86-21-66136132 (Y.C.)
| | - Yudong Cai
- School of Life Sciences, Shanghai University, Shanghai 200444, China;
- Correspondence: (L.L.); (T.H.); or (Y.C.); Tel.: +86-21-54923269 (T.H.); +86-21-66136132 (Y.C.)
| |
Collapse
|
32
|
Zhou X, Ding S, Wang D, Chen L, Feng K, Huang T, Li Z, Cai Y. Identification of Cell Markers and Their Expression Patterns in Skin Based on Single-Cell RNA-Sequencing Profiles. Life (Basel) 2022; 12:life12040550. [PMID: 35455041 PMCID: PMC9025372 DOI: 10.3390/life12040550] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2022] [Revised: 03/27/2022] [Accepted: 04/04/2022] [Indexed: 12/19/2022] Open
Abstract
Atopic dermatitis and psoriasis are members of a family of inflammatory skin disorders. Cellular immune responses in skin tissues contribute to the development of these diseases. However, their underlying immune mechanisms remain to be fully elucidated. We developed a computational pipeline for analyzing the single-cell RNA-sequencing profiles of the Human Cell Atlas skin dataset to investigate the pathological mechanisms of skin diseases. First, we applied the maximum relevance criterion and the Boruta feature selection method to exclude irrelevant gene features from the single-cell gene expression profiles of inflammatory skin disease samples and healthy controls. The retained gene features were ranked by using the Monte Carlo feature selection method on the basis of their importance, and a feature list was compiled. This list was then introduced into the incremental feature selection method that combined the decision tree and random forest algorithms to extract important cell markers and thus build excellent classifiers and decision rules. These cell markers and their expression patterns have been analyzed and validated in recent studies and are potential therapeutic and diagnostic targets for skin diseases because their expression affects the pathogenesis of inflammatory skin diseases.
Collapse
Affiliation(s)
- Xianchao Zhou
- School of Life Sciences, Shanghai University, Shanghai 200444, China; (X.Z.); (S.D.)
- Center for Single-Cell Omics, School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai 200025, China
| | - Shijian Ding
- School of Life Sciences, Shanghai University, Shanghai 200444, China; (X.Z.); (S.D.)
| | - Deling Wang
- State Key Laboratory of Oncology in South China, Collaborative Innovation Center for Cancer Medicine, Department of Medical Imaging, Sun Yat-sen University Cancer Center, Guangzhou 510060, China;
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China;
| | - Kaiyan Feng
- Department of Computer Science, Guangdong AIB Polytechnic College, Guangzhou 510507, China;
| | - Tao Huang
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
- CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
- Correspondence: (T.H.); (Z.L.); (Y.C.); Tel.: +86-21-54923269 (T.H.); +86-21-66136132 (Y.C.)
| | - Zhandong Li
- College of Food Engineering, Jilin Engineering Normal University, Changchun 130052, China
- Correspondence: (T.H.); (Z.L.); (Y.C.); Tel.: +86-21-54923269 (T.H.); +86-21-66136132 (Y.C.)
| | - Yudong Cai
- School of Life Sciences, Shanghai University, Shanghai 200444, China; (X.Z.); (S.D.)
- Correspondence: (T.H.); (Z.L.); (Y.C.); Tel.: +86-21-54923269 (T.H.); +86-21-66136132 (Y.C.)
| |
Collapse
|
33
|
Similarity-Based Method with Multiple-Feature Sampling for Predicting Drug Side Effects. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2022; 2022:9547317. [PMID: 35401786 PMCID: PMC8993545 DOI: 10.1155/2022/9547317] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/20/2021] [Revised: 09/18/2021] [Accepted: 03/15/2022] [Indexed: 12/23/2022]
Abstract
Drugs can treat different diseases but also bring side effects. Undetected and unaccepted side effects for approved drugs can greatly harm the human body and bring huge risks for pharmaceutical companies. Traditional experimental methods used to determine the side effects have several drawbacks, such as low efficiency and high cost. One alternative to achieve this purpose is to design computational methods. Previous studies modeled a binary classification problem by pairing drugs and side effects; however, their classifiers can only extract one feature from each type of drug association. The present work proposed a novel multiple-feature sampling scheme that can extract several features from one type of drug association. Thirteen classification algorithms were employed to construct classifiers with features yielded by such scheme. Their performance was greatly improved compared with that of the classifiers that use the features yielded by the original scheme. Best performance was observed for the classifier based on random forest with MCC of 0.8661, AUROC of 0.969, and AUPR of 0.977. Finally, one key parameter in the multiple-feature sampling scheme was analyzed.
Collapse
|
34
|
Li Z, Wang D, Liao H, Zhang S, Guo W, Chen L, Lu L, Huang T, Cai YD. Exploring the Genomic Patterns in Human and Mouse Cerebellums Via Single-Cell Sequencing and Machine Learning Method. Front Genet 2022; 13:857851. [PMID: 35309141 PMCID: PMC8930846 DOI: 10.3389/fgene.2022.857851] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2022] [Accepted: 02/09/2022] [Indexed: 12/29/2022] Open
Abstract
In mammals, the cerebellum plays an important role in movement control. Cellular research reveals that the cerebellum involves a variety of sub-cell types, including Golgi, granule, interneuron, and unipolar brush cells. The functional characteristics of cerebellar cells exhibit considerable differences among diverse mammalian species, reflecting a potential development and evolution of nervous system. In this study, we aimed to recognize the transcriptional differences between human and mouse cerebellum in four cerebellar sub-cell types by using single-cell sequencing data and machine learning methods. A total of 321,387 single-cell sequencing data were used. The 321,387 cells included 4 cell types, i.e., Golgi (5,048, 1.57%), granule (250,307, 77.88%), interneuron (60,526, 18.83%), and unipolar brush (5,506, 1.72%) cells. Our results showed that by using gene expression profiles as features, the optimal classification model could achieve very high even perfect performance for Golgi, granule, interneuron, and unipolar brush cells, respectively, suggesting a remarkable difference between the genomic profiles of human and mouse. Furthermore, a group of related genes and rules contributing to the classification was identified, which might provide helpful information for deepening the understanding of cerebellar cell heterogeneity and evolution.
Collapse
Affiliation(s)
- ZhanDong Li
- College of Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Deling Wang
- Department of Radiology, State Key Laboratory of Oncology in South China, Collaborative Innovation Center for Cancer Medicine, Sun Yat-sen University Cancer Center, Guangzhou, China
| | - HuiPing Liao
- Eye Institute of Shandong University of Traditional Chinese Medicine, Jinan, China
| | - ShiQi Zhang
- Department of Biostatistics, University of Copenhagen, Copenhagen, Denmark
| | - Wei Guo
- Key Laboratory of Stem Cell Biology, Shanghai Jiao Tong University School of Medicine (SJTUSM) & Shanghai Institutes for Biological Sciences (SIBS), Chinese Academy of Sciences (CAS), Shanghai, China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, China
| | - Lin Lu
- Department of Radiology, Columbia University Medical Center, New York, NY, United States
| | - Tao Huang
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
- CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
| |
Collapse
|
35
|
Meng C, Ju Y, Shi H. TMPpred: A support vector machine-based thermophilic protein identifier. Anal Biochem 2022; 645:114625. [PMID: 35218736 DOI: 10.1016/j.ab.2022.114625] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2021] [Revised: 02/18/2022] [Accepted: 02/21/2022] [Indexed: 11/13/2022]
Abstract
MOTIVATION The thermostability of proteins will cause them to break the temperature binding and play more functions. Using machine learning, we explored the mechanism of and reasons for protein thermostability characteristics. RESULTS Different from other methods that only pursue the performance of models, we aim to find important features so as to provide a powerful reference for in vitro experiments. We transformed this problem into a binary classification problem, that is, the distinction between thermophilic proteins and nonthermophilic proteins. Using support vector machine-based model construction and analysis, we inferred that Gly, Ala, Ser and Thr may be the most important components at the residue level that determine the thermal stability of proteins. It is also noteworthy that our proposed model obtains an Sn of 0.892, an Sp of 0.857, an ACC of 0.87566 and an AUC of 0.874. To facilitate other researchers, we wrapped our model and deployed it as a web server, which is accessible at http://112.124.26.17:7000/TMPpred/index.html.
Collapse
Affiliation(s)
- Chaolu Meng
- College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot, China; Inner Mongolia Autonomous Region Key Laboratory of Big Data Research and Application for Agriculture and Animal Husbandry, Hohhot, China
| | - Ying Ju
- School of Informatics, Xiamen University, Xiamen, China.
| | - Hua Shi
- School of Opto-electronic and Communication Engineering, Xiamen University of Technology, Xiamen, China.
| |
Collapse
|
36
|
Li X, Lu L, Chen L. Identification of protein functions in mouse with a label space partition method. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2022; 19:3820-3842. [PMID: 35341276 DOI: 10.3934/mbe.2022176] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Protein is very important for almost all living creatures because it participates in most complicated and essential biological processes. Determining the functions of given proteins is one of the most essential problems in protein science. Such determination can be conducted through traditional experiments. However, the experimental methods are always time-consuming and of high costs. In recent years, computational methods give useful aids for identification of protein functions. This study presented a new multi-label classifier for identifying functions of mouse proteins. Due to the number of functional types, which were termed as labels in the classification procedure, a label space partition method was employed to divide labels into some partitions. On each partition, a multi-label classifier was constructed. The classifiers based on all partitions were integrated in the proposed classifier. The cross-validation results proved that the proposed classifier was of good performance. Classifiers with label partition were superior to those without label partition or with random label partition.
Collapse
Affiliation(s)
- Xuan Li
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Lin Lu
- Department of Radiology, Columbia University Medical Center, New York 10032, USA
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| |
Collapse
|
37
|
Predicting Heart Cell Types by Using Transcriptome Profiles and a Machine Learning Method. Life (Basel) 2022; 12:life12020228. [PMID: 35207515 PMCID: PMC8877019 DOI: 10.3390/life12020228] [Citation(s) in RCA: 33] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2022] [Revised: 01/29/2022] [Accepted: 01/29/2022] [Indexed: 11/17/2022] Open
Abstract
The heart is an essential organ in the human body. It contains various types of cells, such as cardiomyocytes, mesothelial cells, endothelial cells, and fibroblasts. The interactions between these cells determine the vital functions of the heart. Therefore, identifying the different cell types and revealing the expression rules in these cell types are crucial. In this study, multiple machine learning methods were used to analyze the heart single-cell profiles with 11 different heart cell types. The single-cell profiles were first analyzed via light gradient boosting machine method to evaluate the importance of gene features on the profiling dataset, and a ranking feature list was produced. This feature list was then brought into the incremental feature selection method to identify the best features and build the optimal classifiers. The results suggested that the best decision tree (DT) and random forest classification models achieved the highest weighted F1 scores of 0.957 and 0.981, respectively. The selected features, such as NPPA, LAMA2, DLC1, and the classification rules extracted from the optimal DT classifier played a crucial role in cardiac structure and function in recent research and enrichment analysis. In particular, some lncRNAs (LINC02019, NEAT1) were found to be quite important for the recognition of different cardiac cell types. In summary, these findings provide a solid academic foundation for the development of molecular diagnostics and biomarker discovery for cardiac diseases.
Collapse
|
38
|
Predicting RNA 5-Methylcytosine Sites by Using Essential Sequence Features and Distributions. BIOMED RESEARCH INTERNATIONAL 2022; 2022:4035462. [PMID: 35071593 PMCID: PMC8776474 DOI: 10.1155/2022/4035462] [Citation(s) in RCA: 30] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/15/2021] [Revised: 12/07/2021] [Accepted: 12/22/2021] [Indexed: 12/15/2022]
Abstract
Methylation is one of the most common and considerable modifications in biological systems mediated by multiple enzymes. Recent studies have shown that methylation has been widely identified in different RNA molecules. RNA methylation modifications have various kinds, such as 5-methylcytosine (m5C). However, for individual methylation sites, their functions still remain to be elucidated. Testing of all methylation sites relies heavily on high-throughput sequencing technology, which is expensive and labor consuming. Thus, computational prediction approaches could serve as a substitute. In this study, multiple machine learning models were used to predict possible RNA m5C sites on the basis of mRNA sequences in human and mouse. Each site was represented by several features derived from
-mers of an RNA subsequence containing such site as center. The powerful max-relevance and min-redundancy (mRMR) feature selection method was employed to analyse these features. The outcome feature list was fed into incremental feature selection method, incorporating four classification algorithms, to build efficient models. Furthermore, the sites related to features used in the models were also investigated.
Collapse
|
39
|
Yang Y, Chen L. Identification of Drug-Disease Associations by Using Multiple Drug and
Disease Networks. Curr Bioinform 2022. [DOI: 10.2174/1574893616666210825115406] [Citation(s) in RCA: 34] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Drug repositioning is a new research area in drug development. It aims to discover
novel therapeutic uses of existing drugs. It could accelerate the process of designing novel drugs
for some diseases and considerably decrease the cost. The traditional method to determine novel therapeutic
uses of an existing drug is quite laborious. It is alternative to design computational methods to
overcome such defect.
Objective:
This study aims to propose a novel model for the identification of drug–disease associations.
Method:
Twelve drug networks and three disease networks were built, which were fed into a powerful
network-embedding algorithm called Mashup to produce informative drug and disease features. These
features were combined to represent each drug–disease association. Classic classification algorithm,
random forest, was used to build the model.
Results:
Tenfold cross-validation results indicated that the MCC, AUROC, and AUPR were 0.7156,
0.9280, and 0.9191, respectively.
Conclusion:
The proposed model showed good performance. Some tests indicated that a small dimension
of drug features and a large dimension of disease features were beneficial for constructing the
model. Moreover, the model was quite robust even if some drug or disease properties were not available.
Collapse
Affiliation(s)
- Ying Yang
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| |
Collapse
|
40
|
Ding S, Li H, Zhang YH, Zhou X, Feng K, Li Z, Chen L, Huang T, Cai YD. Identification of Pan-Cancer Biomarkers Based on the Gene Expression Profiles of Cancer Cell Lines. Front Cell Dev Biol 2021; 9:781285. [PMID: 34917619 PMCID: PMC8669964 DOI: 10.3389/fcell.2021.781285] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Accepted: 11/16/2021] [Indexed: 12/12/2022] Open
Abstract
There are many types of cancers. Although they share some hallmarks, such as proliferation and metastasis, they are still very different from many perspectives. They grow on different organ or tissues. Does each cancer have a unique gene expression pattern that makes it different from other cancer types? After the Cancer Genome Atlas (TCGA) project, there are more and more pan-cancer studies. Researchers want to get robust gene expression signature from pan-cancer patients. But there is large variance in cancer patients due to heterogeneity. To get robust results, the sample size will be too large to recruit. In this study, we tried another approach to get robust pan-cancer biomarkers by using the cell line data to reduce the variance. We applied several advanced computational methods to analyze the Cancer Cell Line Encyclopedia (CCLE) gene expression profiles which included 988 cell lines from 20 cancer types. Two feature selection methods, including Boruta, and max-relevance and min-redundancy methods, were applied to the cell line gene expression data one by one, generating a feature list. Such list was fed into incremental feature selection method, incorporating one classification algorithm, to extract biomarkers, construct optimal classifiers and decision rules. The optimal classifiers provided good performance, which can be useful tools to identify cell lines from different cancer types, whereas the biomarkers (e.g. NCKAP1, TNFRSF12A, LAMB2, FKBP9, PFN2, TOM1L1) and rules identified in this work may provide a meaningful and precise reference for differentiating multiple types of cancer and contribute to the personalized treatment of tumors.
Collapse
Affiliation(s)
- ShiJian Ding
- School of Life Sciences, Shanghai University, Shanghai, China
| | - Hao Li
- College of Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Yu-Hang Zhang
- Channing Division of Network Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, United States
| | - XianChao Zhou
- Center for Single-Cell Omics, School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - KaiYan Feng
- Department of Computer Science, Guangdong AIB Polytechnic College, Guangzhou, China
| | - ZhanDong Li
- College of Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, China
| | - Tao Huang
- CAS Key Laboratory of Computational Biology, Bio-Med Big Data Center, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China.,CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
| |
Collapse
|
41
|
Chen L, Li Z, Zeng T, Zhang YH, Zhang S, Huang T, Cai YD. Predicting Human Protein Subcellular Locations by Using a Combination of Network and Function Features. Front Genet 2021; 12:783128. [PMID: 34804131 PMCID: PMC8603309 DOI: 10.3389/fgene.2021.783128] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2021] [Accepted: 10/22/2021] [Indexed: 12/12/2022] Open
Abstract
Given the limitation of technologies, the subcellular localizations of proteins are difficult to identify. Predicting the subcellular localization and the intercellular distribution patterns of proteins in accordance with their specific biological roles, including validated functions, relationships with other proteins, and even their specific sequence characteristics, is necessary. The computational prediction of protein subcellular localizations can be performed on the basis of the sequence and the functional characteristics. In this study, the protein-protein interaction network, functional annotation of proteins and a group of direct proteins with known subcellular localization were used to construct models. To build efficient models, several powerful machine learning algorithms, including two feature selection methods, four classification algorithms, were employed. Some key proteins and functional terms were discovered, which may provide important contributions for determining protein subcellular locations. Furthermore, some quantitative rules were established to identify the potential subcellular localizations of proteins. As the first prediction model that uses direct protein annotation information (i.e., functional features) and STRING-based protein-protein interaction network (i.e., network features), our computational model can help promote the development of predictive technologies on subcellular localizations and provide a new approach for exploring the protein subcellular localization patterns and their potential biological importance.
Collapse
Affiliation(s)
- Lei Chen
- School of Life Sciences, Shanghai University, Shanghai, China
- College of Information Engineering, Shanghai Maritime University, Shanghai, China
| | - ZhanDong Li
- College of Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Tao Zeng
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Yu-Hang Zhang
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, United States
| | - ShiQi Zhang
- Department of Biostatistics, University of Copenhagen, Copenhagen, Denmark
| | - Tao Huang
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
- CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
| |
Collapse
|
42
|
iMPT-FDNPL: Identification of Membrane Protein Types with Functional Domains and a Natural Language Processing Approach. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2021; 2021:7681497. [PMID: 34671418 PMCID: PMC8523280 DOI: 10.1155/2021/7681497] [Citation(s) in RCA: 29] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/22/2021] [Revised: 09/15/2021] [Accepted: 09/27/2021] [Indexed: 12/20/2022]
Abstract
Membrane protein is an important kind of proteins. It plays essential roles in several cellular processes. Based on the intramolecular arrangements and positions in a cell, membrane proteins can be divided into several types. It is reported that the types of a membrane protein are highly related to its functions. Determination of membrane protein types is a hot topic in recent years. A plenty of computational methods have been proposed so far. Some of them used functional domain information to encode proteins. However, this procedure was still crude. In this study, we designed a novel feature extraction scheme to obtain informative features of proteins from their functional domain information. Such scheme termed domains as words and proteins, represented by its domains, as sentences. The natural language processing approach, word2vector, was applied to access the features of domains, which were further refined to protein features. Based on these features, RAndom k-labELsets with random forest as the base classifier was employed to build the multilabel classifier, namely, iMPT-FDNPL. The tenfold cross-validation results indicated the good performance of such classifier. Furthermore, such classifier was superior to other classifiers based on features derived from functional domains via one-hot scheme or derived from other properties of proteins, suggesting the effectiveness of protein features generated by the proposed scheme.
Collapse
|
43
|
Chen L, Zhou X, Zeng T, Pan X, Zhang YH, Huang T, Fang Z, Cai YD. Recognizing Pattern and Rule of Mutation Signatures Corresponding to Cancer Types. Front Cell Dev Biol 2021; 9:712931. [PMID: 34513841 PMCID: PMC8427289 DOI: 10.3389/fcell.2021.712931] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2021] [Accepted: 07/02/2021] [Indexed: 11/20/2022] Open
Abstract
Cancer has been generally defined as a cluster of systematic malignant pathogenesis involving abnormal cell growth. Genetic mutations derived from environmental factors and inherited genetics trigger the initiation and progression of cancers. Although several well-known factors affect cancer, mutation features and rules that affect cancers are relatively unknown due to limited related studies. In this study, a computational investigation on mutation profiles of cancer samples in 27 types was given. These profiles were first analyzed by the Monte Carlo Feature Selection (MCFS) method. A feature list was thus obtained. Then, the incremental feature selection (IFS) method adopted such list to extract essential mutation features related to 27 cancer types, find out 207 mutation rules and construct efficient classifiers. The top 37 mutation features corresponding to different cancer types were discussed. All the qualitatively analyzed gene mutation features contribute to the distinction of different types of cancers, and most of such mutation rules are supported by recent literature. Therefore, our computational investigation could identify potential biomarkers and prediction rules for cancers in the mutation signature level.
Collapse
Affiliation(s)
- Lei Chen
- School of Life Sciences, Shanghai University, Shanghai, China.,College of Information Engineering, Shanghai Maritime University, Shanghai, China
| | - Xianchao Zhou
- School of Life Sciences and Technology, ShanghaiTech University, Shanghai, China.,Center for Single-Cell Omics, School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Tao Zeng
- CAS Key Laboratory of Computational Biology, Bio-Med Big Data Center, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Xiaoyong Pan
- Key Laboratory of System Control and Information Processing, Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Ministry of Education of China, Shanghai, China
| | - Yu-Hang Zhang
- Channing Division of Network Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, United States
| | - Tao Huang
- CAS Key Laboratory of Computational Biology, Bio-Med Big Data Center, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China.,Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, Chinese Academy of Sciences, Shanghai, China
| | - Zhaoyuan Fang
- Zhejiang University-University of Edinburgh Institute, Zhejiang University School of Medicine, Haining, China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
| |
Collapse
|
44
|
Huang GH, Zhang YH, Chen L, Li Y, Huang T, Cai YD. Identifying Lung Cancer Cell Markers with Machine Learning Methods and Single-Cell RNA-Seq Data. Life (Basel) 2021; 11:life11090940. [PMID: 34575089 PMCID: PMC8467493 DOI: 10.3390/life11090940] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2021] [Revised: 09/03/2021] [Accepted: 09/06/2021] [Indexed: 11/21/2022] Open
Abstract
Non-small cell lung cancer is a major lethal subtype of epithelial lung cancer, with high morbidity and mortality. The single-cell sequencing technique plays a key role in exploring the pathogenesis of non-small cell lung cancer. We proposed a computational method for distinguishing cell subtypes from the different pathological regions of non-small cell lung cancer on the basis of transcriptomic profiles, including a group of qualitative classification criteria (biomarkers) and various rules. The random forest classifier reached a Matthew’s correlation coefficient (MCC) of 0.922 by using 720 features, and the decision tree reached an MCC of 0.786 by using 1880 features. The obtained biomarkers and rules were analyzed in the end of this study.
Collapse
Affiliation(s)
- Guo-Hua Huang
- School of Life Sciences, Shanghai University, Shanghai 200444, China;
- Department of Mechanical and Energy Engineering, Shaoyang University, Shaoyang 422000, China;
| | - Yu-Hang Zhang
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02115, USA;
| | - Lei Chen
- Department of College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China;
| | - You Li
- Department of Mechanical and Energy Engineering, Shaoyang University, Shaoyang 422000, China;
| | - Tao Huang
- CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, Chinese Academy of Sciences, Shanghai 200031, China
- Correspondence: (T.H.); (Y.-D.C.); Tel.: +86-21-54923269 (T.H.); +86-21-66136132 (Y.-D.C.)
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai 200444, China;
- Correspondence: (T.H.); (Y.-D.C.); Tel.: +86-21-54923269 (T.H.); +86-21-66136132 (Y.-D.C.)
| |
Collapse
|
45
|
Zhang YH, Guo W, Zeng T, Zhang S, Chen L, Gamarra M, Mansour RF, Escorcia-Gutierrez J, Huang T, Cai YD. Identification of Microbiota Biomarkers With Orthologous Gene Annotation for Type 2 Diabetes. Front Microbiol 2021; 12:711244. [PMID: 34305880 PMCID: PMC8299781 DOI: 10.3389/fmicb.2021.711244] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2021] [Accepted: 06/21/2021] [Indexed: 01/03/2023] Open
Abstract
Type 2 diabetes (T2D) is a systematic chronic metabolic condition with abnormal sugar metabolism dysfunction, and its complications are the most harmful to human beings and may be life-threatening after long-term durations. Considering the high incidence and severity at late stage, researchers have been focusing on the identification of specific biomarkers and potential drug targets for T2D at the genomic, epigenomic, and transcriptomic levels. Microbes participate in the pathogenesis of multiple metabolic diseases including diabetes. However, the related studies are still non-systematic and lack the functional exploration on identified microbes. To fill this gap between gut microbiome and diabetes study, we first introduced eggNOG database and KEGG ORTHOLOGY (KO) database for orthologous (protein/gene) annotation of microbiota. Two datasets with these annotations were employed, which were analyzed by multiple machine-learning models for identifying significant microbiota biomarkers of T2D. The powerful feature selection method, Max-Relevance and Min-Redundancy (mRMR), was first applied to the datasets, resulting in a feature list for each dataset. Then, the list was fed into the incremental feature selection (IFS), incorporating support vector machine (SVM) as the classification algorithm, to extract essential annotations and build efficient classifiers. This study not only revealed potential pathological factors for diabetes at the microbiome level but also provided us new candidates for drug development against diabetes.
Collapse
Affiliation(s)
- Yu-Hang Zhang
- School of Life Sciences, Shanghai University, Shanghai, China.,Channing Division of Network Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, United States
| | - Wei Guo
- Key Laboratory of Stem Cell Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences (CAS) and Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Tao Zeng
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
| | - ShiQi Zhang
- Department of Biostatistics, University of Copenhagen, Copenhagen, Denmark
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, China
| | - Margarita Gamarra
- Department of Computational Science and Electronic, Universidad de la Costa, CUC, Barranquilla, Colombia
| | - Romany F Mansour
- Department of Mathematics, Faculty of Science, New Valley University, El-Kharga, Egypt
| | - José Escorcia-Gutierrez
- Electronic and Telecommunications Engineering Program, Universidad Autónoma del Caribe, Barranquilla, Colombia
| | - Tao Huang
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China.,CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
| |
Collapse
|
46
|
Chen L, Li Z, Zeng T, Zhang YH, Feng K, Huang T, Cai YD. Identifying COVID-19-Specific Transcriptomic Biomarkers with Machine Learning Methods. BIOMED RESEARCH INTERNATIONAL 2021; 2021:9939134. [PMID: 34307679 PMCID: PMC8272456 DOI: 10.1155/2021/9939134] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/17/2021] [Revised: 06/03/2021] [Accepted: 06/24/2021] [Indexed: 12/11/2022]
Abstract
COVID-19, a severe respiratory disease caused by a new type of coronavirus SARS-CoV-2, has been spreading all over the world. Patients infected with SARS-CoV-2 may have no pathogenic symptoms, i.e., presymptomatic patients and asymptomatic patients. Both patients could further spread the virus to other susceptible people, thereby making the control of COVID-19 difficult. The two major challenges for COVID-19 diagnosis at present are as follows: (1) patients could share similar symptoms with other respiratory infections, and (2) patients may not have any symptoms but could still spread the virus. Therefore, new biomarkers at different omics levels are required for the large-scale screening and diagnosis of COVID-19. Although some initial analyses could identify a group of candidate gene biomarkers for COVID-19, the previous work still could not identify biomarkers capable for clinical use in COVID-19, which requires disease-specific diagnosis compared with other multiple infectious diseases. As an extension of the previous study, optimized machine learning models were applied in the present study to identify some specific qualitative host biomarkers associated with COVID-19 infection on the basis of a publicly released transcriptomic dataset, which included healthy controls and patients with bacterial infection, influenza, COVID-19, and other kinds of coronavirus. This dataset was first analysed by Boruta, Max-Relevance and Min-Redundancy feature selection methods one by one, resulting in a feature list. This list was fed into the incremental feature selection method, incorporating one of the classification algorithms to extract essential biomarkers and build efficient classifiers and classification rules. The capacity of these findings to distinguish COVID-19 with other similar respiratory infectious diseases at the transcriptomic level was also validated, which may improve the efficacy and accuracy of COVID-19 diagnosis.
Collapse
Affiliation(s)
- Lei Chen
- School of Life Sciences, Shanghai University, shanghai 200444, China
- College of Information Engineering, Shanghai Maritime University, shanghai 201306, China
| | - Zhandong Li
- College of Food Engineering, Jilin Engineering Normal University, Changchun 130052, China
| | - Tao Zeng
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, shanghai 200031, China
| | - Yu-Hang Zhang
- Channing Division of Network Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - KaiYan Feng
- Department of Computer Science, Guangdong AIB Polytechnic College, Guangzhou 510507, China
| | - Tao Huang
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, shanghai 200031, China
- CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, shanghai 200444, China
| |
Collapse
|
47
|
Zhu W, Guo Y, Zou Q. Prediction of presynaptic and postsynaptic neurotoxins based on feature extraction. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2021; 18:5943-5958. [PMID: 34517517 DOI: 10.3934/mbe.2021297] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
A neurotoxin is essentially a protein that mainly acts on the nervous system; it has a selective toxic effect on the central nervous system and neuromuscular nodes, can cause muscle paralysis and respiratory paralysis, and has strong lethality. According to their principle of action, neurotoxins are divided into presynaptic neurotoxins and postsynaptic neurotoxins. Correctly identifying presynaptic and postsynaptic nerve toxins provides important clues for future drug development and the discovery of drug targets. Therefore, a predictive model, Neu_LR, was constructed in this paper. The monoMonokGap method was used to extract the frequency characteristics of presynaptic and postsynaptic neurotoxin sequences and carry out feature selection, then, based on the important features obtained after dimensionality reduction, the prediction model Neu_LR was constructed using a logistic regression algorithm, and ten-fold cross-validation and independent test set validation were used. The final accuracy rates were 99.6078 and 94.1176%, respectively, which proved that the Neu_LR model had good predictive performance and robustness, and could meet the prediction requirements of presynaptic and postsynaptic neurotoxins. The data and source code of the model can be freely download from https://github.com/gyx123681/.
Collapse
Affiliation(s)
- Wen Zhu
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China
- Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Yuxin Guo
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China
- Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Quan Zou
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| |
Collapse
|
48
|
CWLy-RF: A novel approach for identifying cell wall lyases based on random forest classifier. Genomics 2021; 113:2919-2924. [PMID: 34186189 DOI: 10.1016/j.ygeno.2021.06.038] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2021] [Revised: 06/20/2021] [Accepted: 06/25/2021] [Indexed: 02/05/2023]
Abstract
Drug resistance of pathogenic bacteria has become increasingly serious due to the abuse of antibiotics in recent years. Researchers have found that cell wall lyases are effective antibacterial agents that can specifically recognize target bacteria and degrade bacterial peptidoglycan. Traditional wet experiments are usually expensive, time-consuming and laborious for the identification of lyases. Therefore, there is an urgent need to develop prediction tools based on computer methods to identify lyases quickly and accurately. In this paper, a new predictor, CWLy-RF, is proposed based on the random forest (RF) algorithm to identify cell wall lyases. In this method, we combined three features, namely, 400D, 188D and the composition of k-spaced amino acid group pairs, using mixed-feature representation methods. Afterward, we improved the feature representation ability with the selected top 100 features by using the information gain method and trained a predictive model using RF. The constructed prediction model is evaluated by using 10-fold cross-validation. The accuracy obtained was 96.09%, the AUC was 0.993, the MCC was 0.922, the sensitivity was 94.92%, and the specificity was 97.32%. We have proved that the proposed predictor CWLy-RF is superior to other latest models, and it will hopefully become an effective and useful tool for identifying lyases.
Collapse
|
49
|
Analysis of the Sequence Characteristics of Antifreeze Protein. Life (Basel) 2021; 11:life11060520. [PMID: 34204983 PMCID: PMC8226703 DOI: 10.3390/life11060520] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2021] [Revised: 05/27/2021] [Accepted: 05/31/2021] [Indexed: 12/31/2022] Open
Abstract
Antifreeze protein (AFP) is a proteinaceous compound with improved antifreeze ability and binding ability to ice to prevent its growth. As a surface-active material, a small number of AFPs have a tremendous influence on the growth of ice. Therefore, identifying novel AFPs is important to understand protein–ice interactions and create novel ice-binding domains. To date, predicting AFPs is difficult due to their low sequence similarity for the ice-binding domain and the lack of common features among different AFPs. Here, a computational engine was developed to predict the features of AFPs and reveal the most important 39 features for AFP identification, such as antifreeze-like/N-acetylneuraminic acid synthase C-terminal, insect AFP motif, C-type lectin-like, and EGF-like domain. With this newly presented computational method, a group of previously confirmed functional AFP motifs was screened out. This study has identified some potential new AFP motifs and contributes to understanding biological antifreeze mechanisms.
Collapse
|
50
|
Chen L, Li Z, Zeng T, Zhang YH, Li H, Huang T, Cai YD. Predicting gene phenotype by multi-label multi-class model based on essential functional features. Mol Genet Genomics 2021; 296:905-918. [PMID: 33914130 DOI: 10.1007/s00438-021-01789-8] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2021] [Accepted: 04/13/2021] [Indexed: 12/19/2022]
Abstract
Phenotype is one of the most significant concepts in genetics, which is used to describe all the characteristics of a research object that can be observed. Considering that phenotype reflects the integrated features of genotype and environment factors, it is hard to define phenotype characteristics, even difficult to predict unknown phenotypes. Restricted by current biological techniques, it is still quite expensive and time-consuming to obtain sufficient structural information of large-scale phenotype-associated genes/proteins. Various bioinformatics methods have been presented to solve such problem, and researchers have confirmed the efficacy and prediction accuracy of functional network-based prediction. But general functional descriptions have highly complicated inner structures for phenotype prediction. To further address this issue and improve the efficacy of phenotype prediction on more than ten kinds of phenotypes, we first extract functional enrichment features from GO and KEGG, and then use node2vec to learn functional embedding features of genes from a gene-gene network. All these features are analyzed by some feature selection methods (Boruta, minimum redundancy maximum relevance) to generate a feature list. Such list is fed into the incremental feature selection, incorporating some multi-label classifiers built by RAkEL and some classic base classifiers, to build an optimum multi-label multi-class classification model for phenotype prediction. According to recent researches, our method has indeed identified many literature-supported genes/proteins and their associated phenotypes, and even some candidate genes with re-assigned new phenotypes, which provide a new computational tool for the accurate and effective phenotypic prediction.
Collapse
Affiliation(s)
- Lei Chen
- School of Life Sciences, Shanghai University, Shanghai, 200444, People's Republic of China.,College of Information Engineering, Shanghai Maritime University, Shanghai, 201306, People's Republic of China
| | - Zhandong Li
- College of Food Engineering, Jilin Engineering Normal University, Changchun, 130052, People's Republic of China
| | - Tao Zeng
- CAS Key Laboratory of Computational Biology, Bio-Med Big Data Center, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, 200031, People's Republic of China
| | - Yu-Hang Zhang
- Channing Division of Network Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Hao Li
- College of Food Engineering, Jilin Engineering Normal University, Changchun, 130052, People's Republic of China
| | - Tao Huang
- Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, Chinese Academy of Sciences, Shanghai, 200031, People's Republic of China.
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, 200444, People's Republic of China.
| |
Collapse
|