1
|
Guimarães PAS, Carvalho MGR, Ruiz JC. A computational framework for extracting biological insights from SRA cancer data. Sci Rep 2025; 15:8117. [PMID: 40057525 PMCID: PMC11890766 DOI: 10.1038/s41598-025-91781-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2024] [Accepted: 02/24/2025] [Indexed: 05/13/2025] Open
Abstract
The integration of sequenced samples and clinical data from independent yet related studies from public domain databases, such as The Sequence Read Archive (SRA), has the potential to increase sample sizes and enhance the statistical power needed for more precise bioinformatic analysis. Data mining and sample grouping are the starting points in this process and still present several challenges, including the presence of structured and unstructured data, missing deposited data, and varying experimental conditions and techniques applied across the studies. Designed to address the main challenges of data mining and sample grouping for biomarkers research, the proposed methodology employs a computational approach integrating relational database construction, text and data mining, natural language processing, network analysis, search by Pubmed publications, and combining MeSH, TTD and WordNet database to identify groups of samples with the same characteristics. As a result, it identifies and illustrates relationships among sample collections, aiming to discover potential cancer biomarkers. In colorectal cancer (CRC) and acute lymphoblastic leukemia (ALL) case studies, this methodology effectively navigates SRA metadata, retrieving, extracting, and integrating data. It highlights significant connections between samples and patient clinical data, revealing important biological insights. The study grouped 2,737 (CRC) and 3,655 (ALL) samples into potential comparison groups, demonstrating the method's power in identifying relationships and aiding biomarker discovery.
Collapse
Affiliation(s)
- Paul Anderson Souza Guimarães
- Grupo Informática de Biossistemas, Bioengenharia e Genômica, Instituto René Rachou, Fiocruz Minas, Av. Augusto de Lima, 1715, Barro Preto, Belo Horizonte, MG, Brazil
- Biologia Computacional e Sistemas (BCS), Instituto Oswaldo Cruz (IOC), Fiocruz, Rio de Janeiro, Brazil
| | - Maria Gabriela Reis Carvalho
- Grupo Informática de Biossistemas, Bioengenharia e Genômica, Instituto René Rachou, Fiocruz Minas, Av. Augusto de Lima, 1715, Barro Preto, Belo Horizonte, MG, Brazil.
- Biologia Computacional e Sistemas (BCS), Instituto Oswaldo Cruz (IOC), Fiocruz, Rio de Janeiro, Brazil.
| | - Jeronimo Conceição Ruiz
- Grupo Informática de Biossistemas, Bioengenharia e Genômica, Instituto René Rachou, Fiocruz Minas, Av. Augusto de Lima, 1715, Barro Preto, Belo Horizonte, MG, Brazil.
- Biologia Computacional e Sistemas (BCS), Instituto Oswaldo Cruz (IOC), Fiocruz, Rio de Janeiro, Brazil.
| |
Collapse
|
2
|
Luo L, Luo F, Wu C, Zhang H, Jiang Q, He S, Li W, Zhang W, Cheng Y, Yang P, Li Z, Li M, Bao Y, Jiang F. Identification of potential biomarkers in the peripheral blood of neonates with bronchopulmonary dysplasia using WGCNA and machine learning algorithms. Medicine (Baltimore) 2024; 103:e37083. [PMID: 38277517 PMCID: PMC10817126 DOI: 10.1097/md.0000000000037083] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/12/2023] [Accepted: 01/05/2024] [Indexed: 01/28/2024] Open
Abstract
Bronchopulmonary dysplasia (BPD) is often seen as a pulmonary complication of extreme preterm birth, resulting in persistent respiratory symptoms and diminished lung function. Unfortunately, current diagnostic and treatment options for this condition are insufficient. Hence, this study aimed to identify potential biomarkers in the peripheral blood of neonates affected by BPD. The Gene Expression Omnibus provided the expression dataset GSE32472 for BPD. Initially, using this database, we identified differentially expressed genes (DEGs) in GSE32472. Subsequently, we conducted gene set enrichment analysis on the DEGs and employed weighted gene co-expression network analysis (WGCNA) to screen the most relevant modules for BPD. We then mapped the DEGs to the WGCNA module genes, resulting in a gene intersection. We conducted detailed functional enrichment analyses on these overlapping genes. To identify hub genes, we used 3 machine learning algorithms, including SVM-RFE, LASSO, and Random Forest. We constructed a diagnostic nomogram model for predicting BPD based on the hub genes. Additionally, we carried out transcription factor analysis to predict the regulatory mechanisms and identify drugs associated with these biomarkers. We used differential analysis to obtain 470 DEGs and conducted WGCNA analysis to identify 1351 significant genes. The intersection of these 2 approaches yielded 273 common genes. Using machine learning algorithms, we identified CYYR1, GALNT14, and OLAH as potential biomarkers for BPD. Moreover, we predicted flunisolide, budesonide, and beclomethasone as potential anti-BPD drugs. The genes CYYR1, GALNT14, and OLAH have the potential to serve as diagnostic biomarkers for BPD. This may prove beneficial in clinical diagnosis and prevention of BPD.
Collapse
Affiliation(s)
- Liyan Luo
- Department of Neonatology, Dali Bai Autonomous Prefecture Maternal and Child Health Care Hospital, Dali, China
| | - Fei Luo
- Department of Neonatology, Obstetrics and Gynecology Hospital of Fudan University, Shanghai, China
| | - Chuyan Wu
- Department of Rehabilitation Medicine, The First Affiliated Hospital of Nanjing Medical University, Nanjing, China
| | - Hong Zhang
- Department of Neonatology, Dali Bai Autonomous Prefecture Maternal and Child Health Care Hospital, Dali, China
| | - Qiaozhi Jiang
- Department of Neonatology, Dali Bai Autonomous Prefecture Maternal and Child Health Care Hospital, Dali, China
| | - Sixiang He
- Department of Neonatology, Dali Bai Autonomous Prefecture Maternal and Child Health Care Hospital, Dali, China
| | - Weibi Li
- Department of Neonatology, Dali Bai Autonomous Prefecture Maternal and Child Health Care Hospital, Dali, China
| | - Wenlong Zhang
- Department of Neonatology, Dali Bai Autonomous Prefecture Maternal and Child Health Care Hospital, Dali, China
| | - Yurong Cheng
- Department of Neonatology, Dali Bai Autonomous Prefecture Maternal and Child Health Care Hospital, Dali, China
| | - Pengcheng Yang
- Department of Neonatology, Dali Bai Autonomous Prefecture Maternal and Child Health Care Hospital, Dali, China
| | - Zhenghu Li
- Department of Neonatology, Dali Bai Autonomous Prefecture Maternal and Child Health Care Hospital, Dali, China
| | - Min Li
- Department of Neonatology, Dali Bai Autonomous Prefecture Maternal and Child Health Care Hospital, Dali, China
| | - Yunlei Bao
- Department of Neonatology, Obstetrics and Gynecology Hospital of Fudan University, Shanghai, China
| | - Feng Jiang
- Department of Neonatology, Obstetrics and Gynecology Hospital of Fudan University, Shanghai, China
| |
Collapse
|
3
|
Sora V, Tiberti M, Beltrame L, Dogan D, Robbani SM, Rubin J, Papaleo E. PyInteraph2 and PyInKnife2 to Analyze Networks in Protein Structural Ensembles. J Chem Inf Model 2023; 63:4237-4245. [PMID: 37437128 DOI: 10.1021/acs.jcim.3c00574] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/14/2023]
Abstract
Due to the complex nature of noncovalent interactions and their long-range effects, analyzing protein conformations using network theory can be enlightening. Protein Structure Networks (PSNs) provide a convenient formalism to study protein structures in relation to essential properties such as key residues for structural stability, allosteric communication, and the effects of modifications of the protein. PSNs can be defined according to very different principles, and the available tools have limitations in input formats, supported models, and version control. Other outstanding problems are related to the definition of network cutoffs and the assessment of the stability of the network properties. The protein science community could benefit from a common framework to carry out these analyses and make them easier to reproduce, reuse, and evaluate. We here provide two open-source software packages, PyInteraph2 and PyInKnife2, to implement and analyze PSNs in a reproducible and documented manner. PyInteraph2 interfaces with multiple formats for protein ensembles and incorporates different network models with the possibility of integrating them into a macronetwork and performing various downstream analyses, including hubs, connected components, and several other centrality measures, and visualizes the networks or further analyzes them thanks to compatibility with Cytoscape.PyInKnife2 that supports the network models implemented in PyInteraph2. It employs a jackknife resampling approach to estimate the convergence of network properties and streamline the selection of distance cutoffs. We foresee that the modular structure of the code and the supported version control system will promote the transition to a community-driven effort, boost reproducibility, and establish common protocols in the PSN field. As developers, we will guarantee the introduction of new functionalities and maintenance, assistance, and training of new contributors.
Collapse
Affiliation(s)
- Valentina Sora
- Cancer Structural Biology, Danish Cancer Institute, Strandboulevarden 49, 2100 Copenhagen, Denmark
- Cancer Systems Biology, Section of Bioinformatics, Department of Health and Technology, Technical University of Denmark, 2800 Lyngby, Denmark
| | - Matteo Tiberti
- Cancer Structural Biology, Danish Cancer Institute, Strandboulevarden 49, 2100 Copenhagen, Denmark
| | - Ludovica Beltrame
- Cancer Systems Biology, Section of Bioinformatics, Department of Health and Technology, Technical University of Denmark, 2800 Lyngby, Denmark
| | - Deniz Dogan
- Cancer Structural Biology, Danish Cancer Institute, Strandboulevarden 49, 2100 Copenhagen, Denmark
| | - Shahriyar Mahdi Robbani
- Cancer Structural Biology, Danish Cancer Institute, Strandboulevarden 49, 2100 Copenhagen, Denmark
| | - Joshua Rubin
- Cancer Structural Biology, Danish Cancer Institute, Strandboulevarden 49, 2100 Copenhagen, Denmark
| | - Elena Papaleo
- Cancer Structural Biology, Danish Cancer Institute, Strandboulevarden 49, 2100 Copenhagen, Denmark
- Cancer Systems Biology, Section of Bioinformatics, Department of Health and Technology, Technical University of Denmark, 2800 Lyngby, Denmark
| |
Collapse
|
4
|
Agapito G, Milano M, Cannataro M. A Python Clustering Analysis Protocol of Genes Expression Data Sets. Genes (Basel) 2022; 13:1839. [PMID: 36292724 PMCID: PMC9601308 DOI: 10.3390/genes13101839] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2022] [Revised: 10/05/2022] [Accepted: 10/08/2022] [Indexed: 11/16/2022] Open
Abstract
Gene expression and SNPs data hold great potential for a new understanding of disease prognosis, drug sensitivity, and toxicity evaluations. Cluster analysis is used to analyze data that do not contain any specific subgroups. The goal is to use the data itself to recognize meaningful and informative subgroups. In addition, cluster investigation helps data reduction purposes, exposes hidden patterns, and generates hypotheses regarding the relationship between genes and phenotypes. Cluster analysis could also be used to identify bio-markers and yield computational predictive models. The methods used to analyze microarrays data can profoundly influence the interpretation of the results. Therefore, a basic understanding of these computational tools is necessary for optimal experimental design and meaningful data analysis. This manuscript provides an analysis protocol to effectively analyze gene expression data sets through the K-means and DBSCAN algorithms. The general protocol enables analyzing omics data to identify subsets of features with low redundancy and high robustness, speeding up the identification of new bio-markers through pathway enrichment analysis. In addition, to demonstrate the effectiveness of our clustering analysis protocol, we analyze a real data set from the GEO database. Finally, the manuscript provides some best practice and tips to overcome some issues in the analysis of omics data sets through unsupervised learning.
Collapse
Affiliation(s)
- Giuseppe Agapito
- Department of Law, Economics and Social Sciences, University Magna Græcia of Catanzaro, 88100 Catanzaro, Italy
- Data Analytics Research Center, University Magna Græcia of Catanzaro, 88100 Catanzaro, Italy
| | - Marianna Milano
- Data Analytics Research Center, University Magna Græcia of Catanzaro, 88100 Catanzaro, Italy
- Department of Medical and Clinical Surgery, University Magna Græcia of Catanzaro, 88100 Catanzaro, Italy
| | - Mario Cannataro
- Data Analytics Research Center, University Magna Græcia of Catanzaro, 88100 Catanzaro, Italy
- Department of Medical and Clinical Surgery, University Magna Græcia of Catanzaro, 88100 Catanzaro, Italy
| |
Collapse
|
5
|
Maudsley S, Leysen H, van Gastel J, Martin B. Systems Pharmacology: Enabling Multidimensional Therapeutics. COMPREHENSIVE PHARMACOLOGY 2022:725-769. [DOI: 10.1016/b978-0-12-820472-6.00017-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/03/2025]
|
6
|
Wang Z, Meng Z, Chen C. Screening of potential biomarkers in peripheral blood of patients with depression based on weighted gene co-expression network analysis and machine learning algorithms. Front Psychiatry 2022; 13:1009911. [PMID: 36325528 PMCID: PMC9621316 DOI: 10.3389/fpsyt.2022.1009911] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/02/2022] [Accepted: 09/23/2022] [Indexed: 11/20/2022] Open
Abstract
BACKGROUND The prevalence of depression has been increasing worldwide in recent years, posing a heavy burden on patients and society. However, the diagnostic and therapeutic tools available for this disease are inadequate. Therefore, this research focused on the identification of potential biomarkers in the peripheral blood of patients with depression. METHODS The expression dataset GSE98793 of depression was provided by the Gene Expression Omnibus (GEO) (https://www.ncbi.nlm.nih.gov/gds). Initially, differentially expressed genes (DEGs) were detected in GSE98793. Subsequently, the most relevant modules for depression were screened according to weighted gene co-expression network analysis (WGCNA). Finally, the identified DEGs were mapped to the WGCNA module genes to obtain the intersection genes. In addition, Gene Ontology (GO), Disease Ontology (DO), and Kyoto Encyclopedia of Genes and Genomes (KEGG) functional enrichment analyses were conducted on these genes. Moreover, biomarker screening was carried out by protein-protein interaction (PPI) network construction of intersection genes on the basis of various machine learning algorithms. Furthermore, the gene set enrichment analysis (GSEA), immune function analysis, transcription factor (TF) analysis, and the prediction of the regulatory mechanism were collectively performed on the identified biomarkers. In addition, we also estimated the clinical diagnostic ability of the obtained biomarkers, and performed Mfuzz expression pattern clustering and functional enrichment of the most potential biomarkers to explore their regulatory mechanisms. Finally, we also perform biomarker-related drug prediction. RESULTS Differential analysis was used for obtaining a total of 550 DEGs and WGCNA for obtaining 1,194 significant genes. Intersection analysis of the two yielded 140 intersection genes. Biological functional analysis indicated that these genes had a major role in inflammation-related bacterial infection pathways and cardiovascular diseases such as atherosclerosis. Subsequently, the genes S100A12, SERPINB2, TIGIT, GRB10, and LHFPL2 in peripheral serum were identified as depression biomarkers by using machine learning algorithms. Among them, S100A12 is the most valuable biomarker for clinical diagnosis. Finally, antidepressants, including disodium selenite and eplerenone, were predicted. CONCLUSION The genes S100A12, TIGIT, SERPINB2, GRB10, and LHFPL2 in peripheral serum are viable diagnostic biomarkers for depression. and contribute to the diagnosis and prevention of depression in clinical practice.
Collapse
Affiliation(s)
- Zhe Wang
- School of Chinese Medicine, Ningxia Medical University, Yinchuan, China
| | - Zhe Meng
- School of Chinese Medicine, Ningxia Medical University, Yinchuan, China
| | - Che Chen
- School of Chinese Medicine, Ningxia Medical University, Yinchuan, China
| |
Collapse
|
7
|
Emmert-Streib F, Manjang K, Dehmer M, Yli-Harja O, Auvinen A. Are There Limits in Explainability of Prognostic Biomarkers? Scrutinizing Biological Utility of Established Signatures. Cancers (Basel) 2021; 13:cancers13205087. [PMID: 34680236 PMCID: PMC8533990 DOI: 10.3390/cancers13205087] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2021] [Revised: 10/01/2021] [Accepted: 10/05/2021] [Indexed: 11/30/2022] Open
Abstract
Prognostic biomarkers can have an important role in the clinical practice because they allow stratification of patients in terms of predicting the outcome of a disorder. Obstacles for developing such markers include lack of robustness when using different data sets and limited concordance among similar signatures. In this paper, we highlight a new problem that relates to the biological meaning of already established prognostic gene expression signatures. Specifically, it is commonly assumed that prognostic markers provide sensible biological information and molecular explanations about the underlying disorder. However, recent studies on prognostic biomarkers investigating 80 established signatures of breast and prostate cancer demonstrated that this is not the case. We will show that this surprising result is related to the distinction between causal models and predictive models and the obfuscating usage of these models in the biomedical literature. Furthermore, we suggest a falsification procedure for studies aiming to establish a prognostic signature to safeguard against false expectations with respect to biological utility.
Collapse
Affiliation(s)
- Frank Emmert-Streib
- Predictive Society and Data Analytics Lab, Faculty of Information Technology and Communication Sciences, Tampere University, 33720 Tampere, Finland;
- Correspondence:
| | - Kalifa Manjang
- Predictive Society and Data Analytics Lab, Faculty of Information Technology and Communication Sciences, Tampere University, 33720 Tampere, Finland;
| | - Matthias Dehmer
- Department of Computer Science, Swiss Distance University of Applied Sciences, 3900 Brig, Switzerland;
- Department of Mechatronics and Biomedical Computer Science, UMIT, 6060 Hall in Tyrol, Austria
- College of Artificial Intelligence, Nankai University, Tianjin 300350, China
| | - Olli Yli-Harja
- Computational Systems Biology, Faculty of Medicine and Health Technology, Tampere University, 33720 Tampere, Finland;
- Institute for Systems Biology, Seattle, WA 98195, USA
- Institute of Biosciences and Medical Technology, 33720 Tampere, Finland
| | - Anssi Auvinen
- Unit of Health Sciences, Faculty of Social Sciences, Tampere University, 33720 Tampere, Finland;
| |
Collapse
|