1
|
Pirmoradi S, Hosseiniyan Khatibi SM, Zununi Vahed S, Homaei Rad H, Khamaneh AM, Akbarpour Z, Seyedrezazadeh E, Teshnehlab M, Chapman KR, Ansarin K. Unraveling the link between PTBP1 and severe asthma through machine learning and association rule mining method. Sci Rep 2023; 13:15399. [PMID: 37717070 PMCID: PMC10505163 DOI: 10.1038/s41598-023-42581-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2023] [Accepted: 09/12/2023] [Indexed: 09/18/2023] Open
Abstract
Severe asthma is a chronic inflammatory airway disease with great therapeutic challenges. Understanding the genetic and molecular mechanisms of severe asthma may help identify therapeutic strategies for this complex condition. RNA expression data were analyzed using a combination of artificial intelligence methods to identify novel genes related to severe asthma. Through the ANOVA feature selection approach, 100 candidate genes were selected among 54,715 mRNAs in blood samples of patients with severe asthmatic and healthy groups. A deep learning model was used to validate the significance of the candidate genes. The accuracy, F1-score, AUC-ROC, and precision of the 100 genes were 83%, 0.86, 0.89, and 0.9, respectively. To discover hidden associations among selected genes, association rule mining was applied. The top 20 genes including the PTBP1, RAB11FIP3, APH1A, and MYD88 were recognized as the most frequent items among severe asthma association rules. The PTBP1 was found to be the most frequent gene associated with severe asthma among those 20 genes. PTBP1 was the gene most frequently associated with severe asthma among candidate genes. Identification of master genes involved in the initiation and development of asthma can offer novel targets for its diagnosis, prognosis, and targeted-signaling therapy.
Collapse
Affiliation(s)
- Saeed Pirmoradi
- Clinical Research Development Unit of Tabriz Valiasr Hospital, Tabriz University of Medical Sciences, Tabriz, Iran
| | - Seyed Mahdi Hosseiniyan Khatibi
- Kidney Research Center, Tabriz University of Medical Sciences, Tabriz, Iran
- Rahat Breath and Sleep Research Center, Tabriz University of Medical Science, Tabriz, Iran
| | | | - Hamed Homaei Rad
- Rahat Breath and Sleep Research Center, Tabriz University of Medical Science, Tabriz, Iran
| | - Amir Mahdi Khamaneh
- Faculty of Advanced Medical Sciences, Tabriz University of Medical Sciences, Tabriz, Iran
| | - Zahra Akbarpour
- Rahat Breath and Sleep Research Center, Tabriz University of Medical Science, Tabriz, Iran
| | - Ensiyeh Seyedrezazadeh
- Tuberculosis and Lung Disease Research Center, Tabriz University of Medical Sciences, Tabriz, Iran
| | - Mohammad Teshnehlab
- Department of Electric and Computer Engineering, K.N. Toosi University of Technology, Tehran, Iran
| | - Kenneth R Chapman
- Division of Respiratory Medicine, Department of Medicine, University of Toronto, Toronto, ON, Canada.
| | - Khalil Ansarin
- Rahat Breath and Sleep Research Center, Tabriz University of Medical Science, Tabriz, Iran.
| |
Collapse
|
2
|
Mallik S, Seth S, Si A, Bhadra T, Zhao Z. Optimal ranking and directional signature classification using the integral strategy of multi-objective optimization-based association rule mining of multi-omics data. FRONTIERS IN BIOINFORMATICS 2023; 3:1182176. [PMID: 37576714 PMCID: PMC10415913 DOI: 10.3389/fbinf.2023.1182176] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2023] [Accepted: 06/19/2023] [Indexed: 08/15/2023] Open
Abstract
Introduction: Association rule mining (ARM) is a powerful tool for exploring the informative relationships among multiple items (genes) in any dataset. The main problem of ARM is that it generates many rules containing different rule-informative values, which becomes a challenge for the user to choose the effective rules. In addition, few works have been performed on the integration of multiple biological datasets and variable cutoff values in ARM. Methods: To solve all these problems, in this article, we developed a novel framework MOOVARM (multi-objective optimized variable cutoff-based association rule mining) for multi-omics profiles. Results: In this regard, we identified the positive ideal solution (PIS), which maximized the profit and minimized the loss, and negative ideal solution (NIS), which minimized the profit and maximized the loss for all gene sets (item sets), belonging to each extracted rule. Thereafter, we computed the distance (d +) from PIS and distance (d -) from NIS for each gene set or product. These two distances played an important role in determining the optimized associations among various pairs of genes in the multi-omics dataset. We then globally estimated the relative closeness to PIS for ranking the gene sets. When the relative closeness score of the rule is greater than or equal to the pre-defined threshold value, the rule can be considered a final resultant rule. Moreover, MOOVARM evaluated the relative score of the rule based on the status of all genes instead of individual genes. Conclusions: MOOVARM produced the final rank of the extracted (multi-objective optimized) rules of correlated genes which had better disease classification than the state-of-the-art algorithms on gene signature identification.
Collapse
Affiliation(s)
- Saurav Mallik
- Environmental Health, Harvard T. H. Chan School of Public Health, Boston, MA, United States
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, United States
| | - Soumita Seth
- Department of Computer Science and Engineering, Brainware University, Kolkata, India
- Department of Computer Science and Engineering, Aliah University, Kolkata, India
| | - Amalendu Si
- School of Information Technology, Maulana Abul Kalam Azad University of Technology, Haringhata, India
| | - Tapas Bhadra
- Department of Computer Science and Engineering, Aliah University, Kolkata, India
| | - Zhongming Zhao
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, United States
- Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, United States
| |
Collapse
|
3
|
Flora J, Khan W, Jin J, Jin D, Hussain A, Dajani K, Khan B. Usefulness of Vaccine Adverse Event Reporting System for Machine-Learning Based Vaccine Research: A Case Study for COVID-19 Vaccines. Int J Mol Sci 2022; 23:ijms23158235. [PMID: 35897804 PMCID: PMC9368306 DOI: 10.3390/ijms23158235] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2022] [Revised: 07/06/2022] [Accepted: 07/21/2022] [Indexed: 02/04/2023] Open
Abstract
Usefulness of Vaccine-Adverse Event-Reporting System (VAERS) data and protocols required for statistical analyses were pinpointed with a set of recommendations for the application of machine learning modeling or exploratory analyses on VAERS data with a case study of COVID-19 vaccines (Pfizer-BioNTech, Moderna, Janssen). A total of 262,454 duplicate reports (29%) from 905,976 reports were identified, which were merged into a total of 643,522 distinct reports. A customized online survey was also conducted providing 211 reports. A total of 20 highest reported adverse events were first identified. Differences in results after applying various machine learning algorithms (association rule mining, self-organizing maps, hierarchical clustering, bipartite graphs) on VAERS data were noticed. Moderna reports showed injection-site-related AEs of higher frequencies by 15.2%, consistent with the online survey (12% higher reporting rate for pain in the muscle for Moderna compared to Pfizer-BioNTech). AEs {headache, pyrexia, fatigue, chills, pain, dizziness} constituted >50% of the total reports. Chest pain in male children reports was 295% higher than in female children reports. Penicillin and sulfa were of the highest frequencies (22%, and 19%, respectively). Analysis of uncleaned VAERS data demonstrated major differences from the above (7% variations). Spelling/grammatical mistakes in allergies were discovered (e.g., ~14% reports with incorrect spellings for penicillin).
Collapse
Affiliation(s)
- James Flora
- Department of Computer Science and Engineering, California State University San Bernardino, 5500 University Parkway, San Bernardino, CA 92407, USA; (J.F.); (J.J.); (K.D.)
| | - Wasiq Khan
- School of Computer Science and Mathematics, Liverpool John Moores University, Liverpool L3 3AF, UK;
| | - Jennifer Jin
- Department of Computer Science and Engineering, California State University San Bernardino, 5500 University Parkway, San Bernardino, CA 92407, USA; (J.F.); (J.J.); (K.D.)
| | - Daniel Jin
- Division of Vascular & Interventional Radiology, Department of Radiology, Loma Linda University Medical Center, Loma Linda, CA 92354, USA;
| | - Abir Hussain
- Department of Electrical Engineering, University of Sharjah, Sharjah P.O. Box 27272, United Arab Emirates;
| | - Khalil Dajani
- Department of Computer Science and Engineering, California State University San Bernardino, 5500 University Parkway, San Bernardino, CA 92407, USA; (J.F.); (J.J.); (K.D.)
| | - Bilal Khan
- Department of Computer Science and Engineering, California State University San Bernardino, 5500 University Parkway, San Bernardino, CA 92407, USA; (J.F.); (J.J.); (K.D.)
- Institute of the Environment and Sustainability, University of California Los Angeles, Los Angeles, CA 90095, USA
- Correspondence: ; Tel.: +1-(909)-537-5428
| |
Collapse
|
4
|
Sauvé JF, Emili A, Mater G. Application of Pattern Mining Methods to Assess Exposures to Multiple Airborne Chemical Agents in Two Large Occupational Exposure Databases from France. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2022; 19:ijerph19031746. [PMID: 35162769 PMCID: PMC8835122 DOI: 10.3390/ijerph19031746] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/20/2021] [Revised: 01/31/2022] [Accepted: 02/01/2022] [Indexed: 12/10/2022]
Abstract
Surveys of the French working population estimate that approximately 15% of all workers may be exposed to at least three different chemical agents, but the most prevalent coexposure situations and their associated health risks remain relatively understudied. To characterize occupational coexposure situations in France, we extracted personal measurement data from COLCHIC and SCOLA, two large administrative occupation exposure databases. We selected 118 chemical agents that had ≥100 measurements with detected concentrations over the period 2010-2019, including 31 carcinogens (IARC groups 1, 2A, and 2B). We grouped measurements by work situations (WS, combination of sector, occupation, task, and year). We characterized the mixtures across WS using frequent itemset mining and association rules mining. The 275,213 measurements extracted came from 32,670 WS and encompassing 4692 unique mixtures. Workers in 32% of all WS were exposed to ≥2 agents (median 3 agents/WS) and 13% of all WS contained ≥2 carcinogens (median 2 carcinogens/WS). The most frequent coexposures were ethylbenzene-xylene (1550 WS), quartz-cristobalite (1417 WS), and toluene-xylene (1305 WS). Prevalent combinations of carcinogens also included hexavalent chromium-lead (368 WS) and benzene-ethylbenzene (314 WS). Wood dust (6% of WS exposed to at least one other agent) and asbestos (8%) had the least amount of WS coexposed with other agents. Tasks with the highest proportions of coexposure to carcinogens include electric arc welding (37% of WS with coexposure), polymerization and distillation (34%), and construction drilling and excavating (34%). Overall, the coexposure to multiple chemical agents, including carcinogens, was highly prevalent in the databases, and should be taken into account when assessing exposure risks in the workplace.
Collapse
|
5
|
Giulia A, Anna S, Antonia B, Dario P, Maurizio C. Extending Association Rule Mining to Microbiome Pattern Analysis: Tools and Guidelines to Support Real Applications. FRONTIERS IN BIOINFORMATICS 2022; 1:794547. [PMID: 36303759 PMCID: PMC9580939 DOI: 10.3389/fbinf.2021.794547] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2021] [Accepted: 12/07/2021] [Indexed: 11/24/2022] Open
Abstract
Boosted by the exponential growth of microbiome-based studies, analyzing microbiome patterns is now a hot-topic, finding different fields of application. In particular, the use of machine learning techniques is increasing in microbiome studies, providing deep insights into microbial community composition. In this context, in order to investigate microbial patterns from 16S rRNA metabarcoding data, we explored the effectiveness of Association Rule Mining (ARM) technique, a supervised-machine learning procedure, to extract patterns (in this work, intended as groups of species or taxa) from microbiome data. ARM can generate huge amounts of data, making spurious information removal and visualizing results challenging. Our work sheds light on the strengths and weaknesses of pattern mining strategy into the study of microbial patterns, in particular from 16S rRNA microbiome datasets, applying ARM on real case studies and providing guidelines for future usage. Our results highlighted issues related to the type of input and the use of metadata in microbial pattern extraction, identifying the key steps that must be considered to apply ARM consciously on 16S rRNA microbiome data. To promote the use of ARM and the visualization of microbiome patterns, specifically, we developed microFIM (microbial Frequent Itemset Mining), a versatile Python tool that facilitates the use of ARM integrating common microbiome outputs, such as taxa tables. microFIM implements interest measures to remove spurious information and merges the results of ARM analysis with the common microbiome outputs, providing similar microbiome strategies that help scientists to integrate ARM in microbiome applications. With this work, we aimed at creating a bridge between microbial ecology researchers and ARM technique, making researchers aware about the strength and weaknesses of association rule mining approach.
Collapse
Affiliation(s)
- Agostinetto Giulia
- Department of Biotechnology and Biosciences, University of Milano-Bicocca, Milan, Italy
- *Correspondence: Agostinetto Giulia,
| | | | - Bruno Antonia
- Department of Biotechnology and Biosciences, University of Milano-Bicocca, Milan, Italy
| | - Pescini Dario
- Department of Statistics and Quantitative Methods, University of Milano-Bicocca, Milan, Italy
| | - Casiraghi Maurizio
- Department of Biotechnology and Biosciences, University of Milano-Bicocca, Milan, Italy
| |
Collapse
|
6
|
Ledmi M, Zidat S, Hamdi-Cherif A. GrAFCI+ A fast generator-based algorithm for mining frequent closed itemsets. Knowl Inf Syst 2021. [DOI: 10.1007/s10115-021-01575-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
7
|
Efficient representations of tumor diversity with paired DNA-RNA aberrations. PLoS Comput Biol 2021; 17:e1008944. [PMID: 34115745 PMCID: PMC8221796 DOI: 10.1371/journal.pcbi.1008944] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2020] [Revised: 06/23/2021] [Accepted: 04/07/2021] [Indexed: 12/13/2022] Open
Abstract
Cancer cells display massive dysregulation of key regulatory pathways due to now well-catalogued mutations and other DNA-related aberrations. Moreover, enormous heterogeneity has been commonly observed in the identity, frequency and location of these aberrations across individuals with the same cancer type or subtype, and this variation naturally propagates to the transcriptome, resulting in myriad types of dysregulated gene expression programs. Many have argued that a more integrative and quantitative analysis of heterogeneity of DNA and RNA molecular profiles may be necessary for designing more systematic explorations of alternative therapies and improving predictive accuracy. We introduce a representation of multi-omics profiles which is sufficiently rich to account for observed heterogeneity and support the construction of quantitative, integrated, metrics of variation. Starting from the network of interactions existing in Reactome, we build a library of “paired DNA-RNA aberrations” that represent prototypical and recurrent patterns of dysregulation in cancer; each two-gene “Source-Target Pair” (STP) consists of a “source” regulatory gene and a “target” gene whose expression is plausibly “controlled” by the source gene. The STP is then “aberrant” in a joint DNA-RNA profile if the source gene is DNA-aberrant (e.g., mutated, deleted, or duplicated), and the downstream target gene is “RNA-aberrant”, meaning its expression level is outside the normal, baseline range. With M STPs, each sample profile has exactly one of the 2M possible configurations. We concentrate on subsets of STPs, and the corresponding reduced configurations, by selecting tissue-dependent minimal coverings, defined as the smallest family of STPs with the property that every sample in the considered population displays at least one aberrant STP within that family. These minimal coverings can be computed with integer programming. Given such a covering, a natural measure of cross-sample diversity is the extent to which the particular aberrant STPs composing a covering vary from sample to sample; this variability is captured by the entropy of the distribution over configurations. We apply this program to data from TCGA for six distinct tumor types (breast, prostate, lung, colon, liver, and kidney cancer). This enables an efficient simplification of the complex landscape observed in cancer populations, resulting in the identification of novel signatures of molecular alterations which are not detected with frequency-based criteria. Estimates of cancer heterogeneity across tumor phenotypes reveals a stable pattern: entropy increases with disease severity. This framework is then well-suited to accommodate the expanding complexity of cancer genomes and epigenomes emerging from large consortia projects. A large variety of genomic and transcriptomic aberrations are observed in cancer cells, and their identity, location, and frequency can be highly indicative of the particular subtype or molecular phenotype, and thereby inform treatment options. However, elucidating this association between sets of aberrations and subtypes of cancer is severely impeded by considerable diversity in the set of aberrations across samples from the same population. Most attempts at analyzing tumor heterogeneity have dealt with either the genome or transcriptome in isolation. Here we present a novel, multi-omics approach for quantifying heterogeneity by determining a small set of paired DNA-RNA aberrations that incorporates potential downstream effects on gene expression. We apply integer programming to identify a small set of paired aberrations such that at least one among them is present in every sample of a given cancer population. The resulting “coverings” are analyzed for six cancer cohorts from the Cancer Genome Atlas, and facilitate introducing an information-theoretic measure of heterogeneity. Our results identify many known facets of tumorigenesis as well as suggest potential novel genes and interactions of interest.
Collapse
|
8
|
Anguita-Ruiz A, Segura-Delgado A, Alcalá R, Aguilera CM, Alcalá-Fdez J. eXplainable Artificial Intelligence (XAI) for the identification of biologically relevant gene expression patterns in longitudinal human studies, insights from obesity research. PLoS Comput Biol 2020; 16:e1007792. [PMID: 32275707 PMCID: PMC7176286 DOI: 10.1371/journal.pcbi.1007792] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2019] [Revised: 04/22/2020] [Accepted: 03/17/2020] [Indexed: 12/18/2022] Open
Abstract
Until date, several machine learning approaches have been proposed for the dynamic modeling of temporal omics data. Although they have yielded impressive results in terms of model accuracy and predictive ability, most of these applications are based on "Black-box" algorithms and more interpretable models have been claimed by the research community. The recent eXplainable Artificial Intelligence (XAI) revolution offers a solution for this issue, were rule-based approaches are highly suitable for explanatory purposes. The further integration of the data mining process along with functional-annotation and pathway analyses is an additional way towards more explanatory and biologically soundness models. In this paper, we present a novel rule-based XAI strategy (including pre-processing, knowledge-extraction and functional validation) for finding biologically relevant sequential patterns from longitudinal human gene expression data (GED). To illustrate the performance of our pipeline, we work on in vivo temporal GED collected within the course of a long-term dietary intervention in 57 subjects with obesity (GSE77962). As validation populations, we employ three independent datasets following the same experimental design. As a result, we validate primarily extracted gene patterns and prove the goodness of our strategy for the mining of biologically relevant gene-gene temporal relations. Our whole pipeline has been gathered under open-source software and could be easily extended to other human temporal GED applications.
Collapse
Affiliation(s)
- Augusto Anguita-Ruiz
- Department of Biochemistry and Molecular Biology II, Institute of Nutrition and Food Technology "José Mataix", Center of Biomedical Research, University of Granada, Granada, Spain
- Instituto de Investigación Biosanitaria ibs.GRANADA, Granada, Spain
- CIBEROBN (Physiopathology of Obesity and Nutrition), Instituto de Salud Carlos III (ISCIII), Madrid, Spain
| | - Alberto Segura-Delgado
- Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain
| | - Rafael Alcalá
- Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain
| | - Concepción M. Aguilera
- Department of Biochemistry and Molecular Biology II, Institute of Nutrition and Food Technology "José Mataix", Center of Biomedical Research, University of Granada, Granada, Spain
- Instituto de Investigación Biosanitaria ibs.GRANADA, Granada, Spain
- CIBEROBN (Physiopathology of Obesity and Nutrition), Instituto de Salud Carlos III (ISCIII), Madrid, Spain
| | - Jesús Alcalá-Fdez
- Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain
| |
Collapse
|
9
|
Nassiri I, McCall MN. Systematic exploration of cell morphological phenotypes associated with a transcriptomic query. Nucleic Acids Res 2019; 46:e116. [PMID: 30011038 PMCID: PMC6212779 DOI: 10.1093/nar/gky626] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2017] [Accepted: 07/10/2018] [Indexed: 12/23/2022] Open
Abstract
Cell morphological phenotypes, including shape, size, intensity, and texture of cellular compartments have been shown to change in response to perturbation with small molecule compounds. Image-based cell profiling or cell morphological profiling has been used to associate changes of cell morphological features with alterations in cellular function and to infer molecular mechanisms of action. Recently, the Library of Integrated Network-based Cellular Signatures (LINCS) Project has measured gene expression and performed image-based cell profiling on cell lines treated with 9515 unique compounds. These data provide an opportunity to study the interdependence between transcription and cell morphology. Previous methods to investigate cell phenotypes have focused on targeting candidate genes as components of known pathways, RNAi morphological profiling, and cataloging morphological defects; however, these methods do not provide an explicit model to link transcriptomic changes with corresponding alterations in morphology. To address this, we propose a cell morphology enrichment analysis to assess the association between transcriptomic alterations and changes in cell morphology. Additionally, for a new transcriptomic query, our approach can be used to predict associated changes in cellular morphology. We demonstrate the utility of our method by applying it to cell morphological changes in a human bone osteosarcoma cell line.
Collapse
Affiliation(s)
- Isar Nassiri
- Department of Biostatistics and Computational Biology, University of Rochester Medical Center, Rochester, NY, USA.,Department of Oncology, Weatherall Institute for Molecular Medicine, University of Oxford, UK
| | - Matthew N McCall
- Department of Biostatistics and Computational Biology, University of Rochester Medical Center, Rochester, NY, USA.,Department of Biomedical Genetics, University of Rochester Medical Center, Rochester, NY, USA
| |
Collapse
|
10
|
Díaz-Montaña JJ, Díaz-Díaz N, Barranco CD, Ponzoni I. Development and use of a Cytoscape app for GRNCOP2. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2019; 177:211-218. [PMID: 31319950 DOI: 10.1016/j.cmpb.2019.05.030] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/17/2018] [Revised: 05/05/2019] [Accepted: 05/29/2019] [Indexed: 06/10/2023]
Abstract
BACKGROUND AND OBJECTIVE Gene regulatory networks (GRNs) are essential for understanding most molecular processes. In this context, the so-called model-free approaches have an advantage modeling the complex topologies behind these dynamic molecular networks, since most GRNs are difficult to map correctly by any other mathematical model. Abstract model-free approaches, also known as rule-based extraction methods, offer valuable benefits when performing data-driven analysis; such as requiring the least amount of data and simplifying the inference of large models at a faster analysis speed. In particular, GRNCOP2 is a combinatorial optimization method with an adaptive criterion for the discretization of gene expression data and high performance, in contrast to other rule-based extraction methods for discovering GRNs. However, the analysis of the large relational structures of the networks inferred by GRNCOP2 requires the support of effective tools for interactive network visualization and topological analysis of the extracted associations. This need motivated the possibility of integrating GRNCOP2 in the Cytoscape ecosystem in order to benefit from Cytoscapes core functionality, as well as all the other apps in its ecosystem. METHODS In this paper, we introduce the implementation of a GRNCOP2 Cytoscape app. This incorporation to Cytoscape platform includes new functionality for GRN visualizations, dynamic user-interaction and integration with other apps for topological analysis of the networks. RESULTS In order to demonstrate the usefulness of integrating GRNCOP2 in Cytoscape, the new app was used to tackle a novel use case for GRNCOP2: the analysis of crosstalk between pathways. In this regard, datasets associated with Alzheimer's disease (AD) were analyzed using GRNCOP2 app and other apps of the Cytoscape ecosystem by performing a topological analysis of the AD progression and its synchronization with the Ubiquitin Mediated Proteolysis pathway. Finally, the biological relevance of the findings achieved by this new app were evaluated by searching for evidence in the literature. CONCLUSIONS The proposed crosstalk analysis with the new GRNCOP2 app focused on assessing the phase of the Alzheimer's disease progression where the coordination with the Ubiquitin Mediated Proteolysis pathway increase, and identifying the genes that explain the signalling between these cellular processes. Both questions were explored by topological contrastive analysis of the GRNs generated for the GRNCOP2 app, where several facilities of Cytoscape were exploited. The topological patterns inferred by this new App have been consistent with biological evidence reported in the scientic literature, illustrating the effectiveness of using this new GRNCOP2 App in pathway analysis. AVAILABILITY The GRNCOP2 App is freely available at the official Cytoscape app store: http://apps.cytoscape.org/apps/grncop2.
Collapse
Affiliation(s)
- Juan J Díaz-Montaña
- Intelligent Data Analysis (DATAi), Division of Computer Science, Pablo de Olavide University, Seville ES-41013, Spain.
| | - Norberto Díaz-Díaz
- Intelligent Data Analysis (DATAi), Division of Computer Science, Pablo de Olavide University, Seville ES-41013, Spain.
| | - Carlos D Barranco
- Intelligent Data Analysis (DATAi), Division of Computer Science, Pablo de Olavide University, Seville ES-41013, Spain.
| | - Ignacio Ponzoni
- Instituto de Ciencias e Ingeniería de la Computaciǿn (UNS, CONICET), Departamento de Ciencias e Ingeniería de la Computaciǿn, Universidad Nacional del Sur (UNS), Bahía Blanca, Argentina.
| |
Collapse
|
11
|
Xu J, Yang P, Xue S, Sharma B, Sanchez-Martin M, Wang F, Beaty KA, Dehan E, Parikh B. Translating cancer genomics into precision medicine with artificial intelligence: applications, challenges and future perspectives. Hum Genet 2019; 138:109-124. [PMID: 30671672 PMCID: PMC6373233 DOI: 10.1007/s00439-019-01970-5] [Citation(s) in RCA: 94] [Impact Index Per Article: 18.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2018] [Accepted: 01/02/2019] [Indexed: 02/07/2023]
Abstract
In the field of cancer genomics, the broad availability of genetic information offered by next-generation sequencing technologies and rapid growth in biomedical publication has led to the advent of the big-data era. Integration of artificial intelligence (AI) approaches such as machine learning, deep learning, and natural language processing (NLP) to tackle the challenges of scalability and high dimensionality of data and to transform big data into clinically actionable knowledge is expanding and becoming the foundation of precision medicine. In this paper, we review the current status and future directions of AI application in cancer genomics within the context of workflows to integrate genomic analysis for precision cancer care. The existing solutions of AI and their limitations in cancer genetic testing and diagnostics such as variant calling and interpretation are critically analyzed. Publicly available tools or algorithms for key NLP technologies in the literature mining for evidence-based clinical recommendations are reviewed and compared. In addition, the present paper highlights the challenges to AI adoption in digital healthcare with regard to data requirements, algorithmic transparency, reproducibility, and real-world assessment, and discusses the importance of preparing patients and physicians for modern digitized healthcare. We believe that AI will remain the main driver to healthcare transformation toward precision medicine, yet the unprecedented challenges posed should be addressed to ensure safety and beneficial impact to healthcare.
Collapse
Affiliation(s)
- Jia Xu
- IBM Watson Health, Cambridge, MA, USA.
| | | | - Shang Xue
- IBM Watson Health, Cambridge, MA, USA
| | | | | | - Fang Wang
- IBM Watson Health, Cambridge, MA, USA
| | | | | | | |
Collapse
|
12
|
Nguyen D, Luo W, Phung D, Venkatesh S. LTARM: A novel temporal association rule mining method to understand toxicities in a routine cancer treatment. Knowl Based Syst 2018. [DOI: 10.1016/j.knosys.2018.07.031] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
|
13
|
|
14
|
Mallik S, Bhadra T, Mukherji A, Mallik S, Bhadra T, Mukherji A, Mallik S, Bhadra T, Mukherji A. DTFP-Growth: Dynamic Threshold-Based FP-Growth Rule Mining Algorithm Through Integrating Gene Expression, Methylation, and Protein-Protein Interaction Profiles. IEEE Trans Nanobioscience 2018; 17:117-125. [PMID: 29870335 DOI: 10.1109/tnb.2018.2803021] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Association rule mining is an important technique for identifying interesting relationships between gene pairs in a biological data set. Earlier methods basically work for a single biological data set, and, in maximum cases, a single minimum support cutoff can be applied globally, i.e., across all genesets/itemsets. To overcome this limitation, in this paper, we propose dynamic threshold-based FP-growth rule mining algorithm that integrates gene expression, methylation and protein-protein interaction profiles based on weighted shortest distance to find the novel associations among different pairs of genes in multi-view data sets. For this purpose, we introduce three new thresholds, namely, Distance-based Variable/Dynamic Supports (DVS), Distance-based Variable Confidences (DVC), and Distance-based Variable Lifts (DVL) for each rule by integrating co-expression, co-methylation, and protein-protein interactions existed in the multi-omics data set. We develop the proposed algorithm utilizing these three novel multiple threshold measures. In the proposed algorithm, the values of , , and are computed for each rule separately, and subsequently it is verified whether the support, confidence, and lift of each evolved rule are greater than or equal to the corresponding individual , , and values, respectively, or not. If all these three conditions for a rule are found to be true, the rule is treated as a resultant rule. One of the major advantages of the proposed method compared with other related state-of-the-art methods is that it considers both the quantitative and interactive significance among all pairwise genes belonging to each rule. Moreover, the proposed method generates fewer rules, takes less running time, and provides greater biological significance for the resultant top-ranking rules compared to previous methods.
Collapse
|
15
|
|
16
|
GeRNet: a gene regulatory network tool. Biosystems 2017; 162:1-11. [PMID: 28860069 DOI: 10.1016/j.biosystems.2017.08.006] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2016] [Revised: 03/27/2017] [Accepted: 08/17/2017] [Indexed: 11/21/2022]
Abstract
Gene regulatory networks (GRNs) are crucial in every process of life since they govern the majority of the molecular processes. Therefore, the task of assembling these networks is highly important. In particular, the so called model-free approaches have an advantage modeling the complexities of dynamic molecular networks, since most of the gene networks are hard to be mapped with accuracy by any other mathematical model. A highly abstract model-free approach, called rule-based approach, offers several advantages performing data-driven analysis; such as the requirement of the least amount of data. They also have an important ability to perform inferences: its simplicity allows the inference of large size models with a higher speed of analysis. However, regarding these techniques, the reconstruction of the relational structure of the network is partial, hence incomplete, for an effective biological analysis. This situation motivated us to explore the possibility of hybridizing with other approaches, such as biclustering techniques. This led to incorporate a biclustering tool that finds new relations between the nodes of the GRN. In this work we present a new software, called GeRNeT that integrates the algorithms of GRNCOP2 and BiHEA along a set of tools for interactive visualization, statistical analysis and ontological enrichment of the resulting GRNs. In this regard, results associated with Alzheimer disease datasets are presented that show the usefulness of integrating both bioinformatics tools.
Collapse
|
17
|
Henriques R, Madeira SC. BiC2PAM: constraint-guided biclustering for biological data analysis with domain knowledge. Algorithms Mol Biol 2016; 11:23. [PMID: 27651825 PMCID: PMC5024481 DOI: 10.1186/s13015-016-0085-5] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2016] [Accepted: 08/16/2016] [Indexed: 11/10/2022] Open
Abstract
Background Biclustering has been largely used in biological data analysis, enabling the discovery of putative functional modules from omic and network data. Despite the recognized importance of incorporating domain knowledge to guide biclustering and guarantee a focus on relevant and non-trivial biclusters, this possibility has not yet been comprehensively addressed. This results from the fact that the majority of existing algorithms are only able to deliver sub-optimal solutions with restrictive assumptions on the structure, coherency and quality of biclustering solutions, thus preventing the up-front satisfaction of knowledge-driven constraints. Interestingly, in recent years, a clearer understanding of the synergies between pattern mining and biclustering gave rise to a new class of algorithms, termed as pattern-based biclustering algorithms. These algorithms, able to efficiently discover flexible biclustering solutions with optimality guarantees, are thus positioned as good candidates for knowledge incorporation. In this context, this work aims to bridge the current lack of solid views on the use of background knowledge to guide (pattern-based) biclustering tasks. Methods This work extends (pattern-based) biclustering algorithms to guarantee the satisfiability of constraints derived from background knowledge and to effectively explore efficiency gains from their incorporation. In this context, we first show the relevance of constraints with succinct, (anti-)monotone and convertible properties for the analysis of expression data and biological networks. We further show how pattern-based biclustering algorithms can be adapted to effectively prune of the search space in the presence of such constraints, as well as be guided in the presence of biological annotations. Relying on these contributions, we propose BiClustering with Constraints using PAttern Mining (BiC2PAM), an extension of BicPAM and BicNET biclustering algorithms. Results Experimental results on biological data demonstrate the importance of incorporating knowledge within biclustering to foster efficiency and enable the discovery of non-trivial biclusters with heightened biological relevance. Conclusions This work provides the first comprehensive view and sound algorithm for biclustering biological data with constraints derived from user expectations, knowledge repositories and/or literature.
Collapse
|
18
|
Granular Computing Techniques for Classification and Semantic Characterization of Structured Data. Cognit Comput 2015. [DOI: 10.1007/s12559-015-9369-1] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
19
|
Gallo CA, Cecchini RL, Carballido JA, Micheletto S, Ponzoni I. Discretization of gene expression data revised. Brief Bioinform 2015; 17:758-70. [PMID: 26438418 DOI: 10.1093/bib/bbv074] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2015] [Indexed: 01/22/2023] Open
Abstract
Gene expression measurements represent the most important source of biological data used to unveil the interaction and functionality of genes. In this regard, several data mining and machine learning algorithms have been proposed that require, in a number of cases, some kind of data discretization to perform the inference. Selection of an appropriate discretization process has a major impact on the design and outcome of the inference algorithms, as there are a number of relevant issues that need to be considered. This study presents a revision of the current state-of-the-art discretization techniques, together with the key subjects that need to be considered when designing or selecting a discretization approach for gene expression data.
Collapse
|
20
|
An Association Rule Mining Approach to Discover lncRNAs Expression Patterns in Cancer Datasets. BIOMED RESEARCH INTERNATIONAL 2015; 2015:146250. [PMID: 26273587 PMCID: PMC4530207 DOI: 10.1155/2015/146250] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/30/2015] [Revised: 04/20/2015] [Accepted: 04/21/2015] [Indexed: 01/17/2023]
Abstract
In the past few years, the role of long noncoding RNAs (lncRNAs) in tumor development and progression has been disclosed although their mechanisms of action remain to be elucidated. An important contribution to the comprehension of lncRNAs biology in cancer could be obtained through the integrated analysis of multiple expression datasets. However, the growing availability of public datasets requires new data mining techniques to integrate and describe relationship among data. In this perspective, we explored the powerness of the Association Rule Mining (ARM) approach in gene expression data analysis. By the ARM method, we performed a meta-analysis of cancer-related microarray data which allowed us to identify and characterize a set of ten lncRNAs simultaneously altered in different brain tumor datasets. The expression profiles of the ten lncRNAs appeared to be sufficient to distinguish between cancer and normal tissues. A further characterization of this lncRNAs signature through a comodulation expression analysis suggested that biological processes specific of the nervous system could be compromised.
Collapse
|
21
|
Henriques R, Madeira SC. Biclustering with Flexible Plaid Models to Unravel Interactions between Biological Processes. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:738-752. [PMID: 26357312 DOI: 10.1109/tcbb.2014.2388206] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Genes can participate in multiple biological processes at a time and thus their expression can be seen as a composition of the contributions from the active processes. Biclustering under a plaid assumption allows the modeling of interactions between transcriptional modules or biclusters (subsets of genes with coherence across subsets of conditions) by assuming an additive composition of contributions in their overlapping areas. Despite the biological interest of plaid models, few biclustering algorithms consider plaid effects and, when they do, they place restrictions on the allowed types and structures of biclusters, and suffer from robustness problems by seizing exact additive matchings. We propose BiP (Biclustering using Plaid models), a biclustering algorithm with relaxations to allow expression levels to change in overlapping areas according to biologically meaningful assumptions (weighted and noise-tolerant composition of contributions). BiP can be used over existing biclustering solutions (seizing their benefits) as it is able to recover excluded areas due to unaccounted plaid effects and detect noisy areas non-explained by a plaid assumption, thus producing an explanatory model of overlapping transcriptional activity. Experiments on synthetic data support BiP's efficiency and effectiveness. The learned models from expression data unravel meaningful and non-trivial functional interactions between biological processes associated with putative regulatory modules.
Collapse
|
22
|
Henriques R, Madeira SC. BicPAM: Pattern-based biclustering for biomedical data analysis. Algorithms Mol Biol 2014; 9:27. [PMID: 25649207 PMCID: PMC4302537 DOI: 10.1186/s13015-014-0027-z] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2014] [Accepted: 11/12/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Biclustering, the discovery of sets of objects with a coherent pattern across a subset of conditions, is a critical task to study a wide-set of biomedical problems, where molecular units or patients are meaningfully related with a set of properties. The challenging combinatorial nature of this task led to the development of approaches with restrictions on the allowed type, number and quality of biclusters. Contrasting, recent biclustering approaches relying on pattern mining methods can exhaustively discover flexible structures of robust biclusters. However, these approaches are only prepared to discover constant biclusters and their underlying contributions remain dispersed. METHODS The proposed BicPAM biclustering approach integrates existing principles made available by state-of-the-art pattern-based approaches with two new contributions. First, BicPAM is the first efficient attempt to exhaustively mine non-constant types of biclusters, including additive and multiplicative coherencies in the presence or absence of symmetries. Second, BicPAM provides strategies to effectively compose different biclustering structures and to handle arbitrary levels of noise inherent to data and with discretization procedures. RESULTS Results show BicPAM's superiority against its peers and its ability to retrieve unique types of biclusters of interest, to efficiently deliver exhaustive solutions and to successfully recover planted biclusters in datasets with varying levels of missing values and noise. Its application over gene expression data leads to unique solutions with heightened biological relevance. CONCLUSIONS BicPAM approaches integrate existing disperse efforts towards pattern-based biclustering and provides the first critical strategies to efficiently discover exhaustive solutions of biclusters with shifting, scaling and symmetric assumptions with varying quality and underlying structures. Additionally, BicPAM dynamically adapts its behavior to mine data with different levels of missing values and noise.
Collapse
Affiliation(s)
- Rui Henriques
- INESC-ID and Instituto Superior Técnico, Universidade de Lisboa, Lisbon, Portugal
| | - Sara C Madeira
- INESC-ID and Instituto Superior Técnico, Universidade de Lisboa, Lisbon, Portugal
| |
Collapse
|
23
|
Cremaschi P, Rovida S, Sacchi L, Lisa A, Calvi F, Montecucco A, Biamonti G, Bione S, Sacchi G. CorrelaGenes: a new tool for the interpretation of the human transcriptome. BMC Bioinformatics 2014; 15 Suppl 1:S6. [PMID: 24564370 PMCID: PMC4016313 DOI: 10.1186/1471-2105-15-s1-s6] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
Background The amount of gene expression data available in public repositories has grown exponentially in the last years, now requiring new data mining tools to transform them in information easily accessible to biologists. Results By exploiting expression data publicly available in the Gene Expression Omnibus (GEO) database, we developed a new bioinformatics tool aimed at the identification of genes whose expression appeared simultaneously altered in different experimental conditions, thus suggesting co-regulation or coordinated action in the same biological process. To accomplish this task, we used the 978 human GEO Curated DataSets and we manually performed the selection of 2,109 pair-wise comparisons based on their biological rationale. The lists of differentially expressed genes, obtained from the selected comparisons, were stored in a PostgreSQL database and used as data source for the CorrelaGenes tool. Our application uses a customized Association Rule Mining (ARM) algorithm to identify sets of genes showing expression profiles correlated with a gene of interest. The significance of the correlation is measured coupling the Lift, a well-known standard ARM index, and the χ2 p value. The manually curated selection of the comparisons and the developed algorithm constitute a new approach in the field of gene expression profiling studies. Simulation performed on 100 randomly selected target genes allowed us to evaluate the efficiency of the procedure and to obtain preliminary data demonstrating the consistency of the results. Conclusions The preliminary results of the simulation showed how CorrelaGenes could contribute to the characterization of molecular pathways and biological processes integrating data obtained from other applications and available in public repositories.
Collapse
|
24
|
Naulaerts S, Meysman P, Bittremieux W, Vu TN, Vanden Berghe W, Goethals B, Laukens K. A primer to frequent itemset mining for bioinformatics. Brief Bioinform 2013; 16:216-31. [PMID: 24162173 PMCID: PMC4364064 DOI: 10.1093/bib/bbt074] [Citation(s) in RCA: 51] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023] Open
Abstract
Over the past two decades, pattern mining techniques have become an integral part of many bioinformatics solutions. Frequent itemset mining is a popular group of pattern mining techniques designed to identify elements that frequently co-occur. An archetypical example is the identification of products that often end up together in the same shopping basket in supermarket transactions. A number of algorithms have been developed to address variations of this computationally non-trivial problem. Frequent itemset mining techniques are able to efficiently capture the characteristics of (complex) data and succinctly summarize it. Owing to these and other interesting properties, these techniques have proven their value in biological data analysis. Nevertheless, information about the bioinformatics applications of these techniques remains scattered. In this primer, we introduce frequent itemset mining and their derived association rules for life scientists. We give an overview of various algorithms, and illustrate how they can be used in several real-life bioinformatics application domains. We end with a discussion of the future potential and open challenges for frequent itemset mining in the life sciences.
Collapse
|
25
|
Chan WC, Ho MR, Li SC, Tsai KW, Lai CH, Hsu CN, Lin WC. MetaMirClust: discovery of miRNA cluster patterns using a data-mining approach. Genomics 2012; 100:141-8. [PMID: 22735742 DOI: 10.1016/j.ygeno.2012.06.007] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2012] [Revised: 06/15/2012] [Accepted: 06/15/2012] [Indexed: 10/28/2022]
Abstract
Recent genome-wide surveys on ncRNA have revealed that a substantial fraction of miRNA genes is likely to form clusters. However, the evolutionary and biological function implications of clustered miRNAs are still elusive. After identifying clustered miRNA genes under different maximum inter-miRNA distances (MIDs), this study intended to reveal evolution conservation patterns among these clustered miRNA genes in metazoan species using a computation algorithm. As examples, a total of 15-35% of known and predicted miRNA genes in nine selected species constitute clusters under the MIDs ranging from 1kb to 50kb. Intriguingly, 33 out of 37 metazoan miRNA clusters in 56 metazoan genomes are co-conserved with their up/down-stream adjacent protein-coding genes. Meanwhile, a co-expression pattern of miR-1 and miR-133a in the mir-133-1 cluster has been experimentally demonstrated. Therefore, the MetaMirClust database provides a useful bioinformatic resource for biologists to facilitate the advanced interrogations on the composition of miRNA clusters and their evolution patterns.
Collapse
Affiliation(s)
- Wen-Ching Chan
- Institute of Biomedical Informatics, National Yang-Ming University, Academia Sinica, Taipei, Taiwan, ROC.
| | | | | | | | | | | | | |
Collapse
|
26
|
|
27
|
Liu YC, Cheng CP, Tseng VS. Discovering relational-based association rules with multiple minimum supports on microarray datasets. ACTA ACUST UNITED AC 2011; 27:3142-8. [PMID: 21926125 DOI: 10.1093/bioinformatics/btr526] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION Association rule analysis methods are important techniques applied to gene expression data for finding expression relationships between genes. However, previous methods implicitly assume that all genes have similar importance, or they ignore the individual importance of each gene. The relation intensity between any two items has never been taken into consideration. Therefore, we proposed a technique named REMMAR (RElational-based Multiple Minimum supports Association Rules) algorithm to tackle this problem. This method adjusts the minimum relation support (MRS) for each gene pair depending on the regulatory relation intensity to discover more important association rules with stronger biological meaning. RESULTS In the actual case study of this research, REMMAR utilized the shortest distance between any two genes in the Saccharomyces cerevisiae gene regulatory network (GRN) as the relation intensity to discover the association rules from two S.cerevisiae gene expression datasets. Under experimental evaluation, REMMAR can generate more rules with stronger relation intensity, and filter out rules without biological meaning in the protein-protein interaction network (PPIN). Furthermore, the proposed method has a higher precision (100%) than the precision of reference Apriori method (87.5%) for the discovered rules use a literature survey. Therefore, the proposed REMMAR algorithm can discover stronger association rules in biological relationships dissimilated by traditional methods to assist biologists in complicated genetic exploration.
Collapse
Affiliation(s)
- Yu-Cheng Liu
- Department of Computer Science and Information Engineering and Institute of Medical Informatics, National Cheng Kung University, Taiwan
| | | | | |
Collapse
|
28
|
Fontanillo C, Nogales-Cadenas R, Pascual-Montano A, De Las Rivas J. Functional analysis beyond enrichment: non-redundant reciprocal linkage of genes and biological terms. PLoS One 2011; 6:e24289. [PMID: 21949701 PMCID: PMC3174934 DOI: 10.1371/journal.pone.0024289] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2011] [Accepted: 08/03/2011] [Indexed: 11/30/2022] Open
Abstract
Functional analysis of large sets of genes and proteins is becoming more and more necessary with the increase of experimental biomolecular data at omic-scale. Enrichment analysis is by far the most popular available methodology to derive functional implications of sets of cooperating genes. The problem with these techniques relies in the redundancy of resulting information, that in most cases generate lots of trivial results with high risk to mask the reality of key biological events. We present and describe a computational method, called GeneTerm Linker, that filters and links enriched output data identifying sets of associated genes and terms, producing metagroups of coherent biological significance. The method uses fuzzy reciprocal linkage between genes and terms to unravel their functional convergence and associations. The algorithm is tested with a small set of well known interacting proteins from yeast and with a large collection of reference sets from three heterogeneous resources: multiprotein complexes (CORUM), cellular pathways (SGD) and human diseases (OMIM). Statistical Precision, Recall and balanced F-score are calculated showing robust results, even when different levels of random noise are included in the test sets. Although we could not find an equivalent method, we present a comparative analysis with a widely used method that combines enrichment and functional annotation clustering. A web application to use the method here proposed is provided at http://gtlinker.cnb.csic.es.
Collapse
Affiliation(s)
- Celia Fontanillo
- Cancer Research Center (CiC-IBMCC, CSIC/USAL), Campus Miguel de Unamuno, Salamanca, Spain
| | - Ruben Nogales-Cadenas
- National Center of Biotechnology (CNB, CSIC), Campus de Cantoblanco UAM, Madrid, Spain
| | | | - Javier De Las Rivas
- Cancer Research Center (CiC-IBMCC, CSIC/USAL), Campus Miguel de Unamuno, Salamanca, Spain
- * E-mail:
| |
Collapse
|
29
|
Gallo CA, Carballido JA, Ponzoni I. Discovering time-lagged rules from microarray data using gene profile classifiers. BMC Bioinformatics 2011; 12:123. [PMID: 21524308 PMCID: PMC3111372 DOI: 10.1186/1471-2105-12-123] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2010] [Accepted: 04/27/2011] [Indexed: 02/05/2023] Open
Abstract
BACKGROUND Gene regulatory networks have an essential role in every process of life. In this regard, the amount of genome-wide time series data is becoming increasingly available, providing the opportunity to discover the time-delayed gene regulatory networks that govern the majority of these molecular processes. RESULTS This paper aims at reconstructing gene regulatory networks from multiple genome-wide microarray time series datasets. In this sense, a new model-free algorithm called GRNCOP2 (Gene Regulatory Network inference by Combinatorial OPtimization 2), which is a significant evolution of the GRNCOP algorithm, was developed using combinatorial optimization of gene profile classifiers. The method is capable of inferring potential time-delay relationships with any span of time between genes from various time series datasets given as input. The proposed algorithm was applied to time series data composed of twenty yeast genes that are highly relevant for the cell-cycle study, and the results were compared against several related approaches. The outcomes have shown that GRNCOP2 outperforms the contrasted methods in terms of the proposed metrics, and that the results are consistent with previous biological knowledge. Additionally, a genome-wide study on multiple publicly available time series data was performed. In this case, the experimentation has exhibited the soundness and scalability of the new method which inferred highly-related statistically-significant gene associations. CONCLUSIONS A novel method for inferring time-delayed gene regulatory networks from genome-wide time series datasets is proposed in this paper. The method was carefully validated with several publicly available data sets. The results have demonstrated that the algorithm constitutes a usable model-free approach capable of predicting meaningful relationships between genes, revealing the time-trends of gene regulation.
Collapse
Affiliation(s)
- Cristian A Gallo
- Laboratorio de Investigación y Desarrollo en Computación Científica (LIDeCC), Departamento de Ciencias e Ingeniería de la Computación, Universidad Nacional del Sur, Av, Alem 1253, 8000, Bahía Blanca, Argentina
| | | | | |
Collapse
|