1
|
Machine learning for hand pose classification from phasic and tonic EMG signals during bimanual activities in virtual reality. Front Neurosci 2024; 18:1329411. [PMID: 38737097 PMCID: PMC11082314 DOI: 10.3389/fnins.2024.1329411] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2023] [Accepted: 04/12/2024] [Indexed: 05/14/2024] Open
Abstract
Myoelectric prostheses have recently shown significant promise for restoring hand function in individuals with upper limb loss or deficiencies, driven by advances in machine learning and increasingly accessible bioelectrical signal acquisition devices. Here, we first introduce and validate a novel experimental paradigm using a virtual reality headset equipped with hand-tracking capabilities to facilitate the recordings of synchronized EMG signals and hand pose estimation. Using both the phasic and tonic EMG components of data acquired through the proposed paradigm, we compare hand gesture classification pipelines based on standard signal processing features, convolutional neural networks, and covariance matrices with Riemannian geometry computed from raw or xDAWN-filtered EMG signals. We demonstrate the performance of the latter for gesture classification using EMG signals. We further hypothesize that introducing physiological knowledge in machine learning models will enhance their performances, leading to better myoelectric prosthesis control. We demonstrate the potential of this approach by using the neurophysiological integration of the "move command" to better separate the phasic and tonic components of the EMG signals, significantly improving the performance of sustained posture recognition. These results pave the way for the development of new cutting-edge machine learning techniques, likely refined by neurophysiology, that will further improve the decoding of real-time natural gestures and, ultimately, the control of myoelectric prostheses.
Collapse
|
2
|
Improving Infinium MethylationEPIC data processing: re-annotation of enhancers and long noncoding RNA genes and benchmarking of normalization methods. Epigenetics 2022; 17:2434-2454. [DOI: 10.1080/15592294.2022.2135201] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022] Open
|
3
|
The role of diversity and ensemble learning in credit card fraud detection. ADV DATA ANAL CLASSI 2022:1-25. [PMID: 36188101 PMCID: PMC9516537 DOI: 10.1007/s11634-022-00515-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2021] [Revised: 07/18/2022] [Accepted: 08/08/2022] [Indexed: 10/24/2022]
Abstract
The number of daily credit card transactions is inexorably growing: the e-commerce market expansion and the recent constraints for the Covid-19 pandemic have significantly increased the use of electronic payments. The ability to precisely detect fraudulent transactions is increasingly important, and machine learning models are now a key component of the detection process. Standard machine learning techniques are widely employed, but inadequate for the evolving nature of customers behavior entailing continuous changes in the underlying data distribution. his problem is often tackled by discarding past knowledge, despite its potential relevance in the case of recurrent concepts. Appropriate exploitation of historical knowledge is necessary: we propose a learning strategy that relies on diversity-based ensemble learning and allows to preserve past concepts and reuse them for a faster adaptation to changes. In our experiments, we adopt several state-of-the-art diversity measures and we perform comparisons with various other learning approaches. We assess the effectiveness of our proposed learning strategy on extracts of two real datasets from two European countries, containing more than 30 M and 50 M transactions, provided by our industrial partner, Worldline, a leading company in the field.
Collapse
|
4
|
Abstract 6283: Reannotation and normalisation of Infinium 850k data for improved analysis of methylation at enhancers and non-coding RNAs. Cancer Res 2022. [DOI: 10.1158/1538-7445.am2022-6283] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Abstract
In this study, we evaluated the MethylationEPIC BeadChip (850k) technology for enhancer methylation analysis with respect to RRBS, both high-throughput technologies commonly used to screen large patient cohorts. We developed and applied a new approach to re-annotate the 850k data, which greatly improved the association of probes to enhancers and show that 850k targets more enhancers than RRBS. We further investigated the reproducibility of the two technologies and applied various existing normalization methods to 850k data that reduce variability between replicates and variability with other technologies. We thereby showed that normalization methods developed for 450k greatly reduced variability in 850k data to a level below that of highly variable RRBS data. We finally performed differential methylation analysis with 850k data from breast cancer samples applying our new re-annotation method and highlight that the majority of differentially methylated cytosines were detected with probes specific to the 850k mapping to enhancers, confirming the deregulation of enhancer methylation in breast cancer. In summary, our study provides a new annotation for the 850k array, which greatly increases the number of cytosines mapping to enhancers and therefore allows for the improved analysis of enhancer methylation. Overall, we conclude that the 850k array allows for detection of methylation changes in regions not covered by previous Infinium arrays and is the best choice, as compared to RRBS, for high-throughput analysis of enhancer methylation in large clinical cohorts.
Citation Format: Martin Bizet, Matthieu Defrance, Emilie Calonne, Gianluca Bontempi, François Fuks, Jana Jeschke. Reannotation and normalisation of Infinium 850k data for improved analysis of methylation at enhancers and non-coding RNAs [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2022; 2022 Apr 8-13. Philadelphia (PA): AACR; Cancer Res 2022;82(12_Suppl):Abstract nr 6283.
Collapse
|
5
|
Riemannian classification of single-trial surface EEG and sources during checkerboard and navigational images in humans. PLoS One 2022; 17:e0262417. [PMID: 35030232 PMCID: PMC8759639 DOI: 10.1371/journal.pone.0262417] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2021] [Accepted: 12/23/2021] [Indexed: 11/19/2022] Open
Abstract
OBJECTIVE Different visual stimuli are classically used for triggering visual evoked potentials comprising well-defined components linked to the content of the displayed image. These evoked components result from the average of ongoing EEG signals in which additive and oscillatory mechanisms contribute to the component morphology. The evoked related potentials often resulted from a mixed situation (power variation and phase-locking) making basic and clinical interpretations difficult. Besides, the grand average methodology produced artificial constructs that do not reflect individual peculiarities. This motivated new approaches based on single-trial analysis as recently used in the brain-computer interface field. APPROACH We hypothesize that EEG signals may include specific information about the visual features of the displayed image and that such distinctive traits can be identified by state-of-the-art classification algorithms based on Riemannian geometry. The same classification algorithms are also applied to the dipole sources estimated by sLORETA. MAIN RESULTS AND SIGNIFICANCE We show that our classification pipeline can effectively discriminate between the display of different visual items (Checkerboard versus 3D navigational image) in single EEG trials throughout multiple subjects. The present methodology reaches a single-trial classification accuracy of about 84% and 93% for inter-subject and intra-subject classification respectively using surface EEG. Interestingly, we note that the classification algorithms trained on sLORETA sources estimation fail to generalize among multiple subjects (63%), which may be due to either the average head model used by sLORETA or the subsequent spatial filtering failing to extract discriminative information, but reach an intra-subject classification accuracy of 82%.
Collapse
|
6
|
Factor-Based Framework for Multivariate and Multi-step-ahead Forecasting of Large Scale Time Series. Front Big Data 2021; 4:690267. [PMID: 34568817 PMCID: PMC8460934 DOI: 10.3389/fdata.2021.690267] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2021] [Accepted: 08/10/2021] [Indexed: 11/23/2022] Open
Abstract
State-of-the-art multivariate forecasting methods are restricted to low dimensional tasks, linear dependencies and short horizons. The technological advances (notably the Big data revolution) are instead shifting the focus to problems characterized by a large number of variables, non-linear dependencies and long forecasting horizons. In the last few years, the majority of the best performing techniques for multivariate forecasting have been based on deep-learning models. However, such models are characterized by high requirements in terms of data availability and computational resources and suffer from a lack of interpretability. To cope with the limitations of these methods, we propose an extension to the DFML framework, a hybrid forecasting technique inspired by the Dynamic Factor Model (DFM) approach, a successful forecasting methodology in econometrics. This extension improves the capabilities of the DFM approach, by implementing and assessing both linear and non-linear factor estimation techniques as well as model-driven and data-driven factor forecasting techniques. We assess several method integrations within the DFML, and we show that the proposed technique provides competitive results both in terms of forecasting accuracy and computational efficiency on multiple very large-scale (>102 variables and > 103 samples) real forecasting tasks.
Collapse
|
7
|
|
8
|
The CLAIRE COVID-19 initiative: approach, experiences and recommendations. ETHICS AND INFORMATION TECHNOLOGY 2021; 23:127-133. [PMID: 33584129 PMCID: PMC7871022 DOI: 10.1007/s10676-020-09567-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Abstract
A volunteer effort by Artificial Intelligence (AI) researchers has shown it can deliver significant research outcomes rapidly to help tackle COVID-19. Within two months, CLAIRE's self-organising volunteers delivered the World's first comprehensive curated repository of COVID-19-related datasets useful for drug-repurposing, drafted review papers on the role CT/X-ray scan analysis and robotics could play, and progressed research in other areas. Given the pace required and nature of voluntary efforts, the teams faced a number of challenges. These offer insights in how better to prepare for future volunteer scientific efforts and large scale, data-dependent AI collaborations in general. We offer seven recommendations on how to best leverage such efforts and collaborations in the context of managing future crises.
Collapse
|
9
|
Hyperscanning EEG and Classification Based on Riemannian Geometry for Festive and Violent Mental State Discrimination. Front Neurosci 2020; 14:588357. [PMID: 33424535 PMCID: PMC7793677 DOI: 10.3389/fnins.2020.588357] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2020] [Accepted: 11/04/2020] [Indexed: 12/14/2022] Open
Abstract
Interactions between two brains constitute the essence of social communication. Daily movements are commonly executed during social interactions and are determined by different mental states that may express different positive or negative behavioral intent. In this context, the effective recognition of festive or violent intent before the action execution remains crucial for survival. Here, we hypothesize that the EEG signals contain the distinctive features characterizing movement intent already expressed before movement execution and that such distinctive information can be identified by state-of-the-art classification algorithms based on Riemannian geometry. We demonstrated for the first time that a classifier based on covariance matrices and Riemannian geometry can effectively discriminate between neutral, festive, and violent mental states only on the basis of non-invasive EEG signals in both the actor and observer participants. These results pave the way for new electrophysiological discrimination of mental states based on non-invasive EEG recordings and cutting-edge machine learning techniques.
Collapse
|
10
|
Abstract
Cancer driver gene alterations influence cancer development, occurring in oncogenes, tumor suppressors, and dual role genes. Discovering dual role cancer genes is difficult because of their elusive context-dependent behavior. We define oncogenic mediators as genes controlling biological processes. With them, we classify cancer driver genes, unveiling their roles in cancer mechanisms. To this end, we present Moonlight, a tool that incorporates multiple -omics data to identify critical cancer driver genes. With Moonlight, we analyze 8000+ tumor samples from 18 cancer types, discovering 3310 oncogenic mediators, 151 having dual roles. By incorporating additional data (amplification, mutation, DNA methylation, chromatin accessibility), we reveal 1000+ cancer driver genes, corroborating known molecular mechanisms. Additionally, we confirm critical cancer driver genes by analysing cell-line datasets. We discover inactivation of tumor suppressors in intron regions and that tissue type and subtype indicate dual role status. These findings help explain tumor heterogeneity and could guide therapeutic decisions.
Collapse
|
11
|
Batch and incremental dynamic factor machine learning for multivariate and multi-step-ahead forecasting. INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS 2018. [DOI: 10.1007/s41060-018-0150-x] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
12
|
Credit Card Fraud Detection: A Realistic Modeling and a Novel Learning Strategy. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2018; 29:3784-3797. [PMID: 28920909 DOI: 10.1109/tnnls.2017.2736643] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Detecting frauds in credit card transactions is perhaps one of the best testbeds for computational intelligence algorithms. In fact, this problem involves a number of relevant challenges, namely: concept drift (customers' habits evolve and fraudsters change their strategies over time), class imbalance (genuine transactions far outnumber frauds), and verification latency (only a small set of transactions are timely checked by investigators). However, the vast majority of learning algorithms that have been proposed for fraud detection rely on assumptions that hardly hold in a real-world fraud-detection system (FDS). This lack of realism concerns two main aspects: 1) the way and timing with which supervised information is provided and 2) the measures used to assess fraud-detection performance. This paper has three major contributions. First, we propose, with the help of our industrial partner, a formalization of the fraud-detection problem that realistically describes the operating conditions of FDSs that everyday analyze massive streams of credit card transactions. We also illustrate the most appropriate performance measures to be used for fraud-detection purposes. Second, we design and assess a novel learning strategy that effectively addresses class imbalance, concept drift, and verification latency. Third, in our experiments, we demonstrate the impact of class unbalance and concept drift in a real-world data stream containing more than 75 million transactions, authorized over a time window of three years.
Collapse
|
13
|
CancerSubtypes: an R/Bioconductor package for molecular cancer subtype identification, validation and visualization. Bioinformatics 2018; 33:3131-3133. [PMID: 28605519 DOI: 10.1093/bioinformatics/btx378] [Citation(s) in RCA: 139] [Impact Index Per Article: 23.2] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2017] [Accepted: 06/08/2017] [Indexed: 01/28/2023] Open
Abstract
Summary Identifying molecular cancer subtypes from multi-omics data is an important step in the personalized medicine. We introduce CancerSubtypes, an R package for identifying cancer subtypes using multi-omics data, including gene expression, miRNA expression and DNA methylation data. CancerSubtypes integrates four main computational methods which are highly cited for cancer subtype identification and provides a standardized framework for data pre-processing, feature selection, and result follow-up analyses, including results computing, biology validation and visualization. The input and output of each step in the framework are packaged in the same data format, making it convenience to compare different methods. The package is useful for inferring cancer subtypes from an input genomic dataset, comparing the predictions from different well-known methods and testing new subtype discovery methods, as shown with different application scenarios in the Supplementary Material. Availability and implementation The package is implemented in R and available under GPL-2 license from the Bioconductor website (http://bioconductor.org/packages/CancerSubtypes/). Contact thuc.le@unisa.edu.au or jiuyong.li@unisa.edu.au. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
|
14
|
Comprehensive identification of long noncoding RNAs in colorectal cancer. Oncotarget 2018; 9:27605-27629. [PMID: 29963224 PMCID: PMC6021240 DOI: 10.18632/oncotarget.25218] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2017] [Accepted: 04/06/2018] [Indexed: 12/29/2022] Open
Abstract
Colorectal cancer (CRC) is one of the most common cancers in humans and a leading cause of cancer-related deaths worldwide. As in the case of other cancers, CRC heterogeneity leads to a wide range of clinical outcomes and complicates therapy. Over the years, multiple factors have emerged as markers of CRC heterogeneity, improving tumor classification and selection of therapeutic strategies. Understanding the molecular mechanisms underlying this heterogeneity remains a major challenge. A considerable research effort is therefore devoted to identifying additional features of colorectal tumors, in order to better understand CRC etiology and to multiply therapeutic avenues. Recently, long noncoding RNAs (lncRNAs) have emerged as important players in physiological and pathological processes, including CRC. Here we looked for lncRNAs that might contribute to the various colorectal tumor phenotypes. We thus monitored the expression of 4898 lncRNA genes across 566 CRC samples and identified 282 lncRNAs reflecting CRC heterogeneity. We then inferred potential functions of these lncRNAs. Our results highlight lncRNAs that may participate in the major processes altered in distinct CRC cases, such as WNT/β-catenin and TGF-β signaling, immunity, the epithelial-to-mesenchymal transition (EMT), and angiogenesis. For several candidates, we provide experimental evidence supporting our functional predictions that they may be involved in the cell cycle or the EMT. Overall, our work identifies lncRNAs associated with key CRC characteristics and provides insights into their respective functions. Our findings constitute a further step towards understanding the contribution of lncRNAs to CRC heterogeneity. They may open new therapeutic opportunities.
Collapse
|
15
|
Abstract
The GDC (Genomic Data Commons) data portal provides users with data from cancer genomics studies. Recently, we developed the R/Bioconductor TCGAbiolinks package, which allows users to search, download and prepare cancer genomics data for integrative data analysis. The use of this package requires users to have advanced knowledge of R thus limiting the number of users. To overcome this obstacle and improve the accessibility of the package by a wider range of users, we developed a graphical user interface (GUI) using Shiny available through the package TCGAbiolinksGUI. The TCGAbiolinksGUI package is freely available within the Bioconductor project at http://bioconductor.org/packages/TCGAbiolinksGUI/. Links to the GitHub repository, a demo version of the tool, a docker image and PDF/video tutorials are available from the TCGAbiolinksGUI site.
Collapse
|
16
|
Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization. INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS 2018. [DOI: 10.1007/s41060-018-0116-z] [Citation(s) in RCA: 37] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
17
|
Combination of Gene Expression Signature and Model for End-Stage Liver Disease Score Predicts Survival of Patients With Severe Alcoholic Hepatitis. Gastroenterology 2018; 154:965-975. [PMID: 29158192 PMCID: PMC5847453 DOI: 10.1053/j.gastro.2017.10.048] [Citation(s) in RCA: 32] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/11/2017] [Revised: 10/29/2017] [Accepted: 10/30/2017] [Indexed: 12/14/2022]
Abstract
BACKGROUND & AIMS Patients with severe alcoholic hepatitis (AH) have a high risk of death within 90 days. Corticosteroids, which can cause severe adverse events, are the only treatment that increases short-term survival. It is a challenge to predict outcomes of patients with severe AH. Therefore, we developed a scoring system to predict patient survival, integrating baseline molecular and clinical variables. METHODS We obtained fixed liver biopsy samples from 71 consecutive patients diagnosed with severe AH and treated with corticosteroids from July 2006 through December 2013 in Brussels, Belgium (derivation cohort). Gene expression patterns were analyzed by microarrays and clinical data were collected for 180 days. We identified gene expression signatures and clinical data that are associated with survival without liver transplantation at 90 and 180 days after initiation of corticosteroid therapy. Findings were validated using liver biopsies from 48 consecutive patients with severe AH treated with corticosteroids, collected from March 2010 through February 2015 at hospitals in Belgium and Switzerland (validation cohort 1) and in liver biopsies from 20 patients (9 received corticosteroid treatment), collected from January 2012 through May 2015 in the United States (validation cohort 2). RESULTS We integrated data on expression patterns of 123 genes and the model for end-stage liver disease (MELD) scores to assign patients to groups with poor survival (29% survived 90 days and 26% survived 180 days) and good survival (76% survived 90 days and 65% survived 180 days) (P < .001) in the derivation cohort. We named this assignment system the gene signature-MELD (gs-MELD) score. In validation cohort 1, the gs-MELD score discriminated patients with poor survival (43% survived 90 days) from those with good survival (96% survived 90 days) (P < .001). The gs-MELD score also discriminated between patients with a poor survival at 180 days (34% survived) and a good survival at 180 days (84% survived) (P < .001). The time-dependent area under the receiver operator characteristic curve for the score was 0.86 (95% confidence interval 0.73-0.99) for survival at 90 days, and 0.83 (95% confidence interval 0.71-0.96) for survival at 180 days. This score outperformed other clinical models to predict survival of patients with severe AH in validation cohort 1. In validation cohort 2, the gs-MELD discriminated patients with a poor survival at 90 days (12% survived) from those with a good survival at 90 days (100%) (P < .001). CONCLUSIONS We integrated data on baseline liver gene expression pattern and the MELD score to create the gs-MELD scoring system, which identifies patients with severe AH, treated or not with corticosteroids, most and least likely to survive for 90 and 180 days.
Collapse
|
18
|
Integration of multiple networks and pathways identifies cancer driver genes in pan-cancer analysis. BMC Genomics 2018; 19:25. [PMID: 29304754 PMCID: PMC5756345 DOI: 10.1186/s12864-017-4423-x] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2017] [Accepted: 12/27/2017] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND Modern high-throughput genomic technologies represent a comprehensive hallmark of molecular changes in pan-cancer studies. Although different cancer gene signatures have been revealed, the mechanism of tumourigenesis has yet to be completely understood. Pathways and networks are important tools to explain the role of genes in functional genomic studies. However, few methods consider the functional non-equal roles of genes in pathways and the complex gene-gene interactions in a network. RESULTS We present a novel method in pan-cancer analysis that identifies de-regulated genes with a functional role by integrating pathway and network data. A pan-cancer analysis of 7158 tumour/normal samples from 16 cancer types identified 895 genes with a central role in pathways and de-regulated in cancer. Comparing our approach with 15 current tools that identify cancer driver genes, we found that 35.6% of the 895 genes identified by our method have been found as cancer driver genes with at least 2/15 tools. Finally, we applied a machine learning algorithm on 16 independent GEO cancer datasets to validate the diagnostic role of cancer driver genes for each cancer. We obtained a list of the top-ten cancer driver genes for each cancer considered in this study. CONCLUSIONS Our analysis 1) confirmed that there are several known cancer driver genes in common among different types of cancer, 2) highlighted that cancer driver genes are able to regulate crucial pathways.
Collapse
|
19
|
Novel promoters and coding first exons in DLG2 linked to developmental disorders and intellectual disability. Genome Med 2017; 9:67. [PMID: 28724449 PMCID: PMC5518101 DOI: 10.1186/s13073-017-0452-y] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2017] [Accepted: 06/20/2017] [Indexed: 01/04/2023] Open
Abstract
BACKGROUND Tissue-specific integrative omics has the potential to reveal new genic elements important for developmental disorders. METHODS Two pediatric patients with global developmental delay and intellectual disability phenotype underwent array-CGH genetic testing, both showing a partial deletion of the DLG2 gene. From independent human and murine omics datasets, we combined copy number variations, histone modifications, developmental tissue-specific regulation, and protein data to explore the molecular mechanism at play. RESULTS Integrating genomics, transcriptomics, and epigenomics data, we describe two novel DLG2 promoters and coding first exons expressed in human fetal brain. Their murine conservation and protein-level evidence allowed us to produce new DLG2 gene models for human and mouse. These new genic elements are deleted in 90% of 29 patients (public and in-house) showing partial deletion of the DLG2 gene. The patients' clinical characteristics expand the neurodevelopmental phenotypic spectrum linked to DLG2 gene disruption to cognitive and behavioral categories. CONCLUSIONS While protein-coding genes are regarded as well known, our work shows that integration of multiple omics datasets can unveil novel coding elements. From a clinical perspective, our work demonstrates that two new DLG2 promoters and exons are crucial for the neurodevelopmental phenotypes associated with this gene. In addition, our work brings evidence for the lack of cross-annotation in human versus mouse reference genomes and nucleotide versus protein databases.
Collapse
|
20
|
DNA methylation-based immune response signature improves patient diagnosis in multiple cancers. J Clin Invest 2017; 127:3090-3102. [PMID: 28714863 DOI: 10.1172/jci91095] [Citation(s) in RCA: 95] [Impact Index Per Article: 13.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2016] [Accepted: 05/26/2017] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND The tumor immune response is increasingly associated with better clinical outcomes in breast and other cancers. However, the evaluation of tumor-infiltrating lymphocytes (TILs) relies on histopathological measurements with limited accuracy and reproducibility. Here, we profiled DNA methylation markers to identify a methylation of TIL (MeTIL) signature that recapitulates TIL evaluations and their prognostic value for long-term outcomes in breast cancer (BC). METHODS MeTIL signature scores were correlated with clinical endpoints reflecting overall or disease-free survival and a pathologic complete response to preoperative anthracycline therapy in 3 BC cohorts from the Jules Bordet Institute in Brussels and in other cancer types from The Cancer Genome Atlas. RESULTS The MeTIL signature measured TIL distributions in a sensitive manner and predicted survival and response to chemotherapy in BC better than did histopathological assessment of TILs or gene expression-based immune markers, respectively. The MeTIL signature also improved the prediction of survival in other malignancies, including melanoma and lung cancer. Furthermore, the MeTIL signature predicted differences in survival for malignancies in which TILs were not known to have a prognostic value. Finally, we showed that MeTIL markers can be determined by bisulfite pyrosequencing of small amounts of DNA from formalin-fixed, paraffin-embedded tumor tissue, supporting clinical applications for this methodology. CONCLUSIONS This study highlights the power of DNA methylation to evaluate tumor immune responses and the potential of this approach to improve the diagnosis and treatment of breast and other cancers. FUNDING This work was funded by the Fonds National de la Recherche Scientifique (FNRS) and Télévie, the INNOVIRIS Brussels Region BRUBREAST Project, the IUAP P7/03 program, the Belgian "Foundation against Cancer," the Breast Cancer Research Foundation (BCRF), and the Fonds Gaston Ithier.
Collapse
|
21
|
Study of Meta-analysis strategies for network inference using information-theoretic approaches. BioData Min 2017; 10:15. [PMID: 28484519 PMCID: PMC5420410 DOI: 10.1186/s13040-017-0136-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2017] [Accepted: 04/20/2017] [Indexed: 11/10/2022] Open
Abstract
Background Reverse engineering of gene regulatory networks (GRNs) from gene expression data is a classical challenge in systems biology. Thanks to high-throughput technologies, a massive amount of gene-expression data has been accumulated in the public repositories. Modelling GRNs from multiple experiments (also called integrative analysis) has; therefore, naturally become a standard procedure in modern computational biology. Indeed, such analysis is usually more robust than the traditional approaches, which suffer from experimental biases and the low number of samples by analysing individual datasets. To date, there are mainly two strategies for the problem of interest: the first one (“data merging”) merges all datasets together and then infers a GRN whereas the other (“networks ensemble”) infers GRNs from every dataset separately and then aggregates them using some ensemble rules (such as ranksum or weightsum). Unfortunately, a thorough comparison of these two approaches is lacking. Results In this work, we are going to present another meta-analysis approach for inferring GRNs from multiple studies. Our proposed meta-analysis approach, adapted to methods based on pairwise measures such as correlation or mutual information, consists of two steps: aggregating matrices of the pairwise measures from every dataset followed by extracting the network from the meta-matrix. Afterwards, we evaluate the performance of the two commonly used approaches mentioned above and our presented approach with a systematic set of experiments based on in silico benchmarks. Conclusions We proposed a first systematic evaluation of different strategies for reverse engineering GRNs from multiple datasets. Experiment results strongly suggest that assembling matrices of pairwise dependencies is a better strategy for network inference than the two commonly used ones.
Collapse
|
22
|
SpidermiR: An R/Bioconductor Package for Integrative Analysis with miRNA Data. Int J Mol Sci 2017; 18:ijms18020274. [PMID: 28134831 PMCID: PMC5343810 DOI: 10.3390/ijms18020274] [Citation(s) in RCA: 41] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2016] [Accepted: 01/24/2017] [Indexed: 02/08/2023] Open
Abstract
Gene Regulatory Networks (GRNs) control many biological systems, but how such network coordination is shaped is still unknown. GRNs can be subdivided into basic connections that describe how the network members interact e.g., co-expression, physical interaction, co-localization, genetic influence, pathways, and shared protein domains. The important regulatory mechanisms of these networks involve miRNAs. We developed an R/Bioconductor package, namely SpidermiR, which offers an easy access to both GRNs and miRNAs to the end user, and integrates this information with differentially expressed genes obtained from The Cancer Genome Atlas. Specifically, SpidermiR allows the users to: (i) query and download GRNs and miRNAs from validated and predicted repositories; (ii) integrate miRNAs with GRNs in order to obtain miRNA-gene-gene and miRNA-protein-protein interactions, and to analyze miRNA GRNs in order to identify miRNA-gene communities; and (iii) graphically visualize the results of the analyses. These analyses can be performed through a single interface and without the need for any downloads. The full data sets are then rapidly integrated and processed locally.
Collapse
|
23
|
Abstract
BACKGROUND An important challenge in cancer biology is to understand the complex aspects of the disease. It is increasingly evident that genes are not isolated from each other and the comprehension of how different genes are related to each other could explain biological mechanisms causing diseases. Biological pathways are important tools to reveal gene interaction and reduce the large number of genes to be studied by partitioning it into smaller paths. Furthermore, recent scientific evidence has proven that a combination of pathways, instead than a single element of the pathway or a single pathway, could be responsible for pathological changes in a cell. RESULTS In this paper we develop a new method that can reveal miRNAs able to regulate, in a coordinated way, networks of gene pathways. We applied the method to subtypes of breast cancer. The basic idea is the identification of pathways significantly enriched with differentially expressed genes among the different breast cancer subtypes and normal tissue. Looking at the pairs of pathways that were found to be functionally related, we created a network of dependent pathways and we focused on identifying miRNAs that could act as miRNA drivers in a coordinated regulation process. CONCLUSIONS Our approach enables miRNAs identification that could have an important role in the development of breast cancer.
Collapse
|
24
|
Portraying breast cancers with long noncoding RNAs. SCIENCE ADVANCES 2016; 2:e1600220. [PMID: 27617288 PMCID: PMC5010371 DOI: 10.1126/sciadv.1600220] [Citation(s) in RCA: 86] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/04/2016] [Accepted: 08/05/2016] [Indexed: 05/24/2023]
Abstract
Evidence is emerging that long noncoding RNAs (lncRNAs) may play a role in cancer development, but this role is not yet clear. We performed a genome-wide transcriptional survey to explore the lncRNA landscape across 995 breast tissue samples. We identified 215 lncRNAs whose genes are aberrantly expressed in breast tumors, as compared to normal samples. Unsupervised hierarchical clustering of breast tumors on the basis of their lncRNAs revealed four breast cancer subgroups that correlate tightly with PAM50-defined mRNA-based subtypes. Using multivariate analysis, we identified no less than 210 lncRNAs prognostic of clinical outcome. By analyzing the coexpression of lncRNA genes and protein-coding genes, we inferred potential functions of the 215 dysregulated lncRNAs. We then associated subtype-specific lncRNAs with key molecular processes involved in cancer. A correlation was observed, on the one hand, between luminal A-specific lncRNAs and the activation of phosphatidylinositol 3-kinase, fibroblast growth factor, and transforming growth factor-β pathways and, on the other hand, between basal-like-specific lncRNAs and the activation of epidermal growth factor receptor (EGFR)-dependent pathways and of the epithelial-to-mesenchymal transition. Finally, we showed that a specific lncRNA, which we called CYTOR, plays a role in breast cancer. We confirmed its predicted functions, showing that it regulates genes involved in the EGFR/mammalian target of rapamycin pathway and is required for cell proliferation, cell migration, and cytoskeleton organization. Overall, our work provides the most comprehensive analyses for lncRNA in breast cancers. Our findings suggest a wide range of biological functions associated with lncRNAs in breast cancer and provide a foundation for functional investigations that could lead to new therapeutic approaches.
Collapse
|
25
|
Abstract
Biotechnological advances in sequencing have led to an explosion of publicly available data via large international consortia such as
The Cancer Genome Atlas (TCGA),
The Encyclopedia of DNA Elements (ENCODE), and
The NIH Roadmap Epigenomics Mapping Consortium (Roadmap). These projects have provided unprecedented opportunities to interrogate the epigenome of cultured cancer cell lines as well as normal and tumor tissues with high genomic resolution. The
Bioconductor project offers more than 1,000 open-source software and statistical packages to analyze high-throughput genomic data. However, most packages are designed for specific data types (e.g. expression, epigenetics, genomics) and there is no one comprehensive tool that provides a complete integrative analysis of the resources and data provided by all three public projects. A need to create an integration of these different analyses was recently proposed. In this workflow, we provide a series of biologically focused integrative analyses of different molecular data. We describe how to download, process and prepare TCGA data and by harnessing several key Bioconductor packages, we describe how to extract biologically meaningful genomic and epigenomic data. Using Roadmap and ENCODE data, we provide a work plan to identify biologically relevant functional epigenomic elements associated with cancer. To illustrate our workflow, we analyzed two types of brain tumors: low-grade glioma (LGG) versus high-grade glioma (glioblastoma multiform or GBM). This workflow introduces the following Bioconductor packages:
AnnotationHub,
ChIPSeeker,
ComplexHeatmap,
pathview,
ELMER,
GAIA,
MINET,
RTCGAToolbox,
TCGAbiolinks.
Collapse
|
26
|
Abstract
Biotechnological advances in sequencing have led to an explosion of publicly available data via large international consortia such as The Cancer Genome Atlas (TCGA), The Encyclopedia of DNA Elements (ENCODE), and The NIH Roadmap Epigenomics Mapping Consortium (Roadmap). These projects have provided unprecedented opportunities to interrogate the epigenome of cultured cancer cell lines as well as normal and tumor tissues with high genomic resolution. The Bioconductor project offers more than 1,000 open-source software and statistical packages to analyze high-throughput genomic data. However, most packages are designed for specific data types (e.g. expression, epigenetics, genomics) and there is no one comprehensive tool that provides a complete integrative analysis of the resources and data provided by all three public projects. A need to create an integration of these different analyses was recently proposed. In this workflow, we provide a series of biologically focused integrative analyses of different molecular data. We describe how to download, process and prepare TCGA data and by harnessing several key Bioconductor packages, we describe how to extract biologically meaningful genomic and epigenomic data. Using Roadmap and ENCODE data, we provide a work plan to identify biologically relevant functional epigenomic elements associated with cancer. To illustrate our workflow, we analyzed two types of brain tumors: low-grade glioma (LGG) versus high-grade glioma (glioblastoma multiform or GBM). This workflow introduces the following Bioconductor packages: AnnotationHub, ChIPSeeker, ComplexHeatmap, pathview, ELMER, GAIA, MINET, RTCGAToolbox, TCGAbiolinks.
Collapse
|
27
|
TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data. Nucleic Acids Res 2016; 44:e71. [PMID: 26704973 PMCID: PMC4856967 DOI: 10.1093/nar/gkv1507] [Citation(s) in RCA: 2007] [Impact Index Per Article: 250.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2015] [Revised: 12/06/2015] [Accepted: 12/10/2015] [Indexed: 12/18/2022] Open
Abstract
The Cancer Genome Atlas (TCGA) research network has made public a large collection of clinical and molecular phenotypes of more than 10 000 tumor patients across 33 different tumor types. Using this cohort, TCGA has published over 20 marker papers detailing the genomic and epigenomic alterations associated with these tumor types. Although many important discoveries have been made by TCGA's research network, opportunities still exist to implement novel methods, thereby elucidating new biological pathways and diagnostic markers. However, mining the TCGA data presents several bioinformatics challenges, such as data retrieval and integration with clinical data and other molecular data types (e.g. RNA and DNA methylation). We developed an R/Bioconductor package called TCGAbiolinks to address these challenges and offer bioinformatics solutions by using a guided workflow to allow users to query, download and perform integrative analyses of TCGA data. We combined methods from computer science and statistics into the pipeline and incorporated methodologies developed in previous TCGA marker studies and in our own group. Using four different TCGA tumor types (Kidney, Brain, Breast and Colon) as examples, we provide case studies to illustrate examples of reproducibility, integrative analysis and utilization of different Bioconductor packages to advance and accelerate novel discoveries.
Collapse
|
28
|
Abstract
Species interaction networks are shaped by abiotic and biotic factors. Here, as part of the Tara Oceans project, we studied the photic zone interactome using environmental factors and organismal abundance profiles and found that environmental factors are incomplete predictors of community structure. We found associations across plankton functional types and phylogenetic groups to be nonrandomly distributed on the network and driven by both local and global patterns. We identified interactions among grazers, primary producers, viruses, and (mainly parasitic) symbionts and validated network-generated hypotheses using microscopy to confirm symbiotic relationships. We have thus provided a resource to support further research on ocean food webs and integrating biological components into ocean models.
Collapse
|
29
|
Using shRNA experiments to validate gene regulatory networks. GENOMICS DATA 2015; 4:123-6. [PMID: 26484195 PMCID: PMC4535466 DOI: 10.1016/j.gdata.2015.03.011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/20/2015] [Revised: 03/23/2015] [Accepted: 03/23/2015] [Indexed: 11/26/2022]
Abstract
Quantitative validation of gene regulatory networks (GRNs) inferred from observational expression data is a difficult task usually involving time intensive and costly laboratory experiments. We were able to show that gene knock-down experiments can be used to quantitatively assess the quality of large-scale GRNs via a purely data-driven approach (Olsen et al. 2014). Our new validation framework also enables the statistical comparison of multiple network inference techniques, which was a long-standing challenge in the field. In this Data in Brief we detail the contents and quality controls for the gene expression data (available from NCBI Gene Expression Omnibus repository with accession number GSE53091) associated with our study published in Genomics (Olsen et al. 2014). We also provide R code to access the data and reproduce the analysis presented in this article.
Collapse
|
30
|
Relevance of different prior knowledge sources for inferring gene interaction networks. Front Genet 2014; 5:177. [PMID: 25009552 PMCID: PMC4067568 DOI: 10.3389/fgene.2014.00177] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2014] [Accepted: 05/26/2014] [Indexed: 11/13/2022] Open
Abstract
When inferring networks from high-throughput genomic data, one of the main challenges is the subsequent validation of these networks. In the best case scenario, the true network is partially known from previous research results published in structured databases or research articles. Traditionally, inferred networks are validated against these known interactions. Whenever the recovery rate is gauged to be high enough, subsequent high scoring but unknown inferred interactions are deemed good candidates for further experimental validation. Therefore such validation framework strongly depends on the quantity and quality of published interactions and presents serious pitfalls: (1) availability of these known interactions for the studied problem might be sparse; (2) quantitatively comparing different inference algorithms is not trivial; and (3) the use of these known interactions for validation prevents their integration in the inference procedure. The latter is particularly relevant as it has recently been showed that integration of priors during network inference significantly improves the quality of inferred networks. To overcome these problems when validating inferred networks, we recently proposed a data-driven validation framework based on single gene knock-down experiments. Using this framework, we were able to demonstrate the benefits of integrating prior knowledge and expression data. In this paper we used this framework to assess the quality of different sources of prior knowledge on their own and in combination with different genomic data sets in colorectal cancer. We observed that most prior sources lead to significant F-scores. Furthermore, their integration with genomic data leads to a significant increase in F-scores, especially for priors extracted from full text PubMed articles, known co-expression modules and genetic interactions. Lastly, we observed that the results are consistent for three different data sets: experimental knock-down data and two human tumor data sets.
Collapse
|
31
|
Association between the PNPLA3 (rs738409 C>G) variant and hepatocellular carcinoma: Evidence from a meta-analysis of individual participant data. Hepatology 2014; 59:2170-7. [PMID: 24114809 DOI: 10.1002/hep.26767] [Citation(s) in RCA: 175] [Impact Index Per Article: 17.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/16/2013] [Accepted: 09/19/2013] [Indexed: 12/12/2022]
Abstract
UNLABELLED The incidence of hepatocellular carcinoma (HCC) is increasing in Western countries. Although several clinical factors have been identified, many individuals never develop HCC, suggesting a genetic susceptibility. However, to date, only a few single-nucleotide polymorphisms have been reproducibly shown to be linked to HCC onset. A variant (rs738409 C>G, encoding for p.I148M) in the PNPLA3 gene is associated with liver damage in chronic liver diseases. Interestingly, several studies have reported that the minor rs738409[G] allele is more represented in HCC cases in chronic hepatitis C (CHC) and alcoholic liver disease (ALD). However, a significant association with HCC related to CHC has not been consistently observed, and the strength of the association between rs738409 and HCC remains unclear. We performed a meta-analysis of individual participant data including 2,503 European patients with cirrhosis to assess the association between rs738409 and HCC, particularly in ALD and CHC. We found that rs738409 was strongly associated with overall HCC (odds ratio [OR] per G allele, additive model=1.77; 95% confidence interval [CI]: 1.42-2.19; P=2.78 × 10(-7) ). This association was more pronounced in ALD (OR=2.20; 95% CI: 1.80-2.67; P=4.71 × 10(-15) ) than in CHC patients (OR=1.55; 95% CI: 1.03-2.34; P=3.52 × 10(-2) ). After adjustment for age, sex, and body mass index, the variant remained strongly associated with HCC. CONCLUSION Overall, these results suggest that rs738409 exerts a marked influence on hepatocarcinogenesis in patients with cirrhosis of European descent and provide a strong argument for performing further mechanistic studies to better understand the role of PNPLA3 in HCC development.
Collapse
|
32
|
Temporal profiling of cytokine-induced genes in pancreatic β-cells by meta-analysis and network inference. Genomics 2014; 103:264-75. [DOI: 10.1016/j.ygeno.2013.12.007] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2013] [Revised: 12/17/2013] [Accepted: 12/18/2013] [Indexed: 01/12/2023]
|
33
|
Inference and validation of predictive gene networks from biomedical literature and gene expression data. Genomics 2014; 103:329-36. [PMID: 24691108 DOI: 10.1016/j.ygeno.2014.03.004] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2013] [Revised: 01/23/2014] [Accepted: 03/15/2014] [Indexed: 02/04/2023]
Abstract
Although many methods have been developed for inference of biological networks, the validation of the resulting models has largely remained an unsolved problem. Here we present a framework for quantitative assessment of inferred gene interaction networks using knock-down data from cell line experiments. Using this framework we are able to show that network inference based on integration of prior knowledge derived from the biomedical literature with genomic data significantly improves the quality of inferred networks relative to other approaches. Our results also suggest that cell line experiments can be used to quantitatively assess the quality of networks inferred from tumor samples.
Collapse
|
34
|
Experimental assessment of static and dynamic algorithms for gene regulation inference from time series expression data. Front Genet 2013; 4:303. [PMID: 24400020 PMCID: PMC3872039 DOI: 10.3389/fgene.2013.00303] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2013] [Accepted: 12/10/2013] [Indexed: 11/13/2022] Open
Abstract
Accurate inference of causal gene regulatory networks from gene expression data is an open bioinformatics challenge. Gene interactions are dynamical processes and consequently we can expect that the effect of any regulation action occurs after a certain temporal lag. However such lag is unknown a priori and temporal aspects require specific inference algorithms. In this paper we aim to assess the impact of taking into consideration temporal aspects on the final accuracy of the inference procedure. In particular we will compare the accuracy of static algorithms, where no dynamic aspect is considered, to that of fixed lag and adaptive lag algorithms in three inference tasks from microarray expression data. Experimental results show that network inference algorithms that take dynamics into account perform consistently better than static ones, once the considered lags are properly chosen. However, no individual algorithm stands out in all three inference tasks, and the challenging nature of network inference tasks is evidenced, as a large number of the assessed algorithms does not perform better than random.
Collapse
|
35
|
Genome-wide gene expression profiling to predict resistance to anthracyclines in breast cancer patients. GENOMICS DATA 2013; 1:7-10. [PMID: 26484051 PMCID: PMC4608867 DOI: 10.1016/j.gdata.2013.09.001] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/12/2013] [Accepted: 09/12/2013] [Indexed: 11/19/2022]
Abstract
Validated biomarkers predictive of response/resistance to anthracyclines in breast cancer are currently lacking. The neoadjuvant Trial of Principle (TOP) study, in which patients with estrogen receptor (ER)–negative tumors were treated with anthracycline (epirubicin) monotherapy, was specifically designed to evaluate the predictive value of topoisomerase II-alpha (TOP2A) and develop a gene expression signature to identify those patients who do not benefit from anthracyclines. Here we describe in details the contents and quality controls for the gene expression and clinical data associated with the study published by Desmedt and colleagues in the Journal of Clinical Oncology in 2011 (Desmedt et al., 2011). We also provide R code to easily access the data and perform the quality controls and basic analyses relevant to this dataset.
Collapse
|
36
|
Abstract
Infinium HumanMethylation450 beadarray is a popular technology to explore DNA methylomes in health and disease, and there is a current explosion in the use of this technique. Despite experience acquired from gene expression microarrays, analyzing Infinium Methylation arrays appeared more complex than initially thought and several difficulties have been encountered, as those arrays display specific features that need to be taken into consideration during data processing. Here, we review several issues that have been highlighted by the scientific community, and we present an overview of the general data processing scheme and an evaluation of the different normalization methods available to date to guide the 450K users in their analysis and data interpretation.
Collapse
|
37
|
|
38
|
|
39
|
Abstract
BACKGROUND An enduring challenge in personalized medicine lies in selecting the right drug for each individual patient. While testing of drugs on patients in large trials is the only way to assess their clinical efficacy and toxicity, we dramatically lack resources to test the hundreds of drugs currently under development. Therefore the use of preclinical model systems has been intensively investigated as this approach enables response to hundreds of drugs to be tested in multiple cell lines in parallel. METHODS Two large-scale pharmacogenomic studies recently screened multiple anticancer drugs on over 1000 cell lines. We propose to combine these datasets to build and robustly validate genomic predictors of drug response. We compared five different approaches for building predictors of increasing complexity. We assessed their performance in cross-validation and in two large validation sets, one containing the same cell lines present in the training set and another dataset composed of cell lines that have never been used during the training phase. RESULTS Sixteen drugs were found in common between the datasets. We were able to validate multivariate predictors for three out of the 16 tested drugs, namely irinotecan, PD-0325901, and PLX4720. Moreover, we observed that response to 17-AAG, an inhibitor of Hsp90, could be efficiently predicted by the expression level of a single gene, NQO1. CONCLUSION These results suggest that genomic predictors could be robustly validated for specific drugs. If successfully validated in patients' tumor cells, and subsequently in clinical trials, they could act as companion tests for the corresponding drugs and play an important role in personalized medicine.
Collapse
|
40
|
Innate cytokines production in the geriatric population: Frailty is associated with low responsiveness to lipopolysaccharide. Eur Geriatr Med 2012. [DOI: 10.1016/j.eurger.2012.07.142] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
41
|
Abstract
ChIP-sequencing is a method of choice to localize the positions of protein binding sites on DNA on a whole genomic scale. The deciphering of the sequencing data produced by this novel technique is challenging and it is achieved by their rigorous interpretation using dedicated tools and adapted visualization programs. Here, we present a bioinformatics tool (D-peaks) that adds several possibilities (including, user-friendliness, high-quality, relative position with respect to the genomic features) to the well-known visualization browsers or databases already existing. D-peaks is directly available through its web interface http://rsat.ulb.ac.be/dpeaks/ as well as a command line tool.
Collapse
|
42
|
A three-gene model to robustly identify breast cancer molecular subtypes. J Natl Cancer Inst 2012; 104:311-25. [PMID: 22262870 DOI: 10.1093/jnci/djr545] [Citation(s) in RCA: 222] [Impact Index Per Article: 18.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
BACKGROUND Single sample predictors (SSPs) and Subtype classification models (SCMs) are gene expression-based classifiers used to identify the four primary molecular subtypes of breast cancer (basal-like, HER2-enriched, luminal A, and luminal B). SSPs use hierarchical clustering, followed by nearest centroid classification, based on large sets of tumor-intrinsic genes. SCMs use a mixture of Gaussian distributions based on sets of genes with expression specifically correlated with three key breast cancer genes (estrogen receptor [ER], HER2, and aurora kinase A [AURKA]). The aim of this study was to compare the robustness, classification concordance, and prognostic value of these classifiers with those of a simplified three-gene SCM in a large compendium of microarray datasets. METHODS Thirty-six publicly available breast cancer datasets (n = 5715) were subjected to molecular subtyping using five published classifiers (three SSPs and two SCMs) and SCMGENE, the new three-gene (ER, HER2, and AURKA) SCM. We used the prediction strength statistic to estimate robustness of the classification models, defined as the capacity of a classifier to assign the same tumors to the same subtypes independently of the dataset used to fit it. We used Cohen κ and Cramer V coefficients to assess concordance between the subtype classifiers and association with clinical variables, respectively. We used Kaplan-Meier survival curves and cross-validated partial likelihood to compare prognostic value of the resulting classifications. All statistical tests were two-sided. RESULTS SCMs were statistically significantly more robust than SSPs, with SCMGENE being the most robust because of its simplicity. SCMGENE was statistically significantly concordant with published SCMs (κ = 0.65-0.70) and SSPs (κ = 0.34-0.59), statistically significantly associated with ER (V = 0.64), HER2 (V = 0.52) status, and histological grade (V = 0.55), and yielded similar strong prognostic value. CONCLUSION Our results suggest that adequate classification of the major and clinically relevant molecular subtypes of breast cancer can be robustly achieved with quantitative measurements of three key genes.
Collapse
|
43
|
Predictive networks: a flexible, open source, web application for integration and analysis of human gene networks. Nucleic Acids Res 2012; 40:D866-75. [PMID: 22096235 PMCID: PMC3245161 DOI: 10.1093/nar/gkr1050] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2011] [Revised: 10/09/2011] [Accepted: 10/23/2011] [Indexed: 12/03/2022] Open
Abstract
Genomics provided us with an unprecedented quantity of data on the genes that are activated or repressed in a wide range of phenotypes. We have increasingly come to recognize that defining the networks and pathways underlying these phenotypes requires both the integration of multiple data types and the development of advanced computational methods to infer relationships between the genes and to estimate the predictive power of the networks through which they interact. To address these issues we have developed Predictive Networks (PN), a flexible, open-source, web-based application and data services framework that enables the integration, navigation, visualization and analysis of gene interaction networks. The primary goal of PN is to allow biomedical researchers to evaluate experimentally derived gene lists in the context of large-scale gene interaction networks. The PN analytical pipeline involves two key steps. The first is the collection of a comprehensive set of known gene interactions derived from a variety of publicly available sources. The second is to use these 'known' interactions together with gene expression data to infer robust gene networks. The PN web application is accessible from http://predictivenetworks.org. The PN code base is freely available at https://sourceforge.net/projects/predictivenets/.
Collapse
|
44
|
Machine Learning for Automated Polyp Detection in Computed Tomography Colonography. Mach Learn 2012. [DOI: 10.4018/978-1-60960-818-7.ch407] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
This chapter presents a comprehensive scheme for automated detection of colorectal polyps in computed tomography colonography (CTC) with particular emphasis on robust learning algorithms that differentiate polyps from non-polyp shapes. The authors’ automated CTC scheme introduces two orientation independent features which encode the shape characteristics that aid in classification of polyps and non-polyps with high accuracy, low false positive rate, and low computations making the scheme suitable for colorectal cancer screening initiatives. Experiments using state-of-the-art machine learning algorithms viz., lazy learning, support vector machines, and naïve Bayes classifiers reveal the robustness of the two features in detecting polyps at 100% sensitivity for polyps with diameter greater than 10 mm while attaining total low false positive rates, respectively, of 3.05, 3.47 and 0.71 per CTC dataset at specificities above 99% when tested on 58 CTC datasets. The results were validated using colonoscopy reports provided by expert radiologists.
Collapse
|
45
|
Multiple-input multiple-output causal strategies for gene selection. BMC Bioinformatics 2011; 12:458. [PMID: 22118187 PMCID: PMC3323860 DOI: 10.1186/1471-2105-12-458] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2011] [Accepted: 11/25/2011] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Traditional strategies for selecting variables in high dimensional classification problems aim to find sets of maximally relevant variables able to explain the target variations. If these techniques may be effective in generalization accuracy they often do not reveal direct causes. The latter is essentially related to the fact that high correlation (or relevance) does not imply causation. In this study, we show how to efficiently incorporate causal information into gene selection by moving from a single-input single-output to a multiple-input multiple-output setting. RESULTS We show in synthetic case study that a better prioritization of causal variables can be obtained by considering a relevance score which incorporates a causal term. In addition we show, in a meta-analysis study of six publicly available breast cancer microarray datasets, that the improvement occurs also in terms of accuracy. The biological interpretation of the results confirms the potential of a causal approach to gene selection. CONCLUSIONS Integrating causal information into gene selection algorithms is effective both in terms of prediction accuracy and biological interpretation.
Collapse
|
46
|
Abstract
PURPOSE Validated biomarkers predictive of response/resistance to anthracyclines in breast cancer are currently lacking. The neoadjuvant Trial of Principle (TOP) study, in which patients with estrogen receptor (ER) -negative tumors were treated with anthracycline (epirubicin) monotherapy, was specifically designed to evaluate the predictive value of topoisomerase II-α (TOP2A) and develop a gene expression signature to identify those patients who do not benefit from anthracyclines. PATIENTS AND METHODS The TOP trial included 149 patients, 139 of whom were evaluable for response prediction analyses. The primary end point was pathologic complete response (pCR). TOP2A and gene expression profiles were evaluated using pre-epirubicin biopsies. Gene expression data from ER-negative samples of the EORTC (European Organisation for Research and Treatment of Cancer) 10994/BIG (Breast International Group) 00-01 and MDACC (MD Anderson Cancer Center) 2003-0321 neoadjuvant trials were used for validation purposes. RESULTS A pCR was obtained in 14% of the evaluable patients in the TOP trial. TOP2A amplification, but not protein overexpression, was significantly associated with pCR (P ≤ .001 v P ≤ .33). We developed an anthracycline-based score (A-Score) combining three signatures: a TOP2A gene signature and two previously published signatures related to tumor invasion and immune response. The A-Score was characterized by a high negative predictive value ([NPV]; NPV, 0.98; 95% CI, 0.90 to 1.00) overall and in the human epidermal growth factor receptor 2 (HER2) -negative and HER2-positive subpopulations. Its performance was independently confirmed in the anthracycline-based arms of the two validation trials (BIG 00-01: NPV, 0.83; 95% CI, 0.64 to 0.94 and MDACC 2003-0321: NPV, 1.00; 95% CI, 0.80 to 1.00). CONCLUSION Given its high NPV, the A-Score could become, if further validated, a useful clinical tool to identify those patients who do not benefit from anthracyclines and could therefore be spared the non-negligible adverse effects.
Collapse
|
47
|
STAT1 is a master regulator of pancreatic {beta}-cell apoptosis and islet inflammation. J Biol Chem 2010; 286:929-41. [PMID: 20980260 DOI: 10.1074/jbc.m110.162131] [Citation(s) in RCA: 135] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
Cytokines produced by islet-infiltrating immune cells induce β-cell apoptosis in type 1 diabetes. The IFN-γ-regulated transcription factors STAT1/IRF-1 have apparently divergent effects on β-cells. Thus, STAT1 promotes apoptosis and inflammation, whereas IRF-1 down-regulates inflammatory mediators. To understand the molecular basis for these differential outcomes within a single signal transduction pathway, we presently characterized the gene networks regulated by STAT1 and IRF-1 in β-cells. This was done by using siRNA approaches coupled to microarray analysis of insulin-producing cells exposed or not to IL-1β and IFN-γ. Relevant microarray findings were further studied in INS-1E cells and primary rat β-cells. STAT1, but not IRF-1, mediates the cytokine-induced loss of the differentiated β-cell phenotype, as indicated by decreased insulin, Pdx1, MafA, and Glut2. Furthermore, STAT1 regulates cytokine-induced apoptosis via up-regulation of the proapoptotic protein DP5. STAT1 and IRF-1 have opposite effects on cytokine-induced chemokine production, with IRF-1 exerting negative feedback inhibition on STAT1 and downstream chemokine expression. The present study elucidates the transcriptional networks through which the IFN-γ/STAT1/IRF-1 axis controls β-cell function/differentiation, demise, and islet inflammation.
Collapse
|
48
|
A fuzzy gene expression-based computational approach improves breast cancer prognostication. Genome Biol 2010; 11:R18. [PMID: 20156340 PMCID: PMC2872878 DOI: 10.1186/gb-2010-11-2-r18] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2009] [Revised: 01/04/2010] [Accepted: 02/15/2010] [Indexed: 12/11/2022] Open
Abstract
A fuzzy computational approach that takes into account several molecular subtypes in order to provide more accurate breast cancer prognosis Early gene expression studies classified breast tumors into at least three clinically relevant subtypes. Although most current gene signatures are prognostic for estrogen receptor (ER) positive/human epidermal growth factor receptor 2 (HER2) negative breast cancers, few are informative for ER negative/HER2 negative and HER2 positive subtypes. Here we present Gene Expression Prognostic Index Using Subtypes (GENIUS), a fuzzy approach for prognostication that takes into account the molecular heterogeneity of breast cancer. In systematic evaluations, GENIUS significantly outperformed current gene signatures and clinical indices in the global population of patients.
Collapse
|
49
|
Machine learning techniques to identify putative genes involved in nitrogen catabolite repression in the yeast Saccharomyces cerevisiae. BMC Proc 2008; 2 Suppl 4:S5. [PMID: 19091052 PMCID: PMC2654973 DOI: 10.1186/1753-6561-2-s4-s5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Nitrogen is an essential nutrient for all life forms. Like most unicellular organisms, the yeast Saccharomyces cerevisiae transports and catabolizes good nitrogen sources in preference to poor ones. Nitrogen catabolite repression (NCR) refers to this selection mechanism. All known nitrogen catabolite pathways are regulated by four regulators. The ultimate goal is to infer the complete nitrogen catabolite pathways. Bioinformatics approaches offer the possibility to identify putative NCR genes and to discard uninteresting genes. RESULTS We present a machine learning approach where the identification of putative NCR genes in the yeast Saccharomyces cerevisiae is formulated as a supervised two-class classification problem. Classifiers predict whether genes are NCR-sensitive or not from a large number of variables related to the GATA motif in the upstream non-coding sequences of the genes. The positive and negative training sets are composed of annotated NCR genes and manually-selected genes known to be insensitive to NCR, respectively. Different classifiers and variable selection methods are compared. We show that all classifiers make significant and biologically valid predictions by comparing these predictions to annotated and putative NCR genes, and by performing several negative controls. In particular, the inferred NCR genes significantly overlap with putative NCR genes identified in three genome-wide experimental and bioinformatics studies. CONCLUSION These results suggest that our approach can successfully identify potential NCR genes. Hence, the dimensionality of the problem of identifying all genes involved in NCR is drastically reduced.
Collapse
|
50
|
minet: A R/Bioconductor package for inferring large transcriptional networks using mutual information. BMC Bioinformatics 2008; 9:461. [PMID: 18959772 PMCID: PMC2630331 DOI: 10.1186/1471-2105-9-461] [Citation(s) in RCA: 323] [Impact Index Per Article: 20.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2008] [Accepted: 10/29/2008] [Indexed: 12/03/2022] Open
Abstract
Results This paper presents the R/Bioconductor package minet (version 1.1.6) which provides a set of functions to infer mutual information networks from a dataset. Once fed with a microarray dataset, the package returns a network where nodes denote genes, edges model statistical dependencies between genes and the weight of an edge quantifies the statistical evidence of a specific (e.g transcriptional) gene-to-gene interaction. Four different entropy estimators are made available in the package minet (empirical, Miller-Madow, Schurmann-Grassberger and shrink) as well as four different inference methods, namely relevance networks, ARACNE, CLR and MRNET. Also, the package integrates accuracy assessment tools, like F-scores, PR-curves and ROC-curves in order to compare the inferred network with a reference one. Conclusion The package minet provides a series of tools for inferring transcriptional networks from microarray data. It is freely available from the Comprehensive R Archive Network (CRAN) as well as from the Bioconductor website.
Collapse
|