1
|
Polaka I, Razuka-Ebela D, Park JY, Leja M. Taxonomy-based data representation for data mining: an example of the magnitude of risk associated with H. pylori infection. BioData Min 2021; 14:43. [PMID: 34454568 PMCID: PMC8400764 DOI: 10.1186/s13040-021-00271-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2021] [Accepted: 08/08/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The amount of available and potentially significant data describing study subjects is ever growing with the introduction and integration of different registries and data banks. The single specific attribute of these data are not always necessary; more often, membership to a specific group (e.g. diet, social 'bubble', living area) is enough to build a successful machine learning or data mining model without overfitting it. Therefore, in this article we propose an approach to building taxonomies using clustering to replace detailed data from large heterogenous data sets from different sources, while improving interpretability. We used the GISTAR study data base that holds exhaustive self-assessment questionnaire data to demonstrate this approach in the task of differentiating between H. pylori positive and negative study participants, and assessing their potential risk factors. We have compared the results of taxonomy-based classification to the results of classification using raw data. RESULTS Evaluation of our approach was carried out using 6 classification algorithms that induce rule-based or tree-based classifiers. The taxonomy-based classification results show no significant loss in information, with similar and up to 2.5% better classification accuracy. Information held by 10 and more attributes can be replaced by one attribute demonstrating membership to a cluster in a hierarchy at a specific cut. The clusters created this way can be easily interpreted by researchers (doctors, epidemiologists) and describe the co-occurring features in the group, which is significant for the specific task. CONCLUSIONS While there are always features and measurements that must be used in data analysis as they are, the use of taxonomies for the description of study subjects in parallel allows using membership to specific naturally occurring groups and their impact on an outcome. This can decrease the risk of overfitting (picking attributes and values specific to the training set without explaining the underlying conditions), improve the accuracy of the models, and improve privacy protection of study participants by decreasing the amount of specific information used to identify the individual.
Collapse
Affiliation(s)
- Inese Polaka
- University of Latvia, Institute of Clinical and Preventive Medicine, Gailezera Street 1, Riga, LV-1079, Latvia.
| | - Danute Razuka-Ebela
- University of Latvia, Institute of Clinical and Preventive Medicine, Gailezera Street 1, Riga, LV-1079, Latvia
| | - Jin Young Park
- International Agency for Research on Cancer, 150 Cours Albert Thomas, 69372, Lyon, CEDEX 08, France
| | - Marcis Leja
- University of Latvia, Institute of Clinical and Preventive Medicine, Gailezera Street 1, Riga, LV-1079, Latvia
- Center for Gastric Diseases GASTRO, Gailezera Street 1, Riga, LV-1079, Latvia
| |
Collapse
|
2
|
Abstract
Public databases featuring original, raw data from "Omics" experiments enable researchers to perform meta-analyses by combining either the raw data or the summarized results of several independent studies. In proteomics, high-throughput protein expression data is measured by diverse techniques such as mass spectrometry, 2-D gel electrophoresis or protein arrays yielding data of different scales. Therefore, direct data merging can be problematic, and combining the summarized data of the individual studies can be advantageous. A special form of meta-analysis is network meta-analysis, where studies with different settings of experimental groups can be combined. However, all studies must be linked by one experimental group that has to appear in each study. Usually that is the control group. Then, a study network is formed and indirect statistical inferences can also be made between study groups that appear not in each of the studies.In this chapter, we describe the working principle of and available software for network meta-analysis. The applicability to high-throughput protein expression data is demonstrated in an example from breast cancer research. We also describe the special challenges when applying this method.
Collapse
Affiliation(s)
- Christine Winter
- Institute for Animal Breeding and Genetics, University of Veterinary Medicine Hannover, Foundation, Hannover, Germany
| | - Klaus Jung
- Institute for Animal Breeding and Genetics, University of Veterinary Medicine Hannover, Foundation, Hannover, Germany.
| |
Collapse
|
3
|
Pedersen JW, Larsen LH, Thirsing C, Vezzaro L. Reconstruction of corrupted datasets from ammonium-ISE sensors at WRRFs through merging with daily composite samples. Water Res 2020; 185:116227. [PMID: 32736284 DOI: 10.1016/j.watres.2020.116227] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/25/2020] [Revised: 07/06/2020] [Accepted: 07/23/2020] [Indexed: 06/11/2023]
Abstract
Long-term, continuous datasets of high quality are important for instrumentation, control, and automation efforts of wastewater resources recovery facility (WRRFs). This study presents a methodology to increase the reliability of measurements from ammonium ion-selective electrodes (ISEs). This is done by correcting corrupted ISE data with a data source that often is available at WRRFs (volume-proportional composite samples). A yearlong measurement campaign showed that the existing standard protocols for sensor maintenance might still create corrupted dataset, with poor sensor recalibrations responsible for abrupt and unrealistic jumps in the measurements. The proposed automatic correction methodology removes both recalibration jumps and signal drift by using information from composite samples that already are taken for reporting to legal authorities. Results showed that the developed methodology provided a continuous, high-quality time series without the major data quality issues of the original signal. In fact, the signal was improved for 87% of days when a reference sample was available. The effect of correcting the data before use in a data-driven software sensor was also investigated. The corrected dataset led to noticeably smaller day-to-day variations in estimated NH4+ loads, and to large improvements on both median estimates and prediction bounds. The long time series allowed for an investigation of how much training data that is required to fit a software sensor, which provides estimates that are representative for the entire study period. The results showed that 8 weeks of data allowed for a good median estimate, while 16 weeks are required for obtaining good 80% prediction bounds. Overall, the proposed method can increase the applicability of relatively cheaper ISE sensors for ICA application within WRRFs.
Collapse
Affiliation(s)
- Jonas Wied Pedersen
- DTU Environment, Technical University of Denmark, Bygningstorvet, Building 115, 2800 Kgs, Lyngby, Denmark.
| | - Laura Holm Larsen
- DTU Environment, Technical University of Denmark, Bygningstorvet, Building 115, 2800 Kgs, Lyngby, Denmark
| | | | - Luca Vezzaro
- DTU Environment, Technical University of Denmark, Bygningstorvet, Building 115, 2800 Kgs, Lyngby, Denmark; Krüger A/S, Veolia Water Technologies, Gladsaxevej 363, 2860 Søborg, Denmark
| |
Collapse
|
4
|
Liu L, O'Donnell P, Sullivan R, Katalinic A, Moser L, de Boer A, Meunier F. Cancer in Europe: Death sentence or life sentence? Eur J Cancer 2016; 65:150-5. [PMID: 27498140 DOI: 10.1016/j.ejca.2016.07.007] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2016] [Accepted: 07/05/2016] [Indexed: 11/26/2022]
Abstract
With so many adults and children receiving successful treatment for their cancer, survivorship is now a 'new' and critical issue. It is increasingly recognised that the growing numbers of survivors face new challenges in their bid to return to 'normal' life. What is not yet so widely recognised is the need for a broad response to help them cope-with stigmatisation, misunderstanding, lifelong issues of confidence and social adaptation, and even access to employment and to financial services. As a further stage in its programme of attention to this aspect of cancer, the European Organisation for Research and Treatment of Cancer (EORTC) brought survivors, researchers, carers, authorities and policymakers together at a meeting in Brussels in March/April 2016, to learn at first hand about the posttreatment experience of cancer survivors. The meeting demonstrated that while research is well advanced in many of the medical consequences of survivorship, understanding is still lacking of many non-clinical, personal and administrative issues. The meeting raised the discussion of survivorship research beyond the individual to a population-based approach, exploring the related socioeconomic issues. Its exploration of initiatives across Europe countries provoked new thinking on the need for effective collaboration, with a new focus on non-clinical issues, including effective dialogue with financial service providers and employers, improvements in collecting, exchanging and accessing data, and above all, ways of translating research outcomes into action. This will require wider recognition that, as Françoise Meunier, Director Special Projects, EORTC, said, 'It is time for a new mind set'.
Collapse
Affiliation(s)
- Lifang Liu
- The European Organisation for Research and Treatment of Cancer (EORTC), Avenue Emmanuel Mounier 83/11, 1200 Brussels, Belgium.
| | | | - Richard Sullivan
- Institute of Cancer Policy, King's College, London, United Kingdom
| | - Alexander Katalinic
- Institute for Social Medicine and Epidemiology, University of Lubeck, Germany
| | | | - Angela de Boer
- Coronel Institute of Occupational Health, Amsterdam Medical Center, Amsterdam, The Netherlands
| | - Francoise Meunier
- The European Organisation for Research and Treatment of Cancer (EORTC), Avenue Emmanuel Mounier 83/11, 1200 Brussels, Belgium
| | | |
Collapse
|
5
|
Abstract
X-ray diffraction from crystals of membrane proteins very often yields incomplete datasets due to, among other things, severe radiation damage. Multiple crystals are thus required to form complete datasets, provided the crystals themselves are isomorphous. Selection and combination of data from multiple crystals is a difficult and tedious task that can be facilitated by purpose-built software. BLEND, in the CCP4 suite of programs for macromolecular crystallography (MX), has been created exactly for this reason. In this chapter the program is described and its workings illustrated by means of data from two membrane proteins.
Collapse
|
6
|
Hokamp K. Perl One-Liners: Bridging the Gap Between Large Data Sets and Analysis Tools. Methods Mol Biol 2015; 1326:177-91. [PMID: 26498621 DOI: 10.1007/978-1-4939-2839-2_15] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/18/2023]
Abstract
Computational analyses of biological data are becoming increasingly powerful, and researchers intending on carrying out their own analyses can often choose from a wide array of tools and resources. However, their application might be obstructed by the wide variety of different data formats that are in use, from standard, commonly used formats to output files from high-throughput analysis platforms. The latter are often too large to be opened, viewed, or edited by standard programs, potentially leading to a bottleneck in the analysis. Perl one-liners provide a simple solution to quickly reformat, filter, and merge data sets in preparation for downstream analyses. This chapter presents example code that can be easily adjusted to meet individual requirements. An online version is available at http://bioinf.gen.tcd.ie/pol.
Collapse
|
7
|
Wang F, Song PXK, Wang L. Merging multiple longitudinal studies with study-specific missing covariates: A joint estimating function approach. Biometrics 2015; 71:929-40. [PMID: 26193911 DOI: 10.1111/biom.12356] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2014] [Revised: 04/01/2015] [Accepted: 05/01/2015] [Indexed: 11/28/2022]
Abstract
Merging multiple datasets collected from studies with identical or similar scientific objectives is often undertaken in practice to increase statistical power. This article concerns the development of an effective statistical method that enables to merge multiple longitudinal datasets subject to various heterogeneous characteristics, such as different follow-up schedules and study-specific missing covariates (e.g., covariates observed in some studies but missing in other studies). The presence of study-specific missing covariates presents great statistical methodology challenge in data merging and analysis. We propose a joint estimating function approach to addressing this challenge, in which a novel nonparametric estimating function constructed via splines-based sieve approximation is utilized to bridge estimating equations from studies with missing covariates to those with fully observed covariates. Under mild regularity conditions, we show that the proposed estimator is consistent and asymptotically normal. We evaluate finite-sample performances of the proposed method through simulation studies. In comparison to the conventional multiple imputation approach, our method exhibits smaller estimation bias. We provide an illustrative data analysis using longitudinal cohorts collected in Mexico City to assess the effect of lead exposures on children's somatic growth.
Collapse
Affiliation(s)
- Fei Wang
- Global Analytics, Ford Motor Credit, Dearborn, Michigan 48126, U.S.A
| | - Peter X-K Song
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan 48109, U.S.A
| | - Lu Wang
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan 48109, U.S.A
| |
Collapse
|