1
|
Lourenço KS, Suleiman AKA, Pijl A, Dimitrov MR, Cantarella H, Kuramae EE. Mix-method toolbox for monitoring greenhouse gas production and microbiome responses to soil amendments. MethodsX 2024; 12:102699. [PMID: 38660030 PMCID: PMC11041840 DOI: 10.1016/j.mex.2024.102699] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2024] [Accepted: 04/04/2024] [Indexed: 04/26/2024] Open
Abstract
In this study, we adopt an interdisciplinary approach, integrating agronomic field experiments with soil chemistry, molecular biology techniques, and statistics to investigate the impact of organic residue amendments, such as vinasse (a by-product of sugarcane ethanol production), on soil microbiome and greenhouse gas (GHG) production. The research investigates the effects of distinct disturbances, including organic residue application alone or combined with inorganic N fertilizer on the environment. The methods assess soil microbiome dynamics (composition and function), GHG emissions, and plant productivity. Detailed steps for field experimental setup, soil sampling, soil chemical analyses, determination of bacterial and fungal community diversity, quantification of genes related to nitrification and denitrification pathways, measurement and analysis of gas fluxes (N2O, CH4, and CO2), and determination of plant productivity are provided. The outcomes of the methods are detailed in our publications (Lourenço et al., 2018a; Lourenço et al., 2018b; Lourenço et al., 2019; Lourenço et al., 2020). Additionally, the statistical methods and scripts used for analyzing large datasets are outlined. The aim is to assist researchers by addressing common challenges in large-scale field experiments, offering practical recommendations to avoid common pitfalls, and proposing potential analyses, thereby encouraging collaboration among diverse research groups.•Interdisciplinary methods and scientific questions allow for exploring broader interconnected environmental problems.•The proposed method can serve as a model and protocol for evaluating the impact of soil amendments on soil microbiome, GHG emissions, and plant productivity, promoting more sustainable management practices.•Time-series data can offer detailed insights into specific ecosystems, particularly concerning soil microbiota (taxonomy and functions).
Collapse
Affiliation(s)
- Késia Silva Lourenço
- Microbial Ecology Department, Netherlands Institute of Ecology (NIOO), Droevendaalsesteeg 10, Wageningen 6708, PB, The Netherlands
- Soils and Environmental Resources Center, Agronomic Institute of Campinas (IAC), Av. Barão de Itapura 1481, Campinas 13020-902, SP, Brazil
| | - Afnan Khalil Ahmad Suleiman
- Microbial Ecology Department, Netherlands Institute of Ecology (NIOO), Droevendaalsesteeg 10, Wageningen 6708, PB, The Netherlands
- Soil Health group, Bioclear Earth B.V., Rozenburglaan 13, Groningen 9727 DL, The Netherlands
| | - Agata Pijl
- Microbial Ecology Department, Netherlands Institute of Ecology (NIOO), Droevendaalsesteeg 10, Wageningen 6708, PB, The Netherlands
| | - Mauricio R. Dimitrov
- Microbial Ecology Department, Netherlands Institute of Ecology (NIOO), Droevendaalsesteeg 10, Wageningen 6708, PB, The Netherlands
| | - Heitor Cantarella
- Soils and Environmental Resources Center, Agronomic Institute of Campinas (IAC), Av. Barão de Itapura 1481, Campinas 13020-902, SP, Brazil
| | - Eiko Eurya Kuramae
- Microbial Ecology Department, Netherlands Institute of Ecology (NIOO), Droevendaalsesteeg 10, Wageningen 6708, PB, The Netherlands
- Ecology and Biodiversity, Institute of Environmental Biology, Utrecht University, Utrecht, The Netherlands
| |
Collapse
|
2
|
Tsagris M, Lagani V, Tsamardinos I. Feature selection for high-dimensional temporal data. BMC Bioinformatics 2018; 19:17. [PMID: 29357817 PMCID: PMC5778658 DOI: 10.1186/s12859-018-2023-7] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2017] [Accepted: 01/11/2018] [Indexed: 12/14/2022] Open
Abstract
BACKGROUND Feature selection is commonly employed for identifying collectively-predictive biomarkers and biosignatures; it facilitates the construction of small statistical models that are easier to verify, visualize, and comprehend while providing insight to the human expert. In this work we extend established constrained-based, feature-selection methods to high-dimensional "omics" temporal data, where the number of measurements is orders of magnitude larger than the sample size. The extension required the development of conditional independence tests for temporal and/or static variables conditioned on a set of temporal variables. RESULTS The algorithm is able to return multiple, equivalent solution subsets of variables, scale to tens of thousands of features, and outperform or be on par with existing methods depending on the analysis task specifics. CONCLUSIONS The use of this algorithm is suggested for variable selection with high-dimensional temporal data.
Collapse
Affiliation(s)
- Michail Tsagris
- Department of Computer Science, University of Crete, Voutes Campus, Heraklion, 70013 Greece
| | - Vincenzo Lagani
- Department of Computer Science, University of Crete, Voutes Campus, Heraklion, 70013 Greece
| | - Ioannis Tsamardinos
- Department of Computer Science, University of Crete, Voutes Campus, Heraklion, 70013 Greece
| |
Collapse
|
3
|
Abstract
When count data exhibit excess zero, that is more zero counts than a simpler parametric distribution can model, the zero-inflated Poisson (ZIP) or zero-inflated negative binomial (ZINB) models are often used. Variable selection for these models is even more challenging than for other regression situations because the availability of p covariates implies 4 p possible models. We adapt to zero-inflated models an approach for variable selection that avoids the screening of all possible models. This approach is based on a stochastic search through the space of all possible models, which generates a chain of interesting models. As an additional novelty, we propose three ways of extracting information from this rich chain and we compare them in two simulation studies, where we also contrast our approach with regularization (penalized) techniques available in the literature. The analysis of a typical dataset that has motivated our research is also presented, before concluding with some recommendations.
Collapse
Affiliation(s)
- Eva Cantoni
- Research Center for Statistics and Geneva School of Economics and Management, University of Geneva, Geneva, Switzerland
| | - Marie Auda
- Research Center for Statistics and Geneva School of Economics and Management, University of Geneva, Geneva, Switzerland
| |
Collapse
|
4
|
Efficient estimation for marginal generalized partially linear single-index models with longitudinal data. TEST-SPAIN 2016. [DOI: 10.1007/s11749-015-0462-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
5
|
Rajeswaran J, Blackstone EH. A multiphase non-linear mixed effects model: An application to spirometry after lung transplantation. Stat Methods Med Res 2016; 26:21-42. [PMID: 24919830 DOI: 10.1177/0962280214537255] [Citation(s) in RCA: 51] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
In medical sciences, we often encounter longitudinal temporal relationships that are non-linear in nature. The influence of risk factors may also change across longitudinal follow-up. A system of multiphase non-linear mixed effects model is presented to model temporal patterns of longitudinal continuous measurements, with temporal decomposition to identify the phases and risk factors within each phase. Application of this model is illustrated using spirometry data after lung transplantation using readily available statistical software. This application illustrates the usefulness of our flexible model when dealing with complex non-linear patterns and time-varying coefficients.
Collapse
Affiliation(s)
- Jeevanantham Rajeswaran
- Department of Quantitative Health Sciences, Heart and Vascular Institute, Cleveland Clinic, Cleveland, USA
| | - Eugene H Blackstone
- Department of Quantitative Health Sciences, Heart and Vascular Institute, Cleveland Clinic, Cleveland, USA
| |
Collapse
|
6
|
Wang P, Zhou J, Qu A. Correlation structure selection for longitudinal data with diverging cluster size. CAN J STAT 2016. [DOI: 10.1002/cjs.11290] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Affiliation(s)
- Peng Wang
- Department of Operations, Business Analytics and Information Systems, University of Cincinnati; Cincinnati, OH U.S.A
| | - Jianhui Zhou
- Department of Statistics, University of Virginia; Charlottesville, VA U.S.A
| | - Annie Qu
- Department of Statistics, University of Illinois at Urbana-Champaign; Champaign, IL U.S.A
| |
Collapse
|
7
|
Guerrier S, Mili N, Molinari R, Orso S, Avella-Medina M, Ma Y. A Predictive Based Regression Algorithm for Gene Network Selection. Front Genet 2016; 7:97. [PMID: 27379155 PMCID: PMC4908120 DOI: 10.3389/fgene.2016.00097] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2016] [Accepted: 05/16/2016] [Indexed: 11/13/2022] Open
Abstract
Gene selection has become a common task in most gene expression studies. The objective of such research is often to identify the smallest possible set of genes that can still achieve good predictive performance. To do so, many of the recently proposed classification methods require some form of dimension-reduction of the problem which finally provide a single model as an output and, in most cases, rely on the likelihood function in order to achieve variable selection. We propose a new prediction-based objective function that can be tailored to the requirements of practitioners and can be used to assess and interpret a given problem. Based on cross-validation techniques and the idea of importance sampling, our proposal scans low-dimensional models under the assumption of sparsity and, for each of them, estimates their objective function to assess their predictive power in order to select. Two applications on cancer data sets and a simulation study show that the proposal compares favorably with competing alternatives such as, for example, Elastic Net and Support Vector Machine. Indeed, the proposed method not only selects smaller models for better, or at least comparable, classification errors but also provides a set of selected models instead of a single one, allowing to construct a network of possible models for a target prediction accuracy level.
Collapse
Affiliation(s)
- Stéphane Guerrier
- Department of Statistics, University of Illinois at Urbana-Champaign Champaign, IL, USA
| | - Nabil Mili
- Research Center for Statistics, Geneva School of Economics and Management, University of Geneva Geneva, Switzerland
| | - Roberto Molinari
- Research Center for Statistics, Geneva School of Economics and Management, University of Geneva Geneva, Switzerland
| | - Samuel Orso
- Research Center for Statistics, Geneva School of Economics and Management, University of Geneva Geneva, Switzerland
| | - Marco Avella-Medina
- Research Center for Statistics, Geneva School of Economics and Management, University of Geneva Geneva, Switzerland
| | - Yanyuan Ma
- Department of Statistics, University of South Carolina Columbia, SC, USA
| |
Collapse
|
8
|
Gosho M. Model selection in the weighted generalized estimating equations for longitudinal data with dropout. Biom J 2015; 58:570-87. [DOI: 10.1002/bimj.201400045] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2014] [Revised: 02/23/2015] [Accepted: 08/10/2015] [Indexed: 11/10/2022]
Affiliation(s)
- Masahiko Gosho
- Advanced Medical Research Center; Aichi Medical University; 1-1, Yazakokarimata Nagakute Aichi 480-1195 Japan
- Department of Clinical Trial and Clinical Epidemiology; Faculty of Medicine; University of Tsukuba; 1-1-1, Tennodai Tsukuba Ibaraki 305-8575 Japan
| |
Collapse
|
9
|
Xu P, Zhu L, Li Y. Ultrahigh dimensional time course feature selection. Biometrics 2014; 70:356-65. [PMID: 24571586 PMCID: PMC4061374 DOI: 10.1111/biom.12137] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2013] [Revised: 12/01/2013] [Accepted: 12/01/2013] [Indexed: 12/13/2022]
Abstract
Statistical challenges arise from modern biomedical studies that produce time course genomic data with ultrahigh dimensions. In a renal cancer study that motivated this paper, the pharmacokinetic measures of a tumor suppressor (CCI-779) and expression levels of 12,625 genes were measured for each of 33 patients at 8 and 16 weeks after the start of treatments, with the goal of identifying predictive gene transcripts and the interactions with time in peripheral blood mononuclear cells for pharmacokinetics over the time course. The resulting data set defies analysis even with regularized regression. Although some remedies have been proposed for both linear and generalized linear models, there are virtually no solutions in the time course setting. As such, a novel GEE-based screening procedure is proposed, which only pertains to the specifications of the first two marginal moments and a working correlation structure. Different from existing methods that either fit separate marginal models or compute pairwise correlation measures, the new procedure merely involves making a single evaluation of estimating functions and thus is extremely computationally efficient. The new method is robust against the mis-specification of correlation structures and enjoys theoretical readiness, which is further verified via Monte Carlo simulations. The procedure is applied to analyze the aforementioned renal cancer study and identify gene transcripts and possible time-interactions that are relevant to CCI-779 metabolism in peripheral blood.
Collapse
Affiliation(s)
- Peirong Xu
- Department of Mathematics, Southeast University, Nanjing, China
| | - Lixing Zhu
- Department of Mathematics, Hong Kong Baptist University, Hong Kong, China
| | - Yi Li
- Department of Biostatistics, University of Michigan, Ann Arbor, USA
| |
Collapse
|
10
|
Data mining for longitudinal data under multicollinearity and time dependence using penalized generalized estimating equations. Comput Stat Data Anal 2014. [DOI: 10.1016/j.csda.2013.02.023] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
11
|
|
12
|
|
13
|
Huang SH, Wulsin LR, Li H, Guo J. Dimensionality reduction for knowledge discovery in medical claims database: application to antidepressant medication utilization study. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2009; 93:115-123. [PMID: 18835058 DOI: 10.1016/j.cmpb.2008.08.002] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/26/2008] [Revised: 08/06/2008] [Accepted: 08/13/2008] [Indexed: 05/26/2023]
Abstract
Data mining, through its capacity to discover knowledge embedded in large databases to improve organizational decision-making, has the potential to contribute to efficiencies and cost savings in the increasingly costly healthcare industry. One important aspect of the methods of mining medical databases includes reducing dimensionality through feature selection. Traditionally feature selection is accomplished through stepwise regression, which tends to produce an unnecessarily high number of "significant" variables. This paper applies a filter-based feature selection method using inconsistency rate measure and discretization, to a medical claims database to predict the adequacy of duration of antidepressant medication utilization. Compared to traditional stepwise logistic regression, which selected seven variables from a total of nine potential explanatory variables to characterize patients with inadequate antidepressant medication utilization, the filter-based method selected two variables (age and number of claims) to achieve a similar prediction accuracy. This comparison suggests it may be feasible and efficient to apply the filter-based feature selection method to reduce the dimensionality of healthcare databases.
Collapse
Affiliation(s)
- Samuel H Huang
- Department of Mechanical Engineering, University of Cincinnati, Cincinnati, OH 45221, USA.
| | | | | | | |
Collapse
|
14
|
Cui J, Qian G. Selection of Working Correlation Structure and Best Model in GEE Analyses of Longitudinal Data. COMMUN STAT-SIMUL C 2007. [DOI: 10.1080/03610910701539617] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|