1
|
Zoehler B, de Aguiar AM, Silveira GF. SAEDC: Development of a technological solution for exploratory data analysis and statistics in cytotoxicity. Comput Struct Biotechnol J 2024; 23:483-490. [PMID: 38261941 PMCID: PMC10796974 DOI: 10.1016/j.csbj.2023.12.020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2023] [Revised: 12/14/2023] [Accepted: 12/15/2023] [Indexed: 01/25/2024] Open
Abstract
INTRODUCTION The intergovernmental organizations Organisation for Economic Co-operation and Development (OECD) and Interagency Coordinating Committee on the Validation of Alternative Methods (ICCVAM) have developed guidelines for the use of in vitro models for toxicological evaluation of chemicals. However, the presence of manual steps and the requirement of multiple tools for data analysis, apart from being costly and time-consuming, can inadvertently introduce errors by researchers. OBJECTIVES We have developed the SAEDC platform (Technological Solution for Exploratory Data Analysis and Statistics for Cytotoxicity, in Portuguese), which enables analysis of cytotoxicity data from assays following OECD Guideline No. 129. METHODOLOGY In vitro experimental data were used to compare with the analysis methodology suggested in the Guideline. We analyzed 117 data sets covering chemicals from Category I to Unclassified according to GHS classification. RESULTS The four-parameters of non-linear regression (4PL) calculated by the SAEDC platform showed no significant differences compared to standard methodology in any of the data sets (p > 0.05). The coefficient of determination (R-squared) also demonstrated not only a good fit of the 4PL model to the data but also significant similarity to values obtained by the conventional methodology. Finally, the SAEDC platform predicted LD50 values for the chemicals from IC50, using the Registry of Cytotoxicity (RC) regression models. CONCLUSION The comparison with the standard data analysis methodology revealed that SAEDC platform fulfills the requirements for cytotoxicity data analysis, generating reliable and accurate results with fewer steps performed by researchers. The use of SAEDC platform for obtaining toxicity values can reduce analysis time compared to the standard methodology proposed by regulatory agencies. Thus, automation of the analysis using the SAEDC platform has the potential to save time and resources for cytotoxicity researchers and laboratories while generating reliable results.
Collapse
Affiliation(s)
- Bernardo Zoehler
- Instituto Carlos Chagas – ICC, Fundação Oswaldo Cruz – Fiocruz, Brazil
| | - Alessandra Melo de Aguiar
- Plataforma de Bioensaios com métodos alternativos em citotoxicidade, Instituto Carlos Chagas – ICC, Fundação Oswaldo Cruz – Fiocruz, Brazil
- Laboratório de Biologia Básica de Células-tronco, Instituto Carlos Chagas – ICC, Fundação Oswaldo Cruz – Fiocruz, Brazil
| | | |
Collapse
|
2
|
Ordóñez Á, Sánchez E, Carlos Solano J, Parra-Domínguez J. Demand charges reduction with photovoltaics in industry. Heliyon 2024; 10:e23404. [PMID: 38169926 PMCID: PMC10758794 DOI: 10.1016/j.heliyon.2023.e23404] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2023] [Revised: 11/13/2023] [Accepted: 12/04/2023] [Indexed: 01/05/2024] Open
Abstract
Demand charges are widely used for commercial and industrial consumers. These costs are often not well known, let alone the effects that PV can have on them. This work proposes a methodology to assess the effect of PV on reducing these charges and to optimise the power to be contracted, using techniques taken from exploratory data analysis. This methodology is applied to five case studies of industrial consumers from different sectors in Spain, finding savings between 5 % and 11 % of demand charges in industries with continuous operation and up to 28 % in cases of discontinuous operation. These savings can be even greater if the maximum power that can be contracted is lower than the optimum. The demand charges in Spain consist of a fixed part proportional to the contracted power and a variable part depending on the power peaks exceeding it. Since for the variable part the coincident and non-coincident models coexist, a comparison is made between the two models, finding that in the general case PV users can achieve higher savings with the coincident model.
Collapse
Affiliation(s)
- Ángel Ordóñez
- Facultad de la Energía, Universidad Nacional de Loja, Avda. Pío Jaramillo Alvarado, 110110 Loja, Ecuador
- Escuela Técnica Superior de Ingeniería Industrial, Universidad de Salamanca, Avda. Fernando Ballesteros, 2, 37700 Béjar, Spain
| | - Esteban Sánchez
- Escuela Técnica Superior de Ingeniería Industrial, Universidad de Salamanca, Avda. Fernando Ballesteros, 2, 37700 Béjar, Spain
| | - Juan Carlos Solano
- Facultad de la Energía, Universidad Nacional de Loja, Avda. Pío Jaramillo Alvarado, 110110 Loja, Ecuador
| | - Javier Parra-Domínguez
- Escuela Técnica Superior de Ingeniería Industrial, Universidad de Salamanca, Avda. Fernando Ballesteros, 2, 37700 Béjar, Spain
- Department of Business Studies - School of Economics and Business, University of Salamanca, Campus Miguel de Unamuno, P.° Francisco Tomás y Valiente, s/n, 37007 Salamanca, Spain
- BISITE Research Group, Edificio I+D+i - C, C. Espejo, s/n, 37007 Salamanca, Spain
| |
Collapse
|
3
|
Fan YV, Čuček L, Si C, Jiang P, Vujanović A, Krajnc D, Lee CT. Uncovering environmental performance patterns of plastic packaging waste in high recovery rate countries: An example of EU-27. Environ Res 2024; 241:117581. [PMID: 37967705 DOI: 10.1016/j.envres.2023.117581] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/08/2023] [Revised: 10/30/2023] [Accepted: 11/01/2023] [Indexed: 11/17/2023]
Abstract
Plastic consumption and its end-of-life management pose a significant environmental footprint and are energy intensive. Waste-to-resources and prevention strategies have been promoted widely in Europe as countermeasures; however, their effectiveness remains uncertain. This study aims to uncover the environmental footprint patterns of the plastics value chain in the European Union Member States (EU-27) through exploratory data analysis with dimension reduction and grouping. Nine variables are assessed, ranging from socioeconomic and demographic to environmental impacts. Three clusters are formed according to the similarity of a range of characteristics (nine), with environmental impacts being identified as the primary influencing variable in determining the clusters. Most countries belong to Cluster 0, consisting of 17 countries in 2014 and 18 countries in 2019. They represent clusters with a relatively low global warming potential (GWP), with an average value of 2.64 t CO2eq/cap in 2014 and 4.01 t CO2eq/cap in 2019. Among all the assessed countries, Denmark showed a significant change when assessed within the traits of EU-27, categorised from Cluster 1 (high GWP) in 2014 to Cluster 0 (low GWP) in 2019. The analysis of plastic packaging waste statistics in 2019 (data released in 2022) shows that, despite an increase in the recovery rate within the EU-27, the GWP has not reduced, suggesting a rebound effect. The GWP tends to increase in correlation with the higher plastic waste amount. In contrast, other environmental impacts, like eutrophication, abiotic and acidification potential, are identified to be mitigated effectively via recovery, suppressing the adverse effects of an increase in plastic waste generation. The five-year interval data analysis identified distinct clusters within a set of patterns, categorising them based on their similarities. The categorisation and managerial insights serve as a foundation for devising a focused mitigation strategy.
Collapse
Affiliation(s)
- Yee Van Fan
- Sustainable Process Integration Laboratory - SPIL, NETME Centre, Faculty of Mechanical Engineering, Brno University of Technology, Technická 2896/2, 616 69 Brno, Czech Republic.
| | - Lidija Čuček
- Faculty of Chemistry and Chemical Engineering, University of Maribor, Smetanova 17, Maribor, Slovenia
| | - Chunyan Si
- Sustainable Process Integration Laboratory - SPIL, NETME Centre, Faculty of Mechanical Engineering, Brno University of Technology, Technická 2896/2, 616 69 Brno, Czech Republic
| | - Peng Jiang
- Department of Industrial Engineering and Management, Business School, Sichuan University, Chengdu 610064, China
| | - Annamaria Vujanović
- Faculty of Chemistry and Chemical Engineering, University of Maribor, Smetanova 17, Maribor, Slovenia
| | - Damjan Krajnc
- Faculty of Chemistry and Chemical Engineering, University of Maribor, Smetanova 17, Maribor, Slovenia
| | - Chew Tin Lee
- Faculty of Chemical and Energy Engineering, Universiti Teknologi Malaysia, 81310, Johor Bahru, Johor, Malaysia
| |
Collapse
|
4
|
Gonzalez-Ponce K, Horta Andrade C, Hunter F, Kirchmair J, Martinez-Mayorga K, Medina-Franco JL, Rarey M, Tropsha A, Varnek A, Zdrazil B. School of cheminformatics in Latin America. J Cheminform 2023; 15:82. [PMID: 37726809 PMCID: PMC10507835 DOI: 10.1186/s13321-023-00758-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2023] [Accepted: 09/10/2023] [Indexed: 09/21/2023] Open
Abstract
We report the major highlights of the School of Cheminformatics in Latin America, Mexico City, November 24-25, 2022. Six lectures, one workshop, and one roundtable with four editors were presented during an online public event with speakers from academia, big pharma, and public research institutions. One thousand one hundred eighty-one students and academics from seventy-nine countries registered for the meeting. As part of the meeting, advances in enumeration and visualization of chemical space, applications in natural product-based drug discovery, drug discovery for neglected diseases, toxicity prediction, and general guidelines for data analysis were discussed. Experts from ChEMBL presented a workshop on how to use the resources of this major compounds database used in cheminformatics. The school also included a round table with editors of cheminformatics journals. The full program of the meeting and the recordings of the sessions are publicly available at https://www.youtube.com/@SchoolChemInfLA/featured .
Collapse
Affiliation(s)
- Karla Gonzalez-Ponce
- Institute of Chemistry, Campus Merida, National Autonomous University of Mexico, Merida‑Tetiz Highway, Km. 4.5, Ucu, Yucatan, Mexico
| | - Carolina Horta Andrade
- LabMol - Laboratory for Molecular Modeling and Drug Design, Faculdade de Farmacia, Universidade Federal de Goias, Goiania, GO, Brazil
| | - Fiona Hunter
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SD, Cambridgeshire, UK
| | - Johannes Kirchmair
- Division of Pharmaceutical Chemistry, Department of Pharmaceutical Sciences, University of Vienna, Josef-Holaubek-Platz 2, 2D 303, 1090, Vienna, Austria
| | - Karina Martinez-Mayorga
- Institute of Chemistry, Campus Merida, National Autonomous University of Mexico, Merida‑Tetiz Highway, Km. 4.5, Ucu, Yucatan, Mexico.
- Institute for Applied Mathematics and Systems, Merida Research Unit, National Autonomous University of Mexico, Sierra Papacal, Merida, Yucatan, Mexico.
| | - José L Medina-Franco
- DIFACQUIM Research Group, Department of Pharmacy, School of Chemistry, National Autonomous University of Mexico, Avenida Universidad 3000, 04510, Mexico City, Mexico.
| | - Matthias Rarey
- ZBH - Center for Bioinformatics, Universität Hamburg, Bundesstraße 43, 20146, Hamburg, Germany
| | - Alexander Tropsha
- Molecular Modeling Laboratory, UNC Eshelman School of Pharmacy, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA
| | - Alexandre Varnek
- Laboratoire d'Infochimie, UMR 7177 CNRS, Université de Strasbourg, 4, Rue B. Pascal, 67000, Strasbourg, France
| | - Barbara Zdrazil
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SD, Cambridgeshire, UK
| |
Collapse
|
5
|
Hernandez-Betancur JD, Ruiz-Mercado GJ, Martin M. Tracking end-of-life stage of chemicals: A scalable data-centric and chemical-centric approach. Resour Conserv Recycl 2023; 196:1-13. [PMID: 37476199 PMCID: PMC10355112 DOI: 10.1016/j.resconrec.2023.107031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/22/2023]
Abstract
Chemical flow analysis (CFA) can be used for collecting life-cycle inventory (LCI), estimating environmental releases, and identifying potential exposure scenarios for chemicals of concern at the end-of-life (EoL) stage. Nonetheless, the demand for comprehensive data and the epistemic uncertainties about the pathway taken by the chemical flows make CFA, LCI, and exposure assessment time-consuming and challenging tasks. Due to the continuous growth of computer power and the appearance of more robust algorithms, data-driven modelling represents an attractive tool for streamlining these tasks. However, a data ingestion pipeline is required for the deployment of serving data-driven models in the real world. Hence, this work moves forward by contributing a chemical-centric and data-centric approach to extract, transform, and load comprehensive data for CFA at the EoL, integrating cross-year and country data and its provenance as part of the data lifecycle. The framework is scalable and adaptable to production-level machine learning operations. The framework can supply data at an annual rate, making it possible to deal with changes in the statistical distributions of model predictors like transferred amount and target variables (e.g., EoL activity identification) to avoid potential data-driven model performance decay over time. For instance, it can detect that recycling transfers of 643 chemicals over the reporting years (1988 to 2020) are 29.87%, 17.79%, and 20.56% for Canada, Australia, and the U.S. Finally, the developed approach enables research advancements on data-driven modelling to easily connect with other data sources for economic information on industry sectors, the economic value of chemicals, and the environmental regulatory implications that may affect the occurrence of an EoL transfer class or activity like recycling of a chemical over years and countries. Finally, stakeholders gain more context about environmental regulation stringency and economic affairs that could affect environmental decision-making and EoL chemical exposure predictions.
Collapse
Affiliation(s)
| | - Gerardo J. Ruiz-Mercado
- Office of Research & Development, U.S. Environmental Protection Agency, Cincinnati, OH, 45268, USA
- Chemical Engineering Graduate Program, Universidad del Atlántico, Puerto Colombia, 080007, Colombia
| | - Mariano Martin
- Department of Chemical Engineering, University of Salamanca, Salamanca, 37008, Spain
| |
Collapse
|
6
|
Tseng YJ, Chen CJ, Chang CW. lab: an R package for generating analysis-ready data from laboratory records. PeerJ Comput Sci 2023; 9:e1528. [PMID: 37705643 PMCID: PMC10495959 DOI: 10.7717/peerj-cs.1528] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2023] [Accepted: 07/20/2023] [Indexed: 09/15/2023]
Abstract
Background Electronic health records (EHRs) play a crucial role in healthcare decision-making by giving physicians insights into disease progression and suitable treatment options. Within EHRs, laboratory test results are frequently utilized for predicting disease progression. However, processing laboratory test results often poses challenges due to variations in units and formats. In addition, leveraging the temporal information in EHRs can improve outcomes, prognoses, and diagnosis predication. Nevertheless, the irregular frequency of the data in these records necessitates data preprocessing, which can add complexity to time-series analyses. Methods To address these challenges, we developed an open-source R package that facilitates the extraction of temporal information from laboratory records. The proposed lab package generates analysis-ready time series data by segmenting the data into time-series windows and imputing missing values. Moreover, users can map local laboratory codes to the Logical Observation Identifier Names and Codes (LOINC), an international standard. This mapping allows users to incorporate additional information, such as reference ranges and related diseases. Moreover, the reference ranges provided by LOINC enable us to categorize results into normal or abnormal. Finally, the analysis-ready time series data can be further summarized using descriptive statistics and utilized to develop models using machine learning technologies. Results Using the lab package, we analyzed data from MIMIC-III, focusing on newborns with patent ductus arteriosus (PDA). We extracted time-series laboratory records and compared the differences in test results between patients with and without 30-day in-hospital mortality. We then identified significant variations in several laboratory test results 7 days after PDA diagnosis. Leveraging the time series-analysis-ready data, we trained a prediction model with the long short-term memory algorithm, achieving an area under the receiver operating characteristic curve of 0.83 for predicting 30-day in-hospital mortality in model training. These findings demonstrate the lab package's effectiveness in analyzing disease progression. Conclusions The proposed lab package simplifies and expedites the workflow involved in laboratory records extraction. This tool is particularly valuable in assisting clinical data analysts in overcoming the obstacles associated with heterogeneous and sparse laboratory records.
Collapse
Affiliation(s)
- Yi-Ju Tseng
- Department of Computer Science, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
- Computational Health Informatics Program, Boston Children’s Hospital, Boston, MA, United States of America
| | - Chun Ju Chen
- Department of Information Management, National Taiwan University, Taipei, Taiwan
| | - Chia Wei Chang
- Department of Computer Science, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
| |
Collapse
|
7
|
Koteeswaran S, Suganya R, Surianarayanan C, Neeba EA, Suresh A, Chelliah PR, Buhari SM. A supervised learning approach for the influence of comorbidities in the analysis of COVID-19 mortality in Tamil Nadu. Soft comput 2023:1-15. [PMID: 37362286 PMCID: PMC10238245 DOI: 10.1007/s00500-023-08590-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/19/2023] [Indexed: 06/28/2023]
Abstract
COVID-19 has created many complications in today's world. It has negatively impacted the lives of many people and emphasized the need for a better health system everywhere. COVID-19 is a life-threatening disease, and a high proportion of people have lost their lives due to this pandemic. This situation enables us to dig deeper into mortality records and find meaningful patterns to save many lives in future. Based on the article from the New Indian Express (published on January 19, 2021), a whopping 82% of people who died of COVID-19 in Tamil Nadu had comorbidities, while 63 percent of people who died of the disease were above the age of 60, as per data from the Health Department. The data, part of a presentation shown to Union Health Minister Harsh Vardhan, show that of the 12,200 deaths till January 7, as many as 10,118 patients had comorbidities, and 7613 were aged above 60. A total of 3924 people (32%) were aged between 41 and 60. Compared to the 1st wave of COVID-19, the 2nd wave had a high mortality rate. Therefore, it is important to find meaningful insights from the mortality records of COVID-19 patients to know the most vulnerable population and to decide on comprehensive treatment strategies.
Collapse
Affiliation(s)
- S. Koteeswaran
- Department of CSE (AI&ML), S.A. Engineering College, Chennai, 600077 Tamil Nadu India
| | - R. Suganya
- School of Computer Science and Engineering, Vellore Institute of Technology, Chennai, Tamil Nadu India
| | - Chellammal Surianarayanan
- Centre for Distance and Online Education, Bharathidasan University, Tiruchirappalli, Tamil Nadu India
| | - E. A. Neeba
- Department of Information Technology, Rajagiri School of Engineering and Technology, Kochi, Kerala India
| | - A. Suresh
- Department of Networking and Communications, School of Computing, Faculty of Engineering and Technology, SRM Institute of Science and Technology, Kattankulathur, Chennai, 603202 Tamil Nadu India
| | | | - Seyed M. Buhari
- School of Business, Universiti Teknologi Brunei, Jalan Tungku Link, Mukim Gadong A, BE1410 Brunei
| |
Collapse
|
8
|
Rahnenführer J, De Bin R, Benner A, Ambrogi F, Lusa L, Boulesteix AL, Migliavacca E, Binder H, Michiels S, Sauerbrei W, McShane L. Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges. BMC Med 2023; 21:182. [PMID: 37189125 DOI: 10.1186/s12916-023-02858-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/28/2022] [Accepted: 04/03/2023] [Indexed: 05/17/2023] Open
Abstract
BACKGROUND In high-dimensional data (HDD) settings, the number of variables associated with each observation is very large. Prominent examples of HDD in biomedical research include omics data with a large number of variables such as many measurements across the genome, proteome, or metabolome, as well as electronic health records data that have large numbers of variables recorded for each patient. The statistical analysis of such data requires knowledge and experience, sometimes of complex methods adapted to the respective research questions. METHODS Advances in statistical methodology and machine learning methods offer new opportunities for innovative analyses of HDD, but at the same time require a deeper understanding of some fundamental statistical concepts. Topic group TG9 "High-dimensional data" of the STRATOS (STRengthening Analytical Thinking for Observational Studies) initiative provides guidance for the analysis of observational studies, addressing particular statistical challenges and opportunities for the analysis of studies involving HDD. In this overview, we discuss key aspects of HDD analysis to provide a gentle introduction for non-statisticians and for classically trained statisticians with little experience specific to HDD. RESULTS The paper is organized with respect to subtopics that are most relevant for the analysis of HDD, in particular initial data analysis, exploratory data analysis, multiple testing, and prediction. For each subtopic, main analytical goals in HDD settings are outlined. For each of these goals, basic explanations for some commonly used analysis methods are provided. Situations are identified where traditional statistical methods cannot, or should not, be used in the HDD setting, or where adequate analytic tools are still lacking. Many key references are provided. CONCLUSIONS This review aims to provide a solid statistical foundation for researchers, including statisticians and non-statisticians, who are new to research with HDD or simply want to better evaluate and understand the results of HDD analyses.
Collapse
Affiliation(s)
| | | | - Axel Benner
- Division of Biostatistics, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Federico Ambrogi
- Department of Clinical Sciences and Community Health, University of Milan, Milan, Italy
- Scientific Directorate, IRCCS Policlinico San Donato, San Donato Milanese, Italy
| | - Lara Lusa
- Department of Mathematics, Faculty of Mathematics, Natural Sciences and Information Technology, University of Primorksa, Koper, Slovenia
- Institute of Biostatistics and Medical Informatics, University of Ljubljana, Ljubljana, Slovenia
| | - Anne-Laure Boulesteix
- Institute for Medical Information Processing, Biometry and Epidemiology, Ludwig Maximilian University of Munich, Munich, Germany
| | | | - Harald Binder
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Freiburg, Germany
| | - Stefan Michiels
- Service de Biostatistique et d'Épidémiologie, Gustave Roussy, Université Paris-Saclay, Villejuif, France
- Oncostat U1018, Inserm, Université Paris-Saclay, Labeled Ligue Contre le Cancer, Villejuif, France
| | - Willi Sauerbrei
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Freiburg, Germany
| | - Lisa McShane
- Biometric Research Program, Division of Cancer Treatment and Diagnosis, National Cancer Institute, Bethesda, MD, USA.
| |
Collapse
|
9
|
Pakkan S, Sudhakar C, Tripathi S, Rao M. A correlation study of sustainable development goal (SDG) interactions. Qual Quant 2023; 57:1937-1956. [PMID: 35729959 PMCID: PMC9189271 DOI: 10.1007/s11135-022-01443-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Accepted: 05/20/2022] [Indexed: 11/05/2022]
Abstract
As universities are the change agent of society, institutions from all nations set their goals to transform the world by exploring various societal challenges that humans are facing. Together, the higher education systems across the world developing strategies based on the United Nations' Sustainable Development Goals (SDGs). The current study aimed to provide policymakers, academics, and researchers an insight on the influence of 16 SDGs on each other paving the way for the universities to set a clear goal in attaining Sustainable Development goals by 2030. To analyze the SDGs' interactions towards each other, 201,844 research publications from India during five years on 16 SDGs are retrieved from the Scopus database. Spearman Rank Correlation is applied to understand the correlation of each SDG towards one another. We could observe converging results out of the interactions among the SDGs. A significant positive and moderately positive correlation between pairs of SDGs are identified. While a significant number of negative correlations is also classified which need deep thinking among researchers to develop healthy relationships. The most frequent interactions between SDGs is a positive sign for any university in strategizing the goal towards SDGs. The association of all university stakeholders and some constitutional and cultural changes are necessary to put SDGs at the core of the management of the university. Embracing this task by researchers will improve the overall performance of universities. The analysis presented in the present study is useful for academics, governments, funding agencies, researchers, and policy-makers.
Collapse
Affiliation(s)
- Sheeba Pakkan
- Manipal Academy of Higher Education, Manipal, Karnataka India
| | - Christopher Sudhakar
- Department of Quality, Manipal Academy of Higher Education, Manipal, Karnataka India
| | | | - Mahabaleshwara Rao
- Department of Library and Information Sciences, Manipal Academy of Higher Education, Manipal, Karnataka India
| |
Collapse
|
10
|
Zoiros A, Vrahatis A. Effective Preprocessing of Single-Cell RNA-Seq for Unravelling Alzheimer's Disease Signatures. Adv Exp Med Biol 2023; 1423:251-256. [PMID: 37525052 DOI: 10.1007/978-3-031-31978-5_25] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/02/2023]
Abstract
The development in the field of biomedical technology has brought significant progress in the diagnosis and prediction of many complex diseases. Part of this development is the single-cell RNA sequencing analysis, which allows the study of a complex disease in great depth at the cellular level. Such analyses can decipher the mechanisms that cause complex diseases, such as Alzheimer's disease (AD). However, the increasing depth in the collection of single-cell RNA sequencing data implies, in addition to greater challenges, the production of a large amount of information, which needs careful analysis. Toward this direction, we examine the approach to single-cell RNA sequencing data through the development of an exploratory data analysis methodology. For this purpose, a combination of various tools is presented for their effective and efficient processing. At the same time, reference is made to the relevant biological concepts, the goals and challenges of the studies, and the workflows of sequencing, preprocessing, and analysis of the data. Our framework is applied to Alzheimer's disease data providing evidence that such data are quite complex while the appropriate preprocess step can boost the machine learning processes for identifying AD signatures.
Collapse
Affiliation(s)
- Apollon Zoiros
- Interdisciplinary PSP Bioinformatics and Neuroinformatics (BNP), School of Science and Technology, Hellenic Open University, Patras, Greece
| | - Aristidis Vrahatis
- Bioinformatics and Human Electrophysiology Lab (BiHELab), Department of Informatics, Ionian University, Corfu, Greece
| |
Collapse
|
11
|
Kumar K, Pande BP. Air pollution prediction with machine learning: a case study of Indian cities. Int J Environ Sci Technol (Tehran) 2022; 20:5333-5348. [PMID: 35603096 PMCID: PMC9107909 DOI: 10.1007/s13762-022-04241-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/18/2021] [Revised: 02/17/2022] [Accepted: 04/19/2022] [Indexed: 05/06/2023]
Abstract
The survival of mankind cannot be imagined without air. Consistent developments in almost all realms of modern human society affected the health of the air adversely. Daily industrial, transport, and domestic activities are stirring hazardous pollutants in our environment. Monitoring and predicting air quality have become essentially important in this era, especially in developing countries like India. In contrast to the traditional methods, the prediction technologies based on machine learning techniques are proved to be the most efficient tools to study such modern hazards. The present work investigates six years of air pollution data from 23 Indian cities for air quality analysis and prediction. The dataset is well preprocessed and key features are selected through the correlation analysis. An exploratory data analysis is exercised to develop insights into various hidden patterns in the dataset and pollutants directly affecting the air quality index are identified. A significant fall in almost all pollutants is observed in the pandemic year, 2020. The data imbalance problem is solved with a resampling technique and five machine learning models are employed to predict air quality. The results of these models are compared with the standard metrics. The Gaussian Naive Bayes model achieves the highest accuracy while the Support Vector Machine model exhibits the lowest accuracy. The performances of these models are evaluated and compared through established performance parameters. The XGBoost model performed the best among the other models and gets the highest linearity between the predicted and actual data.
Collapse
Affiliation(s)
- K. Kumar
- Sikh National College, Qadian, Guru Nanak Dev University, Amritsar, Punjab India
| | - B. P. Pande
- Department of Computer Applications, LSM, Government PG College, Pithoragarh, Uttarakhand India
| |
Collapse
|
12
|
Abstract
This commentary reflects on the articles included in the Psychometrika Special Issue on Network Psychometrics in Action. The contributions to the special issue are related to several possible future paths for research in this area. These include the development of models to analyze and represent interventions, improvement in exploratory and inferential techniques in network psychometrics, the articulation of psychometric theories in addition to psychometric models, and extensions of network modeling to novel data sources. Finally, network psychometrics is part of a larger movement in psychology that revolves around the analysis of human beings as complex systems, and it is timely that psychometricians start extending their rich modeling tradition to improve and extend the analysis of systems in psychology.
Collapse
Affiliation(s)
- Denny Borsboom
- Department of Psychology, University of Amsterdam, Nieuwe Achtergracht 129-B, 1018 WT, Amsterdam, The Netherlands
| |
Collapse
|
13
|
Jagadev P, Naik S, Giri LI. Contactless monitoring of human respiration using infrared thermography and deep learning. Physiol Meas 2022; 43. [PMID: 35193123 DOI: 10.1088/1361-6579/ac57a8] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2021] [Accepted: 02/22/2022] [Indexed: 11/11/2022]
Abstract
OBJECTIVE To monitor the human respiration rate (RR) using infrared thermography (IRT) and artificial intelligence, in a completely non-invasive and automated manner. APPROACH The human breathing signals (BS) were obtained using IRT. The RR was monitored under extreme conditions, by developing a deep learning (DL) based "Residual network 50+Facial landmark detection" (ResNet 50+FLD) model. This model was built and evaluated on 10,000 thermograms and is the first work that documents the use of a DL classifier on a large thermal dataset for nostril tracking. Further, the acquired BS were filtered using the Moving average filter (MAF), and the Butterworth filter (BF). The novel "Breathing signal characterization algorithm (BSCA)" was proposed to obtain the RR in an automated manner. This algorithm is the first work that identifies the breaths in the thermal BS as regular, prolonged, or rapid, using machine learning (ML). The "Exploratory data analysis" was performed to choose an appropriate ML algorithm for the BSCA. The performance of the "BSCA" was evaluated for both "Decision tree (DT)" and "Support vector machine(SVM)" models. MAIN RESULTS The "ResNet 50+FLD model" had Validation and Testing accuracy, of 99.5 %, and 99.4 % respectively. The Precision, Sensitivity, Specificity, F-measure, and G- mean values were computed as well. The comparative analysis of the filters revealed that the BF performed better than the MAF. The "BSCA" performed better with the SVM classifier, than the DT classifier, with Validation accuracy, and Testing accuracy of 99.5%, and 98.83%, respectively. SIGNIFICANCE The ever-increasing number of critical cases and the limited availability of skilled medical attendants, advocates in favor of an automated and harmless health monitoring system. The proposed methodology eliminates the risk of infections that spread through contact. It can be used in darkness, and in remote areas as well, where there is a lack of medical attendants.
Collapse
Affiliation(s)
- Preeti Jagadev
- ECE Department, National Institute of Technology Goa, Farmagudi, Ponda, Goa, Ponda, Goa, 403401, INDIA
| | - Shubham Naik
- Emerson Innovation center, Plot No. 23, Hinjawadi Phase 2 Rd, Phase 2, Hinjewadi Rajiv Gandhi Infotech Park, Maan, Pune, Maharashtra, Pune, 411057, INDIA
| | - Lalat Indu Giri
- ECE Department, National Institute of Technology Goa, Farmagudi, Ponda, Goa, Ponda, Goa, 403401, INDIA
| |
Collapse
|
14
|
Dodge S, Toka M, Bae CJ. DynamoVis 1.0: an exploratory data visualization software for mapping movement in relation to internal and external factors. Mov Ecol 2021; 9:55. [PMID: 34736518 PMCID: PMC8567714 DOI: 10.1186/s40462-021-00291-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/03/2021] [Accepted: 10/21/2021] [Indexed: 05/08/2023]
Abstract
BACKGROUND This paper introduces DynamoVis version 1.0, an open-source software developed to design, record and export custom animations and multivariate visualizations from movement data, enabling visual exploration and communication of patterns capturing the associations between animals' movement and its affecting internal and external factors. Proper representation of these dependencies grounded on cartographic principles and intuitive visual forms can facilitate scientific discovery, decision-making, collaborations, and foster understanding of movement. RESULTS DynamoVis offers a visualization platform that is accessible and easily usable for scientists and general public without a need for prior experience with data visualization or programming. The intuitive design focuses on a simple interface to apply cartographic techniques, giving ecologists of all backgrounds the power to visualize and communicate complex movement patterns. CONCLUSIONS DynamoVis 1.0 offers a flexible platform to quickly and easily visualize and animate animal tracks to uncover hidden patterns captured in the data, and explore the effects of internal and external factors on their movement path choices and motion capacities. Hence, DynamoVis can be used as a powerful communicative and hypothesis generation tool for scientific discovery and decision-making through visual reasoning. The visual products can be used as a research and pedagogical tool in movement ecology.
Collapse
Affiliation(s)
- Somayeh Dodge
- Department of Geography, University of California, Santa Barbara, Santa Barbara, CA USA
| | - Mert Toka
- Media Arts and Technology Program, University of California, Santa Barbara, Santa Barbara, CA USA
| | - Crystal J. Bae
- Department of Geography, University of California, Santa Barbara, Santa Barbara, CA USA
| |
Collapse
|
15
|
Cain CN, Sudol PE, Berrier KL, Synovec RE. Development of variance rank initiated-unsupervised sample indexing for gas chromatography-mass spectrometry analysis. Talanta 2021; 233:122495. [PMID: 34215113 DOI: 10.1016/j.talanta.2021.122495] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2021] [Revised: 04/29/2021] [Accepted: 04/30/2021] [Indexed: 02/08/2023]
Abstract
Traditional non-targeted chemometric workflows for gas chromatography-mass spectrometry (GC-MS) data rely on using supervised methods, which requires a priori knowledge of sample class membership. Herein, we propose a simple, unsupervised chemometric workflow known as variance rank initiated-unsupervised sample indexing (VRI-USI). VRI-USI discovers analyte peaks exhibiting high relative variance across all samples, followed by k-means clustering on the individual peaks. Based upon how the samples cluster for a given peak, a sample index assignment is provided. Using a probabilistic argument, if the same sample index assignment appears for several discovered peaks, then this outcome strongly suggests that the samples are properly classified by that particular sample index assignment. Thus, relevant chemical differences between the samples have been discovered in an unsupervised fashion. The VRI-USI workflow is demonstrated on three, increasingly difficult datasets: simulations, yeast metabolomics, and human cancer metabolomics. For simulated GC-MS datasets, VRI-USI discovered 85-90% of analytes modeled to vary between sample classes. Nineteen out of 53 peaks in the peak table developed for the yeast metabolome dataset had the same sample index assignments, indicating that those indices are most likely due to class-distinguishing chemical differences. A t-test revealed that 22 out of 53 peaks were statistically significant (p < 0.05) when using those sample index assignments. Likewise, for the human cancer metabolomics study, VRI-USI discovered 25 analytes that were statistically different (p < 0.05) using the sample index assignments determined to highlight meaningful sample-based differences. For all datasets, the sample index assignments that were deduced from VRI-USI were the correct class-based difference when using prior knowledge. VRI-USI holds promise as an exploratory data analysis workflow for studies in which analysts do not readily have a priori class information or want to uncover the underlying nature of their dataset.
Collapse
Affiliation(s)
- Caitlin N Cain
- Department of Chemistry, Box 351700, University of Washington, Seattle, WA, 98195, USA
| | - Paige E Sudol
- Department of Chemistry, Box 351700, University of Washington, Seattle, WA, 98195, USA
| | - Kelsey L Berrier
- Department of Chemistry, Box 351700, University of Washington, Seattle, WA, 98195, USA
| | - Robert E Synovec
- Department of Chemistry, Box 351700, University of Washington, Seattle, WA, 98195, USA.
| |
Collapse
|
16
|
Rosa LK, Costa FS, Hauagge CM, Mobile RZ, de Lima AAS, Amaral CDB, Machado RC, Nogueira ARA, Brancher JA, de Araujo MR. Oral health, organic and inorganic saliva composition of men with Schizophrenia: Case-control study. J Trace Elem Med Biol 2021; 66:126743. [PMID: 33740480 DOI: 10.1016/j.jtemb.2021.126743] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/19/2020] [Revised: 03/03/2021] [Accepted: 03/09/2021] [Indexed: 11/30/2022]
Abstract
BACKGROUND Schizophrenia (SCZ) presents complex challenges related to diagnosis and clinical monitoring. The study of conditions associated with SCZ can be facilitated by using potential markers and patterns that provide information to support the diagnosis and oral health. METHODS The salivary composition of patients diagnosed with SCZ (n = 50) was evaluated and compared to the control (n = 50). Saliva samples from male patients were collected and clinical parameters were evaluated. The concentration of total proteins and amylase were determined and salivary macro- and microelements were quantified by ICP OES and ICP-MS. Exploratory data analysis based on artificial intelligence tools was used in the investigation. RESULTS There was a significant increase in the salivary concentrations of Al, Fe, Li, Mg, Na, and V, higher prevalence of caries (p < 0.001), periodontal disease (p < 0.001), and reduced salivary flow rate (p = 0.019) in SCZ patients. Also, samples were grouped into six clusters. As, Co, Cr, Cu, Mn, Mo, Ni, Se, and Sr were correlated with each other, while Fe, K, Li, Ti, and V showed the highest concentrations in the samples distributed in the clusters with the highest association between SZC patients and controls. CONCLUSIONS The results obtained indicate changes in salivary flow, organic composition, and levels of macro- and microelements in SCZ patients. Salivary concentrations of Fe, Mg, and Na may be related to oral conditions, higher prevalence of caries, and periodontal disease. The exploratory analysis showed different patterns in the salivary composition of SCZ patients impacted by associations between oral health conditions and the use of medications. Future studies are encouraged to confirm the results investigated in this study.
Collapse
Affiliation(s)
- Letícia Kreutz Rosa
- Federal University of Paraná, Department of Stomatology, Curitiba, PR, 80210-170, Brazil
| | | | - Cecília Moraes Hauagge
- Federal University of Paraná, Department of Stomatology, Curitiba, PR, 80210-170, Brazil
| | - Rafael Zancan Mobile
- Federal University of Paraná, Department of Stomatology, Curitiba, PR, 80210-170, Brazil
| | | | - Clarice D B Amaral
- Federal University of Paraná, Department of Chemistry, Curitiba, PR, 81531-980, Brazil
| | - Raquel C Machado
- Federal University of São Carlos, Department of Chemistry, São Carlos, SP, 13565-905, Brazil
| | | | - João Armando Brancher
- Pontifícia Universidade Católica do Paraná, Escola de Ciências da Vida, Curitiba, PR, 80215-901, Brazil
| | | |
Collapse
|
17
|
Wentzell PD, Gonçalves TR, Matsushita M, Valderrama P. Combinatorial projection pursuit analysis for exploring multivariate chemical data. Anal Chim Acta 2021; 1174:338716. [PMID: 34247741 DOI: 10.1016/j.aca.2021.338716] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2021] [Revised: 05/26/2021] [Accepted: 05/30/2021] [Indexed: 11/19/2022]
Abstract
Kurtosis-based projection pursuit analysis (kPPA) has demonstrated the ability to visualize multivariate data in a way that complements other exploratory data analysis tools, such as principal components analysis (PCA). It is especially useful for partitioning binary data sets (2k classes) with a balanced design. Since kPPA is not a variance-based method, it can often provide unsupervised class separation where other methods fail. However, when multiple classifications are possible (e.g. by gender, age, disease state, etc.), the projection provided by kPPA (corresponding to the global minimum kurtosis) will not necessarily be the one of greatest interest to the researcher. Fortunately, the optimization algorithm for kPPA allows for interrogation of projections obtained from numerous local minima. This strategy provides the basis of a new method described here, referred to as combinatorial projection pursuit analysis (CombPPA) because it presents alternative combinations of class separation. The method is truly exploratory in that it allows the landscape of interesting projections to be more fully probed. The approach uses Procrustes rotation to map local minima among the kPPA solutions, whereupon the researcher can visualize different projections. To demonstrate the new method, the clustering of grape juice samples using visible spectroscopy is presented as a model problem. This problem is well-suited to this type of study because there are eight classes of samples symmetrically partitioned into two classes by type (organic/non-organic) or four classes by brand. Results presented show the different combinations of projections that can be obtained, including the desired partitions. In addition, this work describes new enhancements to the kPPA algorithm that improve the orthogonality of solutions obtained.
Collapse
Affiliation(s)
- Peter D Wentzell
- Trace Analysis Research Centre, Department of Chemistry, Dalhousie University, PO Box 15000, Halifax, NS B3H 4R2, Canada.
| | - Thays R Gonçalves
- Universidade Estadual de Maringá (UEM), Av. Colombo, 5790, 87020-900, Maringá, PR, Brazil
| | - Makoto Matsushita
- Universidade Estadual de Maringá (UEM), Av. Colombo, 5790, 87020-900, Maringá, PR, Brazil
| | - Patrícia Valderrama
- Universidade Tecnológica Federal do Paraná (UTFPR), Via Rosalina Maria dos Santos 1233, 87301-899, Campo Mourão, PR, Brazil
| |
Collapse
|
18
|
Fife DA, Longo G, Correll M, Tremoulet PD. A graph for every analysis: Mapping visuals onto common analyses using flexplot. Behav Res Methods 2021; 53:1876-94. [PMID: 33634423 DOI: 10.3758/s13428-020-01520-2] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/05/2020] [Indexed: 11/08/2022]
Abstract
For decades, statisticians and methodologists have insisted researchers utilize graphical analysis much more heavily. Despite cogent and passionate recommendations, there has been no graphical revolution. Instead, researchers rely heavily on misleading graphics that violate visual processing heuristics. Perhaps the main reason for the persistence of deceptive graphics is software; most software familiar to psychological researchers suffer from poor defaults and limited capabilities. Also, visualization is ancillary to statistical analysis, providing an incentive to not produce graphics at all. In this paper, we argue that every statistical analysis must have an accompanying graphic, and we introduce the point-and-click software Flexplot, available both in JASP and Jamovi. We then present the theoretical framework that guides Flexplot, as well as show how to perform the most common statistical analyses in psychological literature.
Collapse
|
19
|
Ratnovsky A, Rozenes S, Bloch E, Halpern P. Statistical learning methodologies and admission prediction in an emergency department. Australas Emerg Care 2021; 24:241-247. [PMID: 33461906 DOI: 10.1016/j.auec.2020.11.004] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2020] [Revised: 10/07/2020] [Accepted: 11/25/2020] [Indexed: 11/30/2022]
Abstract
BACKGROUND The quality of an emergency department (ED) is highly dependent on its ability to supply efficient, as well as high-quality treatment for all patients. Key performance indicators are important when measuring the performance of an emergency department. This study aimed to perform an exploratory data analysis and to develop an admission prediction model based on a dataset that was constructed from key performance indicators selected by a panel of expert physicians, nurses and hospital administrators. METHODS A dataset of 172,695 records was retrospectively collected from an Emergency Department. The relationships within the dataset were analyzed and three machine learning algorithms were compared for an admission predictive model based on the initial patient information. RESULTS The dataset showed that mean length of stay was similar in the different weekdays, there was a positive linear relationship between the length of stay and patient age and the admission predictive model yielded an AUC of 0.79. CONCLUSIONS The selected indicators can be used to study whether emergency department allocates its resources properly to cope with overcrowding and the predictive model may be employed by Hospital and ED administrates to fill information gaps and support decision making for the improvement of the key performance indicators.
Collapse
Affiliation(s)
- Anat Ratnovsky
- School of Medical Engineering, Afeka, Tel Aviv Academic College of Engineering, Israel.
| | - Shai Rozenes
- School of Industrial Engineering, Engineering and Management Programme, Afeka, Tel Aviv Academic College of Engineering, Israel
| | - Eli Bloch
- School of Industrial Engineering, Engineering and Management Programme, Afeka, Tel Aviv Academic College of Engineering, Israel
| | - Pinchas Halpern
- Department of Emergency Medicine, Tel Aviv Sourasky Medical Center, and Tel Aviv University Sackler Faculty of Medicine, Israel
| |
Collapse
|
20
|
Abstract
Normalization is an important step in the analysis of single-cell RNA-seq data. While no single method outperforms all others in all datasets, the choice of normalization can have profound impact on the results. Data-driven metrics can be used to rank normalization methods and select the best performers. Here, we show how to use R/Bioconductor to calculate normalization factors, apply them to compute normalized data, and compare several normalization approaches. Finally, we briefly show how to perform downstream analysis steps on the normalized data.
Collapse
Affiliation(s)
- Davide Risso
- Department of Statistical Sciences, University of Padova, Padova, Italy.
| |
Collapse
|
21
|
Tseng YJ, Chiu HJ, Chen CJ. dxpr: an R package for generating analysis-ready data from electronic health records-diagnoses and procedures. PeerJ Comput Sci 2021; 7:e520. [PMID: 34141876 PMCID: PMC8176530 DOI: 10.7717/peerj-cs.520] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2020] [Accepted: 04/09/2021] [Indexed: 05/08/2023]
Abstract
BACKGROUND Enriched electronic health records (EHRs) contain crucial information related to disease progression, and this information can help with decision-making in the health care field. Data analytics in health care is deemed as one of the essential processes that help accelerate the progress of clinical research. However, processing and analyzing EHR data are common bottlenecks in health care data analytics. METHODS The dxpr R package provides mechanisms for integration, wrangling, and visualization of clinical data, including diagnosis and procedure records. First, the dxpr package helps users transform International Classification of Diseases (ICD) codes to a uniform format. After code format transformation, the dxpr package supports four strategies for grouping clinical diagnostic data. For clinical procedure data, two grouping methods can be chosen. After EHRs are integrated, users can employ a set of flexible built-in querying functions for dividing data into case and control groups by using specified criteria and splitting the data into before and after an event based on the record date. Subsequently, the structure of integrated long data can be converted into wide, analysis-ready data that are suitable for statistical analysis and visualization. RESULTS We conducted comorbidity data processes based on a cohort of newborns from Medical Information Mart for Intensive Care-III (n = 7,833) by using the dxpr package. We first defined patent ductus arteriosus (PDA) cases as patients who had at least one PDA diagnosis (ICD, Ninth Revision, Clinical Modification [ICD-9-CM] 7470*). Controls were defined as patients who never had PDA diagnosis. In total, 381 and 7,452 patients with and without PDA, respectively, were included in our study population. Then, we grouped the diagnoses into defined comorbidities. Finally, we observed a statistically significant difference in 8 of the 16 comorbidities among patients with and without PDA, including fluid and electrolyte disorders, valvular disease, and others. CONCLUSIONS This dxpr package helps clinical data analysts address the common bottleneck caused by clinical data characteristics such as heterogeneity and sparseness.
Collapse
Affiliation(s)
- Yi-Ju Tseng
- Department of Information Management, National Central University, Taoyuan, Taiwan
- Department of Laboratory Medicine, Chang Gung Memorial Hospital at Linkou, Taoyuan, Taiwan
| | - Hsiang-Ju Chiu
- Department of Information Management, Chang Gung University, Taoyuan, Taiwan
| | - Chun Ju Chen
- Department of Information Management, Chang Gung University, Taoyuan, Taiwan
- Department of Information Management, National Taiwan University, Taipei, Taiwan
| |
Collapse
|
22
|
Adeniyi MO, Ekum MI, C I, S OA, A AJ, Oke SI, B MM. Dynamic model of COVID-19 disease with exploratory data analysis. Sci Afr 2020; 9:e00477. [PMID: 33521409 DOI: 10.1016/j.sciaf.2020.e00477] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2020] [Revised: 06/17/2020] [Accepted: 07/09/2020] [Indexed: 11/22/2022] Open
Abstract
Novel Coronavirus is a highly infectious disease, with over one million confirmed cases and thousands of deaths recorded. The disease has become pandemic, affecting almost all nations of the world, and has caused enormous economic, social and psychological burden on countries. Hygiene and educational campaign programmes have been identified to be potent public health interventions that can curtail the spread of the highly infectious disease. In order to verify this claim quantitatively, we propose and analyze a non-linear mathematical model to investigate the effect of healthy sanitation and awareness on the transmission dynamics of Coronavirus disease (COVID-19) prevalence. Rigorous stability analysis of the model equilibrium points was performed to ascertain the basic reproduction number R 0, a threshold that determines whether or not a disease dies out of the population. Our model assumes that education on the disease transmission and prevention induce behavioral changes in individuals to imbibe good hygiene, thereby reducing the basic reproduction number and disease burden. Numerical simulations are carried out using real life data to support the analytic results.
Collapse
|
23
|
Yousif A, Drou N, Rowe J, Khalfan M, Gunsalus KC. NASQAR: a web-based platform for high-throughput sequencing data analysis and visualization. BMC Bioinformatics 2020; 21:267. [PMID: 32600310 DOI: 10.1186/s12859-020-03577-4] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2019] [Accepted: 06/01/2020] [Indexed: 01/23/2023] Open
Abstract
Background As high-throughput sequencing applications continue to evolve, the rapid growth in quantity and variety of sequence-based data calls for the development of new software libraries and tools for data analysis and visualization. Often, effective use of these tools requires computational skills beyond those of many researchers. To ease this computational barrier, we have created a dynamic web-based platform, NASQAR (Nucleic Acid SeQuence Analysis Resource). Results NASQAR offers a collection of custom and publicly available open-source web applications that make extensive use of a variety of R packages to provide interactive data analysis and visualization. The platform is publicly accessible at http://nasqar.abudhabi.nyu.edu/. Open-source code is on GitHub at https://github.com/nasqar/NASQAR, and the system is also available as a Docker image at https://hub.docker.com/r/aymanm/nasqarall. NASQAR is a collaboration between the core bioinformatics teams of the NYU Abu Dhabi and NYU New York Centers for Genomics and Systems Biology. Conclusions NASQAR empowers non-programming experts with a versatile and intuitive toolbox to easily and efficiently explore, analyze, and visualize their Transcriptomics data interactively. Popular tools for a variety of applications are currently available, including Transcriptome Data Preprocessing, RNA-seq Analysis (including Single-cell RNA-seq), Metagenomics, and Gene Enrichment.
Collapse
|
24
|
Giguere DJ, Macklaim JM, Lieng BY, Gloor GB. omicplotR: visualizing omic datasets as compositions. BMC Bioinformatics 2019; 20:580. [PMID: 31729955 PMCID: PMC6858670 DOI: 10.1186/s12859-019-3174-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2019] [Accepted: 10/24/2019] [Indexed: 12/02/2022] Open
Abstract
BACKGROUND Differential abundance analysis is widely used with high-throughput sequencing data to compare gene abundance or expression between groups of samples. Many software packages exist for this purpose, but each uses a unique set of statistical assumptions to solve problems on a case-by-case basis. These software packages are typically difficult to use for researchers without command-line skills, and software that does offer a graphical user interface do not use a compositionally valid method. RESULTS omicplotR facilitates visual exploration of omic datasets for researchers with and without prior scripting knowledge. Reproducible visualizations include principal component analysis, hierarchical clustering, MA plots and effect plots. We demonstrate the functionality of omicplotR using a publicly available metatranscriptome dataset. CONCLUSIONS omicplotR provides a graphical user interface to explore sequence count data using generalizable compositional methods, facilitating visualization for investigators without command-line experience.
Collapse
Affiliation(s)
- Daniel J Giguere
- Department of Biochemistry, Schulich School of Medicine and Dentistry, Western University, London, N6A5C1, Canada.
| | - Jean M Macklaim
- Department of Biochemistry, Schulich School of Medicine and Dentistry, Western University, London, N6A5C1, Canada
| | - Brandon Y Lieng
- Department of Biochemistry, Schulich School of Medicine and Dentistry, Western University, London, N6A5C1, Canada
| | - Gregory B Gloor
- Department of Biochemistry, Schulich School of Medicine and Dentistry, Western University, London, N6A5C1, Canada
| |
Collapse
|
25
|
Álvarez Sánchez R, Beristain Iraola A, Epelde Unanue G, Carlin P. TAQIH, a tool for tabular data quality assessment and improvement in the context of health data. Comput Methods Programs Biomed 2019; 181:104824. [PMID: 30638900 DOI: 10.1016/j.cmpb.2018.12.029] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/27/2018] [Revised: 09/14/2018] [Accepted: 12/28/2018] [Indexed: 06/09/2023]
Abstract
BACKGROUND AND OBJECTIVES Data curation is a tedious task but of paramount relevance for data analytics and more specially in the health context where data-driven decisions must be extremely accurate. The ambition of TAQIH is to support non-technical users on 1) the exploratory data analysis (EDA) process of tabular health data, and 2) the assessment and improvement of its quality. METHODS A web-based tool has been implemented with a simple yet powerful visual interface. First, it provides interfaces to understand the dataset, to gain the understanding of the content, structure and distribution. Then, it provides data visualization and improvement utilities for the data quality dimensions of completeness, accuracy, redundancy and readability. RESULTS It has been applied in two different scenarios. (1) The Northern Ireland General Practitioners (GPs) Prescription Data, an open data set containing drug prescriptions. (2) A glucose monitoring tele health system dataset. Findings on (1) include: Features that had significant amount of missing values (e.g. AMP_NM variable 53.39%); instances that have high percentage of variable values missing (e.g. 0.21% of the instances with > 75% of missing values); highly correlated variables (e.g. Gross and Actual cost almost completely correlated (∼ + 1.0)). Findings on (2) include: Features that had significant amount of missing values (e.g. patient height, weight and body mass index (BMI) (> 70%), date of diagnosis 13%)); highly correlated variables (e.g. height, weight and BMI). Full detail of the testing and insights related to findings are reported. CONCLUSIONS TAQIH enables and supports users to carry out EDA on tabular health data and to assess and improve its quality. Having the layout of the application menu arranged sequentially as the conventional EDA pipeline helps following a consistent analysis process. The general description of the dataset and features section is very useful for the first overview of the dataset. The missing value heatmap is also very helpful in visually identifying correlations among missing values. The correlations section has proved to be supportive as a preliminary step before further data analysis pipelines, as well as the outliers section. Finally, the data quality section provides a quantitative value to the dataset improvements.
Collapse
Affiliation(s)
- Roberto Álvarez Sánchez
- Vicomtech, Paseo Mikeletegi 57 Parque Científico y Tecnológico de Gipuzkoa, Donostia/San Sebastián 20009, Gipuzkoa, Spain; IIS Biodonostia, Paseo Doctor Beguiristain s/n, Donostia/San Sebastián, 20014, Gipuzkoa, Spain.
| | - Andoni Beristain Iraola
- Vicomtech, Paseo Mikeletegi 57 Parque Científico y Tecnológico de Gipuzkoa, Donostia/San Sebastián 20009, Gipuzkoa, Spain; IIS Biodonostia, Paseo Doctor Beguiristain s/n, Donostia/San Sebastián, 20014, Gipuzkoa, Spain
| | - Gorka Epelde Unanue
- Vicomtech, Paseo Mikeletegi 57 Parque Científico y Tecnológico de Gipuzkoa, Donostia/San Sebastián 20009, Gipuzkoa, Spain; IIS Biodonostia, Paseo Doctor Beguiristain s/n, Donostia/San Sebastián, 20014, Gipuzkoa, Spain
| | - Paul Carlin
- South Eastern Health and Social Care Trust, Upper Newtownards Road, Belfast, BT16 1RH, United Kingdom
| |
Collapse
|
26
|
Kluxen FM, Grégoire S, Schepky A, Hewitt NJ, Klaric M, Domoradzki JY, Felkers E, Fernandes J, Fisher P, McEuen SF, Parr-Dobrzanski R, Wiemann C. Dermal absorption study OECD TG 428 mass balance recommendations based on the EFSA database. Regul Toxicol Pharmacol 2019; 108:104475. [PMID: 31539567 DOI: 10.1016/j.yrtph.2019.104475] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2019] [Revised: 08/21/2019] [Accepted: 09/13/2019] [Indexed: 11/24/2022]
Abstract
The European Food Safety Authority (EFSA) guidance (EFSA, 2017) for dermal absorption (DA) studies recommends stringent mass balance (MB) limits of 95-105%. EFSA suggested that test material can be lost after penetration and requires that for chemicals with <5% absorption the non-recovered material must be added to the absorbed dose if MB is <95%. This has huge consequences for low absorption pesticides. Indeed, one third of the MBs in the EFSA DA database are outside the refined criteria. This is also true for DA data generated by Cosmetics Europe (Gregoire et al., 2019), indicating that this criterion is often not achieved even when using highly standardized protocols. While EFSA hypothesizes that modern analytical and pipetting techniques would enable to achieve this criterion, no scientific basis was provided. We describe how protocol procedures impact MB and evaluate the EFSA DA database to demonstrate that MB is subject to random variation. Generic application of "the addition rule" skews the measured data and increases the DA estimate, which results in unnecessary risk assessment failure. In conclusion, "missing material" is just a random negative deviation to the nominal dose. We propose a data-driven MB criterion of 90-110%, fully in line with OECD recommendations.
Collapse
Affiliation(s)
- Felix M Kluxen
- ADAMA Deutschland GmbH, Edmund-Rumpler-Str. 6, 51149, Cologne, Germany.
| | - Sébastien Grégoire
- L'Oreal Research & Innovation, 1 Avenue Eugène Schueller, 93600, Aulnay-Sous-Bois, France.
| | | | - Nicky J Hewitt
- Cosmetics Europe, Avenue Herrmann-Debroux 40, 1160, Brussels, Belgium.
| | - Martina Klaric
- Cosmetics Europe, Avenue Herrmann-Debroux 40, 1160, Brussels, Belgium.
| | | | - Edgars Felkers
- ADAMA Deutschland GmbH, Edmund-Rumpler-Str. 6, 51149, Cologne, Germany.
| | - Joshua Fernandes
- Syngenta Ltd., Jealotts Hill Research Station, Warfield, Bracknell, RG42 6EY, UK.
| | - Philip Fisher
- Bayer SAS, Crop Science Division, 16 Rue Jean-Marie Leclair, 69266, Lyon, France.
| | - Steven F McEuen
- FMC Corporation, Stine Research Center, S300/427, P.O. Box 30, Newark, DE, 19714-0030, USA.
| | | | | |
Collapse
|
27
|
Orjuela S, Huang R, Hembach KM, Robinson MD, Soneson C. ARMOR: An Automated Reproducible MOdular Workflow for Preprocessing and Differential Analysis of RNA-seq Data. G3 (Bethesda) 2019; 9:2089-2096. [PMID: 31088905 PMCID: PMC6643886 DOI: 10.1534/g3.119.400185] [Citation(s) in RCA: 30] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/12/2019] [Accepted: 05/13/2019] [Indexed: 01/08/2023]
Abstract
The extensive generation of RNA sequencing (RNA-seq) data in the last decade has resulted in a myriad of specialized software for its analysis. Each software module typically targets a specific step within the analysis pipeline, making it necessary to join several of them to get a single cohesive workflow. Multiple software programs automating this procedure have been proposed, but often lack modularity, transparency or flexibility. We present ARMOR, which performs an end-to-end RNA-seq data analysis, from raw read files, via quality checks, alignment and quantification, to differential expression testing, geneset analysis and browser-based exploration of the data. ARMOR is implemented using the Snakemake workflow management system and leverages conda environments; Bioconductor objects are generated to facilitate downstream analysis, ensuring seamless integration with many R packages. The workflow is easily implemented by cloning the GitHub repository, replacing the supplied input and reference files and editing a configuration file. Although we have selected the tools currently included in ARMOR, the setup is modular and alternative tools can be easily integrated.
Collapse
Affiliation(s)
- Stephany Orjuela
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
- Institute of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- Institute of Molecular Cancer Research, University of Zurich, Zurich, Switzerland
| | - Ruizhu Huang
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
- Institute of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
| | - Katharina M Hembach
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
- Institute of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- Department of Quantitative Biomedicine, University of Zurich, Zurich, Switzerland
| | - Mark D Robinson
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
- Institute of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
| | - Charlotte Soneson
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
- Institute of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
| |
Collapse
|
28
|
Birgen C, Dürre P, Preisig HA, Wentzel A. Butanol production from lignocellulosic biomass: revisiting fermentation performance indicators with exploratory data analysis. Biotechnol Biofuels 2019; 12:167. [PMID: 31297155 PMCID: PMC6598312 DOI: 10.1186/s13068-019-1508-6] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/04/2019] [Accepted: 06/19/2019] [Indexed: 05/09/2023]
Abstract
After just more than 100 years of history of industrial acetone-butanol-ethanol (ABE) fermentation, patented by Weizmann in the UK in 1915, butanol is again today considered a promising biofuel alternative based on several advantages compared to the more established biofuels ethanol and methanol. Large-scale fermentative production of butanol, however, still suffers from high substrate cost and low product titers and selectivity. There have been great advances the last decades to tackle these problems. However, understanding the fermentation process variables and their interconnectedness with a holistic view of the current scientific state-of-the-art is lacking to a great extent. To illustrate the benefits of such a comprehensive approach, we have developed a dataset by collecting data from 175 fermentations of lignocellulosic biomass and mixed sugars to produce butanol that reported during the past three decades of scientific literature and performed an exploratory data analysis to map current trends and bottlenecks. This review presents the results of this exploratory data analysis as well as main features of fermentative butanol production from lignocellulosic biomass with a focus on performance indicators as a useful tool to guide further research and development in the field towards more profitable butanol manufacturing for biofuel applications in the future.
Collapse
Affiliation(s)
- Cansu Birgen
- Department of Chemical Engineering, NTNU, 7491 Trondheim, Norway
| | - Peter Dürre
- Institute of Microbiology and Biotechnology, Ulm University, 89069 Ulm, Germany
| | - Heinz A. Preisig
- Department of Chemical Engineering, NTNU, 7491 Trondheim, Norway
| | | |
Collapse
|
29
|
Abstract
BACKGROUND Principal component analysis (PCA) is frequently used in genomics applications for quality assessment and exploratory analysis in high-dimensional data, such as RNA sequencing (RNA-seq) gene expression assays. Despite the availability of many software packages developed for this purpose, an interactive and comprehensive interface for performing these operations is lacking. RESULTS We developed the pcaExplorer software package to enhance commonly performed analysis steps with an interactive and user-friendly application, which provides state saving as well as the automated creation of reproducible reports. pcaExplorer is implemented in R using the Shiny framework and exploits data structures from the open-source Bioconductor project. Users can easily generate a wide variety of publication-ready graphs, while assessing the expression data in the different modules available, including a general overview, dimension reduction on samples and genes, as well as functional interpretation of the principal components. CONCLUSION pcaExplorer is distributed as an R package in the Bioconductor project ( http://bioconductor.org/packages/pcaExplorer/ ), and is designed to assist a broad range of researchers in the critical step of interactive data exploration.
Collapse
Affiliation(s)
- Federico Marini
- Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI), University Medical Center of the Johannes Gutenberg University Mainz, Obere Zahlbacher Str. 69, Mainz, 55131 Germany
- Center for Thrombosis and Hemostasis (CTH), University Medical Center of the Johannes Gutenberg University Mainz, Langenbeckstr. 1, Mainz, 55131 Germany
| | - Harald Binder
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center - University of Freiburg, Stefan-Meier-Str. 26, Freiburg, 79104 Germany
| |
Collapse
|
30
|
Sirén K, Fischer U, Vestner J. Automated supervised learning pipeline for non-targeted GC-MS data analysis. Anal Chim Acta X 2019; 1:100005. [PMID: 33117972 PMCID: PMC7587030 DOI: 10.1016/j.acax.2019.100005] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2018] [Revised: 12/21/2018] [Accepted: 01/02/2019] [Indexed: 11/15/2022] Open
Abstract
Non-targeted analysis is nowadays applied in many different domains of analytical chemistry such as metabolomics, environmental and food analysis. Conventional processing strategies for GC-MS data include baseline correction, feature detection, and retention time alignment before multivariate modeling. These techniques can be prone to errors and therefore time-consuming manual corrections are generally necessary. We introduce here a novel fully automated approach to non-targeted GC-MS data processing. This new approach avoids feature extraction and retention time alignment. Supervised machine learning on decomposed tensors of segmented chromatographic raw data signal is used to rank regions in the chromatograms contributing to differentiation between sample classes. The performance of this novel data analysis approach is demonstrated on three published datasets.
Collapse
Affiliation(s)
- Kimmo Sirén
- Institute for Viticulture and Oenology, DLR Rheinpfalz, Breitenweg 71, D-67435, Neustadt, Germany
- Department of Chemistry, University of Kaiserslautern, Erwin-Schroedinger-Strasse 52, D-67663, Kaiserslautern, Germany
| | - Ulrich Fischer
- Institute for Viticulture and Oenology, DLR Rheinpfalz, Breitenweg 71, D-67435, Neustadt, Germany
| | - Jochen Vestner
- Institute for Viticulture and Oenology, DLR Rheinpfalz, Breitenweg 71, D-67435, Neustadt, Germany
- Corresponding author.
| |
Collapse
|
31
|
Breault MS, Sacré P, González-Martínez J, Gale JT, Sarma SV. An exploratory data analysis method for identifying brain regions and frequencies of interest from large-scale neural recordings. J Comput Neurosci 2019; 46:3-17. [PMID: 30511274 DOI: 10.1007/s10827-018-0705-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2018] [Revised: 08/28/2018] [Accepted: 10/23/2018] [Indexed: 10/27/2022]
Abstract
High-resolution whole brain recordings have the potential to uncover unknown functionality but also present the challenge of how to find such associations between brain and behavior when presented with a large number of regions and spectral frequencies. In this paper, we propose an exploratory data analysis method that sorts through a massive quantity of multivariate neural recordings to quickly extract a subset of brain regions and frequencies that encode behavior. This approach combines existing tools and exploits low-rank approximation of matrices without a priori selection of regions and frequency bands for analysis. In detail, the spectral content of neural activity across all frequencies of each recording contact is computed and represented as a matrix. Then, the rank-1 approximation of the matrix is computed using singular value decomposition and the associated singular vectors are extracted. The temporal singular vector, which captures the salient features of the spectrogram, is then correlated to the trial-varying behavioral signal. The distribution of correlations for each brain region is efficiently computed and used to find a subset of regions and frequency bands of interest for further examination. As an illustration, we apply this approach to a data set of local field potentials collected using stereoelectroencephalography from a human subject performing a reaching task. Using the proposed procedure, we produced a comprehensive set of brain regions and frequencies related to our specific behavior. We demonstrate how this tool can produce preliminary results that capture neural patterns related to behavior and aid in formulating data-driven hypotheses, hence reducing the time it takes for any scientist to transition from the exploratory to the confirmatory phase.
Collapse
|
32
|
Kneale C, Brown SD. Uncharted forest: A technique for exploratory data analysis. Talanta 2018; 189:71-8. [PMID: 30086977 DOI: 10.1016/j.talanta.2018.06.061] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2018] [Revised: 06/18/2018] [Accepted: 06/19/2018] [Indexed: 11/22/2022]
Abstract
Exploratory data analysis is crucial for developing and understanding classification models from high-dimensional datasets. We explore the utility of a new unsupervised tree ensemble called uncharted forest for visualizing class associations, sample-sample associations, class heterogeneity, and uninformative classes for provenance studies. The uncharted forest algorithm can be used to partition data using random selections of variables and metrics based on statistical spread. After each tree is grown, a tally of the samples that arrive at every terminal node is maintained. Those tallies are stored in single sample association matrix and a likelihood measure for each sample being partitioned with one another can be made. That matrix may be readily viewed as a heat map, and the probabilities can be quantified via new metrics that account for class or cluster membership. We display the advantages and limitations of using this technique by applying it to two classification datasets and three two provenance study datasets. Two of the metrics presented in this paper are also compared with widely used metrics from two algorithms that have variance-based clustering mechanisms.
Collapse
|
33
|
Gross T, Mapstone M, Miramontes R, Padilla R, Cheema AK, Macciardi F, Federoff HJ, Fiandaca MS. Toward Reproducible Results from Targeted Metabolomic Studies: Perspectives for Data Pre-processing and a Basis for Analytic Pipeline Development. Curr Top Med Chem 2018; 18:883-895. [PMID: 29992885 DOI: 10.2174/1568026618666180711144323] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2018] [Revised: 06/20/2018] [Accepted: 06/28/2018] [Indexed: 11/22/2022]
Abstract
Contemporary metabolomics experiments generate a rich array of complex high-dimensional data. Consequently, there have been concurrent efforts to develop methodological standards and analytical workflows to streamline the generation of meaningful biochemical and clinical inferences from raw data generated using an analytical platform like mass spectrometry. While such considerations have been frequently addressed in untargeted metabolomics (i.e., the broad survey of all distinguishable metabolites within a sample of interest), this methodological scrutiny has seldom been applied to data generated using commercial, targeted metabolomics kits. We suggest that this may, in part, account for past and more recent incomplete replications of previously specified biomarker panels. Herein, we identify common impediments challenging the analysis of raw, targeted metabolomic abundance data from a commercial kit and review methods to remedy these issues. In doing so, we propose an analytical pipeline suitable for the pre-processing of data for downstream biomarker discovery. Operational and statistical considerations for integrating targeted data sets across experimental sites and analytical batches are discussed, as are best practices for developing predictive models relating pre-processed metabolomic data to associated phenotypic information.
Collapse
Affiliation(s)
- Thomas Gross
- Translational Laboratory and Biorepository, University of California, Irvine School of Medicine, Irvine, CA 92697, United States.,Department of Anatomy & Neurobiology, University of California, Irvine School of Medicine, Irvine, CA 92697, United States
| | - Mark Mapstone
- Translational Laboratory and Biorepository, University of California, Irvine School of Medicine, Irvine, CA 92697, United States.,Department of Neurology, University of California, Irvine School of Medicine, Irvine, CA 92697, United States
| | - Ricardo Miramontes
- Translational Laboratory and Biorepository, University of California, Irvine School of Medicine, Irvine, CA 92697, United States.,Department of Neurology, University of California, Irvine School of Medicine, Irvine, CA 92697, United States
| | - Robert Padilla
- Translational Laboratory and Biorepository, University of California, Irvine School of Medicine, Irvine, CA 92697, United States.,Department of Neurology, University of California, Irvine School of Medicine, Irvine, CA 92697, United States
| | - Amrita K Cheema
- Department of Oncology, Georgetown University Medical Center, Washington DC, 20007, United States.,Department of Biochemistry and Molecular and Cellular Biology, Georgetown University Medical Center, Washington, DC, 20007, United States
| | - Fabio Macciardi
- Translational Laboratory and Biorepository, University of California, Irvine School of Medicine, Irvine, CA 92697, United States.,Department of Psychiatry and Human Behavior, University of California, Irvine School of Medicine, Irvine, CA 92697, United States
| | - Howard J Federoff
- Translational Laboratory and Biorepository, University of California, Irvine School of Medicine, Irvine, CA 92697, United States.,Department of Neurology, University of California, Irvine School of Medicine, Irvine, CA 92697, United States.,UCI Health, University of California, Irvine School of Medicine, Irvine, CA 92697, United States
| | - Massimo S Fiandaca
- Translational Laboratory and Biorepository, University of California, Irvine School of Medicine, Irvine, CA 92697, United States.,Department of Anatomy & Neurobiology, University of California, Irvine School of Medicine, Irvine, CA 92697, United States.,Department of Neurology, University of California, Irvine School of Medicine, Irvine, CA 92697, United States.,Department of Neurological Surgery, University of California, Irvine School of Medicine, Irvine, CA 92697, United States
| |
Collapse
|
34
|
Owolabi FO, Oguntunde PE, Adetula DT, Fakile SA. Learning analytics: Data sets on the academic record of accounting students in a Nigerian University. Data Brief 2018; 19:1614-1619. [PMID: 30246078 PMCID: PMC6141959 DOI: 10.1016/j.dib.2018.06.078] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2018] [Revised: 06/15/2018] [Accepted: 06/19/2018] [Indexed: 11/24/2022] Open
Abstract
This paper presents data on the academic performance of a particular set of accounting students from the year of inception into a Nigerian university to the year of graduation. Descriptive analysis was performed on the dataset and a regression model which is capable of making predictions was fitted to the dataset. From the dataset, 24 out of the students who started with a first class result (CGPA above 4.50) still maintained a first class result at graduation. 4 out of the students who started with a first class result dropped to second class upper division before graduation. 4 out of the students who started with a second class upper division result moved to first class result before graduation. 28 out of 35 students who started with a second class upper division maintained a second class upper division result at graduation.
Collapse
|
35
|
Abeysinghe R, Cui L. Query-constraint-based mining of association rules for exploratory analysis of clinical datasets in the National Sleep Research Resource. BMC Med Inform Decis Mak 2018; 18:58. [PMID: 30066656 PMCID: PMC6069291 DOI: 10.1186/s12911-018-0633-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
Background Association Rule Mining (ARM) has been widely used by biomedical researchers to perform exploratory data analysis and uncover potential relationships among variables in biomedical datasets. However, when biomedical datasets are high-dimensional, performing ARM on such datasets will yield a large number of rules, many of which may be uninteresting. Especially for imbalanced datasets, performing ARM directly would result in uninteresting rules that are dominated by certain variables that capture general characteristics. Methods We introduce a query-constraint-based ARM (QARM) approach for exploratory analysis of multiple, diverse clinical datasets in the National Sleep Research Resource (NSRR). QARM enables rule mining on a subset of data items satisfying a query constraint. We first perform a series of data-preprocessing steps including variable selection, merging semantically similar variables, combining multiple-visit data, and data transformation. We use Top-k Non-Redundant (TNR) ARM algorithm to generate association rules. Then we remove general and subsumed rules so that unique and non-redundant rules are resulted for a particular query constraint. Results Applying QARM on five datasets from NSRR obtained a total of 2517 association rules with a minimum confidence of 60% (using top 100 rules for each query constraint). The results show that merging similar variables could avoid uninteresting rules. Also, removing general and subsumed rules resulted in a more concise and interesting set of rules. Conclusions QARM shows the potential to support exploratory analysis of large biomedical datasets. It is also shown as a useful method to reduce the number of uninteresting association rules generated from imbalanced datasets. A preliminary literature-based analysis showed that some association rules have supporting evidence from biomedical literature, while others without literature-based evidence may serve as the candidates for new hypotheses to explore and investigate. Together with literature-based evidence, the association rules mined over the NSRR clinical datasets may be used to support clinical decisions for sleep-related problems. Electronic supplementary material The online version of this article (10.1186/s12911-018-0633-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Rashmie Abeysinghe
- Department of Computer Science, University of Kentucky, Lexington, KY, USA
| | - Licong Cui
- Department of Computer Science, University of Kentucky, Lexington, KY, USA. .,Institute for Biomedical Informatics, University of Kentucky, Lexington, KY, USA.
| |
Collapse
|
36
|
Abstract
The box-and-whiskers plot is an extraordinary graphical tool that provides a quick visual summary of an observed distribution. In spite of its many extensions, a really suitable boxplot to display circular data is not yet available. Thanks to its simplicity and strong visual impact, such a tool would be especially useful in all fields where circular measures arise: biometrics, astronomy, environmetrics, Earth sciences, to cite just a few. For this reason, in line with Tukey's original idea, a Tukey-like circular boxplot is introduced. Several simulated and real datasets arising in biology are used to illustrate the proposed graphical tool.
Collapse
Affiliation(s)
- Davide Buttarazzi
- Department of Economics and Law, University of Cassino and Southern Lazio, Italy
| | - Giuseppe Pandolfo
- Department of Industrial Engineering, University of Naples Federico II, Italy
| | - Giovanni C Porzio
- Department of Economics and Law, University of Cassino and Southern Lazio, Italy
| |
Collapse
|
37
|
Abstract
Finding useful patterns in datasets has attracted considerable interest in the field of visual analytics. One of the most common tasks is the identification and representation of clusters. However, this is non-trivial in heterogeneous datasets since the data needs to be analyzed from different perspectives. Indeed, highly variable patterns may mask underlying trends in the dataset. Dendrograms are graphical representations resulting from agglomerative hierarchical clustering and provide a framework for viewing the clustering at different levels of detail. However, dendrograms become cluttered when the dataset gets large, and the single cut of the dendrogram to demarcate different clusters can be insufficient in heterogeneous datasets. In this work, we propose a visual analytics methodology called MCLEAN that offers a general approach for guiding the user through the exploration and detection of clusters. Powered by a graph-based transformation of the relational data, it supports a scalable environment for representation of heterogeneous datasets by changing the spatialization. We thereby combine multilevel representations of the clustered dataset with community finding algorithms. Our approach entails displaying the results of the heuristics to users, providing a setting from which to start the exploration and data analysis. To evaluate our proposed approach, we conduct a qualitative user study, where participants are asked to explore a heterogeneous dataset, comparing the results obtained by MCLEAN with the dendrogram. These qualitative results reveal that MCLEAN is an effective way of aiding users in the detection of clusters in heterogeneous datasets. The proposed methodology is implemented in an R package available at https://bitbucket.org/vda-lab/mclean.
Collapse
Affiliation(s)
- Daniel Alcaide
- Department of Electrical Engineering (ESAT) STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics, KU Leuven, Leuven, Belgium
- imec, KU Leuven, Leuven, Belgium
| | - Jan Aerts
- Department of Electrical Engineering (ESAT) STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics, KU Leuven, Leuven, Belgium
- imec, KU Leuven, Leuven, Belgium
| |
Collapse
|
38
|
Zhu Q, Fisher SA, Dueck H, Middleton S, Khaladkar M, Kim J. PIVOT: platform for interactive analysis and visualization of transcriptomics data. BMC Bioinformatics 2018; 19:6. [PMID: 29304726 PMCID: PMC5756333 DOI: 10.1186/s12859-017-1994-0] [Citation(s) in RCA: 26] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2017] [Accepted: 12/06/2017] [Indexed: 11/29/2022] Open
Abstract
Background Many R packages have been developed for transcriptome analysis but their use often requires familiarity with R and integrating results of different packages requires scripts to wrangle the datatypes. Furthermore, exploratory data analyses often generate multiple derived datasets such as data subsets or data transformations, which can be difficult to track. Results Here we present PIVOT, an R-based platform that wraps open source transcriptome analysis packages with a uniform user interface and graphical data management that allows non-programmers to interactively explore transcriptomics data. PIVOT supports more than 40 popular open source packages for transcriptome analysis and provides an extensive set of tools for statistical data manipulations. A graph-based visual interface is used to represent the links between derived datasets, allowing easy tracking of data versions. PIVOT further supports automatic report generation, publication-quality plots, and program/data state saving, such that all analysis can be saved, shared and reproduced. Conclusions PIVOT will allow researchers with broad background to easily access sophisticated transcriptome analysis tools and interactively explore transcriptome datasets. Electronic supplementary material The online version of this article (10.1186/s12859-017-1994-0) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Qin Zhu
- Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA
| | - Stephen A Fisher
- Department of Biology, University of Pennsylvania, Philadelphia, PA, USA
| | - Hannah Dueck
- Department of Biology, University of Pennsylvania, Philadelphia, PA, USA
| | - Sarah Middleton
- Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA
| | - Mugdha Khaladkar
- Department of Biology, University of Pennsylvania, Philadelphia, PA, USA
| | - Junhyong Kim
- Department of Biology, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|
39
|
Palace-Berl F, Pasqualoto KFM, Zingales B, Moraes CB, Bury M, Franco CH, da Silva Neto AL, Murayama JS, Nunes SL, Silva MN, Tavares LC. Investigating the structure-activity relationships of N'-[(5-nitrofuran-2-yl) methylene] substituted hydrazides against Trypanosoma cruzi to design novel active compounds. Eur J Med Chem 2017; 144:29-40. [PMID: 29247858 DOI: 10.1016/j.ejmech.2017.12.011] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2017] [Revised: 11/29/2017] [Accepted: 12/02/2017] [Indexed: 10/18/2022]
Abstract
Chagas disease, caused by the protozoan Trypanosoma cruzi, is a neglected chronic tropical infection endemic in Latin America. New and effective treatments are urgently needed because the two available drugs - benznidazole (BZD) and nifurtimox (NFX) - have limited curative power in the chronic phase of the disease. We have previously reported the design and synthesis of N'-[(5-nitrofuran-2-yl) methylene] substituted hydrazides that showed high trypanocidal activity against axenic epimastigote forms of three T. cruzi strains. Here we show that these compounds are also active against a BZD- and NFX-resistant strain. Herein, multivariate approaches (hierarchical cluster analysis and principal component analysis) were applied to a set of thirty-six formerly characterized compounds. Based on the findings from exploratory data analysis, novel compounds were designed and synthesized. These compounds showed two-to three-fold higher trypanocidal activity against epimastigote forms than the previous set and were 25-30-fold more active than BZD. Their activity was also evaluated against intracellular amastigotes by high content screening (HCS). The most active compounds (BSF-38 to BSF-40) showed a selective index (SI') greater than 200, in contrast to the SI' values of reference drugs (NFX, 16.45; BZD, > 3), and a 70-fold greater activity than BZD. These findings indicate that nitrofuran compounds designed based on the activity against epimastigote forms show promising trypanocidal activity against intracellular amastigotes, which correspond to the predominant parasite stage in the chronic phase of Chagas disease.
Collapse
Affiliation(s)
- Fanny Palace-Berl
- Department of Biochemical and Pharmaceutical Technology, Faculty of Pharmaceutical Sciences, University of São Paulo, SP, Brazil.
| | | | - Bianca Zingales
- Department of Biochemistry, Chemistry Institute, University of São Paulo, SP, Brazil
| | - Carolina Borsoi Moraes
- Laboratório Nacional de Biociências (LNBio), Centro Nacional de Pesquisa em Energia e Materiais (CNPEM), Campinas, Brazil
| | - Mariana Bury
- Department of Biochemistry, Chemistry Institute, University of São Paulo, SP, Brazil
| | - Caio Haddad Franco
- Laboratório Nacional de Biociências (LNBio), Centro Nacional de Pesquisa em Energia e Materiais (CNPEM), Campinas, Brazil
| | - Adelson Lopes da Silva Neto
- Department of Biochemical and Pharmaceutical Technology, Faculty of Pharmaceutical Sciences, University of São Paulo, SP, Brazil
| | - João Sussumu Murayama
- Department of Biochemical and Pharmaceutical Technology, Faculty of Pharmaceutical Sciences, University of São Paulo, SP, Brazil
| | - Solange Lessa Nunes
- Department of Biochemistry, Chemistry Institute, University of São Paulo, SP, Brazil
| | - Marcelo Nunes Silva
- Department of Biochemistry, Chemistry Institute, University of São Paulo, SP, Brazil
| | - Leoberto Costa Tavares
- Department of Biochemical and Pharmaceutical Technology, Faculty of Pharmaceutical Sciences, University of São Paulo, SP, Brazil
| |
Collapse
|
40
|
Abstract
BACKGROUND Instead of testing predefined hypotheses, the goal of exploratory data analysis (EDA) is to find what data can tell us. Following this strategy, we re-analyzed a large body of genomic data to study the complex gene regulation in mouse pre-implantation development (PD). RESULTS Starting with a single-cell RNA-seq dataset consisting of 259 mouse embryonic cells derived from zygote to blastocyst stages, we reconstructed the temporal and spatial gene expression pattern during PD. The dynamics of gene expression can be partially explained by the enrichment of transposable elements in gene promoters and the similarity of expression profiles with those of corresponding transposons. Long Terminal Repeats (LTRs) are associated with transient, strong induction of many nearby genes at the 2-4 cell stages, probably by providing binding sites for Obox and other homeobox factors. B1 and B2 SINEs (Short Interspersed Nuclear Elements) are correlated with the upregulation of thousands of nearby genes during zygotic genome activation. Such enhancer-like effects are also found for human Alu and bovine tRNA SINEs. SINEs also seem to be predictive of gene expression in embryonic stem cells (ESCs), raising the possibility that they may also be involved in regulating pluripotency. We also identified many potential transcription factors underlying PD and discussed the evolutionary necessity of transposons in enhancing genetic diversity, especially for species with longer generation time. CONCLUSIONS Together with other recent studies, our results provide further evidence that many transposable elements may play a role in establishing the expression landscape in early embryos. It also demonstrates that exploratory bioinformatics investigation can pinpoint developmental pathways for further study, and serve as a strategy to generate novel insights from big genomic data.
Collapse
Affiliation(s)
- Steven Xijin Ge
- Department of Mathematics and Statistics, South Dakota State University, Box 2225, Brookings, SD, 57110, USA.
| |
Collapse
|
41
|
Komenda M, Karolyi M, Pokorná A, Vaitsis C. Medical and Healthcare Curriculum Exploratory Analysis. Stud Health Technol Inform 2017; 235:231-235. [PMID: 28423788] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
In the recent years, medical and healthcare higher education institutions compile their curricula in different ways in order to cover all necessary topics and sections that the students will need to go through to success in their future clinical practice. A medical and healthcare curriculum consists of many descriptive parameters, which define statements of what, when, and how students will learn in the course of their studies. For the purpose of understanding a complicated medical and healthcare curriculum structure, we have developed a web-oriented platform for curriculum management covering in detail formal metadata specifications in accordance with the approved pedagogical background, namely outcome-based approach. Our platform provides a rich database that can be used for innovative detailed educational data analysis. In this contribution we would like to present how we used a proven process model as a way of increasing accuracy in solving individual analytical tasks with the available data. Moreover, we introduce an innovative approach on how to explore a dataset in accordance with the selected methodology. The achieved results from the selected analytical issues are presented here in clear visual interpretations in an attempt to visually describe the entire medical and healthcare curriculum.
Collapse
Affiliation(s)
- Martin Komenda
- Institute of Biostatistics and Analyses, Faculty of Medicine, Masaryk University
| | - Matěj Karolyi
- Institute of Biostatistics and Analyses, Faculty of Medicine, Masaryk University
| | - Andrea Pokorná
- Institute of Biostatistics and Analyses, Faculty of Medicine, Masaryk University
| | - Christos Vaitsis
- Department of Learning, Informatics Management and Ethics, Karolinska Institutet
| |
Collapse
|
42
|
Pimentel H, Sturmfels P, Bray N, Melsted P, Pachter L. The Lair: a resource for exploratory analysis of published RNA-Seq data. BMC Bioinformatics 2016; 17:490. [PMID: 27905880 DOI: 10.1186/s12859-016-1357-2] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2016] [Accepted: 11/19/2016] [Indexed: 11/10/2022] Open
Abstract
Increased emphasis on reproducibility of published research in the last few years has led to the large-scale archiving of sequencing data. While this data can, in theory, be used to reproduce results in papers, it is difficult to use in practice. We introduce a series of tools for processing and analyzing RNA-Seq data in the Sequence Read Archive, that together have allowed us to build an easily extendable resource for analysis of data underlying published papers. Our system makes the exploration of data easily accessible and usable without technical expertise. Our database and associated tools can be accessed at The Lair: http://pachterlab.github.io/lair .
Collapse
|
43
|
González-Calabozo JM, Valverde-Albacete FJ, Peláez-Moreno C. Interactive knowledge discovery and data mining on genomic expression data with numeric formal concept analysis. BMC Bioinformatics 2016; 17:374. [PMID: 27628041 DOI: 10.1186/s12859-016-1234-z] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2016] [Accepted: 09/01/2016] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Gene Expression Data (GED) analysis poses a great challenge to the scientific community that can be framed into the Knowledge Discovery in Databases (KDD) and Data Mining (DM) paradigm. Biclustering has emerged as the machine learning method of choice to solve this task, but its unsupervised nature makes result assessment problematic. This is often addressed by means of Gene Set Enrichment Analysis (GSEA). RESULTS We put forward a framework in which GED analysis is understood as an Exploratory Data Analysis (EDA) process where we provide support for continuous human interaction with data aiming at improving the step of hypothesis abduction and assessment. We focus on the adaptation to human cognition of data interpretation and visualization of the output of EDA. First, we give a proper theoretical background to bi-clustering using Lattice Theory and provide a set of analysis tools revolving around [Formula: see text]-Formal Concept Analysis ([Formula: see text]-FCA), a lattice-theoretic unsupervised learning technique for real-valued matrices. By using different kinds of cost structures to quantify expression we obtain different sequences of hierarchical bi-clusterings for gene under- and over-expression using thresholds. Consequently, we provide a method with interleaved analysis steps and visualization devices so that the sequences of lattices for a particular experiment summarize the researcher's vision of the data. This also allows us to define measures of persistence and robustness of biclusters to assess them. Second, the resulting biclusters are used to index external omics databases-for instance, Gene Ontology (GO)-thus offering a new way of accessing publicly available resources. This provides different flavors of gene set enrichment against which to assess the biclusters, by obtaining their p-values according to the terminology of those resources. We illustrate the exploration procedure on a real data example confirming results previously published. CONCLUSIONS The GED analysis problem gets transformed into the exploration of a sequence of lattices enabling the visualization of the hierarchical structure of the biclusters with a certain degree of granularity. The ability of FCA-based bi-clustering methods to index external databases such as GO allows us to obtain a quality measure of the biclusters, to observe the evolution of a gene throughout the different biclusters it appears in, to look for relevant biclusters-by observing their genes and what their persistence is-to infer, for instance, hypotheses on their function.
Collapse
|
44
|
Konarska M, Kuchida K, Tarr G, Polkinghorne RJ. Relationships between marbling measures across principal muscles. Meat Sci 2016; 123:67-78. [PMID: 27639062 DOI: 10.1016/j.meatsci.2016.09.005] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2016] [Revised: 08/16/2016] [Accepted: 09/09/2016] [Indexed: 11/24/2022]
Abstract
As marbling is a principal input into many grading systems it is important to have an accurate and reliable measurement procedure. This paper compares three approaches to measuring marbling: trained personnel, near infrared spectroscopy (NIR) and image analysis. One 25mm slice of meat was utilised from up to 12 cuts from 48 carcasses processed in Poland and France. Each slice was frozen to enable a consistent post-slaughter period then thawed for image analysis. The images were appraised by experienced beef graders and the sample used to determine fat content by NIR. We find that image analysis based marbling measures are capturing something different to trained personnel and that there is a strong relationship between near infrared spectroscopy and trained personnel. Finally, we demonstrate that marbling measures taken on one muscle can be predictive of marbling in other muscles in the same carcase. This is particularly important for cut based models such as the Meat Standards Australia system.
Collapse
Affiliation(s)
- Małgorzata Konarska
- Warsaw University of Life Sciences, Department of Technique and Food Development, Faculty of Human Nutrition and Consumer Sciences, (WULS-SGGW), 159C Nowoursynowska Str., 02-776 Warsaw, Poland.
| | - Keigo Kuchida
- Obihiro University of Agriculture and Veterinary Medicine, Obihiro 080-8555, Japan
| | - Garth Tarr
- School of Mathematical and Physical Sciences, University of Newcastle, Callaghan, NSW 2308, Australia
| | | |
Collapse
|
45
|
Palace-Berl F, Pasqualoto KF, Jorge SD, Zingales B, Zorzi RR, Silva MN, Ferreira AK, de Azevedo RA, Teixeira SF, Tavares LC. Designing and exploring active N'-[(5-nitrofuran-2-yl) methylene] substituted hydrazides against three Trypanosoma cruzi strains more prevalent in Chagas disease patients. Eur J Med Chem 2015; 96:330-9. [PMID: 25899337 DOI: 10.1016/j.ejmech.2015.03.066] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2014] [Revised: 02/26/2015] [Accepted: 03/30/2015] [Indexed: 12/28/2022]
Abstract
Chagas disease affects around 8 million people worldwide and its treatment depends on only two nitroheterocyclic drugs, benznidazole (BZD) and nifurtimox (NFX). Both drugs have limited curative power in chronic phase of disease. Nifuroxazide (NF), a nitroheterocyclic drug, was used as lead to design a set of twenty one compounds in order to improve the anti-Trypanosoma cruzi activity. Lipinski's rules were considered in order to support drug-likeness designing. The set of N'-[(5-nitrofuran-2-yl) methylene] substituted hydrazides was assayed against three T. cruzi strains, which represent the discrete typing units more prevalent in human patients: Y (TcII), Silvio X10 cl1 (TcI), and Bug 2149 cl10 (TcV). All the derivatives, except one, showed enhanced trypanocidal activity against the three strains as compared to BZD. In the Y strain 62% of the compounds were more active than NFX. The most active compound was N'-((5-nitrofuran-2-yl) methylene)biphenyl-4-carbohydrazide (C20), which showed IC50 values of 1.17 ± 0.12 μM; 3.17 ± 0.32 μM; and 1.81 ± 0.18 μM for Y, Silvio X10 cl1, and Bug 2149 cl10 strains, respectively. Cytotoxicity assays with human fibroblast cells have demonstrated high selectivity indices for several compounds. Exploratory data analysis indicated that primarily topological, steric/geometric, and electronic properties have contributed to the discrimination of the set of investigated compounds. The findings can be helpful to drive the designing, and subsequently, the synthesis of additional promising drugs against Chagas disease.
Collapse
|
46
|
Damião MCFCB, Pasqualoto KFM, Ferreira AK, Teixeira SF, Azevedo RA, Barbuto JAM, Palace-Berl F, Franchi-Junior GC, Nowill AE, Tavares MT, Parise-Filho R. Novel capsaicin analogues as potential anticancer agents: synthesis, biological evaluation, and in silico approach. Arch Pharm (Weinheim) 2014; 347:885-95. [PMID: 25283529 DOI: 10.1002/ardp.201400233] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2014] [Revised: 07/28/2014] [Accepted: 08/15/2014] [Indexed: 11/07/2022]
Abstract
A novel class of benzo[d][1,3]dioxol-5-ylmethyl alkyl/aryl amide and ester analogues of capsaicin were designed, synthesized, and evaluated for their cytotoxic activity against human and murine cancer cell lines (B16F10, SK-MEL-28, NCI-H1299, NCI-H460, SK-BR-3, and MDA-MB-231) and human lung fibroblasts (MRC-5). Three compounds (5f, 6c, and 6e) selectively inhibited the growth of aggressive cancer cells in the micromolar (µM) range. Furthermore, an exploratory data analysis pointed at the topological and electronic molecular properties as responsible for the discrimination process regarding the set of investigated compounds. The findings suggest that the applied designing strategy, besides providing more potent analogues, indicates the aryl amides and esters as well as the alkyl esters as interesting scaffolds to design and develop novel anticancer agents.
Collapse
Affiliation(s)
- Mariana C F C B Damião
- Department of Pharmacy, School of Pharmaceutical Sciences, University of São Paulo, São Paulo, SP, Brazil
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
47
|
Jung S, Jang K, Yoon Y, Kang S. Contributing factors to vehicle to vehicle crash frequency and severity under rainfall. J Safety Res 2014; 50:1-10. [PMID: 25142355 DOI: 10.1016/j.jsr.2014.01.001] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/01/2013] [Revised: 01/09/2014] [Accepted: 01/14/2014] [Indexed: 06/03/2023]
Abstract
INTRODUCTION This study combined vehicle to vehicle crash frequency and severity estimations to examine factor impacts on Wisconsin highway safety in rainy weather. METHOD Because of data deficiency, the real-time water film depth, the car-following distance, and the vertical curve grade were estimated with available data sources and a GIS analysis to capture rainy weather conditions at the crash location and time. Using a negative binomial regression for crash frequency estimation, the average annual daily traffic per lane, the interaction between the posted speed limit change and the existence of an off-ramp, and the interaction between the travel lane number change and the pavement surface material change were found to increase the likelihood of vehicle to vehicle crashes under rainfall. RESULTS However, more average daily rainfall per month and a wider left shoulder were identified as factors that decrease the likelihood of vehicle to vehicle crashes. In the crash severity estimation using the multinomial logit model that outperformed the ordered logit model, the travel lane number, the interaction between the travel lane number and the slow grade, the deep water film, and the rear-end collision type were more likely to increase the likelihood of injury crashes under rainfall compared with crashes involving only property damage. PRACTICAL IMPLICATIONS As an exploratory data analysis, this study provides insight into potential strategies for rainy weather highway safety improvement, specifically, the following weather-sensitive strategies: road design and ITS implementation for drivers' safety awareness under rainfall.
Collapse
Affiliation(s)
- Soyoung Jung
- Hanyang University Erica Campus, Department of Transportation and Logistics Engineering, 55 Hanyangdaehak-ro, Sangnok-gu, Ansan 426-791, Republic of Korea.
| | - Kitae Jang
- Korea Advanced Institute of Science and Technology, The Cho Chun Shik Graduate School for Green Transportation, 2116-1 Eureka Bldg., 335 Gwahak-ro, Yuseong-gu, Deajeon 305-701, Republic of Korea.
| | - Yoonjin Yoon
- Korea Advanced Institute of Science and Technology, Department of Civil and Environmental Engineering, 291 Daehak-ro, Yuseong-gu, Daejeon 305-701, Republic of Korea.
| | - Sanghyeok Kang
- Construction and Economy Research Institute of Korea, 11th F. Construction Bldg, 711 Eonjuro, Kangnam-gu, Seoul 135-701, Republic of Korea.
| |
Collapse
|
48
|
Xiao F, Gulliver JS, Simcik MF. Perfluorooctane sulfonate (PFOS) contamination of fish in urban lakes: a prioritization methodology for lake management. Water Res 2013; 47:7264-7272. [PMID: 24184022 DOI: 10.1016/j.watres.2013.09.063] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/26/2013] [Revised: 08/06/2013] [Accepted: 09/01/2013] [Indexed: 06/02/2023]
Abstract
The contamination of urban lakes by anthropogenic pollutants such as perfluorooctane sulfonate (PFOS) is a worldwide environmental problem. Large-scale, long-term monitoring of urban lakes requires careful prioritization of available resources, focusing efforts on potentially impaired lakes. Herein, a database of PFOS concentrations in 304 fish caught from 28 urban lakes was used for development of an urban-lake prioritization framework by means of exploratory data analysis (EDA) with the aid of a geographical information system. The prioritization scheme consists of three main tiers: preliminary classification, carried out by hierarchical cluster analysis; predictor screening, fulfilled by a regression tree method; and model development by means of a neural network. The predictive performance of the newly developed model was assessed using a training/validation splitting method and determined by an external validation set. The application of the model in the U.S. state of Minnesota identified 40 urban lakes that may contain elevated levels of PFOS; these lakes were not previously considered in PFOS monitoring programs. The model results also highlight ongoing industrial/commercial activities as a principal determinant of PFOS pollution in urban lakes, and suggest vehicular traffic as an important source and surface runoff as a primary pollution carrier. In addition, the EDA approach was further compared to a spatial interpolation method (kriging), and their advantages and disadvantages were discussed.
Collapse
Affiliation(s)
- Feng Xiao
- St. Anthony Falls Laboratory, University of Minnesota, Minneapolis, MN 55414, United States.
| | | | | |
Collapse
|
49
|
Abstract
Descriptive, exploratory, and inferential statistics are necessary components of hypothesis-driven biomedical research. Despite the ubiquitous need for these tools, the emphasis on statistical methods in pharmacology has become dominated by inferential methods often chosen more by the availability of user-friendly software than by any understanding of the data set or the critical assumptions of the statistical tests. Such frank misuse of statistical methodology and the quest to reach the mystical α<0.05 criteria has hampered research via the publication of incorrect analysis driven by rudimentary statistical training. Perhaps more critically, a poor understanding of statistical tools limits the conclusions that may be drawn from a study by divorcing the investigator from their own data. The net result is a decrease in quality and confidence in research findings, fueling recent controversies over the reproducibility of high profile findings and effects that appear to diminish over time. The recent development of "omics" approaches leading to the production of massive higher dimensional data sets has amplified these issues making it clear that new approaches are needed to appropriately and effectively mine this type of data. Unfortunately, statistical education in the field has not kept pace. This commentary provides a foundation for an intuitive understanding of statistics that fosters an exploratory approach and an appreciation for the assumptions of various statistical tests that hopefully will increase the correct use of statistics, the application of exploratory data analysis, and the use of statistical study design, with the goal of increasing reproducibility and confidence in the literature.
Collapse
Affiliation(s)
- Michael J Marino
- Merck Research Laboratories, Merck & Co., Inc., 770 Sumneytown Pike, West Point, PA 19486, United States.
| |
Collapse
|
50
|
Sato Y, Gosho M, Toshimori K. Usefulness of statistics for establishing evidence-based reproductive medicine. Reprod Med Biol 2011; 11:49-58. [PMID: 29699105 DOI: 10.1007/s12522-011-0106-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2011] [Accepted: 07/12/2011] [Indexed: 11/29/2022] Open
Abstract
During the last decade, evidence-based medicine has been described as a paradigm shift in clinical practice, and as "the conscientious, explicit, and judicious use of current best evidence in making decisions about the care of individual patients". Appropriate statistical methods for analyzing data are critical for the correct interpretation of the results in proof of the evidence. However, in the medical literature, these statistical methods are often incorrectly interpreted or misinterpreted, leading to serious methodological errors and misinterpretations. This review highlights several important aspects related to the design and statistical analysis for evidence-based reproductive medicine. First, we clarify the distinction between ratios, proportions, and rates, and then provide a definition of pregnancy rate. Second, we focus on a special type of bias called 'confounding bias', which occurs when a factor is associated with both the exposure and the disease but is not part of the causal pathway. Finally, we present concerns regarding misuse of statistical software or application of inappropriate statistical methods, especially in medical research.
Collapse
Affiliation(s)
- Yasunori Sato
- Clinical Research Center Chiba University Hospital 1-8-1 Inohana, Chuo-ku 260-8677 Chiba Japan.,Department of Biostatistics Harvard School of Public Health Boston MA USA
| | - Masahiko Gosho
- Department of Management Science, Graduate School of Engineering Tokyo University of Science Tokyo Japan
| | - Kiyotaka Toshimori
- Department of Anatomy and Developmental Biology Chiba University Graduate School of Medicine Chiba Japan
| |
Collapse
|