1
|
Bernard M, Poli M, Karadayi J, Dupoux E. Shennong: A Python toolbox for audio speech features extraction. Behav Res Methods 2023; 55:4489-4501. [PMID: 36750521 DOI: 10.3758/s13428-022-02029-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/17/2022] [Indexed: 02/09/2023]
Abstract
We introduce Shennong, a Python toolbox and command-line utility for audio speech features extraction. It implements a wide range of well-established state-of-the-art algorithms: spectro-temporal filters such as Mel-Frequency Cepstral Filterbank or Predictive Linear Filters, pre-trained neural networks, pitch estimators, speaker normalization methods, and post-processing algorithms. Shennong is an open source, reliable and extensible framework built on top of the popular Kaldi speech processing library. The Python implementation makes it easy to use by non-technical users and integrates with third-party speech modeling and machine learning tools from the Python ecosystem. This paper describes the Shennong software architecture, its core components, and implemented algorithms. Then, three applications illustrate its use. We first present a benchmark of speech features extraction algorithms available in Shennong on a phone discrimination task. We then analyze the performances of a speaker normalization model as a function of the speech duration used for training. We finally compare pitch estimation algorithms on speech under various noise conditions.
Collapse
Affiliation(s)
- Mathieu Bernard
- Cognitive Machine Learning, PSL Research University, CNRS, EHESS, ENS, Inria, Paris, France.
- EconomiX (UMR 7235), Université Paris Nanterre, CNRS, Nanterre, France.
| | - Maxime Poli
- Cognitive Machine Learning, PSL Research University, CNRS, EHESS, ENS, Inria, Paris, France
| | - Julien Karadayi
- Cognitive Machine Learning, PSL Research University, CNRS, EHESS, ENS, Inria, Paris, France
| | - Emmanuel Dupoux
- Cognitive Machine Learning, PSL Research University, CNRS, EHESS, ENS, Inria, Paris, France
- Meta AI Research, Paris, France
| |
Collapse
|
2
|
Hung LH, Straw E, Reddy S, Schmitz R, Colburn Z, Yeung KY. Cloud-enabled Biodepot workflow builder integrates image processing using Fiji with reproducible data analysis using Jupyter notebooks. Sci Rep 2022; 12:14920. [PMID: 36056115 PMCID: PMC9440253 DOI: 10.1038/s41598-022-19173-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2022] [Accepted: 08/25/2022] [Indexed: 11/16/2022] Open
Abstract
Modern biomedical image analyses workflows contain multiple computational processing tasks giving rise to problems in reproducibility. In addition, image datasets can span both spatial and temporal dimensions, with additional channels for fluorescence and other data, resulting in datasets that are too large to be processed locally on a laptop. For omics analyses, software containers have been shown to enhance reproducibility, facilitate installation and provide access to scalable computational resources on the cloud. However, most image analyses contain steps that are graphical and interactive, features that are not supported by most omics execution engines. We present the containerized and cloud-enabled Biodepot-workflow-builder platform that supports graphics from software containers and has been extended for image analyses. We demonstrate the potential of our modular approach with multi-step workflows that incorporate the popular and open-source Fiji suite for image processing. One of our examples integrates fully interactive ImageJ macros with Jupyter notebooks. Our second example illustrates how the complicated cloud setup of an computationally intensive process such as stitching 3D digital pathology datasets using BigStitcher can be automated and simplified. In both examples, users can leverage a form-based graphical interface to execute multi-step workflows with a single click, using the provided sample data and preset input parameters. Alternatively, users can interactively modify the image processing steps in the workflow, apply the workflows to their own data, change the input parameters and macros. By providing interactive graphics support to software containers, our modular platform supports reproducible image analysis workflows, simplified access to cloud resources for analysis of large datasets, and integration across different applications such as Jupyter.
Collapse
Affiliation(s)
- Ling-Hong Hung
- School of Engineering and Technology, University of Washington Tacoma, Box 358426, Tacoma, 98402, WA, USA
| | - Evan Straw
- Biodepot LLC, Seattle, 98195, WA, USA
- University of Washington, Seattle, 98195, WA, USA
| | - Shishir Reddy
- School of Engineering and Technology, University of Washington Tacoma, Box 358426, Tacoma, 98402, WA, USA
| | - Robert Schmitz
- School of Engineering and Technology, University of Washington Tacoma, Box 358426, Tacoma, 98402, WA, USA
- Biodepot LLC, Seattle, 98195, WA, USA
| | | | - Ka Yee Yeung
- School of Engineering and Technology, University of Washington Tacoma, Box 358426, Tacoma, 98402, WA, USA.
- Biodepot LLC, Seattle, 98195, WA, USA.
| |
Collapse
|
3
|
Yu M, Dolios G, Petrick L. Reproducible untargeted metabolomics workflow for exhaustive MS2 data acquisition of MS1 features. J Cheminform 2022; 14:6. [PMID: 35172886 PMCID: PMC8848943 DOI: 10.1186/s13321-022-00586-8] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2021] [Accepted: 02/03/2022] [Indexed: 01/16/2023] Open
Abstract
Unknown features in untargeted metabolomics and non-targeted analysis (NTA) are identified using fragment ions from MS/MS spectra to predict the structures of the unknown compounds. The precursor ion selected for fragmentation is commonly performed using data dependent acquisition (DDA) strategies or following statistical analysis using targeted MS/MS approaches. However, the selected precursor ions from DDA only cover a biased subset of the peaks or features found in full scan data. In addition, different statistical analysis can select different precursor ions for MS/MS analysis, which make the post-hoc validation of ions selected following a secondary analysis impossible for precursor ions selected by the original statistical method. Here we propose an automated, exhaustive, statistical model-free workflow: paired mass distance-dependent analysis (PMDDA), for reproducible untargeted mass spectrometry MS2 fragment ion collection of unknown compounds found in MS1 full scan. Our workflow first removes redundant peaks from MS1 data and then exports a list of precursor ions for pseudo-targeted MS/MS analysis on independent peaks. This workflow provides comprehensive coverage of MS2 collection on unknown compounds found in full scan analysis using a “one peak for one compound” workflow without a priori redundant peak information. We compared pseudo-spectra formation and the number of MS2 spectra linked to MS1 data using the PMDDA workflow to that obtained using CAMERA and RAMclustR algorithms. More annotated compounds, molecular networks, and unique MS/MS spectra were found using PMDDA compared with CAMERA and RAMClustR. In addition, PMDDA can generate a preferred ion list for iterative DDA to enhance coverage of compounds when instruments support such functions. Finally, compounds with signals in both positive and negative modes can be identified by the PMDDA workflow, to further reduce redundancies. The whole workflow is fully reproducible as a docker image xcmsrocker with both the original data and the data processing template.
Collapse
Affiliation(s)
- Miao Yu
- Department of Environmental Medicine and Public Health, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA.
| | - Georgia Dolios
- Department of Environmental Medicine and Public Health, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA
| | - Lauren Petrick
- Department of Environmental Medicine and Public Health, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA.,The Institute for Exposomic Research, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA
| |
Collapse
|
4
|
Krampis K. Democratizing bioinformatics through easily accessible software platforms for non-experts in the field. Biotechniques 2022; 72:36-38. [PMID: 35060754 PMCID: PMC8988881 DOI: 10.2144/btn-2021-0060] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Affiliation(s)
- Konstantinos Krampis
- Department of Biological Sciences, Hunter College, City University of New York, NY, USA
| |
Collapse
|
5
|
Du X, Aristizabal-Henao JJ, Garrett TJ, Brochhausen M, Hogan WR, Lemas DJ. A Checklist for Reproducible Computational Analysis in Clinical Metabolomics Research. Metabolites 2022; 12:87. [PMID: 35050209 PMCID: PMC8779534 DOI: 10.3390/metabo12010087] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2021] [Revised: 12/25/2021] [Accepted: 01/10/2022] [Indexed: 12/15/2022] Open
Abstract
Clinical metabolomics emerged as a novel approach for biomarker discovery with the translational potential to guide next-generation therapeutics and precision health interventions. However, reproducibility in clinical research employing metabolomics data is challenging. Checklists are a helpful tool for promoting reproducible research. Existing checklists that promote reproducible metabolomics research primarily focused on metadata and may not be sufficient to ensure reproducible metabolomics data processing. This paper provides a checklist including actions that need to be taken by researchers to make computational steps reproducible for clinical metabolomics studies. We developed an eight-item checklist that includes criteria related to reusable data sharing and reproducible computational workflow development. We also provided recommended tools and resources to complete each item, as well as a GitHub project template to guide the process. The checklist is concise and easy to follow. Studies that follow this checklist and use recommended resources may facilitate other researchers to reproduce metabolomics results easily and efficiently.
Collapse
Affiliation(s)
- Xinsong Du
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL 32610, USA; (X.D.); (W.R.H.)
| | | | - Timothy J. Garrett
- Department of Pathology, Immunology and Laboratory Medicine, College of Medicine, University of Florida, Gainesville, FL 32610, USA;
| | - Mathias Brochhausen
- Department of Biomedical Informatics, College of Medicine, University of Arkansas for Medical Sciences, Little Rock, AR 72205, USA;
| | - William R. Hogan
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL 32610, USA; (X.D.); (W.R.H.)
| | - Dominick J. Lemas
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL 32610, USA; (X.D.); (W.R.H.)
| |
Collapse
|
6
|
Plonski NM, Johnson E, Frederick M, Mercer H, Fraizer G, Meindl R, Casadesus G, Piontkivska H. Automated Isoform Diversity Detector (AIDD): a pipeline for investigating transcriptome diversity of RNA-seq data. BMC Bioinformatics 2020; 21:578. [PMID: 33375933 PMCID: PMC7772930 DOI: 10.1186/s12859-020-03888-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2020] [Accepted: 11/18/2020] [Indexed: 11/16/2022] Open
Abstract
Background As the number of RNA-seq datasets that become available to explore transcriptome diversity increases, so does the need for easy-to-use comprehensive computational workflows. Many available tools facilitate analyses of one of the two major mechanisms of transcriptome diversity, namely, differential expression of isoforms due to alternative splicing, while the second major mechanism—RNA editing due to post-transcriptional changes of individual nucleotides—remains under-appreciated. Both these mechanisms play an essential role in physiological and diseases processes, including cancer and neurological disorders. However, elucidation of RNA editing events at transcriptome-wide level requires increasingly complex computational tools, in turn resulting in a steep entrance barrier for labs who are interested in high-throughput variant calling applications on a large scale but lack the manpower and/or computational expertise. Results Here we present an easy-to-use, fully automated, computational pipeline (Automated Isoform Diversity Detector, AIDD) that contains open source tools for various tasks needed to map transcriptome diversity, including RNA editing events. To facilitate reproducibility and avoid system dependencies, the pipeline is contained within a pre-configured VirtualBox environment. The analytical tasks and format conversions are accomplished via a set of automated scripts that enable the user to go from a set of raw data, such as fastq files, to publication-ready results and figures in one step. A publicly available dataset of Zika virus-infected neural progenitor cells is used to illustrate AIDD’s capabilities. Conclusions AIDD pipeline offers a user-friendly interface for comprehensive and reproducible RNA-seq analyses. Among unique features of AIDD are its ability to infer RNA editing patterns, including ADAR editing, and inclusion of Guttman scale patterns for time series analysis of such editing landscapes. AIDD-based results show importance of diversity of ADAR isoforms, key RNA editing enzymes linked with the innate immune system and viral infections. These findings offer insights into the potential role of ADAR editing dysregulation in the disease mechanisms, including those of congenital Zika syndrome. Because of its automated all-inclusive features, AIDD pipeline enables even a novice user to easily explore common mechanisms of transcriptome diversity, including RNA editing landscapes.
Collapse
Affiliation(s)
- Noel-Marie Plonski
- Department of Biological Sciences, Kent State University, 256 Cunningham Hall, Kent, OH, 44242, USA.,School of Biomedical Sciences, Kent State University, PO Box 5190, Kent, OH, 44242, USA
| | - Emily Johnson
- Department of Biological Sciences, Kent State University, 256 Cunningham Hall, Kent, OH, 44242, USA
| | - Madeline Frederick
- Department of Biological Sciences, Kent State University, 256 Cunningham Hall, Kent, OH, 44242, USA
| | - Heather Mercer
- Department of Biological Sciences, Kent State University, 256 Cunningham Hall, Kent, OH, 44242, USA.,University of Mount Union, 1972 Clark Ave, Alliance, OH, 44601, USA
| | - Gail Fraizer
- Department of Biological Sciences, Kent State University, 256 Cunningham Hall, Kent, OH, 44242, USA.,School of Biomedical Sciences, Kent State University, PO Box 5190, Kent, OH, 44242, USA
| | - Richard Meindl
- School of Biomedical Sciences, Kent State University, PO Box 5190, Kent, OH, 44242, USA.,Department of Anthropology, Kent State University, Kent, OH, 44242, USA
| | - Gemma Casadesus
- Department of Biological Sciences, Kent State University, 256 Cunningham Hall, Kent, OH, 44242, USA.,School of Biomedical Sciences, Kent State University, PO Box 5190, Kent, OH, 44242, USA.,Brain Health Research Institute, Kent State University, Kent, OH, 44242, USA.,Department of Pharmacology & Therapeutics, College of Medicine, University of Florida, Gainesville, FL, 32610, USA
| | - Helen Piontkivska
- Department of Biological Sciences, Kent State University, 256 Cunningham Hall, Kent, OH, 44242, USA. .,School of Biomedical Sciences, Kent State University, PO Box 5190, Kent, OH, 44242, USA. .,Brain Health Research Institute, Kent State University, Kent, OH, 44242, USA.
| |
Collapse
|
7
|
Wittman JT, Aukema BH. A Guide and Toolbox to Replicability and Open Science in Entomology. JOURNAL OF INSECT SCIENCE (ONLINE) 2020; 20:6. [PMID: 32441307 PMCID: PMC7423018 DOI: 10.1093/jisesa/ieaa036] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/10/2020] [Indexed: 05/04/2023]
Abstract
The ability to replicate scientific experiments is a cornerstone of the scientific method. Sharing ideas, workflows, data, and protocols facilitates testing the generalizability of results, increases the speed that science progresses, and enhances quality control of published work. Fields of science such as medicine, the social sciences, and the physical sciences have embraced practices designed to increase replicability. Granting agencies, for example, may require data management plans and journals may require data and code availability statements along with the deposition of data and code in publicly available repositories. While many tools commonly used in replicable workflows such as distributed version control systems (e.g., 'git') or script programming languages for data cleaning and analysis may have a steep learning curve, their adoption can increase individual efficiency and facilitate collaborations both within entomology and across disciplines. The open science movement is developing within the discipline of entomology, but practitioners of these concepts or those desiring to work more collaboratively across disciplines may be unsure where or how to embrace these initiatives. This article is meant to introduce some of the tools entomologists can incorporate into their workflows to increase the replicability and openness of their work. We describe these tools and others, recommend additional resources for learning more about these tools, and discuss the benefits to both individuals and the scientific community and potential drawbacks associated with implementing a replicable workflow.
Collapse
Affiliation(s)
- Jacob T Wittman
- Department of Entomology, University of Minnesota, St. Paul,
MN
| | - Brian H Aukema
- Department of Entomology, University of Minnesota, St. Paul,
MN
| |
Collapse
|
8
|
Lachmann A, Clarke DJB, Torre D, Xie Z, Ma'ayan A. Interoperable RNA-Seq analysis in the cloud. BIOCHIMICA ET BIOPHYSICA ACTA-GENE REGULATORY MECHANISMS 2020; 1863:194521. [PMID: 32156561 DOI: 10.1016/j.bbagrm.2020.194521] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/02/2019] [Revised: 03/01/2020] [Accepted: 03/01/2020] [Indexed: 12/25/2022]
Abstract
RNA-Sequencing (RNA-Seq) is currently the leading technology for genome-wide transcript quantification. Mapping the raw reads to transcript and gene level counts can be achieved by different aligners. Here we report an in-depth comparison of transcript quantification methods. Our goal is the specific use of cost-efficient RNA-Seq analysis for deployment in a cloud infrastructure composed of interacting microservices. The individual modules cover file transfer into the cloud and APIs to handle the cloud alignment jobs. We next demonstrate how newly generated RNA-Seq data can be placed in the context of thousands of previously published datasets in near real time. With in-depth benchmarks, we identify suitable gene count quantification methods to facilitate cost-effective, accurate, and cloud-based RNA-Seq analysis service. Pseudo-alignment algorithms such as kallisto and Salmon combine high read quality estimation with cost efficient runtime performance. HISAT2 is the fastest of the classical aligners with good alignment quality. This article is part of a Special Issue entitled: Transcriptional Profiles and Regulatory Gene Networks edited by Dr. Federico Manuel Giorgi and Dr. Shaun Mahony.
Collapse
Affiliation(s)
- Alexander Lachmann
- Department of Pharmacological Sciences, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, Box 1603, New York, NY 10029, USA; Library of Integrated Network-based Cellular Signatures, Data Coordination and Integration Center (BD2K-LINCS DCIC), USA; Knowledge Management Center for Illuminating the Druggable Genome (KMC-IDG), USA.
| | - Daniel J B Clarke
- Department of Pharmacological Sciences, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, Box 1603, New York, NY 10029, USA; Library of Integrated Network-based Cellular Signatures, Data Coordination and Integration Center (BD2K-LINCS DCIC), USA; Knowledge Management Center for Illuminating the Druggable Genome (KMC-IDG), USA
| | - Denis Torre
- Department of Pharmacological Sciences, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, Box 1603, New York, NY 10029, USA; Library of Integrated Network-based Cellular Signatures, Data Coordination and Integration Center (BD2K-LINCS DCIC), USA; Knowledge Management Center for Illuminating the Druggable Genome (KMC-IDG), USA
| | - Zhuorui Xie
- Department of Pharmacological Sciences, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, Box 1603, New York, NY 10029, USA; Library of Integrated Network-based Cellular Signatures, Data Coordination and Integration Center (BD2K-LINCS DCIC), USA
| | - Avi Ma'ayan
- Department of Pharmacological Sciences, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, Box 1603, New York, NY 10029, USA; Library of Integrated Network-based Cellular Signatures, Data Coordination and Integration Center (BD2K-LINCS DCIC), USA; Knowledge Management Center for Illuminating the Druggable Genome (KMC-IDG), USA
| |
Collapse
|
9
|
Building Containerized Workflows Using the BioDepot-Workflow-Builder. Cell Syst 2019; 9:508-514.e3. [PMID: 31521606 DOI: 10.1016/j.cels.2019.08.007] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2018] [Revised: 05/21/2019] [Accepted: 08/16/2019] [Indexed: 11/22/2022]
Abstract
We present the BioDepot-workflow-builder (Bwb), a software tool that allows users to create and execute reproducible bioinformatics workflows using a drag-and-drop interface. Graphical widgets represent Docker containers executing a modular task. Widgets are linked graphically to build bioinformatics workflows that can be reproducibly deployed across different local and cloud platforms. Each widget contains a form-based user interface to facilitate parameter entry and a console to display intermediate results. Bwb provides tools for rapid customization of widgets, containers, and workflows. Saved workflows can be shared using Bwb's native format or exported as shell scripts.
Collapse
|
10
|
Aciole Barbosa D, Menegidio FB, Alencar VC, Gonçalves RS, Silva JDFS, Vilas Boas RO, Faustino de Maria YNL, Jabes DL, Costa de Oliveira R, Nunes LR. ParaDB: A manually curated database containing genomic annotation for the human pathogenic fungi Paracoccidioides spp. PLoS Negl Trop Dis 2019; 13:e0007576. [PMID: 31306428 PMCID: PMC6658007 DOI: 10.1371/journal.pntd.0007576] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2019] [Revised: 07/25/2019] [Accepted: 06/24/2019] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND The genus Paracoccidioides consists of thermodymorphic fungi responsible for Paracoccidioidomycosis (PCM), a systemic mycosis that has been registered to affect ~10 million people in Latin America. Biogeographical data subdivided the genus Paracoccidioides in five divergent subgroups, which have been recently classified as different species. Genomic sequencing of five Paracoccidioides isolates, representing each of these subgroups/species provided an important framework for the development of post-genomic studies with these fungi. However, functional annotations of these genomes have not been submitted to manual curation and, as a result, ~60-90% of the Paracoccidioides protein-coding genes (depending on isolate/annotation) are currently described as responsible for hypothetical proteins, without any further functional/structural description. PRINCIPAL FINDINGS The present work reviews the functional assignment of Paracoccidioides genes, reducing the number of hypothetical proteins to ~25-28%. These results were compiled in a relational database called ParaDB, dedicated to the main representatives of Paracoccidioides spp. ParaDB can be accessed through a friendly graphical interface, which offers search tools based on keywords or protein/DNA sequences. All data contained in ParaDB can be partially or completely downloaded through spreadsheet, multi-fasta and GFF3-formatted files, which can be subsequently used in a variety of downstream functional analyses. Moreover, the entire ParaDB environment has been configured in a Docker service, which has been submitted to the GitHub repository, ensuring long-term data availability to researchers. This service can be downloaded and used to perform fully functional local installations of the database in alternative computing ecosystems, allowing users to conduct their data mining and analyses in a personal and stable working environment. CONCLUSIONS These new annotations greatly reduce the number of genes identified solely as hypothetical proteins and are integrated into a dedicated database, providing resources to assist researchers in this field to conduct post-genomic studies with this group of human pathogenic fungi.
Collapse
Affiliation(s)
- David Aciole Barbosa
- Núcleo Integrado de Biotecnologia, Universidade de Mogi das Cruzes (UMC), Mogi das Cruzes, São Paulo, Brazil
| | - Fabiano Bezerra Menegidio
- Núcleo Integrado de Biotecnologia, Universidade de Mogi das Cruzes (UMC), Mogi das Cruzes, São Paulo, Brazil
| | - Valquíria Campos Alencar
- Núcleo Integrado de Biotecnologia, Universidade de Mogi das Cruzes (UMC), Mogi das Cruzes, São Paulo, Brazil
| | - Rafael S. Gonçalves
- Núcleo Integrado de Biotecnologia, Universidade de Mogi das Cruzes (UMC), Mogi das Cruzes, São Paulo, Brazil
| | | | - Renata Ozelami Vilas Boas
- Núcleo Integrado de Biotecnologia, Universidade de Mogi das Cruzes (UMC), Mogi das Cruzes, São Paulo, Brazil
| | | | - Daniela Leite Jabes
- Núcleo Integrado de Biotecnologia, Universidade de Mogi das Cruzes (UMC), Mogi das Cruzes, São Paulo, Brazil
| | - Regina Costa de Oliveira
- Núcleo Integrado de Biotecnologia, Universidade de Mogi das Cruzes (UMC), Mogi das Cruzes, São Paulo, Brazil
| | - Luiz R. Nunes
- Centro de Ciências Naturais e Humanas, Universidade Federal do ABC (UFABC), São Bernardo do Campo, São Paulo, Brazil
| |
Collapse
|
11
|
Abstract
A basic task in first language acquisition likely involves discovering the boundaries between words or morphemes in input where these basic units are not overtly segmented. A number of unsupervised learning algorithms have been proposed in the last 20 years for these purposes, some of which have been implemented computationally, but whose results remain difficult to compare across papers. We created a tool that is open source, enables reproducible results, and encourages cumulative science in this domain. WordSeg has a modular architecture: It combines a set of corpora description routines, multiple algorithms varying in complexity and cognitive assumptions (including several that were not publicly available, or insufficiently documented), and a rich evaluation package. In the paper, we illustrate the use of this package by analyzing a corpus of child-directed speech in various ways, which further allows us to make recommendations for experimental design of follow-up work. Supplementary materials allow readers to reproduce every result in this paper, and detailed online instructions further enable them to go beyond what we have done. Moreover, the system can be installed within container software that ensures a stable and reliable environment. Finally, by virtue of its modular architecture and transparency, WordSeg can work as an open-source platform, to which other researchers can add their own segmentation algorithms.
Collapse
|
12
|
Menegidio FB, Aciole Barbosa D, Gonçalves RDS, Nishime MM, Jabes DL, Costa de Oliveira R, Nunes LR. Bioportainer Workbench: a versatile and user-friendly system that integrates implementation, management, and use of bioinformatics resources in Docker environments. Gigascience 2019; 8:5479503. [PMID: 31222200 PMCID: PMC6482343 DOI: 10.1093/gigascience/giz041] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2018] [Revised: 12/18/2018] [Accepted: 03/21/2019] [Indexed: 11/14/2022] Open
Abstract
Background The Docker project is providing a promising strategy for the development of virtualization systems in bioinformatics. However, implementation, management, and launching of Docker containers is not entirely trivial for users not fully familiarized with command line interfaces. This has prompted the development of graphical user interfaces to facilitate the interaction of inexperienced users with Docker environments. Results We describe the BioPortainer Workbench, an integrated Docker system that assists inexperienced users in interacting with a bioinformatics-dedicated Docker environment at 3 main levels: (i) infrastructure, (ii) platform, and (iii) application. Conclusions The BioPortainer Workbench represents a pioneering effort in developing a comprehensive and easy-to-use Docker platform focused on bioinformatics, which may greatly assist in the dissemination of Docker virtualization technology in this complex field of research.
Collapse
Affiliation(s)
- Fabiano B Menegidio
- Núcleo Integrado de Biotecnologia, Universidade de Mogi das Cruzes (UMC), Av. Dr. Cândido Xavier de Almeida Souza, 200, Mogi das Cruzes, SP - 08780-911, Brazil
| | - David Aciole Barbosa
- Núcleo Integrado de Biotecnologia, Universidade de Mogi das Cruzes (UMC), Av. Dr. Cândido Xavier de Almeida Souza, 200, Mogi das Cruzes, SP - 08780-911, Brazil
| | - Rafael Dos S Gonçalves
- Núcleo Integrado de Biotecnologia, Universidade de Mogi das Cruzes (UMC), Av. Dr. Cândido Xavier de Almeida Souza, 200, Mogi das Cruzes, SP - 08780-911, Brazil
| | - Marcio M Nishime
- Núcleo Integrado de Biotecnologia, Universidade de Mogi das Cruzes (UMC), Av. Dr. Cândido Xavier de Almeida Souza, 200, Mogi das Cruzes, SP - 08780-911, Brazil
| | - Daniela L Jabes
- Núcleo Integrado de Biotecnologia, Universidade de Mogi das Cruzes (UMC), Av. Dr. Cândido Xavier de Almeida Souza, 200, Mogi das Cruzes, SP - 08780-911, Brazil
| | - Regina Costa de Oliveira
- Núcleo Integrado de Biotecnologia, Universidade de Mogi das Cruzes (UMC), Av. Dr. Cândido Xavier de Almeida Souza, 200, Mogi das Cruzes, SP - 08780-911, Brazil
| | - Luiz R Nunes
- Centro de Ciências Naturais e Humanas, Universidade Federal do ABC (UFABC), Alameda da Universidade, s/n, São Bernardo do Campo, SP - 09606-045, Brazil
| |
Collapse
|
13
|
Vázquez N, López-Fernández H, Vieira CP, Fdez-Riverola F, Vieira J, Reboiro-Jato M. BDBM 1.0: A Desktop Application for Efficient Retrieval and Processing of High-Quality Sequence Data and Application to the Identification of the Putative Coffea S-Locus. Interdiscip Sci 2019; 11:57-67. [PMID: 30712176 DOI: 10.1007/s12539-019-00320-3] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2018] [Revised: 01/22/2019] [Accepted: 01/24/2019] [Indexed: 11/25/2022]
Abstract
Nowadays, bioinformatics is one of the most important areas in modern biology and the creation of high-quality scientific software supporting this recent research area is one of the core activities of many researchers. In this context, high-quality sequence datasets are needed to perform inferences on the evolution of species, genes, and gene families, or to get evidence for adaptive amino acid evolution, among others. Nevertheless, sequence data are very often spread over several databases, many useful genomes and transcriptomes are non-annotated, the available annotation is not for the desired coding sequence isoform, and/or is unlikely to be accurate. Moreover, although the FASTA text-based format is quite simple and usable by most software applications, there are a number of issues that may be critical depending on the software used to analyse such files. Therefore, researchers without training in informatics often use a fraction of all available data. The above issues can be addressed using already available software applications, but there is no easy-to-use single piece of software that allows performing all these tasks within the same graphical interface, such as the one here presented, named BDBM (Blast DataBase Manager). BDBM can be used to efficiently get gene sequences from annotated and non-annotated genomes and transcriptomes. Moreover, it can be used to look for alternatives to existing annotations and to easily create reliable custom databases. Such databases are essential to prepare high-quality datasets. The analyses that we have performed on the Coffea canephora genome using BDBM aimed at the identification of the S-locus region (that harbours the genes involved in gametophytic self-incompatibility) led to the conclusion that there are two likely regions, one on chromosome 2 (around region 6600000-6650000), and another on chromosome 5 (around 15830000-15930000). Such findings are discussed in the context of the Rubiaceae gametophytic self-incompatibility evolution.
Collapse
Affiliation(s)
- Noé Vázquez
- ESEI-Escuela Superior de Ingeniería Informática, Universidade de Vigo, Edificio Politécnico, Campus Universitario As Lagoas s/n, 32004, Ourense, Spain
- CINBIO-Centro de Investigaciones Biomédicas, University of Vigo, Campus Universitario Lagoas-Marcosende, 36310, Vigo, Spain
| | - Hugo López-Fernández
- ESEI-Escuela Superior de Ingeniería Informática, Universidade de Vigo, Edificio Politécnico, Campus Universitario As Lagoas s/n, 32004, Ourense, Spain.
- CINBIO-Centro de Investigaciones Biomédicas, University of Vigo, Campus Universitario Lagoas-Marcosende, 36310, Vigo, Spain.
- SING Research Group, Galicia Sur Health Research Institute (IIS Galicia Sur), SERGAS-UVIGO, Vigo, Spain.
- Instituto de Biologia Molecular e Celular (IBMC), Rua Alfredo Allen, 208, 4200-135, Porto, Portugal.
- Instituto de Investigação e Inovação em Saúde (I3S), Universidade do Porto, Rua Alfredo Allen, 208, 4200-135, Porto, Portugal.
| | - Cristina P Vieira
- Instituto de Biologia Molecular e Celular (IBMC), Rua Alfredo Allen, 208, 4200-135, Porto, Portugal
- Instituto de Investigação e Inovação em Saúde (I3S), Universidade do Porto, Rua Alfredo Allen, 208, 4200-135, Porto, Portugal
| | - Florentino Fdez-Riverola
- ESEI-Escuela Superior de Ingeniería Informática, Universidade de Vigo, Edificio Politécnico, Campus Universitario As Lagoas s/n, 32004, Ourense, Spain
- CINBIO-Centro de Investigaciones Biomédicas, University of Vigo, Campus Universitario Lagoas-Marcosende, 36310, Vigo, Spain
- SING Research Group, Galicia Sur Health Research Institute (IIS Galicia Sur), SERGAS-UVIGO, Vigo, Spain
| | - Jorge Vieira
- Instituto de Biologia Molecular e Celular (IBMC), Rua Alfredo Allen, 208, 4200-135, Porto, Portugal
- Instituto de Investigação e Inovação em Saúde (I3S), Universidade do Porto, Rua Alfredo Allen, 208, 4200-135, Porto, Portugal
| | - Miguel Reboiro-Jato
- ESEI-Escuela Superior de Ingeniería Informática, Universidade de Vigo, Edificio Politécnico, Campus Universitario As Lagoas s/n, 32004, Ourense, Spain
- CINBIO-Centro de Investigaciones Biomédicas, University of Vigo, Campus Universitario Lagoas-Marcosende, 36310, Vigo, Spain
- SING Research Group, Galicia Sur Health Research Institute (IIS Galicia Sur), SERGAS-UVIGO, Vigo, Spain
| |
Collapse
|
14
|
Vaidyam A, Halamka J, Torous J. Actionable digital phenotyping: a framework for the delivery of just-in-time and longitudinal interventions in clinical healthcare. Mhealth 2019; 5:25. [PMID: 31559270 PMCID: PMC6737424 DOI: 10.21037/mhealth.2019.07.04] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/09/2019] [Accepted: 07/15/2019] [Indexed: 11/06/2022] Open
Abstract
Designed to improve health, today numerous wearables and smartphone apps are used by millions across the world. Yet the wealth of data generated from the many sensors on these wearables and smartwatches has not yet transformed routine clinical care. One central reason for this gap between data and clinical insights is the lack of transparency and standards around data generated from mobile device that hinders interoperability and reproducibility. The clinical informatics community has offered solutions via the Fast Healthcare Interoperability Resources (FHIR) standard which facilities electronic health record interoperability but is less developed towards precision temporal contextually-tagged sensor measurements generated from today's ubiquitous mobile devices. In this paper we explore the opportunities and challenges of various theoretical approaches towards FHIR compatible digital phenotyping, and offer a concrete example implementing one such framework as an Application Programming Interface (API) for the open-source mindLAMP platform. We aim to build a community with contributions from statisticians, clinicians, patients, family members, researchers, designers, engineers, and more.
Collapse
Affiliation(s)
- Aditya Vaidyam
- Department of Psychiatry, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, MA, USA
| | - John Halamka
- Department of Emergency Medicine, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, MA, USA
| | - John Torous
- Department of Psychiatry, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, MA, USA
| |
Collapse
|
15
|
Liu DM, Salganik MJ. Successes and Struggles with Computational Reproducibility: Lessons from the Fragile Families Challenge. SOCIUS : SOCIOLOGICAL RESEARCH FOR A DYNAMIC WORLD 2019; 5:10.1177/2378023119849803. [PMID: 37309413 PMCID: PMC10260256 DOI: 10.1177/2378023119849803] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Reproducibility is fundamental to science, and an important component of reproducibility is computational reproducibility: the ability of a researcher to recreate the results of a published study using the original author's raw data and code. Although most people agree that computational reproducibility is important, it is still difficult to achieve in practice. In this article, the authors describe their approach to enabling computational reproducibility for the 12 articles in this special issue of Socius about the Fragile Families Challenge. The approach draws on two tools commonly used by professional software engineers but not widely used by academic researchers: software containers (e.g., Docker) and cloud computing (e.g., Amazon Web Services). These tools made it possible to standardize the computing environment around each submission, which will ease computational reproducibility both today and in the future. Drawing on their successes and struggles, the authors conclude with recommendations to researchers and journals.
Collapse
|
16
|
Wagholikar KB, Dessai P, Sanz J, Mendis ME, Bell DS, Murphy SN. Implementation of informatics for integrating biology and the bedside (i2b2) platform as Docker containers. BMC Med Inform Decis Mak 2018; 18:66. [PMID: 30012140 PMCID: PMC6048900 DOI: 10.1186/s12911-018-0646-2] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2017] [Accepted: 06/27/2018] [Indexed: 12/03/2022] Open
Abstract
BACKGROUND Informatics for Integrating Biology and the Bedside (i2b2) is an open source clinical data analytics platform used at over 200 healthcare institutions for querying patient data. The i2b2 platform has several components with numerous dependencies and configuration parameters, which renders the task of installing or upgrading i2b2 a challenging one. Even with the availability of extensive documentation and tutorials, new users often require several weeks to correctly install a functional i2b2 platform. The goal of this work is to simplify the installation and upgrade process for i2b2. Specifically, we have containerized the core components of the platform, and evaluated the containers for ease of installation. RESULTS We developed three Docker container images: WildFly, database, and web, to encapsulate the three major deployment components of i2b2. These containers isolate the core functionalities of the i2b2 platform, and work in unison to provide its functionalities. Our evaluations indicate that i2b2 containers function successfully on the Linux platform. Our results demonstrate that the containerized components work out-of-the-box, with minimal configuration. CONCLUSIONS Containerization offers the potential to package the i2b2 platform components into standalone executable packages that are agnostic to the underlying host operating system. By releasing i2b2 as a Docker container, we anticipate that users will be able to create a working i2b2 hive installation without the need to download, compile, and configure individual components that constitute the i2b2 cells, thus making this platform accessible to a greater number of institutions.
Collapse
Affiliation(s)
| | - Pralav Dessai
- University of California Los Angeles, Los Angeles, CA USA
| | - Javier Sanz
- University of California Los Angeles, Los Angeles, CA USA
| | | | | | - Shawn N. Murphy
- Massachusetts General Hospital, Boston, MA USA
- Harvard Medical School, Boston, MA USA
| |
Collapse
|
17
|
Mittal V, Hung LH, Keswani J, Kristiyanto D, Lee SB, Yeung KY. GUIdock-VNC: using a graphical desktop sharing system to provide a browser-based interface for containerized software. Gigascience 2018; 6:1-6. [PMID: 28327936 PMCID: PMC5530313 DOI: 10.1093/gigascience/giw013] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2016] [Accepted: 12/16/2016] [Indexed: 11/30/2022] Open
Abstract
Background: Software container technology such as Docker can be used to package and distribute bioinformatics workflows consisting of multiple software implementations and dependencies. However, Docker is a command line–based tool, and many bioinformatics pipelines consist of components that require a graphical user interface. Results: We present a container tool called GUIdock-VNC that uses a graphical desktop sharing system to provide a browser-based interface for containerized software. GUIdock-VNC uses the Virtual Network Computing protocol to render the graphics within most commonly used browsers. We also present a minimal image builder that can add our proposed graphical desktop sharing system to any Docker packages, with the end result that any Docker packages can be run using a graphical desktop within a browser. In addition, GUIdock-VNC uses the Oauth2 authentication protocols when deployed on the cloud. Conclusions: As a proof-of-concept, we demonstrated the utility of GUIdock-noVNC in gene network inference. We benchmarked our container implementation on various operating systems and showed that our solution creates minimal overhead.
Collapse
|
18
|
Almugbel R, Hung LH, Hu J, Almutairy A, Ortogero N, Tamta Y, Yeung KY. Reproducible Bioconductor workflows using browser-based interactive notebooks and containers. J Am Med Inform Assoc 2018; 25:4-12. [PMID: 29092073 PMCID: PMC6381817 DOI: 10.1093/jamia/ocx120] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2017] [Revised: 08/31/2017] [Accepted: 09/28/2017] [Indexed: 11/14/2022] Open
Abstract
Objective Bioinformatics publications typically include complex software workflows that are difficult to describe in a manuscript. We describe and demonstrate the use of interactive software notebooks to document and distribute bioinformatics research. We provide a user-friendly tool, BiocImageBuilder, that allows users to easily distribute their bioinformatics protocols through interactive notebooks uploaded to either a GitHub repository or a private server. Materials and methods We present four different interactive Jupyter notebooks using R and Bioconductor workflows to infer differential gene expression, analyze cross-platform datasets, process RNA-seq data and KinomeScan data. These interactive notebooks are available on GitHub. The analytical results can be viewed in a browser. Most importantly, the software contents can be executed and modified. This is accomplished using Binder, which runs the notebook inside software containers, thus avoiding the need to install any software and ensuring reproducibility. All the notebooks were produced using custom files generated by BiocImageBuilder. Results BiocImageBuilder facilitates the publication of workflows with a point-and-click user interface. We demonstrate that interactive notebooks can be used to disseminate a wide range of bioinformatics analyses. The use of software containers to mirror the original software environment ensures reproducibility of results. Parameters and code can be dynamically modified, allowing for robust verification of published results and encouraging rapid adoption of new methods. Conclusion Given the increasing complexity of bioinformatics workflows, we anticipate that these interactive software notebooks will become as necessary for documenting software methods as traditional laboratory notebooks have been for documenting bench protocols, and as ubiquitous.
Collapse
Affiliation(s)
- Reem Almugbel
- Institute of Technology, University of Washington, Tacoma, WA, USA
| | - Ling-Hong Hung
- Institute of Technology, University of Washington, Tacoma, WA, USA
| | - Jiaming Hu
- Institute of Technology, University of Washington, Tacoma, WA, USA
| | - Abeer Almutairy
- Institute of Technology, University of Washington, Tacoma, WA, USA
| | - Nicole Ortogero
- Department of Clinical Investigation, Madigan Army Medical Center, Tacoma, WA, USA
| | - Yashaswi Tamta
- Institute of Technology, University of Washington, Tacoma, WA, USA
| | - Ka Yee Yeung
- Institute of Technology, University of Washington, Tacoma, WA, USA
| |
Collapse
|
19
|
Kim B, Ali T, Lijeron C, Afgan E, Krampis K. Bio-Docklets: virtualization containers for single-step execution of NGS pipelines. Gigascience 2017; 6:1-7. [PMID: 28854616 PMCID: PMC5569920 DOI: 10.1093/gigascience/gix048] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2017] [Revised: 06/13/2017] [Accepted: 06/14/2017] [Indexed: 11/12/2022] Open
Abstract
Processing of next-generation sequencing (NGS) data requires significant technical skills, involving installation, configuration, and execution of bioinformatics data pipelines, in addition to specialized postanalysis visualization and data mining software. In order to address some of these challenges, developers have leveraged virtualization containers toward seamless deployment of preconfigured bioinformatics software and pipelines on any computational platform. We present an approach for abstracting the complex data operations of multistep, bioinformatics pipelines for NGS data analysis. As examples, we have deployed 2 pipelines for RNA sequencing and chromatin immunoprecipitation sequencing, preconfigured within Docker virtualization containers we call Bio-Docklets. Each Bio-Docklet exposes a single data input and output endpoint and from a user perspective, running the pipelines as simply as running a single bioinformatics tool. This is achieved using a "meta-script" that automatically starts the Bio-Docklets and controls the pipeline execution through the BioBlend software library and the Galaxy Application Programming Interface. The pipeline output is postprocessed by integration with the Visual Omics Explorer framework, providing interactive data visualizations that users can access through a web browser. Our goal is to enable easy access to NGS data analysis pipelines for nonbioinformatics experts on any computing environment, whether a laboratory workstation, university computer cluster, or a cloud service provider. Beyond end users, the Bio-Docklets also enables developers to programmatically deploy and run a large number of pipeline instances for concurrent analysis of multiple datasets.
Collapse
Affiliation(s)
- Baekdoo Kim
- Center for Translational and Basic Research and Belfer Research Building, Hunter College of The City University of New York, 413 E 69th St, New York, NY 10021
| | - Thahmina Ali
- Center for Translational and Basic Research and Belfer Research Building, Hunter College of The City University of New York, 413 E 69th St, New York, NY 10021
| | - Carlos Lijeron
- Center for Translational and Basic Research and Belfer Research Building, Hunter College of The City University of New York, 413 E 69th St, New York, NY 10021
| | - Enis Afgan
- Johns Hopkins University, Department of Biology, B3400 N Charles St, Mudd Hall 144, Baltimore MD 21218
| | - Konstantinos Krampis
- Center for Translational and Basic Research and Belfer Research Building, Hunter College of The City University of New York, 413 E 69th St, New York, NY 10021
- Department of Biological Sciences, Hunter College of The City University of New York, 695 Park Av., New York, NY, 10065
- Department of Physiology and Biophysics, Institute for Computational Biomedicine, Weill Cornell Medical College, 413 E 69th St, New York, NY 10021
| |
Collapse
|
20
|
Costa RL, Gadelha L, Ribeiro-Alves M, Porto F. GeNNet: an integrated platform for unifying scientific workflows and graph databases for transcriptome data analysis. PeerJ 2017; 5:e3509. [PMID: 28695067 PMCID: PMC5501156 DOI: 10.7717/peerj.3509] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2017] [Accepted: 06/06/2017] [Indexed: 12/28/2022] Open
Abstract
There are many steps in analyzing transcriptome data, from the acquisition of raw data to the selection of a subset of representative genes that explain a scientific hypothesis. The data produced can be represented as networks of interactions among genes and these may additionally be integrated with other biological databases, such as Protein-Protein Interactions, transcription factors and gene annotation. However, the results of these analyses remain fragmented, imposing difficulties, either for posterior inspection of results, or for meta-analysis by the incorporation of new related data. Integrating databases and tools into scientific workflows, orchestrating their execution, and managing the resulting data and its respective metadata are challenging tasks. Additionally, a great amount of effort is equally required to run in-silico experiments to structure and compose the information as needed for analysis. Different programs may need to be applied and different files are produced during the experiment cycle. In this context, the availability of a platform supporting experiment execution is paramount. We present GeNNet, an integrated transcriptome analysis platform that unifies scientific workflows with graph databases for selecting relevant genes according to the evaluated biological systems. It includes GeNNet-Wf, a scientific workflow that pre-loads biological data, pre-processes raw microarray data and conducts a series of analyses including normalization, differential expression inference, clusterization and gene set enrichment analysis. A user-friendly web interface, GeNNet-Web, allows for setting parameters, executing, and visualizing the results of GeNNet-Wf executions. To demonstrate the features of GeNNet, we performed case studies with data retrieved from GEO, particularly using a single-factor experiment in different analysis scenarios. As a result, we obtained differentially expressed genes for which biological functions were analyzed. The results are integrated into GeNNet-DB, a database about genes, clusters, experiments and their properties and relationships. The resulting graph database is explored with queries that demonstrate the expressiveness of this data model for reasoning about gene interaction networks. GeNNet is the first platform to integrate the analytical process of transcriptome data with graph databases. It provides a comprehensive set of tools that would otherwise be challenging for non-expert users to install and use. Developers can add new functionality to components of GeNNet. The derived data allows for testing previous hypotheses about an experiment and exploring new ones through the interactive graph database environment. It enables the analysis of different data on humans, rhesus, mice and rat coming from Affymetrix platforms. GeNNet is available as an open source platform at https://github.com/raquele/GeNNet and can be retrieved as a software container with the command docker pull quelopes/gennet.
Collapse
Affiliation(s)
- Raquel L. Costa
- DEXL Lab, National Laboratory for Scientific Computing (LNCC), Petrópolis, Rio de Janeiro, Brazil
- National Institute of Cancer (INCA), Rio de Janeiro, RJ, Brazil
| | - Luiz Gadelha
- DEXL Lab, National Laboratory for Scientific Computing (LNCC), Petrópolis, Rio de Janeiro, Brazil
| | - Marcelo Ribeiro-Alves
- Laboratory of Clinical Research in DST- AIDS, National Institute of Infectology Evandro Chagas, Oswaldo Cruz Foundation, Rio de Janeiro, Brazil
| | - Fábio Porto
- DEXL Lab, National Laboratory for Scientific Computing (LNCC), Petrópolis, Rio de Janeiro, Brazil
| |
Collapse
|
21
|
Brohi RD, Wang L, Hassine NB, Cao J, Talpur HS, Wu D, Huang CJ, Rehman ZU, Bhattarai D, Huo LJ. Expression, Localization of SUMO-1, and Analyses of Potential SUMOylated Proteins in Bubalus bubalis Spermatozoa. Front Physiol 2017; 8:354. [PMID: 28659810 PMCID: PMC5468435 DOI: 10.3389/fphys.2017.00354] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2017] [Accepted: 05/15/2017] [Indexed: 11/19/2022] Open
Abstract
Mature spermatozoa have highly condensed DNA that is essentially silent both transcriptionally and translationally. Therefore, post translational modifications are very important for regulating sperm motility, morphology, and for male fertility in general. Protein sumoylation was recently demonstrated in human and rodent spermatozoa, with potential consequences for sperm motility and DNA integrity. We examined the expression and localization of small ubiquitin-related modifier-1 (SUMO-1) in the sperm of water buffalo (Bubalus bubalis) using immunofluorescence analysis. We confirmed the expression of SUMO-1 in the acrosome. We further found that SUMO-1 was lost if the acrosome reaction was induced by calcium ionophore A23187. Proteins modified or conjugated by SUMO-1 in water buffalo sperm were pulled down and analyzed by mass spectrometry. Sixty proteins were identified, including proteins important for sperm morphology and motility, such as relaxin receptors and cytoskeletal proteins, including tubulin chains, actins, and dyneins. Forty-six proteins were predicted as potential sumoylation targets. The expression of SUMO-1 in the acrosome region of water buffalo sperm and the identification of potentially SUMOylated proteins important for sperm function implicates sumoylation as a crucial PTM related to sperm function.
Collapse
Affiliation(s)
- Rahim Dad Brohi
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction, Education Ministry of China, College of Animal Science and Technology, Huazhong Agricultural UniversityWuhan, China.,Department of Hubei Province's Engineering Research Center in Buffalo Breeding and ProductsWuhan, China
| | - Li Wang
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction, Education Ministry of China, College of Animal Science and Technology, Huazhong Agricultural UniversityWuhan, China.,Department of Hubei Province's Engineering Research Center in Buffalo Breeding and ProductsWuhan, China
| | | | - Jing Cao
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction, Education Ministry of China, College of Animal Science and Technology, Huazhong Agricultural UniversityWuhan, China.,Department of Hubei Province's Engineering Research Center in Buffalo Breeding and ProductsWuhan, China
| | - Hira Sajjad Talpur
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction, Education Ministry of China, College of Animal Science and Technology, Huazhong Agricultural UniversityWuhan, China.,Department of Hubei Province's Engineering Research Center in Buffalo Breeding and ProductsWuhan, China
| | - Di Wu
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction, Education Ministry of China, College of Animal Science and Technology, Huazhong Agricultural UniversityWuhan, China.,Department of Hubei Province's Engineering Research Center in Buffalo Breeding and ProductsWuhan, China
| | - Chun-Jie Huang
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction, Education Ministry of China, College of Animal Science and Technology, Huazhong Agricultural UniversityWuhan, China.,Department of Hubei Province's Engineering Research Center in Buffalo Breeding and ProductsWuhan, China
| | - Zia-Ur Rehman
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction, Education Ministry of China, College of Animal Science and Technology, Huazhong Agricultural UniversityWuhan, China.,Department of Hubei Province's Engineering Research Center in Buffalo Breeding and ProductsWuhan, China
| | - Dinesh Bhattarai
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction, Education Ministry of China, College of Animal Science and Technology, Huazhong Agricultural UniversityWuhan, China.,Department of Hubei Province's Engineering Research Center in Buffalo Breeding and ProductsWuhan, China
| | - Li-Jun Huo
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction, Education Ministry of China, College of Animal Science and Technology, Huazhong Agricultural UniversityWuhan, China.,Department of Hubei Province's Engineering Research Center in Buffalo Breeding and ProductsWuhan, China
| |
Collapse
|
22
|
List M. Using Docker Compose for the Simple Deployment of an Integrated Drug Target Screening Platform. J Integr Bioinform 2017; 14:/j/jib.ahead-of-print/jib-2017-0016/jib-2017-0016.xml. [PMID: 28600904 PMCID: PMC6042832 DOI: 10.1515/jib-2017-0016] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2017] [Accepted: 04/18/2017] [Indexed: 12/28/2022] Open
Abstract
Docker virtualization allows for software tools to be executed in an isolated and controlled environment referred to as a container. In Docker containers, dependencies are provided exactly as intended by the developer and, consequently, they simplify the distribution of scientific software and foster reproducible research. The Docker paradigm is that each container encapsulates one particular software tool. However, to analyze complex biomedical data sets, it is often necessary to combine several software tools into elaborate workflows. To address this challenge, several Docker containers need to be instantiated and properly integrated, which complicates the software deployment process unnecessarily. Here, we demonstrate how an extension to Docker, Docker compose, can be used to mitigate these problems by providing a unified setup routine that deploys several tools in an integrated fashion. We demonstrate the power of this approach by example of a Docker compose setup for a drug target screening platform consisting of five integrated web applications and shared infrastructure, deployable in just two lines of codes.
Collapse
|
23
|
Schulz WL, Durant TJS, Siddon AJ, Torres R. Use of application containers and workflows for genomic data analysis. J Pathol Inform 2016; 7:53. [PMID: 28163975 PMCID: PMC5248400 DOI: 10.4103/2153-3539.197197] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2016] [Accepted: 11/27/2016] [Indexed: 11/29/2022] Open
Abstract
Background: The rapid acquisition of biological data and development of computationally intensive analyses has led to a need for novel approaches to software deployment. In particular, the complexity of common analytic tools for genomics makes them difficult to deploy and decreases the reproducibility of computational experiments. Methods: Recent technologies that allow for application virtualization, such as Docker, allow developers and bioinformaticians to isolate these applications and deploy secure, scalable platforms that have the potential to dramatically increase the efficiency of big data processing. Results: While limitations exist, this study demonstrates a successful implementation of a pipeline with several discrete software applications for the analysis of next-generation sequencing (NGS) data. Conclusions: With this approach, we significantly reduced the amount of time needed to perform clonal analysis from NGS data in acute myeloid leukemia.
Collapse
Affiliation(s)
- Wade L Schulz
- Department of Laboratory Medicine, Yale University School of Medicine, New Haven, CT, USA
| | - Thomas J S Durant
- Department of Laboratory Medicine, Yale University School of Medicine, New Haven, CT, USA
| | - Alexa J Siddon
- Department of Laboratory Medicine, Yale University School of Medicine, New Haven, CT, USA; Pathology and Laboratory Medicine Service, VA Connecticut Healthcare System, West Haven, CT, USA
| | - Richard Torres
- Department of Laboratory Medicine, Yale University School of Medicine, New Haven, CT, USA
| |
Collapse
|