1
|
Investigating the overlap of machine learning algorithms in the final results of RNA-seq analysis on gene expression estimation. Health Inf Sci Syst 2024; 12:14. [PMID: 38435719 PMCID: PMC10904690 DOI: 10.1007/s13755-023-00265-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2023] [Accepted: 12/05/2023] [Indexed: 03/05/2024] Open
Abstract
Advances in computer science in combination with the next-generation sequencing have introduced a new era in biology, enabling advanced state-of-the-art analysis of complex biological data. Bioinformatics is evolving as a union field between computer Science and biology, enabling the representation, storage, management, analysis and exploration of many types of data with a plethora of machine learning algorithms and computing tools. In this study, we used machine learning algorithms to detect differentially expressed genes between different types of cancer and showing the existence overlap to final results from RNA-sequencing analysis. The datasets were obtained from the National Center for Biotechnology Information resource. Specifically, dataset GSE68086 which corresponds to PMID:200,068,086. This dataset consists of 171 blood platelet samples collected from patients with six different tumors and healthy individuals. All steps for RNA-sequencing analysis (preprocessing, read alignment, transcriptome reconstruction, expression quantification and differential expression analysis) were followed. Machine Learning- based Random Forest and Gradient Boosting algorithms were applied to predict significant genes. The Rstudio statistical tool was used for the analysis.
Collapse
|
2
|
A Bioconductor workflow for processing, evaluating, and interpreting expression proteomics data. F1000Res 2023; 12:1402. [PMID: 38021401 PMCID: PMC10683783 DOI: 10.12688/f1000research.139116.1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 09/15/2023] [Indexed: 12/01/2023] Open
Abstract
Background: Expression proteomics involves the global evaluation of protein abundances within a system. In turn, differential expression analysis can be used to investigate changes in protein abundance upon perturbation to such a system. Methods: Here, we provide a workflow for the processing, analysis and interpretation of quantitative mass spectrometry-based expression proteomics data. This workflow utilizes open-source R software packages from the Bioconductor project and guides users end-to-end and step-by-step through every stage of the analyses. As a use-case we generated expression proteomics data from HEK293 cells with and without a treatment. Of note, the experiment included cellular proteins labelled using tandem mass tag (TMT) technology and secreted proteins quantified using label-free quantitation (LFQ). Results: The workflow explains the software infrastructure before focusing on data import, pre-processing and quality control. This is done individually for TMT and LFQ datasets. The application of statistical differential expression analysis is demonstrated, followed by interpretation via gene ontology enrichment analysis. Conclusions: A comprehensive workflow for the processing, analysis and interpretation of expression proteomics is presented. The workflow is a valuable resource for the proteomics community and specifically beginners who are at least familiar with R who wish to understand and make data-driven decisions with regards to their analyses.
Collapse
|
3
|
Data-driven identification of total RNA expression genes for estimation of RNA abundance in heterogeneous cell types highlighted in brain tissue. Genome Biol 2023; 24:233. [PMID: 37845779 PMCID: PMC10578035 DOI: 10.1186/s13059-023-03066-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2022] [Accepted: 09/20/2023] [Indexed: 10/18/2023] Open
Abstract
We define and identify a new class of control genes for next-generation sequencing called total RNA expression genes (TREGs), which correlate with total RNA abundance in cell types of different sizes and transcriptional activity. We provide a data-driven method to identify TREGs from single-cell RNA sequencing data, allowing the estimation of total amount of RNA when restricted to quantifying a limited number of genes. We demonstrate our method in postmortem human brain using multiplex single-molecule fluorescent in situ hybridization and compare candidate TREGs against classic housekeeping genes. We identify AKT3 as a top TREG across five brain regions.
Collapse
|
4
|
BiocMAP: a Bioconductor-friendly, GPU-accelerated pipeline for bisulfite-sequencing data. BMC Bioinformatics 2023; 24:340. [PMID: 37704947 PMCID: PMC10498615 DOI: 10.1186/s12859-023-05461-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2022] [Accepted: 08/31/2023] [Indexed: 09/15/2023] Open
Abstract
BACKGROUND Bisulfite sequencing is a powerful tool for profiling genomic methylation, an epigenetic modification critical in the understanding of cancer, psychiatric disorders, and many other conditions. Raw data generated by whole genome bisulfite sequencing (WGBS) requires several computational steps before it is ready for statistical analysis, and particular care is required to process data in a timely and memory-efficient manner. Alignment to a reference genome is one of the most computationally demanding steps in a WGBS workflow, taking several hours or even days with commonly used WGBS-specific alignment software. This naturally motivates the creation of computational workflows that can utilize GPU-based alignment software to greatly speed up the bottleneck step. In addition, WGBS produces raw data that is large and often unwieldy; a lack of memory-efficient representation of data by existing pipelines renders WGBS impractical or impossible to many researchers. RESULTS We present BiocMAP, a Bioconductor-friendly methylation analysis pipeline consisting of two modules, to address the above concerns. The first module performs computationally-intensive read alignment using Arioc, a GPU-accelerated short-read aligner. Since GPUs are not always available on the same computing environments where traditional CPU-based analyses are convenient, the second module may be run in a GPU-free environment. This module extracts and merges DNA methylation proportions-the fractions of methylated cytosines across all cells in a sample at a given genomic site. Bioconductor-based output objects in R utilize an on-disk data representation to drastically reduce required main memory and make WGBS projects computationally feasible to more researchers. CONCLUSIONS BiocMAP is implemented using Nextflow and available at http://research.libd.org/BiocMAP/ . To enable reproducible analysis across a variety of typical computing environments, BiocMAP can be containerized with Docker or Singularity, and executed locally or with the SLURM or SGE scheduling engines. By providing Bioconductor objects, BiocMAP's output can be integrated with powerful analytical open source software for analyzing methylation data.
Collapse
|
5
|
MBECS: Microbiome Batch Effects Correction Suite. BMC Bioinformatics 2023; 24:182. [PMID: 37138207 PMCID: PMC10155362 DOI: 10.1186/s12859-023-05252-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2022] [Accepted: 03/20/2023] [Indexed: 05/05/2023] Open
Abstract
Despite the availability of batch effect correcting algorithms (BECA), no comprehensive tool that combines batch correction and evaluation of the results exists for microbiome datasets. This work outlines the Microbiome Batch Effects Correction Suite development that integrates several BECAs and evaluation metrics into a software package for the statistical computation framework R.
Collapse
|
6
|
Quality Control for the Target Decoy Approach for Peptide Identification. J Proteome Res 2023; 22:350-358. [PMID: 36648107 DOI: 10.1021/acs.jproteome.2c00423] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
Abstract
Reliable peptide identification is key in mass spectrometry (MS) based proteomics. To this end, the target decoy approach (TDA) has become the cornerstone for extracting a set of reliable peptide-to-spectrum matches (PSMs) that will be used in downstream analysis. Indeed, TDA is now the default method to estimate the false discovery rate (FDR) for a given set of PSMs, and users typically view it as a universal solution for assessing the FDR in the peptide identification step. However, the TDA also relies on a minimal set of assumptions, which are typically never verified in practice. We argue that a violation of these assumptions can lead to poor FDR control, which can be detrimental to any downstream data analysis. We here therefore first clearly spell out these TDA assumptions, and introduce TargetDecoy, a Bioconductor package with all the necessary functionality to control the TDA quality and its underlying assumptions for a given set of PSMs.
Collapse
|
7
|
RNA Preparation and RNA-Seq Bioinformatics for Comparative Transcriptomics. Methods Mol Biol 2023; 2704:99-113. [PMID: 37642840 DOI: 10.1007/978-1-0716-3385-4_6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/31/2023]
Abstract
The principal transcriptome analysis is the determination of differentially expressed genes across experimental conditions. For this, the next-generation sequencing of RNA (RNA-seq) has several advantages over other techniques, such as the capability of detecting all the transcripts in one assay over RT-qPCR, such as its higher accuracy and broader dynamic range over microarrays or the ability to detect novel transcripts, including non-coding RNA molecules, at nucleotide-level resolution over both techniques. Despite these advantages, many microbiology laboratories have not yet applied RNA-seq analyses to their investigations. The high cost of the equipment for next-generation sequencing is no longer an issue since this intermediate part of the analysis can be provided by commercial or central services. Here, we detail a protocol for the first part of the analysis, the RNA extraction and an introductory protocol to the bioinformatics analysis of the sequencing data to generate the differential expression results.
Collapse
|
8
|
Abstract
msmsTests is an R/Bioconductor package providing functions for statistical tests in label-free LC-MS/MS data by spectral counts. These functions aim at discovering differentially expressed proteins between two biological conditions. Three tests are available: Poisson GLM regression, quasi-likelihood GLM regression, and the negative binomial of the edgeR package. The three models admit blocking factors to control for nuisance variables. To assure a good level of reproducibility a post-test filter is available, where (1) a minimum effect size considered biologically relevant, and (2) a minimum expression of the most abundant condition, may be set. A companion package, msmsEDA, proposes functions to explore datasets based on msms spectral counts. The provided graphics help in identifying outliers, the presence of eventual batch factors, and check the effects of different normalizing strategies. This protocol illustrates the use of both packages on two examples: A purely spike-in experiment of 48 human proteins in a standard yeast cell lysate; and a cancer cell-line secretome dataset requiring a biological normalization.
Collapse
|
9
|
Katdetectr: an R/ bioconductor package utilizing unsupervised changepoint analysis for robust kataegis detection. Gigascience 2022; 12:giad081. [PMID: 37848617 PMCID: PMC10580377 DOI: 10.1093/gigascience/giad081] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2023] [Revised: 06/15/2023] [Accepted: 09/12/2023] [Indexed: 10/19/2023] Open
Abstract
BACKGROUND Kataegis refers to the occurrence of regional genomic hypermutation in cancer and is a phenomenon that has been observed in a wide range of malignancies. A kataegis locus constitutes a genomic region with a high mutation rate (i.e., a higher frequency of closely interspersed somatic variants than the overall mutational background). It has been shown that kataegis is of biological significance and possibly clinically relevant. Therefore, an accurate and robust workflow for kataegis detection is paramount. FINDINGS Here we present Katdetectr, an open-source R/Bioconductor-based package for the robust yet flexible and fast detection of kataegis loci in genomic data. In addition, Katdetectr houses functionalities to characterize and visualize kataegis and provides results in a standardized format useful for subsequent analysis. In brief, Katdetectr imports industry-standard formats (MAF, VCF, and VRanges), determines the intermutation distance of the genomic variants, and performs unsupervised changepoint analysis utilizing the Pruned Exact Linear Time search algorithm followed by kataegis calling according to user-defined parameters.We used synthetic data and an a priori labeled pan-cancer dataset of whole-genome sequenced malignancies for the performance evaluation of Katdetectr and 5 publicly available kataegis detection packages. Our performance evaluation shows that Katdetectr is robust regarding tumor mutational burden and shows the fastest mean computation time. Additionally, Katdetectr reveals the highest accuracy (0.99, 0.99) and normalized Matthews correlation coefficient (0.98, 0.92) of all evaluated tools for both datasets. CONCLUSIONS Katdetectr is a robust workflow for the detection, characterization, and visualization of kataegis and is available on Bioconductor: https://doi.org/doi:10.18129/B9.bioc.katdetectr.
Collapse
|
10
|
surfaltr: An R/ Bioconductor package to benchmark surface protein isoforms by rapid prediction and visualization of transmembrane topologies. Proteomics 2022; 22:e2200002. [PMID: 35678367 DOI: 10.1002/pmic.202200002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2022] [Revised: 05/20/2022] [Accepted: 06/07/2022] [Indexed: 11/09/2022]
Abstract
Cell surface proteins form a major fraction of the druggable proteome and can be used for tissue-specific delivery of oligonucleotide/cell-based therapeutics. Surface protein isoforms are regulated by alternative splicing, which drives subcellular localization and transmembrane (TM) topology thereby shaping cell type specific signatures. Current advances in multiomic approaches have developed interest in discovery of tissue-specific alternatively spliced or novel surface protein isoforms. However, there exists a need for bioinformatic approaches for rapidly benchmarking the large number of isoforms identified by these approaches. To address this gap, we have developed, surfaltr, an R package which takes user input isoforms, pairs them with the known principal isoform of the gene, predicts TM topologies, and generates a customizable graphical output. Further, surfaltr facilitates prioritization of topologically diverse isoform pairs through incorporation of three different ranking metrics and through protein alignment functions. Here, we demonstrate the utility of our R package by evaluating the mouse retina-specific novel surface protein isoforms identified in Ray et al. 2020. surfaltr is freely available through Bioconductor (https://bioconductor.org/packages/surfaltr) and the vignette provides extensive instructions for implementation.
Collapse
|
11
|
Improve consensus partitioning via a hierarchical procedure. Brief Bioinform 2022; 23:bbac048. [PMID: 35289356 PMCID: PMC9116221 DOI: 10.1093/bib/bbac048] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2021] [Revised: 01/20/2022] [Accepted: 01/30/2022] [Indexed: 11/22/2022] Open
Abstract
Consensus partitioning is an unsupervised method widely used in high-throughput data analysis for revealing subgroups and assigning stability for the classification. However, standard consensus partitioning procedures are weak for identifying large numbers of stable subgroups. There are two major issues. First, subgroups with small differences are difficult to be separated if they are simultaneously detected with subgroups with large differences. Second, stability of classification generally decreases as the number of subgroups increases. In this work, we proposed a new strategy to solve these two issues by applying consensus partitioning in a hierarchical procedure. We demonstrated hierarchical consensus partitioning can be efficient to reveal more meaningful subgroups. We also tested the performance of hierarchical consensus partitioning on revealing a great number of subgroups with a large deoxyribonucleic acid methylation dataset. The hierarchical consensus partitioning is implemented in the R package cola with comprehensive functionalities for analysis and visualization. It can also automate the analysis only with a minimum of two lines of code, which generates a detailed HTML report containing the complete analysis. The cola package is available at https://bioconductor.org/packages/cola/.
Collapse
|
12
|
GenomicDistributions: fast analysis of genomic intervals with Bioconductor. BMC Genomics 2022; 23:299. [PMID: 35413804 PMCID: PMC9003978 DOI: 10.1186/s12864-022-08467-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2021] [Accepted: 03/13/2022] [Indexed: 11/10/2022] Open
Abstract
Background Epigenome analysis relies on defined sets of genomic regions output by widely used assays such as ChIP-seq and ATAC-seq. Statistical analysis and visualization of genomic region sets is essential to answer biological questions in gene regulation. As the epigenomics community continues generating data, there will be an increasing need for software tools that can efficiently deal with more abundant and larger genomic region sets. Here, we introduce GenomicDistributions, an R package for fast and easy summarization and visualization of genomic region data. Results GenomicDistributions offers a broad selection of functions to calculate properties of genomic region sets, such as feature distances, genomic partition overlaps, and more. GenomicDistributions functions are meticulously optimized for best-in-class speed and generally outperform comparable functions in existing R packages. GenomicDistributions also offers plotting functions that produce editable ggplot objects. All GenomicDistributions functions follow a uniform naming scheme and can handle either single or multiple region set inputs. Conclusions GenomicDistributions offers a fast and scalable tool for exploratory genomic region set analysis and visualization. GenomicDistributions excels in user-friendliness, flexibility of outputs, breadth of functions, and computational performance. GenomicDistributions is available from Bioconductor (https://bioconductor.org/packages/release/bioc/html/GenomicDistributions.html). Supplementary Information The online version contains supplementary material available at 10.1186/s12864-022-08467-y.
Collapse
|
13
|
RNA-Seq Experiment and Data Analysis. METHODS IN MOLECULAR BIOLOGY (CLIFTON, N.J.) 2022; 2418:405-424. [PMID: 35119677 DOI: 10.1007/978-1-0716-1920-9_22] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
With the ability to obtain several millions of reads per sample, high-throughput RNA sequencing (RNA-Seq) enables investigation of any transcriptome at a fine resolution. Not just the messenger RNA (mRNA), but a wide variety of different RNA populations (e.g., total RNA, microRNA, long ncRNA, pre-mRNA) can also be investigated using RNA-Seq. While facilitating accurate quantification of gene expression, RNA-Seq offers the opportunity to estimate abundance of isoforms and find novel transcripts and allele-specific transcripts. In this chapter, we describe a protocol to construct an RNA-Seq library for sequencing on Illumina NGS platforms and a computational pipeline to perform RNA-Seq data analysis. The protocols described in this chapter can be applied to the analysis of differential gene expression in control versus 17β-estradiol treatment of in vivo or in vitro systems.
Collapse
|
14
|
geneExpressionFromGEO: An R Package to Facilitate Data Reading from Gene Expression Omnibus (GEO). METHODS IN MOLECULAR BIOLOGY (CLIFTON, N.J.) 2022; 2401:187-194. [PMID: 34902129 DOI: 10.1007/978-1-0716-1839-4_12] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Abstract
Gene expression profiling is a useful way to measure the activity of genes in molecular biology and, because of its effectiveness, researchers have released thousands of gene expression datasets publicly in online databases and repositories, such as Gene Expression Omnibus (GEO). To read and analyze gene expression data, the computational biology community has developed several tools and platforms, including Bioconductor, an R open-source platform of software packages that can be used to analyze these data. Despite the usefulness of Bioconductor and of its packages, it is still difficult to read gene expression data from GEO, and to assign gene symbols to the probesets of datasets. To alleviate this problem, we introduce here a new R software package, geneExpressionFromGEO, which provides to the users the possibility to easily download gene expression data from GEO and to easily associate gene symbols to probesets. In this short chapter, we describe the assets of our software package, and we report an example of its usage. We believe that geneExpressionFromGEO can be very useful for the R community of bioinformaticians working on gene expression data.
Collapse
|
15
|
scAnnotatR: framework to accurately classify cell types in single-cell RNA-sequencing data. BMC Bioinformatics 2022; 23:44. [PMID: 35038984 PMCID: PMC8762856 DOI: 10.1186/s12859-022-04574-5] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2021] [Accepted: 01/11/2022] [Indexed: 12/02/2022] Open
Abstract
Background Automatic cell type identification is essential to alleviate a key bottleneck in scRNA-seq data analysis. While most existing classification tools show good sensitivity and specificity, they often fail to adequately not-classify cells that are missing in the used reference. Additionally, many tools do not scale to the continuously increasing size of current scRNA-seq datasets. Therefore, additional tools are needed to solve these challenges. Results scAnnotatR is a novel R package that provides a complete framework to classify cells in scRNA-seq datasets using pre-trained classifiers. It supports both Seurat and Bioconductor’s SingleCellExperiment and is thereby compatible with the vast majority of R-based analysis workflows. scAnnotatR uses hierarchically organised SVMs to distinguish a specific cell type versus all others. It shows comparable or even superior accuracy, sensitivity and specificity compared to existing tools while being able to not-classify unknown cell types. Moreover, scAnnotatR is the only of the best performing tools able to process datasets containing more than 600,000 cells. Conclusions scAnnotatR is freely available on GitHub (https://github.com/grisslab/scAnnotatR) and through Bioconductor (from version 3.14). It is consistently among the best performing tools in terms of classification accuracy while scaling to the largest datasets. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04574-5.
Collapse
|
16
|
GeneTonic: an R/ Bioconductor package for streamlining the interpretation of RNA-seq data. BMC Bioinformatics 2021; 22:610. [PMID: 34949163 PMCID: PMC8697502 DOI: 10.1186/s12859-021-04461-5] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2021] [Accepted: 10/26/2021] [Indexed: 01/22/2023] Open
Abstract
BACKGROUND The interpretation of results from transcriptome profiling experiments via RNA sequencing (RNA-seq) can be a complex task, where the essential information is distributed among different tabular and list formats-normalized expression values, results from differential expression analysis, and results from functional enrichment analyses. A number of tools and databases are widely used for the purpose of identification of relevant functional patterns, yet often their contextualization within the data and results at hand is not straightforward, especially if these analytic components are not combined together efficiently. RESULTS We developed the GeneTonic software package, which serves as a comprehensive toolkit for streamlining the interpretation of functional enrichment analyses, by fully leveraging the information of expression values in a differential expression context. GeneTonic is implemented in R and Shiny, leveraging packages that enable HTML-based interactive visualizations for executing drilldown tasks seamlessly, viewing the data at a level of increased detail. GeneTonic is integrated with the core classes of existing Bioconductor workflows, and can accept the output of many widely used tools for pathway analysis, making this approach applicable to a wide range of use cases. Users can effectively navigate interlinked components (otherwise available as flat text or spreadsheet tables), bookmark features of interest during the exploration sessions, and obtain at the end a tailored HTML report, thus combining the benefits of both interactivity and reproducibility. CONCLUSION GeneTonic is distributed as an R package in the Bioconductor project ( https://bioconductor.org/packages/GeneTonic/ ) under the MIT license. Offering both bird's-eye views of the components of transcriptome data analysis and the detailed inspection of single genes, individual signatures, and their relationships, GeneTonic aims at simplifying the process of interpretation of complex and compelling RNA-seq datasets for many researchers with different expertise profiles.
Collapse
|
17
|
Abstract
The creation of visualizations to interpret genomics data remains an important aspect of data science within computational biology. The GenVisR Bioconductor package was created to lower the entry point for publication-quality graphics and has remained a popular suite of tools within this domain. GenVisR supports visualizations covering a breadth of topics including functions to produce visual summaries of copy-number alterations, somatic variants, sequence quality metrics, and more. Recently, the GenVisR package has undergone significant updates to increase performance and functionality. To demonstrate the utility of GenVisR, we present protocols for use of the updated Waterfall() function to create a customizable Oncoprint-style plot of the mutational landscape of a tumor cohort. We explain the basics of installation, data import, configuration, plotting, clinical annotation, and customization. A companion online workshop describing the GenVisR library, Waterfall() function, and other genomic visualization tools is available at genviz.org. © 2021 The Authors. Current Protocols published by Wiley Periodicals LLC. Basic Protocol 1: Generating a Waterfall() plot from original mutation data Basic Protocol 2: Adding clinical data to a Waterfall() plot Basic Protocol 3: Customizing mutation burden in Waterfall() plots Basic Protocol 4: Brief exploration of customizable options Support Protocol 1: Installing GenVisR.
Collapse
|
18
|
iDEP Web Application for RNA-Seq Data Analysis. METHODS IN MOLECULAR BIOLOGY (CLIFTON, N.J.) 2021; 2284:417-443. [PMID: 33835455 DOI: 10.1007/978-1-0716-1307-8_22] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Abstract
RNA sequencing (RNA-seq) has become a routine method for transcriptomic profiling. We developed a user-friendly web app called iDEP (integrated differential expression and pathway analysis) to help biologists interpret read counts or other types of expression matrices derived from read mapping. With iDEP, users can easily conduct exploratory data analysis, identify differentially expressed genes, and perform pathway analysis. Due to its intuitive user interface and massive annotation database, iDEP is being widely adopted for interactive analysis of RNA-seq data. Using a public dataset on the effect of heat shock on mouse with and without functional Hsf1, we demonstrate how users can prepare data files and conduct in-depth analysis. We also discuss the importance of critical interpretion of results (avoid p-hacking and rationalizing) and validation of significant pathways by using different methods and independent annotation databases.
Collapse
|
19
|
SPEAQeasy: a scalable pipeline for expression analysis and quantification for R/ bioconductor-powered RNA-seq analyses. BMC Bioinformatics 2021; 22:224. [PMID: 33932985 PMCID: PMC8088074 DOI: 10.1186/s12859-021-04142-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2021] [Accepted: 04/21/2021] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND RNA sequencing (RNA-seq) is a common and widespread biological assay, and an increasing amount of data is generated with it. In practice, there are a large number of individual steps a researcher must perform before raw RNA-seq reads yield directly valuable information, such as differential gene expression data. Existing software tools are typically specialized, only performing one step-such as alignment of reads to a reference genome-of a larger workflow. The demand for a more comprehensive and reproducible workflow has led to the production of a number of publicly available RNA-seq pipelines. However, we have found that most require computational expertise to set up or share among several users, are not actively maintained, or lack features we have found to be important in our own analyses. RESULTS In response to these concerns, we have developed a Scalable Pipeline for Expression Analysis and Quantification (SPEAQeasy), which is easy to install and share, and provides a bridge towards R/Bioconductor downstream analysis solutions. SPEAQeasy is portable across computational frameworks (SGE, SLURM, local, docker integration) and different configuration files are provided ( http://research.libd.org/SPEAQeasy/ ). CONCLUSIONS SPEAQeasy is user-friendly and lowers the computational-domain entry barrier for biologists and clinicians to RNA-seq data processing as the main input file is a table with sample names and their corresponding FASTQ files. The goal is to provide a flexible pipeline that is immediately usable by researchers, regardless of their technical background or computing environment.
Collapse
|
20
|
KnowSeq R-Bioc package: The automatic smart gene expression tool for retrieving relevant biological knowledge. Comput Biol Med 2021; 133:104387. [PMID: 33872966 DOI: 10.1016/j.compbiomed.2021.104387] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2020] [Revised: 04/05/2021] [Accepted: 04/05/2021] [Indexed: 02/07/2023]
Abstract
KnowSeq R/Bioc package is designed as a powerful, scalable and modular software focused on automatizing and assembling renowned bioinformatic tools with new features and functionalities. It comprises a unified environment to perform complex gene expression analyses, covering all the needed processing steps to identify a gene signature for a specific disease to gather understandable knowledge. This process may be initiated from raw files either available at well-known platforms or provided by the users themselves, and in either case coming from different information sources and different Transcriptomic technologies. The pipeline makes use of a set of advanced algorithms, including the adaptation of a novel procedure for the selection of the most representative genes in a given multiclass problem. Similarly, an intelligent system able to classify new patients, providing the user the opportunity to choose one among a number of well-known and widespread classification and feature selection methods in Bioinformatics, is embedded. Furthermore, KnowSeq is engineered to automatically develop a complete and detailed HTML report of the whole process which is also modular and scalable. Biclass breast cancer and multiclass lung cancer study cases were addressed to rigorously assess the usability and efficiency of KnowSeq. The models built by using the Differential Expressed Genes achieved from both experiments reach high classification rates. Furthermore, biological knowledge was extracted in terms of Gene Ontologies, Pathways and related diseases with the aim of helping the expert in the decision-making process. KnowSeq is available at Bioconductor (https://bioconductor.org/packages/KnowSeq), GitHub (https://github.com/CasedUgr/KnowSeq) and Docker (https://hub.docker.com/r/casedugr/knowseq).
Collapse
|
21
|
In silico candidate variant and gene identification using inbred mouse strains. PeerJ 2021; 9:e11017. [PMID: 33763305 PMCID: PMC7956000 DOI: 10.7717/peerj.11017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2020] [Accepted: 02/06/2021] [Indexed: 12/05/2022] Open
Abstract
Mice are the most widely used animal model to study genotype to phenotype relationships. Inbred mice are genetically identical, which eliminates genetic heterogeneity and makes them particularly useful for genetic studies. Many different strains have been bred over decades and a vast amount of phenotypic data has been generated. In addition, recently whole genome sequencing-based genome-wide genotype data for many widely used inbred strains has been released. Here, we present an approach for in silico fine-mapping that uses genotypic data of 37 inbred mouse strains together with phenotypic data provided by the user to propose candidate variants and genes for the phenotype under study. Public genome-wide genotype data covering more than 74 million variant sites is queried efficiently in real-time to provide those variants that are compatible with the observed phenotype differences between strains. Variants can be filtered by molecular consequences and by corresponding molecular impact. Candidate gene lists can be generated from variant lists on the fly. Fine-mapping together with annotation or filtering of results is provided in a Bioconductor package called MouseFM. In order to characterize candidate variant lists under various settings, MouseFM was applied to two expression data sets across 20 inbred mouse strains, one from neutrophils and one from CD4+ T cells. Fine-mapping was assessed for about 10,000 genes, respectively, and identified candidate variants and haplotypes for many expression quantitative trait loci (eQTLs) reported previously based on these data. For albinism, MouseFM reports only one variant allele of moderate or high molecular impact that only albino mice share: a missense variant in the Tyr gene, reported previously to be causal for this phenotype. Performing in silico fine-mapping for interfrontal bone formation in mice using four strains with and five strains without interfrontal bone results in 12 genes. Of these, three are related to skull shaping abnormality. Finally performing fine-mapping for dystrophic cardiac calcification by comparing 9 strains showing the phenotype with eight strains lacking it, we identify only one moderate impact variant in the known causal gene Abcc6. In summary, this illustrates the benefit of using MouseFM for candidate variant and gene identification.
Collapse
|
22
|
ideal: an R/ Bioconductor package for interactive differential expression analysis. BMC Bioinformatics 2020; 21:565. [PMID: 33297942 PMCID: PMC7724894 DOI: 10.1186/s12859-020-03819-5] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2020] [Accepted: 10/15/2020] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND RNA sequencing (RNA-seq) is an ever increasingly popular tool for transcriptome profiling. A key point to make the best use of the available data is to provide software tools that are easy to use but still provide flexibility and transparency in the adopted methods. Despite the availability of many packages focused on detecting differential expression, a method to streamline this type of bioinformatics analysis in a comprehensive, accessible, and reproducible way is lacking. RESULTS We developed the ideal software package, which serves as a web application for interactive and reproducible RNA-seq analysis, while producing a wealth of visualizations to facilitate data interpretation. ideal is implemented in R using the Shiny framework, and is fully integrated with the existing core structures of the Bioconductor project. Users can perform the essential steps of the differential expression analysis workflow in an assisted way, and generate a broad spectrum of publication-ready outputs, including diagnostic and summary visualizations in each module, all the way down to functional analysis. ideal also offers the possibility to seamlessly generate a full HTML report for storing and sharing results together with code for reproducibility. CONCLUSION ideal is distributed as an R package in the Bioconductor project ( http://bioconductor.org/packages/ideal/ ), and provides a solution for performing interactive and reproducible analyses of summarized RNA-seq expression data, empowering researchers with many different profiles (life scientists, clinicians, but also experienced bioinformaticians) to make the ideal use of the data at hand.
Collapse
|
23
|
multiGSEA: a GSEA-based pathway enrichment analysis for multi-omics data. BMC Bioinformatics 2020; 21:561. [PMID: 33287694 PMCID: PMC7720482 DOI: 10.1186/s12859-020-03910-x] [Citation(s) in RCA: 61] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2020] [Accepted: 11/25/2020] [Indexed: 01/08/2023] Open
Abstract
Background Gaining biological insights into molecular responses to treatments or diseases from omics data can be accomplished by gene set or pathway enrichment methods. A plethora of different tools and algorithms have been developed so far. Among those, the gene set enrichment analysis (GSEA) proved to control both type I and II errors well. In recent years the call for a combined analysis of multiple omics layers became prominent, giving rise to a few multi-omics enrichment tools. Each of these has its own drawbacks and restrictions regarding its universal application. Results Here, we present the multiGSEA package aiding to calculate a combined GSEA-based pathway enrichment on multiple omics layers. The package queries 8 different pathway databases and relies on the robust GSEA algorithm for a single-omics enrichment analysis. In a final step, those scores will be combined to create a robust composite multi-omics pathway enrichment measure. multiGSEA supports 11 different organisms and includes a comprehensive mapping of transcripts, proteins, and metabolite IDs. Conclusions With multiGSEA we introduce a highly versatile tool for multi-omics pathway integration that minimizes previous restrictions in terms of omics layer selection, pathway database availability, organism selection and the mapping of omics feature identifiers. multiGSEA is publicly available under the GPL-3 license at https://github.com/yigbt/multiGSEA and at bioconductor: https://bioconductor.org/packages/multiGSEA.
Collapse
|
24
|
MADloy: robust detection of mosaic loss of chromosome Y from genotype-array-intensity data. BMC Bioinformatics 2020; 21:533. [PMID: 33225898 PMCID: PMC7682048 DOI: 10.1186/s12859-020-03768-z] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2019] [Accepted: 09/20/2020] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND Accurate protocols and methods to robustly detect the mosaic loss of chromosome Y (mLOY) are needed given its reported role in cancer, several age-related disorders and overall male mortality. Intensity SNP-array data have been used to infer mLOY status and to determine its prominent role in male disease. However, discrepancies of reported findings can be due to the uncertainty and variability of the methods used for mLOY detection and to the differences in the tissue-matrix used. RESULTS We created a publicly available software tool called MADloy (Mosaic Alteration Detection for LOY) that incorporates existing methods and includes a new robust approach, allowing efficient calling in large studies and comparisons between methods. MADloy optimizes mLOY calling by correctly modeling the underlying reference population with no-mLOY status and incorporating B-deviation information. We observed improvements in the calling accuracy to previous methods, using experimentally validated samples, and an increment in the statistical power to detect associations with disease and mortality, using simulation studies and real dataset analyses. To understand discrepancies in mLOY detection across different tissues, we applied MADloy to detect the increment of mLOY cellularity in blood on 18 individuals after 3 years and to confirm that its detection in saliva was sub-optimal (41%). We additionally applied MADloy to detect the down-regulation genes in the chromosome Y in kidney and bladder tumors with mLOY, and to perform pathway analyses for the detection of mLOY in blood. CONCLUSIONS MADloy is a new software tool implemented in R for the easy and robust calling of mLOY status across different tissues aimed to facilitate its study in large epidemiological studies.
Collapse
|
25
|
MSnbase, Efficient and Elegant R-Based Processing and Visualization of Raw Mass Spectrometry Data. J Proteome Res 2020; 20:1063-1069. [PMID: 32902283 DOI: 10.1021/acs.jproteome.0c00313] [Citation(s) in RCA: 38] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
We present version 2 of the MSnbase R/Bioconductor package. MSnbase provides infrastructure for the manipulation, processing, and visualization of mass spectrometry data. We focus on the new on-disk infrastructure, that allows the handling of large raw mass spectrometry experiments on commodity hardware and illustrate how the package is used for elegant data processing, method development, and visualization.
Collapse
|
26
|
Abstract
We construct a simple workflow for fluent genomics data analysis using the R/Bioconductor ecosystem. This involves three core steps:
import the data into an appropriate abstraction,
model the data with respect to the biological questions of interest, and
integrate the results with respect to their underlying genomic coordinates. Here we show how to implement these steps to integrate published RNA-seq and ATAC-seq experiments on macrophage cell lines. Using
tximeta, we
import RNA-seq transcript quantifications into an analysis-ready data structure, called the
SummarizedExperiment, that contains the ranges of the reference transcripts and metadata on their provenance. Using
SummarizedExperiments to represent the ATAC-seq and RNA-seq data, we
model differentially accessible (DA) chromatin peaks and differentially expressed (DE) genes with existing Bioconductor packages. Using
plyranges we then
integrate the results to see if there is an enrichment of DA peaks near DE genes by finding overlaps and aggregating over log-fold change thresholds. The combination of these packages and their integration with the Bioconductor ecosystem provide a coherent framework for analysts to iteratively and reproducibly explore their biological data.
Collapse
|
27
|
fcScan: a versatile tool to cluster combinations of sites using genomic coordinates. BMC Bioinformatics 2020; 21:194. [PMID: 32429868 PMCID: PMC7236483 DOI: 10.1186/s12859-020-3536-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2019] [Accepted: 05/05/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Finding combinations of homotypic or heterotypic genomic sites obeying a specific grammar in DNA sequences is a frequent task in bioinformatics. A typical case corresponds to the identification of cis-regulatory modules characterized by a combination of transcription factor binding sites in a defined window size. Although previous studies identified clusters of genomic sites in species with varying genome sizes, the availability of a dedicated and versatile tool to search for such clusters is lacking. RESULTS We present fcScan, an R/Bioconductor package to search for clusters of genomic sites based on user defined criteria including cluster size, inter-cluster distances and sites order and orientation allowing users to adapt their search criteria to specific biological questions. It supports GRanges, data frame and VCF/BED files as input and returns data in GRanges format. By performing clustering on vectorized data, fcScan is adapted to search for genomic clusters in millions of sites as input in short time and is thus ideal to scan data generated by high throughput methods including next generation sequencing. CONCLUSIONS fcScan is ideal for detecting cis-regulatory modules of transcription factor binding sites with a specific grammar as well as genomic loci enriched for mutations. The flexibility in input parameters allows users to perform searches targeting specific research questions. It is released under Artistic-2.0 License. The source code is freely available through Bioconductor (https://bioconductor.org/packages/fcScan) and GitHub (https://github.com/pkhoueiry/fcScan).
Collapse
|
28
|
gscreend: modelling asymmetric count ratios in CRISPR screens to decrease experiment size and improve phenotype detection. Genome Biol 2020; 21:53. [PMID: 32122365 PMCID: PMC7052974 DOI: 10.1186/s13059-020-1939-1] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2019] [Accepted: 01/19/2020] [Indexed: 02/06/2023] Open
Abstract
Pooled CRISPR screens are a powerful tool to probe genotype-phenotype relationships at genome-wide scale. However, criteria for optimal design are missing, and it remains unclear how experimental parameters affect results. Here, we report that random decreases in gRNA abundance are more likely than increases due to bottle-neck effects during the cell proliferation phase. Failure to consider this asymmetry leads to loss of detection power. We provide a new statistical test that addresses this problem and improves hit detection at reduced experiment size. The method is implemented in the R package gscreend, which is available at http://bioconductor.org/packages/gscreend.
Collapse
|
29
|
Abstract
Allelic imbalance occurs when the two alleles of a gene are differentially expressed within a diploid organism, and can indicate important differences in cis-regulation and epigenetic state across the two chromosomes. Because of this, the ability to accurately quantify the proportion at which each allele of a gene is expressed is of great interest to researchers. This becomes challenging in the presence of small read counts and/or sample sizes, which can cause estimates for allelic expression proportions to have high variance. Investigators have traditionally dealt with this problem by filtering out genes with small counts and samples. However, this may inadvertently remove important genes that have truly large allelic imbalances. Another option is to use Bayesian estimators to reduce the variance. To this end, we evaluated the accuracy of three different estimators, the latter two of which are Bayesian shrinkage estimators: maximum likelihood, approximate posterior estimation of GLM coefficients (apeglm) and adaptive shrinkage (ash). We also wrote C++ code to quickly calculate ML and apeglm estimates, and integrated it into the apeglm package. The three methods were evaluated on both simulated and real data. Apeglm consistently performed better than ML according to a variety of criteria, including mean absolute error and concordance at the top. While ash had lower error and greater concordance than ML on the simulations, it also had a tendency to over-shrink large effects, and performed worse on the real data according to error and concordance. Furthermore, when compared to five other packages that also fit beta-binomial models, the apeglm package was substantially faster, making our package useful for quick and reliable analyses of allelic imbalance. Apeglm is available as an R/Bioconductor package at http://bioconductor.org/packages/apeglm.
Collapse
|
30
|
Abstract
Allelic imbalance occurs when the two alleles of a gene are differentially expressed within a diploid organism and can indicate important differences in cis-regulation and epigenetic state across the two chromosomes. Because of this, the ability to accurately quantify the proportion at which each allele of a gene is expressed is of great interest to researchers. This becomes challenging in the presence of small read counts and/or sample sizes, which can cause estimators for allelic expression proportions to have high variance. Investigators have traditionally dealt with this problem by filtering out genes with small counts and samples. However, this may inadvertently remove important genes that have truly large allelic imbalances. Another option is to use pseudocounts or Bayesian estimators to reduce the variance. To this end, we evaluated the accuracy of four different estimators, the latter two of which are Bayesian shrinkage estimators: maximum likelihood, adding a pseudocount to each allele, approximate posterior estimation of GLM coefficients (apeglm) and adaptive shrinkage (ash). We also wrote C++ code to quickly calculate ML and apeglm estimates and integrated it into the apeglm package. The four methods were evaluated on two simulations and one real data set. Apeglm consistently performed better than ML according to a variety of criteria, and generally outperformed use of pseudocounts as well. Ash also performed better than ML in one of the simulations, but in the other performance was more mixed. Finally, when compared to five other packages that also fit beta-binomial models, the apeglm package was substantially faster and more numerically reliable, making our package useful for quick and reliable analyses of allelic imbalance. Apeglm is available as an R/Bioconductor package at http://bioconductor.org/packages/apeglm.
Collapse
|
31
|
Abstract
RCy3 is an R package in Bioconductor that communicates with Cytoscape via its REST API, providing access to the full feature set of Cytoscape from within the R programming environment. RCy3 has been redesigned to streamline its usage and future development as part of a broader Cytoscape Automation effort. Over 100 new functions have been added, including dozens of helper functions specifically for intuitive data overlay operations. Over 40 Cytoscape apps have implemented automation support so far, making hundreds of additional operations accessible via RCy3. Two-way conversion with networks from \textit{igraph} and \textit{graph} ensures interoperability with existing network biology workflows and dozens of other Bioconductor packages. These capabilities are demonstrated in a series of use cases involving public databases, enrichment analysis pipelines, shortest path algorithms and more. With RCy3, bioinformaticians will be able to quickly deliver reproducible network biology workflows as integrations of Cytoscape functions, complex custom analyses and other R packages.
Collapse
|
32
|
Abstract
RCy3 is an R package in Bioconductor that communicates with Cytoscape via its REST API, providing access to the full feature set of Cytoscape from within the R programming environment. RCy3 has been redesigned to streamline its usage and future development as part of a broader Cytoscape Automation effort. Over 100 new functions have been added, including dozens of helper functions specifically for intuitive data overlay operations. Over 40 Cytoscape apps have implemented automation support so far, making hundreds of additional operations accessible via RCy3. Two-way conversion with networks from \textit{igraph} and \textit{graph} ensures interoperability with existing network biology workflows and dozens of other Bioconductor packages. These capabilities are demonstrated in a series of use cases involving public databases, enrichment analysis pipelines, shortest path algorithms and more. With RCy3, bioinformaticians will be able to quickly deliver reproducible network biology workflows as integrations of Cytoscape functions, complex custom analyses and other R packages.
Collapse
|
33
|
Abstract
RCy3 is an R package in Bioconductor that communicates with Cytoscape via its REST API, providing access to the full feature set of Cytoscape from within the R programming environment. RCy3 has been redesigned to streamline its usage and future development as part of a broader Cytoscape Automation effort. Over 100 new functions have been added, including dozens of helper functions specifically for intuitive data overlay operations. Over 40 Cytoscape apps have implemented automation support so far, making hundreds of additional operations accessible via RCy3. Two-way conversion with networks from \textit{igraph} and \textit{graph} ensures interoperability with existing network biology workflows and dozens of other Bioconductor packages. These capabilities are demonstrated in a series of use cases involving public databases, enrichment analysis pipelines, shortest path algorithms and more. With RCy3, bioinformaticians will be able to quickly deliver reproducible network biology workflows as integrations of Cytoscape functions, complex custom analyses and other R packages.
Collapse
|
34
|
Abstract
An increasing emphasis on understanding the dynamics of microbial communities in various settings has led to the proliferation of longitudinal metagenomic sampling studies. Data from whole metagenomic shotgun sequencing and marker-gene survey studies have characteristics that drive novel statistical methodological development for estimating time intervals of differential abundance. In designing a study and the frequency of collection prior to a study, one may wish to model the ability to detect an effect, e.g., there may be issues with respect to cost, ease of access, etc. Additionally, while every study is unique, it is possible that in certain scenarios one statistical framework may be more appropriate than another. Here, we present a simulation paradigm implemented in the R Bioconductor software package microbiomeDASim available at http://bioconductor.org/packages/microbiomeDASim microbiomeDASim. microbiomeDASim allows investigators to simulate longitudinal differential abundant microbiome features with a variety of known functional forms with flexible parameters to control desired signal-to-noise ratio. We present metrics of success results on one particular method called metaSplines.
Collapse
|
35
|
Abstract
An increasing emphasis on understanding the dynamics of microbial communities in various settings has led to the proliferation of longitudinal metagenomic sampling studies. Data from whole metagenomic shotgun sequencing and marker-gene survey studies have characteristics that drive novel statistical methodological development for estimating time intervals of differential abundance. In designing a study and the frequency of collection prior to a study, one may wish to model the ability to detect an effect, e.g., there may be issues with respect to cost, ease of access, etc. Additionally, while every study is unique, it is possible that in certain scenarios one statistical framework may be more appropriate than another. Here, we present a simulation paradigm implemented in the R Bioconductor software package microbiomeDASim available at http://bioconductor.org/packages/microbiomeDASim microbiomeDASim. microbiomeDASim allows investigators to simulate longitudinal differential abundant microbiome features with a variety of known functional forms with flexible parameters to control desired signal-to-noise ratio. We present metrics of success results on one particular method called metaSplines.
Collapse
|
36
|
Abstract
BACKGROUND 5'-end sequencing assays, and Cap Analysis of Gene Expression (CAGE) in particular, have been instrumental in studying transcriptional regulation. 5'-end methods provide genome-wide maps of transcription start sites (TSSs) with base pair resolution. Because active enhancers often feature bidirectional TSSs, such data can also be used to predict enhancer candidates. The current availability of mature and comprehensive computational tools for the analysis of 5'-end data is limited, preventing efficient analysis of new and existing 5'-end data. RESULTS We present CAGEfightR, a framework for analysis of CAGE and other 5'-end data implemented as an R/Bioconductor-package. CAGEfightR can import data from BigWig files and allows for fast and memory efficient prediction and analysis of TSSs and enhancers. Downstream analyses include quantification, normalization, annotation with transcript and gene models, TSS shape statistics, linking TSSs to enhancers via co-expression, identification of enhancer clusters, and genome-browser style visualization. While built to analyze CAGE data, we demonstrate the utility of CAGEfightR in analyzing nascent RNA 5'-data (PRO-Cap). CAGEfightR is implemented using standard Bioconductor classes, making it easy to learn, use and combine with other Bioconductor packages, for example popular differential expression tools such as limma, DESeq2 and edgeR. CONCLUSIONS CAGEfightR provides a single, scalable and easy-to-use framework for comprehensive downstream analysis of 5'-end data. CAGEfightR is designed to be interoperable with other Bioconductor packages, thereby unlocking hundreds of mature transcriptomic analysis tools for 5'-end data. CAGEfightR is freely available via Bioconductor: bioconductor.org/packages/CAGEfightR .
Collapse
|
37
|
Abstract
Background Mutational signatures are specific patterns of somatic mutations introduced into the genome by oncogenic processes. Several mutational signatures have been identified and quantified from multiple cancer studies, and some of them have been linked to known oncogenic processes. Identification of the processes contributing to mutations observed in a sample is potentially informative to understand the cancer etiology. Results We present here SigsPack, a Bioconductor package to estimate a sample’s exposure to mutational processes described by a set of mutational signatures. The package also provides functions to estimate stability of these exposures, using bootstrapping. The performance of exposure and exposure stability estimations have been validated using synthetic and real data. Finally, the package provides tools to normalize the mutation frequencies with respect to the tri-nucleotide contents of the regions probed in the experiment. The importance of this effect is illustrated in an example. Conclusion SigsPack provides a complete set of tools for individual sample exposure estimation, and for mutation catalogue & mutational signatures normalization. Electronic supplementary material The online version of this article (10.1186/s12859-019-3043-7) contains supplementary material, which is available to authorized users.
Collapse
|
38
|
Abstract
Benchmarking is a crucial step during computational analysis and method development. Recently, a number of new methods have been developed for analyzing high-dimensional cytometry data. However, it can be difficult for analysts and developers to find and access well-characterized benchmark datasets. Here, we present HDCytoData, a Bioconductor package providing streamlined access to several publicly available high-dimensional cytometry benchmark datasets. The package is designed to be extensible, allowing new datasets to be contributed by ourselves or other researchers in the future. Currently, the package includes a set of experimental and semi-simulated datasets, which have been used in our previous work to evaluate methods for clustering and differential analyses. Datasets are formatted into standard SummarizedExperiment and flowSet Bioconductor object formats, which include complete metadata within the objects. Access is provided through Bioconductor's ExperimentHub interface. The package is freely available from http://bioconductor.org/packages/HDCytoData.
Collapse
|
39
|
Abstract
DNA transcription is intrinsically complex. Bioinformatic work with transcription factors (TFs) is complicated by a multiplicity of data resources and annotations. The Bioconductor package TFutils includes data structures and functions to enhance the precision and utility of integrative analyses that have components involving TFs. TFutils provides catalogs of human TFs from three reference sources (CISBP, HOCOMOCO, and GO), a catalog of TF targets derived from MSigDb, and multiple approaches to enumerating TF binding sites, including an interface to results of 690 ENCODE experiments. Aspects of integration of TF binding patterns and genome-wide association study results are explored in examples.
Collapse
|
40
|
Development of an Interactive Web Application "Shiny App for Frequency Analysis on Homo sapiens Genome (SAFA-HsG)". Interdiscip Sci 2019; 11:723-729. [PMID: 31264054 DOI: 10.1007/s12539-019-00340-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2019] [Revised: 06/08/2019] [Accepted: 06/19/2019] [Indexed: 10/26/2022]
Abstract
The web application "Shiny App for Frequency Analysis on Homo sapiens Genome (SAFA-HsG)" was developed using R programming-based bioconductor packages and shiny framework. Through the app, preliminary descriptive data analysis on nucleotide frequency, and CpG island, CpG non-island, and CpG island shores and shelves (downstream and upstream) of human reference genome can be carried out, which will help biologists to work on human epigenomics. Table view of these analyses of all chromosomes can be visualized and downloaded by the end users. Similarly, the respective comparative plots can be used for CpG sites comparison. In addition, to introduce the personal genome project, the present study has done a preliminary work on few raw data and are included in the app, which will create interest on personal genome information. The app is hosted on https://SAFA-HsG.shinyapps.io/home/. It is a multi-platform application and can be initiated locally from any computer that has or has not installed R. It is a user-friendly interface, which will allow a biologist, even who has little computer knowledge to access and analyze further.
Collapse
|
41
|
FastqCleaner: an interactive Bioconductor application for quality-control, filtering and trimming of FASTQ files. BMC Bioinformatics 2019; 20:361. [PMID: 31253077 PMCID: PMC6599294 DOI: 10.1186/s12859-019-2961-8] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2018] [Accepted: 06/20/2019] [Indexed: 11/10/2022] Open
Abstract
Background Exploration and processing of FASTQ files are the first steps in state-of-the-art data analysis workflows of Next Generation Sequencing (NGS) platforms. The large amount of data generated by these technologies has put a challenge in terms of rapid analysis and visualization of sequencing information. Recent integration of the R data analysis platform with web visual frameworks has stimulated the development of user-friendly, powerful, and dynamic NGS data analysis applications. Results This paper presents FastqCleaner, a Bioconductor visual application for both quality-control (QC) and pre-processing of FASTQ files. The interface shows diagnostic information for the input and output data and allows to select a series of filtering and trimming operations in an interactive framework. FastqCleaner combines the technology of Bioconductor for NGS data analysis with the data visualization advantages of a web environment. Conclusions FastqCleaner is an user-friendly, offline-capable tool that enables access to advanced Bioconductor infrastructure. The novel concept of a Bioconductor interactive application that can be used without the need for programming skills, makes FastqCleaner a valuable resource for NGS data analysis. Electronic supplementary material The online version of this article (10.1186/s12859-019-2961-8) contains supplementary material, which is available to authorized users.
Collapse
|
42
|
pcaExplorer: an R/ Bioconductor package for interacting with RNA-seq principal components. BMC Bioinformatics 2019; 20:331. [PMID: 31195976 PMCID: PMC6567655 DOI: 10.1186/s12859-019-2879-1] [Citation(s) in RCA: 118] [Impact Index Per Article: 23.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2018] [Accepted: 05/07/2019] [Indexed: 12/11/2022] Open
Abstract
BACKGROUND Principal component analysis (PCA) is frequently used in genomics applications for quality assessment and exploratory analysis in high-dimensional data, such as RNA sequencing (RNA-seq) gene expression assays. Despite the availability of many software packages developed for this purpose, an interactive and comprehensive interface for performing these operations is lacking. RESULTS We developed the pcaExplorer software package to enhance commonly performed analysis steps with an interactive and user-friendly application, which provides state saving as well as the automated creation of reproducible reports. pcaExplorer is implemented in R using the Shiny framework and exploits data structures from the open-source Bioconductor project. Users can easily generate a wide variety of publication-ready graphs, while assessing the expression data in the different modules available, including a general overview, dimension reduction on samples and genes, as well as functional interpretation of the principal components. CONCLUSION pcaExplorer is distributed as an R package in the Bioconductor project ( http://bioconductor.org/packages/pcaExplorer/ ), and is designed to assist a broad range of researchers in the critical step of interactive data exploration.
Collapse
|
43
|
HMP16SData: Efficient Access to the Human Microbiome Project Through Bioconductor. Am J Epidemiol 2019; 188:1023-1026. [PMID: 30649166 PMCID: PMC6545282 DOI: 10.1093/aje/kwz006] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2018] [Revised: 10/08/2018] [Accepted: 10/11/2018] [Indexed: 12/30/2022] Open
Abstract
Phase 1 of the Human Microbiome Project (HMP) investigated 18 body subsites of 242 healthy American adults to produce the first comprehensive reference for the composition and variation of the "healthy" human microbiome. Publicly available data sets from amplicon sequencing of two 16S ribosomal RNA variable regions, with extensive controlled-access participant data, provide a reference for ongoing microbiome studies. However, utilization of these data sets can be hindered by the complex bioinformatic steps required to access, import, decrypt, and merge the various components in formats suitable for ecological and statistical analysis. The HMP16SData package provides count data for both 16S ribosomal RNA variable regions, integrated with phylogeny, taxonomy, public participant data, and controlled participant data for authorized researchers, using standard integrative Bioconductor data objects. By removing bioinformatic hurdles of data access and management, HMP16SData enables epidemiologists with only basic R skills to quickly analyze HMP data.
Collapse
|
44
|
Invited Commentary: Improving the Accessibility of Human Microbiome Project Data Through Integration With R/ Bioconductor. Am J Epidemiol 2019; 188:1027-1030. [PMID: 30649168 DOI: 10.1093/aje/kwz007] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2018] [Accepted: 11/19/2018] [Indexed: 12/24/2022] Open
Abstract
Alterations in the composition of the microbiota have been implicated in many diseases. The Human Microbiome Project (HMP) provides a comprehensive reference data set of the "normal" human microbiome of 242 healthy adults at 5 major body sites. The HMP used both 16S ribosomal RNA gene sequencing and whole-genome metagenomic sequencing to profile the subjects' microbial communities. However, accessing and analyzing the HMP data set still presents technical and bioinformatic challenges, given that researchers must import the microbiome data, integrate phylogenetic trees, and access and merge public and restricted metadata. The HMP16SData R/Bioconductor package developed by Schiffer et al. (Am J Epidemiol. 2019;188(6):1023-1026) greatly simplifies access to the HMP data by combining 16S taxonomic abundance data, public patient metadata, and phylogenetic trees as a single data object. The authors also provide an interface for users with approved Database of Genotypes and Phenotypes (dbGaP) projects to easily retrieve and merge the controlled-access HMP metadata. This package has a broad range of appeal to researchers across disciplines and with various levels of expertise in using R and/or other statistical tools, which translates to improved data accessibility for public health research, with data from healthy individuals serving as a reference for disease-associated studies.
Collapse
|
45
|
Abstract
Knowledge of the subcellular location of a protein gives valuable insight into its function. The field of spatial proteomics has become increasingly popular due to improved multiplexing capabilities in high-throughput mass spectrometry, which have made it possible to systematically localise thousands of proteins per experiment. In parallel with these experimental advances, improved methods for analysing spatial proteomics data have also been developed. In this workflow, we demonstrate using `pRoloc` for the Bayesian analysis of spatial proteomics data. We detail the software infrastructure and then provide step-by-step guidance of the analysis, including setting up a pipeline, assessing convergence, and interpreting downstream results. In several places we provide additional details on Bayesian analysis to provide users with a holistic view of Bayesian analysis for spatial proteomics data.
Collapse
|
46
|
GSEPD: a Bioconductor package for RNA-seq gene set enrichment and projection display. BMC Bioinformatics 2019; 20:115. [PMID: 30841846 PMCID: PMC6404334 DOI: 10.1186/s12859-019-2697-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2018] [Accepted: 02/21/2019] [Indexed: 12/04/2022] Open
Abstract
Background RNA-seq, wherein RNA transcripts expressed in a sample are sequenced and quantified, has become a widely used technique to study disease and development. With RNA-seq, transcription abundance can be measured, differential expression genes between groups and functional enrichment of those genes can be computed. However, biological insights from RNA-seq are often limited by computational analysis and the enormous volume of resulting data, preventing facile and meaningful review and interpretation of gene expression profiles. Particularly, in cases where the samples under study exhibit uncontrolled variation, deeper analysis of functional enrichment would be necessary to visualize samples’ gene expression activity under each biological function. Results We developed a Bioconductor package rgsepd that streamlines RNA-seq data analysis by wrapping commonly used tools DESeq2 and GOSeq in a user-friendly interface and performs a gene-subset linear projection to cluster heterogeneous samples by Gene Ontology (GO) terms. Rgsepd computes significantly enriched GO terms for each experimental condition and generates multidimensional projection plots highlighting how each predefined gene set’s multidimensional expression may delineate samples. Conclusions The rgsepd serves to automate differential expression, functional annotation, and exploratory data analyses to highlight subtle expression differences among samples based on each significant biological function.
Collapse
|
47
|
restfulSE: A semantically rich interface for cloud-scale genomics with Bioconductor. F1000Res 2019; 8:21. [PMID: 30828438 PMCID: PMC6392152 DOI: 10.12688/f1000research.17518.1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 12/19/2018] [Indexed: 11/20/2022] Open
Abstract
Bioconductor's SummarizedExperiment class unites numerical assay quantifications with sample- and experiment-level metadata. SummarizedExperiment is the standard Bioconductor class for assays that produce matrix-like data, used by over 200 packages. We describe the restfulSE package, a deployment of this data model that supports remote storage. We illustrate use of SummarizedExperiment with remote HDF5 and Google BigQuery back ends, with two applications in cancer genomics. Our intent is to allow the use of familiar and semantically meaningful programmatic idioms to query genomic data, while abstracting the remote interface from end users and developers.
Collapse
|
48
|
IMMAN: an R/ Bioconductor package for Interolog protein network reconstruction, mapping and mining analysis. BMC Bioinformatics 2019; 20:73. [PMID: 30755155 PMCID: PMC6373071 DOI: 10.1186/s12859-019-2659-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2018] [Accepted: 01/28/2019] [Indexed: 12/15/2022] Open
Abstract
Background Reconstruction of protein-protein interaction networks (PPIN) has been riddled with controversy for decades. Particularly, false-negative and -positive interactions make this progress even more complicated. Also, lack of a standard PPIN limits us in the comparison studies and results in the incompatible outcomes. Using an evolution-based concept, i.e. interolog which refers to interacting orthologous protein sets, pave the way toward an optimal benchmark. Results Here, we provide an R package, IMMAN, as a tool for reconstructing Interolog Protein Network (IPN) by integrating several Protein-protein Interaction Networks (PPINs). Users can unify different PPINs to mine conserved common networks among species. IMMAN is designed to retrieve IPNs with different degrees of conservation to engage prediction analysis of protein functions according to their networks. Conclusions IPN consists of evolutionarily conserved nodes and their related edges regarding low false positive rates, which can be considered as a gold standard network in the contexts of biological network analysis regarding to those PPINs which is derived from. Electronic supplementary material The online version of this article (10.1186/s12859-019-2659-y) contains supplementary material, which is available to authorized users.
Collapse
|
49
|
Abstract
Bioconductor is a widely used R-based platform for genomics, but its host of complex genomic data structures places a cognitive burden on the user. For most tasks, the GRanges object would suffice, but there are gaps in the API that prevent its general use. By recognizing that the GRanges class follows “tidy” data principles, we create a grammar of genomic data transformation, defining verbs for performing actions on and between genomic interval data and providing a way of performing common data analysis tasks through a coherent interface to existing Bioconductor infrastructure. We implement this grammar as a Bioconductor/R package called plyranges.
Collapse
|
50
|
Abstract
The importance of bioinformatics, computational biology, and data science in biomedical research continues to grow, driving a need for effective instruction and education. A workshop setting, with lectures and guided hands-on tutorials, is a common approach to teaching practical computational and analytical methods. Here, we detail the process we used to produce high-quality, community-authored educational materials that are available for public consumption and reuse. The coordinated efforts of 17 authors over 10 weeks resulted in 15 workshops available as a website and as a 388-page electronic book. We describe how we utilized cloud infrastructure, GitHub, and a literate programming approach to robustly deliver hands-on tutorials to participants of the annual Bioconductor conference. The scripts, raw and published workshop materials, and cloud machine image are all openly available. Our approach uses free services and software and can be adapted by workshop organizers and authors in other contests with appropriate technical backgrounds.
Collapse
|