1
|
An integrated computational pipeline for machine learning-driven diagnosis based on Raman spectra of saliva samples. Comput Biol Med 2024; 171:108028. [PMID: 38335817 DOI: 10.1016/j.compbiomed.2024.108028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2023] [Revised: 01/17/2024] [Accepted: 01/25/2024] [Indexed: 02/12/2024]
Abstract
Raman Spectroscopy promises the ability to encode in spectral data the significant differences between biological samples belonging to patients affected by a disease and samples of healthy patients (controls). However, the decoding and interpretation of the Raman spectral fingerprint is still a difficult and time-consuming procedure even for domain experts. In this work, we test an end-to-end deep-learning diagnostic pipeline able to classify spectral data from saliva samples. The pipeline has been validated against the SARS-COV-2 Infection and for the screening of neurodegenerative diseases such as Parkinson's and Alzheimer's diseases. The proposed system can be used for the fast prototyping of promising non-invasive, cost and time-efficient diagnostic screening tests.
Collapse
|
2
|
Personalized tumor combination therapy optimization using the single-cell transcriptome. Genome Med 2023; 15:105. [PMID: 38041202 PMCID: PMC10691165 DOI: 10.1186/s13073-023-01256-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2023] [Accepted: 11/13/2023] [Indexed: 12/03/2023] Open
Abstract
BACKGROUND The precise characterization of individual tumors and immune microenvironments using transcriptome sequencing has provided a great opportunity for successful personalized cancer treatment. However, the cancer treatment response is often characterized by in vitro assays or bulk transcriptomes that neglect the heterogeneity of malignant tumors in vivo and the immune microenvironment, motivating the need to use single-cell transcriptomes for personalized cancer treatment. METHODS Here, we present comboSC, a computational proof-of-concept study to explore the feasibility of personalized cancer combination therapy optimization using single-cell transcriptomes. ComboSC provides a workable solution to stratify individual patient samples based on quantitative evaluation of their personalized immune microenvironment with single-cell RNA sequencing and maximize the translational potential of in vitro cellular response to unify the identification of synergistic drug/small molecule combinations or small molecules that can be paired with immune checkpoint inhibitors to boost immunotherapy from a large collection of small molecules and drugs, and finally prioritize them for personalized clinical use based on bipartition graph optimization. RESULTS We apply comboSC to publicly available 119 single-cell transcriptome data from a comprehensive set of 119 tumor samples from 15 cancer types and validate the predicted drug combination with literature evidence, mining clinical trial data, perturbation of patient-derived cell line data, and finally in-vivo samples. CONCLUSIONS Overall, comboSC provides a feasible and one-stop computational prototype and a proof-of-concept study to predict potential drug combinations for further experimental validation and clinical usage using the single-cell transcriptome, which will facilitate and accelerate personalized tumor treatment by reducing screening time from a large drug combination space and saving valuable treatment time for individual patients. A user-friendly web server of comboSC for both clinical and research users is available at www.combosc.top . The source code is also available on GitHub at https://github.com/bm2-lab/comboSC .
Collapse
|
3
|
SUsPECT: a pipeline for variant effect prediction based on custom long-read transcriptomes for improved clinical variant annotation. BMC Genomics 2023; 24:305. [PMID: 37280537 PMCID: PMC10245480 DOI: 10.1186/s12864-023-09391-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2023] [Accepted: 05/19/2023] [Indexed: 06/08/2023] Open
Abstract
Our incomplete knowledge of the human transcriptome impairs the detection of disease-causing variants, in particular if they affect transcripts only expressed under certain conditions. These transcripts are often lacking from reference transcript sets, such as Ensembl/GENCODE and RefSeq, and could be relevant for establishing genetic diagnoses. We present SUsPECT (Solving Unsolved Patient Exomes/gEnomes using Custom Transcriptomes), a pipeline based on the Ensembl Variant Effect Predictor (VEP) to predict variant impact on custom transcript sets, such as those generated by long-read RNA-sequencing, for downstream prioritization. Our pipeline predicts the functional consequence and likely deleteriousness scores for missense variants in the context of novel open reading frames predicted from any transcriptome. We demonstrate the utility of SUsPECT by uncovering potential mutational mechanisms of pathogenic variants in ClinVar that are not predicted to be pathogenic using the reference transcript annotation. In further support of SUsPECT's utility, we identified an enrichment of immune-related variants predicted to have a more severe molecular consequence when annotating with a newly generated transcriptome from stimulated immune cells instead of the reference transcriptome. Our pipeline outputs crucial information for further prioritization of potentially disease-causing variants for any disease and will become increasingly useful as more long-read RNA sequencing datasets become available.
Collapse
|
4
|
A pipeline for testing drug mechanism of action and combination therapies: From microarray data to simulations via Linear-In-Flux-Expressions: Testing four-drug combinations for tuberculosis treatment. Math Biosci 2023; 360:108983. [PMID: 36931620 DOI: 10.1016/j.mbs.2023.108983] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2022] [Revised: 02/13/2023] [Accepted: 02/14/2023] [Indexed: 03/17/2023]
Abstract
Computational methods are becoming commonly used in many areas of medical research. Recently, the modeling of biological mechanisms associated with disease pathophysiology have benefited from approaches such as Quantitative Systems Pharmacology (briefly QSP) and Physiologically Based Pharmacokinetics (briefly PBPK). These methodologies show the potential to enhance, if not substitute animal models. The main reasons for this success are the high accuracy and low cost. Solid mathematical foundations of such methods, such as compartmental systems and flux balance analysis, provide a good base on which to build computational tools. However, there are many choices to be made in model design, that will have a large impact on how these methods perform as we scale up the network or perturb the system to uncover the mechanisms of action of new compounds or therapy combinations. A computational pipeline is presented here that starts with available -omic data and utilizes advanced mathematical simulations to inform the modeling of a biochemical system. Specific attention is devoted to creating a modular workflow, including the mathematical rigorous tools to represent complex chemical reactions, and modeling drug action in terms of its impact on multiple pathways. An application to optimizing combination therapy for tuberculosis shows the potential of the approach.
Collapse
|
5
|
Harnessing Single-Cell RNA Sequencing to Identify Dendritic Cell Types, Characterize Their Biological States, and Infer Their Activation Trajectory. Methods Mol Biol 2023; 2618:319-373. [PMID: 36905526 DOI: 10.1007/978-1-0716-2938-3_22] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/12/2023]
Abstract
Dendritic cells (DCs) orchestrate innate and adaptive immunity, by translating the sensing of distinct danger signals into the induction of different effector lymphocyte responses, to induce the defense mechanisms the best suited to face the threat. Hence, DCs are very plastic, which results from two key characteristics. First, DCs encompass distinct cell types specialized in different functions. Second, each DC type can undergo different activation states, fine-tuning its functions depending on its tissue microenvironment and the pathophysiological context, by adapting the output signals it delivers to the input signals it receives. Hence, to better understand DC biology and harness it in the clinic, we must determine which combinations of DC types and activation states mediate which functions and how.To decipher the nature, functions, and regulation of DC types and their physiological activation states, one of the methods that can be harnessed most successfully is ex vivo single-cell RNA sequencing (scRNAseq). However, for new users of this approach, determining which analytics strategy and computational tools to choose can be quite challenging, considering the rapid evolution and broad burgeoning in the field. In addition, awareness must be raised on the need for specific, robust, and tractable strategies to annotate cells for cell type identity and activation states. It is also important to emphasize the necessity of examining whether similar cell activation trajectories are inferred by using different, complementary methods. In this chapter, we take these issues into account for providing a pipeline for scRNAseq analysis and illustrating it with a tutorial reanalyzing a public dataset of mononuclear phagocytes isolated from the lungs of naïve or tumor-bearing mice. We describe this pipeline step-by-step, including data quality controls, dimensionality reduction, cell clustering, cell cluster annotation, inference of the cell activation trajectories, and investigation of the underpinning molecular regulation. It is accompanied with a more complete tutorial on GitHub. We hope that this method will be helpful for both wet lab and bioinformatics researchers interested in harnessing scRNAseq data for deciphering the biology of DCs or other cell types and that it will contribute to establishing high standards in the field.
Collapse
|
6
|
Comparative Genomics for Evolutionary Cell Biology Using AMOEBAE: Understanding the Golgi and Beyond. Methods Mol Biol 2022; 2557:431-452. [PMID: 36512230 DOI: 10.1007/978-1-0716-2639-9_26] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
Taking an evolutionary approach to cell biology can yield important new information about how the cell works and how it evolved to do so. This is true of the Golgi apparatus, as it is of all systems within the cell. Comparative genomics is one of the crucial first steps to this line of research, but comes with technical challenges that must be overcome for rigor and robustness. We here introduce AMOEBAE, a workflow for mid-range scale comparative genomic analyses. It allows for customization of parameters, queries, and taxonomic sampling of genomic and transcriptomics data. This protocol article covers the rationale for an evolutionary approach to cell biological study (i.e., when would AMOEBAE be useful), how to use AMOEBAE, and discussion of limitations. It also provides an example dataset, which demonstrates that the Golgi protein AP4 Epsilon is present as the sole retained subunit of the AP4 complex in basidiomycete fungi. AMOEBAE can facilitate comparative genomic studies by balancing reproducibility and speed with user-input and interpretation. It is hoped that AMOEBAE or similar tools will encourage cell biologists to incorporate an evolutionary context into their research.
Collapse
|
7
|
Bioinformatic Analysis of Circular RNA Expression. Methods Mol Biol 2021; 2348:343-370. [PMID: 34160817 DOI: 10.1007/978-1-0716-1581-2_22] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/25/2023]
Abstract
Circular RNAs (circRNAs) are stable RNA molecules generated by backsplicing that play regulatory functions through interaction with other RNA and proteins, as well as by encoding peptides. Dysregulation of circRNA expression can drive cancer development and progression with different mechanisms. CircRNAs are currently regarded as extremely attractive molecules in cancer research for the identification of new and possibly targetable disease regulatory networks and for the development of biomarkers for cancer diagnosis, prognosis definition, and monitoring. Using specific experimental and computational protocols, circRNAs can be identified through RNA-seq by spotting the reads spanning backsplice junctions, which are specific to circular molecules. In this chapter, we report a state-of-the-art computational protocol for a genome-wide analysis of circRNAs from RNA-seq data, which considers circRNA detection, quantification, and differential expression testing. Finally, we indicate how to determine circular transcript sequences and the resources for an in silico functional characterization of circRNAs.
Collapse
|
8
|
Abstract
BACKGROUND Genomic micro-satellites are the genomic regions that consist of short and repetitive DNA motifs. Estimating the length distribution and state of a micro-satellite region is an important computational step in cancer sequencing data pipelines, which is suggested to facilitate the downstream analysis and clinical decision supporting. Although several state-of-the-art approaches have been proposed to identify micro-satellite instability (MSI) events, they are limited in dealing with regions longer than one read length. Moreover, based on our best knowledge, all of these approaches imply a hypothesis that the tumor purity of the sequenced samples is sufficiently high, which is inconsistent with the reality, leading the inferred length distribution to dilute the data signal and introducing the false positive errors. RESULTS In this article, we proposed a computational approach, named ELMSI, which detected MSI events based on the next generation sequencing technology. ELMSI can estimate the specific length distributions and states of micro-satellite regions from a mixed tumor sample paired with a control one. It first estimated the purity of the tumor sample based on the read counts of the filtered SNVs loci. Then, the algorithm identified the length distributions and the states of short micro-satellites by adding the Maximum Likelihood Estimation (MLE) step to the existing algorithm. After that, ELMSI continued to infer the length distributions of long micro-satellites by incorporating a simplified Expectation Maximization (EM) algorithm with central limit theorem, and then used statistical tests to output the states of these micro-satellites. Based on our experimental results, ELMSI was able to handle micro-satellites with lengths ranging from shorter than one read length to 10kbps. CONCLUSIONS To verify the reliability of our algorithm, we first compared the ability of classifying the shorter micro-satellites from the mixed samples with the existing algorithm MSIsensor. Meanwhile, we varied the number of micro-satellite regions, the read length and the sequencing coverage to separately test the performance of ELMSI on estimating the longer ones from the mixed samples. ELMSI performed well on mixed samples, and thus ELMSI was of great value for improving the recognition effect of micro-satellite regions and supporting clinical decision supporting. The source codes have been uploaded and maintained at https://github.com/YixuanWang1120/ELMSI for academic use only.
Collapse
|
9
|
High-throughput computational pipeline for 3-D structure preparation and in silico protein surface property screening: A case study on HBcAg dimer structures. Int J Pharm 2019; 563:337-346. [PMID: 30935914 DOI: 10.1016/j.ijpharm.2019.03.057] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2018] [Revised: 03/26/2019] [Accepted: 03/27/2019] [Indexed: 02/06/2023]
Abstract
Knowledge-based experimental design can aid biopharmaceutical high-throughput screening (HTS) experiments needed to identify critical manufacturability parameters. Prior knowledge can be obtained via computational methods such as protein property extraction from 3-D protein structures. This study presents a high-throughput 3-D structure preparation and refinement pipeline that supports structure screenings with an automated and data-dependent workflow. As a case study, three chimeric virus-like particle (VLP) building blocks, hepatitis B core antigen (HBcAg) dimers, were constructed. Molecular dynamics (MD) refinement quality, speed, stability, and correlation to zeta potential data was evaluated using different MD simulation settings. Settings included 2 force fields (YASARA2 and AMBER03) and 2 pKa computation methods (YASARA and H++). MD simulations contained a data-dependent termination via identification of a 2 ns Window of Stability, which was also used for robust descriptor extraction. MD simulation with YASARA2, independent of pKa computation method, was found to be most stable and computationally efficient. These settings resulted in a fast refinement (6.6-37.5 h), a good structure quality (-1.17--1.13) and a strong linear dependence between dimer surface charge and complete chimeric HBcAg VLP zeta potential. These results indicate the computational pipeline's applicability for early-stage candidate assessment and design optimization of HTS manufacturability or formulability experiments.
Collapse
|
10
|
Abstract
BACKGROUND Haplotype phasing is an important step in many bioinformatics workflows. In cancer genomics, it is suggested that reconstructing the clonal haplotypes of a tumor sample could facilitate a comprehensive understanding of its clonal architecture and further provide valuable reference in clinical diagnosis and treatment. However, the sequencing data is an admixture of reads sampled from different clonal haplotypes, which complicates the computational problem by exponentially increasing the solution-space and leads the existing algorithms to an unacceptable time-/space- complexity. In addition, the evolutionary process among clonal haplotypes further weakens those algorithms by bringing indistinguishable candidate solutions. RESULTS To improve the algorithmic performance of phasing clonal haplotypes, in this article, we propose MixSubHap, which is a graph-based computational pipeline working on cancer sequencing data. To reduce the computation complexity, MixSubHap adopts three bounding strategies to limit the solution space and filter out false positive candidates. It first estimates the global clonal structure by clustering the variant allelic frequencies on sampled point mutations. This offers a priori on the number of clonal haplotypes when copy-number variations are not considered. Then, it utilizes a greedy extension algorithm to approximately find the longest linkage of the locally assembled contigs. Finally, it incorporates a read-depth stripping algorithm to filter out false linkages according to the posterior estimation of tumor purity and the estimated percentage of each sub-clone in the sample. A series of experiments are conducted to verify the performance of the proposed pipeline. CONCLUSIONS The results demonstrate that MixSubHap is able to identify about 90% on average of the preset clonal haplotypes under different simulation configurations. Especially, MixSubHap is robust when decreasing the mutation rates, in which cases the longest assembled contig could reach to 10kbps, while the accuracy of assigning a mutation to its haplotype still keeps more than 60% on average. MixSubHap is considered as a practical algorithm to reconstruct clonal haplotypes from cancer sequencing data. The source codes have been uploaded and maintained at https://github.com/YixuanWang1120/MixSubHap for academic use only.
Collapse
|
11
|
Next Generation-Targeted Amplicon Sequencing (NG-TAS): an optimised protocol and computational pipeline for cost-effective profiling of circulating tumour DNA. Genome Med 2019; 11:1. [PMID: 30609936 PMCID: PMC6320579 DOI: 10.1186/s13073-018-0611-9] [Citation(s) in RCA: 32] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2018] [Accepted: 12/17/2018] [Indexed: 01/05/2023] Open
Abstract
Circulating tumour DNA (ctDNA) detection and monitoring have enormous potential clinical utility in oncology. We describe here a fast, flexible and cost-effective method to profile multiple genes simultaneously in low input cell-free DNA (cfDNA): Next Generation-Targeted Amplicon Sequencing (NG-TAS). We designed a panel of 377 amplicons spanning 20 cancer genes and tested the NG-TAS pipeline using cell-free DNA from two HapMap lymphoblastoid cell lines. NG-TAS consistently detected mutations in cfDNA when mutation allele fraction was > 1%. We applied NG-TAS to a clinical cohort of metastatic breast cancer patients, demonstrating its potential in monitoring the disease. The computational pipeline is available at https://github.com/cclab-brca/NGTAS_pipeline .
Collapse
|
12
|
Abstract
Drug repurposing is a methodology where already existing drugs are tested against diseases outside their initial usage, in order to reduce the high cost and long periods of new drug development. In silico drug repurposing further speeds up the process, by testing a large number of drugs against the biological signatures of known diseases. In this chapter, we present a step-by-step methodology of a transcriptomics-based computational drug repurposing pipeline providing a comprehensive guide to the whole procedure, from proper dataset selection to short list derivation of repurposed drugs which might act as inhibitors against the studied disease. The presented pipeline contains the selection and curation of proper transcriptomics datasets, statistical analysis of the datasets in order to extract the top over- and under-expressed gene identifiers, appropriate identifier conversion, drug repurposing analysis, repurposed drugs filtering, cross-tool screening, drug-list re-ranking, and results' validation.
Collapse
|
13
|
VacSol: a high throughput in silico pipeline to predict potential therapeutic targets in prokaryotic pathogens using subtractive reverse vaccinology. BMC Bioinformatics 2017; 18:106. [PMID: 28193166 PMCID: PMC5307925 DOI: 10.1186/s12859-017-1540-0] [Citation(s) in RCA: 49] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2016] [Accepted: 02/08/2017] [Indexed: 02/06/2023] Open
Abstract
Background With advances in reverse vaccinology approaches, a progressive improvement has been observed in the prediction of putative vaccine candidates. Reverse vaccinology has changed the way of discovery and provides a mean to propose target identification in reduced time and labour. In this regard, high throughput genomic sequencing technologies and supporting bioinformatics tools have greatly facilitated the prompt analysis of pathogens, where various predicted candidates have been found effective against certain infections and diseases. A pipeline, VacSol, is designed here based on a similar approach to predict putative vaccine candidates both rapidly and efficiently. Results VacSol, a new pipeline introduced here, is a highly scalable, multi-mode, and configurable software designed to automate the high throughput in silico vaccine candidate prediction process for the identification of putative vaccine candidates against the proteome of bacterial pathogens. Vaccine candidates are screened using integrated, well-known and robust algorithms/tools for proteome analysis, and the results from the VacSol software are presented in five different formats by taking proteome sequence as input in FASTA file format. The utility of VacSol is tested and compared with published data and using the Helicobacter pylori 26695 reference strain as a benchmark. Conclusion VacSol rapidly and efficiently screens the whole bacterial pathogen proteome to identify a few predicted putative vaccine candidate proteins. This pipeline has the potential to save computational costs and time by efficiently reducing false positive candidate hits. VacSol results do not depend on any universal set of rules and may vary based on the provided input. It is freely available to download from: https://sourceforge.net/projects/vacsol/. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1540-0) contains supplementary material, which is available to authorized users.
Collapse
|
14
|
HiC-bench: comprehensive and reproducible Hi-C data analysis designed for parameter exploration and benchmarking. BMC Genomics 2017; 18:22. [PMID: 28056762 PMCID: PMC5217551 DOI: 10.1186/s12864-016-3387-6] [Citation(s) in RCA: 47] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2016] [Accepted: 12/07/2016] [Indexed: 01/24/2023] Open
Abstract
BACKGROUND Chromatin conformation capture techniques have evolved rapidly over the last few years and have provided new insights into genome organization at an unprecedented resolution. Analysis of Hi-C data is complex and computationally intensive involving multiple tasks and requiring robust quality assessment. This has led to the development of several tools and methods for processing Hi-C data. However, most of the existing tools do not cover all aspects of the analysis and only offer few quality assessment options. Additionally, availability of a multitude of tools makes scientists wonder how these tools and associated parameters can be optimally used, and how potential discrepancies can be interpreted and resolved. Most importantly, investigators need to be ensured that slight changes in parameters and/or methods do not affect the conclusions of their studies. RESULTS To address these issues (compare, explore and reproduce), we introduce HiC-bench, a configurable computational platform for comprehensive and reproducible analysis of Hi-C sequencing data. HiC-bench performs all common Hi-C analysis tasks, such as alignment, filtering, contact matrix generation and normalization, identification of topological domains, scoring and annotation of specific interactions using both published tools and our own. We have also embedded various tasks that perform quality assessment and visualization. HiC-bench is implemented as a data flow platform with an emphasis on analysis reproducibility. Additionally, the user can readily perform parameter exploration and comparison of different tools in a combinatorial manner that takes into account all desired parameter settings in each pipeline task. This unique feature facilitates the design and execution of complex benchmark studies that may involve combinations of multiple tool/parameter choices in each step of the analysis. To demonstrate the usefulness of our platform, we performed a comprehensive benchmark of existing and new TAD callers exploring different matrix correction methods, parameter settings and sequencing depths. Users can extend our pipeline by adding more tools as they become available. CONCLUSIONS HiC-bench consists an easy-to-use and extensible platform for comprehensive analysis of Hi-C datasets. We expect that it will facilitate current analyses and help scientists formulate and test new hypotheses in the field of three-dimensional genome organization.
Collapse
|
15
|
CoreFlow: a computational platform for integration, analysis and modeling of complex biological data. J Proteomics 2014; 100:167-73. [PMID: 24503186 DOI: 10.1016/j.jprot.2014.01.023] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2013] [Revised: 01/20/2014] [Accepted: 01/24/2014] [Indexed: 11/16/2022]
Abstract
UNLABELLED A major challenge in mass spectrometry and other large-scale applications is how to handle, integrate, and model the data that is produced. Given the speed at which technology advances and the need to keep pace with biological experiments, we designed a computational platform, CoreFlow, which provides programmers with a framework to manage data in real-time. It allows users to upload data into a relational database (MySQL), and to create custom scripts in high-level languages such as R, Python, or Perl for processing, correcting and modeling this data. CoreFlow organizes these scripts into project-specific pipelines, tracks interdependencies between related tasks, and enables the generation of summary reports as well as publication-quality images. As a result, the gap between experimental and computational components of a typical large-scale biology project is reduced, decreasing the time between data generation, analysis and manuscript writing. CoreFlow is being released to the scientific community as an open-sourced software package complete with proteomics-specific examples, which include corrections for incomplete isotopic labeling of peptides (SILAC) or arginine-to-proline conversion, and modeling of multiple/selected reaction monitoring (MRM/SRM) results. BIOLOGICAL SIGNIFICANCE CoreFlow was purposely designed as an environment for programmers to rapidly perform data analysis. These analyses are assembled into project-specific workflows that are readily shared with biologists to guide the next stages of experimentation. Its simple yet powerful interface provides a structure where scripts can be written and tested virtually simultaneously to shorten the life cycle of code development for a particular task. The scripts are exposed at every step so that a user can quickly see the relationships between the data, the assumptions that have been made, and the manipulations that have been performed. Since the scripts use commonly available programming languages, they can easily be transferred to and from other computational environments for debugging or faster processing. This focus on 'on the fly' analysis sets CoreFlow apart from other workflow applications that require wrapping of scripts into particular formats and development of specific user interfaces. Importantly, current and future releases of data analysis scripts in CoreFlow format will be of widespread benefit to the proteomics community, not only for uptake and use in individual labs, but to enable full scrutiny of all analysis steps, thus increasing experimental reproducibility and decreasing errors. This article is part of a Special Issue entitled: Can Proteomics Fill the Gap Between Genomics and Phenotypes?
Collapse
|