1
|
The effect of data transformation on low-dimensional integration of single-cell RNA-seq. BMC Bioinformatics 2024; 25:171. [PMID: 38689234 PMCID: PMC11059821 DOI: 10.1186/s12859-024-05788-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2023] [Accepted: 04/16/2024] [Indexed: 05/02/2024] Open
Abstract
BACKGROUND Recent developments in single-cell RNA sequencing have opened up a multitude of possibilities to study tissues at the level of cellular populations. However, the heterogeneity in single-cell sequencing data necessitates appropriate procedures to adjust for technological limitations and various sources of noise when integrating datasets from different studies. While many analysis procedures employ various preprocessing steps, they often overlook the importance of selecting and optimizing the employed data transformation methods. RESULTS This work investigates data transformation approaches used in single-cell clustering analysis tools and their effects on batch integration analysis. In particular, we compare 16 transformations and their impact on the low-dimensional representations, aiming to reduce the batch effect and integrate multiple single-cell sequencing data. Our results show that data transformations strongly influence the results of single-cell clustering on low-dimensional data space, such as those generated by UMAP or PCA. Moreover, these changes in low-dimensional space significantly affect trajectory analysis using multiple datasets, as well. However, the performance of the data transformations greatly varies across datasets, and the optimal method was different for each dataset. Additionally, we explored how data transformation impacts the analysis of deep feature encodings using deep neural network-based models, including autoencoder-based models and proto-typical networks. Data transformation also strongly affects the outcome of deep neural network models. CONCLUSIONS Our findings suggest that the batch effect and noise in integrative analysis are highly influenced by data transformation. Low-dimensional features can integrate different batches well when proper data transformation is applied. Furthermore, we found that the batch mixing score on low-dimensional space can guide the selection of the optimal data transformation. In conclusion, data preprocessing is one of the most crucial analysis steps and needs to be cautiously considered in the integrative analysis of multiple scRNA-seq datasets.
Collapse
|
2
|
CLARUS: An interactive explainable AI platform for manual counterfactuals in graph neural networks. J Biomed Inform 2024; 150:104600. [PMID: 38301750 DOI: 10.1016/j.jbi.2024.104600] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Revised: 01/22/2024] [Accepted: 01/22/2024] [Indexed: 02/03/2024]
Abstract
BACKGROUND Lack of trust in artificial intelligence (AI) models in medicine is still the key blockage for the use of AI in clinical decision support systems (CDSS). Although AI models are already performing excellently in systems medicine, their black-box nature entails that patient-specific decisions are incomprehensible for the physician. Explainable AI (XAI) algorithms aim to "explain" to a human domain expert, which input features influenced a specific recommendation. However, in the clinical domain, these explanations must lead to some degree of causal understanding by a clinician. RESULTS We developed the CLARUS platform, aiming to promote human understanding of graph neural network (GNN) predictions. CLARUS enables the visualisation of patient-specific networks, as well as, relevance values for genes and interactions, computed by XAI methods, such as GNNExplainer. This enables domain experts to gain deeper insights into the network and more importantly, the expert can interactively alter the patient-specific network based on the acquired understanding and initiate re-prediction or retraining. This interactivity allows us to ask manual counterfactual questions and analyse the effects on the GNN prediction. CONCLUSION We present the first interactive XAI platform prototype, CLARUS, that allows not only the evaluation of specific human counterfactual questions based on user-defined alterations of patient networks and a re-prediction of the clinical outcome but also a retraining of the entire GNN after changing the underlying graph structures. The platform is currently hosted by the GWDG on https://rshiny.gwdg.de/apps/clarus/.
Collapse
|
3
|
A primer on the use of machine learning to distil knowledge from data in biological psychiatry. Mol Psychiatry 2024; 29:387-401. [PMID: 38177352 DOI: 10.1038/s41380-023-02334-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/12/2022] [Revised: 09/21/2023] [Accepted: 11/17/2023] [Indexed: 01/06/2024]
Abstract
Applications of machine learning in the biomedical sciences are growing rapidly. This growth has been spurred by diverse cross-institutional and interdisciplinary collaborations, public availability of large datasets, an increase in the accessibility of analytic routines, and the availability of powerful computing resources. With this increased access and exposure to machine learning comes a responsibility for education and a deeper understanding of its bases and bounds, borne equally by data scientists seeking to ply their analytic wares in medical research and by biomedical scientists seeking to harness such methods to glean knowledge from data. This article provides an accessible and critical review of machine learning for a biomedically informed audience, as well as its applications in psychiatry. The review covers definitions and expositions of commonly used machine learning methods, and historical trends of their use in psychiatry. We also provide a set of standards, namely Guidelines for REporting Machine Learning Investigations in Neuropsychiatry (GREMLIN), for designing and reporting studies that use machine learning as a primary data-analysis approach. Lastly, we propose the establishment of the Machine Learning in Psychiatry (MLPsych) Consortium, enumerate its objectives, and identify areas of opportunity for future applications of machine learning in biological psychiatry. This review serves as a cautiously optimistic primer on machine learning for those on the precipice as they prepare to dive into the field, either as methodological practitioners or well-informed consumers.
Collapse
|
4
|
Species-agnostic transfer learning for cross-species transcriptomics data integration without gene orthology. Brief Bioinform 2024; 25:bbae004. [PMID: 38305455 PMCID: PMC10835749 DOI: 10.1093/bib/bbae004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2023] [Revised: 11/24/2023] [Accepted: 12/10/2023] [Indexed: 02/03/2024] Open
Abstract
Novel hypotheses in biomedical research are often developed or validated in model organisms such as mice and zebrafish and thus play a crucial role. However, due to biological differences between species, translating these findings into human applications remains challenging. Moreover, commonly used orthologous gene information is often incomplete and entails a significant information loss during gene-id conversion. To address these issues, we present a novel methodology for species-agnostic transfer learning with heterogeneous domain adaptation. We extended the cross-domain structure-preserving projection toward out-of-sample prediction. Our approach not only allows knowledge integration and translation across various species without relying on gene orthology but also identifies similar GO among the most influential genes composing the latent space for integration. Subsequently, during the alignment of latent spaces, each composed of species-specific genes, it is possible to identify functional annotations of genes missing from public orthology databases. We evaluated our approach with four different single-cell sequencing datasets focusing on cell-type prediction and compared it against related machine-learning approaches. In summary, the developed model outperforms related methods working without prior knowledge when predicting unseen cell types based on other species' data. The results demonstrate that our novel approach allows knowledge transfer beyond species barriers without the dependency on known gene orthology but utilizing the entire gene sets.
Collapse
|
5
|
Ensemble-GNN: federated ensemble learning with graph neural networks for disease module discovery and classification. Bioinformatics 2023; 39:btad703. [PMID: 37988152 PMCID: PMC10684359 DOI: 10.1093/bioinformatics/btad703] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2023] [Revised: 09/06/2023] [Accepted: 11/20/2023] [Indexed: 11/22/2023] Open
Abstract
SUMMARY Federated learning enables collaboration in medicine, where data is scattered across multiple centers without the need to aggregate the data in a central cloud. While, in general, machine learning models can be applied to a wide range of data types, graph neural networks (GNNs) are particularly developed for graphs, which are very common in the biomedical domain. For instance, a patient can be represented by a protein-protein interaction (PPI) network where the nodes contain the patient-specific omics features. Here, we present our Ensemble-GNN software package, which can be used to deploy federated, ensemble-based GNNs in Python. Ensemble-GNN allows to quickly build predictive models utilizing PPI networks consisting of various node features such as gene expression and/or DNA methylation. We exemplary show the results from a public dataset of 981 patients and 8469 genes from the Cancer Genome Atlas (TCGA). AVAILABILITY AND IMPLEMENTATION The source code is available at https://github.com/pievos101/Ensemble-GNN, and the data at Zenodo (DOI: 10.5281/zenodo.8305122).
Collapse
|
6
|
The FeatureCloud Platform for Federated Learning in Biomedicine: Unified Approach. J Med Internet Res 2023; 25:e42621. [PMID: 37436815 PMCID: PMC10372562 DOI: 10.2196/42621] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2022] [Revised: 01/13/2023] [Accepted: 02/26/2023] [Indexed: 07/13/2023] Open
Abstract
BACKGROUND Machine learning and artificial intelligence have shown promising results in many areas and are driven by the increasing amount of available data. However, these data are often distributed across different institutions and cannot be easily shared owing to strict privacy regulations. Federated learning (FL) allows the training of distributed machine learning models without sharing sensitive data. In addition, the implementation is time-consuming and requires advanced programming skills and complex technical infrastructures. OBJECTIVE Various tools and frameworks have been developed to simplify the development of FL algorithms and provide the necessary technical infrastructure. Although there are many high-quality frameworks, most focus only on a single application case or method. To our knowledge, there are no generic frameworks, meaning that the existing solutions are restricted to a particular type of algorithm or application field. Furthermore, most of these frameworks provide an application programming interface that needs programming knowledge. There is no collection of ready-to-use FL algorithms that are extendable and allow users (eg, researchers) without programming knowledge to apply FL. A central FL platform for both FL algorithm developers and users does not exist. This study aimed to address this gap and make FL available to everyone by developing FeatureCloud, an all-in-one platform for FL in biomedicine and beyond. METHODS The FeatureCloud platform consists of 3 main components: a global frontend, a global backend, and a local controller. Our platform uses a Docker to separate the local acting components of the platform from the sensitive data systems. We evaluated our platform using 4 different algorithms on 5 data sets for both accuracy and runtime. RESULTS FeatureCloud removes the complexity of distributed systems for developers and end users by providing a comprehensive platform for executing multi-institutional FL analyses and implementing FL algorithms. Through its integrated artificial intelligence store, federated algorithms can easily be published and reused by the community. To secure sensitive raw data, FeatureCloud supports privacy-enhancing technologies to secure the shared local models and assures high standards in data privacy to comply with the strict General Data Protection Regulation. Our evaluation shows that applications developed in FeatureCloud can produce highly similar results compared with centralized approaches and scale well for an increasing number of participating sites. CONCLUSIONS FeatureCloud provides a ready-to-use platform that integrates the development and execution of FL algorithms while reducing the complexity to a minimum and removing the hurdles of federated infrastructure. Thus, we believe that it has the potential to greatly increase the accessibility of privacy-preserving and distributed data analyses in biomedicine and beyond.
Collapse
|
7
|
Analysis of a Deep Learning Model for 12-Lead ECG Classification Reveals Learned Features Similar to Diagnostic Criteria. IEEE J Biomed Health Inform 2023; PP:1-12. [PMID: 37126621 DOI: 10.1109/jbhi.2023.3271858] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/03/2023]
Abstract
Despite their remarkable performance, deep neural networks remain unadopted in clinical practice, which is considered to be partially due to their lack of explainability. In this work, we apply explainable attribution methods to a pre-trained deep neural network for abnormality classification in 12-lead electrocardiography to open this "black box" and understand the relationship between model prediction and learned features. We classify data from two public databases (CPSC 2018, PTB-XL) and the attribution methods assign a "relevance score" to each sample of the classified signals. This allows analyzing what the network learned during training, for which we propose quantitative methods: average relevance scores over a) classes, b) leads, and c) average beats. The analyses of relevance scores for atrial fibrillation and left bundle branch block compared to healthy controls show that their mean values a) increase with higher classification probability and correspond to false classifications when around zero, and b) correspond to clinical recommendations regarding which lead to consider. Furthermore, c) visible P-waves and concordant T-waves result in clearly negative relevance scores in atrial fibrillation and left bundle branch block classification, respectively. Results are similar across both databases despite differences in study population and hardware. In summary, our analysis suggests that the DNN learned features similar to cardiology textbook knowledge.
Collapse
|
8
|
MirDIP 5.2: tissue context annotation and novel microRNA curation. Nucleic Acids Res 2022; 51:D217-D225. [PMID: 36453996 PMCID: PMC9825511 DOI: 10.1093/nar/gkac1070] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2022] [Revised: 10/16/2022] [Accepted: 11/28/2022] [Indexed: 12/05/2022] Open
Abstract
MirDIP is a well-established database that aggregates microRNA-gene human interactions from multiple databases to increase coverage, reduce bias, and improve usability by providing an integrated score proportional to the probability of the interaction occurring. In version 5.2, we removed eight outdated resources, added a new resource (miRNATIP), and ran five prediction algorithms for miRBase and mirGeneDB. In total, mirDIP 5.2 includes 46 364 047 predictions for 27 936 genes and 2734 microRNAs, making it the first database to provide interactions using data from mirGeneDB. Moreover, we curated and integrated 32 497 novel microRNAs from 14 publications to accelerate the use of these novel data. In this release, we also extend the content and functionality of mirDIP by associating contexts with microRNAs, genes, and microRNA-gene interactions. We collected and processed microRNA and gene expression data from 20 resources and acquired information on 330 tissue and disease contexts for 2657 microRNAs, 27 576 genes and 123 651 910 gene-microRNA-tissue interactions. Finally, we improved the usability of mirDIP by enabling the user to search the database using precursor IDs, and we integrated miRAnno, a network-based tool for identifying pathways linked to specific microRNAs. We also provide a mirDIP API to facilitate access to its integrated predictions. Updated mirDIP is available at https://ophid.utoronto.ca/mirDIP.
Collapse
|
9
|
Guideline for Software Life Cycle in Health Informatics. iScience 2022; 25:105534. [DOI: 10.1016/j.isci.2022.105534] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022] Open
|
10
|
Editorial: Computational systems biomedicine. Front Genet 2022; 13:1047760. [DOI: 10.3389/fgene.2022.1047760] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2022] [Accepted: 09/23/2022] [Indexed: 11/13/2022] Open
|
11
|
Federated Random Forests can improve local performance of predictive models for various healthcare applications. Bioinformatics 2022; 38:2278-2286. [PMID: 35139148 DOI: 10.1093/bioinformatics/btac065] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2021] [Revised: 01/08/2022] [Accepted: 02/01/2022] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Limited data access has hindered the field of precision medicine from exploring its full potential, e.g. concerning machine learning and privacy and data protection rules.Our study evaluates the efficacy of federated Random Forests (FRF) models, focusing particularly on the heterogeneity within and between datasets. We addressed three common challenges: (i) number of parties, (ii) sizes of datasets and (iii) imbalanced phenotypes, evaluated on five biomedical datasets. RESULTS The FRF outperformed the average local models and performed comparably to the data-centralized models trained on the entire data. With an increasing number of models and decreasing dataset size, the performance of local models decreases drastically. The FRF, however, do not decrease significantly. When combining datasets of different sizes, the FRF vastly improve compared to the average local models. We demonstrate that the FRF remain more robust and outperform the local models by analyzing different class-imbalances.Our results support that FRF overcome boundaries of clinical research and enables collaborations across institutes without violating privacy or legal regulations. Clinicians benefit from a vast collection of unbiased data aggregated from different geographic locations, demographics and other varying factors. They can build more generalizable models to make better clinical decisions, which will have relevance, especially for patients in rural areas and rare or geographically uncommon diseases, enabling personalized treatment. In combination with secure multi-party computation, federated learning has the power to revolutionize clinical practice by increasing the accuracy and robustness of healthcare AI and thus paving the way for precision medicine. AVAILABILITY AND IMPLEMENTATION The implementation of the federated random forests can be found at https://featurecloud.ai/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
12
|
Evaluation of machine learning strategies for imaging confirmed prostate cancer recurrence prediction on electronic health records. Comput Biol Med 2022; 143:105263. [PMID: 35131608 DOI: 10.1016/j.compbiomed.2022.105263] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2021] [Revised: 01/21/2022] [Accepted: 01/21/2022] [Indexed: 11/03/2022]
Abstract
BACKGROUND The main screening parameter to monitor prostate cancer recurrence (PCR) after primary treatment is the serum concentration of prostate-specific antigen (PSA). In recent years, Ga-68-PSMA PET/CT has become an important method for additional diagnostics in patients with biochemical recurrence. PURPOSE While Ga-68-PSMA PET/CT performs better, it is an expensive, invasive, and time-consuming examination. Therefore, in this study, we aim to employ modern multivariate Machine Learning (ML) methods on electronic health records (EHR) of prostate cancer patients to improve the prediction of imaging confirmed PCR (IPCR). METHODS We retrospectively analyzed the clinical information of 272 patients, who were examined using Ga-68-PSMA PET/CT. The PSA values ranged from 0 ng/mL to 2270.38 ng/mL with a median PSA level at 1.79 ng/mL. We performed a descriptive analysis using Logistic Regression. Additionally, we evaluated the predictive performance of Logistic Regression, Support Vector Machine, Gradient Boosting, and Random Forest. Finally, we assessed the importance of all features using Ensemble Feature Selection (EFS). RESULTS The descriptive analysis found significant associations between IPCR and logarithmic PSA values as well as between IPCR and performed hormonal therapy. Our models were able to predict IPCR with an AUC score of 0.78 ± 0.13 (mean ± standard deviation) and a sensitivity of 0.997 ± 0.01. Features such as PSA, PSA doubling time, PSA velocity, hormonal therapy, radiation treatment, and injected activity show high importance for IPCR prediction using EFS. CONCLUSION This study demonstrates the potential of employing a multitude of parameters into multivariate ML models to improve identification of non-recurring patients compared to the current focus on the main screening parameter (PSA). We showed that ML models are able to predict IPCR, detectable by Ga-68-PSMA PET/CT, and thereby pave the way for optimized early imaging and treatment.
Collapse
|
13
|
Fractal construction of constrained code words for DNA storage systems. Nucleic Acids Res 2021; 50:e30. [PMID: 34908135 PMCID: PMC8934655 DOI: 10.1093/nar/gkab1209] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2021] [Revised: 11/16/2021] [Accepted: 11/24/2021] [Indexed: 12/29/2022] Open
Abstract
The use of complex biological molecules to solve computational problems is an emerging field at the interface between biology and computer science. There are two main categories in which biological molecules, especially DNA, are investigated as alternatives to silicon-based computer technologies. One is to use DNA as a storage medium, and the other is to use DNA for computing. Both strategies come with certain constraints. In the current study, we present a novel approach derived from chaos game representation for DNA to generate DNA code words that fulfill user-defined constraints, namely GC content, homopolymers, and undesired motifs, and thus, can be used to build codes for reliable DNA storage systems.
Collapse
|
14
|
Transfer learning compensates limited data, batch effects and technological heterogeneity in single-cell sequencing. NAR Genom Bioinform 2021; 3:lqab104. [PMID: 34805988 PMCID: PMC8598306 DOI: 10.1093/nargab/lqab104] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2021] [Revised: 10/07/2021] [Accepted: 10/18/2021] [Indexed: 12/18/2022] Open
Abstract
Tremendous advances in next-generation sequencing technology have enabled the accumulation of large amounts of omics data in various research areas over the past decade. However, study limitations due to small sample sizes, especially in rare disease clinical research, technological heterogeneity and batch effects limit the applicability of traditional statistics and machine learning analysis. Here, we present a meta-transfer learning approach to transfer knowledge from big data and reduce the search space in data with small sample sizes. Few-shot learning algorithms integrate meta-learning to overcome data scarcity and data heterogeneity by transferring molecular pattern recognition models from datasets of unrelated domains. We explore few-shot learning models with large scale public dataset, TCGA (The Cancer Genome Atlas) and GTEx dataset, and demonstrate their potential as pre-training dataset in other molecular pattern recognition tasks. Our results show that meta-transfer learning is very effective for datasets with a limited sample size. Furthermore, we show that our approach can transfer knowledge across technological heterogeneity, for example, from bulk cell to single-cell data. Our approach can overcome study size constraints, batch effects and technical limitations in analyzing single-cell data by leveraging existing bulk-cell sequencing data.
Collapse
|
15
|
Prediction of antimicrobial resistance based on whole-genome sequencing and machine learning. Bioinformatics 2021; 38:325-334. [PMID: 34613360 PMCID: PMC8722762 DOI: 10.1093/bioinformatics/btab681] [Citation(s) in RCA: 31] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2021] [Revised: 08/27/2021] [Accepted: 09/24/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Antimicrobial resistance (AMR) is one of the biggest global problems threatening human and animal health. Rapid and accurate AMR diagnostic methods are thus very urgently needed. However, traditional antimicrobial susceptibility testing (AST) is time-consuming, low throughput and viable only for cultivable bacteria. Machine learning methods may pave the way for automated AMR prediction based on genomic data of the bacteria. However, comparing different machine learning methods for the prediction of AMR based on different encodings and whole-genome sequencing data without previously known knowledge remains to be done. RESULTS In this study, we evaluated logistic regression (LR), support vector machine (SVM), random forest (RF) and convolutional neural network (CNN) for the prediction of AMR for the antibiotics ciprofloxacin, cefotaxime, ceftazidime and gentamicin. We could demonstrate that these models can effectively predict AMR with label encoding, one-hot encoding and frequency matrix chaos game representation (FCGR encoding) on whole-genome sequencing data. We trained these models on a large AMR dataset and evaluated them on an independent public dataset. Generally, RFs and CNNs perform better than LR and SVM with AUCs up to 0.96. Furthermore, we were able to identify mutations that are associated with AMR for each antibiotic. AVAILABILITY AND IMPLEMENTATION Source code in data preparation and model training are provided at GitHub website (https://github.com/YunxiaoRen/ML-iAMR). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
16
|
Abstract
Computational methods can transform healthcare. In particular, health informatics with artificial intelligence has shown tremendous potential when applied in various fields of medical research and has opened a new era for precision medicine. The development of reusable biomedical software for research or clinical practice is time-consuming and requires rigorous compliance with quality requirements as defined by international standards. However, research projects rarely implement such measures, hindering smooth technology transfer into the research community or manufacturers as well as reproducibility and reusability. Here, we present a guideline for quality management systems (QMS) for academic organizations incorporating the essential components while confining the requirements to an easily manageable effort. It provides a starting point to implement a QMS tailored to specific needs effortlessly and greatly facilitates technology transfer in a controlled manner, thereby supporting reproducibility and reusability. Ultimately, the emerging standardized workflows can pave the way for an accelerated deployment in clinical practice.
Collapse
|
17
|
Integrative Analysis of Next-Generation Sequencing for Next-Generation Cancer Research toward Artificial Intelligence. Cancers (Basel) 2021; 13:3148. [PMID: 34202427 PMCID: PMC8269018 DOI: 10.3390/cancers13133148] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2021] [Revised: 06/16/2021] [Accepted: 06/21/2021] [Indexed: 12/18/2022] Open
Abstract
The rapid improvement of next-generation sequencing (NGS) technologies and their application in large-scale cohorts in cancer research led to common challenges of big data. It opened a new research area incorporating systems biology and machine learning. As large-scale NGS data accumulated, sophisticated data analysis methods became indispensable. In addition, NGS data have been integrated with systems biology to build better predictive models to determine the characteristics of tumors and tumor subtypes. Therefore, various machine learning algorithms were introduced to identify underlying biological mechanisms. In this work, we review novel technologies developed for NGS data analysis, and we describe how these computational methodologies integrate systems biology and omics data. Subsequently, we discuss how deep neural networks outperform other approaches, the potential of graph neural networks (GNN) in systems biology, and the limitations in NGS biomedical research. To reflect on the various challenges and corresponding computational solutions, we will discuss the following three topics: (i) molecular characteristics, (ii) tumor heterogeneity, and (iii) drug discovery. We conclude that machine learning and network-based approaches can add valuable insights and build highly accurate models. However, a well-informed choice of learning algorithm and biological network information is crucial for the success of each specific research question.
Collapse
|
18
|
A large-scale comparative study on peptide encodings for biomedical classification. NAR Genom Bioinform 2021; 3:lqab039. [PMID: 34046590 PMCID: PMC8140742 DOI: 10.1093/nargab/lqab039] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2021] [Revised: 04/13/2021] [Accepted: 04/26/2021] [Indexed: 01/19/2023] Open
Abstract
Owing to the great variety of distinct peptide encodings, working on a biomedical classification task at hand is challenging. Researchers have to determine encodings capable to represent underlying patterns as numerical input for the subsequent machine learning. A general guideline is lacking in the literature, thus, we present here the first large-scale comprehensive study to investigate the performance of a wide range of encodings on multiple datasets from different biomedical domains. For the sake of completeness, we added additional sequence- and structure-based encodings. In particular, we collected 50 biomedical datasets and defined a fixed parameter space for 48 encoding groups, leading to a total of 397 700 encoded datasets. Our results demonstrate that none of the encodings are superior for all biomedical domains. Nevertheless, some encodings often outperform others, thus reducing the initial encoding selection substantially. Our work offers researchers to objectively compare novel encodings to the state of the art. Our findings pave the way for a more sophisticated encoding optimization, for example, as part of automated machine learning pipelines. The work presented here is implemented as a large-scale, end-to-end workflow designed for easy reproducibility and extensibility. All standardized datasets and results are available for download to comply with FAIR standards.
Collapse
|
19
|
Computational strategies to combat COVID-19: useful tools to accelerate SARS-CoV-2 and coronavirus research. Brief Bioinform 2021; 22:642-663. [PMID: 33147627 PMCID: PMC7665365 DOI: 10.1093/bib/bbaa232] [Citation(s) in RCA: 81] [Impact Index Per Article: 27.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2020] [Revised: 07/28/2020] [Accepted: 08/26/2020] [Indexed: 12/16/2022] Open
Abstract
SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) is a novel virus of the family Coronaviridae. The virus causes the infectious disease COVID-19. The biology of coronaviruses has been studied for many years. However, bioinformatics tools designed explicitly for SARS-CoV-2 have only recently been developed as a rapid reaction to the need for fast detection, understanding and treatment of COVID-19. To control the ongoing COVID-19 pandemic, it is of utmost importance to get insight into the evolution and pathogenesis of the virus. In this review, we cover bioinformatics workflows and tools for the routine detection of SARS-CoV-2 infection, the reliable analysis of sequencing data, the tracking of the COVID-19 pandemic and evaluation of containment measures, the study of coronavirus evolution, the discovery of potential drug targets and development of therapeutic strategies. For each tool, we briefly describe its use case and how it advances research specifically for SARS-CoV-2. All tools are free to use and available online, either through web applications or public code repositories. Contact:evbc@unj-jena.de.
Collapse
|
20
|
Interleukin-6 Gene Expression Changes after a 4-Week Intake of a Multispecies Probiotic in Major Depressive Disorder-Preliminary Results of the PROVIT Study. Nutrients 2020; 12:E2575. [PMID: 32858844 PMCID: PMC7551871 DOI: 10.3390/nu12092575] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2020] [Revised: 08/20/2020] [Accepted: 08/21/2020] [Indexed: 12/13/2022] Open
Abstract
Major depressive disorder (MDD) is a prevalent disease, in which one third of sufferers do not respond to antidepressants. Probiotics have the potential to be well-tolerated and cost-efficient treatment options. However, the molecular pathways of their effects are not fully elucidated yet. Based on previous literature, we assume that probiotics can positively influence inflammatory mechanisms. We aimed at analyzing the effects of probiotics on gene expression of inflammation genes as part of the randomized, placebo-controlled, multispecies probiotics PROVIT study in Graz, Austria. Fasting blood of 61 inpatients with MDD was collected before and after four weeks of probiotic intake or placebo. We analyzed the effects on gene expression of tumor necrosis factor (TNF), nuclear factor kappa B subunit 1 (NFKB1) and interleukin-6 (IL-6). In IL-6 we found no significant main effects for group (F(1,44) = 1.33, p = ns) nor time (F(1,44) = 0.00, p = ns), but interaction was significant (F(1,44) = 5.67, p < 0.05). The intervention group showed decreasing IL-6 gene expression levels while the placebo group showed increasing gene expression levels of IL-6. Probiotics could be a useful additional treatment in MDD, due to their anti-inflammatory effects. Results of the current study are promising, but further studies are required to investigate the beneficial effects of probiotic interventions in depressed individuals.
Collapse
|
21
|
CORDITE: The Curated CORona Drug InTERactions Database for SARS-CoV-2. iScience 2020; 23:101297. [PMID: 32619700 PMCID: PMC7305714 DOI: 10.1016/j.isci.2020.101297] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2020] [Revised: 06/09/2020] [Accepted: 06/15/2020] [Indexed: 01/18/2023] Open
Abstract
Since the outbreak in 2019, researchers are trying to find effective drugs against the SARS-CoV-2 virus based on de novo drug design and drug repurposing. The former approach is very time consuming and needs extensive testing in humans, whereas drug repurposing is more promising, as the drugs have already been tested for side effects, etc. At present, there is no treatment for COVID-19 that is clinically effective, but there is a huge amount of data from studies that analyze potential drugs. We developed CORDITE to efficiently combine state-of-the-art knowledge on potential drugs and make it accessible to scientists and clinicians. The web interface also provides access to an easy-to-use API that allows a wide use for other software and applications, e.g., for meta-analysis, design of new clinical studies, or simple literature search. CORDITE is currently empowering many scientists across all continents and accelerates research in the knowledge domains of virology and drug design.
Collapse
|
22
|
Urinary proteomics links keratan sulfate degradation and lysosomal enzymes to early type 1 diabetes. PLoS One 2020; 15:e0233639. [PMID: 32453760 PMCID: PMC7250451 DOI: 10.1371/journal.pone.0233639] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2020] [Accepted: 05/09/2020] [Indexed: 01/09/2023] Open
Abstract
Diabetes is the leading cause of end-stage renal disease worldwide. Our understanding of the early kidney response to chronic hyperglycemia remains incomplete. To address this, we first investigated the urinary proteomes of otherwise healthy youths with and without type 1 diabetes and subsequently examined the enriched pathways that might be dysregulated in early disease using systems biology approaches. This cross-sectional study included two separate cohorts for the discovery (N = 30) and internal validation (N = 30) of differentially excreted proteins. Discovery proteomics was performed on a Q Exactive Plus hybrid quadrupole-orbitrap mass spectrometer. We then searched the pathDIP, KEGG, and Reactome databases to identify enriched pathways in early diabetes; the Integrated Interactions Database to retrieve protein-protein interaction data; and the PubMed database to compare fold changes of our signature proteins with those published in similarly designed studies. Proteins were selected for internal validation based on pathway enrichment and availability of commercial enzyme-linked immunosorbent assay kits. Of the 2451 proteins identified, 576 were quantified in all samples from the discovery cohort; 34 comprised the urinary signature for early diabetes after Benjamini-Hochberg adjustment (Q < 0.05). The top pathways associated with this signature included lysosome, glycosaminoglycan degradation, and innate immune system (Q < 0.01). Notably, all enzymes involved in keratan sulfate degradation were significantly elevated in urines from youths with diabetes (|fold change| > 1.6). Increased urinary excretion of monocyte differentiation antigen CD14, hexosaminidase A, and lumican was also observed in the validation cohort (P < 0.05). Twenty-one proteins from our signature have been reported elsewhere as potential mediators of early diabetes. In this study, we identified a urinary proteomic signature for early type 1 diabetes, of which lysosomal enzymes were major constituents. Our findings highlight novel pathways such as keratan sulfate degradation in the early kidney response to hyperglycemia.
Collapse
|
23
|
mirDIP 4.1-integrative database of human microRNA target predictions. Nucleic Acids Res 2019; 46:D360-D370. [PMID: 29194489 PMCID: PMC5753284 DOI: 10.1093/nar/gkx1144] [Citation(s) in RCA: 348] [Impact Index Per Article: 69.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2017] [Accepted: 10/30/2017] [Indexed: 02/07/2023] Open
Abstract
MicroRNAs are important regulators of gene expression, achieved by binding to the gene to be regulated. Even with modern high-throughput technologies, it is laborious and expensive to detect all possible microRNA targets. For this reason, several computational microRNA-target prediction tools have been developed, each with its own strengths and limitations. Integration of different tools has been a successful approach to minimize the shortcomings of individual databases. Here, we present mirDIP v4.1, providing nearly 152 million human microRNA-target predictions, which were collected across 30 different resources. We also introduce an integrative score, which was statistically inferred from the obtained predictions, and was assigned to each unique microRNA-target interaction to provide a unified measure of confidence. We demonstrate that integrating predictions across multiple resources does not cumulate prediction bias toward biological processes or pathways. mirDIP v4.1 is freely available at http://ophid.utoronto.ca/mirDIP/.
Collapse
|
24
|
GWAS-based machine learning approach to predict duloxetine response in major depressive disorder. J Psychiatr Res 2018; 99:62-68. [PMID: 29407288 DOI: 10.1016/j.jpsychires.2017.12.009] [Citation(s) in RCA: 44] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/19/2017] [Revised: 10/31/2017] [Accepted: 12/14/2017] [Indexed: 12/22/2022]
Abstract
Major depressive disorder (MDD) is one of the most prevalent psychiatric disorders and is commonly treated with antidepressant drugs. However, large variability is observed in terms of response to antidepressants. Machine learning (ML) models may be useful to predict treatment outcomes. A sample of 186 MDD patients received treatment with duloxetine for up to 8 weeks were categorized as "responders" based on a MADRS change >50% from baseline; or "remitters" based on a MADRS score ≤10 at end point. The initial dataset (N = 186) was randomly divided into training and test sets in a nested 5-fold cross-validation, where 80% was used as a training set and 20% made up five independent test sets. We performed genome-wide logistic regression to identify potentially significant variants related to duloxetine response/remission and extracted the most promising predictors using LASSO regression. Subsequently, classification-regression trees (CRT) and support vector machines (SVM) were applied to construct models, using ten-fold cross-validation. With regards to response, none of the pairs performed significantly better than chance (accuracy p > .1). For remission, SVM achieved moderate performance with an accuracy = 0.52, a sensitivity = 0.58, and a specificity = 0.46, and 0.51 for all coefficients for CRT. The best performing SVM fold was characterized by an accuracy = 0.66 (p = .071), sensitivity = 0.70 and a sensitivity = 0.61. In this study, the potential of using GWAS data to predict duloxetine outcomes was examined using ML models. The models were characterized by a promising sensitivity, but specificity remained moderate at best. The inclusion of additional non-genetic variables to create integrated models may improve prediction.
Collapse
|
25
|
LifeStyle-Specific-Islands (LiSSI): Integrated Bioinformatics Platform for Genomic Island Analysis. J Integr Bioinform 2017; 14:/j/jib.ahead-of-print/jib-2017-0010/jib-2017-0010.xml. [PMID: 28678736 PMCID: PMC6042826 DOI: 10.1515/jib-2017-0010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2017] [Accepted: 04/19/2017] [Indexed: 11/20/2022] Open
Abstract
Distinct bacteria are able to cope with highly diverse lifestyles; for instance, they can be free living or host-associated. Thus, these organisms must possess a large and varied genomic arsenal to withstand different environmental conditions. To facilitate the identification of genomic features that might influence bacterial adaptation to a specific niche, we introduce LifeStyle-Specific-Islands (LiSSI). LiSSI combines evolutionary sequence analysis with statistical learning (Random Forest with feature selection, model tuning and robustness analysis). In summary, our strategy aims to identify conserved consecutive homology sequences (islands) in genomes and to identify the most discriminant islands for each lifestyle.
Collapse
|
26
|
|
27
|
Carotta: Revealing Hidden Confounder Markers in Metabolic Breath Profiles. Metabolites 2015; 5:344-63. [PMID: 26065494 PMCID: PMC4495376 DOI: 10.3390/metabo5020344] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2015] [Revised: 05/20/2015] [Accepted: 05/25/2015] [Indexed: 12/20/2022] Open
Abstract
Computational breath analysis is a growing research area aiming at identifying volatile organic compounds (VOCs) in human breath to assist medical diagnostics of the next generation. While inexpensive and non-invasive bioanalytical technologies for metabolite detection in exhaled air and bacterial/fungal vapor exist and the first studies on the power of supervised machine learning methods for profiling of the resulting data were conducted, we lack methods to extract hidden data features emerging from confounding factors. Here, we present Carotta, a new cluster analysis framework dedicated to uncovering such hidden substructures by sophisticated unsupervised statistical learning methods. We study the power of transitivity clustering and hierarchical clustering to identify groups of VOCs with similar expression behavior over most patient breath samples and/or groups of patients with a similar VOC intensity pattern. This enables the discovery of dependencies between metabolites. On the one hand, this allows us to eliminate the effect of potential confounding factors hindering disease classification, such as smoking. On the other hand, we may also identify VOCs associated with disease subtypes or concomitant diseases. Carotta is an open source software with an intuitive graphical user interface promoting data handling, analysis and visualization. The back-end is designed to be modular, allowing for easy extensions with plugins in the future, such as new clustering methods and statistics. It does not require much prior knowledge or technical skills to operate. We demonstrate its power and applicability by means of one artificial dataset. We also apply Carotta exemplarily to a real-world example dataset on chronic obstructive pulmonary disease (COPD). While the artificial data are utilized as a proof of concept, we will demonstrate how Carotta finds candidate markers in our real dataset associated with confounders rather than the primary disease (COPD) and bronchial carcinoma (BC). Carotta is publicly available at http://carotta.compbio.sdu.dk [1].
Collapse
|
28
|
Classification of Breast Cancer Subtypes by combining Gene Expression and DNA Methylation Data. J Integr Bioinform 2014. [DOI: 10.1515/jib-2014-236] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Summary Selecting the most promising treatment strategy for breast cancer crucially depends on determining the correct subtype. In recent years, gene expression profiling has been investigated as an alternative to histochemical methods. Since databases like TCGA provide easy and unrestricted access to gene expression data for hundreds of patients, the challenge is to extract a minimal optimal set of genes with good prognostic properties from a large bulk of genes making a moderate contribution to classification. Several studies have successfully applied machine learning algorithms to solve this so-called gene selection problem. However, more diverse data from other OMICS technologies are available, including methylation. We hypothesize that combining methylation and gene expression data could already lead to a largely improved classification model, since the resulting model will reflect differences not only on the transcriptomic, but also on an epigenetic level. We compared so-called random forest derived classification models based on gene expression and methylation data alone, to a model based on the combined features and to a model based on the gold standard PAM50. We obtained bootstrap errors of 10-20% and classification error of 1-50%, depending on breast cancer subtype and model. The gene expression model was clearly superior to the methylation model, which was also reflected in the combined model, which mainly selected features from gene expression data. However, the methylation model was able to identify unique features not considered as relevant by the gene expression model, which might provide deeper insights into breast cancer subtype differentiation on an epigenetic level.
Collapse
|
29
|
On the limits of computational functional genomics for bacterial lifestyle prediction. Brief Funct Genomics 2014; 13:398-408. [PMID: 24855068 DOI: 10.1093/bfgp/elu014] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023] Open
Abstract
We review the level of genomic specificity regarding actinobacterial pathogenicity. As they occupy various niches in diverse habitats, one may assume the existence of lifestyle-specific genomic features. We include 240 actinobacteria classified into four pathogenicity classes: human pathogens (HPs), broad-spectrum pathogens (BPs), opportunistic pathogens (OPs) and non-pathogenic (NP). We hypothesize: (H1) Pathogens (HPs and BPs) possess specific pathogenicity signature genes. (H2) The same holds for OPs. (H3) Broad-spectrum and exclusively HPs cannot be distinguished from each other because of an observation bias, i.e. many HPs might yet be unclassified BPs. (H4) There is no intrinsic genomic characteristic of OPs compared with pathogens, as small mutations are likely to play a more dominant role to survive the immune system. To study these hypotheses, we implemented a bioinformatics pipeline that combines evolutionary sequence analysis with statistical learning methods (Random Forest with feature selection, model tuning and robustness analysis). Essentially, we present orthologous gene sets that computationally distinguish pathogens from NPs (H1). We further show a clear limit in differentiating OPs from both NPs (H2) and pathogens (H4). HPs may also not be distinguished from bacteria annotated as BPs based only on a small set of orthologous genes (H3), as many HPs might as well target a broad range of mammals but have not been annotated accordingly. In conclusion, we illustrate that even in the post-genome era and despite next-generation sequencing technology, our ability to efficiently deduce real-world conclusions, such as pathogenicity classification, remains quite limited.
Collapse
|
30
|
|
31
|
An Integrative Clinical Database and Diagnostics Platform for Biomarker Identification and Analysis in Ion Mobility Spectra of Human Exhaled Air. J Integr Bioinform 2013. [DOI: 10.1515/jib-2013-218] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Summary Over the last decade the evaluation of odors and vapors in human breath has gained more and more attention, particularly in the diagnostics of pulmonary diseases. Ion mobility spectrometry coupled with multi-capillary columns (MCC/IMS), is a well known technology for detecting volatile organic compounds (VOCs) in air. It is a comparatively inexpensive, non-invasive, high-throughput method, which is able to handle the moisture that comes with human exhaled air, and allows for characterizing of VOCs in very low concentrations. To identify discriminating compounds as biomarkers, it is necessary to have a clear understanding of the detailed composition of human breath. Therefore, in addition to the clinical studies, there is a need for a flexible and comprehensive centralized data repository, which is capable of gathering all kinds of related information. Moreover, there is a demand for automated data integration and semi-automated data analysis, in particular with regard to the rapid data accumulation, emerging from the high-throughput nature of the MCC/IMS technology. Here, we present a comprehensive database application and analysis platform, which combines metabolic maps with heterogeneous biomedical data in a well-structured manner. The design of the database is based on a hybrid of the entity-attribute- value (EAV) model and the EAV-CR, which incorporates the concepts of classes and relationships. Additionally it offers an intuitive user interface that provides easy and quick access to the platform’s functionality: automated data integration and integrity validation, versioning and roll-back strategy, data retrieval as well as semi-automatic data mining and machine learning capabilities. The platform will support MCC/IMS-based biomarker identification and validation. The software, schemata, data sets and further information is publicly available at http://imsdb.mpi-inf.mpg.de.
Collapse
|
32
|
Peak detection method evaluation for ion mobility spectrometry by using machine learning approaches. Metabolites 2013; 3:277-93. [PMID: 24957992 PMCID: PMC3901270 DOI: 10.3390/metabo3020277] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2013] [Revised: 03/15/2013] [Accepted: 04/09/2013] [Indexed: 11/25/2022] Open
Abstract
Ion mobility spectrometry with pre-separation by multi-capillary columns (MCC/IMS) has become an established inexpensive, non-invasive bioanalytics technology for detecting volatile organic compounds (VOCs) with various metabolomics applications in medical research. To pave the way for this technology towards daily usage in medical practice, different steps still have to be taken. With respect to modern biomarker research, one of the most important tasks is the automatic classification of patient-specific data sets into different groups, healthy or not, for instance. Although sophisticated machine learning methods exist, an inevitable preprocessing step is reliable and robust peak detection without manual intervention. In this work we evaluate four state-of-the-art approaches for automated IMS-based peak detection: local maxima search, watershed transformation with IPHEx, region-merging with VisualNow, and peak model estimation (PME). We manually generated a gold standard with the aid of a domain expert (manual) and compare the performance of the four peak calling methods with respect to two distinct criteria. We first utilize established machine learning methods and systematically study their classification performance based on the four peak detectors’ results. Second, we investigate the classification variance and robustness regarding perturbation and overfitting. Our main finding is that the power of the classification accuracy is almost equally good for all methods, the manually created gold standard as well as the four automatic peak finding methods. In addition, we note that all tools, manual and automatic, are similarly robust against perturbations. However, the classification performance is more robust against overfitting when using the PME as peak calling preprocessor. In summary, we conclude that all methods, though small differences exist, are largely reliable and enable a wide spectrum of real-world biomedical applications.
Collapse
|
33
|
Robust modelling, measurement and analysis of human and animal metabolic systems. PHILOSOPHICAL TRANSACTIONS. SERIES A, MATHEMATICAL, PHYSICAL, AND ENGINEERING SCIENCES 2009; 367:1971-92. [PMID: 19380321 DOI: 10.1098/rsta.2008.0305] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/10/2023]
Abstract
Modelling human and animal metabolism is impeded by the lack of accurate quantitative parameters and the large number of biochemical reactions. This problem may be tackled by: (i) study of modules of the network independently; (ii) ensemble simulations to explore many plausible parameter combinations; (iii) analysis of 'sloppy' parameter behaviour, revealing interdependent parameter combinations with little influence; (iv) multiscale analysis that combines molecular and whole network data; and (v) measuring metabolic flux (rate of flow) in vivo via stable isotope labelling. For the latter method, carbon transition networks were modelled with systems of ordinary differential equations, but we show that coloured Petri nets provide a more intuitive graphical approach. Analysis of parameter sensitivities shows that only a few parameter combinations have a large effect on predictions. Model analysis of high-energy phosphate transport indicates that membrane permeability, inaccurately known at the organellar level, can be well determined from whole-organ responses. Ensemble simulations that take into account the imprecision of measured molecular parameters contradict the popular hypothesis that high-energy phosphate transport in heart muscle is mostly by phosphocreatine. Combining modular, multiscale, ensemble and sloppy modelling approaches with in vivo flux measurements may prove indispensable for the modelling of the large human metabolic system.
Collapse
|