1
|
Li S, Liu G, Wei T, Jia S, Zhang J. EvoVis: A Visual Analytics Method to Understand the Labeling Iterations in Data Programming. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2025; 31:1802-1817. [PMID: 38416617 DOI: 10.1109/tvcg.2024.3370654] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/01/2024]
Abstract
Obtaining high-quality labeled training data poses a significant bottleneck in the domain of machine learning. Data programming has emerged as a new paradigm to address this issue by converting human knowledge into labeling functions (LFs) to quickly produce low-cost probabilistic labels. To ensure the quality of labeled data, data programmers commonly iterate LFs for many rounds until satisfactory performance is achieved. However, the challenge in understanding the labeling iterations stems from interpreting the intricate relationships between data programming elements, exacerbated by their many-to-many and directed characteristics, inconsistent formats, and the large scale of data typically involved in labeling tasks. These complexities may impede the evaluation of label quality, identification of areas for improvement, and the effective optimization of LFs for acquiring high-quality labeled data. In this article, we introduce EvoVis, a visual analytics method for multi-class text labeling tasks. It seamlessly integrates relationship analysis and temporal overview to display contextual and historical information on a single screen, aiding in explaining the labeling iterations in data programming. We assessed its utility and effectiveness through case studies and user studies. The results indicate that EvoVis can effectively assist data programmers in understanding labeling iterations and improving the quality of labeled data, as evidenced by an increase of 0.16 in the average F1 score when compared to the default analysis tool.
Collapse
|
2
|
Zhuang Y, Ouyang Y, Ding L, Xu M, Shi F, Shan D, Cao D, Cao X. Source Tracing of Kidney Injury via the Multispectral Fingerprint Identified by Machine Learning-Driven Surface-Enhanced Raman Spectroscopic Analysis. ACS Sens 2024; 9:2622-2633. [PMID: 38700898 DOI: 10.1021/acssensors.4c00407] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/25/2024]
Abstract
Early diagnosis of drug-induced kidney injury (DIKI) is essential for clinical treatment and intervention. However, developing a reliable method to trace kidney injury origins through retrospective studies remains a challenge. In this study, we designed ordered fried-bun-shaped Au nanocone arrays (FBS NCAs) to create microarray chips as a surface-enhanced Raman scattering (SERS) analysis platform. Subsequently, the principal component analysis (PCA)-two-layer nearest neighbor (TLNN) model was constructed to identify and analyze the SERS spectra of exosomes from renal injury induced by cisplatin and gentamycin. The established PCA-TLNN model successfully differentiated the SERS spectra of exosomes from renal injury at different stages and causes, capturing the most significant spectral features for distinguishing these variations. For the SERS spectra of exosomes from renal injury at different induction times, the accuracy of PCA-TLNN reached 97.8% (cisplatin) and 93.3% (gentamicin). For the SERS spectra of exosomes from renal injury caused by different agents, the accuracy of PCA-TLNN reached 100% (7 days) and 96.7% (14 days). This study demonstrates that the combination of label-free exosome SERS and machine learning could serve as an innovative strategy for medical diagnosis and therapeutic intervention.
Collapse
Affiliation(s)
- Yanwen Zhuang
- Institute of Translational Medicine, Medical College, Yangzhou University, Yangzhou 225001, P. R. China
| | - Yu Ouyang
- Department of Clinical Laboratory, The Affiliated Taizhou Second People's Hospital of Yangzhou University, Taizhou 225300, P. R. China
| | - Li Ding
- Institute of Translational Medicine, Medical College, Yangzhou University, Yangzhou 225001, P. R. China
| | - Miaowen Xu
- Institute of Translational Medicine, Medical College, Yangzhou University, Yangzhou 225001, P. R. China
| | - Fanfeng Shi
- Yangzhou Polytechnic Institute, Yangzhou 225002, P. R. China
| | - Dan Shan
- School of Information Engineering/Carbon Based Low Dimensional Semiconductor Materials and Device Engineering Research Center of Jiangsu Province, Yangzhou Polytechnic Institute, Yangzhou 225127, P. R. China
| | - Dawei Cao
- Yangzhou Polytechnic Institute, Yangzhou 225002, P. R. China
- School of Information Engineering/Carbon Based Low Dimensional Semiconductor Materials and Device Engineering Research Center of Jiangsu Province, Yangzhou Polytechnic Institute, Yangzhou 225127, P. R. China
| | - Xiaowei Cao
- Institute of Translational Medicine, Medical College, Yangzhou University, Yangzhou 225001, P. R. China
- Jiangsu Key Laboratory of Integrated Traditional Chinese and Western Medicine for Prevention and Treatment of Senile Diseases, Medical College, Yangzhou University, Yangzhou 225001, P. R. China
- Jiangsu Key Laboratory of Experimental & Translational Non-coding RNA Research, Medical College, Yangzhou University, Yangzhou 225001, P. R. China
| |
Collapse
|
3
|
Xenopoulos P, Rulff J, Nonato LG, Barr B, Silva C. Calibrate: Interactive Analysis of Probabilistic Model Output. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2023; 29:853-863. [PMID: 36166523 DOI: 10.1109/tvcg.2022.3209489] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Analyzing classification model performance is a crucial task for machine learning practitioners. While practitioners often use count-based metrics derived from confusion matrices, like accuracy, many applications, such as weather prediction, sports betting, or patient risk prediction, rely on a classifier's predicted probabilities rather than predicted labels. In these instances, practitioners are concerned with producing a calibrated model, that is, one which outputs probabilities that reflect those of the true distribution. Model calibration is often analyzed visually, through static reliability diagrams, however, the traditional calibration visualization may suffer from a variety of drawbacks due to the strong aggregations it necessitates. Furthermore, count-based approaches are unable to sufficiently analyze model calibration. We present Calibrate, an interactive reliability diagram that addresses the aforementioned issues. Calibrate constructs a reliability diagram that is resistant to drawbacks in traditional approaches, and allows for interactive subgroup analysis and instance-level inspection. We demonstrate the utility of Calibrate through use cases on both real-world and synthetic data. We further validate Calibrate by presenting the results of a think-aloud experiment with data scientists who routinely analyze model calibration.
Collapse
|
4
|
Huang Y, Chang H, Chen X, Meng J, Han M, Huang T, Yuan L, Zhang G. A cell marker-based clustering strategy (cmCluster) for precise cell type identification of scRNA-seq data. QUANTITATIVE BIOLOGY 2023. [DOI: 10.15302/j-qb-022-0311] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/11/2023]
|
5
|
Vajiac C, Chau DH, Olligschlaeger A, Mackenzie R, Nair P, Lee MC, Li Y, Park N, Rabbany R, Faloutsos C. TRAFFICVIS: Visualizing Organized Activity and Spatio-Temporal Patterns for Detecting and Labeling Human Trafficking. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2022; PP:1-10. [PMID: 36201417 DOI: 10.1109/tvcg.2022.3209403] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Law enforcement and domain experts can detect human trafficking (HT) in online escort websites by analyzing suspicious clusters of connected ads. How can we explain clustering results intuitively and interactively, visualizing potential evidence for experts to analyze? We present TRAFFICVIS, the first interface for cluster-level HT detection and labeling. Developed through months of participatory design with domain experts, TRAFFICVIS provides coordinated views in conjunction with carefully chosen backend algorithms to effectively show spatio-temporal and text patterns to a wide variety of anti-HT stakeholders. We build upon state-of-the-art text clustering algorithms by incorporating shared metadata as a signal of connected and possibly suspicious activity, then visualize the results. Domain experts can use TRAFFICVIS to label clusters as HT, or other, suspicious, but non-HT activity such as spam and scam, quickly creating labeled datasets to enable further HT research. Through domain expert feedback and a usage scenario, we demonstrate TRAFFICVIS's efficacy. The feedback was overwhelmingly positive, with repeated high praises for the usability and explainability of our tool, the latter being vital for indicting possible criminals.
Collapse
|
6
|
Humer C, Heberle H, Montanari F, Wolf T, Huber F, Henderson R, Heinrich J, Streit M. ChemInformatics Model Explorer (CIME): exploratory analysis of chemical model explanations. J Cheminform 2022; 14:21. [PMID: 35379315 PMCID: PMC8981840 DOI: 10.1186/s13321-022-00600-z] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2021] [Accepted: 03/12/2022] [Indexed: 11/10/2022] Open
Abstract
The introduction of machine learning to small molecule research- an inherently multidisciplinary field in which chemists and data scientists combine their expertise and collaborate - has been vital to making screening processes more efficient. In recent years, numerous models that predict pharmacokinetic properties or bioactivity have been published, and these are used on a daily basis by chemists to make decisions and prioritize ideas. The emerging field of explainable artificial intelligence is opening up new possibilities for understanding the reasoning that underlies a model. In small molecule research, this means relating contributions of substructures of compounds to their predicted properties, which in turn also allows the areas of the compounds that have the greatest influence on the outcome to be identified. However, there is no interactive visualization tool that facilitates such interdisciplinary collaborations towards interpretability of machine learning models for small molecules. To fill this gap, we present CIME (ChemInformatics Model Explorer), an interactive web-based system that allows users to inspect chemical data sets, visualize model explanations, compare interpretability techniques, and explore subgroups of compounds. The tool is model-agnostic and can be run on a server or a workstation.
Collapse
Affiliation(s)
| | - Henry Heberle
- Division Crop Science, Bayer AG, 40789, Monheim am Rhein, DE, Germany.
| | | | - Thomas Wolf
- Division Crop Science, Bayer AG, 65926, Frankfurt, DE, Germany
| | - Florian Huber
- Division Crop Science, Bayer AG, 65926, Frankfurt, DE, Germany
| | - Ryan Henderson
- Digital Technologies, Bayer AG, 13353, Berlin, DE, Germany
| | - Julian Heinrich
- Division Crop Science, Bayer AG, 40789, Monheim am Rhein, DE, Germany.
| | - Marc Streit
- Johannes Kepler University Linz, Linz, Austria.
| |
Collapse
|
7
|
Bernard J, Hutter M, Sedlmair M, Zeppelzauer M, Munzner T. A Taxonomy of Property Measures to Unify Active Learning and Human-centered Approaches to Data Labeling. ACM T INTERACT INTEL 2021. [DOI: 10.1145/3439333] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Abstract
Strategies for selecting the next data instance to label, in service of generating labeled data for machine learning, have been considered separately in the machine learning literature on active learning and in the visual analytics literature on human-centered approaches. We propose a unified design space for instance selection strategies to support detailed and fine-grained analysis covering both of these perspectives. We identify a concise set of 15 properties, namely measureable characteristics of datasets or of machine learning models applied to them, that cover most of the strategies in these literatures. To quantify these properties, we introduce Property Measures (PM) as fine-grained building blocks that can be used to formalize instance selection strategies. In addition, we present a taxonomy of PMs to support the description, evaluation, and generation of PMs across four dimensions: machine learning (ML)
Model Output
,
Instance Relations
,
Measure Functionality
, and
Measure Valence
. We also create computational infrastructure to support qualitative visual data analysis: a visual analytics explainer for PMs built around an implementation of PMs using cascades of eight atomic functions. It supports eight analysis tasks, covering the analysis of datasets and ML models using visual comparison within and between PMs and groups of PMs, and over time during the interactive labeling process. We iteratively refined the PM taxonomy, the explainer, and the task abstraction in parallel with each other during a two-year formative process, and show evidence of their utility through a summative evaluation with the same infrastructure. This research builds a formal baseline for the better understanding of the commonalities and differences of instance selection strategies, which can serve as the stepping stone for the synthesis of novel strategies in future work.
Collapse
|
8
|
Hinterreiter A, Steinparz C, SchÖfl M, Stitz H, Streit M. Projection Path Explorer: Exploring Visual Patterns in Projected Decision-making Paths. ACM T INTERACT INTEL 2021. [DOI: 10.1145/3387165] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Abstract
In problem-solving, a path towards a solutions can be viewed as a sequence of decisions. The decisions, made by humans or computers, describe a trajectory through a high-dimensional representation space of the problem. By means of dimensionality reduction, these trajectories can be visualized in lower-dimensional space. Such embedded trajectories have previously been applied to a wide variety of data, but analysis has focused almost exclusively on the self-similarity of single trajectories. In contrast, we describe patterns emerging from drawing many trajectories—for different initial conditions, end states, and solution strategies—in the same embedding space. We argue that general statements about the problem-solving tasks and solving strategies can be made by interpreting these patterns. We explore and characterize such patterns in trajectories resulting from human and machine-made decisions in a variety of application domains: logic puzzles (Rubik’s cube), strategy games (chess), and optimization problems (neural network training). We also discuss the importance of suitably chosen representation spaces and similarity metrics for the embedding.
Collapse
Affiliation(s)
- Andreas Hinterreiter
- Johannes Kepler University Linz, Austria and Imperial College London, London, UK
| | | | | | - Holger Stitz
- Johannes Kepler University Linz, Austria and datavisyn GmbH, Austria
| | - Marc Streit
- Johannes Kepler University Linz, Linz, Austria
| |
Collapse
|