1
|
Jo T, Bice P, Nho K, Saykin AJ. LD-informed deep learning for Alzheimer's gene loci detection using WGS data. ALZHEIMER'S & DEMENTIA (NEW YORK, N. Y.) 2025; 11:e70041. [PMID: 39822590 PMCID: PMC11736638 DOI: 10.1002/trc2.70041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/30/2024] [Revised: 11/26/2024] [Accepted: 12/11/2024] [Indexed: 01/19/2025]
Abstract
INTRODUCTION The exponential growth of genomic datasets necessitates advanced analytical tools to effectively identify genetic loci from large-scale high throughput sequencing data. This study presents Deep-Block, a multi-stage deep learning framework that incorporates biological knowledge into its AI architecture to identify genetic regions as significantly associated with Alzheimer's disease (AD). The framework employs a three-stage approach: (1) genome segmentation based on linkage disequilibrium (LD) patterns, (2) selection of relevant LD blocks using sparse attention mechanisms, and (3) application of TabNet and Random Forest algorithms to quantify single nucleotide polymorphism (SNP) feature importance, thereby identifying genetic factors contributing to AD risk. METHODS The Deep-Block was applied to a large-scale whole genome sequencing (WGS) dataset from the Alzheimer's Disease Sequencing Project (ADSP), comprising 7416 non-Hispanic white (NHW) participants (3150 cognitively normal older adults (CN), 4266 AD). RESULTS 30,218 LD blocks were identified and then ranked based on their relevance with Alzheimer's disease. Subsequently, the Deep-Block identified novel SNPs within the top 1500 LD blocks and confirmed previously known variants, including APOE rs429358 and rs769449. Expression Quantitative Trait Loci (eQTL) analysis across 13 brain regions provided functional evidence for the identified variants. The results were cross-validated against established AD-associated loci from the European Alzheimer's and Dementia Biobank (EADB) and the GWAS catalog. DISCUSSION The Deep-Block framework effectively processes large-scale high throughput sequencing data while preserving SNP interactions during dimensionality reduction, minimizing bias and information loss. The framework's findings are supported by tissue-specific eQTL evidence across brain regions, indicating the functional relevance of the identified variants. Additionally, the Deep-Block approach has identified both known and novel genetic variants, enhancing our understanding of the genetic architecture and demonstrating its potential for application in large-scale sequencing studies. Highlights Growing genomic datasets require advanced tools to identify genetic loci in sequencing.Deep-Block, a novel AI framework, was used to process large-scale ADSP WGS data.Deep-Block identified both known and novel AD-associated genetic loci.rs429358 (APOE) was key; rs11556505 (TOMM40), rs34342646 (NECTIN2) were significant.The AI framework uses biological knowledge to enhance detection of Alzheimer's loci.
Collapse
Affiliation(s)
- Taeho Jo
- Indiana Alzheimer Disease Research Center and Center for Neuroimaging, Department of Radiology and Imaging SciencesIndiana University School of MedicineIndianapolisIndianaUSA
| | - Paula Bice
- Indiana Alzheimer Disease Research Center and Center for Neuroimaging, Department of Radiology and Imaging SciencesIndiana University School of MedicineIndianapolisIndianaUSA
| | - Kwangsik Nho
- Indiana Alzheimer Disease Research Center and Center for Neuroimaging, Department of Radiology and Imaging SciencesIndiana University School of MedicineIndianapolisIndianaUSA
- Center for Computational Biology and BioinformaticsIndiana University School of MedicineIndianapolisIndianaUSA
| | - Andrew J. Saykin
- Indiana Alzheimer Disease Research Center and Center for Neuroimaging, Department of Radiology and Imaging SciencesIndiana University School of MedicineIndianapolisIndianaUSA
- Department of Medical and Molecular GeneticsIndiana University School of MedicineIndianapolisIndianaUSA
| | | |
Collapse
|
2
|
Montambault B, Appleby G, Rogers J, Brumar CD, Li M, Chang R. DimBridge: Interactive Explanation of Visual Patterns in Dimensionality Reductions with Predicate Logic. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2025; 31:207-217. [PMID: 39312423 DOI: 10.1109/tvcg.2024.3456391] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/25/2024]
Abstract
Dimensionality reduction techniques are widely used for visualizing high-dimensional data. However, support for interpreting patterns of dimension reduction results in the context of the original data space is often insufficient. Consequently, users may struggle to extract insights from the projections. In this paper, we introduce DimBridge, a visual analytics tool that allows users to interact with visual patterns in a projection and retrieve corresponding data patterns. DimBridge supports several interactions, allowing users to perform various analyses, from contrasting multiple clusters to explaining complex latent structures. Leveraging first-order predicate logic, DimBridge identifies subspaces in the original dimensions relevant to a queried pattern and provides an interface for users to visualize and interact with them. We demonstrate how DimBridge can help users overcome the challenges associated with interpreting visual patterns in projections.
Collapse
|
3
|
Jo T, Bice P, Nho K, Saykin AJ. LD-informed deep learning for Alzheimer's gene loci detection using WGS data. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.09.19.24313993. [PMID: 39371140 PMCID: PMC11451815 DOI: 10.1101/2024.09.19.24313993] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 10/08/2024]
Abstract
INTRODUCTION The exponential growth of genomic datasets necessitates advanced analytical tools to effectively identify genetic loci from large-scale high throughput sequencing data. This study presents Deep-Block, a multi-stage deep learning framework that incorporates biological knowledge into its AI architecture to identify genetic regions as significantly associated with Alzheimer's disease (AD). The framework employs a three-stage approach: (1) genome segmentation based on linkage disequilibrium (LD) patterns, (2) selection of relevant LD blocks using sparse attention mechanisms, and (3) application of TabNet and Random Forest algorithms to quantify single nucleotide polymorphism (SNP) feature importance, thereby identifying genetic factors contributing to AD risk. METHODS The Deep-Block was applied to a large-scale whole genome sequencing (WGS) dataset from the Alzheimer's Disease Sequencing Project (ADSP), comprising 7,416 non-Hispanic white participants (3,150 cognitively normal older adults (CN), 4,266 AD). RESULTS 30,218 LD blocks were identified and then ranked based on their relevance with Alzheimer's disease. Subsequently, the Deep-Block identified novel SNPs within the top 1,500 LD blocks and confirmed previously known variants, including APOE rs429358 and rs769449. Expression Quantitative Trait Loci (eQTL) analysis across 13 brain regions provided functional evidence for the identified variants. The results were cross-validated against established AD-associated loci from the European Alzheimer's and Dementia Biobank (EADB) and the GWAS catalog. DISCUSSION The Deep-Block framework effectively processes large-scale high throughput sequencing data while preserving SNP interactions during dimensionality reduction, minimizing bias and information loss. The framework's findings are supported by tissue-specific eQTL evidence across brain regions, indicating the functional relevance of the identified variants. Additionally, the Deep-Block approach has identified both known and novel genetic variants, enhancing our understanding of the genetic architecture and demonstrating its potential for application in large-scale sequencing studies.
Collapse
Affiliation(s)
- Taeho Jo
- Indiana Alzheimer Disease Research Center and Center for Neuroimaging, Department of Radiology and Imaging Sciences, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | - Paula Bice
- Indiana Alzheimer Disease Research Center and Center for Neuroimaging, Department of Radiology and Imaging Sciences, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | - Kwangsik Nho
- Indiana Alzheimer Disease Research Center and Center for Neuroimaging, Department of Radiology and Imaging Sciences, Indiana University School of Medicine, Indianapolis, IN 46202, USA
- Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | - Andrew J. Saykin
- Indiana Alzheimer Disease Research Center and Center for Neuroimaging, Department of Radiology and Imaging Sciences, Indiana University School of Medicine, Indianapolis, IN 46202, USA
- Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | | |
Collapse
|
4
|
Zhang S, Chen H, Wan Y, Wang H, Qu H. A Data-Driven Approach for Leveraging Inline and Offline Data to Determine the Causes of Monoclonal Antibody Productivity Reduction in the Commercial-Scale Cell Culture Process. Pharmaceutics 2024; 16:1082. [PMID: 39204427 PMCID: PMC11359819 DOI: 10.3390/pharmaceutics16081082] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2024] [Revised: 07/25/2024] [Accepted: 08/15/2024] [Indexed: 09/04/2024] Open
Abstract
The monoclonal antibody (mAb) manufacturing process comes with high profits and high costs, and thus mAb productivity is of vital importance. However, many factors can impact the cell culture process, and lead to mAb productivity reduction. Nowadays, the biopharma industry is actively employing manufacturing information systems, which enable the integration of both online data and offline data. Although the volume of data is large, related data mining studies for mAb productivity improvement are rare. Therefore, a data-driven approach is proposed in this study to leverage both the inline and offline data of the cell culture process to discover the causes of mAb productivity reduction. The approach consists of four steps, namely data preprocessing, phase division, feature extraction and fusion, and cluster comparing. First, data quality issues are solved during the data preprocessing step. Next, the inline data are divided into several phases based on the moving window k-nearest neighbor method. Then, the inline data features are extracted via functional data analysis and combined with the offline data features. Finally, the causes of mAb productivity reduction are identified using the contrasting clusters via the principal component analysis method. A commercial-scale cell culture process case study is provided in this research to verify the effectiveness of the approach. Data from 35 batches were collected, and each batch contained nine inline variables and seven offline variables. The causes of mAb productivity reduction were identified to be the lack of nutrients, and recommended actions were taken according to the result, which was subsequently proven by six validation batches.
Collapse
Affiliation(s)
- Sheng Zhang
- Pharmaceutical Informatics Institute, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China; (S.Z.); (H.C.)
| | - Hang Chen
- Pharmaceutical Informatics Institute, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China; (S.Z.); (H.C.)
| | - Yuxiang Wan
- BioRay Pharmaceutical Co., Ltd., Taizhou 318000, China; (Y.W.); (H.W.)
| | - Haibin Wang
- BioRay Pharmaceutical Co., Ltd., Taizhou 318000, China; (Y.W.); (H.W.)
| | - Haibin Qu
- Pharmaceutical Informatics Institute, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China; (S.Z.); (H.C.)
| |
Collapse
|
5
|
Dennig FL, Miller M, Keim DA, El-Assady M. FS/DS: A Theoretical Framework for the Dual Analysis of Feature Space and Data Space. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2024; 30:5165-5182. [PMID: 37342951 DOI: 10.1109/tvcg.2023.3288356] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/23/2023]
Abstract
With the surge of data-driven analysis techniques, there is a rising demand for enhancing the exploration of large high-dimensional data by enabling interactions for the joint analysis of features (i.e., dimensions). Such a dual analysis of the feature space and data space is characterized by three components, 1) a view visualizing feature summaries, 2) a view that visualizes the data records, and 3) a bidirectional linking of both plots triggered by human interaction in one of both visualizations, e.g., Linking & Brushing. Dual analysis approaches span many domains, e.g., medicine, crime analysis, and biology. The proposed solutions encapsulate various techniques, such as feature selection or statistical analysis. However, each approach establishes a new definition of dual analysis. To address this gap, we systematically reviewed published dual analysis methods to investigate and formalize the key elements, such as the techniques used to visualize the feature space and data space, as well as the interaction between both spaces. From the information elicited during our review, we propose a unified theoretical framework for dual analysis, encompassing all existing approaches extending the field. We apply our proposed formalization describing the interactions between each component and relate them to the addressed tasks. Additionally, we categorize the existing approaches using our framework and derive future research directions to advance dual analysis by including state-of-the-art visual analysis techniques to improve data exploration.
Collapse
|
6
|
Rafiei MH, Gauthier LV, Adeli H, Takabi D. Self-Supervised Learning for Electroencephalography. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:1457-1471. [PMID: 35867362 DOI: 10.1109/tnnls.2022.3190448] [Citation(s) in RCA: 48] [Impact Index Per Article: 48.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Decades of research have shown machine learning superiority in discovering highly nonlinear patterns embedded in electroencephalography (EEG) records compared with conventional statistical techniques. However, even the most advanced machine learning techniques require relatively large, labeled EEG repositories. EEG data collection and labeling are costly. Moreover, combining available datasets to achieve a large data volume is usually infeasible due to inconsistent experimental paradigms across trials. Self-supervised learning (SSL) solves these challenges because it enables learning from EEG records across trials with variable experimental paradigms, even when the trials explore different phenomena. It aggregates multiple EEG repositories to increase accuracy, reduce bias, and mitigate overfitting in machine learning training. In addition, SSL could be employed in situations where there is limited labeled training data, and manual labeling is costly. This article: 1) provides a brief introduction to SSL; 2) describes some SSL techniques employed in recent studies, including EEG; 3) proposes current and potential SSL techniques for future investigations in EEG studies; 4) discusses the cons and pros of different SSL techniques; and 5) proposes holistic implementation tips and potential future directions for EEG SSL practices.
Collapse
|
7
|
Fujiwara T, Liu TP. Contrastive multiple correspondence analysis (cMCA): Using contrastive learning to identify latent subgroups in political parties. PLoS One 2023; 18:e0287180. [PMID: 37428735 PMCID: PMC10332614 DOI: 10.1371/journal.pone.0287180] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2022] [Accepted: 05/31/2023] [Indexed: 07/12/2023] Open
Abstract
Scaling methods have long been utilized to simplify and cluster high-dimensional data. However, the general latent spaces across all predefined groups derived from these methods sometimes do not fall into researchers' interest regarding specific patterns within groups. To tackle this issue, we adopt an emerging analysis approach called contrastive learning. We contribute to this growing field by extending its ideas to multiple correspondence analysis (MCA) in order to enable an analysis of data often encountered by social scientists-containing binary, ordinal, and nominal variables. We demonstrate the utility of contrastive MCA (cMCA) by analyzing two different surveys of voters in the U.S. and U.K. Our results suggest that, first, cMCA can identify substantively important dimensions and divisions among subgroups that are overlooked by traditional methods; second, for other cases, cMCA can derive latent traits that emphasize subgroups seen moderately in those derived by traditional methods.
Collapse
|
8
|
Xu C, Neuroth T, Fujiwara T, Liang R, Ma KL. A Predictive Visual Analytics System for Studying Neurodegenerative Disease Based on DTI Fiber Tracts. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2023; 29:2020-2035. [PMID: 34965212 DOI: 10.1109/tvcg.2021.3137174] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Diffusion tensor imaging (DTI) has been used to study the effects of neurodegenerative diseases on neural pathways, which may lead to more reliable and early diagnosis of these diseases as well as a better understanding of how they affect the brain. We introduce a predictive visual analytics system for studying patient groups based on their labeled DTI fiber tract data and corresponding statistics. The system's machine-learning-augmented interface guides the user through an organized and holistic analysis space, including the statistical feature space, the physical space, and the space of patients over different groups. We use a custom machine learning pipeline to help narrow down this large analysis space and then explore it pragmatically through a range of linked visualizations. We conduct several case studies using DTI and T1-weighted images from the research database of Parkinson's Progression Markers Initiative.
Collapse
|
9
|
Gebrye H, Wang Y, Li F. Traffic data extraction and labeling for machine learning based attack detection in IoT networks. INT J MACH LEARN CYB 2023. [DOI: 10.1007/s13042-022-01765-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
|
10
|
Ye S, Chen Z, Chu X, Li K, Luo J, Li Y, Geng G, Wu Y. PuzzleFixer: A Visual Reassembly System for Immersive Fragments Restoration. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2023; 29:429-439. [PMID: 36179001 DOI: 10.1109/tvcg.2022.3209388] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
We present PuzzleFixer, an immersive interactive system for experts to rectify defective reassembled 3D objects. Reassembling the fragments of a broken object to restore its original state is the prerequisite of many analytical tasks such as cultural relics analysis and forensics reasoning. While existing computer-aided methods can automatically reassemble fragments, they often derive incorrect objects due to the complex and ambiguous fragment shapes. Thus, experts usually need to refine the object manually. Prior advances in immersive technologies provide benefits for realistic perception and direct interactions to visualize and interact with 3D fragments. However, few studies have investigated the reassembled object refinement. The specific challenges include: 1) the fragment combination set is too large to determine the correct matches, and 2) the geometry of the fragments is too complex to align them properly. To tackle the first challenge, PuzzleFixer leverages dimensionality reduction and clustering techniques, allowing users to review possible match categories, select the matches with reasonable shapes, and drill down to shapes to correct the corresponding faces. For the second challenge, PuzzleFixer embeds the object with node-link networks to augment the perception of match relations. Specifically, it instantly visualizes matches with graph edges and provides force feedback to facilitate the efficiency of alignment interactions. To demonstrate the effectiveness of PuzzleFixer, we conducted an expert evaluation based on two cases on real-world artifacts and collected feedback through post-study interviews. The results suggest that our system is suitable and efficient for experts to refine incorrect reassembled objects.
Collapse
|
11
|
Li J, Zhou CQ. Incorporation of Human Knowledge into Data Embeddings to Improve Pattern Significance and Interpretability. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2023; 29:723-733. [PMID: 36155441 DOI: 10.1109/tvcg.2022.3209382] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Embedding is a common technique for analyzing multi-dimensional data. However, the embedding projection cannot always form significant and interpretable visual structures that foreshadow underlying data patterns. We propose an approach that incorporates human knowledge into data embeddings to improve pattern significance and interpretability. The core idea is (1) externalizing tacit human knowledge as explicit sample labels and (2) adding a classification loss in the embedding network to encode samples' classes. The approach pulls samples of the same class with similar data features closer in the projection, leading to more compact (significant) and class-consistent (interpretable) visual structures. We give an embedding network with a customized classification loss to implement the idea and integrate the network into a visualization system to form a workflow that supports flexible class creation and pattern exploration. Patterns found on open datasets in case studies, subjects' performance in a user study, and quantitative experiment results illustrate the general usability and effectiveness of the approach.
Collapse
|
12
|
Wang X, Chen W, Xia J, Wen Z, Zhu R, Schreck T. HetVis: A Visual Analysis Approach for Identifying Data Heterogeneity in Horizontal Federated Learning. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2023; 29:310-319. [PMID: 36197857 DOI: 10.1109/tvcg.2022.3209347] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Horizontal federated learning (HFL) enables distributed clients to train a shared model and keep their data privacy. In training high-quality HFL models, the data heterogeneity among clients is one of the major concerns. However, due to the security issue and the complexity of deep learning models, it is challenging to investigate data heterogeneity across different clients. To address this issue, based on a requirement analysis we developed a visual analytics tool, HetVis, for participating clients to explore data heterogeneity. We identify data heterogeneity through comparing prediction behaviors of the global federated model and the stand-alone model trained with local data. Then, a context-aware clustering of the inconsistent records is done, to provide a summary of data heterogeneity. Combining with the proposed comparison techniques, we develop a novel set of visualizations to identify heterogeneity issues in HFL. We designed three case studies to introduce how HetVis can assist client analysts in understanding different types of heterogeneity issues. Expert reviews and a comparative study demonstrate the effectiveness of HetVis.
Collapse
|
13
|
Fujiwara T, Sakamoto N, Nonaka J, Ma KL. A Visual Analytics Approach for Hardware System Monitoring with Streaming Functional Data Analysis. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2022; 28:2338-2349. [PMID: 35394909 DOI: 10.1109/tvcg.2022.3165348] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Many real-world applications involve analyzing time-dependent phenomena, which are intrinsically functional, consisting of curves varying over a continuum (e.g., time). When analyzing continuous data, functional data analysis (FDA) provides substantial benefits, such as the ability to study the derivatives and to restrict the ordering of data. However, continuous data inherently has infinite dimensions, and for a long time series, FDA methods often suffer from high computational costs. The analysis problem becomes even more challenging when updating the FDA results for continuously arriving data. In this paper, we present a visual analytics approach for monitoring and reviewing time series data streamed from a hardware system with a focus on identifying outliers by using FDA. To perform FDA while addressing the computational problem, we introduce new incremental and progressive algorithms that promptly generate the magnitude-shape (MS) plot, which conveys both the functional magnitude and shape outlyingness of time series data. In addition, by using an MS plot in conjunction with an FDA version of principal component analysis, we enhance the analyst's ability to investigate the visually-identified outliers. We illustrate the effectiveness of our approach with two use scenarios using real-world datasets. The resulting tool is evaluated by industry experts using real-world streaming datasets.
Collapse
|
14
|
Muhammad U, Hoque MZ, Oussalah M, Keskinarkaus A, Seppänen T, Sarder P. SAM: Self-augmentation mechanism for COVID-19 detection using chest X-ray images. Knowl Based Syst 2022; 241:108207. [PMID: 35068707 PMCID: PMC8762871 DOI: 10.1016/j.knosys.2022.108207] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2021] [Revised: 01/07/2022] [Accepted: 01/08/2022] [Indexed: 12/20/2022]
Abstract
COVID-19 is a rapidly spreading viral disease and has affected over 100 countries worldwide. The numbers of casualties and cases of infection have escalated particularly in countries with weakened healthcare systems. Recently, reverse transcription-polymerase chain reaction (RT-PCR) is the test of choice for diagnosing COVID-19. However, current evidence suggests that COVID-19 infected patients are mostly stimulated from a lung infection after coming in contact with this virus. Therefore, chest X-ray (i.e., radiography) and chest CT can be a surrogate in some countries where PCR is not readily available. This has forced the scientific community to detect COVID-19 infection from X-ray images and recently proposed machine learning methods offer great promise for fast and accurate detection. Deep learning with convolutional neural networks (CNNs) has been successfully applied to radiological imaging for improving the accuracy of diagnosis. However, the performance remains limited due to the lack of representative X-ray images available in public benchmark datasets. To alleviate this issue, we propose a self-augmentation mechanism for data augmentation in the feature space rather than in the data space using reconstruction independent component analysis (RICA). Specifically, a unified architecture is proposed which contains a deep convolutional neural network (CNN), a feature augmentation mechanism, and a bidirectional LSTM (BiLSTM). The CNN provides the high-level features extracted at the pooling layer where the augmentation mechanism chooses the most relevant features and generates low-dimensional augmented features. Finally, BiLSTM is used to classify the processed sequential information. We conducted experiments on three publicly available databases to show that the proposed approach achieves the state-of-the-art results with accuracy of 97%, 84% and 98%. Explainability analysis has been carried out using feature visualization through PCA projection and t-SNE plots.
Collapse
Affiliation(s)
- Usman Muhammad
- Center for Machine Vision and Signal Analysis, Faculty of Information Technology and Electrical Engineering, University of Oulu, Finland
| | - Md Ziaul Hoque
- Center for Machine Vision and Signal Analysis, Faculty of Information Technology and Electrical Engineering, University of Oulu, Finland
| | - Mourad Oussalah
- Center for Machine Vision and Signal Analysis, Faculty of Information Technology and Electrical Engineering, University of Oulu, Finland
- Medical Imaging, Physics, and Technology (MIPT), Faculty of Medicine, University of Oulu, Finland
| | - Anja Keskinarkaus
- Center for Machine Vision and Signal Analysis, Faculty of Information Technology and Electrical Engineering, University of Oulu, Finland
| | - Tapio Seppänen
- Center for Machine Vision and Signal Analysis, Faculty of Information Technology and Electrical Engineering, University of Oulu, Finland
| | - Pinaki Sarder
- Department of Pathology and Anatomical Sciences, University at Buffalo, USA
| |
Collapse
|
15
|
Fujiwara T, Wei X, Zhao J, Ma KL. Interactive Dimensionality Reduction for Comparative Analysis. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2022; 28:758-768. [PMID: 34591765 DOI: 10.1109/tvcg.2021.3114807] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Finding the similarities and differences between groups of datasets is a fundamental analysis task. For high-dimensional data, dimensionality reduction (DR) methods are often used to find the characteristics of each group. However, existing DR methods provide limited capability and flexibility for such comparative analysis as each method is designed only for a narrow analysis target, such as identifying factors that most differentiate groups. This paper presents an interactive DR framework where we integrate our new DR method, called ULCA (unified linear comparative analysis), with an interactive visual interface. ULCA unifies two DR schemes, discriminant analysis and contrastive learning, to support various comparative analysis tasks. To provide flexibility for comparative analysis, we develop an optimization algorithm that enables analysts to interactively refine ULCA results. Additionally, the interactive visualization interface facilitates interpretation and refinement of the ULCA results. We evaluate ULCA and the optimization algorithm to show their efficiency as well as present multiple case studies using real-world datasets to demonstrate the usefulness of this framework.
Collapse
|
16
|
IXVC: An interactive pipeline for explaining visual clusters in dimensionality reduction visualizations with decision trees. ARRAY 2021. [DOI: 10.1016/j.array.2021.100080] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
|
17
|
Natsukawa H, Deyle ER, Pao GM, Koyamada K, Sugihara G. A Visual Analytics Approach for Ecosystem Dynamics based on Empirical Dynamic Modeling. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2021; 27:506-516. [PMID: 33026998 DOI: 10.1109/tvcg.2020.3028956] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
An important approach for scientific inquiry across many disciplines involves using observational time series data to understand the relationships between key variables to gain mechanistic insights into the underlying rules that govern the given system. In real systems, such as those found in ecology, the relationships between time series variables are generally not static; instead, these relationships are dynamical and change in a nonlinear or state-dependent manner. To further understand such systems, we investigate integrating methods that appropriately characterize these dynamics (i.e., methods that measure interactions as they change with time-varying system states) with visualization techniques that can help analyze the behavior of the system. Here, we focus on empirical dynamic modeling (EDM) as a state-of-the-art method that specifically identifies causal variables and measures changing state-dependent relationships between time series variables. Instead of using approaches centered on parametric equations, EDM is an equation-free approach that studies systems based on their dynamic attractors. We propose a visual analytics system to support the identification and mechanistic interpretation of system states using an EDM-constructed dynamic graph. This work, as detailed in four analysis tasks and demonstrated with a GUI, provides a novel synthesis of EDM and visualization techniques such as brush-link visualization and visual summarization to interpret dynamic graphs representing ecosystem dynamics. We applied our proposed system to ecological simulation data and real data from a marine mesocosm study as two key use cases. Our case studies show that our visual analytics tools support the identification and interpretation of the system state by the user, and enable us to discover both confirmatory and new findings in ecosystem dynamics. Overall, we demonstrated that our system can facilitate an understanding of how systems function beyond the intuitive analysis of high-dimensional information based on specific domain knowledge.
Collapse
|
18
|
Knittel J, Lalama A, Koch S, Ertl T. Visual Neural Decomposition to Explain Multivariate Data Sets. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2021; 27:1374-1384. [PMID: 33048724 DOI: 10.1109/tvcg.2020.3030420] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Investigating relationships between variables in multi-dimensional data sets is a common task for data analysts and engineers. More specifically, it is often valuable to understand which ranges of which input variables lead to particular values of a given target variable. Unfortunately, with an increasing number of independent variables, this process may become cumbersome and time-consuming due to the many possible combinations that have to be explored. In this paper, we propose a novel approach to visualize correlations between input variables and a target output variable that scales to hundreds of variables. We developed a visual model based on neural networks that can be explored in a guided way to help analysts find and understand such correlations. First, we train a neural network to predict the target from the input variables. Then, we visualize the inner workings of the resulting model to help understand relations within the data set. We further introduce a new regularization term for the backpropagation algorithm that encourages the neural network to learn representations that are easier to interpret visually. We apply our method to artificial and real-world data sets to show its utility.
Collapse
|
19
|
Fujiwara T, Sakamoto N, Nonaka J, Yamamoto K, Ma KL. A Visual Analytics Framework for Reviewing Multivariate Time-Series Data with Dimensionality Reduction. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2021; 27:1601-1611. [PMID: 33026990 DOI: 10.1109/tvcg.2020.3028889] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Data-driven problem solving in many real-world applications involves analysis of time-dependent multivariate data, for which dimensionality reduction (DR) methods are often used to uncover the intrinsic structure and features of the data. However, DR is usually applied to a subset of data that is either single-time-point multivariate or univariate time-series, resulting in the need to manually examine and correlate the DR results out of different data subsets. When the number of dimensions is large either in terms of the number of time points or attributes, this manual task becomes too tedious and infeasible. In this paper, we present MulTiDR, a new DR framework that enables processing of time-dependent multivariate data as a whole to provide a comprehensive overview of the data. With the framework, we employ DR in two steps. When treating the instances, time points, and attributes of the data as a 3D array, the first DR step reduces the three axes of the array to two, and the second DR step visualizes the data in a lower-dimensional space. In addition, by coupling with a contrastive learning method and interactive visualizations, our framework enhances analysts' ability to interpret DR results. We demonstrate the effectiveness of our framework with four case studies using real-world datasets.
Collapse
|
20
|
Neto MP, Paulovich FV. Explainable Matrix - Visualization for Global and Local Interpretability of Random Forest Classification Ensembles. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2021; 27:1427-1437. [PMID: 33048689 DOI: 10.1109/tvcg.2020.3030354] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/23/2023]
Abstract
Over the past decades, classification models have proven to be essential machine learning tools given their potential and applicability in various domains. In these years, the north of the majority of the researchers had been to improve quantitative metrics, notwithstanding the lack of information about models' decisions such metrics convey. This paradigm has recently shifted, and strategies beyond tables and numbers to assist in interpreting models' decisions are increasing in importance. Part of this trend, visualization techniques have been extensively used to support classification models' interpretability, with a significant focus on rule-based models. Despite the advances, the existing approaches present limitations in terms of visual scalability, and the visualization of large and complex models, such as the ones produced by the Random Forest (RF) technique, remains a challenge. In this paper, we propose Explainable Matrix (ExMatrix), a novel visualization method for RF interpretability that can handle models with massive quantities of rules. It employs a simple yet powerful matrix-like visual metaphor, where rows are rules, columns are features, and cells are rules predicates, enabling the analysis of entire models and auditing classification results. ExMatrix applicability is confirmed via different examples, showing how it can be used in practice to promote RF models interpretability.
Collapse
|
21
|
Boileau P, Hejazi NS, Dudoit S. Exploring high-dimensional biological data with sparse contrastive principal component analysis. Bioinformatics 2020; 36:3422-3430. [PMID: 32176249 DOI: 10.1093/bioinformatics/btaa176] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2019] [Revised: 02/20/2020] [Accepted: 03/10/2020] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Statistical analyses of high-throughput sequencing data have re-shaped the biological sciences. In spite of myriad advances, recovering interpretable biological signal from data corrupted by technical noise remains a prevalent open problem. Several classes of procedures, among them classical dimensionality reduction techniques and others incorporating subject-matter knowledge, have provided effective advances. However, no procedure currently satisfies the dual objectives of recovering stable and relevant features simultaneously. RESULTS Inspired by recent proposals for making use of control data in the removal of unwanted variation, we propose a variant of principal component analysis (PCA), sparse contrastive PCA that extracts sparse, stable, interpretable and relevant biological signal. The new methodology is compared to competing dimensionality reduction approaches through a simulation study and via analyses of several publicly available protein expression, microarray gene expression and single-cell transcriptome sequencing datasets. AVAILABILITY AND IMPLEMENTATION A free and open-source software implementation of the methodology, the scPCA R package, is made available via the Bioconductor Project. Code for all analyses presented in this article is also available via GitHub. CONTACT philippe_boileau@berkeley.edu. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Nima S Hejazi
- Graduate Group in Biostatistics.,Center for Computational Biology
| | - Sandrine Dudoit
- Center for Computational Biology.,Division of Epidemiology and Biostatistics, School of Public Health.,Department of Statistics, University of California, Berkeley, CA 94720, USA
| |
Collapse
|
22
|
Chatzimparmpas A, Martins RM, Kerren A. t-viSNE: Interactive Assessment and Interpretation of t-SNE Projections. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2020; 26:2696-2714. [PMID: 32305922 DOI: 10.1109/tvcg.2020.2986996] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
t-Distributed Stochastic Neighbor Embedding (t-SNE) for the visualization of multidimensional data has proven to be a popular approach, with successful applications in a wide range of domains. Despite their usefulness, t-SNE projections can be hard to interpret or even misleading, which hurts the trustworthiness of the results. Understanding the details of t-SNE itself and the reasons behind specific patterns in its output may be a daunting task, especially for non-experts in dimensionality reduction. In this article, we present t-viSNE, an interactive tool for the visual exploration of t-SNE projections that enables analysts to inspect different aspects of their accuracy and meaning, such as the effects of hyper-parameters, distance and neighborhood preservation, densities and costs of specific neighborhoods, and the correlations between dimensions and visual patterns. We propose a coherent, accessible, and well-integrated collection of different views for the visualization of t-SNE projections. The applicability and usability of t-viSNE are demonstrated through hypothetical usage scenarios with real data sets. Finally, we present the results of a user study where the tool's effectiveness was evaluated. By bringing to light information that would normally be lost after running t-SNE, we hope to support analysts in using t-SNE and making its results better understandable.
Collapse
|