1
|
Xu Z, Lu D, Luo J, Zheng Y, Tong RKY. Separated collaborative learning for semi-supervised prostate segmentation with multi-site heterogeneous unlabeled MRI data. Med Image Anal 2024; 93:103095. [PMID: 38310678 DOI: 10.1016/j.media.2024.103095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2023] [Revised: 09/11/2023] [Accepted: 01/24/2024] [Indexed: 02/06/2024]
Abstract
Segmenting prostate from magnetic resonance imaging (MRI) is a critical procedure in prostate cancer staging and treatment planning. Considering the nature of labeled data scarcity for medical images, semi-supervised learning (SSL) becomes an appealing solution since it can simultaneously exploit limited labeled data and a large amount of unlabeled data. However, SSL relies on the assumption that the unlabeled images are abundant, which may not be satisfied when the local institute has limited image collection capabilities. An intuitive solution is to seek support from other centers to enrich the unlabeled image pool. However, this further introduces data heterogeneity, which can impede SSL that works under identical data distribution with certain model assumptions. Aiming at this under-explored yet valuable scenario, in this work, we propose a separated collaborative learning (SCL) framework for semi-supervised prostate segmentation with multi-site unlabeled MRI data. Specifically, on top of the teacher-student framework, SCL exploits multi-site unlabeled data by: (i) Local learning, which advocates local distribution fitting, including the pseudo label learning that reinforces confirmation of low-entropy easy regions and the cyclic propagated real label learning that leverages class prototypes to regularize the distribution of intra-class features; (ii) External multi-site learning, which aims to robustly mine informative clues from external data, mainly including the local-support category mutual dependence learning, which takes the spirit that mutual information can effectively measure the amount of information shared by two variables even from different domains, and the stability learning under strong adversarial perturbations to enhance robustness to heterogeneity. Extensive experiments on prostate MRI data from six different clinical centers show that our method can effectively generalize SSL on multi-site unlabeled data and significantly outperform other semi-supervised segmentation methods. Besides, we validate the extensibility of our method on the multi-class cardiac MRI segmentation task with data from four different clinical centers.
Collapse
Affiliation(s)
- Zhe Xu
- Department of Biomedical Engineering, The Chinese University of Hong Kong, Shatin, NT, Hong Kong, China.
| | - Donghuan Lu
- Tencent Jarvis Research Center, Youtu Lab, Shenzhen, China.
| | - Jie Luo
- Massachusetts General Hospital, Harvard Medical School, Boston, USA
| | - Yefeng Zheng
- Tencent Jarvis Research Center, Youtu Lab, Shenzhen, China
| | - Raymond Kai-Yu Tong
- Department of Biomedical Engineering, The Chinese University of Hong Kong, Shatin, NT, Hong Kong, China.
| |
Collapse
|
2
|
Thwal CM, Nguyen MNH, Tun YL, Kim ST, Thai MT, Hong CS. OnDev-LCT: On-Device Lightweight Convolutional Transformers towards federated learning. Neural Netw 2024; 170:635-649. [PMID: 38100846 DOI: 10.1016/j.neunet.2023.11.044] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2023] [Revised: 10/26/2023] [Accepted: 11/20/2023] [Indexed: 12/17/2023]
Abstract
Federated learning (FL) has emerged as a promising approach to collaboratively train machine learning models across multiple edge devices while preserving privacy. The success of FL hinges on the efficiency of participating models and their ability to handle the unique challenges of distributed learning. While several variants of Vision Transformer (ViT) have shown great potential as alternatives to modern convolutional neural networks (CNNs) for centralized training, the unprecedented size and higher computational demands hinder their deployment on resource-constrained edge devices, challenging their widespread application in FL. Since client devices in FL typically have limited computing resources and communication bandwidth, models intended for such devices must strike a balance between model size, computational efficiency, and the ability to adapt to the diverse and non-IID data distributions encountered in FL. To address these challenges, we propose OnDev-LCT: Lightweight Convolutional Transformers for On-Device vision tasks with limited training data and resources. Our models incorporate image-specific inductive biases through the LCT tokenizer by leveraging efficient depthwise separable convolutions in residual linear bottleneck blocks to extract local features, while the multi-head self-attention (MHSA) mechanism in the LCT encoder implicitly facilitates capturing global representations of images. Extensive experiments on benchmark image datasets indicate that our models outperform existing lightweight vision models while having fewer parameters and lower computational demands, making them suitable for FL scenarios with data heterogeneity and communication bottlenecks.
Collapse
Affiliation(s)
- Chu Myaet Thwal
- Department of Computer Science and Engineering, Kyung Hee University, Yongin-si, Gyeonggi-do 17104, South Korea.
| | - Minh N H Nguyen
- Vietnam - Korea University of Information and Communication Technology, Danang, Viet Nam.
| | - Ye Lin Tun
- Department of Computer Science and Engineering, Kyung Hee University, Yongin-si, Gyeonggi-do 17104, South Korea.
| | - Seong Tae Kim
- Department of Computer Science and Engineering, Kyung Hee University, Yongin-si, Gyeonggi-do 17104, South Korea.
| | - My T Thai
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, Florida 32611, USA.
| | - Choong Seon Hong
- Department of Computer Science and Engineering, Kyung Hee University, Yongin-si, Gyeonggi-do 17104, South Korea.
| |
Collapse
|
3
|
Tun YL, Nguyen MNH, Thwal CM, Choi J, Hong CS. Contrastive encoder pre-training-based clustered federated learning for heterogeneous data. Neural Netw 2023; 165:689-704. [PMID: 37385023 DOI: 10.1016/j.neunet.2023.06.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2022] [Revised: 05/26/2023] [Accepted: 06/05/2023] [Indexed: 07/01/2023]
Abstract
Federated learning (FL) is a promising approach that enables distributed clients to collaboratively train a global model while preserving their data privacy. However, FL often suffers from data heterogeneity problems, which can significantly affect its performance. To address this, clustered federated learning (CFL) has been proposed to construct personalized models for different client clusters. One effective client clustering strategy is to allow clients to choose their own local models from a model pool based on their performance. However, without pre-trained model parameters, such a strategy is prone to clustering failure, in which all clients choose the same model. Unfortunately, collecting a large amount of labeled data for pre-training can be costly and impractical in distributed environments. To overcome this challenge, we leverage self-supervised contrastive learning to exploit unlabeled data for the pre-training of FL systems. Together, self-supervised pre-training and client clustering can be crucial components for tackling the data heterogeneity issues of FL. Leveraging these two crucial strategies, we propose contrastive pre-training-based clustered federated learning (CP-CFL) to improve the model convergence and overall performance of FL systems. In this work, we demonstrate the effectiveness of CP-CFL through extensive experiments in heterogeneous FL settings, and present various interesting observations.
Collapse
Affiliation(s)
- Ye Lin Tun
- Department of Computer Science and Engineering, Kyung Hee University, Yongin-si, Gyeonggi-do 17104, South Korea.
| | - Minh N H Nguyen
- Vietnam - Korea University of Information and Communication Technology, Danang, Viet Nam.
| | - Chu Myaet Thwal
- Department of Computer Science and Engineering, Kyung Hee University, Yongin-si, Gyeonggi-do 17104, South Korea.
| | - Jinwoo Choi
- Department of Computer Science and Engineering, Kyung Hee University, Yongin-si, Gyeonggi-do 17104, South Korea.
| | - Choong Seon Hong
- Department of Computer Science and Engineering, Kyung Hee University, Yongin-si, Gyeonggi-do 17104, South Korea.
| |
Collapse
|
4
|
Reitemeyer F, Fritz D, Jacobi N, Díaz-Bone L, Mariño Viteri C, Kropp JP. Quantification of urban mitigation potentials - coping with data heterogeneity. Heliyon 2023; 9:e16733. [PMID: 37303575 PMCID: PMC10250789 DOI: 10.1016/j.heliyon.2023.e16733] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2023] [Revised: 05/25/2023] [Accepted: 05/25/2023] [Indexed: 06/13/2023] Open
Abstract
Cities are at the forefront of European and international climate action. However, in many cities, the ever-growing urban population is putting pressure on settlement and infrastructure development, increasing attention to urban planning, infrastructure and buildings. This paper introduces a set of quantification approaches, capturing impacts of urban planning measures in three fields of action: sustainable building, transport and redensification. The quantification approaches have been developed to account for different levels of data availability, thus providing users with quantification approaches that are applicable across cities. The mitigation potentials of various measures such as a modal shift, the substitution of building materials with wood, and different redensification scenarios were calculated. The substitution of conventional building materials with wood was analyzed as having a high mitigation potential. Building construction, in combination with urban planning and design, are key drivers for mitigating climate change in cities. Given the data heterogeneity among cities, mixed quantification approaches could be defined and the measures and policy areas with the greatest climate mitigation potential identified.
Collapse
Affiliation(s)
- Fabian Reitemeyer
- Potsdam Institute for Climate Impact Research – PIK, Member of Leibniz Association, P.O. Box 601203, Potsdam, 14412, Germany
| | - David Fritz
- Environment Agency Austria, Spittelauer Lände 5, Vienna, 1090, Austria
| | - Nikolai Jacobi
- ICLEI European Secretariat, Leopoldring 3, Freiburg, 79098, Germany
| | - León Díaz-Bone
- ICLEI - Local Governments for Sustainability e.V., Kaiser-Friedrich-Str. 7, Bonn, 53113, Germany
| | - Carla Mariño Viteri
- ICLEI - Local Governments for Sustainability e.V., Kaiser-Friedrich-Str. 7, Bonn, 53113, Germany
| | - Juergen P. Kropp
- Potsdam Institute for Climate Impact Research – PIK, Member of Leibniz Association, P.O. Box 601203, Potsdam, 14412, Germany
- Bauhaus Earth, Dortustraße 46, Potsdam, 14467, Germany
- University of Potsdam, Institute of Environmental Science and Geography, Karl-Liebknecht-Str. 24-25, Potsdam, 14476, Germany
| |
Collapse
|
5
|
Li J, Zhang W, Wang P, Li Q, Zhang K, Liu Y. Nonparametric prediction distribution from resolution-wise regression with heterogeneous data. J Bus Econ Stat 2022; 41:1157-1172. [PMID: 38046827 PMCID: PMC10691808 DOI: 10.1080/07350015.2022.2115498] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/05/2023]
Abstract
Modeling and inference for heterogeneous data have gained great interest recently due to rapid developments in personalized marketing. Most existing regression approaches are based on the conditional mean and may require additional cluster information to accommodate data heterogeneity. In this paper, we propose a novel nonparametric resolution-wise regression procedure to provide an estimated distribution of the response instead of one single value. We achieve this by decomposing the information of the response and the predictors into resolutions and patterns respectively based on marginal binary expansions. The relationships between resolutions and patterns are modeled by penalized logistic regressions. Combining the resolution-wise prediction, we deliver a histogram of the conditional response to approximate the distribution. Moreover, we show a sure independence screening property and the consistency of the proposed method for growing dimensions. Simulations and a real estate valuation dataset further illustrate the effectiveness of the proposed method.
Collapse
Affiliation(s)
- Jialu Li
- School of Mathematics and Statistics, Beijing Institute of Technology, Beijing 100081, China
| | - Wan Zhang
- Department of Statistics and Operations Research, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Peiyao Wang
- Department of Statistics and Operations Research, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Qizhai Li
- LSC, NCMIS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, and University of Chinese Academy of Sciences, Beijing 100190, China
| | - Kai Zhang
- Department of Statistics and Operations Research, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Yufeng Liu
- Department of Statistics and Operations Research, Department of Genetics, Department of Biostatistics, Carolina Center for Genome Science, Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| |
Collapse
|
6
|
Qu L, Balachandar N, Zhang M, Rubin D. Handling data heterogeneity with generative replay in collaborative learning for medical imaging. Med Image Anal 2022; 78:102424. [PMID: 35390737 DOI: 10.1016/j.media.2022.102424] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2021] [Revised: 03/02/2022] [Accepted: 03/12/2022] [Indexed: 01/11/2023]
Abstract
Collaborative learning, which enables collaborative and decentralized training of deep neural networks at multiple institutions in a privacy-preserving manner, is rapidly emerging as a valuable technique in healthcare applications. However, its distributed nature often leads to significant heterogeneity in data distributions across institutions. In this paper, we present a novel generative replay strategy to address the challenge of data heterogeneity in collaborative learning methods. Different from traditional methods that directly aggregating the model parameters, we leverage generative adversarial learning to aggregate the knowledge from all the local institutions. Specifically, instead of directly training a model for task performance, we develop a novel dual model architecture: a primary model learns the desired task, and an auxiliary "generative replay model" allows aggregating knowledge from the heterogenous clients. The auxiliary model is then broadcasted to the central sever, to regulate the training of primary model with an unbiased target distribution. Experimental results demonstrate the capability of the proposed method in handling heterogeneous data across institutions. On highly heterogeneous data partitions, our model achieves ∼4.88% improvement in the prediction accuracy on a diabetic retinopathy classification dataset, and ∼49.8% reduction of mean absolution value on a Bone Age prediction dataset, respectively, compared to the state-of-the art collaborative learning methods.
Collapse
|
7
|
Chen D, Hosner PA, Dittmann DL, O'Neill JP, Birks SM, Braun EL, Kimball RT. Divergence time estimation of Galliformes based on the best gene shopping scheme of ultraconserved elements. BMC Ecol Evol 2021; 21:209. [PMID: 34809586 PMCID: PMC8609756 DOI: 10.1186/s12862-021-01935-1] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2021] [Accepted: 11/08/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Divergence time estimation is fundamental to understanding many aspects of the evolution of organisms, such as character evolution, diversification, and biogeography. With the development of sequence technology, improved analytical methods, and knowledge of fossils for calibration, it is possible to obtain robust molecular dating results. However, while phylogenomic datasets show great promise in phylogenetic estimation, the best ways to leverage the large amounts of data for divergence time estimation has not been well explored. A potential solution is to focus on a subset of data for divergence time estimation, which can significantly reduce the computational burdens and avoid problems with data heterogeneity that may bias results. RESULTS In this study, we obtained thousands of ultraconserved elements (UCEs) from 130 extant galliform taxa, including representatives of all genera, to determine the divergence times throughout galliform history. We tested the effects of different "gene shopping" schemes on divergence time estimation using a carefully, and previously validated, set of fossils. Our results found commonly used clock-like schemes may not be suitable for UCE dating (or other data types) where some loci have little information. We suggest use of partitioning (e.g., PartitionFinder) and selection of tree-like partitions may be good strategies to select a subset of data for divergence time estimation from UCEs. Our galliform time tree is largely consistent with other molecular clock studies of mitochondrial and nuclear loci. With our increased taxon sampling, a well-resolved topology, carefully vetted fossil calibrations, and suitable molecular dating methods, we obtained a high quality galliform time tree. CONCLUSIONS We provide a robust galliform backbone time tree that can be combined with more fossil records to further facilitate our understanding of the evolution of Galliformes and can be used as a resource for comparative and biogeographic studies in this group.
Collapse
Affiliation(s)
- De Chen
- MOE Key Laboratory for Biodiversity Science and Ecological Engineering, College of Life Sciences, Beijing Normal University, Beijing, China
- Department of Biology, University of Florida, Gainesville, FL, USA
| | - Peter A Hosner
- Department of Biology, University of Florida, Gainesville, FL, USA
- Natural History Museum of Denmark and Center for Global Mountain Biodiversity, University of Copenhagen, Copenhagen, Denmark
| | - Donna L Dittmann
- Museum of Natural Science, Louisiana State University, Baton Rouge, LA, USA
| | - John P O'Neill
- Museum of Natural Science, Louisiana State University, Baton Rouge, LA, USA
| | - Sharon M Birks
- Burke Museum of Natural History and Culture, University of Washington, Seattle, WA, USA
| | - Edward L Braun
- Department of Biology, University of Florida, Gainesville, FL, USA
| | | |
Collapse
|
8
|
Wang T, Chen R, Liu W, Yu M. Structure-preserving integrated analysis for risk stratification with application to cancer staging. Biostatistics 2021; 23:990-1006. [PMID: 33738474 DOI: 10.1093/biostatistics/kxab005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2020] [Revised: 01/21/2021] [Accepted: 01/25/2021] [Indexed: 11/13/2022] Open
Abstract
To provide appropriate and practical level of health care, it is critical to group patients into relatively few strata that have distinct prognosis. Such grouping or stratification is typically based on well-established risk factors and clinical outcomes. A well-known example is the American Joint Committee on Cancer staging for cancer that uses tumor size, node involvement, and metastasis status. We consider a statistical method for such grouping based on individual patient data from multiple studies. The method encourages a common grouping structure as a basis for borrowing information, but acknowledges data heterogeneity including unbalanced data structures across multiple studies. We build on the "lasso-tree" method that is more versatile than the well-known classification and regression tree method in generating possible grouping patterns. In addition, the parametrization of the lasso-tree method makes it very natural to incorporate the underlying order information in the risk factors. In this article, we also strengthen the lasso-tree method by establishing its theoretical properties for which Lin and others (2013. Lasso tree for cancer staging with survival data. Biostatistics 14, 327-339) did not pursue. We evaluate our method in extensive simulation studies and an analysis of multiple breast cancer data sets.
Collapse
Affiliation(s)
- Tianjie Wang
- Department of Statistics, University of Wisconsin, Madison, WI, USA
| | - Rui Chen
- Department of Statistics, University of Wisconsin, Madison, WI, USA
| | - Wenshuo Liu
- Department of Research & Innovation, Interactions LLC, 31 Hayward Street Suite E, Franklin, MA 02038, USA
| | - Menggang Yu
- Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, WI, USA
| |
Collapse
|
9
|
Zhang Y, Bernau C, Parmigiani G, Waldron L. The impact of different sources of heterogeneity on loss of accuracy from genomic prediction models. Biostatistics 2020; 21:253-268. [PMID: 30202918 DOI: 10.1093/biostatistics/kxy044] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2018] [Revised: 07/22/2018] [Accepted: 08/04/2018] [Indexed: 11/13/2022] Open
Abstract
Cross-study validation (CSV) of prediction models is an alternative to traditional cross-validation (CV) in domains where multiple comparable datasets are available. Although many studies have noted potential sources of heterogeneity in genomic studies, to our knowledge none have systematically investigated their intertwined impacts on prediction accuracy across studies. We employ a hybrid parametric/non-parametric bootstrap method to realistically simulate publicly available compendia of microarray, RNA-seq, and whole metagenome shotgun microbiome studies of health outcomes. Three types of heterogeneity between studies are manipulated and studied: (i) imbalances in the prevalence of clinical and pathological covariates, (ii) differences in gene covariance that could be caused by batch, platform, or tumor purity effects, and (iii) differences in the "true" model that associates gene expression and clinical factors to outcome. We assess model accuracy, while altering these factors. Lower accuracy is seen in CSV than in CV. Surprisingly, heterogeneity in known clinical covariates and differences in gene covariance structure have very limited contributions in the loss of accuracy when validating in new studies. However, forcing identical generative models greatly reduces the within/across study difference. These results, observed consistently for multiple disease outcomes and omics platforms, suggest that the most easily identifiable sources of study heterogeneity are not necessarily the primary ones that undermine the ability to accurately replicate the accuracy of omics prediction models in new studies. Unidentified heterogeneity, such as could arise from unmeasured confounding, may be more important.
Collapse
Affiliation(s)
- Yuqing Zhang
- Graduate Program in Bioinformatics, Boston University, 24 Cummington Mall, Boston, MA, USA
| | - Christoph Bernau
- Department of Medical Informatics, Biometry and Epidemiology, University of Munich, Marchioninistr. 15, Munich, Germany
| | - Giovanni Parmigiani
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, 3 Blackfan Cir, Boston, MA, USA.,Department of Biostatistics, Harvard TH Chan School of Public Health, 677 Huntington Ave, Boston, MA, USA
| | - Levi Waldron
- Graduate School of Public Health and Health Policy, Institute for Implementation Science in Population Health, City University of New York, 55 W 125th St, New York, NY, USA
| |
Collapse
|
10
|
Wang K, Zhao S, Jackson E. Investigating exposure measures and functional forms in urban and suburban intersection safety performance functions using generalized negative binomial - P model. Accid Anal Prev 2020; 148:105838. [PMID: 33125923 DOI: 10.1016/j.aap.2020.105838] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/29/2020] [Revised: 09/14/2020] [Accepted: 10/03/2020] [Indexed: 06/11/2023]
Abstract
Selecting an appropriate exposure measure and functional form for Safety Performance Functions (SPFs) is critical in precisely predicting crash counts by different crash types for intersections. This study proposes a new approach, namely Generalized Negative Binomial-P (GNB-P) model, to model the complex relationship between crashes and different exposure measures by crash type for intersections, which helps not only identify the most reliable exposure measure for intersection SPFs, but also explore the most appropriate functional form of the NB models. To this end, three types of SPF functional forms, namely Power function, Hoerl function 1 and Hoerl function 2 with different exposure measures including major road AADT, minor road AADT and total AADT were estimated by crash type for stop-controlled and two types of signalized intersections. The over-dispersion of the SPF models was estimated using the exposure measures to account for crash data variation across different intersections. The SPF estimation results highlighted that the mean-variance structure of NB models is not consistent and varies by crash data. The over-dispersion of SPFs by crash type is not constant and varies across different intersections. The minor road AADT is shown to be positively correlated with the over-dispersion of SPFs in estimating crash counts for Same-Direction Crashes (SDC), Intersecting-Direction Crashes (IDC) and Single-Vehicle Crashes (SVC). Estimating the over-dispersion using exposure measures results in more reliable SPF results. Furthermore, it is found that the Power function with major road and minor road AADT as the exposure measure performs the best in estimating SPFs for Opposite-Direction Crashes (ODC). The Hoerl function 2 with total AADT and the proportion of minor road AADT over the total as the exposure measure performs the best in estimating SVC SPFs for intersections. The Hoerl function 1 with major road and minor road AADT as the exposure measure is more accurate in estimating SPFs for both SDC and IDC.
Collapse
Affiliation(s)
- Kai Wang
- Connecticut Transportation Safety Research Center, Connecticut Transportation Institute, University of Connecticut, 270 Middle Turnpike, Unit 5202, Storrs, CT 06269-5202, USA.
| | - Shanshan Zhao
- Connecticut Transportation Safety Research Center, Connecticut Transportation Institute, University of Connecticut, 270 Middle Turnpike, Unit 5202, Storrs, CT 06269-5202, USA.
| | - Eric Jackson
- Connecticut Transportation Safety Research Center, Connecticut Transportation Institute, University of Connecticut, 270 Middle Turnpike, Unit 5202, Storrs, CT 06269-5202, USA.
| |
Collapse
|
11
|
Wang K, Zhao S, Jackson E. Functional forms of the negative binomial models in safety performance functions for rural two-lane intersections. Accid Anal Prev 2019; 124:193-201. [PMID: 30665054 DOI: 10.1016/j.aap.2019.01.015] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/18/2018] [Revised: 12/20/2018] [Accepted: 01/11/2019] [Indexed: 06/09/2023]
Abstract
Safety Performance Functions (SPFs) play a prominent role in estimating intersection crashes, and identifying the sites with the highest potential for safety improvement. To maximize the crash prediction accuracy, this paper describes the application of different functional forms of the Negative Binomial (NB) models (i.e. NB-1, NB-2 and NB-P) in estimating safety performance functions by crash type for three types of rural two-lane intersections, including three-leg stop-controlled (3ST) intersections, four-leg stop-controlled (4ST) intersections and four-leg signalized (4SG) intersections. Crash types were aggregated into same-direction, opposite-direction, intersecting-direction and single-vehicle crashes. Major and minor road Annual Average Daily Traffic (AADT) were used as predictors in the SPF estimation. In addition, major and minor road AADT were also used as predictors in the estimation of the over-dispersion parameter of the NB models to account for the crash data heterogeneity. In the end, all NB models were compared based on both the model estimation goodness-of-fit and the prediction performance. The model goodness-of-fit indicates that the NB-P model outperforms the NB-1 and NB-2 models for most crash types and intersection types, by providing a flexible variance structure to the NB approaches. The parameterization of the over-dispersion factor verifies that the over-dispersion parameter of the NB models highly depends on how the variance structure is defined in the model, and the over-dispersion parameter is shown to vary among different intersections for each crash type and can be estimated using both the major and minor road AADT at rural two-lane intersections. The NB-P model is found to more effectively capture the variation of over-dispersion among intersections in NB models, which benefits the accommodation of data heterogeneity in intersection SPF development. The prediction performance comparison illustrates that the NB-P model slightly improves the crash prediction accuracy compared with the other two models, especially for the 3ST and 4SG intersections. In conclusion, the NB-P model with parameterized over-dispersion factor is recommended to provide more unbiased parameter estimates when estimating SPFs by crash type for rural two-lane intersections.
Collapse
Affiliation(s)
- Kai Wang
- Connecticut Transportation Safety Research Center, Connecticut Transportation Institute, University of Connecticut, 270 Middle Turnpike, Unit 5202, Storrs, CT, 06269-5202, USA.
| | - Shanshan Zhao
- Connecticut Transportation Safety Research Center, Connecticut Transportation Institute, University of Connecticut, 270 Middle Turnpike, Unit 5202, Storrs, CT, 06269-5202, USA.
| | - Eric Jackson
- Connecticut Transportation Safety Research Center, Connecticut Transportation Institute, University of Connecticut, 270 Middle Turnpike, Unit 5202, Storrs, CT, 06269-5202, USA.
| |
Collapse
|
12
|
Cui L, Zeng N, Kim M, Mueller R, Hankosky ER, Redline S, Zhang GQ. X-search: an open access interface for cross-cohort exploration of the National Sleep Research Resource. BMC Med Inform Decis Mak 2018; 18:99. [PMID: 30424756 PMCID: PMC6234631 DOI: 10.1186/s12911-018-0682-y] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2017] [Accepted: 10/18/2018] [Indexed: 01/27/2023] Open
Abstract
BACKGROUND The National Sleep Research Resource (NSRR) is a large-scale, openly shared, data repository of de-identified, highly curated clinical sleep data from multiple NIH-funded epidemiological studies. Although many data repositories allow users to browse their content, few support fine-grained, cross-cohort query and exploration at study-subject level. We introduce a cross-cohort query and exploration system, called X-search, to enable researchers to query patient cohort counts across a growing number of completed, NIH-funded studies in NSRR and explore the feasibility or likelihood of reusing the data for research studies. METHODS X-search has been designed as a general framework with two loosely-coupled components: semantically annotated data repository and cross-cohort exploration engine. The semantically annotated data repository is comprised of a canonical data dictionary, data sources with a data dictionary, and mappings between each individual data dictionary and the canonical data dictionary. The cross-cohort exploration engine consists of five modules: query builder, graphical exploration, case-control exploration, query translation, and query execution. The canonical data dictionary serves as the unified metadata to drive the visual exploration interfaces and facilitate query translation through the mappings. RESULTS X-search is publicly available at https://www.x-search.net/ with nine NSRR datasets consisting of over 26,000 unique subjects. The canonical data dictionary contains over 900 common data elements across the datasets. X-search has received over 1800 cross-cohort queries by users from 16 countries. CONCLUSIONS X-search provides a powerful cross-cohort exploration interface for querying and exploring heterogeneous datasets in the NSRR data repository, so as to enable researchers to evaluate the feasibility of potential research studies and generate potential hypotheses using the NSRR data.
Collapse
Affiliation(s)
- Licong Cui
- Department of Computer Science, University of Kentucky, Lexington, KY USA
- Institute for Biomedical Informatics, University of Kentucky, Lexington, KY USA
| | - Ningzhou Zeng
- Department of Computer Science, University of Kentucky, Lexington, KY USA
- Institute for Biomedical Informatics, University of Kentucky, Lexington, KY USA
| | - Matthew Kim
- Brigham and Women’s Hospital, Boston MA, USA
- Harvard Medical School, Harvard University, Boston MA, USA
| | - Remo Mueller
- Brigham and Women’s Hospital, Boston MA, USA
- Harvard Medical School, Harvard University, Boston MA, USA
| | - Emily R. Hankosky
- Institute for Biomedical Informatics, University of Kentucky, Lexington, KY USA
| | - Susan Redline
- Brigham and Women’s Hospital, Boston MA, USA
- Harvard Medical School, Harvard University, Boston MA, USA
| | - Guo-Qiang Zhang
- Department of Computer Science, University of Kentucky, Lexington, KY USA
- Institute for Biomedical Informatics, University of Kentucky, Lexington, KY USA
| |
Collapse
|
13
|
Kumar K, Cava F. Principal coordinate analysis assisted chromatographic analysis of bacterial cell wall collection: A robust classification approach. Anal Biochem 2018; 550:8-14. [PMID: 29649471 DOI: 10.1016/j.ab.2018.04.008] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2018] [Revised: 03/28/2018] [Accepted: 04/08/2018] [Indexed: 11/20/2022]
Abstract
In the present work, Principal coordinate analysis (PCoA) is introduced to develop a robust model to classify the chromatographic data sets of peptidoglycan sample. PcoA captures the heterogeneity present in the data sets by using the dissimilarity matrix as input. Thus, in principle, it can even capture the subtle differences in the bacterial peptidoglycan composition and can provide a more robust and fast approach for classifying the bacterial collection and identifying the novel cell wall targets for further biological and clinical studies. The utility of the proposed approach is successfully demonstrated by analysing the two different kind of bacterial collections. The first set comprised of peptidoglycan sample belonging to different subclasses of Alphaproteobacteria. Whereas, the second set that is relatively more intricate for the chemometric analysis consist of different wild type Vibrio Cholerae and its mutants having subtle differences in their peptidoglycan composition. The present work clearly proposes a useful approach that can classify the chromatographic data sets of chromatographic peptidoglycan samples having subtle differences. Furthermore, present work clearly suggest that PCoA can be a method of choice in any data analysis workflow.
Collapse
|
14
|
Shi Q, Zhang C, Guo W, Zeng T, Lu L, Jiang Z, Wang Z, Liu J, Chen L. Local network component analysis for quantifying transcription factor activities. Methods 2017; 124:25-35. [PMID: 28710010 DOI: 10.1016/j.ymeth.2017.06.018] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2017] [Revised: 05/02/2017] [Accepted: 06/17/2017] [Indexed: 12/16/2022] Open
Abstract
Transcription factors (TFs) could regulate physiological transitions or determine stable phenotypic diversity. The accurate estimation on TF regulatory signals or functional activities is of great significance to guide biological experiments or elucidate molecular mechanisms, but still remains challenging. Traditional methods identify TF regulatory signals at the population level, which masks heterogeneous regulation mechanisms in individuals or subgroups, thus resulting in inaccurate analyses. Here, we propose a novel computational framework, namely local network component analysis (LNCA), to exploit data heterogeneity and automatically quantify accurate transcription factor activity (TFA) in practical terms, through integrating the partitioned expression sets (i.e., local information) and prior TF-gene regulatory knowledge. Specifically, LNCA adopts an adaptive optimization strategy, which evaluates the local similarities of regulation controls and corrects biases during data integration, to construct the TFA landscape. In particular, we first numerically demonstrate the effectiveness of LNCA for the simulated data sets, compared with traditional methods, such as FastNCA, ROBNCA and NINCA. Then, we apply our model to two real data sets with implicit temporal or spatial regulation variations. The results show that LNCA not only recognizes the periodic mode along the S. cerevisiae cell cycle process, but also substantially outperforms over other methods in terms of accuracy and consistency. In addition, the cross-validation study for glioblastomas multiforme (GBM) indicates that the TFAs, identified by LNCA, can better distinguish clinically distinct tumor groups than the expression values of the corresponding TFs, thus opening a new way to classify tumor subtypes and also providing a novel insight into cancer heterogeneity. AVAILABILITY LNCA was implemented as a Matlab package, which is available at http://sysbio.sibcb.ac.cn/cb/chenlab/software.htm/LNCApackage_0.1.rar.
Collapse
|
15
|
Qian P, Zhao K, Jiang Y, Su KH, Deng Z, Wang S, Muzic RF Jr. Knowledge-leveraged transfer fuzzy C-Means for texture image segmentation with self-adaptive cluster prototype matching. Knowl Based Syst 2017; 130:33-50. [PMID: 30050232 DOI: 10.1016/j.knosys.2017.05.018] [Citation(s) in RCA: 70] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
We study a novel fuzzy clustering method to improve the segmentation performance on the target texture image by leveraging the knowledge from a prior texture image. Two knowledge transfer mechanisms, i.e. knowledge-leveraged prototype transfer (KL-PT) and knowledge-leveraged prototype matching (KL-PM) are first introduced as the bases. Applying them, the knowledge-leveraged transfer fuzzy C-means (KL-TFCM) method and its three-stage-interlinked framework, including knowledge extraction, knowledge matching, and knowledge utilization, are developed. There are two specific versions: KL-TFCM-c and KL-TFCM-f, i.e. the so-called crisp and flexible forms, which use the strategies of maximum matching degree and weighted sum, respectively. The significance of our work is fourfold: 1) Owing to the adjustability of referable degree between the source and target domains, KL-PT is capable of appropriately learning the insightful knowledge, i.e. the cluster prototypes, from the source domain; 2) KL-PM is able to self-adaptively determine the reasonable pairwise relationships of cluster prototypes between the source and target domains, even if the numbers of clusters differ in the two domains; 3) The joint action of KL-PM and KL-PT can effectively resolve the data inconsistency and heterogeneity between the source and target domains, e.g. the data distribution diversity and cluster number difference. Thus, using the three-stage-based knowledge transfer, the beneficial knowledge from the source domain can be extensively, self-adaptively leveraged in the target domain. As evidence of this, both KL-TFCM-c and KL-TFCM-f surpass many existing clustering methods in texture image segmentation; and 4) In the case of different cluster numbers between the source and target domains, KL-TFCM-f proves higher clustering effectiveness and segmentation performance than does KL-TFCM-c.
Collapse
|
16
|
Abraham A, Milham MP, Di Martino A, Craddock RC, Samaras D, Thirion B, Varoquaux G. Deriving reproducible biomarkers from multi-site resting-state data: An Autism-based example. Neuroimage 2016; 147:736-745. [PMID: 27865923 DOI: 10.1016/j.neuroimage.2016.10.045] [Citation(s) in RCA: 292] [Impact Index Per Article: 36.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2016] [Revised: 10/16/2016] [Accepted: 10/21/2016] [Indexed: 12/30/2022] Open
Abstract
Resting-state functional Magnetic Resonance Imaging (R-fMRI) holds the promise to reveal functional biomarkers of neuropsychiatric disorders. However, extracting such biomarkers is challenging for complex multi-faceted neuropathologies, such as autism spectrum disorders. Large multi-site datasets increase sample sizes to compensate for this complexity, at the cost of uncontrolled heterogeneity. This heterogeneity raises new challenges, akin to those face in realistic diagnostic applications. Here, we demonstrate the feasibility of inter-site classification of neuropsychiatric status, with an application to the Autism Brain Imaging Data Exchange (ABIDE) database, a large (N=871) multi-site autism dataset. For this purpose, we investigate pipelines that extract the most predictive biomarkers from the data. These R-fMRI pipelines build participant-specific connectomes from functionally-defined brain areas. Connectomes are then compared across participants to learn patterns of connectivity that differentiate typical controls from individuals with autism. We predict this neuropsychiatric status for participants from the same acquisition sites or different, unseen, ones. Good choices of methods for the various steps of the pipeline lead to 67% prediction accuracy on the full ABIDE data, which is significantly better than previously reported results. We perform extensive validation on multiple subsets of the data defined by different inclusion criteria. These enables detailed analysis of the factors contributing to successful connectome-based prediction. First, prediction accuracy improves as we include more subjects, up to the maximum amount of subjects available. Second, the definition of functional brain areas is of paramount importance for biomarker discovery: brain areas extracted from large R-fMRI datasets outperform reference atlases in the classification tasks.
Collapse
Affiliation(s)
- Alexandre Abraham
- Parietal Team, Saclay-INRIA le-de-France,Saclay,France; CEA, Neurospin bât 145, 91191 Gif-Sur-Yvette, France.
| | - Michael P Milham
- Center for the Developing Brain, Child Mind Institute, New York, USA; Center for Biomedical Imaging and Neuromodulation, Nathan S. Kline Institute for Psychiatric Research, Orangeburg, NY, USA
| | | | - R Cameron Craddock
- Center for the Developing Brain, Child Mind Institute, New York, USA; Center for Biomedical Imaging and Neuromodulation, Nathan S. Kline Institute for Psychiatric Research, Orangeburg, NY, USA
| | - Dimitris Samaras
- Stony Brook University, NY 11794, USA; Ecole Centrale, 92290 Châtenay Malabry, France
| | - Bertrand Thirion
- Parietal Team, Saclay-INRIA le-de-France,Saclay,France; CEA, Neurospin bât 145, 91191 Gif-Sur-Yvette, France
| | - Gael Varoquaux
- Parietal Team, Saclay-INRIA le-de-France,Saclay,France; CEA, Neurospin bât 145, 91191 Gif-Sur-Yvette, France
| |
Collapse
|