1
|
Yuan T, Li Z, Liu B, Tang Y, Liu Y. ARPruning: An automatic channel pruning based on attention map ranking. Neural Netw 2024; 174:106220. [PMID: 38447427 DOI: 10.1016/j.neunet.2024.106220] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2023] [Revised: 01/22/2024] [Accepted: 02/27/2024] [Indexed: 03/08/2024]
Abstract
Structured pruning is a representative model compression technology for convolutional neural networks (CNNs), aiming to prune some less important filters or channels of CNNs. Most recent structured pruning methods have established some criteria to measure the importance of filters, which are mainly based on the magnitude of weights or other parameters in CNNs. However, these judgment criteria lack explainability, and it is insufficient to simply rely on the numerical values of the network parameters to assess the relationship between the channel and the model performance. Moreover, directly utilizing these pruning criteria for global pruning may lead to suboptimal solutions, therefore, it is necessary to complement search algorithms to determine the pruning ratio for each layer. To address these issues, we propose ARPruning (Attention-map-based Ranking Pruning), which reconstructs a new pruning criterion as the importance of the intra-layer channels and further develops a new local neighborhood search algorithm for determining the optimal inter-layer pruning ratio. To measure the relationship between the channel to be pruned and the model performance, we construct an intra-layer channel importance criterion by considering the attention map for each layer. Then, we propose an automatic pruning strategy searching method that can search for the optimal solution effectively and efficiently. By integrating the well-designed pruning criteria and search strategy, our ARPruning can not only maintain a high compression rate but also achieve outstanding accuracy. In our work, it is also experimentally concluded that compared with state-of-the-art pruning methods, our ARPruning method is capable of achieving better compression results. The code can be obtained at https://github.com/dozingLee/ARPruning.
Collapse
Affiliation(s)
| | - Zulin Li
- Beijing University of Technology, China
| | - Bo Liu
- Beijing University of Technology, China.
| | - Yinan Tang
- Inspur Electronic Information Industry Co., Ltd, China
| | - Yujia Liu
- Beijing University of Technology, China
| |
Collapse
|
2
|
Zhang Z, Lu Y, Wang T, Wei X, Wei Z. DDK: Dynamic structure pruning based on differentiable search and recursive knowledge distillation for BERT. Neural Netw 2024; 173:106164. [PMID: 38367353 DOI: 10.1016/j.neunet.2024.106164] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2023] [Revised: 02/01/2024] [Accepted: 02/03/2024] [Indexed: 02/19/2024]
Abstract
Large-scale pre-trained models, such as BERT, have demonstrated outstanding performance in Natural Language Processing (NLP). Nevertheless, the high number of parameters in these models has increased the demand for hardware storage and computational resources while posing a challenge for their practical deployment. In this article, we propose a combined method of model pruning and knowledge distillation to compress and accelerate large-scale pre-trained language models. Specifically, we introduce a dynamic structure pruning method based on differentiable search and recursive knowledge distillation to automatically prune the BERT model, named DDK. We define the search space for network pruning as all feed-forward layer channels and self-attention heads at each layer of the network, and utilize differentiable methods to determine their optimal number. Additionally, we design a recursive knowledge distillation method that employs adaptive weighting to extract the most important features from multiple intermediate layers of the teacher model and fuse them to supervise the student network learning. Our experimental results on the GLUE benchmark dataset and ablation analysis demonstrate that our proposed method outperforms other advanced methods in terms of average performance.
Collapse
Affiliation(s)
- Zhou Zhang
- School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, China.
| | - Yang Lu
- School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, China; Anhui Mine IOT and Security Monitoring Technology Key Laboratory, Hefei 230088, China.
| | - Tengfei Wang
- School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, China.
| | - Xing Wei
- School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, China; Intelligent Manufacturing Institute of Hefei University of Technology, Hefei 230009, China.
| | - Zhen Wei
- School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, China; Intelligent Manufacturing Institute of Hefei University of Technology, Hefei 230009, China.
| |
Collapse
|
3
|
Wang Y, Guo S, Guo J, Zhang J, Zhang W, Yan C, Zhang Y. Towards performance-maximizing neural network pruning via global channel attention. Neural Netw 2024; 171:104-113. [PMID: 38091754 DOI: 10.1016/j.neunet.2023.11.065] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2023] [Revised: 11/24/2023] [Accepted: 11/29/2023] [Indexed: 01/29/2024]
Abstract
Network pruning has attracted increasing attention recently for its capability of transferring large-scale neural networks (e.g., CNNs) into resource-constrained devices. Such a transfer is typically achieved by removing redundant network parameters while retaining its generalization performance in a static or dynamic manner. Concretely, static pruning usually maintains a larger and fit-to-all (samples) compressed network by removing the same channels for all samples, which cannot maximally excavate redundancy in the given network. In contrast, dynamic pruning can adaptively remove (more) different channels for different samples and obtain state-of-the-art performance along with a higher compression ratio. However, since the system has to preserve the complete network information for sample-specific pruning, the dynamic pruning methods are usually not memory-efficient. In this paper, our interest is to explore a static alternative, dubbed GlobalPru, from a different perspective by respecting the differences among data. Specifically, a novel channel attention-based learn-to-rank framework is proposed to learn a global ranking of channels with respect to network redundancy. In this method, each sample-wise (local) channel attention is forced to reach an agreement on the global ranking among different data. Hence, all samples can empirically share the same ranking of channels and make the pruning statically in practice. Extensive experiments on ImageNet, SVHN, and CIFAR-10/100 demonstrate that the proposed GlobalPru achieves superior performance than state-of-the-art static and dynamic pruning methods by significant margins.
Collapse
Affiliation(s)
- Yingchun Wang
- BDKE Lab, School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, China; Department of Computing, The Hong Kong Polytechnic University, Hong Kong Special Administrative Region of China.
| | - Song Guo
- Department of Computing, The Hong Kong Polytechnic University, Hong Kong Special Administrative Region of China.
| | - Jingcai Guo
- Department of Computing, The Hong Kong Polytechnic University, Hong Kong Special Administrative Region of China.
| | - Jie Zhang
- Department of Computing, The Hong Kong Polytechnic University, Hong Kong Special Administrative Region of China.
| | - Weizhan Zhang
- BDKE Lab, School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, China.
| | - Caixia Yan
- BDKE Lab, School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, China
| | - Yuanhong Zhang
- BDKE Lab, School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, China
| |
Collapse
|
4
|
Niyaz U, Sambyal AS, Bathula DR. Leveraging different learning styles for improved knowledge distillation in biomedical imaging. Comput Biol Med 2024; 168:107764. [PMID: 38056210 DOI: 10.1016/j.compbiomed.2023.107764] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2023] [Revised: 10/15/2023] [Accepted: 11/21/2023] [Indexed: 12/08/2023]
Abstract
Learning style refers to a type of training mechanism adopted by an individual to gain new knowledge. As suggested by the VARK model, humans have different learning preferences, like Visual (V), Auditory (A), Read/Write (R), and Kinesthetic (K), for acquiring and effectively processing information. Our work endeavors to leverage this concept of knowledge diversification to improve the performance of model compression techniques like Knowledge Distillation (KD) and Mutual Learning (ML). Consequently, we use a single-teacher and two-student network in a unified framework that not only allows for the transfer of knowledge from teacher to students (KD) but also encourages collaborative learning between students (ML). Unlike the conventional approach, where the teacher shares the same knowledge in the form of predictions or feature representations with the student network, our proposed approach employs a more diversified strategy by training one student with predictions and the other with feature maps from the teacher. We further extend this knowledge diversification by facilitating the exchange of predictions and feature maps between the two student networks, enriching their learning experiences. We have conducted comprehensive experiments with three benchmark datasets for both classification and segmentation tasks using two different network architecture combinations. These experimental results demonstrate that knowledge diversification in a combined KD and ML framework outperforms conventional KD or ML techniques (with similar network configuration) that only use predictions with an average improvement of 2%. Furthermore, consistent improvement in performance across different tasks, with various network architectures, and over state-of-the-art techniques establishes the robustness and generalizability of the proposed model.
Collapse
Affiliation(s)
- Usma Niyaz
- Department of Computer Science and Engineering, Indian Institute of Technology Ropar, Rupnagar, 140001, Punjab, India.
| | - Abhishek Singh Sambyal
- Department of Computer Science and Engineering, Indian Institute of Technology Ropar, Rupnagar, 140001, Punjab, India.
| | - Deepti R Bathula
- Department of Computer Science and Engineering, Indian Institute of Technology Ropar, Rupnagar, 140001, Punjab, India.
| |
Collapse
|
5
|
López-González CI, Gascó E, Barrientos-Espillco F, Besada-Portas E, Pajares G. Filter pruning for convolutional neural networks in semantic image segmentation. Neural Netw 2024; 169:713-732. [PMID: 37976595 DOI: 10.1016/j.neunet.2023.11.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2023] [Revised: 10/01/2023] [Accepted: 11/05/2023] [Indexed: 11/19/2023]
Abstract
The remarkable performance of Convolutional Neural Networks (CNNs) has increased their use in real-time systems and devices with limited resources. Hence, compacting these networks while preserving accuracy has become necessary, leading to multiple compression methods. However, the majority require intensive iterative procedures and do not delve into the influence of the used data. To overcome these issues, this paper presents several contributions, framed in the context of explainable Artificial Intelligence (xAI): (a) two filter pruning methods for CNNs, which remove the less significant convolutional kernels; (b) a fine-tuning strategy to recover generalization; (c) a layer pruning approach for U-Net; and (d) an explanation of the relationship between performance and the used data. Filter and feature maps information are used in the pruning process: Principal Component Analysis (PCA) is combined with a next-convolution influence-metric, while the latter and the mean standard deviation are used in an importance score distribution-based method. The developed strategies are generic, and therefore applicable to different models. Experiments demonstrating their effectiveness are conducted over distinct CNNs and datasets, focusing mainly on semantic segmentation (using U-Net, DeepLabv3+, SegNet, and VGG-16 as highly representative models). Pruned U-Net on agricultural benchmarks achieves 98.7% parameters and 97.5% FLOPs drop, with a 0.35% gain in accuracy. DeepLabv3+ and SegNet on CamVid reach 46.5% and 72.4% parameters reduction and a 51.9% and 83.6% FLOPs drop respectively, with almost no decrease in accuracy. VGG-16 on CIFAR-10 obtains up to 86.5% parameter and 82.2% FLOPs decrease with a 0.78% accuracy gain.
Collapse
Affiliation(s)
- Clara I López-González
- Department of Software Engineering and Artificial Intelligence, Complutense University of Madrid, Madrid, 28040, Spain.
| | - Esther Gascó
- Department of Software Engineering and Artificial Intelligence, Complutense University of Madrid, Madrid, 28040, Spain.
| | - Fredy Barrientos-Espillco
- Department of Computer Architecture and Automation, Complutense University of Madrid, Madrid, 28040, Spain.
| | - Eva Besada-Portas
- Department of Computer Architecture and Automation, Complutense University of Madrid, Madrid, 28040, Spain.
| | - Gonzalo Pajares
- Institute for Knowledge Technology, Complutense University of Madrid, Madrid, 28040, Spain.
| |
Collapse
|
6
|
Wang W, Zhang Y, Zhu L. DRF-DRC: dynamic receptive field and dense residual connections for model compression. Cogn Neurodyn 2023; 17:1561-1573. [PMID: 37974581 PMCID: PMC10640440 DOI: 10.1007/s11571-022-09913-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2021] [Revised: 09/25/2022] [Accepted: 10/28/2022] [Indexed: 11/16/2022] Open
Abstract
Deep convolutional neural networks have achived remarkable progress on computer vision tasks over last years. These novel neural architecture are most designed manually by human experts, which is a time-consuming process and not the best solution. Hence neural architecture search (NAS) has become a hot research topic for the design of neural architecture. In this paper, we propose the dynamic receptive field (DRF) operation and measurable dense residual connections (DRC) in search space for designing efficient networks, i.e., DRENet. The search method can be deployed on the MobileNetV2-based search space. The experimental results on CIFAR10/100, SVHN, CUB-200-2011, ImageNet and COCO benchmark datasets and an application example in a railway intelligent surveillance system demonstrate the effectiveness of our scheme, which achieves superior performance.
Collapse
Affiliation(s)
- Wei Wang
- Avic Xi’an Aircraft Industry Group Company Ltd., Xi’an, 710089 China
| | - Yongde Zhang
- Avic Xi’an Aircraft Industry Group Company Ltd., Xi’an, 710089 China
| | - Liqiang Zhu
- School of Mechanical, Electronic and Control Engineering, Beijing Jiaotong University, Beijing, 100044 China
| |
Collapse
|
7
|
Zhen C, Zhang W, Mo J, Ji M, Zhou H, Zhu J. RASP: Regularization-based Amplitude Saliency Pruning. Neural Netw 2023; 168:1-13. [PMID: 37734135 DOI: 10.1016/j.neunet.2023.09.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Revised: 07/14/2023] [Accepted: 09/01/2023] [Indexed: 09/23/2023]
Abstract
Due to the prevalent data-dependent nature of existing pruning criteria, norm criteria with data independence play a crucial role in filter pruning criteria, providing promising prospects for deploying deep neural networks on resource-constrained devices. However, norm criteria based on amplitude measurements have long posed challenges in terms of theoretical feasibility. Existing methods rely on data-derived information such as derivatives to establish reasonable pruning standards. Nonetheless, achieving quantitative analysis of the "smaller-norm-less-important" notion remains elusive within the norm criterion context. To address the need for data independence and theoretical feasibility, we conducted saliency analysis on filters and proposed a regularization-based amplitude saliency pruning criterion (RASP). This amplitude saliency not only attains data independence but also establishes norm criteria for usage guidelines. Furthermore, we further investigated the amplitude saliency, addressing the issues of data dependency in model evaluation and inter-class filter selection. We introduced model saliency and an adaptive parameter group lasso (AGL) regularization approach sensitive to different layers. Theoretically, we thoroughly analyzed the feasibility of amplitude saliency and employed quantitative saliency analysis to validate the advantages of our method over previous approaches. Experimentally, conducted on the CIFAR-10 and ImageNet image classification benchmarks, we extensively validated the improved top-level performance of our method compared to previous methods. Even when the pruned model has the same or even smaller number of FLOP, our method can achieve equivalent or higher model accuracy. Notably, in our ImageNet experiment, RASP achieved a 51.9% reduction in FLOPs while maintaining an accuracy of 76.19% on ResNet-50.
Collapse
Affiliation(s)
- Chenghui Zhen
- College of Information Science and Engineering, Huaqiao University, Xiamen, 361021, Fujian, China.
| | - Weiwei Zhang
- College of Engineering, Huaqiao University, Quanzhou, 362021, Fujian, China.
| | - Jian Mo
- College of Engineering, Huaqiao University, Quanzhou, 362021, Fujian, China.
| | - Ming Ji
- College of Engineering, Huaqiao University, Quanzhou, 362021, Fujian, China.
| | - Hongbo Zhou
- College of Engineering, Huaqiao University, Quanzhou, 362021, Fujian, China; Intelligent Software Research Center, Institute of Software Chinese Academy of Sciences, 100190, Beijing, China.
| | - Jianqing Zhu
- College of Engineering, Huaqiao University, Quanzhou, 362021, Fujian, China.
| |
Collapse
|
8
|
Gong Z, Chen C, Chen C, Li C, Tian X, Gong Z, Lv X. RamanCMP: A Raman spectral classification acceleration method based on lightweight model and model compression techniques. Anal Chim Acta 2023; 1278:341758. [PMID: 37709483 DOI: 10.1016/j.aca.2023.341758] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2023] [Revised: 08/02/2023] [Accepted: 08/27/2023] [Indexed: 09/16/2023]
Abstract
In recent years, Raman spectroscopy combined with deep learning techniques has been widely used in various fields such as medical, chemical, and geological. However, there is still room for optimization of deep learning techniques and model compression algorithms for processing Raman spectral data. To further optimize deep learning models applied to Raman spectroscopy, in this study time, accuracy, sensitivity, specificity and floating point operations numbers(FLOPs) are used as evaluation metrics to optimize the model, which is named RamanCompact(RamanCMP). The experimental data used in this research are selected from the RRUFF public dataset, which consists of 723 Raman spectroscopy data samples from 10 different mineral categories. In this paper, 1D-EfficientNet adapted to the spectral data as well as 1D-DRSN are proposed to improve the model classification accuracy. To achieve better classification accuracy while optimizing the time parameters, three model compression methods are designed: knowledge distillation using 1D-EfficientNet model as a teacher model to train convolutional neural networks(CNN), proposing a channel conversion method to optimize 1D-DRSN model, and using 1D-DRSN model as a feature extractor in combination with linear discriminant analysis(LDA) model for classification. Compared with the traditional LDA and CNN models, the accuracy of 1D-EfficientNet and 1D-DRSN is improved by more than 20%. The time of the distilled model is reduced by 9680.9s compared with the teacher model 1D-EfficientNet under the condition of losing 2.07% accuracy. The accuracy of the distilled model is improved by 20% compared to the CNN student model while keeping inference efficiency constant. The 1D-DRSN optimized with channel conversion method saves 60% inference time of the original 1D-DRSN model. Feature extraction reduces the inference time of 1D-DRSN model by 93% with 94.48% accuracy. This study innovatively combines lightweight models and model compression algorithms to improve the classification speed of deep learning models in the field of Raman spectroscopy, forming a complete set of analysis methods and laying the foundation for future research.
Collapse
Affiliation(s)
- Zengyun Gong
- College of Software, Xinjiang University, Urumqi, 830046, Xinjiang, China.
| | - Chen Chen
- College of Information Science and Engineering, Xinjiang University, Urumqi, 830046, Xinjian, China.
| | - Cheng Chen
- College of Software, Xinjiang University, Urumqi, 830046, Xinjiang, China.
| | - Chenxi Li
- Oncological Department of Oral and Maxillofacial Surgery, The First Affiliated Hospital of Xinjiang Medical University, Urumqi, 830054, Xinjiang, China.
| | - Xuecong Tian
- College of Information Science and Engineering, Xinjiang University, Urumqi, 830046, Xinjian, China.
| | - Zhongcheng Gong
- Oncological Department of Oral and Maxillofacial Surgery, The First Affiliated Hospital of Xinjiang Medical University, Urumqi, 830054, Xinjiang, China; Hospital of Stomatology Xinjiang Medical University, Urumqi, 830054, Xinjiang, China; Stomatological Research Institute of Xinjiang Uygur Autonomous Region, Urumqi, 830054, Xinjiang, China.
| | - Xiaoyi Lv
- College of Software, Xinjiang University, Urumqi, 830046, Xinjiang, China; Key Laboratory of Signal Detection and Processing, Xinjiang University, Urumqi, 830046, Xinjiang, China.
| |
Collapse
|
9
|
Jantre S, Bhattacharya S, Maiti T. Layer adaptive node selection in Bayesian neural networks: Statistical guarantees and implementation details. Neural Netw 2023; 167:309-330. [PMID: 37666188 DOI: 10.1016/j.neunet.2023.08.029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2022] [Revised: 04/04/2023] [Accepted: 08/17/2023] [Indexed: 09/06/2023]
Abstract
Sparse deep neural networks have proven to be efficient for predictive model building in large-scale studies. Although several works have studied theoretical and numerical properties of sparse neural architectures, they have primarily focused on the edge selection. Sparsity through edge selection might be intuitively appealing; however, it does not necessarily reduce the structural complexity of a network. Instead pruning excessive nodes leads to a structurally sparse network with significant computational speedup during inference. To this end, we propose a Bayesian sparse solution using spike-and-slab Gaussian priors to allow for automatic node selection during training. The use of spike-and-slab prior alleviates the need of an ad-hoc thresholding rule for pruning. In addition, we adopt a variational Bayes approach to circumvent the computational challenges of traditional Markov Chain Monte Carlo (MCMC) implementation. In the context of node selection, we establish the fundamental result of variational posterior consistency together with the characterization of prior parameters. In contrast to the previous works, our theoretical development relaxes the assumptions of the equal number of nodes and uniform bounds on all network weights, thereby accommodating sparse networks with layer-dependent node structures or coefficient bounds. With a layer-wise characterization of prior inclusion probabilities, we discuss the optimal contraction rates of the variational posterior. We empirically demonstrate that our proposed approach outperforms the edge selection method in computational complexity with similar or better predictive performance. Our experimental evidence further substantiates that our theoretical work facilitates layer-wise optimal node recovery.
Collapse
Affiliation(s)
- Sanket Jantre
- Department of Statistics and Probability, Michigan State University, United States of America.
| | - Shrijita Bhattacharya
- Department of Statistics and Probability, Michigan State University, United States of America.
| | - Tapabrata Maiti
- Department of Statistics and Probability, Michigan State University, United States of America.
| |
Collapse
|
10
|
Sun C, Chen J, Li Y, Wang W, Ma T. Random pruning: channel sparsity by expectation scaling factor. PeerJ Comput Sci 2023; 9:e1564. [PMID: 37705629 PMCID: PMC10495938 DOI: 10.7717/peerj-cs.1564] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2023] [Accepted: 08/13/2023] [Indexed: 09/15/2023]
Abstract
Pruning is an efficient method for deep neural network model compression and acceleration. However, existing pruning strategies, both at the filter level and at the channel level, often introduce a large amount of computation and adopt complex methods for finding sub-networks. It is found that there is a linear relationship between the sum of matrix elements of the channels in convolutional neural networks (CNNs) and the expectation scaling ratio of the image pixel distribution, which is reflects the relationship between the expectation change of the pixel distribution between the feature mapping and the input data. This implies that channels with similar expectation scaling factors (δ E ) cause similar expectation changes to the input data, thus producing redundant feature mappings. Thus, this article proposes a new structured pruning method called EXP. In the proposed method, the channels with similar δ E are randomly removed in each convolutional layer, and thus the whole network achieves random sparsity to obtain non-redundant and non-unique sub-networks. Experiments on pruning various networks show that EXP can achieve a significant reduction of FLOPs. For example, on the CIFAR-10 dataset, EXP reduces the FLOPs of the ResNet-56 model by 71.9% with a 0.23% loss in Top-1 accuracy. On ILSVRC-2012, it reduces the FLOPs of the ResNet-50 model by 60.0% with a 1.13% loss of Top-1 accuracy. Our code is available at: https://github.com/EXP-Pruning/EXP_Pruning and DOI: 10.5281/zenodo.8141065.
Collapse
Affiliation(s)
- Chuanmeng Sun
- North University of China, State Key Laboratory of Dynamic Measurement Technology, Taiyuan, Shanxi, China
- North University of China, School of Electrical and Control Engineering, Taiyuan, Shanxi, China
| | - Jiaxin Chen
- North University of China, State Key Laboratory of Dynamic Measurement Technology, Taiyuan, Shanxi, China
- North University of China, School of Electrical and Control Engineering, Taiyuan, Shanxi, China
| | - Yong Li
- Chongqing University, State Key Laboratory of Coal Mine Disaster Dynamics and Control, Chongqing, China
| | - Wenbo Wang
- North University of China, State Key Laboratory of Dynamic Measurement Technology, Taiyuan, Shanxi, China
- North University of China, School of Electrical and Control Engineering, Taiyuan, Shanxi, China
| | - Tiehua Ma
- North University of China, State Key Laboratory of Dynamic Measurement Technology, Taiyuan, Shanxi, China
- North University of China, School of Electrical and Control Engineering, Taiyuan, Shanxi, China
| |
Collapse
|
11
|
Shang R, Li W, Zhu S, Jiao L, Li Y. Multi-teacher knowledge distillation based on joint Guidance of Probe and Adaptive Corrector. Neural Netw 2023; 164:345-356. [PMID: 37163850 DOI: 10.1016/j.neunet.2023.04.015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2022] [Revised: 03/05/2023] [Accepted: 04/11/2023] [Indexed: 05/12/2023]
Abstract
Knowledge distillation (KD) has been widely used in model compression. But, in the current multi-teacher KD algorithms, the student can only passively acquire the knowledge of the teacher's middle layer in a single form and all teachers use identical a guiding scheme to the student. To solve these problems, this paper proposes a multi-teacher KD based on joint Guidance of Probe and Adaptive Corrector (GPAC) method. First, GPAC proposes a teacher selection strategy guided by the Linear Classifier Probe (LCP). This strategy allows the student to select better teachers in the middle layer. Teachers are evaluated using the classification accuracy detected by LCP. Then, GPAC designs an adaptive multi-teacher instruction mechanism. The mechanism uses instructional weights to emphasize the student's predicted direction and reduce the student's difficulty learning from teachers. At the same time, every teacher can formulate guiding scheme according to the Kullback-Leibler divergence loss of the student and itself. Finally, GPAC develops a multi-level mechanism for adjusting spatial attention loss. this mechanism uses a piecewise function that varies with the number of epochs to adjust the spatial attention loss. This piecewise function classifies the student' learning about spatial attention into three levels, which can efficiently use spatial attention of teachers. GPAC and the current state-of-the-art distillation methods are tested on CIFAR-10 and CIFAR-100 datasets. The experimental results demonstrate that the proposed method in this paper can obtain higher classification accuracy.
Collapse
Affiliation(s)
- Ronghua Shang
- Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, Xi'an, Shaanxi, China
| | - Wenzheng Li
- Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, Guangzhou Institute of Technology, Xidian University, Guangzhou, Guangdong, China.
| | - Songling Zhu
- Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, Xi'an, Shaanxi, China
| | - Licheng Jiao
- Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, Xi'an, Shaanxi, China
| | - Yangyang Li
- Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, Xi'an, Shaanxi, China
| |
Collapse
|
12
|
Yun HI, Park JS. End-to-end emotional speech recognition using acoustic model adaptation based on knowledge distillation. Multimed Tools Appl 2023; 82:22759-22776. [PMID: 36817556 PMCID: PMC9923643 DOI: 10.1007/s11042-023-14680-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/16/2022] [Revised: 07/28/2022] [Accepted: 02/03/2023] [Indexed: 06/01/2023]
Abstract
The end-to-end approach provides better performance in speech recognition compared to the traditional hidden Markov model-deep neural network (HMM-DNN)-based approach, but still shows poor performance in abnormal speech, especially emotional speech. The optimal solution is to build an acoustic model suitable for emotional speech recognition using only emotional speech data for each emotion, but it is impossible because it is difficult to collect sufficient amount of emotional speech data for each emotion. In this study, we propose a method to improve the emotional speech recognition performance by using the knowledge distillation technique that was originally introduced to decrease computational intensity of deep learning-based approaches by reducing the number of model parameters. In addition to its use as model compression, we employ this technique for model adaptation to emotional speech. The proposed method builds a basic model (referred to as a teacher model) with a number of model parameters using an amount of normal speech data, and then constructs a target model (referred to as a student model) with fewer model parameters using a small amount of emotional speech data (i.e., adaptation data). Since the student model is built with emotional speech data, it is expected to reflect the emotional characteristics of each emotion well. In the emotional speech recognition experiment, the student model maintained recognition performance regardless of the number of model parameters, whereas the teacher model degraded performance significantly as the number of parameters decreased, showing performance degradation of about 10% in word error rate. This result demonstrates that the student model serves as an acoustic model suitable for emotional speech recognition even though it does not require much emotional speech data.
Collapse
Affiliation(s)
- Hong-In Yun
- Department of English Linguistics, Hankuk University of Foreign Studies, Seoul, Republic of Korea
| | - Jeong-Sik Park
- Department of English Linguistics & Language Technology, Hankuk University of Foreign Studies, Seoul, Republic of Korea
| |
Collapse
|
13
|
Li L, Su W, Liu F, He M, Liang X. Knowledge Fusion Distillation: Improving Distillation with Multi-scale Attention Mechanisms. Neural Process Lett 2023; 55:1-16. [PMID: 36619739 PMCID: PMC9807430 DOI: 10.1007/s11063-022-11132-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/12/2022] [Indexed: 01/04/2023]
Abstract
The success of deep learning has brought breakthroughs in many fields. However, the increased performance of deep learning models is often accompanied by an increase in their depth and width, which conflicts with the storage, energy consumption, and computational power of edge devices. Knowledge distillation, as an effective model compression method, can transfer knowledge from complex teacher models to student models. Self-distillation is a special type of knowledge distillation, which does not to require a pre-trained teacher model. However, existing self-distillation methods rarely consider how to effectively use the early features of the model. Furthermore, most self-distillation methods use features from the deepest layers of the network to guide the training of the branches of the network, which we find is not the optimal choice. In this paper, we found that the feature maps obtained by early feature fusion do not serve as a good teacher to guide their own training. Based on this, we propose a selective feature fusion module and further obtain a new self-distillation method, knowledge fusion distillation. Extensive experiments on three datasets have demonstrated that our method has comparable performance to state-of-the-art distillation methods. In addition, the performance of the network can be further enhanced when fused features are integrated into the network.
Collapse
Affiliation(s)
- Linfeng Li
- School of Artificial Intelligence, Tiangong University, Tianjin, 300387 China
| | - Weixing Su
- School of Computer Science and Technology, Tiangong University, Tianjin, 300387 China
| | - Fang Liu
- School of Software, Tiangong University, Tianjin, 300387 China
| | - Maowei He
- School of Computer Science and Technology, Tiangong University, Tianjin, 300387 China
| | - Xiaodan Liang
- School of Computer Science and Technology, Tiangong University, Tianjin, 300387 China
| |
Collapse
|
14
|
Abrar S, Samad MD. Perturbation of deep autoencoder weights for model compression and classification of tabular data. Neural Netw 2022; 156:160-169. [PMID: 36270199 PMCID: PMC9669225 DOI: 10.1016/j.neunet.2022.09.020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2022] [Revised: 07/18/2022] [Accepted: 09/19/2022] [Indexed: 11/16/2022]
Abstract
Fully connected deep neural networks (DNN) often include redundant weights leading to overfitting and high memory requirements. Additionally, in tabular data classification, DNNs are challenged by the often superior performance of traditional machine learning models. This paper proposes periodic perturbations (prune and regrow) of DNN weights, especially at the self-supervised pre-training stage of deep autoencoders. The proposed weight perturbation strategy outperforms dropout learning or weight regularization (L1 or L2) for four out of six tabular data sets in downstream classification tasks. Unlike dropout learning, the proposed weight perturbation routine additionally achieves 15% to 40% sparsity across six tabular data sets, resulting in compressed pretrained models. The proposed pretrained model compression improves the accuracy of downstream classification, unlike traditional weight pruning methods that trade off performance for model compression. Our experiments reveal that a pretrained deep autoencoder with weight perturbation can outperform traditional machine learning in tabular data classification, whereas baseline fully-connected DNNs yield the worst classification accuracy. However, traditional machine learning models are superior to any deep model when a tabular data set contains uncorrelated variables. Therefore, the performance of deep models with tabular data is contingent on the types and statistics of constituent variables.
Collapse
Affiliation(s)
- Sakib Abrar
- Department of Computer Science, Tennessee State University, Nashville, TN 37209, United States
| | - Manar D Samad
- Department of Computer Science, Tennessee State University, Nashville, TN 37209, United States.
| |
Collapse
|
15
|
Li G, Togo R, Ogawa T, Haseyama M. Compressed gastric image generation based on soft-label dataset distillation for medical data sharing. Comput Methods Programs Biomed 2022; 227:107189. [PMID: 36323177 DOI: 10.1016/j.cmpb.2022.107189] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/27/2021] [Revised: 07/07/2022] [Accepted: 10/17/2022] [Indexed: 06/16/2023]
Abstract
BACKGROUND AND OBJECTIVE Sharing of medical data is required to enable the cross-agency flow of healthcare information and construct high-accuracy computer-aided diagnosis systems. However, the large sizes of medical datasets, the massive amount of memory of saved deep convolutional neural network (DCNN) models, and patients' privacy protection are problems that can lead to inefficient medical data sharing. Therefore, this study proposes a novel soft-label dataset distillation method for medical data sharing. METHODS The proposed method distills valid information of medical image data and generates several compressed images with different data distributions for anonymous medical data sharing. Furthermore, our method can extract essential weights of DCNN models to reduce the memory required to save trained models for efficient medical data sharing. RESULTS The proposed method can compress tens of thousands of images into several soft-label images and reduce the size of a trained model to a few hundredths of its original size. The compressed images obtained after distillation have been visually anonymized; therefore, they do not contain the private information of the patients. Furthermore, we can realize high-detection performance with a small number of compressed images. CONCLUSIONS The experimental results show that the proposed method can improve the efficiency and security of medical data sharing.
Collapse
Affiliation(s)
- Guang Li
- Graduate School of Information Science and Technology, Hokkaido University, N-14, W-9, Kita-Ku, Sapporo, 060-0814, Japan.
| | - Ren Togo
- Faculty of Information Science and Technology, Hokkaido University, N-14, W-9, Kita-Ku, Sapporo, 060-0814, Japan.
| | - Takahiro Ogawa
- Faculty of Information Science and Technology, Hokkaido University, N-14, W-9, Kita-Ku, Sapporo, 060-0814, Japan.
| | - Miki Haseyama
- Faculty of Information Science and Technology, Hokkaido University, N-14, W-9, Kita-Ku, Sapporo, 060-0814, Japan.
| |
Collapse
|
16
|
Li J, Zhao B, Liu D. DMPP: Differentiable multi-pruner and predictor for neural network pruning. Neural Netw 2021; 147:103-112. [PMID: 34998270 DOI: 10.1016/j.neunet.2021.12.020] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2021] [Revised: 12/13/2021] [Accepted: 12/23/2021] [Indexed: 10/19/2022]
Abstract
Neural network pruning can trim the over-parameterized neural networks effectively by removing a number of network parameters. However, the traditional rule-based approaches always depend on manual experience. Existing heuristic search methods in discrete search spaces are usually time consuming and sub-optimal. In this paper, we develop a differentiable multi-pruner and predictor (DMPP) to prune neural networks automatically. The pruner composed of learnable parameters generates the pruning ratios of all convolutional layers as the continuous representation of the network. The neural network-based predictor is employed to predict the performance of different structures, which can accelerate the search process. Pruner and predictor enable us to directly employ gradient-based optimization to find a better structure. In addition, multi-pruner is presented to improve the efficiency of search, and knowledge distillation is leveraged to improve the performance of the pruned network. To evaluate the effectiveness of the proposed method, extensive experiments are performed on CIFAR-10, CIFAR-100, and ImageNet datasets with VGGNet and ResNet. Results show that the present DMPP can achieve a better performance than many previous state-of-the-art methods.
Collapse
Affiliation(s)
- Jiaxin Li
- School of Automation, Guangdong University of Technology, Guangzhou 510006, China.
| | - Bo Zhao
- School of System Science, Beijing Normal University, Beijing 100875, China.
| | - Derong Liu
- School of Automation, Guangdong University of Technology, Guangzhou 510006, China; Department of Electrical and Computer Engineering, University of Illinois at Chicago, Chicago, IL 60607, USA.
| |
Collapse
|
17
|
Deng X, Zhang Z. Sparsity-control ternary weight networks. Neural Netw 2021; 145:221-232. [PMID: 34773898 DOI: 10.1016/j.neunet.2021.10.018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2021] [Revised: 08/10/2021] [Accepted: 10/21/2021] [Indexed: 11/18/2022]
Abstract
Deep neural networks (DNNs) have been widely and successfully applied to various applications, but they require large amounts of memory and computational power. This severely restricts their deployment on resource-limited devices. To address this issue, many efforts have been made on training low-bit weight DNNs. In this paper, we focus on training ternary weight {-1, 0, +1} networks which can avoid multiplications and dramatically reduce the memory and computation requirements. A ternary weight network can be considered as a sparser version of the binary weight counterpart by replacing some -1s or 1s in the binary weights with 0s, thus leading to more efficient inference but more memory cost. However, the existing approaches to train ternary weight networks cannot control the sparsity (i.e., percentage of 0s) of the ternary weights, which undermines the advantage of ternary weights. In this paper, we propose to our best knowledge the first sparsity-control approach (SCA) to train ternary weight networks, which is simply achieved by a weight discretization regularizer (WDR). SCA is different from all the existing regularizer-based approaches in that it can control the sparsity of the ternary weights through a controller α and does not rely on gradient estimators. We theoretically and empirically show that the sparsity of the trained ternary weights is positively related to α. SCA is extremely simple, easy-to-implement, and is shown to consistently outperform the state-of-the-art approaches significantly over several benchmark datasets and even matches the performances of the full-precision weight counterparts.
Collapse
Affiliation(s)
- Xiang Deng
- State University of New York at Binghamton, Binghamton, NY, United States.
| | - Zhongfei Zhang
- State University of New York at Binghamton, Binghamton, NY, United States
| |
Collapse
|
18
|
Zhang R, Chung ACS. MedQ: Lossless ultra-low-bit neural network quantization for medical image segmentation. Med Image Anal 2021; 73:102200. [PMID: 34416578 DOI: 10.1016/j.media.2021.102200] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2020] [Revised: 06/30/2021] [Accepted: 07/26/2021] [Indexed: 10/20/2022]
Abstract
Implementing deep convolutional neural networks (CNNs) with boolean arithmetic is ideal for eliminating the notoriously high computational expense of deep learning models. However, although lossless model compression via weight-only quantization has been achieved in previous works, it is still an open problem about how to reduce the computation precision of CNNs without losing performance, especially for medical image segmentation tasks where data dimension is high and annotation is scarce. This paper presents a novel CNN quantization framework that can squeeze a deep model (both parameters and activation) to extremely low bitwidth, e.g., 1∼2 bits, while maintaining its high performance. In the new method, we first design a strong baseline quantizer with an optimizable quantization range. Then, to relieve the back-propagation difficulty caused by the discontinuous quantization function, we design a radical residual connection scheme that allows gradients to flow through every quantized layer freely. Moreover, a tanh-based derivative function is used to further boost gradient flow and a distributional loss is employed to regularize the model output. Extensive experiments and ablation studies are conducted on two well-established public 3D segmentation datasets, i.e., BRATS2020 and LiTS. Experimental results evidence that our framework not only outperforms state-of-the-art quantization approaches significantly, but also achieves lossless performance on both datasets with ternary (2-bit) quantization.
Collapse
Affiliation(s)
- Rongzhao Zhang
- Lo Kwee-Seong Medical Image Analysis Laboratory, Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong.
| | - Albert C S Chung
- Lo Kwee-Seong Medical Image Analysis Laboratory, Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong.
| |
Collapse
|
19
|
Abstract
The use of deep neural networks (DNNs) has dramatically elevated the performance of speech enhancement over the last decade. However, to achieve strong enhancement performance typically requires a large DNN, which is both memory and computation consuming, making it difficult to deploy such speech enhancement systems on devices with limited hardware resources or in applications with strict latency requirements. In this study, we propose two compression pipelines to reduce the model size for DNN-based speech enhancement, which incorporates three different techniques: sparse regularization, iterative pruning and clustering-based quantization. We systematically investigate these techniques and evaluate the proposed compression pipelines. Experimental results demonstrate that our approach reduces the sizes of four different models by large margins without significantly sacrificing their enhancement performance. In addition, we find that the proposed approach performs well on speaker separation, which further demonstrates the effectiveness of the approach for compressing speech separation models.
Collapse
Affiliation(s)
- Ke Tan
- Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, 43210-1277 USA
| | - DeLiang Wang
- Department of Computer Science and Engineering and the Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, OH 43210-1277, USA
| |
Collapse
|
20
|
Alkhulaifi A, Alsahli F, Ahmad I. Knowledge distillation in deep learning and its applications. PeerJ Comput Sci 2021; 7:e474. [PMID: 33954248 PMCID: PMC8053015 DOI: 10.7717/peerj-cs.474] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2020] [Accepted: 03/16/2021] [Indexed: 05/20/2023]
Abstract
Deep learning based models are relatively large, and it is hard to deploy such models on resource-limited devices such as mobile phones and embedded devices. One possible solution is knowledge distillation whereby a smaller model (student model) is trained by utilizing the information from a larger model (teacher model). In this paper, we present an outlook of knowledge distillation techniques applied to deep learning models. To compare the performances of different techniques, we propose a new metric called distillation metric which compares different knowledge distillation solutions based on models' sizes and accuracy scores. Based on the survey, some interesting conclusions are drawn and presented in this paper including the current challenges and possible research directions.
Collapse
|
21
|
Wen L, Zhang X, Bai H, Xu Z. Structured pruning of recurrent neural networks through neuron selection. Neural Netw 2019; 123:134-141. [PMID: 31855748 DOI: 10.1016/j.neunet.2019.11.018] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2019] [Revised: 10/01/2019] [Accepted: 11/19/2019] [Indexed: 11/17/2022]
Abstract
Recurrent neural networks (RNNs) have recently achieved remarkable successes in a number of applications. However, the huge sizes and computational burden of these models make it difficult for their deployment on edge devices. A practically effective approach is to reduce the overall storage and computation costs of RNNs by network pruning techniques. Despite their successful applications, those pruning methods based on Lasso either produce irregular sparse patterns in weight matrices, which is not helpful in practical speedup. To address these issues, we propose a structured pruning method through neuron selection which can remove the independent neuron of RNNs. More specifically, we introduce two sets of binary random variables, which can be interpreted as gates or switches to the input neurons and the hidden neurons, respectively. We demonstrate that the corresponding optimization problem can be addressed by minimizing the L0 norm of the weight matrix. Finally, experimental results on language modeling and machine reading comprehension tasks have indicated the advantages of the proposed method in comparison with state-of-the-art pruning competitors. In particular, nearly 20× practical speedup during inference was achieved without losing performance for the language model on the Penn TreeBank dataset, indicating the promising performance of the proposed method.
Collapse
Affiliation(s)
- Liangjian Wen
- SMILE Lab, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 610031, China
| | - Xuanyang Zhang
- SMILE Lab, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 610031, China
| | - Haoli Bai
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin NT 999077, Hong Kong SAR
| | - Zenglin Xu
- SMILE Lab, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 610031, China; Center of Artificial Intelligence, Peng Cheng Lab, Shenzhen, Guangdong, China.
| |
Collapse
|
22
|
Patra A, Cai Y, Chatelain P, Sharma H, Drukker L, Papageorghiou A, Noble JA. Efficient Ultrasound Image Analysis Models with Sonographer Gaze Assisted Distillation. Med Image Comput Comput Assist Interv 2019; 22:394-402. [PMID: 31942569 DOI: 10.1007/978-3-030-32251-9_43] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/10/2023]
Abstract
Recent automated medical image analysis methods have attained state-of-the-art performance but have relied on memory and compute-intensive deep learning models. Reducing model size without significant loss in performance metrics is crucial for time and memory-efficient automated image-based decision-making. Traditional deep learning based image analysis only uses expert knowledge in the form of manual annotations. Recently, there has been interest in introducing other forms of expert knowledge into deep learning architecture design. This is the approach considered in the paper where we propose to combine ultrasound video with point-of-gaze tracked for expert sonographers as they scan to train memory-efficient ultrasound image analysis models. Specifically we develop teacher-student knowledge transfer models for the exemplar task of frame classification for the fetal abdomen, head, and femur. The best performing memory-efficient models attain performance within 5% of conventional models that are 1000× larger in size.
Collapse
|
23
|
Guo J, Zhou B, Zeng X, Freyberg Z, Xu M. Model Compression for Faster Structural Separation of Macromolecules Captured by Cellular Electron Cryo-Tomography. Image Anal Recognit 2018; 10882:144-152. [PMID: 31231722 PMCID: PMC6588193 DOI: 10.1007/978-3-319-93000-8_17] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/13/2023]
Abstract
Electron Cryo-Tomography (ECT) enables 3D visualization of macromolecule structure inside single cells. Macromolecule classification approaches based on convolutional neural networks (CNN) were developed to separate millions of macromolecules captured from ECT systematically. However, given the fast accumulation of ECT data, it will soon become necessary to use CNN models to efficiently and accurately separate substantially more macromolecules at the prediction stage, which requires additional computational costs. To speed up the prediction, we compress classification models into compact neural networks with little in accuracy for deployment. Specifically, we propose to perform model compression through knowledge distillation. Firstly, a complex teacher network is trained to generate soft labels with better classification feasibility followed by training of customized student networks with simple architectures using the soft label to compress model complexity. Our tests demonstrate that our compressed models significantly reduce the number of parameters and time cost while maintaining similar classification accuracy.
Collapse
Affiliation(s)
| | - Bo Zhou
- School of Computer Science, Carnegie Mellon University, Pittsburgh, USA
| | - Xiangrui Zeng
- School of Computer Science, Carnegie Mellon University, Pittsburgh, USA
| | - Zachary Freyberg
- Departments of Psychiatry and Cell Biology, University of Pittsburgh, Pittsburgh, USA
| | - Min Xu
- School of Computer Science, Carnegie Mellon University, Pittsburgh, USA
| |
Collapse
|
24
|
Heinrich MP, Blendowski M, Oktay O. TernaryNet: faster deep model inference without GPUs for medical 3D segmentation using sparse and binary convolutions. Int J Comput Assist Radiol Surg 2018; 13:1311-1320. [PMID: 29850978 DOI: 10.1007/s11548-018-1797-4] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2018] [Accepted: 05/21/2018] [Indexed: 10/16/2022]
Abstract
PURPOSE Deep convolutional neural networks (DCNN) are currently ubiquitous in medical imaging. While their versatility and high-quality results for common image analysis tasks including segmentation, localisation and prediction is astonishing, the large representational power comes at the cost of highly demanding computational effort. This limits their practical applications for image-guided interventions and diagnostic (point-of-care) support using mobile devices without graphics processing units (GPU). METHODS We propose a new scheme that approximates both trainable weights and neural activations in deep networks by ternary values and tackles the open question of backpropagation when dealing with non-differentiable functions. Our solution enables the removal of the expensive floating-point matrix multiplications throughout any convolutional neural network and replaces them by energy- and time-preserving binary operators and population counts. RESULTS We evaluate our approach for the segmentation of the pancreas in CT. Here, our ternary approximation within a fully convolutional network leads to more than 90% memory reductions and high accuracy (without any post-processing) with a Dice overlap of 71.0% that comes close to the one obtained when using networks with high-precision weights and activations. We further provide a concept for sub-second inference without GPUs and demonstrate significant improvements in comparison with binary quantisation and without our proposed ternary hyperbolic tangent continuation. CONCLUSIONS We present a key enabling technique for highly efficient DCNN inference without GPUs that will help to bring the advances of deep learning to practical clinical applications. It has also great promise for improving accuracies in large-scale medical data retrieval.
Collapse
Affiliation(s)
- Mattias P Heinrich
- Institute of Medical Informatics, University of Lübeck, Ratzeburger Allee 160, 23562, Lübeck, Germany.
| | - Max Blendowski
- Institute of Medical Informatics, University of Lübeck, Ratzeburger Allee 160, 23562, Lübeck, Germany
| | - Ozan Oktay
- Biomedical Image Analysis Group, Department of Computing, Imperial College London, London, SW7 2AZ, UK
| |
Collapse
|