publications
(*) denotes equal contribution
2024
- PRAMMD: Attentive maximum mean discrepancy for few-shot image classificationJi Wu, Shipeng Wang*, and Jian SunPattern Recognition
Metric-based methods have attained promising performance for few-shot image classification. Maximum Mean Discrepancy (MMD) is a typical distance between distributions, requiring to compute expectations w.r.t. data distributions. In this paper, we propose Attentive Maximum Mean Discrepancy (AMMD) to measure the distances between query images and support classes for few-shot classification. Each query image is classified as the support class with minimal AMMD distance. The proposed AMMD assists MMD with distributions adaptively estimated by an Attention-based Distribution Generation Module (ADGM). ADGM is learned to put more mass on more discriminative features, which makes the proposed AMMD distance emphasize discriminative features and overlook spurious features. Extensive experiments show that our AMMD achieves competitive or state-of-the-art performance on multiple few-shot classification benchmark datasets.
2023
- TPAMIVariational Data-Free Knowledge Distillation for Continual LearningXiaorong Li, Shipeng Wang, Jian Sun, and Zongben XuIEEE Transactions on Pattern Analysis and Machine Intelligence
Deep neural networks suffer from catastrophic forgetting when trained on sequential tasks in continual learning. Various methods rely on storing data of previous tasks to mitigate catastrophic forgetting, which is prohibited in real-world applications considering privacy and security issues. In this paper, we consider a realistic setting of continual learning, where training data of previous tasks are unavailable and memory resources are limited. We contribute a novel knowledge distillation-based method in an information-theoretic framework by maximizing mutual information between outputs of previously learned and current networks. Due to the intractability of computation of mutual information, we instead maximize its variational lower bound, where the covariance of variational distribution is modeled by a graph convolutional network. The inaccessibility of data of previous tasks is tackled by Taylor expansion, yielding a novel regularizer in network training loss for continual learning. The regularizer relies on compressed gradients of network parameters. It avoids storing previous task data and previously learned networks. Additionally, we employ self-supervised learning technique for learning effective features, which improves the performance of continual learning. We conduct extensive experiments including image classification and semantic segmentation, and the results show that our method achieves state-of-the-art performance on continual learning benchmarks.
- PRMemory efficient data-free distillation for continual learningXiaorong Li, Shipeng Wang*, Jian Sun, and Zongben XuPattern Recognition
Deep neural networks suffer from the catastrophic forgetting phenomenon when trained on sequential tasks in continual learning, especially when data from previous tasks are unavailable. To mitigate catastrophic forgetting, various methods either store data from previous tasks, which may raise privacy concerns, or require large memory storage. Particularly, the distillation-based methods mitigate catastrophic forgetting by using proxy datasets. However, proxy datasets may not match the distributions of the original datasets of previous tasks. To address these problems in a setting where the full training data of previous tasks are unavailable and memory resources are limited, we propose a novel data-free distillation method. Our method encodes knowledge of previous tasks into network parameter gradients by Taylor expansion, deducing a regularizer relying on gradients in network training loss. To improve memory efficiency, we design an approach to compressing the gradients in the regularizer. Moreover, we theoretically analyze the approximation error of our method. Experimental results on multiple datasets demonstrate that our proposed method outperforms the existing approaches in continual learning.
2022
- TPAMIVariational HyperAdam: A Meta-Learning Approach to Network TrainingShipeng Wang, Yan Yang, Jian Sun, and Zongben XuIEEE Transactions on Pattern Analysis and Machine Intelligence
Stochastic optimization algorithms have been popular for training deep neural networks. Recently, there emerges a new approach of learning-based optimizer, which has achieved promising performance for training neural networks. However, these black-box learning-based optimizers do not fully take advantage of the experience in human-designed optimizers and heavily rely on learning from meta-training tasks, therefore have limited generalization ability. In this paper, we propose a novel optimizer, dubbed as Variational HyperAdam, which is based on a parametric generalized Adam algorithm, i.e., HyperAdam, in a variational framework. With Variational HyperAdam as optimizer for training neural network, the parameter update vector of the neural network at each training step is considered as random variable, whose approximate posterior distribution given the training data and current network parameter vector is predicted by Variational HyperAdam. The parameter update vector for network training is sampled from this approximate posterior distribution. Specifically, in Variational HyperAdam, we design a learnable generalized Adam algorithm for estimating expectation, paired with a VarBlock for estimating the variance of the approximate posterior distribution of parameter update vector. The Variational HyperAdam is learned in a meta-learning approach with meta-training loss derived by variational inference. Experiments verify that the learned Variational HyperAdam achieved state-of-the-art network training performance for various types of networks on different datasets, such as multilayer perceptron, CNN, LSTM and ResNet.
2021
- CVPROralTraining Networks in Null Space of Feature Covariance for Continual LearningShipeng Wang, Xiaorong Li, Jian Sun, and Zongben XuIn Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern RecognitionOral Presentation [Top %4]
In the setting of continual learning, a network is trained on a sequence of tasks, and suffers from catastrophic forgetting. To balance plasticity and stability of network in continual learning, in this paper, we propose a novel network training algorithm called Adam-NSCL, which sequentially optimizes network parameters in the null space of previous tasks. We first propose two mathematical conditions respectively for achieving network stability and plasticity in continual learning. Based on them, the network training for sequential tasks can be simply achieved by projecting the candidate parameter update into the approximate null space of all previous tasks in the network training process, where the candidate parameter update can be generated by Adam. The approximate null space can be derived by applying singular value decomposition to the uncentered covariance matrix of all input features of previous tasks for each linear layer. For efficiency, the uncentered covariance matrix can be incrementally computed after learning each task. We also empirically verify the rationality of the approximate null space at each linear layer. We apply our approach to training networks for continual learning on benchmark datasets of CIFAR-100 and TinyImageNet, and the results suggest that the proposed approach outperforms or matches the state-ot-the-art continual learning approaches.
2019
- AAAIPoster SpotlightHyperAdam: A Learnable Task-Adaptive Adam for Network TrainingShipeng Wang, Jian Sun, and Zongben XuIn Proceedings of the AAAI Conference on Artificial IntelligencePoster Spotlight
Deep neural networks are traditionally trained using humandesigned stochastic optimization algorithms, such as SGD and Adam. Recently, the approach of learning to optimize network parameters has emerged as a promising research topic. However, these learned black-box optimizers sometimes do not fully utilize the experience in human-designed optimizers, therefore have limitation in generalization ability. In this paper, a new optimizer, dubbed as <em>HyperAdam</em>, is proposed that combines the idea of “learning to optimize” and traditional Adam optimizer. Given a network for training, its parameter update in each iteration generated by HyperAdam is an adaptive combination of multiple updates generated by Adam with varying decay rates . The combination weights and decay rates in HyperAdam are adaptively learned depending on the task. HyperAdam is modeled as a recurrent neural network with AdamCell, WeightCell and StateCell. It is justified to be state-of-the-art for various network training, such as multilayer perceptron, CNN and LSTM.</p>