Bayesian Adaptation of Network Depth and Width for Continual Learning
Authors: Jeevan Thapa, Rui Li
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our results show that our proposed method achieves superior or comparable performance with state-of-the-art methods across various continual learning benchmarks. Moreover, our approach can be readily extended to unsupervised continual learning, showcasing competitive performance compared to existing techniques. Extensive experiments show that by continuously updating network weights and adapting network depth and width across tasks, our method achieves superior or comparable performance to the existing state-of-the-art models. We analyze the behavior of our framework across various settings on benchmark datasets. Next, we evaluate our method in task-incremental learning using different backbone networks across benchmark datasets. Following that, we conduct an ablation study to investigate the importance of each component in our method. |
| Researcher Affiliation | Academia | 1College of Computing and Information Sciences, Rochester Institute of Technology, Rochester, New York, USA. Correspondence to: Rui Li <rxlics@rit.edu>. |
| Pseudocode | No | No: The paper describes algorithms and mathematical formulations in text and equations but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures. |
| Open Source Code | Yes | The link to our codebase is https://github.com/jt4812/bayes_struc_adap_cl. |
| Open Datasets | Yes | For supervised continual learning using fully connected neural networks, three datasets are used: permuted MNIST, split MNIST, and split fashion MNIST. To explore convolutional neural networks, experiments are conducted with split CIFAR10-5, split CIFAR100-10, split CIFAR100-20, and split Tiny Imagenet-10. For unsupervised continual learning, we use two datasets: one-MNIST and not-MNIST for evaluation. |
| Dataset Splits | Yes | Permuted MNIST involves a 10-class classification problem where pixels in all task images are shuffled based on a fixed permutation. In a similar vein, split MNIST comprises five binary classification tasks presented sequentially: 0/1, 2/3, ..., to 8/9. split CIFAR10-5 includes a sequence of five tasks from CIFAR10, while split CIFAR100-n involves n tasks from CIFAR100. Similarly, split Tiny Imagenet-10 comprises ten tasks from the Tiny Imagenet dataset. In addition to hyper-parameter search, we utilize the validation set to determine the optimal model weights during training. |
| Hardware Specification | Yes | We trained and evaluated our models in NVIDIA A100 GPUs. |
| Software Dependencies | No | No: The paper mentions using PyTorch for some functions ('The implementation for the implicit reparameterization and sampling for beta distribution is available in Py Torch (Paszke et al., 2017)'), but it does not list specific version numbers for PyTorch or any other software libraries required to reproduce the experiments. |
| Experiment Setup | Yes | We use a batch size of 512 for all experiments, and we estimate the log-likelihood using 10 samples and 100 samples for the test. We train using the Adam optimizer its default settings β1 = 0.9, β2 = 0.999 and we re-initialize the optimizer for each task. Additionally, we found that multiplying the KL terms by a factor of 0.1 enhances model training. The minimum pseudo-count ϵ for activated weights to identify the activated layer is taken to be 0.001. We use a batch size of 256 and set temperature τ = 0.1 for all CNN experiments. We use 2 Monte Carlo samples during training and 10 samples for predictions. We use a learning rate of 0.003 for the weights in all CIFAR experiments and 0.001 for Tiny Imagenet-10. Additionally, we reduce the weight learning rate by a factor of 0.85 for every new task. For the structure, we maintain a constant learning rate of 0.2 for CIFAR10-5, CIFAR100-10, and Tiny Imagenet-10. And, for CIFAR100-20, we find that a learning rate of 0.5 works best for the structure. During shared training, we run 25 epochs, and for selective fine-tuning of the task-specific mask, we use 15 epochs for Alexnet experiments. |