Overcoming Multi-model Forgetting
Authors: Yassine Benyahia, Kaicheng Yu, Kamil Bennani Smires, Martin Jaggi, Anthony C. Davison, Mathieu Salzmann, Claudiu Musat
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate its effectiveness when training two models sequentially and for neural architecture search. ... WPL can reduce the forgetting effect by 99% when model A converges fully, and by 52% in the loose convergence case. ... For language modeling the perplexity decreases from 65.01 for ENAS without WPL to 61.9 with WPL. For image classification WPL yields a drop of top-1 error from 4.87% to 3.81%. |
| Researcher Affiliation | Collaboration | 1Institute of Mathematics, EPFL; 2Computer Vision Lab, EPFL; 3Artificial Intelligence Lab, Swisscom; 4Machine Learning and Optimization lab, EPFL. |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found in the provided text. |
| Open Source Code | Yes | Our code is public available at https://github.com/kcyu2014/multimodel-forgetting. |
| Open Datasets | Yes | To test WPL in the general scenario, we used the MNIST handwritten digit recognition dataset (Le Cun & Cortes, 2010). ... For neural architecture search, we implement WPL within the efficient ENAS method of Pham et al. (2018), a stateof-the-art technique that relies on parameter sharing and corresponds to the loose convergence setting. ... Our final results on the best architecture found by the search confirm that limiting multimodel forgetting yields better results and better convergence for both language modeling (on the PTB dataset (Marcus et al., 1994)) and image classification (on the CIFAR10 dataset (Krizhevsky et al., 2009)). |
| Dataset Splits | Yes | To compute the Fisher information, we used the backward gradients of θs calculated on 200 images in the validation set. ... We performed two experiments: RNN cell search on the PTB dataset and CNN micro-cell search on the CIFAR10 dataset. |
| Hardware Specification | No | training the best model that was found from scratch takes around 4 GPU days. No specific GPU models, CPU types, or detailed cloud/cluster resources were mentioned. |
| Software Dependencies | No | No specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9) were explicitly stated in the provided text. |
| Experiment Setup | Yes | To better satisfy our assumption that the parameters of previously-trained models should be optimal, we follow the original ENAS training strategy for n epochs, with n = 5 for RNN search and n = 3 for CNN search in our experiments. We also update the Fisher information, Fθ t i = (1 η)Fθ t 1 i +η( L/ θi)2, with η = 0.9. ... We also use a scheduled decay for α in equation (8). |