Overcoming Multi-model Forgetting

Authors: Yassine Benyahia, Kaicheng Yu, Kamil Bennani Smires, Martin Jaggi, Anthony C. Davison, Mathieu Salzmann, Claudiu Musat

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate its effectiveness when training two models sequentially and for neural architecture search. ... WPL can reduce the forgetting effect by 99% when model A converges fully, and by 52% in the loose convergence case. ... For language modeling the perplexity decreases from 65.01 for ENAS without WPL to 61.9 with WPL. For image classification WPL yields a drop of top-1 error from 4.87% to 3.81%.
Researcher Affiliation Collaboration 1Institute of Mathematics, EPFL; 2Computer Vision Lab, EPFL; 3Artificial Intelligence Lab, Swisscom; 4Machine Learning and Optimization lab, EPFL.
Pseudocode No No structured pseudocode or algorithm blocks were found in the provided text.
Open Source Code Yes Our code is public available at https://github.com/kcyu2014/multimodel-forgetting.
Open Datasets Yes To test WPL in the general scenario, we used the MNIST handwritten digit recognition dataset (Le Cun & Cortes, 2010). ... For neural architecture search, we implement WPL within the efficient ENAS method of Pham et al. (2018), a stateof-the-art technique that relies on parameter sharing and corresponds to the loose convergence setting. ... Our final results on the best architecture found by the search confirm that limiting multimodel forgetting yields better results and better convergence for both language modeling (on the PTB dataset (Marcus et al., 1994)) and image classification (on the CIFAR10 dataset (Krizhevsky et al., 2009)).
Dataset Splits Yes To compute the Fisher information, we used the backward gradients of θs calculated on 200 images in the validation set. ... We performed two experiments: RNN cell search on the PTB dataset and CNN micro-cell search on the CIFAR10 dataset.
Hardware Specification No training the best model that was found from scratch takes around 4 GPU days. No specific GPU models, CPU types, or detailed cloud/cluster resources were mentioned.
Software Dependencies No No specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9) were explicitly stated in the provided text.
Experiment Setup Yes To better satisfy our assumption that the parameters of previously-trained models should be optimal, we follow the original ENAS training strategy for n epochs, with n = 5 for RNN search and n = 3 for CNN search in our experiments. We also update the Fisher information, Fθ t i = (1 η)Fθ t 1 i +η( L/ θi)2, with η = 0.9. ... We also use a scheduled decay for α in equation (8).