reproducibilityindex.ai

Overcoming Multi-model Forgetting

Authors: Yassine Benyahia, Kaicheng Yu, Kamil Bennani Smires, Martin Jaggi, Anthony C. Davison, Mathieu Salzmann, Claudiu Musat

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate its effectiveness when training two models sequentially and for neural architecture search. ... WPL can reduce the forgetting effect by 99% when model A converges fully, and by 52% in the loose convergence case. ... For language modeling the perplexity decreases from 65.01 for ENAS without WPL to 61.9 with WPL. For image classiﬁcation WPL yields a drop of top-1 error from 4.87% to 3.81%.
Researcher Affiliation	Collaboration	1Institute of Mathematics, EPFL; 2Computer Vision Lab, EPFL; 3Artiﬁcial Intelligence Lab, Swisscom; 4Machine Learning and Optimization lab, EPFL.
Pseudocode	No	No structured pseudocode or algorithm blocks were found in the provided text.
Open Source Code	Yes	Our code is public available at https://github.com/kcyu2014/multimodel-forgetting.
Open Datasets	Yes	To test WPL in the general scenario, we used the MNIST handwritten digit recognition dataset (Le Cun & Cortes, 2010). ... For neural architecture search, we implement WPL within the efﬁcient ENAS method of Pham et al. (2018), a stateof-the-art technique that relies on parameter sharing and corresponds to the loose convergence setting. ... Our ﬁnal results on the best architecture found by the search conﬁrm that limiting multimodel forgetting yields better results and better convergence for both language modeling (on the PTB dataset (Marcus et al., 1994)) and image classiﬁcation (on the CIFAR10 dataset (Krizhevsky et al., 2009)).
Dataset Splits	Yes	To compute the Fisher information, we used the backward gradients of θs calculated on 200 images in the validation set. ... We performed two experiments: RNN cell search on the PTB dataset and CNN micro-cell search on the CIFAR10 dataset.
Hardware Specification	No	training the best model that was found from scratch takes around 4 GPU days. No specific GPU models, CPU types, or detailed cloud/cluster resources were mentioned.
Software Dependencies	No	No specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9) were explicitly stated in the provided text.
Experiment Setup	Yes	To better satisfy our assumption that the parameters of previously-trained models should be optimal, we follow the original ENAS training strategy for n epochs, with n = 5 for RNN search and n = 3 for CNN search in our experiments. We also update the Fisher information, Fθ t i = (1 η)Fθ t 1 i +η( L/ θi)2, with η = 0.9. ... We also use a scheduled decay for α in equation (8).