reproducibilityindex.ai

Careful with that Scalpel: Improving Gradient Surgery with an EMA

Authors: Yu-Guan Hsieh, James Thornton, Eugene Ndiaye, Michal Klein, Marco Cuturi, Pierre Ablin

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section we demonstrate the effectiveness of Bloop via numerical experiments on problems of three distinct categories: the use of auxiliary loss for imposing an explicit bias, multi-task learning, and joint dataset training. For each of these experiments, we use an optimizer with hyperparameters that work well for the minimization of solely the main loss, and never change these hyperparameters.
Researcher Affiliation	Industry	1Apple. Correspondence to: Pierre Ablin <p ablin@apple.com>.
Pseudocode	Yes	Algorithm 1 The Bloop algorithm
Open Source Code	No	The paper does not provide an explicit statement or link for the open-sourcing of the code for the methodology described in the paper.
Open Datasets	Yes	For this, we use the MNIST dataset (Le Cun et al., 2010) and an MLP of two hidden layers. ... For Image Net training, we employ SGD with a batch size of 2048... ...construct a Cifar10Mnist dataset by overlapping digits from MNIST on images from CIFAR-10 (Krizhevsky et al., 2009)... ...30M examples from the c4 dataset (Raffel et al., 2020), while the auxiliary loss corresponds to 20K examples from the RCV-1 dataset (Lewis et al., 2004). ...Paracrawl dataset (Ba n on et al., 2020), with 36m sentence pairs... ...WMT dataset, yielding 10k sentence pairs (Farhad et al., 2021).
Dataset Splits	No	The paper uses various datasets but does not provide specific train/validation/test dataset splits (e.g., percentages or sample counts) for reproduction. For instance, for Cifar10Mnist, it states 'we construct a Cifar10Mnist dataset' but does not specify the splits used.
Hardware Specification	No	The paper describes network architectures, optimizers, and training parameters in Appendix B but does not provide specific hardware details (e.g., GPU/CPU models, memory) used for the experiments.
Software Dependencies	No	The paper mentions using 'optax-like notations' and that 'Our implementation is derived from the flax example', but it does not specify version numbers for these or other software dependencies.
Experiment Setup	Yes	In this section we report the missing details from Section 5. ... All the methods are trained with Adam optimizer at learning rate of 3e-4 for 100 epochs and a cosine learning rate schedule. For consistency with the other classification experiments we also include 5 epochs of warm-up. The batch size is fixed at 256... For Image Net training, we employ SGD with a batch size of 2048, Nesterov momentum of 0.9, and a learning rate of 0.8. ...We train the model for 300K iterations. ...We train the model for 500K iterations.