Careful with that Scalpel: Improving Gradient Surgery with an EMA

Authors: Yu-Guan Hsieh, James Thornton, Eugene Ndiaye, Michal Klein, Marco Cuturi, Pierre Ablin

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section we demonstrate the effectiveness of Bloop via numerical experiments on problems of three distinct categories: the use of auxiliary loss for imposing an explicit bias, multi-task learning, and joint dataset training. For each of these experiments, we use an optimizer with hyperparameters that work well for the minimization of solely the main loss, and never change these hyperparameters.
Researcher Affiliation Industry 1Apple. Correspondence to: Pierre Ablin <p ablin@apple.com>.
Pseudocode Yes Algorithm 1 The Bloop algorithm
Open Source Code No The paper does not provide an explicit statement or link for the open-sourcing of the code for the methodology described in the paper.
Open Datasets Yes For this, we use the MNIST dataset (Le Cun et al., 2010) and an MLP of two hidden layers. ... For Image Net training, we employ SGD with a batch size of 2048... ...construct a Cifar10Mnist dataset by overlapping digits from MNIST on images from CIFAR-10 (Krizhevsky et al., 2009)... ...30M examples from the c4 dataset (Raffel et al., 2020), while the auxiliary loss corresponds to 20K examples from the RCV-1 dataset (Lewis et al., 2004). ...Paracrawl dataset (Ba n on et al., 2020), with 36m sentence pairs... ...WMT dataset, yielding 10k sentence pairs (Farhad et al., 2021).
Dataset Splits No The paper uses various datasets but does not provide specific train/validation/test dataset splits (e.g., percentages or sample counts) for reproduction. For instance, for Cifar10Mnist, it states 'we construct a Cifar10Mnist dataset' but does not specify the splits used.
Hardware Specification No The paper describes network architectures, optimizers, and training parameters in Appendix B but does not provide specific hardware details (e.g., GPU/CPU models, memory) used for the experiments.
Software Dependencies No The paper mentions using 'optax-like notations' and that 'Our implementation is derived from the flax example', but it does not specify version numbers for these or other software dependencies.
Experiment Setup Yes In this section we report the missing details from Section 5. ... All the methods are trained with Adam optimizer at learning rate of 3e-4 for 100 epochs and a cosine learning rate schedule. For consistency with the other classification experiments we also include 5 epochs of warm-up. The batch size is fixed at 256... For Image Net training, we employ SGD with a batch size of 2048, Nesterov momentum of 0.9, and a learning rate of 0.8. ...We train the model for 300K iterations. ...We train the model for 500K iterations.