reproducibilityindex.ai

Gap-Aware Mitigation of Gradient Staleness

Authors: Saar Barkai, Ido Hakimi, Assaf Schuster

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our evaluation on the CIFAR, Image Net, and Wiki Text-103 datasets shows that GA outperforms the currently acceptable gradient penalization method, in ﬁnal test accuracy. We also provide convergence rate proof for GA.
Researcher Affiliation	Academia	Saar Barkai Department of Electrical Engineering Technion Israel Institute of Technology Haifa, Israel saarbarkai@gmail.com Ido Hakimi Department of Computer Science Technion Israel Institute of Technology Haifa, Israel idohakimi@gmail.com Assaf Schuster Department of Computer Science Technion Israel Institute of Technology Haifa, Israel assaf@sc.technion.ac.il
Pseudocode	Yes	Algorithm 1 Momentum-ASGD: worker i, Algorithm 2 Momentum-ASGD: master, Algorithm 3 Staleness-Aware: master, Algorithm 4 Gap-Aware: master
Open Source Code	Yes	A SOURCE CODE The source code of DANA-Gap-Aware is provided via: DOWNLOAD LINK
Open Datasets	Yes	To validate our claims, we performed experiments on the CIFAR10, CIFAR100 (Krizhevsky, 2012), Image Net (Russakovsky et al., 2015), and Wiki Text-103 (Merity et al., 2016) datasets, using several state-of-the-art architectures.
Dataset Splits	Yes	The CIFAR-10 (Krizhevsky, 2012) dataset comprises 60K RGB images partitioned into 50K training images and 10K test images. The images are partitioned into 1.28 million training images and 50K validation images
Hardware Specification	No	Every asynchronous worker is a machine with 8 GPUs, so the 128 workers in our experiments simulate a total of 1024 GPUs. The specific model of the GPU or any other hardware is not mentioned.
Software Dependencies	No	The paper mentions optimizers like NAG and Adam, and models like Transformer-XL, often with citations to their original papers or repositories. However, it does not provide specific version numbers for the programming languages, libraries, or frameworks used for its implementation (e.g., Python, PyTorch, TensorFlow, CUDA versions).
Experiment Setup	Yes	C.4 HYPERPARAMETERS Since one of our intentions was to verify that penalizing the gradients linearly to the Gap is the factor that leads to a better ﬁnal test error and convergence rate, we used the same hyperparameters across all algorithms tested. These hyperparameters are the original hyperparameters of the respective neural network architecture, which are tuned for a single worker. CIFAR-10 Res Net-20 Initial Learning Rate η: 0.1 Momentum Coefﬁcient γ: 0.9 with NAG Dampening: 0 (no dampening) Batch Size B: 128 Weight Decay: 0.0005 Learning Rate Decay: 0.1 Learning Rate Decay Schedule: Epochs 80 and 120 Total Epochs: 160