Gap-Aware Mitigation of Gradient Staleness
Authors: Saar Barkai, Ido Hakimi, Assaf Schuster
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluation on the CIFAR, Image Net, and Wiki Text-103 datasets shows that GA outperforms the currently acceptable gradient penalization method, in final test accuracy. We also provide convergence rate proof for GA. |
| Researcher Affiliation | Academia | Saar Barkai Department of Electrical Engineering Technion Israel Institute of Technology Haifa, Israel saarbarkai@gmail.com Ido Hakimi Department of Computer Science Technion Israel Institute of Technology Haifa, Israel idohakimi@gmail.com Assaf Schuster Department of Computer Science Technion Israel Institute of Technology Haifa, Israel assaf@sc.technion.ac.il |
| Pseudocode | Yes | Algorithm 1 Momentum-ASGD: worker i, Algorithm 2 Momentum-ASGD: master, Algorithm 3 Staleness-Aware: master, Algorithm 4 Gap-Aware: master |
| Open Source Code | Yes | A SOURCE CODE The source code of DANA-Gap-Aware is provided via: DOWNLOAD LINK |
| Open Datasets | Yes | To validate our claims, we performed experiments on the CIFAR10, CIFAR100 (Krizhevsky, 2012), Image Net (Russakovsky et al., 2015), and Wiki Text-103 (Merity et al., 2016) datasets, using several state-of-the-art architectures. |
| Dataset Splits | Yes | The CIFAR-10 (Krizhevsky, 2012) dataset comprises 60K RGB images partitioned into 50K training images and 10K test images. The images are partitioned into 1.28 million training images and 50K validation images |
| Hardware Specification | No | Every asynchronous worker is a machine with 8 GPUs, so the 128 workers in our experiments simulate a total of 1024 GPUs. The specific model of the GPU or any other hardware is not mentioned. |
| Software Dependencies | No | The paper mentions optimizers like NAG and Adam, and models like Transformer-XL, often with citations to their original papers or repositories. However, it does not provide specific version numbers for the programming languages, libraries, or frameworks used for its implementation (e.g., Python, PyTorch, TensorFlow, CUDA versions). |
| Experiment Setup | Yes | C.4 HYPERPARAMETERS Since one of our intentions was to verify that penalizing the gradients linearly to the Gap is the factor that leads to a better final test error and convergence rate, we used the same hyperparameters across all algorithms tested. These hyperparameters are the original hyperparameters of the respective neural network architecture, which are tuned for a single worker. CIFAR-10 Res Net-20 Initial Learning Rate η: 0.1 Momentum Coefficient γ: 0.9 with NAG Dampening: 0 (no dampening) Batch Size B: 128 Weight Decay: 0.0005 Learning Rate Decay: 0.1 Learning Rate Decay Schedule: Epochs 80 and 120 Total Epochs: 160 |