ErrorCompensatedX: error compensation for variance reduced algorithms

Authors: Hanlin Tang, Yao Li, Ji Liu, Ming Yan

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we train Res Net-50 (He et al., 2016) on CIFAR10, which consists of 50000 training images and 10000 testing images, each has 10 labels. We run the experiments on eight workers, each having a 1080Ti GPU. The batch size on each worker is 16 and the total batch size is 128. ... Figure 2: Epoch-wise convergence comparison on Res Net-50 for Momenum SGD (left column), STORM (middle column), and IGT (right column) with different communication implementations.
Researcher Affiliation Collaboration Hanlin Tang Department of Computer Science University of Rochester tanghl1994@gmail.com Yao Li Department of Mathematics Michigan State University liyao6@msu.edu Ji Liu Kuaishou Technology ji.liu.uwisc@gmail.com Ming Yan Department of Computational Mathematics, Science and Technology; Department of Mathematics Michigan State University myan@msu.edu
Pseudocode Yes Algorithm 1 Error Compensated X for general A (x; ξ)
Open Source Code No The paper does not provide any statement about releasing source code or a link to a code repository for the described methodology.
Open Datasets Yes In this section, we train Res Net-50 (He et al., 2016) on CIFAR10, which consists of 50000 training images and 10000 testing images, each has 10 labels.
Dataset Splits No The paper states '50000 training images and 10000 testing images' for CIFAR-10 but does not specify a validation set split.
Hardware Specification Yes We run the experiments on eight workers, each having a 1080Ti GPU.
Software Dependencies No The paper does not specify version numbers for any software dependencies used in the experiments.
Experiment Setup Yes The batch size on each worker is 16 and the total batch size is 128. ... We use the 1-bit compression in Tang et al. (2019), which leads to an overall 96% of communication volume reduction. ... We grid search the best learning rate from {0.5, 0.1, 0.001} and c0 from {0.1, 0.05, 0.001}, and find that the best learning rate is 0.01 with c0 = 0.05 for both original STORM and IGT. ... We set β = 0.3 for the low-pass filter in all cases.