ErrorCompensatedX: error compensation for variance reduced algorithms
Authors: Hanlin Tang, Yao Li, Ji Liu, Ming Yan
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we train Res Net-50 (He et al., 2016) on CIFAR10, which consists of 50000 training images and 10000 testing images, each has 10 labels. We run the experiments on eight workers, each having a 1080Ti GPU. The batch size on each worker is 16 and the total batch size is 128. ... Figure 2: Epoch-wise convergence comparison on Res Net-50 for Momenum SGD (left column), STORM (middle column), and IGT (right column) with different communication implementations. |
| Researcher Affiliation | Collaboration | Hanlin Tang Department of Computer Science University of Rochester tanghl1994@gmail.com Yao Li Department of Mathematics Michigan State University liyao6@msu.edu Ji Liu Kuaishou Technology ji.liu.uwisc@gmail.com Ming Yan Department of Computational Mathematics, Science and Technology; Department of Mathematics Michigan State University myan@msu.edu |
| Pseudocode | Yes | Algorithm 1 Error Compensated X for general A (x; ξ) |
| Open Source Code | No | The paper does not provide any statement about releasing source code or a link to a code repository for the described methodology. |
| Open Datasets | Yes | In this section, we train Res Net-50 (He et al., 2016) on CIFAR10, which consists of 50000 training images and 10000 testing images, each has 10 labels. |
| Dataset Splits | No | The paper states '50000 training images and 10000 testing images' for CIFAR-10 but does not specify a validation set split. |
| Hardware Specification | Yes | We run the experiments on eight workers, each having a 1080Ti GPU. |
| Software Dependencies | No | The paper does not specify version numbers for any software dependencies used in the experiments. |
| Experiment Setup | Yes | The batch size on each worker is 16 and the total batch size is 128. ... We use the 1-bit compression in Tang et al. (2019), which leads to an overall 96% of communication volume reduction. ... We grid search the best learning rate from {0.5, 0.1, 0.001} and c0 from {0.1, 0.05, 0.001}, and find that the best learning rate is 0.01 with c0 = 0.05 for both original STORM and IGT. ... We set β = 0.3 for the low-pass filter in all cases. |