reproducibilityindex.ai

Communication Efficient Distributed Training with Distributed Lion

Authors: Bo Liu, Lemeng Wu, Lizhang Chen, Kaizhao Liang, Jiaxu Zhu, Chen Liang, Raghuraman Krishnamoorthi, Qiang Liu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results demonstrate its robustness across a range of tasks, worker counts, and batch sizes, on both vision and language problems. In this section, we perform a thorough evaluation of the Distributed Lion algorithm, employing both the averaging and majority vote aggregation methods.
Researcher Affiliation	Collaboration	Bo Liu The University of Texas at Austin bliu@cs.utexas.edu, Lemeng Wu Meta AI lmwu@google.com, Lizhang Chen The University of Texas at Austin lzchen@utexas.edu, Kaizhao Liang The University of Texas at Austin kaizhaol@utexas.edu, Jiaxu Zhu Meta AI jiaxuzhu@meta.com, Chen Liang Google crazydonkey@google.com, Raghuraman Krishnamoorthi Meta AI raghuraman@meta.com, Qiang Liu The University of Texas at Austin lqiang@cs.utexas.edu
Pseudocode	Yes	Algorithm 1 Distributed Lion Training
Open Source Code	No	We will release code upon acceptance.
Open Datasets	Yes	We conduct experiments on the CIFAR-10 dataset... For the Image Net-1K benchmark... For pretraining language models on the Open Web Text dataset... (Citations: [31] Olga Russakovsky et al. Image Net Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211 252, 2015., [17] Aaron Gokaslan and Vanya Cohen. Openwebtext corpus. http://Skylion007.github.io/ Open Web Text Corpus, 2019.)
Dataset Splits	Yes	For language modeling experiments, we report the validation perplexity.
Hardware Specification	No	We provided how many workers are needed for each experiment, the GPU resource can be arbitrary as long as it fits in memory.
Software Dependencies	No	The paper does not provide specific software dependency names with their version numbers (e.g., 'PyTorch 1.9', 'Python 3.8').
Experiment Setup	Yes	Experiment Setup For Grad Drop, DGC, and Tern Grad, we choose the compression rate of 0.04 (note that 1/32 = 0.03125) to match the bandwidth of the D-Lion (Ma Vo). We conduct experiments on the CIFAR-10 dataset using a vision transformer (Vi T) with 6 layers, 8 heads, and a hidden dimension of 512... We list the optimal hyperparameters selected for each method from Figure 2 in Table 4. The learning rates are selected from {0.00005, 0.001, 0.005, 0.01} and the weight decays are selected from {0.0005, 0.001, 0.005}. For each experiment, we use a cosine learning rate scheduler and run for 200 epochs...