Communication Efficient Distributed Training with Distributed Lion
Authors: Bo Liu, Lemeng Wu, Lizhang Chen, Kaizhao Liang, Jiaxu Zhu, Chen Liang, Raghuraman Krishnamoorthi, Qiang Liu
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results demonstrate its robustness across a range of tasks, worker counts, and batch sizes, on both vision and language problems. In this section, we perform a thorough evaluation of the Distributed Lion algorithm, employing both the averaging and majority vote aggregation methods. |
| Researcher Affiliation | Collaboration | Bo Liu The University of Texas at Austin bliu@cs.utexas.edu, Lemeng Wu Meta AI lmwu@google.com, Lizhang Chen The University of Texas at Austin lzchen@utexas.edu, Kaizhao Liang The University of Texas at Austin kaizhaol@utexas.edu, Jiaxu Zhu Meta AI jiaxuzhu@meta.com, Chen Liang Google crazydonkey@google.com, Raghuraman Krishnamoorthi Meta AI raghuraman@meta.com, Qiang Liu The University of Texas at Austin lqiang@cs.utexas.edu |
| Pseudocode | Yes | Algorithm 1 Distributed Lion Training |
| Open Source Code | No | We will release code upon acceptance. |
| Open Datasets | Yes | We conduct experiments on the CIFAR-10 dataset... For the Image Net-1K benchmark... For pretraining language models on the Open Web Text dataset... (Citations: [31] Olga Russakovsky et al. Image Net Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211 252, 2015., [17] Aaron Gokaslan and Vanya Cohen. Openwebtext corpus. http://Skylion007.github.io/ Open Web Text Corpus, 2019.) |
| Dataset Splits | Yes | For language modeling experiments, we report the validation perplexity. |
| Hardware Specification | No | We provided how many workers are needed for each experiment, the GPU resource can be arbitrary as long as it fits in memory. |
| Software Dependencies | No | The paper does not provide specific software dependency names with their version numbers (e.g., 'PyTorch 1.9', 'Python 3.8'). |
| Experiment Setup | Yes | Experiment Setup For Grad Drop, DGC, and Tern Grad, we choose the compression rate of 0.04 (note that 1/32 = 0.03125) to match the bandwidth of the D-Lion (Ma Vo). We conduct experiments on the CIFAR-10 dataset using a vision transformer (Vi T) with 6 layers, 8 heads, and a hidden dimension of 512... We list the optimal hyperparameters selected for each method from Figure 2 in Table 4. The learning rates are selected from {0.00005, 0.001, 0.005, 0.01} and the weight decays are selected from {0.0005, 0.001, 0.005}. For each experiment, we use a cosine learning rate scheduler and run for 200 epochs... |