CocktailSGD: Fine-tuning Foundation Models over 500Mbps Networks
Authors: Jue Wang, Yucheng Lu, Binhang Yuan, Beidi Chen, Percy Liang, Christopher De Sa, Christopher Re, Ce Zhang
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we show that COCKTAILSGD achieves up to 117 compression in fine-tuning LLMs up to 20 billion parameters without hurting convergence. On a 500Mbps network, COCKTAILSGD only incurs 1.2 slowdown compared with data center networks. |
| Researcher Affiliation | Academia | 1ETH Zurich, Switzerland 2Cornell University, USA 3Carnegie Mellon University, USA 4Stanford University, USA. |
| Pseudocode | Yes | Algorithm 1 COCKTAILSGD. Algorithm 2 Compressor C[δ]. |
| Open Source Code | No | The paper mentions using 'open models' and references 'Together Computer (https://together.xyz/)' as providing computation, but it does not state that its own methodology's source code is publicly available or provide a link to it. |
| Open Datasets | Yes | For instruction tuning, we use a collection of Natural-Instruction (NI) (Mishra et al., 2022; Wang et al., 2022), Public Pool of Prompts (P3) (Bach et al., 2022), Chain-of-Thought (Wei et al., 2022) data, and The Pile (Gao et al., 2020) to prevent catastrophic forgetting previously learned knowledge. ... For language modeling, we train on WIKITEXT-103 data (Merity et al., 2016). |
| Dataset Splits | No | The paper mentions training and evaluation on datasets and specifies batch size and sequence length, but it does not provide explicit details about how these datasets were split into training, validation, and test sets (e.g., percentages or sample counts for each split). |
| Hardware Specification | Yes | We use A100-80GB GPUs to train large language models. ... For the models OPT-1.3B and GPT-J-6B, we utilize 4 data parallel workers with 2 A100 GPUs each. For the GPT-Neo X-20B model, we use 4 data parallel workers with 8 A100 GPUs each, for a total of 32 A100 GPUs. |
| Software Dependencies | No | The paper mentions 'mixed precision (FP16) training' and 'the default Adam optimizer' but does not list specific version numbers for any software dependencies like Python, PyTorch, TensorFlow, or CUDA. |
| Experiment Setup | Yes | We train in mixed precision (FP16) training and conduct careful tuning for all methods on all datasets. We use the default Adam optimizer. The optimal learning rate is determined through a grid search, ranging from 1e-6 to 1e-3... We use a batch size of 64, 128, 128 for OPT-1.3B, GPT-J-6B, GPT-Neo X-20B, respectively, and a sequence length of 2048. |