reproducibilityindex.ai

CocktailSGD: Fine-tuning Foundation Models over 500Mbps Networks

Authors: Jue Wang, Yucheng Lu, Binhang Yuan, Beidi Chen, Percy Liang, Christopher De Sa, Christopher Re, Ce Zhang

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we show that COCKTAILSGD achieves up to 117 compression in fine-tuning LLMs up to 20 billion parameters without hurting convergence. On a 500Mbps network, COCKTAILSGD only incurs 1.2 slowdown compared with data center networks.
Researcher Affiliation	Academia	1ETH Zurich, Switzerland 2Cornell University, USA 3Carnegie Mellon University, USA 4Stanford University, USA.
Pseudocode	Yes	Algorithm 1 COCKTAILSGD. Algorithm 2 Compressor C[δ].
Open Source Code	No	The paper mentions using 'open models' and references 'Together Computer (https://together.xyz/)' as providing computation, but it does not state that its own methodology's source code is publicly available or provide a link to it.
Open Datasets	Yes	For instruction tuning, we use a collection of Natural-Instruction (NI) (Mishra et al., 2022; Wang et al., 2022), Public Pool of Prompts (P3) (Bach et al., 2022), Chain-of-Thought (Wei et al., 2022) data, and The Pile (Gao et al., 2020) to prevent catastrophic forgetting previously learned knowledge. ... For language modeling, we train on WIKITEXT-103 data (Merity et al., 2016).
Dataset Splits	No	The paper mentions training and evaluation on datasets and specifies batch size and sequence length, but it does not provide explicit details about how these datasets were split into training, validation, and test sets (e.g., percentages or sample counts for each split).
Hardware Specification	Yes	We use A100-80GB GPUs to train large language models. ... For the models OPT-1.3B and GPT-J-6B, we utilize 4 data parallel workers with 2 A100 GPUs each. For the GPT-Neo X-20B model, we use 4 data parallel workers with 8 A100 GPUs each, for a total of 32 A100 GPUs.
Software Dependencies	No	The paper mentions 'mixed precision (FP16) training' and 'the default Adam optimizer' but does not list specific version numbers for any software dependencies like Python, PyTorch, TensorFlow, or CUDA.
Experiment Setup	Yes	We train in mixed precision (FP16) training and conduct careful tuning for all methods on all datasets. We use the default Adam optimizer. The optimal learning rate is determined through a grid search, ranging from 1e-6 to 1e-3... We use a batch size of 64, 128, 128 for OPT-1.3B, GPT-J-6B, GPT-Neo X-20B, respectively, and a sequence length of 2048.