Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

Authors: Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, Quanquan Gu

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we evaluate our method on several benchmark datasets including the Hugging Face Open LLM Leaderboard, MT-Bench, and datasets from Big-Bench. Our results show that SPIN can significantly improve the LLM s performance across a variety of benchmarks
Researcher Affiliation Academia 1Department of Computer Science, University of California, Los Angeles, CA 90095, USA.
Pseudocode Yes Algorithm 1 Self-Play Fine-Tuning (SPIN)
Open Source Code Yes Codes are available at https://github.com/uclaml/SPIN.
Open Datasets Yes This model derives from the pre-trained Mistral-7B (Jiang et al., 2023) and has been further fine-tuned on the SFT dataset Ultrachat200k1 by Hugging Face. Ultrachat200k represents a high-quality 200k subset of the larger Ultra Chat (Ding et al., 2023) corpus
Dataset Splits No The paper states 'We randomly sample 50k prompts and use the base model to generate the synthetic responses' for training and evaluates on external benchmarks, but does not provide specific training/validation/test splits for its own training process.
Hardware Specification Yes Results were obtained using a machine with 8x A100 (80G) GPUs.
Software Dependencies No The paper mentions software like 'Alignment Handbook library', 'Deep Speed Ze RO-3', 'Flash Attention-2', and 'Accelerate library' but does not provide specific version numbers for these components.
Experiment Setup Yes We train our models with RMSProp (Hinton et al., 2012) optimizer with no weight decay for all iterations as commonly used in fine-tuning LLMs for alignment, with a global batch size of 64, 10% warmup steps and bfloat16 precision. We set the peak learning rate to be 5e-7 for iterations 0 and 1, and decay this peak learning rate to 1e-7 for iteration 2 and 3... We note that at the last iteration (iter-3) where the model is close to convergence, we increase the value of β to 5.0. We use... max sequence length to be 2048 tokens.