reproducibilityindex.ai

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

Authors: Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, Quanquan Gu

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we evaluate our method on several benchmark datasets including the Hugging Face Open LLM Leaderboard, MT-Bench, and datasets from Big-Bench. Our results show that SPIN can significantly improve the LLM s performance across a variety of benchmarks
Researcher Affiliation	Academia	1Department of Computer Science, University of California, Los Angeles, CA 90095, USA.
Pseudocode	Yes	Algorithm 1 Self-Play Fine-Tuning (SPIN)
Open Source Code	Yes	Codes are available at https://github.com/uclaml/SPIN.
Open Datasets	Yes	This model derives from the pre-trained Mistral-7B (Jiang et al., 2023) and has been further fine-tuned on the SFT dataset Ultrachat200k1 by Hugging Face. Ultrachat200k represents a high-quality 200k subset of the larger Ultra Chat (Ding et al., 2023) corpus
Dataset Splits	No	The paper states 'We randomly sample 50k prompts and use the base model to generate the synthetic responses' for training and evaluates on external benchmarks, but does not provide specific training/validation/test splits for its own training process.
Hardware Specification	Yes	Results were obtained using a machine with 8x A100 (80G) GPUs.
Software Dependencies	No	The paper mentions software like 'Alignment Handbook library', 'Deep Speed Ze RO-3', 'Flash Attention-2', and 'Accelerate library' but does not provide specific version numbers for these components.
Experiment Setup	Yes	We train our models with RMSProp (Hinton et al., 2012) optimizer with no weight decay for all iterations as commonly used in fine-tuning LLMs for alignment, with a global batch size of 64, 10% warmup steps and bfloat16 precision. We set the peak learning rate to be 5e-7 for iterations 0 and 1, and decay this peak learning rate to 1e-7 for iteration 2 and 3... We note that at the last iteration (iter-3) where the model is close to convergence, we increase the value of β to 5.0. We use... max sequence length to be 2048 tokens.