Self-playing Adversarial Language Game Enhances LLM Reasoning

Authors: Pengyu Cheng, Tianhao Hu, Han Xu, Zhisong Zhang, Yong Dai, Lei Han, nan du, Xiaolong Li

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental With this goal, we select several open-source LLMs and let each act as the attacker and play with a copy of itself as the defender on an extensive range of target words. Through reinforcement learning on the game outcomes, we observe that the LLMs performances uniformly improve on a broad range of reasoning benchmarks. To verify the effectiveness of SPAG, we select open-source pretrained LLMs of different sources and model sizes, particularly LLa MA-2-7B [Touvron et al., 2023] and Baichuan-2-13B [Yang et al., 2023a]. The main results are shown in Figure 1, where each axis is normalized by the maximum answer-accuracy value.
Researcher Affiliation Industry Tencent AI Lab 1Shenzhen & 2Seattle 3Tencent Robotics X Lab
Pseudocode Yes Algorithm 1 Data collection of LLM self-plays for the adversarial language game. and Algorithm 2 Self-play of adversarial language games (SPAG)
Open Source Code Yes The code is available at https://github.com/Linear95/SPAG.
Open Datasets Yes Hence, we collect the 50K most frequently used words from the Corpus of Contemporary American (Co CA) [Davies, 2020] as the target word list Vtarget. To collect the game episodes of GPT-4 [Achiam et al., 2023], we use the data collection procedure described in Algorithm 1. We use Alpaca [Taori et al., 2023] as the SFT set, which contains 52K instruction-following data from GPT-3 [Brown et al., 2020].
Dataset Splits No The paper describes data collection and evaluation but does not explicitly provide training/validation/test dataset splits with percentages or sample counts.
Hardware Specification Yes All our experiments are conducted using 32 NVIDIA A100-SXM4 GPUs with 40GB memory.
Software Dependencies No The paper mentions software like NLTK and Text Blob, but does not provide specific version numbers for these or other key software dependencies like Python or PyTorch.
Experiment Setup Yes For imitation learning, the learning rate is 5e-6, and the KL-penalty coefficient β1 = 0.1. For SPAG training, the learning rate is 2e-6, the KL-penalty coefficient β2 = 0.2, and the SFT coefficient α = 0.5. For the Alpaca SFT baseline, we exactly follow the training setups of Alpaca and set the learning rate to 2e-6. Among all training stages, the batch size is 128 and the max sequence length is 2048. Each training process maintains one epoch over the offline collected trajectories. The the decay parameter γ is set to 0.8. The maximum turn of the Adversarial Taboo is 5.