Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SPACE: Noise Contrastive Estimation Stabilizes Self-Play Fine-Tuning for Large Language Models

Authors: Yibo Wang, Guangda Huzhang, Qingguo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Lijun Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we present empirical studies to validate the effectiveness of SPACE. We first describe experimental settings, including datasets, pre-trained models, implementations and evaluations. We then report the results with corresponding analyses.
Researcher Affiliation	Collaboration	1National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China 3Alibaba International Digital Commerce
Pseudocode	Yes	We present the pseudocode to computing the loss function in SPACE as follows: def space_loss(mu, player_real_logps, player_generated_logps, opponent_real_logps, opponent_generated_logps):
Open Source Code	No	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: We have reported the complete descriptions of the experiments in Section 4, and have provided the pseudocode in the Appendix E.
Open Datasets	Yes	Following Chen et al. [10], we randomly sample 50k prompts with their corresponding high-quality responses from the Ultrachat200k dataset [15]
Dataset Splits	No	Following Chen et al. [10], we randomly sample 50k prompts with their corresponding high-quality responses from the Ultrachat200k dataset [15], and choose Zephyr-7B-SFT-full [75] and Mistral-7B-Base [35] as pretrained models in experiments. During the training, we first generate synthetic response y for each x with the latest model at each iteration. The resulting synthetic response is then combined with annotated one to update the large language model for the subsequent iteration. We evaluate the performances with different tasks from the Hugging Face Open LLM Leaderboard [6, 18], each targeting a distinct capability of LLMs. These tasks cover a range of domains: science question answering with ARC-Challenge [12] and GPQA [64], mathematical reasoning with GSM8K [13], commonsense inference with Winogrande [67] and Hella Swag [93], multitask language understanding through MMLU [30] and MMLU-Pro [83], truthfulness and factuality with Truthful QA [43], instruction following using IFEval [97], and complex reasoning with BBH [72]. All tasks are implemented with the default configurations provided by the Language Model Evaluation Harness [20].
Hardware Specification	Yes	All experiments are conducted on a single machine equipped with 8 H100 GPUs, and we report the costs of generation and training in Table 3.
Software Dependencies	No	Our implementation is based on the codebase Alignment Handbook [76] and the Accelerate library [23]. We choose RMSProp [69] with default configurations as the optimizer, and set the global batch size as 64 and the epoch as 2.
Experiment Setup	Yes	Our implementation is based on the codebase Alignment Handbook [76] and the Accelerate library [23]. We choose RMSProp [69] with default configurations as the optimizer, and set the global batch size as 64 and the epoch as 2. For SPACE, we choose to set the generation ratio µ = 1 (i.e., m = n) in our experiments, and we will return to this configuration later.