Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SPC: Evolving Self-Play Critic via Adversarial Games for LLM Reasoning

Authors: Jiaqi Chen, Bang Zhang, Ruotian Ma, Peisong Wang, Xiaodan Liang, Zhaopeng Tu, Xiaolong Li, Kwan-Yee K. Wong

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments have been conducted to validate the effectiveness of our proposed self-play critic. After one round of supervised fine-tuning on Qwen2.5-7B-Instruct and two rounds of iterative reinforcement fine-tuning, our SPC has shown continuously evolving performance on three human-annotated reasoning process assessment benchmarks (Process Bench [27], PRM800K [23] and Delta Bench [35]).
Researcher Affiliation Collaboration 1The University of Hong Kong 2Tencent 3Tsinghua University 4MBZUAI
Pseudocode No The paper describes the methodology in text and uses figures (e.g., Fig. 1, Fig. 2) to illustrate the framework, but it does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Project: https://chen-judge.github.io/SPC/ Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: we upload the data and code.
Open Datasets Yes Evaluation We adopt PRM800K [23], Process Bench [27], and Delta Bench [35] that include human annotations of mathematical reasoning steps for evaluation. ... Besides, we evaluate the effectiveness of the critic models in assisting LLMs to solve math problems on MATH500 [54] and AIME2024 [37].
Dataset Splits Yes For incorrect solutions, we only retain the first incorrect step, while we randomly sample one correct step from a correct solution. We then feed the mixed 1,700 correct steps and 1,700 incorrect steps along with their corresponding partial solutions into the critic models. ... Similarly, we retain the labeled erroneous steps and sample the same number of correct steps, totaling 1,542. ... For the SFT phase of the critic, we utilize the reasoning process data from PRM800K... ultimately obtaining 21.8K data, including 9.4K correct steps and 12.4K incorrect steps. ... The first round of self-play... collect 6.4K data for the critic model for reinforcement learning, with a 1:1 ratio of positive to negative samples. Meanwhile, the sneaky generator receives 6K data, divided equally (2K each) into three scenarios: failing to attack the LLM solver, successfully attacking the LLM solver but losing to the critic, and successfully attacking the LLM solver while defeating the critic. ... The second round of self-play... collect 6.8K data for the critic model, maintaining a 1:1 ratio between positive and negative samples, while continuing to gather 6K data for the sneaky generator, with the three scenarios still evenly distributed at 1/3 each.
Hardware Specification No The paper mentions specific LLM models used (e.g., Qwen2.5-7B-Instruct, GPT-4, Deep Seek-R1-Distill-Qwen-7B) and their sizes, but it does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used for training or inference.
Software Dependencies No The paper mentions various LLM models (e.g., Qwen2.5-7B-Instruct, GPT-4, GPT-4o, Deep Seek-R1-Distill-Qwen-7B) and datasets, but it does not specify versions for any programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup Yes In the SFT initialization phase for both sneaky generator and critic models, we employ a batch size of 64 and a learning rate of 5e-6. We train the models for 3 epochs, with the maximum sequence length set to 4,096. To ensure both stability and convergence during training, we also incorporate a KL penalty into the training loss, setting the KL coefficient at 0.1. During the reinforcement learning of the self-play phase, we keep the batch size as 64 but use a learning rate of 2e-6. Except for setting the KL coefficient at 0.1, we also add an SFT loss with a coefficient of 0.15 to ensure the stability of RL training.