Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Incentivizing Reasoning for Advanced Instruction-Following of Large Language Models

Authors: Yulei Qin, Gang Li, Zongyi Li, Zihan Xu, Yuchen Shi, Zhekai Lin, Xiao Cui, Ke Li, Xing Sun

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive evaluations on seven comprehensive benchmarks confirm the validity of the proposed method, where a 1.5B LLM achieves 11.74% gains with performance comparable to a 8B LLM. Evaluation on OOD constraints also confirms the generalizability of our RAIF.
Researcher Affiliation	Collaboration	1Tencent Youtu Lab 2Xiamen University 3The Chinese University of Hong Kong EMAIL
Pseudocode	No	The paper describes methods and provides mathematical formulations, but it does not include any explicit pseudocode blocks or algorithm listings in a structured format.
Open Source Code	Yes	All the codes and data have been available and will be released at https: //anonymous.4open.science/r/IRAIF-B3A0/README.md.
Open Datasets	Yes	We start by selecting a set of seed instructions Dseed from the commonly used Wild Chat [35] and Alpaca [88] datasets. To ensure the diversity of Dseed, we follow [89] to tag each instruction by its topics and tasks for a wide-range selection of task abilities. Details on the tagging and selection process can be found in Sec. A.4.1. We also incorporated Deep Scale R [29] (see Sec. A.5).
Dataset Splits	Yes	All experiments are performed on workstations... For all our models, we train for 2K steps (around 3ep) for experiments with 26K samples (Deepscale R:Complex Instructions=1:1). ...We finally obtain the complex instruction dataset of 13K instances (with a retention less than 10%).
Hardware Specification	Yes	All experiments are performed on workstations with 380 CPU cores, 2.2TB memory, and 8 GPUs. The 7B and 8B models are trained with 16 GPUs with 4 GPUs for both the policy actor model and reference model, 4 GPUs for the reward model, and 8 GPUs for v LLM [105] engines. In contrast, the 1.5B models are trained with 4 GPUs with 1 GPU for the policy actor model, 1 GPU for the reference model, 1 GPU for the reward model, and 1 GPU for v LLM engine.
Software Dependencies	No	We use Open RLHF [94] for both cold-start (Qwen2.5-1.5B/7B [95], LLa MA3.1-8B [2], and Ministral-8B [96]) and warm-start (Deep Seek-Qwen1.5B/7B [19] and Deep Scale R-1.5B [29]) experiments. Detailed settings can be found in Sec. A.6. ... For the SFT experiments in the baselines, we also follow [104] to use the recommended default settings. The detailed settings of the hyper-parameters are presented in Table 13.
Experiment Setup	Yes	We present the details of the hyper-parameter settings in the present study (see Table 12). ... The detailed settings of the hyper-parameters are presented in Table 13.