Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Flexible Realignment of Language Models

Authors: Wenhong Zhu, Ruobing Xie, Weinan Zhang, Rui Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Section 4: Experiments. Section 4.1: Training-time Realignment Evaluation Settings. (a) Models and Baselines: We use Deep Seek-R1-Distill-Qwen-1.5B[3] as our reference model and Deep Scale R-1.5B-Preview (trained on 40K high-quality math problems with 3,800 A100 hours)[14] as our aligned model... (c) Evaluation Dataset: We evaluate on challenging reasoning tasks including AIME-24, AIME-25, and MATH-500 to assess performance. (d) Setup: We realign Deep Seek-R1-Distill-Qwen-1.5B for 200 steps with a batch size of 16. Performance is measured using the Pass@1 metric and token count...
Researcher Affiliation	Collaboration	Wenhong Zhu1,2 Ruobing Xie3 Weinan Zhang1,2 Rui Wang1 1Shanghai Jiao Tong University 2Shanghai Innovation Institute 3Large Language Department, Tencent
Pseudocode	No	The paper describes methods and equations but does not contain any explicitly labeled pseudocode or algorithm blocks, nor does it present structured, code-like procedural steps.
Open Source Code	Yes	Corresponding author. Code: https://github.com/zwhong714/Re Aligner Email: EMAIL
Open Datasets	Yes	Section 4.1: (b) Calibrated Training Datasets: We use the Open R1-Math-220K dataset [21]... (c) Evaluation Dataset: We evaluate on challenging reasoning tasks including AIME-24, AIME-25, and MATH-500 to assess performance. Section 4.3: (d) Setup: We first train the base models using the Ultra Chat-200k dataset [24]... Subsequently, we apply DPO on the Ultra Feedback dataset [25]...
Dataset Splits	No	The paper uses specific datasets for training (e.g., Open R1-Math-220K, Ultra Chat-200k, Ultra Feedback) and evaluation (e.g., AIME-24, AIME-25, MATH-500, MT-Bench, Alpaca Eval 2, Arena-Hard), and mentions filtering the Open R1-Math-220K dataset, but it does not specify explicit train/validation/test splits for these datasets within its own experimental setup.
Hardware Specification	No	The paper mentions "A100 GPU hours" when discussing the computational cost of replicating Deep Seek-R1 experiments and for the training of Deep Scale R-1.5B-Preview (baseline model), but it does not explicitly state the specific hardware used for its own experiments (Tr Ra and In Ra training/inference).
Software Dependencies	No	Appendix E: The training framework utilizes the LLa MA-Factory [39] repository. Appendix E.3: All inferences are conducted using the v LLM engine [20]. While these software tools are mentioned, specific version numbers for them are not provided.
Experiment Setup	Yes	Section 4.1: (d) Setup: We realign Deep Seek-R1-Distill-Qwen-1.5B for 200 steps with a batch size of 16... Each generation has a maximum length of 16384 tokens, with temperature set to 0.7 and top-p set to 0.95. Appendix E.1: The learning rate is set to 2 10 5, and the batch size is 16. Section 4.2: (c) Setup: We train our model for three epochs using a batch size of 128. Appendix E.3: For SFT training, a learning rate of 2e-6 is used for all models.