Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Flexible Realignment of Language Models
Authors: Wenhong Zhu, Ruobing Xie, Weinan Zhang, Rui Wang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Section 4: Experiments. Section 4.1: Training-time Realignment Evaluation Settings. (a) Models and Baselines: We use Deep Seek-R1-Distill-Qwen-1.5B[3] as our reference model and Deep Scale R-1.5B-Preview (trained on 40K high-quality math problems with 3,800 A100 hours)[14] as our aligned model... (c) Evaluation Dataset: We evaluate on challenging reasoning tasks including AIME-24, AIME-25, and MATH-500 to assess performance. (d) Setup: We realign Deep Seek-R1-Distill-Qwen-1.5B for 200 steps with a batch size of 16. Performance is measured using the Pass@1 metric and token count... |
| Researcher Affiliation | Collaboration | Wenhong Zhu1,2 Ruobing Xie3 Weinan Zhang1,2 Rui Wang1 1Shanghai Jiao Tong University 2Shanghai Innovation Institute 3Large Language Department, Tencent |
| Pseudocode | No | The paper describes methods and equations but does not contain any explicitly labeled pseudocode or algorithm blocks, nor does it present structured, code-like procedural steps. |
| Open Source Code | Yes | Corresponding author. Code: https://github.com/zwhong714/Re Aligner Email: EMAIL |
| Open Datasets | Yes | Section 4.1: (b) Calibrated Training Datasets: We use the Open R1-Math-220K dataset [21]... (c) Evaluation Dataset: We evaluate on challenging reasoning tasks including AIME-24, AIME-25, and MATH-500 to assess performance. Section 4.3: (d) Setup: We first train the base models using the Ultra Chat-200k dataset [24]... Subsequently, we apply DPO on the Ultra Feedback dataset [25]... |
| Dataset Splits | No | The paper uses specific datasets for training (e.g., Open R1-Math-220K, Ultra Chat-200k, Ultra Feedback) and evaluation (e.g., AIME-24, AIME-25, MATH-500, MT-Bench, Alpaca Eval 2, Arena-Hard), and mentions filtering the Open R1-Math-220K dataset, but it does not specify explicit train/validation/test splits for these datasets within its own experimental setup. |
| Hardware Specification | No | The paper mentions "A100 GPU hours" when discussing the computational cost of replicating Deep Seek-R1 experiments and for the training of Deep Scale R-1.5B-Preview (baseline model), but it does not explicitly state the specific hardware used for its own experiments (Tr Ra and In Ra training/inference). |
| Software Dependencies | No | Appendix E: The training framework utilizes the LLa MA-Factory [39] repository. Appendix E.3: All inferences are conducted using the v LLM engine [20]. While these software tools are mentioned, specific version numbers for them are not provided. |
| Experiment Setup | Yes | Section 4.1: (d) Setup: We realign Deep Seek-R1-Distill-Qwen-1.5B for 200 steps with a batch size of 16... Each generation has a maximum length of 16384 tokens, with temperature set to 0.7 and top-p set to 0.95. Appendix E.1: The learning rate is set to 2 10 5, and the batch size is 16. Section 4.2: (c) Setup: We train our model for three epochs using a batch size of 128. Appendix E.3: For SFT training, a learning rate of 2e-6 is used for all models. |