Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ARM: Adaptive Reasoning Model

Authors: Siye Wu, Jian Xie, yikai zhang, Aili Chen, Kai Zhang, Yu Su, Yanghua Xiao

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive evaluations show that ARM trained with Ada-GRPO achieves comparable performance while using 30% fewer tokens than GRPO (as shown in Figure 1b), across both in-domain and out-of-domain tasks in commonsense, mathematical, and symbolic reasoning. Furthermore, not only does it improve inference efficiency through reduced token generation, but it also brings a 2 speedup in training. In addition to the default Adaptive Mode, ARM supports two additional reasoning modes: 1) Instruction-Guided Mode, which allows users to explicitly specify the reasoning format via special tokens ideal when the appropriate format is known for a batch of tasks. 2) Consensus-Guided Mode, which aggregates the outputs of the three efficient formats and resorts to Long Co T in case of disagreement, prioritizing performance with higher token usage.
Researcher Affiliation	Academia	Shanghai Key Laboratory of Data Science, College of Computer Science and Artificial Intelligence, Fudan University The Ohio State University Shanghai Academy of AI for Science EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper provides mathematical formulas for Ada-GRPO and its objective in Section 3.2 and Appendix A, but it does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Project Page: https://team-arm.github.io/arm
Open Datasets	Yes	Stage 1: We use AQu A-Rat [26] as the SFT dataset, as its answers can be naturally transformed into four distinct reasoning formats. In addition to the Direct Answer and Short Co T rationales provided with the dataset, we utilize GPT-4o [30] and Deep Seek-R1 [11] to supplement the Code and Long Co T rationales, respectively. [...] Stage 2: To prevent data leakage, we employ three additional datasets exclusively for the RL stage.4 These datasets cover a range of difficulty levels, from relatively simple commonsense reasoning tasks to more complex mathematical reasoning tasks, including Commonsense QA (CSQA) [44], GSM8K [6], and MATH [15], collectively comprising 19.8K verifiable question-answer pairs.
Dataset Splits	Yes	To ensure the quality of the generated rationales, we filter out those that lead to incorrect answers, resulting in a training set containing 3.0K multiple-choice and 7.8K open-form questions, each with four reasoning formats. [...] As a validation set, we sample 10% of the training data and keep the checkpoint with the lowest perplexity on the validation set for testing and the second stage.
Hardware Specification	Yes	Our training is performed using 8 NVIDIA A800 GPUs.
Software Dependencies	No	We utilize the open-source training framework LLAMAFACTORY [58] to perform SFT. [...] We utilize the open-source training framework Ve RL [40] to perform RL. The paper mentions software frameworks like LLAMAFACTORY, LoRA, Deep Speed, and VeRL but does not provide specific version numbers for these or any other software components.
Experiment Setup	Yes	D.1 Stage 1: SFT: The training is conducted with a batch size of 128 and a learning rate of 2e-4. We adopt a cosine learning rate scheduler with a 10% warm-up period over 6 epochs. [...] D.2 Stage 2: RL: During training, we use a batch size of 1024 and generate 8 rollouts per prompt (G = 8), with a maximum rollout length of 4096 tokens. The model is trained with a mini-batch size of 180, a KL loss coefficient of 1e-3, and a total of 9 training epochs. The default sampling temperature is set to 1.0.