Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

RAST: Reasoning Activation in LLMs via Small-model Transfer

Authors: Siru Ouyang, Xinyu Zhu, Zilin Xiao, Minhao Jiang, Yu Meng, Jiawei Han

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments across multiple mathematical reasoning benchmarks show that RAST substantially and consistently enhances the reasoning capabilities of base models while requiring significantly lower GPU memory than direct RL training, sometimes even yielding better performance than the RL-trained counterparts.
Researcher Affiliation Collaboration 1 University of Illinois Urbana-Champaign, 2 University of Virginia 3 Rice University, 4 GE Health Care
Pseudocode No The paper describes the methodology using textual explanations and mathematical equations (e.g., Equation 3) along with an illustrative figure (Figure 2), but does not contain explicitly structured pseudocode or algorithm blocks.
Open Source Code Yes The project page of RAST is available at https://ozyyshr.github.io/RAST/. Additionally, the NeurIPS checklist states: "Answer: [Yes] Justification: We released our code, and provided experimental details to reproduce our results."
Open Datasets Yes All datasets or benchmarks used in this paper are publicly available online. For mathematical reasoning tasks, we include 6 widely used datasets, detailed below: MATH500 [18], Minerva [30], Olympiad Bench [17], GSM8K [6], as well as competition-level benchmarks AIME24 and AMC23.
Dataset Splits Yes MATH500 (could be found at https://huggingface.co/datasets/HuggingFaceH4/MATH-500) is a non-standard train/test split of the original MATH dataset [18], following [33] to avoid the risk of over-fitting and for more efficient testing configurations. These 500 test problems are selected uniformly at random, and are representative of the test set as a whole. The test set of GSM8K (could be found at https://huggingface.co/datasets/openai/gsm8k) includes 1,319 problems in total.
Hardware Specification Yes Our experiments are conducted over 8 NVIDIA A6000 GPUs on a single node, with GPU utilization and tensor parallelism parameters dynamically adjusted based on the model size.
Software Dependencies No The paper mentions specific tools like "v LLM [27]" and "Eval Plus [38, 39]" but does not provide specific version numbers for these or other key software components (e.g., programming languages, libraries) used in the experiments.
Experiment Setup Yes For mathematical reasoning, decoding is performed using a temperature setting of 1.0 and nucleus sampling with a top-p of 0.95, allowing a maximum generation length of 16, 384 tokens, consistent with prior work [76]. We set λ in Equation 3 to 1.0 for all experiments. Appendix B, Table 5 further details decoding configurations for different models.