Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Ada-R1: Hybrid-CoT via Bi-Level Adaptive Reasoning Optimization

Authors: Haotian Luo, Haiying He, Yibo Wang, Jinluan Yang, Rui Liu, Naiqiang Tan, Xiaochun Cao, Dacheng Tao, Li Shen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments demonstrate that our method significantly reduces inference costs compared to other baseline approaches, while maintaining performance. Notably, on five mathematical datasets, the average length of reasoning is reduced by more than 50%, highlighting the potential of adaptive strategies to optimize reasoning efficiency in large language models.
Researcher Affiliation	Collaboration	1 Shenzhen Campus of Sun Yat-sen University; 2 Center for AI Theoretical Foundation and Systems, Shenzhen Loop Area Institute; 3 China Agricultural University; 4 Tsinghua University; 5 Zhejiang University; 6 Didichuxing Co. Ltd; 7 Nanyang Technological University
Pseudocode	No	The paper describes its methods through textual explanations and mathematical formulations, but does not include a distinct 'Pseudocode' or 'Algorithm' block.
Open Source Code	No	Our code is coming soon at https://github.com/Star Dew XXX/Ada R1
Open Datasets	Yes	Dataset. Following s1[35] and Light-R1[36], we construct a mixed training dataset to ensure coverage across mathematical problems of varying difficulty levels. Specifically, we combine GSM8K, MATH, and AIME datasets in a ratio of 1:3:1, resulting in a total of 2,500 diverse math problems.
Dataset Splits	Yes	Evaluation. We use the GSM8K test set, the MATH test set, and AIME25 as in-distribution evaluation data, while Olympiad[37] and Minerva[38] are employed as out-of-distribution test sets.
Hardware Specification	Yes	For both models, we selected 2,500 problems from the mixed Mathematics as training data. For each problem, we sample 12 times. From each set of solutions, we randomly selected 2 solutions for training. After computing the rewards, we normalized the reward values. Both models are trained with 8 * A800-80G GPUs.
Software Dependencies	No	The paper lists hyperparameters for training in Table 7 but does not specify versions of software libraries or frameworks used (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	Table 7: Hyperparameters for the Deepseek-Distill-1.5B and Deepseek-Distill-7B. cutoff_len 4096 4096 batch_size 32 32 learning_rate 5.0e-7 5.0e-7 num_train_epochs 2.0 2.0 lr_scheduler_type constant constant M1 4 4 M2 2 2 beta 0.05 0.1