Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Ada-R1: Hybrid-CoT via Bi-Level Adaptive Reasoning Optimization
Authors: Haotian Luo, Haiying He, Yibo Wang, Jinluan Yang, Rui Liu, Naiqiang Tan, Xiaochun Cao, Dacheng Tao, Li Shen
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments demonstrate that our method significantly reduces inference costs compared to other baseline approaches, while maintaining performance. Notably, on five mathematical datasets, the average length of reasoning is reduced by more than 50%, highlighting the potential of adaptive strategies to optimize reasoning efficiency in large language models. |
| Researcher Affiliation | Collaboration | 1 Shenzhen Campus of Sun Yat-sen University; 2 Center for AI Theoretical Foundation and Systems, Shenzhen Loop Area Institute; 3 China Agricultural University; 4 Tsinghua University; 5 Zhejiang University; 6 Didichuxing Co. Ltd; 7 Nanyang Technological University |
| Pseudocode | No | The paper describes its methods through textual explanations and mathematical formulations, but does not include a distinct 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | No | Our code is coming soon at https://github.com/Star Dew XXX/Ada R1 |
| Open Datasets | Yes | Dataset. Following s1[35] and Light-R1[36], we construct a mixed training dataset to ensure coverage across mathematical problems of varying difficulty levels. Specifically, we combine GSM8K, MATH, and AIME datasets in a ratio of 1:3:1, resulting in a total of 2,500 diverse math problems. |
| Dataset Splits | Yes | Evaluation. We use the GSM8K test set, the MATH test set, and AIME25 as in-distribution evaluation data, while Olympiad[37] and Minerva[38] are employed as out-of-distribution test sets. |
| Hardware Specification | Yes | For both models, we selected 2,500 problems from the mixed Mathematics as training data. For each problem, we sample 12 times. From each set of solutions, we randomly selected 2 solutions for training. After computing the rewards, we normalized the reward values. Both models are trained with 8 * A800-80G GPUs. |
| Software Dependencies | No | The paper lists hyperparameters for training in Table 7 but does not specify versions of software libraries or frameworks used (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | Table 7: Hyperparameters for the Deepseek-Distill-1.5B and Deepseek-Distill-7B. cutoff_len 4096 4096 batch_size 32 32 learning_rate 5.0e-7 5.0e-7 num_train_epochs 2.0 2.0 lr_scheduler_type constant constant M1 4 4 M2 2 2 beta 0.05 0.1 |