Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

AdaReasoner: Adaptive Reasoning Enables More Flexible Thinking

Authors: Xiangqi Wang, Yue Huang, Yanbo Wang, Xiaonan Luo, Kehan Guo, Yujun Zhou, Xiangliang Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate the performance of Ada Reasoner, we selected datasets that engage distinct cognitive processes, ranging from logical and mathematical to figurative and generative reasoning. MMLU: This is a collection of data examples that are in the Math category from the Massive Multitask Language Understanding (MMLU) benchmark [12], focusing on numerical reasoning, symbolic manipulation, and procedural problem solving. Metaphor [49]: This dataset focuses on evaluating whether a highlighted word in context is used metaphorically in the context. Truthful QA [21]: This dataset tests LLM trustworthy generation by posing questions with common misconceptions or false premises. Logi QA [23]: This dataset is designed for multi-step logical reasoning based on Chinese civil service exam questions. Each dataset contributes 250 samples, randomly sampled from the full dataset. The combined dataset is then divided into a training set of 100 samples and a test set of 900 samples forming thus a few-shot setting. Examples of the four datasets are displayed at Table 5 and distribution of each dataset is shown at Figure 5. Baselines. We compare Ada Reasoner with several baselines that adopt different strategies to improve LLM reasoning: Co T (Chain-of-Thought) [54]: Prompts the model to think step-by-step for reasoning. Think Short: Prompts the model for brief, quick responses with prompt at Figure 10. To T (Tree-of-Thought) [57]: Structures reasoning path as a tree, exploring and selecting among multiple paths. Best-of-N [16]: Produces N candidate chains, selects the best based on a predefined scoring metric. Auto-Co T [58]: For each query, retrieve semantically nearest exemplars from a few-shot pool (via embedding clustering), generate Co T rationales, and concatenate the question rationale answer triplets as the in-context prompt; other settings follow the original. In-context Co T (ICL) [5]: Leverages in-context Co T generation by presenting examples of few-shot train set directly within the prompt. Evaluation and other details. To evaluate the alignment between LLM-generated responses and the ground truth, we adopt the LLM-as-a-Judge paradigm [60], utilizing GPT-4o to assess both the semantic equivalence of answers and the quality of their explanations through dedicated judgment prompts, as illustrated in Figure 8.
Researcher Affiliation Academia Xiangqi Wang1 Yue Huang1 Yanbo Wang2 Xiaonan Luo1 Kehan Guo1 Yujun Zhou1 Xiangliang Zhang1 1 University of Notre Dame 2MBZUAI EMAIL EMAIL
Pseudocode Yes A Ada Reasoner Algorithm Algorithm 1 Ada Reasoner Algorithm
Open Source Code Yes Data code is provided via annonymous github repository link: https://anonymous.4open.science/r/officialadareasoner-B9B
Open Datasets Yes To evaluate the performance of Ada Reasoner, we selected datasets that engage distinct cognitive processes, ranging from logical and mathematical to figurative and generative reasoning. MMLU: This is a collection of data examples that are in the Math category from the Massive Multitask Language Understanding (MMLU) benchmark [12], focusing on numerical reasoning, symbolic manipulation, and procedural problem solving. Metaphor [49]: This dataset focuses on evaluating whether a highlighted word in context is used metaphorically in the context. Truthful QA [21]: This dataset tests LLM trustworthy generation by posing questions with common misconceptions or false premises. Logi QA [23]: This dataset is designed for multi-step logical reasoning based on Chinese civil service exam questions.
Dataset Splits Yes Each dataset contributes 250 samples, randomly sampled from the full dataset. The combined dataset is then divided into a training set of 100 samples and a test set of 900 samples forming thus a few-shot setting. Examples of the four datasets are displayed at Table 5 and distribution of each dataset is shown at Figure 5.
Hardware Specification No The paper does not explicitly describe the hardware used to run its experiments.
Software Dependencies No The paper mentions "pre-trained BERT model [55]" and a "DeBERTa-based" reward model [30] and "DeBERTa in huggingface" as components of their system, but does not provide specific version numbers for these software dependencies or any other ancillary software.
Experiment Setup Yes Evaluation and other details. To evaluate the alignment between LLM-generated responses and the ground truth, we adopt the LLM-as-a-Judge paradigm [60], utilizing GPT-4o to assess both the semantic equivalence of answers and the quality of their explanations through dedicated judgment prompts, as illustrated in Figure 8. In each evaluation, the top p parameter is set to 0.1 and the max token parameter is set to 5,000, with no system prompt utilized. We random select 100 out of 1,000 samples as few-shot examples for Ada Reasoner and ICL. To T uses a beam width of 2 and a max length of 3. Baselines follow default settings with in-context examples from the same dataset and type. Ada Reasoner uses a fixed learning rate of 0.01, BERT embeddings (768-d) for the input question, and a 3-layer MLP for each policy head.