Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

DynaAct: Large Language Model Reasoning with Dynamic Action Spaces

Authors: Xueliang Zhao, Wei Wu, Jian Guan, Qintong Li, Lingpeng Kong

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on six diverse standard benchmarks demonstrate that our approach significantly improves overall performance, while maintaining efficient inference without introducing substantial latency. The implementation is available at https://github.com/zhaoxlpku/Dyna Act.
Researcher Affiliation Collaboration Xueliang Zhao Wei Wu Jian Guan Qintong Li Lingpeng Kong The University of Hong Kong Ant Group EMAIL EMAIL
Pseudocode Yes Algorithm 1 Complete Pipeline of Sequential Reasoning. Require: Input question q, dataset D, number of groups k, candidate selection budget m, maximum reasoning steps T 1: /* Proxy Action Space Estimation (performed once) */ 2: Partition the dataset D into k groups: {D1, D2, . . . , Dk} 3: for i = 1 to k do 4: Extract an observation sketch oi = a1, a2, . . . , a|oi| LLMQuery(Di) 5: Form the action space A = Sk i=1 oi = Sk i=1 a1, a2, . . . , a|oi| 6: Train the embedding function e using Q-learning objective (Eq. (5)) with observation sketches as demonstration data 7: s0 Initialize State(q) 8: for t = 0 to T 1 do 9: /* Candidate Action Selection via Greedy Algorithm */ 10: X 11: for i = 1 to m do 12: Xd A \ X 13: a arg maxa Xd F(X {a}; st) 14: X X {a } 15: At(st) X 16: /* Action Evaluation using MCTS */ 17: for all a At(st) do 18: Q(st, a) MCTS(st, a) 19: /* Action Selection and State Update */ 20: at arg maxa At(st) Q(st 1, a) 21: st+1 Update State(st, at) Output: {s0, a0, s1, . . . , s T }
Open Source Code Yes The implementation is available at https://github.com/zhaoxlpku/Dyna Act.
Open Datasets Yes We employ six standard benchmarks covering three domains: general, reasoning, and math. Specifically, we use the following datasets: (1) MMLU [Hendrycks et al., 2020]; (2) MMLU-Pro [Wang et al., 2024b]; (3) GPQA [Rein et al., 2023]; (4) ARC-challenge (ARC-C) [Clark et al., 2018]; (5) GSM8K [Cobbe et al., 2021]; and (6) MATH-500 [Lightman et al., 2023].
Dataset Splits Yes We employ six standard benchmarks covering three domains: general, reasoning, and math. Specifically, we use the following datasets: (1) MMLU [Hendrycks et al., 2020]; (2) MMLU-Pro [Wang et al., 2024b]; (3) GPQA [Rein et al., 2023]; (4) ARC-challenge (ARC-C) [Clark et al., 2018]; (5) GSM8K [Cobbe et al., 2021]; and (6) MATH-500 [Lightman et al., 2023]. ... Evaluation results on the Level 3, Level 4, and Level 5 subsets of MATH-500.
Hardware Specification Yes All measurements were taken on an 8 A100 GPU machine using the MATH-500 dataset, with each method evaluated under consistent rollout settings.
Software Dependencies Yes For embedding efficiency, we use Llama-3.2-1B-Instruct [Dubey et al., 2024] as the backbone... We use Llama-3.1-8B-Instruct [Dubey et al., 2024] as the world model [Hao et al., 2023]...
Experiment Setup Yes For the proxy action space estimation ( 3.1), we use the Open-Platypus [Lee et al., 2023] corpus... We divide the corpus into k = 2, 500 groups... In our submodular function definition ( 3.2), we set the balancing parameters α = 0.9 and β = 0.1... The embedding function is fine-tuned using the Q-learning objective, with a total of 83, 083 state-action pairs, and the learning rate is set to 1e 5. For the action space construction ( 3.3), we set the size of the candidate action set At at each time step m = 5.