Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
DynaAct: Large Language Model Reasoning with Dynamic Action Spaces
Authors: Xueliang Zhao, Wei Wu, Jian Guan, Qintong Li, Lingpeng Kong
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on six diverse standard benchmarks demonstrate that our approach significantly improves overall performance, while maintaining efficient inference without introducing substantial latency. The implementation is available at https://github.com/zhaoxlpku/Dyna Act. |
| Researcher Affiliation | Collaboration | Xueliang Zhao Wei Wu Jian Guan Qintong Li Lingpeng Kong The University of Hong Kong Ant Group EMAIL EMAIL |
| Pseudocode | Yes | Algorithm 1 Complete Pipeline of Sequential Reasoning. Require: Input question q, dataset D, number of groups k, candidate selection budget m, maximum reasoning steps T 1: /* Proxy Action Space Estimation (performed once) */ 2: Partition the dataset D into k groups: {D1, D2, . . . , Dk} 3: for i = 1 to k do 4: Extract an observation sketch oi = a1, a2, . . . , a|oi| LLMQuery(Di) 5: Form the action space A = Sk i=1 oi = Sk i=1 a1, a2, . . . , a|oi| 6: Train the embedding function e using Q-learning objective (Eq. (5)) with observation sketches as demonstration data 7: s0 Initialize State(q) 8: for t = 0 to T 1 do 9: /* Candidate Action Selection via Greedy Algorithm */ 10: X 11: for i = 1 to m do 12: Xd A \ X 13: a arg maxa Xd F(X {a}; st) 14: X X {a } 15: At(st) X 16: /* Action Evaluation using MCTS */ 17: for all a At(st) do 18: Q(st, a) MCTS(st, a) 19: /* Action Selection and State Update */ 20: at arg maxa At(st) Q(st 1, a) 21: st+1 Update State(st, at) Output: {s0, a0, s1, . . . , s T } |
| Open Source Code | Yes | The implementation is available at https://github.com/zhaoxlpku/Dyna Act. |
| Open Datasets | Yes | We employ six standard benchmarks covering three domains: general, reasoning, and math. Specifically, we use the following datasets: (1) MMLU [Hendrycks et al., 2020]; (2) MMLU-Pro [Wang et al., 2024b]; (3) GPQA [Rein et al., 2023]; (4) ARC-challenge (ARC-C) [Clark et al., 2018]; (5) GSM8K [Cobbe et al., 2021]; and (6) MATH-500 [Lightman et al., 2023]. |
| Dataset Splits | Yes | We employ six standard benchmarks covering three domains: general, reasoning, and math. Specifically, we use the following datasets: (1) MMLU [Hendrycks et al., 2020]; (2) MMLU-Pro [Wang et al., 2024b]; (3) GPQA [Rein et al., 2023]; (4) ARC-challenge (ARC-C) [Clark et al., 2018]; (5) GSM8K [Cobbe et al., 2021]; and (6) MATH-500 [Lightman et al., 2023]. ... Evaluation results on the Level 3, Level 4, and Level 5 subsets of MATH-500. |
| Hardware Specification | Yes | All measurements were taken on an 8 A100 GPU machine using the MATH-500 dataset, with each method evaluated under consistent rollout settings. |
| Software Dependencies | Yes | For embedding efficiency, we use Llama-3.2-1B-Instruct [Dubey et al., 2024] as the backbone... We use Llama-3.1-8B-Instruct [Dubey et al., 2024] as the world model [Hao et al., 2023]... |
| Experiment Setup | Yes | For the proxy action space estimation ( 3.1), we use the Open-Platypus [Lee et al., 2023] corpus... We divide the corpus into k = 2, 500 groups... In our submodular function definition ( 3.2), we set the balancing parameters α = 0.9 and β = 0.1... The embedding function is fine-tuned using the Q-learning objective, with a total of 83, 083 state-action pairs, and the learning rate is set to 1e 5. For the action space construction ( 3.3), we set the size of the candidate action set At at each time step m = 5. |