Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

PlanU: Large Language Model Reasoning through Planning under Uncertainty

Authors: Ziwei Deng, Mian Deng, Chenjing Liang, Zeming Gao, Chennan Ma, Chenxing Lin, Haipeng Zhang, Songzhu Mei, Siqi Shen, Cheng Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments, we demonstrate the effectiveness of Plan U in LLM-based reasoning tasks under uncertainty. ... 5 Evaluation We study the performance of Plan U on five decision-making benchmarks: Blocksworld, Overcooked, Virtual Home, Travelplanner and Webshop. We show that Plan U can obtain promising performance across multiple decision-making scenarios under uncertainty. Ablation studies reveal the importance of measuring unceratinty using value distribution and the UCC score for achieving good performance.
Researcher Affiliation Academia a Fujian Key Laboratory of Sensing and Computing for Smart Cities, School of Informatics, Xiamen University (XMU), China b Key Laboratory of Multimedia Trusted Perception and Efficient Computing, XMU, China c School of Computer, National University of Defense Technology, China EMAIL, EMAIL, {sz.mei}@nudt.edu.cn
Pseudocode Yes A Plan U Algorithm A.1 Algorithm The Plan U algorithm is described in Algorithm 1. Algorithm 1 The Plan U Algorithm
Open Source Code Yes Answer: [Yes] Justification: We have included the code and experimental results in the supplementary materials.
Open Datasets Yes We study the performance of Plan U on five decision-making benchmarks: Blocksworld, Overcooked, Virtual Home, Travelplanner and Webshop. ... Answer: [Yes] Justification: All existing assets used in this paper are properly credited. We have cited the original papers that produced the code packages and datasets. The specific versions of the assets used are mentioned, and URLs to the sources are provided in Appendix C.
Dataset Splits Yes The tasks are categorized into three difficulty tiers according to the minimum required action steps: 2-step (37 tasks), 4-step (76 tasks), 6-step (145 tasks), and 8-step (143 tasks).
Hardware Specification Yes The experiments were conducted on high-performance computing clusters equipped with NVIDIA A40 GPU, each with 48GB of memory. The CPUs used in the cluster are Intel(R) Xeon(R) Silver 4216 processors, each running at 2.10GHz. The memory of each computing node is 1024GB.
Software Dependencies No Specifically, we utilize Sentence Transformers 3 to encode textual states into dense embeddings. ... sentence-transformers/all-mpnet-base-v2 (used as the default model in our main experiments) sentence-transformers/all-Mini LM-L6-v2 sentence-transformers/all-Mini LM-L12-v2
Experiment Setup Yes In this work, all the experiments are repeated with 5 different seeds. For the baseline methods, we use their default configuration. We set the number of quantiles to 50. The c1 in Equation 8 is set to 0.25, and the update rate of the quantile distribution is fixed at 0.5. ... Table 11: Hyperparameter Settings for UCC Hyper-parameter Value learning_rate 1e-5 hidden_size_list [64, 64, 128] update_per_collect 5 obs_norm True obs_norm_clamp_min -1 obs_norm_clamp_max 1 intrinsic_reward_weight 0.01