Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Tru-POMDP: Task Planning Under Uncertainty via Tree of Hypotheses and Open-Ended POMDPs
Authors: Wenjing Tang, Xinyu He, Yongxi Huang, Yunxiao Xiao, Cewu Lu, Panpan Cai
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on complex object rearrangement tasks across diverse kitchen environments show that Tru-POMDP significantly outperforms state-of-the-art LLM-based and LLM-tree-search hybrid planners, achieving higher success rates with significantly better plans, stronger robustness to ambiguity and occlusion, and greater planning efficiency. |
| Researcher Affiliation | Academia | 1Shanghai Jiao Tong University 2Shanghai Innovation Institute 3East China Normal University 4Beijing University of Posts and Telecommunications |
| Pseudocode | Yes | LLM-Generated Code for Rollout Policy. Action Simple Next Action( const std ::vector <std ::array <int , 2>>& unreached_goals , const Scene Graph Simple & current_scene_graph ) { // If no goals remain , return no -op if ( unreached_goals .empty ()) { return Action Simple (); } |
| Open Source Code | Yes | 1The code and demonstration video are available at: https://tru-pomdp.github.io |
| Open Datasets | Yes | We evaluate Tru-POMDP in five kitchen environments from Robo Casa [13] |
| Dataset Splits | No | We categorize tasks into three difficulty levels based on the number of target objects required in the goal: easy (requiring 2 target objects), medium (3), and hard (4 8). Each additional target object in the goal leads to exponential growth of uncertainty. For each level, we generate 100 tasks, resulting in a total of 300 tasks. |
| Hardware Specification | Yes | Experiments are run on a local machine equipped with a 12th Gen Intel Core i7-12700KF CPU (20 threads), without GPU acceleration. |
| Software Dependencies | Yes | All methods consistently use GPT-4.1 as the LLM. |
| Experiment Setup | Yes | Parameter Description Value C1, C2 Number of candidates for Levels 1 and 2 in the Tree of Hypotheses 3 T Temperature for the LLM in the Tree of Hypotheses 0.1 ϵ Threshold in Hybrid Belief Update 0.7 k Number of scenarios in Belief Tree Search 30 ds Maximum search depth in Belief Tree Search 20 dr Rollout policy execution depth 10 |