Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Tru-POMDP: Task Planning Under Uncertainty via Tree of Hypotheses and Open-Ended POMDPs

Authors: Wenjing Tang, Xinyu He, Yongxi Huang, Yunxiao Xiao, Cewu Lu, Panpan Cai

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on complex object rearrangement tasks across diverse kitchen environments show that Tru-POMDP significantly outperforms state-of-the-art LLM-based and LLM-tree-search hybrid planners, achieving higher success rates with significantly better plans, stronger robustness to ambiguity and occlusion, and greater planning efficiency.
Researcher Affiliation Academia 1Shanghai Jiao Tong University 2Shanghai Innovation Institute 3East China Normal University 4Beijing University of Posts and Telecommunications
Pseudocode Yes LLM-Generated Code for Rollout Policy. Action Simple Next Action( const std ::vector <std ::array <int , 2>>& unreached_goals , const Scene Graph Simple & current_scene_graph ) { // If no goals remain , return no -op if ( unreached_goals .empty ()) { return Action Simple (); }
Open Source Code Yes 1The code and demonstration video are available at: https://tru-pomdp.github.io
Open Datasets Yes We evaluate Tru-POMDP in five kitchen environments from Robo Casa [13]
Dataset Splits No We categorize tasks into three difficulty levels based on the number of target objects required in the goal: easy (requiring 2 target objects), medium (3), and hard (4 8). Each additional target object in the goal leads to exponential growth of uncertainty. For each level, we generate 100 tasks, resulting in a total of 300 tasks.
Hardware Specification Yes Experiments are run on a local machine equipped with a 12th Gen Intel Core i7-12700KF CPU (20 threads), without GPU acceleration.
Software Dependencies Yes All methods consistently use GPT-4.1 as the LLM.
Experiment Setup Yes Parameter Description Value C1, C2 Number of candidates for Levels 1 and 2 in the Tree of Hypotheses 3 T Temperature for the LLM in the Tree of Hypotheses 0.1 ϵ Threshold in Hybrid Belief Update 0.7 k Number of scenarios in Belief Tree Search 30 ds Maximum search depth in Belief Tree Search 20 dr Rollout policy execution depth 10