Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Towards Reliable LLM-based Robots Planning via Combined Uncertainty Estimation

Authors: Shiyuan Yin, Chenjia Bai, Zihao Zhang, Junwei Jin, Xinxin Zhang, Chi Zhang, Xuelong Li

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validated our approach in two distinct experimental settings: kitchen manipulation and tabletop rearrangement experiments. The results show that, compared to existing methods, our approach yields uncertainty estimates that are more closely aligned with the actual execution outcomes.
Researcher Affiliation	Collaboration	Shiyuan Yin1, Chenjia Bai2 , Zihao Zhang1, Junwei Jin1, Xinxin Zhang1, Chi Zhang2, Xuelong Li2 1 School of Artificial Intelligence, Henan University of Technology 2 Institute of Artificial Intelligence (Tele AI), China Telecom Correspondence to Chenjia Bai <EMAIL>
Pseudocode	No	The paper describes methods using textual descriptions and mathematical formulas in sections like '3 CURE Method', '3.2 Task Familiarity Assessment', and '3.3 Assessment of Task Clarity and Expected Success Rate', and illustrates processes with figures (e.g., Figure 2, Figure 3). However, it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	The code is at https://github.com/Firesuiry/CURE.
Open Datasets	Yes	The datasets used for evaluation are identical to those in the Know No[47]. We adopted datasets originally proposed by Know No [47], published under Apache-2.0 license.
Dataset Splits	No	The paper mentions constructing a training dataset and splitting a test set into two parts for calibration and testing. However, it does not provide specific percentages, sample counts, or detailed methodology for the main training/test/validation splits used in the experiments. It states, "We conducted additional experiments to evaluate the Overstep Rate, Overask Rate, and Help Rate of the Intro Plan method, CURE method, and Intro Plan + CURE method when the target success rate is 90%." and "We tested this method for a target success rate of 85%." but these are for specific analyses, not general dataset splits.
Hardware Specification	Yes	All experiments were conducted on a computing server equipped with dual Intel Xeon Gold 6348 processors , 512GB of RAM, and four NVIDIA A100-PCIE-40GB GPUs.
Software Dependencies	No	The paper mentions using 'Llama-3.3-70B-Instruct' and 'Llama-3.2-8B-Instruct' models, and the 'Py Bullet simulator'. However, it does not provide specific version numbers for these software components or any other libraries/frameworks (e.g., Python, PyTorch, TensorFlow) to allow for reproducible software environment setup.
Experiment Setup	Yes	In this paper, we set α1 = 1, α2 = 0.6, and α3 = 30. Appendix E contains the hyperparameter search experiments for these parameters.