Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Towards Reliable Code-as-Policies: A Neuro-Symbolic Framework for Embodied Task Planning

Authors: Sanghyun Ahn, Wonje Choi, Junyong Lee, Jinwoo Park, Honguk Woo

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our framework on RLBench and in realworld settings across dynamic, partially observable scenarios. Experimental results demonstrate that our framework improves task success rates by 46.2% over Code as Policies baselines and attains over 86.8% executability of task-relevant actions, thereby enhancing the reliability of task planning in dynamic environments.
Researcher Affiliation	Academia	1Department of Computer Science and Engineering, Sungkyunkwan University 2Department of Artificial Intelligence, Sungkyunkwan University EMAIL
Pseudocode	Yes	We provide the full pseudocode for our neuro-symbolic task execution pipeline in Algorithm 1 and Algorithm 2.
Open Source Code	Yes	We plan to release our project as open-source.
Open Datasets	Yes	We conducted experiments in both RLBench [8] and real-world settings using a 7Do F Franka Emika Research 3 robotic arm, enabling reproducible evaluations via randomized initial states and instructions to analyze safe probe strategies in dynamic, partially observable scenarios.
Dataset Splits	No	The paper mentions "randomized initial states" and "varied initial conditions and instructions" for RLBench, and "The initial positions of all objects are randomized for each trial" in the real-world setup, but does not specify explicit training/test/validation dataset splits with percentages or sample counts for models, nor does it refer to predefined splits from external datasets. The focus is on task success under different observability conditions rather than traditional data splits for model training.
Hardware Specification	Yes	Most experiments were conducted on a local machine with an Intel(R) Core(TM) i7-9700KF CPU and an NVIDIA Ge Force RTX 4080 GPU (16GB VRAM). Each task instance used a single GPU, and RLBench simulation was executed with up to 32GB of system memory. Symbolic verification and PDDL planning were run on the CPU. For experiments using the larger language models listed in Table 6 in main paper, such as Llama-3.1-8B and Qwen3-30B-A3B, we used a cloud-based CUDA cluster with GPUs equipped with approximately 82GB of VRAM.
Software Dependencies	Yes	We employ GPT-4o-mini [51] for code generation and feedback generation. Additionally, Llama-3.2-3B [52] is used to compute the CSC. ... For the verification phase (i), the Z3 SMT solver [20] is employed as the symbolic verification tool, while for the validation phase (ii), the Fast Downward planner [53] is used as the symbolic validation tool.
Experiment Setup	Yes	The decoding temperature is fixed at 0.0 for all generation steps. ... The only hyperparameter in our framework is the confidence threshold ϵ used during neuro-symbolic validation. For each skill, we perform five safe exploration probes under varied initial conditions to estimate its execution confidence. To determine a suitable value of ϵ for a given environment, we exclude outlier trials in which the probe failed due to non-informative reasons, which could otherwise deflate confidence estimates.