Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

InstructFlow: Adaptive Symbolic Constraint-Guided Code Generation for Long-Horizon Planning

Authors: Haotian Chi, Zeyu Feng, Yueming LYU, Chengqi Zheng, Linbo Luo, Yew Soon Ong, Ivor Tsang, Hechang Chen, Yi Chang, Haiyan Yin

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our approach on three simulated domains, each designed to test different aspects of long-horizon planning with parameterized skills and physical constraints: (1) Drawing: ... (2) Arrange-Blocks: ... (3) Arrange-YCB: ... As shown in Table 1, across drawing, block arrangement, and YCB manipulation tasks, Instruct Flow outperforms prior methods by 20 40% in task success rate.
Researcher Affiliation	Academia	1School of Artificial Intelligence, Jilin University, China 2CFAR and IHPC, Agency for Science, Technology and Research (A*STAR), Singapore 3Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, Ministry of Education, Jilin University, China 4Nanyang Technological University (NTU), Singapore, 5Xidian University, China
Pseudocode	No	The paper describes the methodology and agent interactions in text and via diagrams (e.g., Figure 1), but it does not include a dedicated section or figure explicitly labeled 'Pseudocode' or 'Algorithm' with structured steps.
Open Source Code	Yes	The implementation is available at https://github.com/chiht21/Instruct Flow.
Open Datasets	Yes	All experiments are conducted in the Ravens [31] simulation environment, using a 6-Do F UR5 arm with a Robotiq 2F-85 gripper in a tabletop workspace. (3) Arrange-YCB: The robot manipulates complex objects from the YCB dataset (e.g., banana, meat can) to perform packing and stacking.
Dataset Splits	No	Each approach is evaluated over 10 randomized seeds per simulated task. We use a maximum budget of 1000 samples per trial (10000 for drawing tasks). We limit the number of feedback iterations to 5. All methods are queried via Open AI s GPT-4o unless otherwise stated. A task is considered successful if the final robot state satisfies the goal condition without violating any constraints (see Appendix B.1 for more details on experiment settings).
Hardware Specification	Yes	Simulations run on CPUs with 32GB RAM, with all baseline implementations integrated into a unified evaluation framework.
Software Dependencies	No	Physics-based execution and constraint checking are handled via Py Bullet. Simulations run on CPUs with 32GB RAM, with all baseline implementations integrated into a unified evaluation framework. All methods are queried via Open AI s GPT-4o unless otherwise stated.
Experiment Setup	Yes	Each approach is evaluated over 10 randomized seeds per simulated task. We use a maximum budget of 1000 samples per trial (10000 for drawing tasks). We limit the number of feedback iterations to 5. All methods are queried via Open AI s GPT-4o unless otherwise stated.