Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Improving Generative Behavior Cloning via Self-Guidance and Adaptive Chunking

Authors: Junhyuk So, Chiwoong Lee, Shinyoung Lee, Jungseul Ok, Eunhyeok Park

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments show that our approach substantially improves GBC performance across a wide range of simulated and real-world robotic manipulation tasks. Extensive evaluations across simulated and real-world robotic environments demonstrate that our approach outperforms Vanilla Diffusion Policy by 23.25% and the state-of-the-art BID by 12.27%
Researcher Affiliation	Academia	1Department of Computer Science & Engineering 2Graduate School of Artificial Intelligence POSTECH, South Korea EMAIL
Pseudocode	Yes	Let Aqueue denote the action chunk queue, ˆat:t+H π(a \| st) the newly predicted action chunk, and τ the similarity threshold. The update rule is defined as: Aqueue Aqueue.enqueue(ˆat+H) if cos(Aqueue[0], ˆa[0]) τ ˆat:t+H else, (14) where cos( ) denotes cosine similarity. At each timestep, the first action in the queue is dequeued and executed: at = Aqueue.dequeue().
Open Source Code	Yes	Our code is available at https://github.com/junhyukso/SGAC.
Open Datasets	Yes	These include simple tasks like Push T [9], standard benchmarks from Robomimic [24], and the particularly challenging long-horizon Kitchen [25] environment.
Dataset Splits	No	While the paper mentions using specific numbers of episodes for evaluation and collecting 300 demonstration episodes for real-world experiments, it does not explicitly provide information on how these demonstration datasets were split into training, validation, and test sets in terms of percentages or counts for reproducibility.
Hardware Specification	Yes	All experiments are conducted on one A6000 GPU server with DDIM-10 Solver with 30Hz standard visuomotor control frequencies. which requires 27H with one NVIDIA RTX 6000 Ada Generation GPU and AMD Ryzen Threadripper PRO 7985WX CPU.
Software Dependencies	No	The paper mentions using 'DDIM-30 solver' and frameworks like 'Lerobot(Huggingface)' and 'Diffusers'. However, it does not provide specific version numbers for these software components or any other libraries like Python or PyTorch, which are necessary for full reproducibility.
Experiment Setup	Yes	Hyperparameter Settings The hyperparameters used in our simulation experiments in main paper are summarized in Table. 3. Additional hyperparameter details are listed in Table. 5