Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Blindfolded Experts Generalize Better: Insights from Robotic Manipulation and Videogames

Authors: Ev Zisselman, Mirco Mutti, Shelly Francis-Meretzki, Elisei Shafer, Aviv Tamar

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments of real-world robot peg insertion tasks with (limited) human demonstrations, alongside videogames from the Procgen benchmark. Additionally, we support our findings with theoretical analysis, which confirms that the generalization error scales with p I/m, where I measures the amount of task information available to the demonstrator, and m is the number of demonstrated tasks.
Researcher Affiliation	Academia	Ev Zisselman , Mirco Mutti, Shelly Francis-Meretzki, Elisei Shafer, Aviv Tamar Technion Israel Institute of Technology
Pseudocode	No	The paper includes a theoretical analysis and describes methods, but it does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps formatted like code in the main text or appendices.
Open Source Code	Yes	Project page with videos and code: https://sites.google.com/view/blindfoldedexperts/home.
Open Datasets	Yes	Our human demonstration dataset for Procgen maze and heist is available on the project website. Procgen [8] is a popular benchmark for measuring sample efficiency and generalization. Functional Manipulation Benchmark [FMB, 25], which focuses on inserting variously shaped pegs into tightly matching holes.
Dataset Splits	Yes	For training, we consider 200 procedurally generated levels. We split the ten shapes into training and test shapes. Training was conducted for k {2, 3, 4, 5} peg shapes, with the remaining shapes serving as a withheld test set.
Hardware Specification	Yes	We use a Franka Emika Panda robot arm and teleoperate the robot using a Space Mouse. The observations are obtained as RGB-only images from two Intel Real Sense D405 cameras, mounted on the robot end-effector
Software Dependencies	No	The paper mentions several software components and architectures like Res Net [17], GRU [7], Segment Anything2 (SAM2) [39], SERL [24], and the Adam optimizer. However, it does not provide specific version numbers for these components or the general programming languages/frameworks used, which is required for a reproducible description of ancillary software.
Experiment Setup	Yes	Policy architecture and training. Training is conducted from scratch on the demonstrated trajectories by minimizing the negative log likelihood 1. We use the architecture from [27] for both the Expert (πBC) and BF-Expert (πBF BC) a Res Net [17] to encode the observations, which are then processed by two fully-connected layers. To capture the exploratory behavior, we add a single GRU [7] before the Softmax policy layer (further details are in Appendix C). Appendix B.2 and C.1 (Hyperparameters and constants sections) detail hyperparameters such as batch size, hidden size, learning rate, and learning rate schedule for both Procgen and peg insertion experiments.