reproducibilityindex.ai

Program Synthesis Guided Reinforcement Learning for Partially Observed Environments

Authors: Yichen Yang, Jeevana Priya Inala, Osbert Bastani, Yewen Pu, Armando Solar-Lezama, Martin Rinard

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our experiments, we show that our approach signiﬁcantly outperforms non-program-guided approaches on a set of challenging benchmarks, including a 2D Minecraft-inspired environment where the agent must complete a complex sequence of subtasks to achieve its goal, and achieves a similar performance as using handcrafted programs to guide the agent.
Researcher Affiliation	Collaboration	Yichen David Yang MIT EECS & CSAIL Jeevana Priya Inala Microsoft Research Osbert Bastani University of Pennsylvania Yewen Pu Autodesk Research Armando Solar-Lezama MIT EECS & CSAIL Martin Rinard MIT EECS & CSAIL
Pseudocode	No	The paper describes the architecture and algorithm steps in text and figures, but does not contain a dedicated pseudocode or algorithm block.
Open Source Code	Yes	The code is available at: https://github.com/yycdavid/program-synthesis-guided-RL
Open Datasets	No	The paper uses custom-generated environments ('2D Minecraft-inspired game', 'box-world') rather than explicitly providing public access information (link, DOI, or formal citation) to a pre-existing dataset for public use. It mentions that tasks are 'randomly sampled' maps and goals.
Dataset Splits	No	The paper mentions training and test sets ('evaluate on a test set'), but does not explicitly describe a separate validation set or its specific split details (percentages, counts).
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies	No	The paper mentions software like 'Z3' and 'Mu Jo Co' but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup	Yes	For our approach, we use a CVAE hallucinator, with MLP (with 200 hidden units) encoder/decoder, trained on 20K (s, o) pairs collected by a random agent. We use m = 3 hallucinated environments, N = 20 steps before replanning in our main experiments, and N = 5 in the example behaviors we show for better demonstrations. We use the same actor (resp., critic) network architecture for the policies across all approaches i.e., an MLP with 128 (resp., 32) hidden units. For the hallucinator, we use the same architecture as in the craft environment but with 300 hidden units, and trained with 100K (s, o) pairs. For the synthesizer, we use m = 3 and N = 10.