Program Synthesis Guided Reinforcement Learning for Partially Observed Environments
Authors: Yichen Yang, Jeevana Priya Inala, Osbert Bastani, Yewen Pu, Armando Solar-Lezama, Martin Rinard
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, we show that our approach significantly outperforms non-program-guided approaches on a set of challenging benchmarks, including a 2D Minecraft-inspired environment where the agent must complete a complex sequence of subtasks to achieve its goal, and achieves a similar performance as using handcrafted programs to guide the agent. |
| Researcher Affiliation | Collaboration | Yichen David Yang MIT EECS & CSAIL Jeevana Priya Inala Microsoft Research Osbert Bastani University of Pennsylvania Yewen Pu Autodesk Research Armando Solar-Lezama MIT EECS & CSAIL Martin Rinard MIT EECS & CSAIL |
| Pseudocode | No | The paper describes the architecture and algorithm steps in text and figures, but does not contain a dedicated pseudocode or algorithm block. |
| Open Source Code | Yes | The code is available at: https://github.com/yycdavid/program-synthesis-guided-RL |
| Open Datasets | No | The paper uses custom-generated environments ('2D Minecraft-inspired game', 'box-world') rather than explicitly providing public access information (link, DOI, or formal citation) to a pre-existing dataset for public use. It mentions that tasks are 'randomly sampled' maps and goals. |
| Dataset Splits | No | The paper mentions training and test sets ('evaluate on a test set'), but does not explicitly describe a separate validation set or its specific split details (percentages, counts). |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper mentions software like 'Z3' and 'Mu Jo Co' but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | For our approach, we use a CVAE hallucinator, with MLP (with 200 hidden units) encoder/decoder, trained on 20K (s, o) pairs collected by a random agent. We use m = 3 hallucinated environments, N = 20 steps before replanning in our main experiments, and N = 5 in the example behaviors we show for better demonstrations. We use the same actor (resp., critic) network architecture for the policies across all approaches i.e., an MLP with 128 (resp., 32) hidden units. For the hallucinator, we use the same architecture as in the craft environment but with 300 hidden units, and trained with 100K (s, o) pairs. For the synthesizer, we use m = 3 and N = 10. |