Imitation-Projected Programmatic Reinforcement Learning
Authors: Abhinav Verma, Hoang Le, Yisong Yue, Swarat Chaudhuri
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present theoretical convergence results for PROPEL and empirically evaluate the approach in three continuous control domains. The experiments show that PROPEL can significantly outperform state-of-the-art approaches for learning programmatic policies. |
| Researcher Affiliation | Academia | Abhinav Verma Rice University averma@rice.edu Hoang M. Le Caltech hmle@caltech.edu Yisong Yue Caltech yyue@caltech.edu Swarat Chaudhuri Rice University swarat@rice.edu |
| Pseudocode | Yes | Algorithm 1 Imitation-Projected Programmatic Reinforcement Learning (PROPEL); Algorithm 2 UPDATEF: neural policy gradient for mixed policies; Algorithm 3 PROJECTΠ: program synthesis via imitation learning |
| Open Source Code | Yes | The code for the TORCS experiments can be found at: https://bitbucket.org/averma8053/propel |
| Open Datasets | Yes | We evaluate over five distinct tracks in the TORCS simulator. Empirical results on two additional classic control tasks, Mountain-Car and Pendulum, are provided in Appendix B |
| Dataset Splits | No | The paper mentions running experiments with "twenty-five random seeds" and "training for 600 episodes", but it does not specify explicit training/validation/test dataset splits (e.g., percentages or sample counts) as would be typical for static datasets. Since it's a simulation environment, data is generated dynamically. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers like Python 3.8, CPLEX 12.4) needed to replicate the experiment. |
| Experiment Setup | Yes | We perform the experiments with twenty-five random seeds and report the median lap time over these twentyfive trials. ... DDPG, a neural policy learned using the Deep Deterministic Policy Gradients [36] algorithm, with 600 episodes of training for each track. |