Imitation-Projected Programmatic Reinforcement Learning

Authors: Abhinav Verma, Hoang Le, Yisong Yue, Swarat Chaudhuri

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present theoretical convergence results for PROPEL and empirically evaluate the approach in three continuous control domains. The experiments show that PROPEL can significantly outperform state-of-the-art approaches for learning programmatic policies.
Researcher Affiliation Academia Abhinav Verma Rice University averma@rice.edu Hoang M. Le Caltech hmle@caltech.edu Yisong Yue Caltech yyue@caltech.edu Swarat Chaudhuri Rice University swarat@rice.edu
Pseudocode Yes Algorithm 1 Imitation-Projected Programmatic Reinforcement Learning (PROPEL); Algorithm 2 UPDATEF: neural policy gradient for mixed policies; Algorithm 3 PROJECTΠ: program synthesis via imitation learning
Open Source Code Yes The code for the TORCS experiments can be found at: https://bitbucket.org/averma8053/propel
Open Datasets Yes We evaluate over five distinct tracks in the TORCS simulator. Empirical results on two additional classic control tasks, Mountain-Car and Pendulum, are provided in Appendix B
Dataset Splits No The paper mentions running experiments with "twenty-five random seeds" and "training for 600 episodes", but it does not specify explicit training/validation/test dataset splits (e.g., percentages or sample counts) as would be typical for static datasets. Since it's a simulation environment, data is generated dynamically.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers like Python 3.8, CPLEX 12.4) needed to replicate the experiment.
Experiment Setup Yes We perform the experiments with twenty-five random seeds and report the median lap time over these twentyfive trials. ... DDPG, a neural policy learned using the Deep Deterministic Policy Gradients [36] algorithm, with 600 episodes of training for each track.