Programmatic Reinforcement Learning without Oracles

Authors: Wenjie Qiu, He Zhu

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiment results demonstrate that our algorithm excels in discovering optimal programmatic policies that are highly interpretable.
Researcher Affiliation Academia Wenjie Qiu Department of Computer Science Rutgers University wq37@cs.rutgers.edu He Zhu Department of Computer Science Rutgers University hz375@cs.rutgers.edu
Pseudocode Yes Algorithm 1 Programmatic Reinforcement Learning without Oracles
Open Source Code Yes Code is available at https://github.com/RU-Automated-Reasoning-Group/pi-PRL.
Open Datasets Yes We evaluated our approach on two groups of challenging continuous control benchmarks involving motion control and task planning: (1) Ant Cross Maze: the example depicted in Fig. 2. (2) Ant Random Goal: The quadruped Mu Jo Co Ant in Fig. 10a is trained to reach a randomly sampled goal location within a confined circular region. (3) Pusher: A robotic arm in Fig. 10b is trained to push a cylinder object to a given target location. (4) Half Cheetah Hurdle: A Mu Jo Co halfcheetah in Fig. 10c is required to run and jump over three hurdles to reach a given goal area. Group two consists of three hierarchical RL benchmarks: (1) Ant -Maze: the example depicted in Fig. 6a. (2) Ant Push: This task requires the Ant in Fig. 10d to push away a movable block to reach the goal region behind it. (3) Ant Fall: The Ant in Fig. 10e is required to push a movable block into a rift to fill the gap and then walk across it to reach the target on the other side of the rift.
Dataset Splits No The paper describes training in reinforcement learning environments but does not provide explicit train/validation/test dataset splits, as is common for supervised learning tasks. RL involves continuous interaction with the environment.
Hardware Specification No The paper does not explicitly describe the hardware used for running experiments, such as specific GPU or CPU models.
Software Dependencies No The paper mentions using SAC and TRPO algorithms implemented in the OpenAI Spinning Up RL framework but does not provide specific version numbers for these software components or other libraries.
Experiment Setup Yes Following hyperparameters are used to train primitive policies with SAC (Haarnoja et al., 2018) algorithm. Discount factor γ = 0.99. SGD optimizer; actor learning rate 0.001; critic learning rate 0.001. Mini-batch size n = 100. Replay buffer of size 100000. Soft update targets τ = 0.005. Target update interval and gradient step are set to be 1. Following hyperparameters are used to train π-PRL programs for solving tasks in group one environments with TRPO (Schulman et al., 2015) algorithm. Discount factor γ = 0.99. Number of trajectories per epoch N = 50. Maximum search depth of program derivation graph Dg = 6. KL-Divergence limit δ = 0.01. GAE λ = 0.97. Gumbel-Softmax Temperature T = 0.25. Following hyperparameters are used to train π-HPRL programs for solving tasks in group two environments with TRPO algorithm. Discount factor γ = 0.995. Number of trajectories per epoch N = 100. KL-Divergence limit δ = 0.01. GAE λ = 0.97. Gumbel-Softmax Temperature T = 0.25.