Programmatic Reinforcement Learning without Oracles
Authors: Wenjie Qiu, He Zhu
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiment results demonstrate that our algorithm excels in discovering optimal programmatic policies that are highly interpretable. |
| Researcher Affiliation | Academia | Wenjie Qiu Department of Computer Science Rutgers University wq37@cs.rutgers.edu He Zhu Department of Computer Science Rutgers University hz375@cs.rutgers.edu |
| Pseudocode | Yes | Algorithm 1 Programmatic Reinforcement Learning without Oracles |
| Open Source Code | Yes | Code is available at https://github.com/RU-Automated-Reasoning-Group/pi-PRL. |
| Open Datasets | Yes | We evaluated our approach on two groups of challenging continuous control benchmarks involving motion control and task planning: (1) Ant Cross Maze: the example depicted in Fig. 2. (2) Ant Random Goal: The quadruped Mu Jo Co Ant in Fig. 10a is trained to reach a randomly sampled goal location within a confined circular region. (3) Pusher: A robotic arm in Fig. 10b is trained to push a cylinder object to a given target location. (4) Half Cheetah Hurdle: A Mu Jo Co halfcheetah in Fig. 10c is required to run and jump over three hurdles to reach a given goal area. Group two consists of three hierarchical RL benchmarks: (1) Ant -Maze: the example depicted in Fig. 6a. (2) Ant Push: This task requires the Ant in Fig. 10d to push away a movable block to reach the goal region behind it. (3) Ant Fall: The Ant in Fig. 10e is required to push a movable block into a rift to fill the gap and then walk across it to reach the target on the other side of the rift. |
| Dataset Splits | No | The paper describes training in reinforcement learning environments but does not provide explicit train/validation/test dataset splits, as is common for supervised learning tasks. RL involves continuous interaction with the environment. |
| Hardware Specification | No | The paper does not explicitly describe the hardware used for running experiments, such as specific GPU or CPU models. |
| Software Dependencies | No | The paper mentions using SAC and TRPO algorithms implemented in the OpenAI Spinning Up RL framework but does not provide specific version numbers for these software components or other libraries. |
| Experiment Setup | Yes | Following hyperparameters are used to train primitive policies with SAC (Haarnoja et al., 2018) algorithm. Discount factor γ = 0.99. SGD optimizer; actor learning rate 0.001; critic learning rate 0.001. Mini-batch size n = 100. Replay buffer of size 100000. Soft update targets τ = 0.005. Target update interval and gradient step are set to be 1. Following hyperparameters are used to train π-PRL programs for solving tasks in group one environments with TRPO (Schulman et al., 2015) algorithm. Discount factor γ = 0.99. Number of trajectories per epoch N = 50. Maximum search depth of program derivation graph Dg = 6. KL-Divergence limit δ = 0.01. GAE λ = 0.97. Gumbel-Softmax Temperature T = 0.25. Following hyperparameters are used to train π-HPRL programs for solving tasks in group two environments with TRPO algorithm. Discount factor γ = 0.995. Number of trajectories per epoch N = 100. KL-Divergence limit δ = 0.01. GAE λ = 0.97. Gumbel-Softmax Temperature T = 0.25. |