reproducibilityindex.ai

Programmatic Reinforcement Learning without Oracles

Authors: Wenjie Qiu, He Zhu

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiment results demonstrate that our algorithm excels in discovering optimal programmatic policies that are highly interpretable.
Researcher Affiliation	Academia	Wenjie Qiu Department of Computer Science Rutgers University wq37@cs.rutgers.edu He Zhu Department of Computer Science Rutgers University hz375@cs.rutgers.edu
Pseudocode	Yes	Algorithm 1 Programmatic Reinforcement Learning without Oracles
Open Source Code	Yes	Code is available at https://github.com/RU-Automated-Reasoning-Group/pi-PRL.
Open Datasets	Yes	We evaluated our approach on two groups of challenging continuous control benchmarks involving motion control and task planning: (1) Ant Cross Maze: the example depicted in Fig. 2. (2) Ant Random Goal: The quadruped Mu Jo Co Ant in Fig. 10a is trained to reach a randomly sampled goal location within a conﬁned circular region. (3) Pusher: A robotic arm in Fig. 10b is trained to push a cylinder object to a given target location. (4) Half Cheetah Hurdle: A Mu Jo Co halfcheetah in Fig. 10c is required to run and jump over three hurdles to reach a given goal area. Group two consists of three hierarchical RL benchmarks: (1) Ant -Maze: the example depicted in Fig. 6a. (2) Ant Push: This task requires the Ant in Fig. 10d to push away a movable block to reach the goal region behind it. (3) Ant Fall: The Ant in Fig. 10e is required to push a movable block into a rift to ﬁll the gap and then walk across it to reach the target on the other side of the rift.
Dataset Splits	No	The paper describes training in reinforcement learning environments but does not provide explicit train/validation/test dataset splits, as is common for supervised learning tasks. RL involves continuous interaction with the environment.
Hardware Specification	No	The paper does not explicitly describe the hardware used for running experiments, such as specific GPU or CPU models.
Software Dependencies	No	The paper mentions using SAC and TRPO algorithms implemented in the OpenAI Spinning Up RL framework but does not provide specific version numbers for these software components or other libraries.
Experiment Setup	Yes	Following hyperparameters are used to train primitive policies with SAC (Haarnoja et al., 2018) algorithm. Discount factor γ = 0.99. SGD optimizer; actor learning rate 0.001; critic learning rate 0.001. Mini-batch size n = 100. Replay buffer of size 100000. Soft update targets τ = 0.005. Target update interval and gradient step are set to be 1. Following hyperparameters are used to train π-PRL programs for solving tasks in group one environments with TRPO (Schulman et al., 2015) algorithm. Discount factor γ = 0.99. Number of trajectories per epoch N = 50. Maximum search depth of program derivation graph Dg = 6. KL-Divergence limit δ = 0.01. GAE λ = 0.97. Gumbel-Softmax Temperature T = 0.25. Following hyperparameters are used to train π-HPRL programs for solving tasks in group two environments with TRPO algorithm. Discount factor γ = 0.995. Number of trajectories per epoch N = 100. KL-Divergence limit δ = 0.01. GAE λ = 0.97. Gumbel-Softmax Temperature T = 0.25.