Instructing Goal-Conditioned Reinforcement Learning Agents with Temporal Logic Objectives

Authors: Wenjie Qiu, Wensen Mao, He Zhu

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiment results demonstrate the effectiveness of our approach in adapting goal-conditioned RL agents to satisfy complex temporal logic task specifications zero-shot. We evaluate GCRL-LTL in the Zone Env environment shown in Fig. 1 and the Ant-16rooms environments depicted in Fig. 3. We also include a 7 7 discrete grid-based environment Letter World from Andreas et al. [2017]. Figure 6 shows the evaluation results across training iterations on both partially ordered and avoidance tasks in Letter World. Fig. 7 demonstrates the results for avoidance tasks in Zone Env.
Researcher Affiliation Academia Wenjie Qiu Rutgers University wq37@cs.rutgers.edu Wensen Mao Rutgers University wm300@cs.rutgers.edu He Zhu Rutgers University hz375@cs.rutgers.edu
Pseudocode Yes Algorithm 1 Goal-Conditioned Proximal Policy Optimization Algorithm; Algorithm 2 GCSL with organizing past goals into a graph structure G
Open Source Code Yes 2GCRL-LTL is available at https://github.com/RU-Automated-Reasoning-Group/GCRL-LTL
Open Datasets Yes We evaluate GCRL-LTL in the Zone Env environment shown in Fig. 1 and the Ant-16rooms environments depicted in Fig. 3. We also include a 7 7 discrete grid-based environment Letter World from Andreas et al. [2017]. Zone Env. The Zone Env environment is derived from Open AI s Safety Gym (Ray et al. [2019]). Ant-16rooms. This environment with continuous observation and action space is adapted from the 16 rooms environment from Jothimurugan et al. [2021].
Dataset Splits No The paper describes training procedures and evaluation episodes but does not provide specific training/validation/test dataset splits (e.g., percentages or absolute counts).
Hardware Specification No The paper does not provide specific hardware details such as GPU models (e.g., NVIDIA A100) or CPU models used for running the experiments.
Software Dependencies No The paper mentions software components such as 'Stable-Baselines3 RL framework' and 'Open AI Spinning Up RL framework' but does not specify their version numbers or other crucial software dependencies with version information for reproducibility.
Experiment Setup Yes The following hyperparameters are used to train the dynamic primitive policies for Zone Env with PPO (Schulman et al. [2017]). Discount factor γ = 0.998. SGD optimizer; actor learning rate 0.001; critic learning rate 0.001. Mini-batch size n = 256. For all training with PPO algorithm, we use GAE λ = 0.97, clip range ϵ = 0.2, and we set the number of iterations when optimizing the surrogate loss to be 10.