Instructing Goal-Conditioned Reinforcement Learning Agents with Temporal Logic Objectives
Authors: Wenjie Qiu, Wensen Mao, He Zhu
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiment results demonstrate the effectiveness of our approach in adapting goal-conditioned RL agents to satisfy complex temporal logic task specifications zero-shot. We evaluate GCRL-LTL in the Zone Env environment shown in Fig. 1 and the Ant-16rooms environments depicted in Fig. 3. We also include a 7 7 discrete grid-based environment Letter World from Andreas et al. [2017]. Figure 6 shows the evaluation results across training iterations on both partially ordered and avoidance tasks in Letter World. Fig. 7 demonstrates the results for avoidance tasks in Zone Env. |
| Researcher Affiliation | Academia | Wenjie Qiu Rutgers University wq37@cs.rutgers.edu Wensen Mao Rutgers University wm300@cs.rutgers.edu He Zhu Rutgers University hz375@cs.rutgers.edu |
| Pseudocode | Yes | Algorithm 1 Goal-Conditioned Proximal Policy Optimization Algorithm; Algorithm 2 GCSL with organizing past goals into a graph structure G |
| Open Source Code | Yes | 2GCRL-LTL is available at https://github.com/RU-Automated-Reasoning-Group/GCRL-LTL |
| Open Datasets | Yes | We evaluate GCRL-LTL in the Zone Env environment shown in Fig. 1 and the Ant-16rooms environments depicted in Fig. 3. We also include a 7 7 discrete grid-based environment Letter World from Andreas et al. [2017]. Zone Env. The Zone Env environment is derived from Open AI s Safety Gym (Ray et al. [2019]). Ant-16rooms. This environment with continuous observation and action space is adapted from the 16 rooms environment from Jothimurugan et al. [2021]. |
| Dataset Splits | No | The paper describes training procedures and evaluation episodes but does not provide specific training/validation/test dataset splits (e.g., percentages or absolute counts). |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models (e.g., NVIDIA A100) or CPU models used for running the experiments. |
| Software Dependencies | No | The paper mentions software components such as 'Stable-Baselines3 RL framework' and 'Open AI Spinning Up RL framework' but does not specify their version numbers or other crucial software dependencies with version information for reproducibility. |
| Experiment Setup | Yes | The following hyperparameters are used to train the dynamic primitive policies for Zone Env with PPO (Schulman et al. [2017]). Discount factor γ = 0.998. SGD optimizer; actor learning rate 0.001; critic learning rate 0.001. Mini-batch size n = 256. For all training with PPO algorithm, we use GAE λ = 0.97, clip range ϵ = 0.2, and we set the number of iterations when optimizing the surrogate loss to be 10. |