Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Instructing Goal-Conditioned Reinforcement Learning Agents with Temporal Logic Objectives
Authors: Wenjie Qiu, Wensen Mao, He Zhu
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiment results demonstrate the effectiveness of our approach in adapting goal-conditioned RL agents to satisfy complex temporal logic task specifications zero-shot. We evaluate GCRL-LTL in the Zone Env environment shown in Fig. 1 and the Ant-16rooms environments depicted in Fig. 3. We also include a 7 7 discrete grid-based environment Letter World from Andreas et al. [2017]. Figure 6 shows the evaluation results across training iterations on both partially ordered and avoidance tasks in Letter World. Fig. 7 demonstrates the results for avoidance tasks in Zone Env. |
| Researcher Affiliation | Academia | Wenjie Qiu Rutgers University EMAIL Wensen Mao Rutgers University EMAIL He Zhu Rutgers University EMAIL |
| Pseudocode | Yes | Algorithm 1 Goal-Conditioned Proximal Policy Optimization Algorithm; Algorithm 2 GCSL with organizing past goals into a graph structure G |
| Open Source Code | Yes | 2GCRL-LTL is available at https://github.com/RU-Automated-Reasoning-Group/GCRL-LTL |
| Open Datasets | Yes | We evaluate GCRL-LTL in the Zone Env environment shown in Fig. 1 and the Ant-16rooms environments depicted in Fig. 3. We also include a 7 7 discrete grid-based environment Letter World from Andreas et al. [2017]. Zone Env. The Zone Env environment is derived from Open AI s Safety Gym (Ray et al. [2019]). Ant-16rooms. This environment with continuous observation and action space is adapted from the 16 rooms environment from Jothimurugan et al. [2021]. |
| Dataset Splits | No | The paper describes training procedures and evaluation episodes but does not provide specific training/validation/test dataset splits (e.g., percentages or absolute counts). |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models (e.g., NVIDIA A100) or CPU models used for running the experiments. |
| Software Dependencies | No | The paper mentions software components such as 'Stable-Baselines3 RL framework' and 'Open AI Spinning Up RL framework' but does not specify their version numbers or other crucial software dependencies with version information for reproducibility. |
| Experiment Setup | Yes | The following hyperparameters are used to train the dynamic primitive policies for Zone Env with PPO (Schulman et al. [2017]). Discount factor γ = 0.998. SGD optimizer; actor learning rate 0.001; critic learning rate 0.001. Mini-batch size n = 256. For all training with PPO algorithm, we use GAE λ = 0.97, clip range ϵ = 0.2, and we set the number of iterations when optimizing the surrogate loss to be 10. |