Generalized Off-Policy Actor-Critic
Authors: Shangtong Zhang, Wendelin Boehmer, Shimon Whiteson
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the merits of Geoff-PAC over existing algorithms in Mujoco robot simulation tasks, the first empirical success of emphatic algorithms in prevailing deep RL benchmarks. |
| Researcher Affiliation | Academia | Shangtong Zhang, Wendelin Boehmer, Shimon Whiteson Department of Computer Science University of Oxford {shangtong.zhang, wendelin.boehmer, shimon.whiteson}@cs.ox.ac.uk |
| Pseudocode | No | Pseudocode of Geoff-PAC is provided in supplementary materials. |
| Open Source Code | Yes | More details are provided in supplementary materials and all the implementations are publicly available5. 5https://github.com/Shangtong Zhang/Deep RL |
| Open Datasets | Yes | We benchmarked Off-PAC, ACE, DDPG, TD3, and Geoff-PAC on five Mujoco robot simulation tasks from Open AI gym (Brockman et al., 2016). |
| Dataset Splits | No | The paper describes using Mujoco robot simulation tasks, which are environments for reinforcement learning, not static datasets with explicit train/validation/test splits in the traditional supervised learning sense. No specific percentages or sample counts for data splits are provided. |
| Hardware Specification | No | The paper mentions 'a generous equipment grant from NVIDIA' but does not specify any particular GPU models, CPU models, or other detailed hardware specifications used for running the experiments. |
| Software Dependencies | No | The paper does not list specific version numbers for software dependencies or libraries used in the implementation or experiments. It only mentions that 'all the implementations are publicly available'. |
| Experiment Setup | Yes | To stabilize training, we adopted the A2C (Clemente et al., 2017) paradigm with multiple workers and utilized a target network (Mnih et al., 2015) and a replay buffer (Lin, 1992). All three algorithms share the same architecture and the same parameterization. We found ACE was not sensitive to λ1 and set λ1 = 0 for all experiments. For Geoff-PAC, we found λ1 = 0.7, λ2 = 0.6, ˆγ = 0.2 produced good empirical results and used this combination for all remaining tasks. |