Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Generalized Off-Policy Actor-Critic
Authors: Shangtong Zhang, Wendelin Boehmer, Shimon Whiteson
NeurIPS 2019 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the merits of Geoff-PAC over existing algorithms in Mujoco robot simulation tasks, the first empirical success of emphatic algorithms in prevailing deep RL benchmarks. |
| Researcher Affiliation | Academia | Shangtong Zhang, Wendelin Boehmer, Shimon Whiteson Department of Computer Science University of Oxford EMAIL |
| Pseudocode | No | Pseudocode of Geoff-PAC is provided in supplementary materials. |
| Open Source Code | Yes | More details are provided in supplementary materials and all the implementations are publicly available5. 5https://github.com/Shangtong Zhang/Deep RL |
| Open Datasets | Yes | We benchmarked Off-PAC, ACE, DDPG, TD3, and Geoff-PAC on five Mujoco robot simulation tasks from Open AI gym (Brockman et al., 2016). |
| Dataset Splits | No | The paper describes using Mujoco robot simulation tasks, which are environments for reinforcement learning, not static datasets with explicit train/validation/test splits in the traditional supervised learning sense. No specific percentages or sample counts for data splits are provided. |
| Hardware Specification | No | The paper mentions 'a generous equipment grant from NVIDIA' but does not specify any particular GPU models, CPU models, or other detailed hardware specifications used for running the experiments. |
| Software Dependencies | No | The paper does not list specific version numbers for software dependencies or libraries used in the implementation or experiments. It only mentions that 'all the implementations are publicly available'. |
| Experiment Setup | Yes | To stabilize training, we adopted the A2C (Clemente et al., 2017) paradigm with multiple workers and utilized a target network (Mnih et al., 2015) and a replay buffer (Lin, 1992). All three algorithms share the same architecture and the same parameterization. We found ACE was not sensitive to λ1 and set λ1 = 0 for all experiments. For Geoff-PAC, we found λ1 = 0.7, λ2 = 0.6, ˆγ = 0.2 produced good empirical results and used this combination for all remaining tasks. |