What Matters for On-Policy Deep Actor-Critic Methods? A Large-Scale Study
Authors: Marcin Andrychowicz, Anton Raichuk, Piotr Stańczyk, Manu Orsini, Sertan Girgin, Raphaël Marinier, Leonard Hussenot, Matthieu Geist, Olivier Pietquin, Marcin Michalski, Sylvain Gelly, Olivier Bachem
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We train over 250 000 agents in five continuous control environments of different complexity and provide insights and practical recommendations for the training of on-policy deep actor-critic RL agents. |
| Researcher Affiliation | Industry | Google Research, Brain Team |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks (clearly labeled algorithm sections or code-like formatted procedures). |
| Open Source Code | Yes | 1The implementation is available at https://github.com/google-research/seed_rl. |
| Open Datasets | Yes | As benchmark environments, we consider five widely used continuous control environments from Open AI Gym [12] of varying complexity: Hopper-v1, Walker2d-v1, Half Cheetah-v1, Ant-v1, and Humanoid-v1 2. |
| Dataset Splits | No | The paper describes training and evaluation procedures on continuous control environments (e.g., 'train 3 models with independent random seeds', 'evaluate trained policies every hundred thousand steps by freezing the policy and computing the average undiscounted episode return of 100 episodes'). However, it does not specify traditional train/test/validation dataset splits (e.g., percentages or sample counts of a fixed dataset). |
| Hardware Specification | No | The paper mentions 'on machines with accelerators such as GPUs and TPUs' but does not provide specific hardware details like exact GPU/CPU models, processor types, or memory amounts used for running experiments. |
| Software Dependencies | Yes | We used Mujoco 2.0 in our experiments. |
| Experiment Setup | Yes | All other settings (for choices not in the group) are set to settings of a competitive base configuration (detailed in Appendix C) that is close to the default PPOv2 configuration4 scaled up to 256 parallel environments. |