reproducibilityindex.ai

What Matters for On-Policy Deep Actor-Critic Methods? A Large-Scale Study

Authors: Marcin Andrychowicz, Anton Raichuk, Piotr Stańczyk, Manu Orsini, Sertan Girgin, Raphaël Marinier, Leonard Hussenot, Matthieu Geist, Olivier Pietquin, Marcin Michalski, Sylvain Gelly, Olivier Bachem

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We train over 250 000 agents in ﬁve continuous control environments of different complexity and provide insights and practical recommendations for the training of on-policy deep actor-critic RL agents.
Researcher Affiliation	Industry	Google Research, Brain Team
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks (clearly labeled algorithm sections or code-like formatted procedures).
Open Source Code	Yes	1The implementation is available at https://github.com/google-research/seed_rl.
Open Datasets	Yes	As benchmark environments, we consider ﬁve widely used continuous control environments from Open AI Gym [12] of varying complexity: Hopper-v1, Walker2d-v1, Half Cheetah-v1, Ant-v1, and Humanoid-v1 2.
Dataset Splits	No	The paper describes training and evaluation procedures on continuous control environments (e.g., 'train 3 models with independent random seeds', 'evaluate trained policies every hundred thousand steps by freezing the policy and computing the average undiscounted episode return of 100 episodes'). However, it does not specify traditional train/test/validation dataset splits (e.g., percentages or sample counts of a fixed dataset).
Hardware Specification	No	The paper mentions 'on machines with accelerators such as GPUs and TPUs' but does not provide specific hardware details like exact GPU/CPU models, processor types, or memory amounts used for running experiments.
Software Dependencies	Yes	We used Mujoco 2.0 in our experiments.
Experiment Setup	Yes	All other settings (for choices not in the group) are set to settings of a competitive base conﬁguration (detailed in Appendix C) that is close to the default PPOv2 conﬁguration4 scaled up to 256 parallel environments.