Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Mastering Visual Continuous Control: Improved Data-Augmented Reinforcement Learning
Authors: Denis Yarats, Rob Fergus, Alessandro Lazaric, Lerrel Pinto
ICLR 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present Dr Q-v2, a model-free reinforcement learning (RL) algorithm for visual continuous control. Dr Q-v2 builds on Dr Q, an off-policy actor-critic approach that uses data augmentation to learn directly from pixels. We introduce several improvements that yield state-of-the-art results on the Deep Mind Control Suite. Notably, Dr Q-v2 is able to solve complex humanoid locomotion tasks directly from pixel observations, previously unattained by model-free RL. |
| Researcher Affiliation | Collaboration | Anonymous authors Paper under double-blind review |
| Pseudocode | No | The paper describes the algorithm and equations but does not provide structured pseudocode or an explicit algorithm block labeled as such. |
| Open Source Code | Yes | Dr Q-v2 s implementation is available at https://anonymous.4open.science/r/drqv2. |
| Open Datasets | Yes | We consider a set of Mu Jo Co tasks (Todorov et al., 2012) provided by DMC (Tassa et al., 2018), a widely used benchmark for continous control. |
| Dataset Splits | No | The paper describes evaluation periodicity and averaging episode returns, but it does not specify explicit train/validation/test dataset splits in the traditional sense for a static dataset. For RL environments, validation often refers to periodic performance checks on the environment, not a separate dataset split. |
| Hardware Specification | Yes | To facilitate fair wall-clock time comparison all algorithms are trained on the same hardware (i.e., a single NVIDIA V100 GPU machine) and evaluated with the same periodicity of 20000 environment steps. |
| Software Dependencies | No | The paper mentions PyTorch for image sampling ("Py Torch (i.e., grid_sample)") but does not provide specific version numbers for PyTorch or other key software dependencies required for replication. |
| Experiment Setup | Yes | Key Hyper-Parameters We conduct an extensive hyper-parameter search and identify several hyper-parameter changes compared to Dr Q. The three most important hyper-parameters are: (i) the size of the replay buffer, (ii) mini-batch size, and (iii) learning rate. Specifically, we use a 10 times larger replay buffer than Dr Q. We also use a smaller mini-batch size of 256 without any noticeable performance degradation... Finally, we find that using smaller learning rate of 1 × 10−4... A full list of hyper-parameters can be found in Appendix E. |