Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic

Authors: Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E. Turner, Sergey Levine

ICLR 2017 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that conservative Q-Prop provides substantial gains in sample efﬁciency over trust region policy optimization (TRPO) with generalized advantage estimation (GAE), and improves stability over deep deterministic policy gradient (DDPG), the stateof-the-art on-policy and off-policy methods, on Open AI Gym s Mu Jo Co continuous control environments.
Researcher Affiliation	Collaboration	1University of Cambridge, UK 2Max Planck Institute for Intelligent Systems, T ubingen, Germany 3Google Brain, USA 4Deep Mind, UK 5UC Berkeley, USA 6Uber AI Labs, USA
Pseudocode	Yes	Algorithm 1 Adaptive Q-Prop
Open Source Code	Yes	Our algorithm implementations are built on top of the rllab TRPO and DDPG codes from Duan et al. (2016) and available at https://github.com/shaneshixiang/rllabplusplus.
Open Datasets	Yes	We evaluated Q-Prop and its variants on continuous control environments from the Open AI Gym benchmark (Brockman et al., 2016) using the Mu Jo Co physics simulator (Todorov et al., 2012) as shown in Figure 1.
Dataset Splits	No	The paper mentions training on 'Open AI Gym s Mu Jo Co' environments and discusses 'batch sizes' but does not specify explicit training, validation, or test dataset splits in terms of percentages or counts, nor does it refer to predefined splits with citations for these specific datasets. While hyperparameters were tuned, the data splitting methodology for training, validation, and testing is not detailed.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or cloud instance types) used for running its experiments.
Software Dependencies	No	The paper mentions using 'Adam (Kingma & Ba, 2014)' but does not provide specific version numbers for software components or libraries (e.g., Python version, PyTorch/TensorFlow versions, or specific library versions).
Experiment Setup	Yes	Training details. This section describes parameters of the training algorithms and their hyperparameter search values in {}. The optimal performing hyperparameter results are reported. Policy gradient methods (VPG, TRPO, Q-Prop) used batch sizes of {1000, 5000, 25000} time steps, step sizes of {0.1, 0.01, 0.001} for the trust-region method, and base learning rates of {0.001, 0.0001} with Adam (Kingma & Ba, 2014). For Q-Prop and DDPG, Qw is learned with the same technique as in DDPG (Lillicrap et al., 2016), using soft target networks with τ = 0.999, a replay buffer of size 106 steps, a mini-batch size of 64, and a base learning rate of {0.001, 0.0001} with Adam (Kingma & Ba, 2014). For Q-Prop we also tuned the relative ratio of gradient steps on the critic Qw against the number of steps on the policy, in the range {0.1, 0.5, 1.0}, where 0.1 corresponds to 100 critic updates for every policy update if the batch size is 1000. For DDPG, we swept the reward scaling using {0.01,0.1,1.0} as it is sensitive to this parameter.