Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic

Authors: Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E. Turner, Sergey Levine

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that conservative Q-Prop provides substantial gains in sample efficiency over trust region policy optimization (TRPO) with generalized advantage estimation (GAE), and improves stability over deep deterministic policy gradient (DDPG), the stateof-the-art on-policy and off-policy methods, on Open AI Gym s Mu Jo Co continuous control environments.
Researcher Affiliation Collaboration 1University of Cambridge, UK 2Max Planck Institute for Intelligent Systems, T ubingen, Germany 3Google Brain, USA 4Deep Mind, UK 5UC Berkeley, USA 6Uber AI Labs, USA
Pseudocode Yes Algorithm 1 Adaptive Q-Prop
Open Source Code Yes Our algorithm implementations are built on top of the rllab TRPO and DDPG codes from Duan et al. (2016) and available at https://github.com/shaneshixiang/rllabplusplus.
Open Datasets Yes We evaluated Q-Prop and its variants on continuous control environments from the Open AI Gym benchmark (Brockman et al., 2016) using the Mu Jo Co physics simulator (Todorov et al., 2012) as shown in Figure 1.
Dataset Splits No The paper mentions training on 'Open AI Gym s Mu Jo Co' environments and discusses 'batch sizes' but does not specify explicit training, validation, or test dataset splits in terms of percentages or counts, nor does it refer to predefined splits with citations for these specific datasets. While hyperparameters were tuned, the data splitting methodology for training, validation, and testing is not detailed.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or cloud instance types) used for running its experiments.
Software Dependencies No The paper mentions using 'Adam (Kingma & Ba, 2014)' but does not provide specific version numbers for software components or libraries (e.g., Python version, PyTorch/TensorFlow versions, or specific library versions).
Experiment Setup Yes Training details. This section describes parameters of the training algorithms and their hyperparameter search values in {}. The optimal performing hyperparameter results are reported. Policy gradient methods (VPG, TRPO, Q-Prop) used batch sizes of {1000, 5000, 25000} time steps, step sizes of {0.1, 0.01, 0.001} for the trust-region method, and base learning rates of {0.001, 0.0001} with Adam (Kingma & Ba, 2014). For Q-Prop and DDPG, Qw is learned with the same technique as in DDPG (Lillicrap et al., 2016), using soft target networks with τ = 0.999, a replay buffer of size 106 steps, a mini-batch size of 64, and a base learning rate of {0.001, 0.0001} with Adam (Kingma & Ba, 2014). For Q-Prop we also tuned the relative ratio of gradient steps on the critic Qw against the number of steps on the policy, in the range {0.1, 0.5, 1.0}, where 0.1 corresponds to 100 critic updates for every policy update if the batch size is 1000. For DDPG, we swept the reward scaling using {0.01,0.1,1.0} as it is sensitive to this parameter.