Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift

Authors: Carles Gelada, Marc G. Bellemare3647-3655

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We complement our analysis with an empirical evaluation of the two techniques in an off-policy setting on the game Pong from the Atari domain where we find discounted COP-TD to be better behaved in practice than the soft normalization penalty. Finally, we perform a more extensive evaluation of discounted COP-TD in 5 games of the Atari domain, where we find performance gains for our approach.
Researcher Affiliation Industry Carles Gelada, Marc G. Bellemare Google Brain cgel@google.com, bellemare@google.com
Pseudocode No The paper describes update rules and operators mathematically (e.g., equation (6), (9)) but does not include any explicit pseudocode blocks or algorithm listings.
Open Source Code No The paper does not provide any specific links to source code repositories, nor does it state that the code for the described methodology is publicly available or included in supplementary materials.
Open Datasets Yes In our experiments we use the Arcade Learning Environment (ALE) (Bellemare et al. 2013), an RL interface to Atari 2600 games.
Dataset Splits No The paper describes a continuous reinforcement learning setup with replay memory and online training, rather than explicit train/validation/test dataset splits with specified percentages or counts for a fixed dataset.
Hardware Specification No The paper mentions a 'single-GPU agent setup' but does not provide specific hardware details such as GPU or CPU models, memory specifications, or cloud computing instance types used for experiments.
Software Dependencies No The paper mentions using the C51 distributional reinforcement learning agent but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes Our baseline is the C51 distributional reinforcement learning agent (Bellemare, Dabney, and Munos 2017), and we use published hyperparameters unless otherwise noted. We augment the C51 network by adding an extra head, the ratio model c(s), to the final convolutional layer, whose role is to predict the ratio dπ dµ . The ratio model consists of a two-layer fully-connected network with a Re LU hidden layer of size 512. ... Identical to C51, the target policy π θ is the ϵ-greedy policy with respect to the expected value of the distribution output of the target network. We set ϵ = 0.1. The ratio model is trained by adding the squared loss η ˆγc θ(s)π θ(a|s) µ(a|s) + (1 ˆγ) cθ(s ) 2 (9) to the usual distributional loss of the agent, where η > 0 is a hyperparameter trading off the two losses. ... We set ϵ = 0.1. ... We run C51 with discounted corrections with ˆγ = 0.99 and η = 0.02.