reproducibilityindex.ai

Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift

Authors: Carles Gelada, Marc G. Bellemare3647-3655

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We complement our analysis with an empirical evaluation of the two techniques in an off-policy setting on the game Pong from the Atari domain where we ﬁnd discounted COP-TD to be better behaved in practice than the soft normalization penalty. Finally, we perform a more extensive evaluation of discounted COP-TD in 5 games of the Atari domain, where we ﬁnd performance gains for our approach.
Researcher Affiliation	Industry	Carles Gelada, Marc G. Bellemare Google Brain cgel@google.com, bellemare@google.com
Pseudocode	No	The paper describes update rules and operators mathematically (e.g., equation (6), (9)) but does not include any explicit pseudocode blocks or algorithm listings.
Open Source Code	No	The paper does not provide any specific links to source code repositories, nor does it state that the code for the described methodology is publicly available or included in supplementary materials.
Open Datasets	Yes	In our experiments we use the Arcade Learning Environment (ALE) (Bellemare et al. 2013), an RL interface to Atari 2600 games.
Dataset Splits	No	The paper describes a continuous reinforcement learning setup with replay memory and online training, rather than explicit train/validation/test dataset splits with specified percentages or counts for a fixed dataset.
Hardware Specification	No	The paper mentions a 'single-GPU agent setup' but does not provide specific hardware details such as GPU or CPU models, memory specifications, or cloud computing instance types used for experiments.
Software Dependencies	No	The paper mentions using the C51 distributional reinforcement learning agent but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	Our baseline is the C51 distributional reinforcement learning agent (Bellemare, Dabney, and Munos 2017), and we use published hyperparameters unless otherwise noted. We augment the C51 network by adding an extra head, the ratio model c(s), to the ﬁnal convolutional layer, whose role is to predict the ratio dπ dµ . The ratio model consists of a two-layer fully-connected network with a Re LU hidden layer of size 512. ... Identical to C51, the target policy π θ is the ϵ-greedy policy with respect to the expected value of the distribution output of the target network. We set ϵ = 0.1. The ratio model is trained by adding the squared loss η ˆγc θ(s)π θ(a\|s) µ(a\|s) + (1 ˆγ) cθ(s ) 2 (9) to the usual distributional loss of the agent, where η > 0 is a hyperparameter trading off the two losses. ... We set ϵ = 0.1. ... We run C51 with discounted corrections with ˆγ = 0.99 and η = 0.02.