Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift
Authors: Carles Gelada, Marc G. Bellemare3647-3655
AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We complement our analysis with an empirical evaluation of the two techniques in an off-policy setting on the game Pong from the Atari domain where we find discounted COP-TD to be better behaved in practice than the soft normalization penalty. Finally, we perform a more extensive evaluation of discounted COP-TD in 5 games of the Atari domain, where we find performance gains for our approach. |
| Researcher Affiliation | Industry | Carles Gelada, Marc G. Bellemare Google Brain cgel@google.com, bellemare@google.com |
| Pseudocode | No | The paper describes update rules and operators mathematically (e.g., equation (6), (9)) but does not include any explicit pseudocode blocks or algorithm listings. |
| Open Source Code | No | The paper does not provide any specific links to source code repositories, nor does it state that the code for the described methodology is publicly available or included in supplementary materials. |
| Open Datasets | Yes | In our experiments we use the Arcade Learning Environment (ALE) (Bellemare et al. 2013), an RL interface to Atari 2600 games. |
| Dataset Splits | No | The paper describes a continuous reinforcement learning setup with replay memory and online training, rather than explicit train/validation/test dataset splits with specified percentages or counts for a fixed dataset. |
| Hardware Specification | No | The paper mentions a 'single-GPU agent setup' but does not provide specific hardware details such as GPU or CPU models, memory specifications, or cloud computing instance types used for experiments. |
| Software Dependencies | No | The paper mentions using the C51 distributional reinforcement learning agent but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | Our baseline is the C51 distributional reinforcement learning agent (Bellemare, Dabney, and Munos 2017), and we use published hyperparameters unless otherwise noted. We augment the C51 network by adding an extra head, the ratio model c(s), to the final convolutional layer, whose role is to predict the ratio dπ dµ . The ratio model consists of a two-layer fully-connected network with a Re LU hidden layer of size 512. ... Identical to C51, the target policy π θ is the ϵ-greedy policy with respect to the expected value of the distribution output of the target network. We set ϵ = 0.1. The ratio model is trained by adding the squared loss η ˆγc θ(s)π θ(a|s) µ(a|s) + (1 ˆγ) cθ(s ) 2 (9) to the usual distributional loss of the agent, where η > 0 is a hyperparameter trading off the two losses. ... We set ϵ = 0.1. ... We run C51 with discounted corrections with ˆγ = 0.99 and η = 0.02. |