Towards a better understanding of representation dynamics under TD-learning
Authors: Yunhao Tang, Remi Munos
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate this theoretical insight with tabular and deep RL experiments over Atari game suites. |
| Researcher Affiliation | Industry | 1Google Deep Mind. Correspondence to: Yunhao Tang <robintyh@deepmind.com>. |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper does not provide a statement about releasing its own source code, nor does it provide a link to a code repository for its methodology. |
| Open Datasets | Yes | We use DQN (Mnih et al., 2013) as a baseline, and generate random reward functions Rπ i (x, a) via outputs of randomly initialized networks, following the practice of (Dabney et al., 2021). [...] Our testbed is a subset of 15 Atari games (Bellemare et al., 2013) on which it has been shown that DQN can achieve reasonable performance |
| Dataset Splits | No | The paper mentions 'validation' in the context of value approximation error decay (Figure 1), but it does not specify explicit dataset splits (e.g., percentages or counts) for training, validation, and testing sets in its empirical deep RL experiments. While hyperparameters were tuned, the method for validation split is not described. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., CPU, GPU models) used for running the experiments. |
| Software Dependencies | Yes | All results are based on the solving the exact ODE dynamics, using the Scipy ODE solver (Virtanen et al., 2020). |
| Experiment Setup | Yes | We tune the learning rate η {0.00025, 0.0001, 0.00005} as suggested in (Dabney et al., 2021). The default DQN uses η = 0.00025. We find that at η = 0.0001 the tuned DQN performs the best. For the auxiliary task, we tune the number of random rewards h {4, 16, 64, 256}. We find that h = 16 performs slightly better than other alternatives. |