Deep Conservative Policy Iteration
Authors: Nino Vieillard, Olivier Pietquin, Matthieu Geist6070-6077
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experiment thoroughly the resulting algorithm on the simple Cartpole problem, and validate the proposed method on a representative subset of Atari games. |
| Researcher Affiliation | Industry | Google Research, Brain Team |
| Pseudocode | Yes | Algorithm 1 DCPI |
| Open Source Code | No | The paper does not provide any concrete access (link, explicit statement of release) to the source code for the methodology it describes. |
| Open Datasets | Yes | We use the version of Cartpole implemented in Open AI Gym (Brockman et al. 2016)... We used the DQN implementation from the Dopamine library as our baseline... Atari is a challenging discrete-actions control environment, introduced by Bellemare et al. (2013) consisting of 57 games. |
| Dataset Splits | No | The paper discusses training steps and evaluation on the environments (Cartpole, Atari), but it does not explicitly provide details about a separate validation dataset split from a static dataset, or how such a split would be generated or used for hyperparameter tuning separate from testing. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments (e.g., GPU/CPU models, memory, cloud instance types). |
| Software Dependencies | No | The paper mentions using the "Dopamine library (Castro et al. 2018)" and "Open AI Gym (Brockman et al. 2016)" but does not specify their version numbers or other software dependencies with versions. |
| Experiment Setup | Yes | Notably, we used the same network architecture for the q-network and the policy network and two identical Adam optimizers; we compute a gradient step every F = 4 interactions with the environment, and update the target networks every C = 100 interactions. Full parameters are reported in the Appendix. ... we chose β1 = 0.99... β2 = 0.9999. ... After a small hyperparameter search on a few games (Pong, Asterix and Space Invaders), we chose α0 = 1 and the Adamax mixture rate (see Eq. (8)). |