Deep Conservative Policy Iteration

Authors: Nino Vieillard, Olivier Pietquin, Matthieu Geist6070-6077

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We experiment thoroughly the resulting algorithm on the simple Cartpole problem, and validate the proposed method on a representative subset of Atari games.
Researcher Affiliation Industry Google Research, Brain Team
Pseudocode Yes Algorithm 1 DCPI
Open Source Code No The paper does not provide any concrete access (link, explicit statement of release) to the source code for the methodology it describes.
Open Datasets Yes We use the version of Cartpole implemented in Open AI Gym (Brockman et al. 2016)... We used the DQN implementation from the Dopamine library as our baseline... Atari is a challenging discrete-actions control environment, introduced by Bellemare et al. (2013) consisting of 57 games.
Dataset Splits No The paper discusses training steps and evaluation on the environments (Cartpole, Atari), but it does not explicitly provide details about a separate validation dataset split from a static dataset, or how such a split would be generated or used for hyperparameter tuning separate from testing.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments (e.g., GPU/CPU models, memory, cloud instance types).
Software Dependencies No The paper mentions using the "Dopamine library (Castro et al. 2018)" and "Open AI Gym (Brockman et al. 2016)" but does not specify their version numbers or other software dependencies with versions.
Experiment Setup Yes Notably, we used the same network architecture for the q-network and the policy network and two identical Adam optimizers; we compute a gradient step every F = 4 interactions with the environment, and update the target networks every C = 100 interactions. Full parameters are reported in the Appendix. ... we chose β1 = 0.99... β2 = 0.9999. ... After a small hyperparameter search on a few games (Pong, Asterix and Space Invaders), we chose α0 = 1 and the Adamax mixture rate (see Eq. (8)).