reproducibilityindex.ai

Bayesian Distributional Policy Gradients

Authors: Luchen Li, A. Aldo Faisal8429-8437

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate in a suite of Atari 2600 games and Mu Jo Co tasks, including well known hard-exploration challenges, how BDPG learns generally faster and with higher asymptotic performance than reference distributional RL algorithms.
Researcher Affiliation	Academia	Luchen Li, 1 A. Aldo Faisal 1,2,3 1 Brain & Behaviour Lab,Dept. of Computing, Imperial College London 2 Brain & Behaviour Lab,Dept. of Bioengineering, Imperial College London 3 UKRI Centre in AI for Healthcare, Imperial College London
Pseudocode	Yes	Algorithm 1 Bayesian Distributional Policy Gradients
Open Source Code	No	The paper does not provide an explicit statement or link for the open-source code of their methodology.
Open Datasets	Yes	We evaluate and compare our method to other distributional RL approaches on the Arcade Learning Environment Atari 2600 games (Bellemare et al. 2013), including some of the best known hard-exploration cases, and on Mu Jo Co continuous-control tasks (Todorov, Erez, and Tassa 2012).
Dataset Splits	No	The paper describes using 'training batches' and 'unroll steps' but does not specify a distinct validation dataset split or how validation was performed.
Hardware Specification	No	The paper mentions '16 parallel workers' but does not specify any particular hardware components like GPU or CPU models, or memory.
Software Dependencies	No	The paper mentions using Generalised Advantage Estimation (GAE), Proximal Policy Optimisation (PPO), Wasserstein-GAN (WGAN), and Quantile Regression (QR) but does not provide specific version numbers for any software libraries or dependencies.
Experiment Setup	Yes	For both Atari and Mu Jo Co environments, we use 16 parallel workers for data collection, and train in mini batches. For Atari, we unroll 128 steps with each worker in each training batch for all algorithms, and average scores every 80 training batches. For Mu Jo Co, we unroll 256 steps, and average scores every 4 batches. We conduct ablation and parameter studies to investigate the impact of the bootstrap length k, and of the truncation cap u on the IG u(s), on Atari games Breakout and Q bert. The exploration coefﬁcient ηt is logarithmically decayed as ηt = η p log t/t.