Bayesian Distributional Policy Gradients

Authors: Luchen Li, A. Aldo Faisal8429-8437

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate in a suite of Atari 2600 games and Mu Jo Co tasks, including well known hard-exploration challenges, how BDPG learns generally faster and with higher asymptotic performance than reference distributional RL algorithms.
Researcher Affiliation Academia Luchen Li, 1 A. Aldo Faisal 1,2,3 1 Brain & Behaviour Lab,Dept. of Computing, Imperial College London 2 Brain & Behaviour Lab,Dept. of Bioengineering, Imperial College London 3 UKRI Centre in AI for Healthcare, Imperial College London
Pseudocode Yes Algorithm 1 Bayesian Distributional Policy Gradients
Open Source Code No The paper does not provide an explicit statement or link for the open-source code of their methodology.
Open Datasets Yes We evaluate and compare our method to other distributional RL approaches on the Arcade Learning Environment Atari 2600 games (Bellemare et al. 2013), including some of the best known hard-exploration cases, and on Mu Jo Co continuous-control tasks (Todorov, Erez, and Tassa 2012).
Dataset Splits No The paper describes using 'training batches' and 'unroll steps' but does not specify a distinct validation dataset split or how validation was performed.
Hardware Specification No The paper mentions '16 parallel workers' but does not specify any particular hardware components like GPU or CPU models, or memory.
Software Dependencies No The paper mentions using Generalised Advantage Estimation (GAE), Proximal Policy Optimisation (PPO), Wasserstein-GAN (WGAN), and Quantile Regression (QR) but does not provide specific version numbers for any software libraries or dependencies.
Experiment Setup Yes For both Atari and Mu Jo Co environments, we use 16 parallel workers for data collection, and train in mini batches. For Atari, we unroll 128 steps with each worker in each training batch for all algorithms, and average scores every 80 training batches. For Mu Jo Co, we unroll 256 steps, and average scores every 4 batches. We conduct ablation and parameter studies to investigate the impact of the bootstrap length k, and of the truncation cap u on the IG u(s), on Atari games Breakout and Q bert. The exploration coefficient ηt is logarithmically decayed as ηt = η p log t/t.