Bayesian Distributional Policy Gradients
Authors: Luchen Li, A. Aldo Faisal8429-8437
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate in a suite of Atari 2600 games and Mu Jo Co tasks, including well known hard-exploration challenges, how BDPG learns generally faster and with higher asymptotic performance than reference distributional RL algorithms. |
| Researcher Affiliation | Academia | Luchen Li, 1 A. Aldo Faisal 1,2,3 1 Brain & Behaviour Lab,Dept. of Computing, Imperial College London 2 Brain & Behaviour Lab,Dept. of Bioengineering, Imperial College London 3 UKRI Centre in AI for Healthcare, Imperial College London |
| Pseudocode | Yes | Algorithm 1 Bayesian Distributional Policy Gradients |
| Open Source Code | No | The paper does not provide an explicit statement or link for the open-source code of their methodology. |
| Open Datasets | Yes | We evaluate and compare our method to other distributional RL approaches on the Arcade Learning Environment Atari 2600 games (Bellemare et al. 2013), including some of the best known hard-exploration cases, and on Mu Jo Co continuous-control tasks (Todorov, Erez, and Tassa 2012). |
| Dataset Splits | No | The paper describes using 'training batches' and 'unroll steps' but does not specify a distinct validation dataset split or how validation was performed. |
| Hardware Specification | No | The paper mentions '16 parallel workers' but does not specify any particular hardware components like GPU or CPU models, or memory. |
| Software Dependencies | No | The paper mentions using Generalised Advantage Estimation (GAE), Proximal Policy Optimisation (PPO), Wasserstein-GAN (WGAN), and Quantile Regression (QR) but does not provide specific version numbers for any software libraries or dependencies. |
| Experiment Setup | Yes | For both Atari and Mu Jo Co environments, we use 16 parallel workers for data collection, and train in mini batches. For Atari, we unroll 128 steps with each worker in each training batch for all algorithms, and average scores every 80 training batches. For Mu Jo Co, we unroll 256 steps, and average scores every 4 batches. We conduct ablation and parameter studies to investigate the impact of the bootstrap length k, and of the truncation cap u on the IG u(s), on Atari games Breakout and Q bert. The exploration coefficient ηt is logarithmically decayed as ηt = η p log t/t. |