Distributional Reinforcement Learning via Moment Matching
Authors: Thanh Nguyen-Tang, Sunil Gupta, Svetha Venkatesh9144-9152
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on the suite of Atari games show that our method outperforms the distributional RL baselines and sets a new record in the Atari games for non-distributed agents. Experimental Results We first present results with a tabular version of MMDRL to illustrate its behaviour in distribution approximation task. We then combine the MMDRL update to the DQN-style architecture to create a novel deep RL algorithm namely MMDQN, and evaluate it on the Atari-57 games. |
| Researcher Affiliation | Academia | Thanh Nguyen-Tang, Sunil Gupta, Svetha Venkatesh Applied Artificial Intelligence Institute (A2I2), Deakin University, Australia |
| Pseudocode | Yes | Algorithm 1: Generic MMDRL update |
| Open Source Code | Yes | Our official code is available at https://github.com/ thanhnguyentang/mmdrl. |
| Open Datasets | Yes | We evaluated our algorithm on 55 3 Atari 2600 games (Bellemare et al. 2013) following the standard training and evaluation procedures (Mnih et al. 2015; van Hasselt, Guez, and Silver 2016) |
| Dataset Splits | No | The paper mentions following 'standard training and evaluation procedures (Mnih et al. 2015; van Hasselt, Guez, and Silver 2016)' but does not explicitly state specific training/validation/test dataset splits, percentages, or counts within its text. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions using 'Open AI Gym' and the 'Dopamine framework' but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | For fair comparison with QR-DQN, we used the same hyperparameters: N = 200, Adam optimizer (Kingma and Ba 2015) with lr = 0.00005, ϵADAM = 0.01/32. We used ϵ-greedy policy with ϵ being decayed at the same rate as in DQN but to a lower value ϵ = 0.01 as commonly used by the distributional RL methods. We used a target network to compute the distributional Bellman target as with DQN. |