Fully Parameterized Quantile Function for Distributional Reinforcement Learning
Authors: Derek Yang, Li Zhao, Zichuan Lin, Tao Qin, Jiang Bian, Tie-Yan Liu
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on 55 Atari Games show that our algorithm significantly outperforms existing distributional RL algorithms and creates a new record for the Atari Learning Environment for non-distributed agents. |
| Researcher Affiliation | Collaboration | Derek Yang UC San Diego dyang1206@gmail.com Li Zhao Microsoft Research lizo@microsoft.com Zichuan Lin Tsinghua University linzc16@mails.tsinghua.edu.cn Tao Qin Microsoft Research taoqin@microsoft.com Jiang Bian Microsoft Research jiang.bian@microsoft.com Tieyan Liu Microsoft Research tyliu@microsoft.com |
| Pseudocode | Yes | Algorithm 1: FQF update |
| Open Source Code | No | The paper states, "We implement FQF based on the Dopamine framework," but does not explicitly provide a link to or statement about the release of their own FQF implementation. |
| Open Datasets | Yes | We test our algorithm on the Atari games from Arcade Learning Environment (ALE) Bellemare et al. [2013]. |
| Dataset Splits | No | The paper specifies training and evaluation phases but does not explicitly detail a separate validation dataset split with percentages or sample counts. |
| Hardware Specification | Yes | All experiments are performed on NVIDIA Tesla V100 16GB graphics cards. |
| Software Dependencies | No | The paper mentions implementing FQF based on the Dopamine framework but does not specify version numbers for Dopamine or any other software components. |
| Experiment Setup | Yes | Our hyper-parameter setting is aligned with IQN for fair comparison. The number of τ for FQF is 32. The weights of the fraction proposal network are initialized so that initial probabilities are uniform as in QR-DQN, also the learning rates are relatively small compared with the quantile value network to keep the probabilities relatively stable while training. We run all agents with 200 million frames. At the training stage, we use ϵ-greedy with ϵ = 0.01. For each evaluation stage, we test the agent for 0.125 million frames with ϵ = 0.001. For each algorithm we run 3 random seeds. |