Distributional Reward Decomposition for Reinforcement Learning
Authors: Zichuan Lin, Li Zhao, Derek Yang, Tao Qin, Tie-Yan Liu, Guangwen Yang
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, our method captures the multi-channel structure and discovers meaningful reward decomposition, without any requirements on prior knowledge. Consequently, our agent achieves better performance than existing methods on environments with multiple reward channels. We test our algorithm on chosen Atari Games with multiple reward channels. |
| Researcher Affiliation | Collaboration | Zichuan Lin Tsinghua University linzc16@mails.tsinghua.edu.cn Li Zhao Microsoft Research lizo@microsoft.com Derek Yang UC San Diego dyang1206@gmail.com Tao Qin Microsoft Research taoqin@microsoft.com Guangwen Yang Tsinghua University ygw@tsinghua.edu.cn Tie-Yan Liu Microsoft Research tyliu@microsoft.com |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states 'We also provide videos2 of running sub-policies defined by πi = arg maxa E(Zi). 2https://sites.google.com/view/drdpaper' but does not explicitly state that the source code for the methodology is provided at this link or elsewhere. |
| Open Datasets | Yes | We tested our algorithm on the games from Arcade Learning Environment (ALE; Bellemare et al. [2013]). |
| Dataset Splits | No | The paper mentions '0.125 million of evaluation steps' but does not specify a validation dataset split or how it's derived from the main dataset. |
| Hardware Specification | Yes | All experiments are performed on NVIDIA Tesla V100 16GB graphics cards. |
| Software Dependencies | No | The paper states 'Our code is built upon dopamine framework (Castro et al. [2018]).' but does not provide specific version numbers for Dopamine or any other software dependencies. |
| Experiment Setup | Yes | We use the default well-tuned hyper-parameter setting in dopamine. For our updating rule in Eq. 9, we set λ = 0.0001. We run our agents for 100 epochs, each with 0.25 million of training steps and 0.125 million of evaluation steps. |