Distributional Reward Decomposition for Reinforcement Learning

Authors: Zichuan Lin, Li Zhao, Derek Yang, Tao Qin, Tie-Yan Liu, Guangwen Yang

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, our method captures the multi-channel structure and discovers meaningful reward decomposition, without any requirements on prior knowledge. Consequently, our agent achieves better performance than existing methods on environments with multiple reward channels. We test our algorithm on chosen Atari Games with multiple reward channels.
Researcher Affiliation Collaboration Zichuan Lin Tsinghua University linzc16@mails.tsinghua.edu.cn Li Zhao Microsoft Research lizo@microsoft.com Derek Yang UC San Diego dyang1206@gmail.com Tao Qin Microsoft Research taoqin@microsoft.com Guangwen Yang Tsinghua University ygw@tsinghua.edu.cn Tie-Yan Liu Microsoft Research tyliu@microsoft.com
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper states 'We also provide videos2 of running sub-policies defined by πi = arg maxa E(Zi). 2https://sites.google.com/view/drdpaper' but does not explicitly state that the source code for the methodology is provided at this link or elsewhere.
Open Datasets Yes We tested our algorithm on the games from Arcade Learning Environment (ALE; Bellemare et al. [2013]).
Dataset Splits No The paper mentions '0.125 million of evaluation steps' but does not specify a validation dataset split or how it's derived from the main dataset.
Hardware Specification Yes All experiments are performed on NVIDIA Tesla V100 16GB graphics cards.
Software Dependencies No The paper states 'Our code is built upon dopamine framework (Castro et al. [2018]).' but does not provide specific version numbers for Dopamine or any other software dependencies.
Experiment Setup Yes We use the default well-tuned hyper-parameter setting in dopamine. For our updating rule in Eq. 9, we set λ = 0.0001. We run our agents for 100 epochs, each with 0.25 million of training steps and 0.125 million of evaluation steps.