reproducibilityindex.ai

Distributional Reward Decomposition for Reinforcement Learning

Authors: Zichuan Lin, Li Zhao, Derek Yang, Tao Qin, Tie-Yan Liu, Guangwen Yang

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, our method captures the multi-channel structure and discovers meaningful reward decomposition, without any requirements on prior knowledge. Consequently, our agent achieves better performance than existing methods on environments with multiple reward channels. We test our algorithm on chosen Atari Games with multiple reward channels.
Researcher Affiliation	Collaboration	Zichuan Lin Tsinghua University linzc16@mails.tsinghua.edu.cn Li Zhao Microsoft Research lizo@microsoft.com Derek Yang UC San Diego dyang1206@gmail.com Tao Qin Microsoft Research taoqin@microsoft.com Guangwen Yang Tsinghua University ygw@tsinghua.edu.cn Tie-Yan Liu Microsoft Research tyliu@microsoft.com
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper states 'We also provide videos2 of running sub-policies defined by πi = arg maxa E(Zi). 2https://sites.google.com/view/drdpaper' but does not explicitly state that the source code for the methodology is provided at this link or elsewhere.
Open Datasets	Yes	We tested our algorithm on the games from Arcade Learning Environment (ALE; Bellemare et al. [2013]).
Dataset Splits	No	The paper mentions '0.125 million of evaluation steps' but does not specify a validation dataset split or how it's derived from the main dataset.
Hardware Specification	Yes	All experiments are performed on NVIDIA Tesla V100 16GB graphics cards.
Software Dependencies	No	The paper states 'Our code is built upon dopamine framework (Castro et al. [2018]).' but does not provide specific version numbers for Dopamine or any other software dependencies.
Experiment Setup	Yes	We use the default well-tuned hyper-parameter setting in dopamine. For our updating rule in Eq. 9, we set λ = 0.0001. We run our agents for 100 epochs, each with 0.25 million of training steps and 0.125 million of evaluation steps.