Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Distributional Reward Decomposition for Reinforcement Learning

Authors: Zichuan Lin, Li Zhao, Derek Yang, Tao Qin, Tie-Yan Liu, Guangwen Yang

NeurIPS 2019 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, our method captures the multi-channel structure and discovers meaningful reward decomposition, without any requirements on prior knowledge. Consequently, our agent achieves better performance than existing methods on environments with multiple reward channels. We test our algorithm on chosen Atari Games with multiple reward channels.
Researcher Affiliation Collaboration Zichuan Lin Tsinghua University EMAIL Li Zhao Microsoft Research EMAIL Derek Yang UC San Diego EMAIL Tao Qin Microsoft Research EMAIL Guangwen Yang Tsinghua University EMAIL Tie-Yan Liu Microsoft Research EMAIL
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper states 'We also provide videos2 of running sub-policies defined by πi = arg maxa E(Zi). 2https://sites.google.com/view/drdpaper' but does not explicitly state that the source code for the methodology is provided at this link or elsewhere.
Open Datasets Yes We tested our algorithm on the games from Arcade Learning Environment (ALE; Bellemare et al. [2013]).
Dataset Splits No The paper mentions '0.125 million of evaluation steps' but does not specify a validation dataset split or how it's derived from the main dataset.
Hardware Specification Yes All experiments are performed on NVIDIA Tesla V100 16GB graphics cards.
Software Dependencies No The paper states 'Our code is built upon dopamine framework (Castro et al. [2018]).' but does not provide specific version numbers for Dopamine or any other software dependencies.
Experiment Setup Yes We use the default well-tuned hyper-parameter setting in dopamine. For our updating rule in Eq. 9, we set λ = 0.0001. We run our agents for 100 epochs, each with 0.25 million of training steps and 0.125 million of evaluation steps.