Diverse Projection Ensembles for Distributional Reinforcement Learning

Authors: Moritz Akiya Zanger, Wendelin Boehmer, Matthijs T. J. Spaan

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our algorithm on the behavior suite benchmark and Viz Doom and find that diverse projection ensembles lead to significant performance improvements over existing methods on a variety of tasks with the most pronounced gains in directed exploration problems.
Researcher Affiliation Academia Moritz A. Zanger Wendelin Böhmer Matthijs T. J. Spaan Delft University of Technology, The Netherlands {m.a.zanger, j.w.bohmer, m.t.j.spaan}@tudelft.nl
Pseudocode Yes Algorithm 1 PE-DQN
Open Source Code Yes C51 requires us to define return ranges, which we defined manually and can be found in the online code repository.
Open Datasets Yes We evaluate our algorithm on the behavior suite (Osband et al., 2020), a benchmark collection of 468 environments, and a set of hard exploration problems in the visual domain Viz Doom (Kempka et al., 2016).
Dataset Splits No The paper uses reinforcement learning environments (bsuite, Viz Doom) rather than traditional datasets with explicit train/validation/test splits. While a subselection of environments was used for hyperparameter tuning, this does not constitute a dataset split as defined.
Hardware Specification Yes We deployed bsuite environments in 16 parallel jobs to be executed on 8 NVIDIA Tesla V100S 32GB GPUs, 16 Intel XEON E5-6248R 24C 3.0GHz CPUs, and 64GB of memory in total.
Software Dependencies No All algorithms use the Adam optimizer (Kingma and Ba, 2015). The hyperparameter search was conducted using Optuna (Akiba et al., 2019).
Experiment Setup Yes Our experiments are designed to provide us with a better understanding of how PE-DQN operates, in comparison to related algorithms as well as in relation to its algorithmic elements. To this end, we aimed to keep codebases and hyperparameters between all implementations equal up to algorithm-specific parameters, which we optimized with a grid search on a selected subsets of problems. Further details regarding the experimental design and implementations are provided in Appendix B.