The Curse of Diversity in Ensemble-Based Exploration

Authors: Zhixuan Lin, Pierluca D'Oro, Evgenii Nikishin, Aaron Courville

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We verify our hypothesis in the Arcade Learning Environment (Bellemare et al., 2012) with the Bootstrapped DQN (Osband et al., 2016) algorithm and the Gym Mu Jo Co benchmark (Towers et al., 2023) with an ensemble SAC (Haarnoja et al., 2018a) algorithm. We show that, in many environments, the individual members of a data-sharing ensemble significantly underperform their single-agent counterparts. Moreover, while aggregating the policies of all ensemble members via voting or averaging sometimes compensates for the degradation in individual members performance, it is not always the case. These results suggest that ensemble-based exploration has a hidden negative effect that might weaken or even completely eliminate its advantages. We perform a series of experiments to confirm the connection between the observed performance degradation and the off-policy learning challenge posed by a diverse ensemble.
Researcher Affiliation Academia Zhixuan Lin , Pierluca D Oro, Evgenii Nikishin & Aaron Courville Mila Quebec AI Institute, Universit e de Montr eal
Pseudocode Yes A ALGORITHMS In this section, we provide the pseudocode for the ensemble algorithms used in this work.
Open Source Code Yes The source code is available at the following repositories: Atari: https://github.com/zhixuan-lin/ensemble-rl-discrete Mu Jo Co: https://github.com/zhixuan-lin/ensemble-rl-continuous
Open Datasets Yes We use 55 Atari games from the Arcade Learning Environment and 4 Gym Mu Jo Co tasks for our analysis. We use the same set of 55 games as Agarwal et al. (2021). The scores of the human and random agents are taken from DQN Zoo (Quan & Ostrovski, 2020). Gym Mu Jo Co benchmark (Towers et al., 2023).
Dataset Splits No No explicit mention of traditional training/validation/test dataset splits (e.g., percentages or counts) was found for the datasets used in the reinforcement learning setup.
Hardware Specification Yes On a server with NVIDIA RTX8000 GPU and AMD EPYC 7502 CPU, each seed of our implementation of Double DQN, Bootstrapped DQN (N = 10, share 0 layers), and Bootstrapped DQN (N = 10, share 3 layers) take roughly 2, 3.5, and 3 days respectively for 200M frames.
Software Dependencies Yes The SAC and ensemble SAC algorithms are built on JAXRL (Kostrikov, 2021) and use the default hyperparameters. Double DQN and Bootstrapped DQN are implemented using the Dopamine (Castro et al., 2018) framework in JAX (Bradbury et al., 2018).
Experiment Setup Yes The hyperparameters for Double DQN are listed in Table 1.