The Curse of Diversity in Ensemble-Based Exploration
Authors: Zhixuan Lin, Pierluca D'Oro, Evgenii Nikishin, Aaron Courville
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We verify our hypothesis in the Arcade Learning Environment (Bellemare et al., 2012) with the Bootstrapped DQN (Osband et al., 2016) algorithm and the Gym Mu Jo Co benchmark (Towers et al., 2023) with an ensemble SAC (Haarnoja et al., 2018a) algorithm. We show that, in many environments, the individual members of a data-sharing ensemble significantly underperform their single-agent counterparts. Moreover, while aggregating the policies of all ensemble members via voting or averaging sometimes compensates for the degradation in individual members performance, it is not always the case. These results suggest that ensemble-based exploration has a hidden negative effect that might weaken or even completely eliminate its advantages. We perform a series of experiments to confirm the connection between the observed performance degradation and the off-policy learning challenge posed by a diverse ensemble. |
| Researcher Affiliation | Academia | Zhixuan Lin , Pierluca D Oro, Evgenii Nikishin & Aaron Courville Mila Quebec AI Institute, Universit e de Montr eal |
| Pseudocode | Yes | A ALGORITHMS In this section, we provide the pseudocode for the ensemble algorithms used in this work. |
| Open Source Code | Yes | The source code is available at the following repositories: Atari: https://github.com/zhixuan-lin/ensemble-rl-discrete Mu Jo Co: https://github.com/zhixuan-lin/ensemble-rl-continuous |
| Open Datasets | Yes | We use 55 Atari games from the Arcade Learning Environment and 4 Gym Mu Jo Co tasks for our analysis. We use the same set of 55 games as Agarwal et al. (2021). The scores of the human and random agents are taken from DQN Zoo (Quan & Ostrovski, 2020). Gym Mu Jo Co benchmark (Towers et al., 2023). |
| Dataset Splits | No | No explicit mention of traditional training/validation/test dataset splits (e.g., percentages or counts) was found for the datasets used in the reinforcement learning setup. |
| Hardware Specification | Yes | On a server with NVIDIA RTX8000 GPU and AMD EPYC 7502 CPU, each seed of our implementation of Double DQN, Bootstrapped DQN (N = 10, share 0 layers), and Bootstrapped DQN (N = 10, share 3 layers) take roughly 2, 3.5, and 3 days respectively for 200M frames. |
| Software Dependencies | Yes | The SAC and ensemble SAC algorithms are built on JAXRL (Kostrikov, 2021) and use the default hyperparameters. Double DQN and Bootstrapped DQN are implemented using the Dopamine (Castro et al., 2018) framework in JAX (Bradbury et al., 2018). |
| Experiment Setup | Yes | The hyperparameters for Double DQN are listed in Table 1. |