REValueD: Regularised Ensemble Value-Decomposition for Factorisable Markov Decision Processes

Authors: David Ireland, Giovanni Montana

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our work culminates in an approach we call REValue D: Regularised Ensemble Value Decomposition. We benchmark REValue D against Dec QN and Branching Dueling Q-Networks (BDQ) (Tavakoli et al., 2018), utilising the discretised variants of Deep Mind control suite tasks (Tunyasuvunakool et al., 2020) used by Seyde et al. (2022) for comparison. The experimental outcomes show that REValue D consistently surpasses Dec QN and BDQ across a majority of tasks. Of significant note is the marked outperformance of REValue D in the humanoid and dog tasks, where the number of sub-action spaces is exceedingly high (N = 21 and 38, respectively). Further, we perform several ablations on the distinct components of REValue D to evaluate their individual contributions.
Researcher Affiliation Academia David Ireland , Giovanni Montana {david.ireland, g.montana}@warwick.ac.uk University of Warwick Alan Turing Institute
Pseudocode No The paper describes algorithms and methods but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures.
Open Source Code No The paper does not contain any explicit statement about releasing open-source code for the described methodology or a link to a code repository.
Open Datasets Yes We benchmark REValue D on the discretised versions of the Deep Mind Control Suite tasks (Tunyasuvunakool et al., 2020), as utilised by Seyde et al. (2022). These tasks represent challenging control problems, and when discretised, they can incorporate up to 38 distinct sub-action spaces. We also provide results for a selection of discretised Meta World tasks (Yu et al., 2020) in Appendix J.
Dataset Splits No The paper describes evaluation on 'test episodes' after training updates and refers to a replay buffer for sampling. However, it does not specify traditional train/validation/test dataset splits (e.g., percentages or counts) as would be common for static datasets in supervised learning.
Hardware Specification No The paper states that experiments were run 'on the same machine' (Table 3 caption) but does not provide any specific details about the hardware used, such as GPU/CPU models, memory, or cloud resources.
Software Dependencies No The paper mentions 'Optimizer Adam' in Table 4, but it does not specify any software names with version numbers, such as Python, PyTorch, TensorFlow, or other libraries that are critical for reproducibility.
Experiment Setup Yes Hyperparameters: We largely employ the same hyperparameters as the original Dec QN study, as detailed in Table 4, along with those specific to REValue D. Exceptions include the decay of the exploration parameter (ϵ) to a minimum value instead of keeping it constant, and the use of Polyakaveraging for updating the target network parameters, as opposed to a hard reset after every specified number of updates. We maintain the same hyperparameters across all our experiments. During action selection in REValue D, we follow a deep exploration technique similar to that proposed by Osband et al. (2016), where we sample a single critic from the ensemble at each time-step during training and follow an ϵ-greedy policy based on that critic s utility estimates. For test-time action selection, we average over the ensemble and then act greedily according to the mean utility values. Table 4: Hyperparameters used for the experiments presented in Section 5.