REValueD: Regularised Ensemble Value-Decomposition for Factorisable Markov Decision Processes
Authors: David Ireland, Giovanni Montana
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our work culminates in an approach we call REValue D: Regularised Ensemble Value Decomposition. We benchmark REValue D against Dec QN and Branching Dueling Q-Networks (BDQ) (Tavakoli et al., 2018), utilising the discretised variants of Deep Mind control suite tasks (Tunyasuvunakool et al., 2020) used by Seyde et al. (2022) for comparison. The experimental outcomes show that REValue D consistently surpasses Dec QN and BDQ across a majority of tasks. Of significant note is the marked outperformance of REValue D in the humanoid and dog tasks, where the number of sub-action spaces is exceedingly high (N = 21 and 38, respectively). Further, we perform several ablations on the distinct components of REValue D to evaluate their individual contributions. |
| Researcher Affiliation | Academia | David Ireland , Giovanni Montana {david.ireland, g.montana}@warwick.ac.uk University of Warwick Alan Turing Institute |
| Pseudocode | No | The paper describes algorithms and methods but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing open-source code for the described methodology or a link to a code repository. |
| Open Datasets | Yes | We benchmark REValue D on the discretised versions of the Deep Mind Control Suite tasks (Tunyasuvunakool et al., 2020), as utilised by Seyde et al. (2022). These tasks represent challenging control problems, and when discretised, they can incorporate up to 38 distinct sub-action spaces. We also provide results for a selection of discretised Meta World tasks (Yu et al., 2020) in Appendix J. |
| Dataset Splits | No | The paper describes evaluation on 'test episodes' after training updates and refers to a replay buffer for sampling. However, it does not specify traditional train/validation/test dataset splits (e.g., percentages or counts) as would be common for static datasets in supervised learning. |
| Hardware Specification | No | The paper states that experiments were run 'on the same machine' (Table 3 caption) but does not provide any specific details about the hardware used, such as GPU/CPU models, memory, or cloud resources. |
| Software Dependencies | No | The paper mentions 'Optimizer Adam' in Table 4, but it does not specify any software names with version numbers, such as Python, PyTorch, TensorFlow, or other libraries that are critical for reproducibility. |
| Experiment Setup | Yes | Hyperparameters: We largely employ the same hyperparameters as the original Dec QN study, as detailed in Table 4, along with those specific to REValue D. Exceptions include the decay of the exploration parameter (ϵ) to a minimum value instead of keeping it constant, and the use of Polyakaveraging for updating the target network parameters, as opposed to a hard reset after every specified number of updates. We maintain the same hyperparameters across all our experiments. During action selection in REValue D, we follow a deep exploration technique similar to that proposed by Osband et al. (2016), where we sample a single critic from the ensemble at each time-step during training and follow an ϵ-greedy policy based on that critic s utility estimates. For test-time action selection, we average over the ensemble and then act greedily according to the mean utility values. Table 4: Hyperparameters used for the experiments presented in Section 5. |