Soft Q-Learning with Mutual-Information Regularization
Authors: Jordi Grau-Moya, Felix Leibfried, Peter Vrancx
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our MIRL agent both in the tabular setting using a grid world domain, and in the parametric function approximator setting using the Atari domain. |
| Researcher Affiliation | Industry | Jordi Grau-Moya, Felix Leibfried and Peter Vrancx PROWLER.io Cambridge, United Kingdom {jordi}@prowler.io |
| Pseudocode | Yes | The pseudocode of our proposed algorithm is outlined in Algorithm 1 |
| Open Source Code | No | The paper does not provide an explicit statement or link indicating that the source code for the methodology is openly available. |
| Open Datasets | Yes | We conduct experiments on 19 Atari games (Brockman et al., 2016) |
| Dataset Splits | No | The paper describes training and evaluation/testing procedures but does not explicitly mention distinct 'validation dataset splits' as a separate data partitioning from the main text. |
| Hardware Specification | No | The paper mentions using a neural network and running experiments on Atari, but it does not specify any particular hardware details such as GPU models, CPU types, or cloud computing specifications. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers, such as programming language versions or library versions (e.g., PyTorch 1.9, TensorFlow 2.x). |
| Experiment Setup | Yes | Parameter β Updates: The parameter β can be seen as a Lagrange multiplier that quantifies the magnitude of penalization for deviating from the prior. As such, a small fixed value of β would restrict the class of available policies and evidently constrain the asymptotic performance of MIRL. In order to remedy this problem and obtain better asymptotic performance, we use the same adaptive β-scheduling over rounds i from (Fox et al., 2016) in which βi is updated linearly according to βi+1 = c * i with some positive constant c. |