Soft Q-Learning with Mutual-Information Regularization

Authors: Jordi Grau-Moya, Felix Leibfried, Peter Vrancx

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our MIRL agent both in the tabular setting using a grid world domain, and in the parametric function approximator setting using the Atari domain.
Researcher Affiliation Industry Jordi Grau-Moya, Felix Leibfried and Peter Vrancx PROWLER.io Cambridge, United Kingdom {jordi}@prowler.io
Pseudocode Yes The pseudocode of our proposed algorithm is outlined in Algorithm 1
Open Source Code No The paper does not provide an explicit statement or link indicating that the source code for the methodology is openly available.
Open Datasets Yes We conduct experiments on 19 Atari games (Brockman et al., 2016)
Dataset Splits No The paper describes training and evaluation/testing procedures but does not explicitly mention distinct 'validation dataset splits' as a separate data partitioning from the main text.
Hardware Specification No The paper mentions using a neural network and running experiments on Atari, but it does not specify any particular hardware details such as GPU models, CPU types, or cloud computing specifications.
Software Dependencies No The paper does not provide specific software dependencies with version numbers, such as programming language versions or library versions (e.g., PyTorch 1.9, TensorFlow 2.x).
Experiment Setup Yes Parameter β Updates: The parameter β can be seen as a Lagrange multiplier that quantifies the magnitude of penalization for deviating from the prior. As such, a small fixed value of β would restrict the class of available policies and evidently constrain the asymptotic performance of MIRL. In order to remedy this problem and obtain better asymptotic performance, we use the same adaptive β-scheduling over rounds i from (Fox et al., 2016) in which βi is updated linearly according to βi+1 = c * i with some positive constant c.