Extreme Q-Learning: MaxEnt RL without Entropy

Authors: Divyansh Garg, Joey Hejna, Matthieu Geist, Stefano Ermon

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our method obtains consistently strong performance in the D4RL benchmark, outperforming prior works by 10+ points on the challenging Franka Kitchen tasks while offering moderate improvements over SAC and TD3 on online DM Control tasks. Visualizations and code can be found on our website 1. 4 EXPERIMENTS We compare our Extreme Q-Learning (X-QL) approach to state-of-the-art algorithms across a wide set of continuous control tasks in both online and offline settings.
Researcher Affiliation Collaboration Divyansh Garg Stanford University divgarg@stanford.edu Joey Hejna Stanford University jhejna@stanford.edu Matthieu Geist Google Brain mfgeist@google.com Stefano Ermon Stanford University ermon@stanford.edu
Pseudocode Yes Algorithm 1 Extreme Q-learning (X-QL) (Under Stochastic Dynamics)
Open Source Code Yes Visualizations and code can be found on our website 1. 1https://div99.github.io/XQL/
Open Datasets Yes Our method obtains consistently strong performance in the D4RL benchmark, outperforming prior works by 10+ points on the challenging Franka Kitchen tasks while offering moderate improvements over SAC and TD3 on online DM Control tasks. ...We compare our Extreme Q-Learning (X-QL) approach to state-of-the-art algorithms across a wide set of continuous control tasks in both online and offline settings. In practice, the exponential nature of the Gumbel regression poses difficult optimization challenges. We provide Offline results on Androit, details of loss implementation, ablations, and hyperparameters in Appendix D.
Dataset Splits No The paper mentions training data and evaluating performance on benchmarks, but does not explicitly specify validation dataset splits or procedures in the main text for its experiments. It refers to 'standard split' in some contexts but does not define its own validation split.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments (e.g., GPU models, CPU types, or cloud computing specifications).
Software Dependencies No The paper mentions using implementations based on 'pytorch_sac (Yarats & Kostrikov, 2020)' and 'TD3 on the original author’s code from Fujimoto et al. (2018)'. It also mentions 'Scipy' for fitting curves. However, it does not provide specific version numbers for these software dependencies (e.g., PyTorch version, Python version).
Experiment Setup Yes Full hyper-parameters we used for experiments are given in Table 4. Table 4: Offline RL Hyperparameters used for X QL. The first values given are for the non per-environment tuned version of X QL, and the values in parenthesis are for the tuned offline results, X QL-T. V-updates gives the number of value updates per Q update, and increasing it reduces the variance of value updates using Gumbel loss on some hard environments.