Offline Reinforcement Learning with Implicit Q-Learning

Authors: Ilya Kostrikov, Ashvin Nair, Sergey Levine

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our algorithm, implicit Q-Learning (IQL), aims to estimate this objective while evaluating the Q-function only on the state-action pairs in the dataset. ... Furthermore, our approach demonstrates the state-of-the-art performance on D4RL, a popular benchmark for offline reinforcement learning. ... Our experiments aim to evaluate our method comparatively, in contrast to prior offline RL methods...
Researcher Affiliation Academia Ilya Kostrikov, Ashvin Nair & Sergey Levine Department of Electrical Engineering and Computer Science University of California, Berkeley {kostrikov,anair17}@berkeley.edu, svlevine@eecs.berkeley.edu
Pseudocode Yes Algorithm 1 Implicit Q-learning
Open Source Code Yes For the baselines, we measure runtime for our reimplementations of the methods in JAX (Bradbury et al., 2018) built on top of JAXRL (Kostrikov, 2021), which are typically faster than the original implementations.
Open Datasets Yes Furthermore, our approach demonstrates the state-of-the-art performance on D4RL, a popular benchmark for offline reinforcement learning. ... Our approach on the D4RL (Fu et al., 2020) benchmark tasks...
Dataset Splits No No specific percentages, sample counts, or explicit methodology for training/validation/test splits were found for their own experiments. The paper refers to standard environments and general RL terms, but not specific reproducible splits for their work.
Hardware Specification Yes First, our algorithm is computationally efficient: we can perform 1M updates on one GTX1080 GPU in less than 20 minutes.
Software Dependencies No The paper mentions JAX (Bradbury et al., 2018) and cites JAXRL (Kostrikov, 2021) which is a GitHub repository, but does not provide specific version numbers for these or other software libraries like PyTorch or TensorFlow, or specific library versions for JAX.
Experiment Setup No The paper mentions the effect of the hyperparameter epsilon (expectile value) and beta (inverse temperature) for policy extraction, but a comprehensive list of hyperparameters (e.g., learning rates, batch sizes, optimizer settings) is not provided in the main text. For online fine-tuning, it states 'Exact experimental details are provided in Appendix C', implying that these details are not in the main body of the paper for the main experiments.