SQIL: Imitation Learning via Reinforcement Learning with Sparse Rewards
Authors: Siddharth Reddy, Anca D. Dragan, Sergey Levine
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we show that SQIL outperforms BC and achieves competitive results compared to GAIL, on a variety of image-based and low-dimensional tasks in Box2D, Atari, and Mu Jo Co. This paper is a proof of concept that illustrates how a simple imitation method based on RL with constant rewards can be as effective as more complex methods that use learned rewards. |
| Researcher Affiliation | Academia | Siddharth Reddy, Anca D. Dragan, Sergey Levine Department of Electrical Engineering and Computer Science University of California, Berkeley {sgr,anca,svlevine}@berkeley.edu |
| Pseudocode | Yes | Algorithm 1 Soft Q Imitation Learning (SQIL) |
| Open Source Code | No | The paper mentions using and adapting existing open-source implementations (e.g., OpenAI Baselines) and pretrained policies, but it does not state that the code for SQIL or their specific modifications is released or publicly available. |
| Open Datasets | Yes | We run experiments in four image-based environments Car Racing, Pong, Breakout, and Space Invaders and three low-dimensional environments Humanoid, Half Cheetah, and Lunar Lander (Brockman et al., 2016; Bellemare et al., 2013; Todorov et al., 2012). |
| Dataset Splits | No | The paper does not specify explicit train/validation/test dataset splits (e.g., percentages or exact counts) for reproducibility, nor does it reference standard splits with specific details for the environments used beyond general mentions of expert demonstrations and collected experiences. |
| Hardware Specification | No | The paper does not specify any particular hardware components such as GPU models, CPU types, or memory sizes used for running the experiments. |
| Software Dependencies | No | The paper mentions the use of algorithms and frameworks like Adam, Deep Q-learning, and Soft Actor-Critic, but it does not provide specific version numbers for any software libraries or dependencies (e.g., Python, PyTorch, TensorFlow, or specific library versions). |
| Experiment Setup | Yes | For Lunar Lander, we set λsamp = 10 6. For Car Racing, we set λsamp = 0.01. For all other environments, we set λsamp = 1. For Lunar Lander, we used a network architecture with two fully-connected layers containing 128 hidden units each to represent the Q network in SQIL, the policy and discriminator networks in GAIL, and the policy network in BC. |