SQIL: Imitation Learning via Reinforcement Learning with Sparse Rewards

Authors: Siddharth Reddy, Anca D. Dragan, Sergey Levine

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we show that SQIL outperforms BC and achieves competitive results compared to GAIL, on a variety of image-based and low-dimensional tasks in Box2D, Atari, and Mu Jo Co. This paper is a proof of concept that illustrates how a simple imitation method based on RL with constant rewards can be as effective as more complex methods that use learned rewards.
Researcher Affiliation Academia Siddharth Reddy, Anca D. Dragan, Sergey Levine Department of Electrical Engineering and Computer Science University of California, Berkeley {sgr,anca,svlevine}@berkeley.edu
Pseudocode Yes Algorithm 1 Soft Q Imitation Learning (SQIL)
Open Source Code No The paper mentions using and adapting existing open-source implementations (e.g., OpenAI Baselines) and pretrained policies, but it does not state that the code for SQIL or their specific modifications is released or publicly available.
Open Datasets Yes We run experiments in four image-based environments Car Racing, Pong, Breakout, and Space Invaders and three low-dimensional environments Humanoid, Half Cheetah, and Lunar Lander (Brockman et al., 2016; Bellemare et al., 2013; Todorov et al., 2012).
Dataset Splits No The paper does not specify explicit train/validation/test dataset splits (e.g., percentages or exact counts) for reproducibility, nor does it reference standard splits with specific details for the environments used beyond general mentions of expert demonstrations and collected experiences.
Hardware Specification No The paper does not specify any particular hardware components such as GPU models, CPU types, or memory sizes used for running the experiments.
Software Dependencies No The paper mentions the use of algorithms and frameworks like Adam, Deep Q-learning, and Soft Actor-Critic, but it does not provide specific version numbers for any software libraries or dependencies (e.g., Python, PyTorch, TensorFlow, or specific library versions).
Experiment Setup Yes For Lunar Lander, we set λsamp = 10 6. For Car Racing, we set λsamp = 0.01. For all other environments, we set λsamp = 1. For Lunar Lander, we used a network architecture with two fully-connected layers containing 128 hidden units each to represent the Q network in SQIL, the policy and discriminator networks in GAIL, and the policy network in BC.