TRAIL: Near-Optimal Imitation Learning with Suboptimal Data

Authors: Mengjiao Yang, Sergey Levine, Ofir Nachum

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the practicality of our objective through experiments on a set of navigation and locomotion tasks. Our results verify the benefits suggested by our theory and show that TRAIL is able to improve baseline imitation learning by up to 4x in performance. 5 EXPERIMENTAL EVALUATION We now evaluate TRAIL on a set of navigation and locomotion tasks (Figure 2). Our evaluation is designed to study how well TRAIL can improve imitation learning with limited expert data by leveraging available suboptimal offline data.
Researcher Affiliation Collaboration Mengjiao Yang UC Berkeley, Google Brain sherryy@google.com Sergey Levine UC Berkeley, Google Brain Ofir Nachum Google Brain
Pseudocode No The paper describes the algorithm steps in text and flowcharts (Figure 1), but does not provide structured pseudocode or an algorithm block.
Open Source Code No The paper does not contain an explicit statement about the release of source code or a link to a code repository.
Open Datasets Yes We include the challenging Ant Maze navigation tasks from D4RL (Fu et al., 2020) and low (1-Do F) to high (21-Do F) dimensional locomotaion tasks from Deep Mind Control Suite (Tassa et al., 2018). For the suboptimal data in Ant Maze, we use the full D4RL datasets antmaze-large-diverse-v0, antmaze-medium-play-v0, antmaze-medium-diverse-v0, and antmaze-medium-play-v0. For Deep Mind Control Suite (Tassa et al., 2018) set of tasks, we use the RL Unplugged (Gulcehre et al., 2020) dataset.
Dataset Splits No The paper describes how expert and suboptimal datasets are used for training and pretraining, e.g., 'we imitate from 1% or 2.5% of the expert datasets'. However, it does not specify explicit training/validation/test splits of these datasets or mention a distinct validation set used for hyperparameter tuning.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions software components like 'Adam optimizer' and 'Swish (Ramachandran et al., 2017) activation function', but it does not specify versions for programming languages, libraries, or other software dependencies necessary for replication.
Experiment Setup Yes During pretraining, we use the Adam optimizer with learning rate 0.0003 for 200k iterations with batch size 256 for all methods that require pretraining. Behavioral cloning for all methods including vanilla BC is trained with learning rate 0.0001 for 1M iterations.