SEABO: A Simple Search-Based Method for Offline Imitation Learning

Authors: Jiafei Lyu, Xiaoteng Ma, Le Wan, Runze Liu, Xiu Li, Zongqing Lu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on a variety of D4RL datasets indicate that SEABO can achieve competitive performance to offline RL algorithms with ground-truth rewards, given only a single expert trajectory, and can outperform prior reward learning and offline IL methods across many tasks. Moreover, we demonstrate that SEABO also works well if the expert demonstrations contain only observations.
Researcher Affiliation Collaboration 1Tsinghua Shenzhen International Graduate School, Tsinghua University 2Department of Automation, Tsinghua University, 3IEG, Tencent 4School of Computer Science, Peking University
Pseudocode Yes We name the resulting method SEABO, and list its pseudo-code in Algorithm 1.
Open Source Code Yes Our code is publicly available at https://github.com/dmksjfl/SEABO.
Open Datasets Yes Experimental results on a variety of D4RL datasets indicate that SEABO can achieve competitive performance to offline RL algorithms with ground-truth rewards, given only a single expert trajectory, and can outperform prior reward learning and offline IL methods across many tasks. Moreover, we demonstrate that SEABO also works well if the expert demonstrations contain only observations. Our code is publicly available at https://github.com/dmksjfl/SEABO.
Dataset Splits No The paper uses D4RL datasets and refers to standard normalized scores, implying predefined splits may be used. However, it does not explicitly state the train/validation/test split percentages, sample counts, or the methodology for partitioning the data into these sets within the paper. It only mentions running experiments over five different random seeds.
Hardware Specification Yes CPU GPU Memory AMD EPYC 7452 RTX3090 8 288GB
Software Dependencies Yes In SEABO, we use the KD-tree implementation from the scipy library (Virtanen et al., 2020), i.e., scipy.spatial.KDTree. We set the number of nearest neighbors N = 1, and keep other default hyperparameters in KD-tree. Note that we can directly get the desired distance by querying the KD-tree. For Ball-tree, we use its implementation in the scikit-learn package (Pedregosa et al., 2011), i.e., sklearn.neighbors.Ball Tree. We also keep its original hyperparameters unchanged. For HNSW, we use its implementation in hnswlib4. We use the suggested hyperparameter setting in its Git Hub page and set ef construction=200 (which defines a construction time/accuracy trade-off) and M=16 (which defines the maximum number of outgoing connections in the graph). All these search algorithms adopt the Euclidean distance as the distance measurement. In our experiments, we use Mu Jo Co 2.0 (Todorov et al., 2012) with Gym version 0.18.3, Py Torch (Paszke et al., 2019) version 1.8. We use the normalized score metric recommended in the D4RL paper (Fu et al., 2020), where 0 corresponds to a random policy, and 100 corresponds to an expert policy.
Experiment Setup Yes We conduct experiments on 9 Mu Jo Co locomotion -v2 medium-level datasets, 6 Ant Maze -v0 datasets, and 8 Adroit -v0 datasets, yielding a total of 23 tasks. We list the hyperparameter setup for IQL and TD3 BC on Mu Jo Co locomotion tasks in Table 8. We keep the hyperparameter setup of the base offline RL algorithms unchanged for both IQL and TD3 BC. For IQL, we do not rescale the rewards in the datasets by 1000/max return min return, as we have an additional hyperparameter α to control the reward scale. In practice, we find minor performance differences if we rescale the rewards. We generally utilize the same formula of squashing function for most of the datasets, except that we set β = 1 in hopper-medium-replay-v2, and α = 10, β = 0.1 in hopper-medium-expert-v2 for better performance. Note that using α = 1, β = 0.5 on these tasks can also produce a good performance (e.g., setting α = 1, β = 0.5 on hopper-medium-replay-v2 leads to an average performance of 87.2, still outperforming strong baselines like OTR), while slightly modifying the hyperparameter setup can result in better performance. We divide the scaled distance by the action dimension of the task to strike a balance between different tasks (as we use one set of hyperparameters). This is also adopted in PWIL paper (Dadashi et al., 2021). For TD3 BC, we use the same type of squashing function as IQL on the locomotion tasks, with α = 1, β = 0.5, except that we use α = 10 for walker2d-medium-v2 and walker2d-medium-replay-v2 for slightly better performance. We use the official implementation of TD3 BC (https://github.com/sfujim/TD3 BC) and adopt the Py Torch (Paszke et al. (2019)) version of IQL for evaluation.