Optimal Transport for Offline Imitation Learning

Authors: Yicheng Luo, zhengyao jiang, Samuel Cohen, Edward Grefenstette, Marc Peter Deisenroth

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On D4RL benchmarks, we show that OTR with a single demonstration can consistently match the performance of offline RL with ground-truth rewards1.Empirical evaluations on the D4RL (Fu et al., 2021) datasets demonstrate that OTR recovers the performance of offline RL methods with ground-truth rewards with only a single demonstration.
Researcher Affiliation Collaboration Yicheng Luo University College London Zhengyao Jiang University College London Samuel Cohen University College London Edward Grefenstette University College London & Cohere Marc Peter Deisenroth University College London
Pseudocode Yes The pseudo-code for our approach is given in algorithm 1.
Open Source Code Yes Code is available at https://github.com/ethanluoyc/optimal_transport_reward
Open Datasets Yes Empirical evaluations on the D4RL (Fu et al., 2021) datasets demonstrate that OTR recovers the performance of offline RL methods with ground-truth rewards with only a single demonstration.
Dataset Splits No The paper describes using fixed offline datasets for learning policies and evaluating performance, but it does not specify explicit train/validation/test dataset splits with percentages or sample counts.
Hardware Specification Yes Runtime measured on halfcheetah-medium-v2 with an NVIDIA Ge Force RTX 3080 GPU.
Software Dependencies No The paper mentions software like JAX, OTT-JAX, and Acme, but it does not provide specific version numbers for these dependencies.
Experiment Setup Yes Table 4 lists the hyperparameters used by OTR and IQL on the locomotion datasets. For Antmaze and Adroit, unless otherwise specified by table 5 or table 6, the hyperparameters follows from those used in the locomotion datasets. Hyperparameter Value Discount 0.99 Network Architectures Hidden layers (256, 256) Dropout none Network initialization orthogonal IQL Optimizer Adam Policy learning rate 3e 4, cosine decay to 0 Critic learning rate 3e 4 Value learning rate 3e 4 Target network update rate 5e 3 Temperature 3.0 Expectile 0.7 OTR Episode length T 1000 Cost function cosine Squashing function s(r) = 5.0 exp(5.0 T r/|A|)