reproducibilityindex.ai

Optimal Transport for Offline Imitation Learning

Authors: Yicheng Luo, zhengyao jiang, Samuel Cohen, Edward Grefenstette, Marc Peter Deisenroth

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On D4RL benchmarks, we show that OTR with a single demonstration can consistently match the performance of offline RL with ground-truth rewards1.Empirical evaluations on the D4RL (Fu et al., 2021) datasets demonstrate that OTR recovers the performance of offline RL methods with ground-truth rewards with only a single demonstration.
Researcher Affiliation	Collaboration	Yicheng Luo University College London Zhengyao Jiang University College London Samuel Cohen University College London Edward Grefenstette University College London & Cohere Marc Peter Deisenroth University College London
Pseudocode	Yes	The pseudo-code for our approach is given in algorithm 1.
Open Source Code	Yes	Code is available at https://github.com/ethanluoyc/optimal_transport_reward
Open Datasets	Yes	Empirical evaluations on the D4RL (Fu et al., 2021) datasets demonstrate that OTR recovers the performance of offline RL methods with ground-truth rewards with only a single demonstration.
Dataset Splits	No	The paper describes using fixed offline datasets for learning policies and evaluating performance, but it does not specify explicit train/validation/test dataset splits with percentages or sample counts.
Hardware Specification	Yes	Runtime measured on halfcheetah-medium-v2 with an NVIDIA Ge Force RTX 3080 GPU.
Software Dependencies	No	The paper mentions software like JAX, OTT-JAX, and Acme, but it does not provide specific version numbers for these dependencies.
Experiment Setup	Yes	Table 4 lists the hyperparameters used by OTR and IQL on the locomotion datasets. For Antmaze and Adroit, unless otherwise specified by table 5 or table 6, the hyperparameters follows from those used in the locomotion datasets. Hyperparameter Value Discount 0.99 Network Architectures Hidden layers (256, 256) Dropout none Network initialization orthogonal IQL Optimizer Adam Policy learning rate 3e 4, cosine decay to 0 Critic learning rate 3e 4 Value learning rate 3e 4 Target network update rate 5e 3 Temperature 3.0 Expectile 0.7 OTR Episode length T 1000 Cost function cosine Squashing function s(r) = 5.0 exp(5.0 T r/\|A\|)