Optimal Transport for Offline Imitation Learning
Authors: Yicheng Luo, zhengyao jiang, Samuel Cohen, Edward Grefenstette, Marc Peter Deisenroth
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On D4RL benchmarks, we show that OTR with a single demonstration can consistently match the performance of offline RL with ground-truth rewards1.Empirical evaluations on the D4RL (Fu et al., 2021) datasets demonstrate that OTR recovers the performance of offline RL methods with ground-truth rewards with only a single demonstration. |
| Researcher Affiliation | Collaboration | Yicheng Luo University College London Zhengyao Jiang University College London Samuel Cohen University College London Edward Grefenstette University College London & Cohere Marc Peter Deisenroth University College London |
| Pseudocode | Yes | The pseudo-code for our approach is given in algorithm 1. |
| Open Source Code | Yes | Code is available at https://github.com/ethanluoyc/optimal_transport_reward |
| Open Datasets | Yes | Empirical evaluations on the D4RL (Fu et al., 2021) datasets demonstrate that OTR recovers the performance of offline RL methods with ground-truth rewards with only a single demonstration. |
| Dataset Splits | No | The paper describes using fixed offline datasets for learning policies and evaluating performance, but it does not specify explicit train/validation/test dataset splits with percentages or sample counts. |
| Hardware Specification | Yes | Runtime measured on halfcheetah-medium-v2 with an NVIDIA Ge Force RTX 3080 GPU. |
| Software Dependencies | No | The paper mentions software like JAX, OTT-JAX, and Acme, but it does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | Table 4 lists the hyperparameters used by OTR and IQL on the locomotion datasets. For Antmaze and Adroit, unless otherwise specified by table 5 or table 6, the hyperparameters follows from those used in the locomotion datasets. Hyperparameter Value Discount 0.99 Network Architectures Hidden layers (256, 256) Dropout none Network initialization orthogonal IQL Optimizer Adam Policy learning rate 3e 4, cosine decay to 0 Critic learning rate 3e 4 Value learning rate 3e 4 Target network update rate 5e 3 Temperature 3.0 Expectile 0.7 OTR Episode length T 1000 Cost function cosine Squashing function s(r) = 5.0 exp(5.0 T r/|A|) |