Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Zero-Shot Offline Imitation Learning via Optimal Transport

Authors: Thomas Rupf, Marco Bagatella, Nico Gürtler, Jonas Frey, Georg Martius

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The resulting method can learn from offline, suboptimal data, and is capable of non-myopic, zeroshot imitation, as we demonstrate in complex, continuous benchmarks. ... This section constitutes an extensive empirical evaluation of ZILOT for zero-shot IL. We first describe our experimental settings, and then present qualitative and quantitative result, as well as an ablation study. We consider a selection of 30 tasks defined over 5 environments, as summarized below and described in detail in appendix A and C.
Researcher Affiliation Academia 1Universit at T ubingen, T ubingen, Germany 2MPI for Intelligent Systems, T ubingen, Germany 3ETH, Z urich, Switzerland. Correspondence to: Thomas Rupf <EMAIL>.
Pseudocode Yes Algorithm 1 OT cost computation for ZILOT
Open Source Code Yes The code is available at https://github.com/martius-lab/ zilot.
Open Datasets Yes We train our version of TD-MPC2 offline with the datasets detailed in table 5 for 600k steps. ... Table 5. Environment description. We detail the datasets used for training. Environment Dataset #Transitions ... fetch push WGCSL (Yang et al., 2022) (expert+random) ... pointmaze medium D4RL (Fu et al., 2021) (expert)
Dataset Splits No The problem setting assumes access to two datasets: Dβ = (si 0, ai 0, si 1, ai 1, . . . )|Dβ| i=1 consisting of full state-action trajectories from M and DE = (gi 0, gi 1, . . . )|DE| i=1 containing demonstrations of an expert in the form of goal sequences, not necessarily abiding to the dynamic of M. (No explicit train/test/validation splits are mentioned for these datasets; Dβ is used for training the world model and value function, and DE for testing the zero-shot imitation.)
Hardware Specification Yes ZILOT runs at 2 to 4Hz on an Nvidia RTX 4090 GPU, depending on the size of H and the size of the OT problem. ... Training took about 8 to 9 hours on a single Nvidia A100 GPU.
Software Dependencies No The paper mentions several algorithms and frameworks like TD-MPC2 (Hansen et al., 2024) and ICEM (Pinneri et al., 2020), and notes the Sinkhorn algorithm implementation is inspired by (Flamary et al., 2021) and (Cuturi et al., 2022), but it does not specify concrete version numbers for any software libraries or dependencies (e.g., Python, PyTorch, specific library versions).
Experiment Setup Yes For training we adopt all TD-MPC2 hyperparameters directly (see table 7). ... Table 7. TD-MPC2 Hyperparameters. ... Table 8. Hyperparameters used for i CEM (Pinneri et al., 2020). ... We run our Sinkhorn algorithm for r = 500 iterations with a regularization factor of ϵ = 0.02. ... Table 6. Environment details. We detail the goal abstraction ϕ, metric h, threshold ϵ, horizon H, maximum episode length Tmax, and discount factor γ used for each environment.