Distance Weighted Supervised Learning for Offline Interaction Data

Authors: Joey Hejna, Jensen Gao, Dorsa Sadigh

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Across all datasets we test, DWSL empirically maintains behavior cloning as a lower bound while still exhibiting policy improvement. In high-dimensional image domains, DWSL surpasses the performance of both prior goal-conditioned IL and RL algorithms. Visualizations and code can be found at https:// sites.google.com/view/dwsl/home.
Researcher Affiliation Academia 1Department of Computer Science, Stanford University. Correspondence to: Joey Hejna <jhejna@cs.stanford.edu>.
Pseudocode Yes Appendix B. DWSL Algorithm. Algorithm 1 Distance Weighted Supervised Learning.
Open Source Code Yes Visualizations and code can be found at https:// sites.google.com/view/dwsl/home.
Open Datasets Yes Gym robotics environments from Plappert et al. (2018), including Fetch and Hand... The Franka Kitchen dataset from Lynch et al. (2019)... We take the Square and Can datasets from Mandlekar et al. (2021)...
Dataset Splits Yes At test time, we sample goals from the end of a held-out set of validation trajectories... We use the same train/test split as done in Cui et al. (2022), and use the final states of demo trajectories as goal states. We use the same 90% random, 10% expert split from Ma et al. (2022).
Hardware Specification No The paper mentions training on 'GPUs' implicitly through discussions of image encoders and network architectures (e.g., 'We train encoders with gradients from the policy...'), but it does not specify any particular GPU models, CPU models, or other specific hardware components used for the experiments.
Software Dependencies No The paper does not provide specific version numbers for software dependencies such as Python, PyTorch, TensorFlow, or other libraries used in the implementation.
Experiment Setup Yes We use the Adam optimizer for all experiments. For all methods, we relabel goals for each state by sampling uniformly from all future states in its trajectory. For baselines that use discount factors, we use γ = 0.98 for Gym Robotics environments, and γ = 0.99 for the remaining environments. We ran four seeds for all state experiments, and 3 seeds for all image experiments. For algorithm specific hyperparameters, we include them in Table 9. Max clip refers to the maximum exponentiated advantage weight we clip to. For other hyperparamters common to all methods, we include them in Table 10.