Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Watch Less, Do More: Implicit Skill Discovery for Video-Conditioned Policy

Authors: Wang, Zongqing Lu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate our method, we perform extensive experiments in various environments and show that our algorithm substantially outperforms baselines (up to 2x) in terms of compositional generalization ability. We propose a practical implementation of our algorithm and perform empirical evaluations in Frank Kitchen (Gupta et al., 2020) and Meta world (Yu et al., 2020) to demonstrate the effectiveness of WL-DM. The experimental results indicate that WL-DM achieves (up to 2x) better compositional generalization ability compared to baselines.
Researcher Affiliation Academia Jiangxing Wang School of Computer Science Peking University EMAIL Zongqing Lu School of Computer Science Peking University, BAAI EMAIL
Pseudocode Yes The pseudocode of our algorithm is summarized in Algorithm 1. It is worth noting that the skill decoder fskill ψ , the prior video encoder gprior ϕ , and the prior action decoder f θ will only be used during training, we will keep only the video encoder gϕ and the action decoder fθ for execution. Algorithm 1 WL-DM
Open Source Code No No explicit statement or link for open-source code release of the described methodology was found. The paper mentions implementing their algorithm based on the codebase of C-bet (Cui et al., 2022) but does not state that their own code is open-source or available.
Open Datasets Yes To validate our method, we conduct empirical evaluations on two different robotic environments, Franka Kitchen (Gupta et al., 2020) and Meta World (Yu et al., 2020). The dataset from the original paper (Gupta et al., 2020) contains 566 trajectories corresponding to 24 different task combinations.
Dataset Splits Yes To evaluate the one-shot imitation learning ability, we split the dataset into a training dataset and a test dataset, where the training dataset contains 17 different task combinations and the test dataset contains 7 different task combinations, and there is no overlap of task combinations between the training set and the test set. During testing, we sample 3 different video demonstrations for each task combination, run the evaluation 10 times, and report the average performance.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or cloud instance types) used for running the experiments.
Software Dependencies No The paper does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow, CUDA versions) needed to replicate the experiment, only mentioning that the implementation is based on the C-bet codebase.
Experiment Setup Yes For all experiments, we set the learning rate to be 3 10 4 and set the window size for the trajectory to be 20... For the Franka Kitchen environment (Gupta et al., 2020), we use decoders with 3 layers, and 3 heads and set the hidden dimension to be 60... We train all methods for 10 epochs. For WL-DM, α1 is fixed to be 1 10 2 and α2 is fixed to be 1 10 1 during the training process. For the Meta World environment (Yu et al., 2020), we use decoders with 6 layers, and 6 heads and set the hidden dimension to be 120... We train all methods for 30 epochs. For WL-DM, α1 is set to be 0 in the beginning and fixed to be 1 10 3 after 10 epochs, and α2 is fixed to be 10 during the training process.