Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning Video-Conditioned Policy on Unlabelled Data with Joint Embedding Predictive Transformer

Authors: Hao Luo, Zongqing Lu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on a series of simulated visual control tasks evaluate that JEPT can effectively leverage the mixture dataset to learn a generalizable policy. JEPT outperforms baselines in the tasks without action-labeled data and unseen tasks. We also experimentally reveal the potential of JEPT as a simple visual priors injection approach to enhance the video-conditioned policy.
Researcher Affiliation Collaboration Hao Luo1 and Zongqing Lu1,2 1School of Computer Science, Peking University 2Beijing Academy of Artificial Intelligence
Pseudocode Yes Algorithm 1 Joint Embedding Predictive Transformer 1: Input: Mixture dataset Ddemo Dvid 2: Initialize: JEPT components Ψobs, Ψprt, Ψpred, Γobs, Γact with random weights 3: for e = 1, 2, . . . do 4: for d Ddemo Dvid do Optimizing Joint Embedding Encoder 5: Encode V and O into Eprt and (ht k+1, . . . , ht) as Equations 3 and 4 6: Predict the joint embedding tokens and action tokens as Equations 5 and 6 7: Compute Ltotal as Equation 9 8: Update Ψobs with Ltotal 9: end for 10: for d Ddemo Dvid do Optimizing other components 11: Repeat steps in lines 5-7 12: Update Ψprt, Ψpred, Γobs, Γact with Ltotal 13: end for 14: end for
Open Source Code No The paper mentions using a third-party codebase: "We build our framework based on Py Torch (Paszke et al., 2019) and use the implementations of Transformer modules from the codebase x-transformers2. https://github.com/lucidrains/x-transformers". However, it does not explicitly state that the authors are releasing their own implementation of JEPT or provide a link to their specific code.
Open Datasets Yes To evaluate the effectiveness of JEPT, we conduct experiments on Meta-World (Yu et al., 2020a) and Robosuite (Zhu et al., 2020).
Dataset Splits Yes We select 18 tasks for the Meta-World task set and 15 tasks for the Robosuite task set. In order to construct the mixture dataset Ddemo Dvid and evaluate on the unseen tasks, we split each task set into three subsets: (1) Tdemo: the tasks with the prompt videos and the paired expert demonstrations, (2) Tvid: the tasks with the prompt videos and the paired expert videos, and (3) Tunseen: the tasks with merely the prompt videos. ... In our Meta-World experiments, we divide the 18 tasks into three subsets: Tdemo, Tvid and Tunseen, respectively containing 8, 6 and 4 tasks. ... We split the 15 tasks into three subsets: Tdemo, Tvid, and Tunseen, containing 6, 5, and 4 tasks, respectively.
Hardware Specification Yes We use the 3090 Nvidia GPU and i9-12900K CPU for the JEPT training and testing.
Software Dependencies No The paper states: "We build our framework based on Py Torch (Paszke et al., 2019) and use the implementations of Transformer modules from the codebase x-transformers2." While PyTorch is mentioned, a specific version number for PyTorch itself is not provided. The `x-transformers` codebase is mentioned, but without a specific version number either.
Experiment Setup Yes The hyperparameters for the Perceiver-IOs and other modules are detailed in Table 5. During training, Eprt is initially excluded in the causal predictor as a warm-up strategy. Furthermore, the hyperparameters for the training process are listed in Table 6.