Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Latent Action Pretraining from Videos
Authors: Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Se June Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, Lars Liden, Kimin Lee, Jianfeng Gao, Luke Zettlemoyer, Dieter Fox, Minjoon Seo
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we demonstrate the effectiveness of Latent Action Pretraining as a general-purpose pretaining method. Specifically, we focus on answering the following questions: Q1. How does LAPA perform when there are cross-task, cross-environment, and cross-embodiment gaps between pretaining and fine-tuning? Q2. Can LAPA learn superior priors compared to using ground-truth actions during pretraining in a multi-embodiment setting? Q3. Can we create a performant LAPA solely from raw human manipulation videos? |
| Researcher Affiliation | Collaboration | Seonghyeon Ye1 Joel Jang2 Byeongguk Jeon1 Sejune Joo1 Jianwei Yang3 Baolin Peng3 Ajay Mandlekar4 Reuben Tan3 Yu-Wei Chao4 Yuchen Lin5 Lars Liden3 Kimin Lee1 Jianfeng Gao3 Luke Zettlemoyer2 Dieter Fox2,4 Minjoon Seo1 1KAIST 2University of Washington 3Microsoft Research 4 NVIDIA 5 Allen Institute for AI |
| Pseudocode | No | The paper describes methods and models using text and mathematical equations, and also contains a model architecture diagram (Figure 8). However, it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | We will open-source the model checkpoints and code at latentactionpretraining.github.io. |
| Open Datasets | Yes | We evaluate the effectiveness of LAPA on 9 different task categories in 2 different simulation environments and 3 different real-world robotic tasks. Table 3 shows an overview of the pretraining and fine-tuning dataset for each setup and Figure 9 in Appendix B visualizes the simulation benchmark and real-world setups. More details of each evaluation setup are provided in Appendix B. Language Table (Lynch et al., 2023) is a simulation where a robot performs 2 DOF actions to push blocks (see Figure 9) (a)). [...] Bridgev2 (Walke et al., 2023), Open-X (Collaboration et al., 2023), and Something Something v2 (Goyal et al., 2017). |
| Dataset Splits | Yes | In-Domain Performance First, we assess LAPA s ability to learn from a small subset of indomain action label data by pretraining on 181k trajectories and finetuning on 1k action-labeled trajectories (0.5%). ... Pretraining LAPA on 181k trajectories and finetuning on only separate tasks (7k), ... We pretrain LAPA on 440k real-world trajectories, and then finetune on 1k simulation trajectories... For evaluation, we evaluate on 50 evaluation rollouts for each subtask category... We filter 25 successful trajectories for each task (total of 100) and use them as the fine-tuning dataset... For evaluation, we evaluate on 24 rollouts per task... Each task involves 150 trajectories across 15 objects. |
| Hardware Specification | Yes | For pretraining LAPA (Open-X), the best-performing model, we use 8 H100 GPUs for 34 hours with a batch size of 128 (total of 272 H100-hours). In contrast, OPENVLA required a total of 21,500 A100-hours with a batch size of 2048. ... For VPT, we use Res Net18 followed by an MLP layer for the inverse dynamics model(IDM). The IDM is trained to predict an action when given two frames on a single A6000 GPU using using Adam optimizer with a learning rate 1e-4. |
| Software Dependencies | No | The paper mentions several models and frameworks like "Large World Model (LWM-Chat-1M) (Liu et al., 2024)", "VQ-VAE objective (van den Oord et al., 2017)", "C-Vi Vi T tokenizer (Villegas et al., 2023)", and "polymetis robotic stack". However, it does not specify version numbers for general programming languages or common libraries (e.g., Python, PyTorch, TensorFlow) used for implementation. |
| Experiment Setup | Yes | To map latent actions to actual robot actions, we finetune LAPA on a small set of labeled trajectories that contain ground truth actions (delta end-effector). For action prediction, we discretize the continuous action space for each dimension of the robot so that the number of data points allocated for each bin is equal following Kim et al. (2024); Brohan et al. (2023). ... By default, we freeze only the vision encoder and unfreeze the language model during training. ... For all experiments, we train with 128 batch. We use the same inverse dynamics model as VPT during inference. ... For finetuning, we use Lo RA finetuning (Hu et al., 2022) with batch size of 32. ... We finetune the model until the training action accuracy reaches 95%. For ACTIONVLA and LAPA, we train with a batch size of 128 and with image augmentation for real-world finetuning. |