Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Latent Action Pretraining from Videos

Authors: Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Se June Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, Lars Liden, Kimin Lee, Jianfeng Gao, Luke Zettlemoyer, Dieter Fox, Minjoon Seo

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we demonstrate the effectiveness of Latent Action Pretraining as a general-purpose pretaining method. Specifically, we focus on answering the following questions: Q1. How does LAPA perform when there are cross-task, cross-environment, and cross-embodiment gaps between pretaining and fine-tuning? Q2. Can LAPA learn superior priors compared to using ground-truth actions during pretraining in a multi-embodiment setting? Q3. Can we create a performant LAPA solely from raw human manipulation videos?
Researcher Affiliation	Collaboration	Seonghyeon Ye1 Joel Jang2 Byeongguk Jeon1 Sejune Joo1 Jianwei Yang3 Baolin Peng3 Ajay Mandlekar4 Reuben Tan3 Yu-Wei Chao4 Yuchen Lin5 Lars Liden3 Kimin Lee1 Jianfeng Gao3 Luke Zettlemoyer2 Dieter Fox2,4 Minjoon Seo1 1KAIST 2University of Washington 3Microsoft Research 4 NVIDIA 5 Allen Institute for AI
Pseudocode	No	The paper describes methods and models using text and mathematical equations, and also contains a model architecture diagram (Figure 8). However, it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We will open-source the model checkpoints and code at latentactionpretraining.github.io.
Open Datasets	Yes	We evaluate the effectiveness of LAPA on 9 different task categories in 2 different simulation environments and 3 different real-world robotic tasks. Table 3 shows an overview of the pretraining and fine-tuning dataset for each setup and Figure 9 in Appendix B visualizes the simulation benchmark and real-world setups. More details of each evaluation setup are provided in Appendix B. Language Table (Lynch et al., 2023) is a simulation where a robot performs 2 DOF actions to push blocks (see Figure 9) (a)). [...] Bridgev2 (Walke et al., 2023), Open-X (Collaboration et al., 2023), and Something Something v2 (Goyal et al., 2017).
Dataset Splits	Yes	In-Domain Performance First, we assess LAPA s ability to learn from a small subset of indomain action label data by pretraining on 181k trajectories and finetuning on 1k action-labeled trajectories (0.5%). ... Pretraining LAPA on 181k trajectories and finetuning on only separate tasks (7k), ... We pretrain LAPA on 440k real-world trajectories, and then finetune on 1k simulation trajectories... For evaluation, we evaluate on 50 evaluation rollouts for each subtask category... We filter 25 successful trajectories for each task (total of 100) and use them as the fine-tuning dataset... For evaluation, we evaluate on 24 rollouts per task... Each task involves 150 trajectories across 15 objects.
Hardware Specification	Yes	For pretraining LAPA (Open-X), the best-performing model, we use 8 H100 GPUs for 34 hours with a batch size of 128 (total of 272 H100-hours). In contrast, OPENVLA required a total of 21,500 A100-hours with a batch size of 2048. ... For VPT, we use Res Net18 followed by an MLP layer for the inverse dynamics model(IDM). The IDM is trained to predict an action when given two frames on a single A6000 GPU using using Adam optimizer with a learning rate 1e-4.
Software Dependencies	No	The paper mentions several models and frameworks like "Large World Model (LWM-Chat-1M) (Liu et al., 2024)", "VQ-VAE objective (van den Oord et al., 2017)", "C-Vi Vi T tokenizer (Villegas et al., 2023)", and "polymetis robotic stack". However, it does not specify version numbers for general programming languages or common libraries (e.g., Python, PyTorch, TensorFlow) used for implementation.
Experiment Setup	Yes	To map latent actions to actual robot actions, we finetune LAPA on a small set of labeled trajectories that contain ground truth actions (delta end-effector). For action prediction, we discretize the continuous action space for each dimension of the robot so that the number of data points allocated for each bin is equal following Kim et al. (2024); Brohan et al. (2023). ... By default, we freeze only the vision encoder and unfreeze the language model during training. ... For all experiments, we train with 128 batch. We use the same inverse dynamics model as VPT during inference. ... For finetuning, we use Lo RA finetuning (Hu et al., 2022) with batch size of 32. ... We finetune the model until the training action accuracy reaches 95%. For ACTIONVLA and LAPA, we train with a batch size of 128 and with image augmentation for real-world finetuning.