Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Self-Improving Embodied Foundation Models

Authors: Seyed Kamyar Seyed Ghasemipour, Ayzaan Wahid, Jonathan J Tompson, Pannag Sanketi, Igor Mordatch

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments on real-world and simulated robot embodiments, our novel post-training recipe unveils significant results on Embodied Foundation Models. First, we demonstrate that the combination of SFT and Self-Improvement is significantly more sample-efficient than scaling imitation data collection for supervised learning, and that it leads to policies with significantly higher success rates. Further ablations highlight that the combination of web-scale pretraining and Self-Improvement is the key to this sample-efficiency.
Researcher Affiliation Industry Seyed Kamyar Seyed Ghasemipour Generalist AI EMAIL Ayzaan Wahid & Jonathan Tompson & Pannag Sanketi & Igor Mordatch Google Deep Mind EMAIL
Pseudocode Yes Algorithm 1 above presents psuedocode of our proposed Stage 2 Self-Improvement procedure.
Open Source Code No Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: We are unable to opensource our code. However, we include a Colab notebook as a pedagogical implementation of our algorithm.
Open Datasets Yes The datasets we use for imitation learning are existing opensourced datasets. The dataset we use to train Stage 1 policies for the simulated Language Table domain is the one provided by the original work [34].
Dataset Splits No The dataset we use to train Stage 1 policies for the simulated Language Table domain is the one provided by the original work [34]. This dataset consists of 181,020 human-generated trajectories, with 78,623 unique instructions describing the goals of the trajectories. We subsample this dataset to create 3 new datasets 10%, 20%, and 80% of the original size.
Hardware Specification Yes Stage 1 (SFT) training was done using one of the following configurations, interchangeably: 64 TPUv4 (2x4x4) 128 TPUv3 For Stage 2 (Self-Improvement) we used: Half of SFT stage resources for the learner job (since we used half batch size) 4 TPUv4 (2x2x1) for the reward model 4 TPUv4 (2x2x1) for the success detector
Software Dependencies No We used the Adam W optimizer, and trained the entire Pa LI model (i.e. kept no component frozen)... The replay buffer is implemented as a standalone server using Google Deep Mind s Reverb [10], which provides efficient distributed data storage and sampling.
Experiment Setup Yes We used batch size 128 during this stage, used the Adam W optimizer, and trained the entire Pa LI model (i.e. kept no component frozen). Stage 2 (Self-Improvement) During this stage we used batch size 64 to require less real-world rollouts for a given number of desired training steps. We kept the Vi T portion of the model frozen, intuitively believing that the model has already learned visual features for the task, and that freezing the Vi T may potentially help with model stability. We did not ablate this decision. We used the same Adam W optimizer as in Stage 1. The algorithm box in Section 2.2 presents the psuedocode for our proposed Stage 2 Self-Improvement procedure. In each RL loop, we collect enough robot trajectories to perform 16 model update steps (N = 16). Intuitively, decreasing N reduces off-policiness of the RL updates, while increasing N improves the diversity of data in the replay buffer due to the larger number of trajectories being collected before performing N RL updates. We use γ = 0.9, c = 5e-2. Please refer to Appendix C for further discussion.