Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Whole-Body Conditioned Egocentric Video Prediction

Authors: Yutong Bai, Danny Tran, Amir Bar, Yann LeCun, Trevor Darrell, Jitendra Malik

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We train models to Predict Ego-centric Video from human Actions (PEVA), given the past video and an action represented by the relative 3D body pose. By conditioning on kinematic pose trajectories, structured by the joint hierarchy of the body, our model learns to simulate how physical human actions shape the environment from a first-person point of view. We train an auto-regressive conditional diffusion transformer on Nymeria, a large-scale dataset of real-world egocentric video and body pose capture. We further design a hierarchical evaluation protocol with increasingly challenging tasks, enabling a comprehensive analysis of the model s embodied prediction and control abilities. Our work represents an initial attempt to tackle the challenges of modeling complex real-world environments and embodied agent behaviors with video prediction from the perspective of a human.
Researcher Affiliation Collaboration Yutong Bai 1 Danny Tran 1 Amir Bar 2 Yann Le Cun 2,3 Trevor Darrell 1 Jitendra Malik 1,2 1UC Berkeley (BAIR) 2FAIR, Meta 3New York University
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks. Methods are described in prose and mathematical equations.
Open Source Code No The paper's NeurIPS checklist for open access to data and code states '[Yes]' but the justification only mentions 'We use open datasets for evaluations which can be downloaded from their respective websites.', which refers to data, not the code for the methodology. No explicit statement or direct link to the implementation's source code is provided in the paper's main body or supplementary material description.
Open Datasets Yes We use the Nymeria dataset (Ma et al., 2024), which contains synchronized egocentric video and full-body motion capture, recorded in diverse real-world settings using an XSens system (Movella, 2021).
Dataset Splits Yes We split the dataset 80/20 for training and evaluation, and report all metrics on the validation set.
Hardware Specification Yes The model was trained for a total of 57.9 hours on 16 H100 nodes, each equipped with 8 GPUs. For inference, the average time per frame is 23728 207 ms, measured on a single A6000 GPU.
Software Dependencies No The paper mentions using specific components like 'Adam W' and 'Stable Diffusion VAE tokenizer' but does not provide specific version numbers for these or other key software dependencies (e.g., Python, PyTorch, CUDA libraries).
Experiment Setup Yes Training Details. We train variants of Conditional Diffusion Transformer (CDi T-S to CDi T-XXL, up to 32 layers) using a context window of 3 15 frames and predicting 64-frame trajectories. Models operate on 2 2 patches and are conditioned on both pose and temporal embeddings. We use Adam W (lr=8e 5, betas=(0.9, 0.95), grad clip=10.0) and batch size 512. Action inputs are normalized to [ 1, 1] for translation and [ Ο€, Ο€] for rotation. All experiments use Stable Diffusion VAE tokenizer and follow NWM s hardware and evaluation setup. Metrics are averaged over 5 samples per sequence.