Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Object-centric 3D Motion Field for Robot Learning from Human Videos
Authors: Zhao-Heng Yin, Sherry Yang, Pieter Abbeel
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the system in real world setups. Experiments show that our method reduces 3D motion estimation error by over 50% compared to the latest method, achieve 55% average success rate in diverse tasks where prior approaches fail ( 10%), and can even acquire fine-grained manipulation skills like insertion. Section 5 is dedicated to "Experiments" detailing system setup, evaluation of the 3D motion field estimator, and robot learning from videos with real-world tasks. |
| Researcher Affiliation | Collaboration | Zhao-Heng Yin1 Sherry Yang1,2 Pieter Abbeel1 1BAIR, UC Berkeley EECS 2Google Deep Mind. Sherry Yang is affiliated with both an academic institution (UC Berkeley EECS) and an industry entity (Google Deep Mind). |
| Pseudocode | No | The paper describes methods and procedures in paragraph form and through figures, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The NeurIPS Paper Checklist states for 'Open access to data and code': 'Answer: [Yes] Justification: We will release the trained model and learning pipeline.' This indicates a future release, not concrete access at the time of publication. |
| Open Datasets | Yes | We use the objects in the Shape Net dataset [5] and some randomly generated regular rigid bodies as training objects. ShapeNet is a well-known public dataset. |
| Dataset Splits | No | The paper mentions generating "8M samples at 256 256 resolution for training" for Phase I and collecting "around 50-150 human videos for each of these tasks" for Phase II. It also describes setting up a "test set" for the estimator. However, it does not provide specific numerical percentages or counts for training, validation, and test splits for any of these datasets in the main text. |
| Hardware Specification | Yes | We generate 8M samples at 256 256 resolution for training, which can be produced with 1 NVIDIA L40 GPU in less than 12 hours. We use the Adam W optimizer [22] to train this model and the training procedure takes about 1 day with 16 NVIDIA A100-40GB GPUs. We use a widely-used Intel D435 RGBD camera at 640 480 resolution for video dataset collection at 30Hz. We use an XArm7 robot arm with a parallel-jaw gripper for the test dataset collection and robot experiments. |
| Software Dependencies | No | The paper mentions various models and optimizers by name such as SAM2 [31], Co Tracker3 [17], AdamW optimizer [22], and UNet [33], but does not provide specific version numbers for any programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA). |
| Experiment Setup | Yes | We crop the image to 480 480 and rescale it to 256 256. We apply a weighted Huber loss ( ) as a stable supervision to train this model: L = Ex,F,M Dsim M (fdepth(x) Fdepth) + α M (fmotion(x) Fmotion) . In the loss function above, ... α is a weighting hyperparameter. We use the Adam W optimizer [22] to train this model. We also find it important to apply a random masking data augmentation to objects. Besides, for diffusion model we also find it useful to use "masked noise sample" as input. |