Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
HOI-Swap: Swapping Objects in Videos with Hand-Object Interaction Awareness
Authors: Zihui (Sherry) Xue, Romy Luo, Changan Chen, Kristen Grauman
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive qualitative and quantitative evaluations demonstrate that HOI-Swap significantly outperforms existing methods, delivering high-quality video edits with realistic HOIs. |
| Researcher Affiliation | Collaboration | Zihui Xue1,2 Mi Luo1 Changan Chen1 Kristen Grauman1,2 1The University of Texas at Austin 2FAIR, Meta |
| Pseudocode | No | The paper describes its model design and training processes in text and with equations, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Project webpage: https://vision.cs.utexas.edu/projects/HOI-Swap. We invite readers to view the Supp. video available at https://vision.cs.utexas.edu/ projects/HOI-Swap, which presents additional qualitative results of HOI-Swap. (Implied by NeurIPS checklist answer stating code is on project website) |
| Open Datasets | Yes | Our training leverages two large-scale egocentric datasets, HOI4D [35] and Ego Exo4D [15], which feature abundant HOIs, making them particularly suitable for exploring this problem. |
| Dataset Splits | Yes | We use 2,679 videos for training and hold out 292 videos for evaluation; the evaluation videos are selected based on object instance ids to ensure that the source objects during test time are unseen by the model. |
| Hardware Specification | Yes | Training for each stage takes about 3 days on one 8-NVIDIA-V100-32G GPU node. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies or libraries used in its implementation, only general mentions like 'PyTorch' and 'CUDA' are present in the references. |
| Experiment Setup | Yes | For stage-I training: image resolution is set as 512 512. We train the model for a total of 25K steps with a learning rate of 1e-4 and a batch size of 8. We finetune the entire 2D UNet for image editing. For stage-II training: input video resolution is set as 14 256 256, where we sample 14 frames at an fps of 7 and train the model for a total of 50K steps with a learning rate of 1e-5 and a batch size of 1. We finetune the temporal layers of the 3D UNet. A classifier-free guidance dropout rate of 0.2 is employed for all stages. |