HOI-Swap: Swapping Objects in Videos with Hand-Object Interaction Awareness
Authors: Zihui (Sherry) Xue, Romy Luo, Changan Chen, Kristen Grauman
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive qualitative and quantitative evaluations demonstrate that HOI-Swap significantly outperforms existing methods, delivering high-quality video edits with realistic HOIs. |
| Researcher Affiliation | Collaboration | Zihui Xue1,2 Mi Luo1 Changan Chen1 Kristen Grauman1,2 1The University of Texas at Austin 2FAIR, Meta |
| Pseudocode | No | The paper describes its model design and training processes in text and with equations, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Project webpage: https://vision.cs.utexas.edu/projects/HOI-Swap. We invite readers to view the Supp. video available at https://vision.cs.utexas.edu/ projects/HOI-Swap, which presents additional qualitative results of HOI-Swap. (Implied by NeurIPS checklist answer stating code is on project website) |
| Open Datasets | Yes | Our training leverages two large-scale egocentric datasets, HOI4D [35] and Ego Exo4D [15], which feature abundant HOIs, making them particularly suitable for exploring this problem. |
| Dataset Splits | Yes | We use 2,679 videos for training and hold out 292 videos for evaluation; the evaluation videos are selected based on object instance ids to ensure that the source objects during test time are unseen by the model. |
| Hardware Specification | Yes | Training for each stage takes about 3 days on one 8-NVIDIA-V100-32G GPU node. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies or libraries used in its implementation, only general mentions like 'PyTorch' and 'CUDA' are present in the references. |
| Experiment Setup | Yes | For stage-I training: image resolution is set as 512 512. We train the model for a total of 25K steps with a learning rate of 1e-4 and a batch size of 8. We finetune the entire 2D UNet for image editing. For stage-II training: input video resolution is set as 14 256 256, where we sample 14 frames at an fps of 7 and train the model for a total of 50K steps with a learning rate of 1e-5 and a batch size of 1. We finetune the temporal layers of the 3D UNet. A classifier-free guidance dropout rate of 0.2 is employed for all stages. |