HOI-Swap: Swapping Objects in Videos with Hand-Object Interaction Awareness

Authors: Zihui (Sherry) Xue, Romy Luo, Changan Chen, Kristen Grauman

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive qualitative and quantitative evaluations demonstrate that HOI-Swap significantly outperforms existing methods, delivering high-quality video edits with realistic HOIs.
Researcher Affiliation Collaboration Zihui Xue1,2 Mi Luo1 Changan Chen1 Kristen Grauman1,2 1The University of Texas at Austin 2FAIR, Meta
Pseudocode No The paper describes its model design and training processes in text and with equations, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Project webpage: https://vision.cs.utexas.edu/projects/HOI-Swap. We invite readers to view the Supp. video available at https://vision.cs.utexas.edu/ projects/HOI-Swap, which presents additional qualitative results of HOI-Swap. (Implied by NeurIPS checklist answer stating code is on project website)
Open Datasets Yes Our training leverages two large-scale egocentric datasets, HOI4D [35] and Ego Exo4D [15], which feature abundant HOIs, making them particularly suitable for exploring this problem.
Dataset Splits Yes We use 2,679 videos for training and hold out 292 videos for evaluation; the evaluation videos are selected based on object instance ids to ensure that the source objects during test time are unseen by the model.
Hardware Specification Yes Training for each stage takes about 3 days on one 8-NVIDIA-V100-32G GPU node.
Software Dependencies No The paper does not provide specific version numbers for software dependencies or libraries used in its implementation, only general mentions like 'PyTorch' and 'CUDA' are present in the references.
Experiment Setup Yes For stage-I training: image resolution is set as 512 512. We train the model for a total of 25K steps with a learning rate of 1e-4 and a batch size of 8. We finetune the entire 2D UNet for image editing. For stage-II training: input video resolution is set as 14 256 256, where we sample 14 frames at an fps of 7 and train the model for a total of 50K steps with a learning rate of 1e-5 and a batch size of 1. We finetune the temporal layers of the 3D UNet. A classifier-free guidance dropout rate of 0.2 is employed for all stages.