reproducibilityindex.ai

HOI-Swap: Swapping Objects in Videos with Hand-Object Interaction Awareness

Authors: Zihui (Sherry) Xue, Romy Luo, Changan Chen, Kristen Grauman

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive qualitative and quantitative evaluations demonstrate that HOI-Swap significantly outperforms existing methods, delivering high-quality video edits with realistic HOIs.
Researcher Affiliation	Collaboration	Zihui Xue1,2 Mi Luo1 Changan Chen1 Kristen Grauman1,2 1The University of Texas at Austin 2FAIR, Meta
Pseudocode	No	The paper describes its model design and training processes in text and with equations, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Project webpage: https://vision.cs.utexas.edu/projects/HOI-Swap. We invite readers to view the Supp. video available at https://vision.cs.utexas.edu/ projects/HOI-Swap, which presents additional qualitative results of HOI-Swap. (Implied by NeurIPS checklist answer stating code is on project website)
Open Datasets	Yes	Our training leverages two large-scale egocentric datasets, HOI4D [35] and Ego Exo4D [15], which feature abundant HOIs, making them particularly suitable for exploring this problem.
Dataset Splits	Yes	We use 2,679 videos for training and hold out 292 videos for evaluation; the evaluation videos are selected based on object instance ids to ensure that the source objects during test time are unseen by the model.
Hardware Specification	Yes	Training for each stage takes about 3 days on one 8-NVIDIA-V100-32G GPU node.
Software Dependencies	No	The paper does not provide specific version numbers for software dependencies or libraries used in its implementation, only general mentions like 'PyTorch' and 'CUDA' are present in the references.
Experiment Setup	Yes	For stage-I training: image resolution is set as 512 512. We train the model for a total of 25K steps with a learning rate of 1e-4 and a batch size of 8. We finetune the entire 2D UNet for image editing. For stage-II training: input video resolution is set as 14 256 256, where we sample 14 frames at an fps of 7 and train the model for a total of 50K steps with a learning rate of 1e-5 and a batch size of 1. We finetune the temporal layers of the 3D UNet. A classifier-free guidance dropout rate of 0.2 is employed for all stages.