ActAnywhere: Subject-Aware Video Background Generation

Authors: Boxiao Pan, Zhan Xu, Chun-Hao Huang, Krishna Kumar Singh, Yang Zhou, Leonidas J. Guibas, Jimei Yang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive evaluation, we show that our model produces videos with realistic foreground-background interaction while strictly following the guidance of the condition image. Our model generalizes to diverse scenarios including non-human subjects, gaming and animation clips, as well as videos with multiple moving subjects. Both quantitative and qualitative comparisons demonstrate that our model significantly outperforms existing methods
Researcher Affiliation Collaboration Boxiao Pan Stanford University Zhan Xu Adobe Research Chun-Hao Paul Huang Adobe Research Krishna Kumar Singh Adobe Research Yang Zhou Adobe Research Leonidas J. Guibas Stanford University Jimei Yang Runway
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. The model architecture and training process are described in prose and figures.
Open Source Code No As stated in the response above, we provided detailed instructions on how to replicate our experiment results in the paper. We will release our code and models upon paper acceptance.
Open Datasets Yes We train on the large-scale dataset compiled and processed by [26], which we refer to as Hi C+. The resulting dataset contains 2.4M videos of human-scene interactions. It also provides foreground segmentation and masks. We refer the reader to the original paper for more details.
Dataset Splits Yes We also evaluate qualitatively and perform ablation study on held-out samples from the Hi C+ dataset following the original data splits [26].
Hardware Specification Yes We train on 8 NVIDIA A100-80GB GPUs with batch size 4, which takes approximately a week to fully converge.
Software Dependencies No We initialize the weights of our denoising network \u03f5\u03b8 with the pre-trained weights from the Stable Diffusion image inpainting model [34] , which is fine-tuned on top of the original Stable Diffusion on the text-conditioned image inpainting task. We initialize the weights of the inserted motion modules with Animate Diff v2 . For the CLIP image encoder, we use the clip-vit-large-patch14 variant provided by Open AI. We use the Adam W [27] optimizer with a constant learning rate of 3e-5. While specific models and optimizers are mentioned, no precise version numbers for software libraries (e.g., PyTorch, TensorFlow, or specific Stable Diffusion software versions) are provided.
Experiment Setup Yes We use the Adam W [27] optimizer with a constant learning rate of 3e-5. We train on 8 NVIDIA A100-80GB GPUs with batch size 4, which takes approximately a week to fully converge. Samples with our method are generated with 50 denoising steps, with a guidance scale [34] of 5. Random condition dropping. In order to enable classifier-free guidance at test time, we randomly drop the segmentation and the mask, the condition frame, or all of them at 10% probability each during training.