ActAnywhere: Subject-Aware Video Background Generation
Authors: Boxiao Pan, Zhan Xu, Chun-Hao Huang, Krishna Kumar Singh, Yang Zhou, Leonidas J. Guibas, Jimei Yang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive evaluation, we show that our model produces videos with realistic foreground-background interaction while strictly following the guidance of the condition image. Our model generalizes to diverse scenarios including non-human subjects, gaming and animation clips, as well as videos with multiple moving subjects. Both quantitative and qualitative comparisons demonstrate that our model significantly outperforms existing methods |
| Researcher Affiliation | Collaboration | Boxiao Pan Stanford University Zhan Xu Adobe Research Chun-Hao Paul Huang Adobe Research Krishna Kumar Singh Adobe Research Yang Zhou Adobe Research Leonidas J. Guibas Stanford University Jimei Yang Runway |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. The model architecture and training process are described in prose and figures. |
| Open Source Code | No | As stated in the response above, we provided detailed instructions on how to replicate our experiment results in the paper. We will release our code and models upon paper acceptance. |
| Open Datasets | Yes | We train on the large-scale dataset compiled and processed by [26], which we refer to as Hi C+. The resulting dataset contains 2.4M videos of human-scene interactions. It also provides foreground segmentation and masks. We refer the reader to the original paper for more details. |
| Dataset Splits | Yes | We also evaluate qualitatively and perform ablation study on held-out samples from the Hi C+ dataset following the original data splits [26]. |
| Hardware Specification | Yes | We train on 8 NVIDIA A100-80GB GPUs with batch size 4, which takes approximately a week to fully converge. |
| Software Dependencies | No | We initialize the weights of our denoising network \u03f5\u03b8 with the pre-trained weights from the Stable Diffusion image inpainting model [34] , which is fine-tuned on top of the original Stable Diffusion on the text-conditioned image inpainting task. We initialize the weights of the inserted motion modules with Animate Diff v2 . For the CLIP image encoder, we use the clip-vit-large-patch14 variant provided by Open AI. We use the Adam W [27] optimizer with a constant learning rate of 3e-5. While specific models and optimizers are mentioned, no precise version numbers for software libraries (e.g., PyTorch, TensorFlow, or specific Stable Diffusion software versions) are provided. |
| Experiment Setup | Yes | We use the Adam W [27] optimizer with a constant learning rate of 3e-5. We train on 8 NVIDIA A100-80GB GPUs with batch size 4, which takes approximately a week to fully converge. Samples with our method are generated with 50 denoising steps, with a guidance scale [34] of 5. Random condition dropping. In order to enable classifier-free guidance at test time, we randomly drop the segmentation and the mask, the condition frame, or all of them at 10% probability each during training. |