Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
ActAnywhere: Subject-Aware Video Background Generation
Authors: Boxiao Pan, Zhan Xu, Chun-Hao Huang, Krishna Kumar Singh, Yang Zhou, Leonidas J. Guibas, Jimei Yang
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive evaluation, we show that our model produces videos with realistic foreground-background interaction while strictly following the guidance of the condition image. Our model generalizes to diverse scenarios including non-human subjects, gaming and animation clips, as well as videos with multiple moving subjects. Both quantitative and qualitative comparisons demonstrate that our model significantly outperforms existing methods |
| Researcher Affiliation | Collaboration | Boxiao Pan Stanford University Zhan Xu Adobe Research Chun-Hao Paul Huang Adobe Research Krishna Kumar Singh Adobe Research Yang Zhou Adobe Research Leonidas J. Guibas Stanford University Jimei Yang Runway |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. The model architecture and training process are described in prose and figures. |
| Open Source Code | No | As stated in the response above, we provided detailed instructions on how to replicate our experiment results in the paper. We will release our code and models upon paper acceptance. |
| Open Datasets | Yes | We train on the large-scale dataset compiled and processed by [26], which we refer to as Hi C+. The resulting dataset contains 2.4M videos of human-scene interactions. It also provides foreground segmentation and masks. We refer the reader to the original paper for more details. |
| Dataset Splits | Yes | We also evaluate qualitatively and perform ablation study on held-out samples from the Hi C+ dataset following the original data splits [26]. |
| Hardware Specification | Yes | We train on 8 NVIDIA A100-80GB GPUs with batch size 4, which takes approximately a week to fully converge. |
| Software Dependencies | No | We initialize the weights of our denoising network \u03f5\u03b8 with the pre-trained weights from the Stable Diffusion image inpainting model [34] , which is fine-tuned on top of the original Stable Diffusion on the text-conditioned image inpainting task. We initialize the weights of the inserted motion modules with Animate Diff v2 . For the CLIP image encoder, we use the clip-vit-large-patch14 variant provided by Open AI. We use the Adam W [27] optimizer with a constant learning rate of 3e-5. While specific models and optimizers are mentioned, no precise version numbers for software libraries (e.g., PyTorch, TensorFlow, or specific Stable Diffusion software versions) are provided. |
| Experiment Setup | Yes | We use the Adam W [27] optimizer with a constant learning rate of 3e-5. We train on 8 NVIDIA A100-80GB GPUs with batch size 4, which takes approximately a week to fully converge. Samples with our method are generated with 50 denoising steps, with a guidance scale [34] of 5. Random condition dropping. In order to enable classifier-free guidance at test time, we randomly drop the segmentation and the mask, the condition frame, or all of them at 10% probability each during training. |