AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Authors: Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, Bo Dai
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Animate Diff and Motion Lo RA on several public representative personalized T2I models collected from the community. The results demonstrate that our approaches help these models generate temporally smooth animation clips while preserving the visual quality and motion diversity. We conduct the quantitative comparison through user study and CLIP metrics. The comparison focuses on three key aspects: text alignment, domain similarity, and motion smoothness. |
| Researcher Affiliation | Collaboration | 1The Chinese University of Hong Kong 2Shanghai Artificial Intelligence Laboratory 3Stanford University |
| Pseudocode | No | The paper describes algorithms and methods but does not contain structured pseudocode or algorithm blocks (clearly labeled algorithm sections or code-like formatted procedures). |
| Open Source Code | Yes | Codes and pre-trained weights are available at https://github.com/guoyww/Animate Diff. We also make both the code and pre-trained weights open-sourced to facilitate further investigation and exploration. |
| Open Datasets | Yes | We implement Animate Diff upon Stable Diffusion V1.5 and train motion module using the Web Vid-10M (Bain et al., 2021) dataset. We utilize the Web Vid-10M dataset (Bain et al., 2021), a large-scale video dataset consisting of approximately 10.7 million text-video data pairs to train the motion module. |
| Dataset Splits | No | The paper mentions uniform sampling of video clips at a stride of 4 for a length of 16 for the motion module and Motion Lo RA training, but it does not specify explicit train/validation/test dataset splits with percentages or counts for the primary dataset used for model training. |
| Hardware Specification | Yes | We use a learning rate of 1 10 4 and train the motion module with 16 NVIDIA A100s for 5 epochs. |
| Software Dependencies | No | The paper mentions 'Stable Diffusion V1.5' but does not provide specific version numbers for other ancillary software components like programming languages, libraries, or frameworks used in the experiments. |
| Experiment Setup | Yes | We adopt a training resolution of 256 256 to balance training efficiency and motion quality. For the motion module and Motion Lo RA, we uniformly sample the videos at a stride of 4 to get video clips at a length of 16. We use a learning rate of 1 10 4 and train the motion module with 16 NVIDIA A100s for 5 epochs. In our experiment setup, we generate animations at a resolution of 512 512 using a DDIM (Song et al., 2020) sampler with classifier-free guidance. We referred to the model s official web page to determine the denoising hyperparameters (guidance scale, Lo RA scaler, etc.) and generally adopted the same settings. |