StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation

Authors: Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi Feng, Qibin Hou

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 Experiments
Researcher Affiliation Collaboration 1 VCIP & TMCC, CS, Nankai University 2 Byte Dance Inc. 3 NKIARI, Futian, Shenzhen
Pseudocode Yes To make it clearer, we also show the pseudo code in Algorithm ?? in the Appendix.
Open Source Code No We intend to make our code publicly available following the paper s acceptance.
Open Datasets Yes Following the previous methods [12, 7], we use the Webvid10M [2] dataset to train our transition video model. Webvid-10M: Webvid-10M [2] is a large-scale video dataset featuring 10 million video clips with associated textual descriptions, designed for training and evaluating machine learning models on video understanding and generation tasks. URL: www.robots.ox.ac.uk/~vgg/research/frozen-in-time/
Dataset Splits No The paper states 'We randomly sample around 1000 videos as the test dataset' but does not specify explicit training/validation/test splits or mention a validation set.
Hardware Specification Yes conduct training 100k iterations for our Semantic Motion Predictor on 8 A100 GPUs.
Software Dependencies No The paper mentions software like Stable Diffusion XL, Stable Diffusion 1.5, and Open CLIP ViT-H-14, but does not provide specific version numbers for these or other ancillary software components.
Experiment Setup Yes All comparison models utilize 50-step DDIM sampling[43], and the classifier-free guidance score [18] is consistently set to 5. We then set our learning rate at 1e-4 and conduct training 100k iterations for our Semantic Motion Predictor on 8 A100 GPUs.