StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation
Authors: Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi Feng, Qibin Hou
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 Experiments |
| Researcher Affiliation | Collaboration | 1 VCIP & TMCC, CS, Nankai University 2 Byte Dance Inc. 3 NKIARI, Futian, Shenzhen |
| Pseudocode | Yes | To make it clearer, we also show the pseudo code in Algorithm ?? in the Appendix. |
| Open Source Code | No | We intend to make our code publicly available following the paper s acceptance. |
| Open Datasets | Yes | Following the previous methods [12, 7], we use the Webvid10M [2] dataset to train our transition video model. Webvid-10M: Webvid-10M [2] is a large-scale video dataset featuring 10 million video clips with associated textual descriptions, designed for training and evaluating machine learning models on video understanding and generation tasks. URL: www.robots.ox.ac.uk/~vgg/research/frozen-in-time/ |
| Dataset Splits | No | The paper states 'We randomly sample around 1000 videos as the test dataset' but does not specify explicit training/validation/test splits or mention a validation set. |
| Hardware Specification | Yes | conduct training 100k iterations for our Semantic Motion Predictor on 8 A100 GPUs. |
| Software Dependencies | No | The paper mentions software like Stable Diffusion XL, Stable Diffusion 1.5, and Open CLIP ViT-H-14, but does not provide specific version numbers for these or other ancillary software components. |
| Experiment Setup | Yes | All comparison models utilize 50-step DDIM sampling[43], and the classifier-free guidance score [18] is consistently set to 5. We then set our learning rate at 1e-4 and conduct training 100k iterations for our Semantic Motion Predictor on 8 A100 GPUs. |