LLM-grounded Video Diffusion Models
Authors: Long Lian, Baifeng Shi, Adam Yala, Trevor Darrell, Boyi Li
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate LVD s ability to generate spatial layouts and temporal dynamics that align with the prompts, we propose a benchmark with five tasks, each requiring the understanding and generation of different spatial and temporal properties in the prompts. We show that LVD significantly improves the text-video alignment compared to several strong baseline models. We also evaluate LVD on common datasets such as UCF-101 (Soomro et al., 2012) and MSR-VTT (Xu et al., 2016) and conducted an evaluator-based assessment, where LVD shows consistent improvements over the base diffusion model that it uses under the hood. |
| Researcher Affiliation | Academia | 1UC Berkeley 2UCSF {longlian,baifeng_shi,yala,trevordarrell,boyili}@berkeley.edu |
| Pseudocode | No | The paper describes the methodology with diagrams and text but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | No | We will also release code and benchmarks for reproducibility and future research. |
| Open Datasets | Yes | We also evaluate LVD on common datasets such as UCF-101 (Soomro et al., 2012) and MSR-VTT (Xu et al., 2016) |
| Dataset Splits | No | The paper describes using the training and test sets of UCF-101 and MSR-VTT for evaluation, but does not specify a distinct validation set split for model training or hyperparameter tuning. The method is training-free. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running experiments. |
| Software Dependencies | No | The paper mentions using "DPMSolver Multi Step scheduler" but does not provide specific version numbers for software components or libraries. |
| Experiment Setup | Yes | For DSL-grounded video generation, we use DPMSolver Multi Step scheduler (Lu et al., 2022a;b) to denoise 40 steps for each generation. We use the same hyperparams as the baselines, except that we employ DSL guidance. For DSL guidance, we scale our energy function by a factor of 5. We perform DSL guidance 5 times per step only in the first 10 steps to allow the model to freely adjust the details generated in the later steps. We apply a background weight of 4.0 and a foreground weight of 1.0 to each of the terms in the energy function, respectively. The k in Topk was selected by counting 75% of the positions in the foreground/background in the corresponding term, inspired by previous work for image generation (Xie et al., 2023). ECo M is weighted by 0.03 and added to Etopk to form the final energy function E. The energy terms for each object, frame, and cross-attention layers are averaged. The learning rate for the gradient descent follows 1 ˆαt for each denoising step t, where the notations are introduced in Dhariwal & Nichol (2021). |