LLM-grounded Video Diffusion Models

Authors: Long Lian, Baifeng Shi, Adam Yala, Trevor Darrell, Boyi Li

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate LVD s ability to generate spatial layouts and temporal dynamics that align with the prompts, we propose a benchmark with five tasks, each requiring the understanding and generation of different spatial and temporal properties in the prompts. We show that LVD significantly improves the text-video alignment compared to several strong baseline models. We also evaluate LVD on common datasets such as UCF-101 (Soomro et al., 2012) and MSR-VTT (Xu et al., 2016) and conducted an evaluator-based assessment, where LVD shows consistent improvements over the base diffusion model that it uses under the hood.
Researcher Affiliation Academia 1UC Berkeley 2UCSF {longlian,baifeng_shi,yala,trevordarrell,boyili}@berkeley.edu
Pseudocode No The paper describes the methodology with diagrams and text but does not include any explicit pseudocode or algorithm blocks.
Open Source Code No We will also release code and benchmarks for reproducibility and future research.
Open Datasets Yes We also evaluate LVD on common datasets such as UCF-101 (Soomro et al., 2012) and MSR-VTT (Xu et al., 2016)
Dataset Splits No The paper describes using the training and test sets of UCF-101 and MSR-VTT for evaluation, but does not specify a distinct validation set split for model training or hyperparameter tuning. The method is training-free.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running experiments.
Software Dependencies No The paper mentions using "DPMSolver Multi Step scheduler" but does not provide specific version numbers for software components or libraries.
Experiment Setup Yes For DSL-grounded video generation, we use DPMSolver Multi Step scheduler (Lu et al., 2022a;b) to denoise 40 steps for each generation. We use the same hyperparams as the baselines, except that we employ DSL guidance. For DSL guidance, we scale our energy function by a factor of 5. We perform DSL guidance 5 times per step only in the first 10 steps to allow the model to freely adjust the details generated in the later steps. We apply a background weight of 4.0 and a foreground weight of 1.0 to each of the terms in the energy function, respectively. The k in Topk was selected by counting 75% of the positions in the foreground/background in the corresponding term, inspired by previous work for image generation (Xie et al., 2023). ECo M is weighted by 0.03 and added to Etopk to form the final energy function E. The energy terms for each object, frame, and cross-attention layers are averaged. The learning rate for the gradient descent follows 1 ˆαt for each denoising step t, where the notations are introduced in Dhariwal & Nichol (2021).