Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
LLM-grounded Video Diffusion Models
Authors: Long Lian, Baifeng Shi, Adam Yala, Trevor Darrell, Boyi Li
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate LVD s ability to generate spatial layouts and temporal dynamics that align with the prompts, we propose a benchmark with five tasks, each requiring the understanding and generation of different spatial and temporal properties in the prompts. We show that LVD significantly improves the text-video alignment compared to several strong baseline models. We also evaluate LVD on common datasets such as UCF-101 (Soomro et al., 2012) and MSR-VTT (Xu et al., 2016) and conducted an evaluator-based assessment, where LVD shows consistent improvements over the base diffusion model that it uses under the hood. |
| Researcher Affiliation | Academia | 1UC Berkeley 2UCSF EMAIL |
| Pseudocode | No | The paper describes the methodology with diagrams and text but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | No | We will also release code and benchmarks for reproducibility and future research. |
| Open Datasets | Yes | We also evaluate LVD on common datasets such as UCF-101 (Soomro et al., 2012) and MSR-VTT (Xu et al., 2016) |
| Dataset Splits | No | The paper describes using the training and test sets of UCF-101 and MSR-VTT for evaluation, but does not specify a distinct validation set split for model training or hyperparameter tuning. The method is training-free. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running experiments. |
| Software Dependencies | No | The paper mentions using "DPMSolver Multi Step scheduler" but does not provide specific version numbers for software components or libraries. |
| Experiment Setup | Yes | For DSL-grounded video generation, we use DPMSolver Multi Step scheduler (Lu et al., 2022a;b) to denoise 40 steps for each generation. We use the same hyperparams as the baselines, except that we employ DSL guidance. For DSL guidance, we scale our energy function by a factor of 5. We perform DSL guidance 5 times per step only in the first 10 steps to allow the model to freely adjust the details generated in the later steps. We apply a background weight of 4.0 and a foreground weight of 1.0 to each of the terms in the energy function, respectively. The k in Topk was selected by counting 75% of the positions in the foreground/background in the corresponding term, inspired by previous work for image generation (Xie et al., 2023). ECo M is weighted by 0.03 and added to Etopk to form the final energy function E. The energy terms for each object, frame, and cross-attention layers are averaged. The learning rate for the gradient descent follows 1 ËÎąt for each denoising step t, where the notations are introduced in Dhariwal & Nichol (2021). |