Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Inference-Time Text-to-Video Alignment with Diffusion Latent Beam Search
Authors: Yuta Oshima, Masahiro Suzuki, Yutaka Matsuo, Hiroki Furuta
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that our method improves the perceptual quality evaluated on the calibrated reward, VLMs, and human assessment, without model parameter update, and outputs the best generation compared to greedy search and best-of-N sampling under much more efficient computational cost. The experiments highlight that our method is beneficial to many capable generative models |
| Researcher Affiliation | Collaboration | 1The University of Tokyo 2Google Deep Mind EMAIL |
| Pseudocode | Yes | Algorithm 1 Diffusion Latent Beam Search (DLBS) with Stochastic DDIM Algorithm 2 Lookahead (LA) with Deterministic DDIM |
| Open Source Code | Yes | Code: https://github.com/shim0114/T2V-Diffusion-Search Our implementation for the experiments are available at https://github.com/shim0114/T2V-Diffusion-Search. |
| Open Datasets | Yes | We select four prompt sets from two distinct datasets (see Appendix G). DEVIL [53] classifies its prompts into five categories... We also draw 30 random captions from the test split of MSRVTT [54], widely used as a video benchmark. Our experiments are based on the open-source dataset [53, 54, 57]. |
| Dataset Splits | Yes | We select four prompt sets from two distinct datasets (see Appendix G). DEVIL [53] classifies its prompts into five categories depending on the dynamics grade, each further divided by subject type... We also draw 30 random captions from the test split of MSRVTT [54], widely used as a video benchmark. |
| Hardware Specification | Yes | Latte: FP16 inference on a single NVIDIA A100 (40 GB), batch size 1. Cog Video X: BF16 inference on a single NVIDIA A100 (40 GB), batch size 1. Wan 2.1: FP16 inference on four NVIDIA H100s (80 GB each), batch size 1. |
| Software Dependencies | No | Latte: DDIM scheduler with a linear noise schedule (βstart = 1.0 10 4, βend = 2.0 10 2) and classifier-free guidance scale wcfg = 7.5. Cog Video X: DDIM scheduler with the original settings and wcfg = 6.0. Wan 2.1: DPMSolver++ with guidance scale wcfg = 5.0. |
| Experiment Setup | Yes | We use the same prompts and Gemini-/GPT-calibrated rewards as in Section 4. We compare the following inference-time search methods with a noise level η = 1.0 for DDIM: ... Latte: DDIM scheduler with a linear noise schedule (βstart = 1.0 10 4, βend = 2.0 10 2) and classifier-free guidance scale wcfg = 7.5. |