Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models
Authors: Ziyi Wu, Anil Kag, Ivan Skorokhodov, Willi Menapace, Ashkan Mirzaei, Igor Gilitschenski, Sergey Tulyakov, Aliaksandr Siarohin
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments aim to answer the following questions: (i) How does Dense DPO perform against Vanilla DPO? (Sec. 4.2) (ii) Can we leverage existing VLMs to produce high-quality preference labels? (Sec. 4.3) (iii) What is the impact of each component in our framework? (Sec. 4.4) |
| Researcher Affiliation | Collaboration | Ziyi Wu1,2,3, Anil Kag1, Ivan Skorokhodov1, Willi Menapace1, Ashkan Mirzaei1, Igor Gilitschenski2,3, , Sergey Tulyakov1, , Aliaksandr Siarohin1, 1Snap Research, 2University of Toronto, 3Vector Institute |
| Pseudocode | Yes | Algorithm 1 Vanilla Paired Video Generation... Algorithm 2 Guided Paired Video Generation... E Pytorch-style Pseudo Code for Structural DPO and Dense DPO |
| Open Source Code | No | We do not have enough time to clean up the code at submission time. |
| Open Datasets | Yes | We curate a high-quality video dataset from existing large-scale video datasets [12, 81]. We mostly follow [58] to filter the length, visual quality, and motion score of videos... We utilize two benchmarks to evaluate the performance of text-to-video generation. Video JAM-bench [8] contains 128 prompts focusing on real-world scenarios with challenging motion... |
| Dataset Splits | Yes | For Vanilla DPO, we randomly select 30k text prompts from the curated dataset, generate 2 videos of 5s per prompt with Algo. 1, and ask human labelers to annotate preferences. This leads to around 10k winning-losing pairs after removing ties. ... For Structural DPO, we use the same 30k prompts from Vanilla DPO... This again leads to around 10k winning-losing pairs. ... We utilize two benchmarks to evaluate the performance of text-to-video generation. Video JAM-bench [8] contains 128 prompts... We also construct Motion Bench... resulting in 419 prompts. |
| Hardware Specification | Yes | We implement all models using Py Torch [54] and conduct training on 64 NVIDIA A100 GPUs, which takes around 16 hours. |
| Software Dependencies | No | We implement all models using Py Torch [54] and conduct training on 64 NVIDIA A100 GPUs, which takes around 16 hours. |
| Experiment Setup | Yes | Following prior works [46, 80], we set β to 500 and apply Lo RA [27] with rank 128 to fine-tune the video model. We train with the Adam W optimizer [49] and a global batch size of 256 for 1000 steps. The peak learning rate is set to 1 × 10−5 and it is linearly warmed up from 0 in the first 250 steps. A gradient clipping of 1.0 is applied to stabilize training. |