Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
DOVE: Efficient One-Step Diffusion Model for Real-World Video Super-Resolution
Authors: Zheng Chen, Zichen Zou, Kewei Zhang, Xiongfei Su, Xin Yuan, Yong Guo, Yulun Zhang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that DOVE exhibits comparable or superior performance to multi-step diffusion-based VSR methods. It also offers outstanding inference efficiency, achieving up to a 28 speed-up over existing methods such as MGLD-VSR. Code is available at: https://github.com/zhengchen1999/DOVE. 4 Experiments 4.1 Experimental Settings 4.2 Ablation Study 4.3 Comparison with State-of-the-Art Methods |
| Researcher Affiliation | Collaboration | Zheng Chen1 , Zichen Zou2 , Kewei Zhang1, Xiongfei Su3, Xin Yuan4, Yong Guo5, Yulun Zhang1 1School of Computer Science, Shanghai Jiao Tong University, 2Zhiyuan College, Shanghai Jiao Tong University, 3China Mobile Research Institute, 4Westlake University, 5Huawei Consumer Business Group |
| Pseudocode | No | The paper describes the proposed method, DOVE, and its training strategy in Section 3, and a video processing pipeline in Section 3.3. These descriptions are in prose and illustrated with figures (Figure 2 and Figure 3), but no structured pseudocode blocks or algorithms are explicitly labeled or presented in a code-like format. |
| Open Source Code | Yes | Code is available at: https://github.com/zhengchen1999/DOVE. |
| Open Datasets | Yes | The training dataset comprises video and image datasets. The video dataset, HQ-VSR, includes 2,055 high-quality videos, and adopts the Real Basic VSR [5] degradation pipeline to synthesize LQ-HQ pairs. The image dataset is DIV2K [3], with 900 images, which follows the Real ESRGAN [38] degradation process. For evaluation, we apply both synthetic and real-world datasets. The synthetic datasets include UDM10 [34], SPMCS [53], and You HQ40 [63], using the same degradations as training. For real-world datasets, we apply Real VSR [51], MVSR4x [37], and Video LQ [5]. We apply the proposed pipeline to the public dataset Open Vid-1M [25], which contains diverse scenes. |
| Dataset Splits | No | The training dataset comprises video and image datasets. The video dataset, HQ-VSR, includes 2,055 high-quality videos, and adopts the Real Basic VSR [5] degradation pipeline to synthesize LQ-HQ pairs. The image dataset is DIV2K [3], with 900 images, which follows the Real ESRGAN [38] degradation process. For evaluation, we apply both synthetic and real-world datasets. The synthetic datasets include UDM10 [34], SPMCS [53], and You HQ40 [63], using the same degradations as training. For real-world datasets, we apply Real VSR [51], MVSR4x [37], and Video LQ [5]. The paper specifies distinct training and evaluation datasets, but does not provide explicit train/validation/test splits for the internal use of its training datasets (HQ-VSR, DIV2K). |
| Hardware Specification | Yes | Both stages are trained on 4 NVIDIA A800-80G GPUs with the total batch size 8. ... For fairness, all methods are measured running time on the same A100 GPU, generating a 33-frame 720 × 1280 video. |
| Software Dependencies | No | Our DOVE is based on the text-to-video model, Cog Video X1.5 [52]. ... We use the Adam W optimizer [21] with β1=0.9, β2=0.95, and β3=0.98. The paper mentions the optimizer and the base model, but does not provide specific version numbers for software libraries like PyTorch, TensorFlow, or Python. |
| Experiment Setup | Yes | Both stages are trained on 4 NVIDIA A800-80G GPUs with the total batch size 8. We use the Adam W optimizer [21] with β1=0.9, β2=0.95, and β3=0.98. In stage-1, training is conducted on video data. The videos have a resolution of 320 × 640 and a frame length of 25. The model is trained for 10,000 iterations with a learning rate of 2 × 10−5. In stage-2, both video and image data are used, with φ=0.8 (i.e., images comprising 80% of the input). All inputs have a resolution of 320 × 640. The model is trained for 500 iterations with a learning rate of 5 × 10−6. The loss weights λ1 and λ2 are set to 1. |