Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment

Authors: Bryan Sangwoo Kim, Jeongsol Kim, Jong Chul Ye

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show that a standard 4 diffusion SR model wrapped in Co Z attains beyond 256 enlargement with high perceptual quality and fidelity. Project Page: https://bryanswkim.github.io/chain-of-zoom/.
Researcher Affiliation	Academia	Bryan Sangwoo Kim Jeongsol Kim Jong Chul Ye KAIST AI EMAIL
Pseudocode	Yes	C Algorithms The following algorithms are provided: Algorithm 1: the main algorithm for Chain-of-Zoom inference. Algorithm 2: the algorithm for GRPO-based human preference alignment training of VLMs.
Open Source Code	Yes	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We use datasets that are open to access, and specific codes are provided. We further provide sufficient information for reproduction in the supplemental material.
Open Datasets	Yes	Evaluation is performed on the training datasets of DIV2K [1] and DIV8K [15], consisting of 800 images and 1500 images, respectively. We adopt the setup of prior work [49, 48] and train OSEDiff [48] as the backbone SR model with the LSDIR [24] dataset and 10K images from FFHQ [18].
Dataset Splits	Yes	The VLM model is GRPO fine-tuned using four NVIDIA Ge Force RTX 3090 GPUs with the LSDIR dataset, with a train/validation split ratio of 0.01 (i.e., 849 images for validation).
Hardware Specification	Yes	We train using four NVIDIA Ge Force RTX 3090 GPUs with the LSDIR [24] dataset and 10K images from FFHQ [18].
Software Dependencies	No	The paper mentions several models used, such as Stable Diffusion 3.0, Qwen2.5-VL-3B-Instruct, Intern VL2.5-8B, and the SWIFT infrastructure. However, it does not explicitly provide specific version numbers for ancillary software components like Python, PyTorch, or CUDA, which are required for a 'Yes' answer according to the prompt.
Experiment Setup	Yes	We adopt the setup of prior work [49, 48] and train OSEDiff [48] as the backbone SR model with the LSDIR [24] dataset and 10K images from FFHQ [18]. We use Stable Diffusion 3.0 [13] as the backbone diffusion model and adopt a coarse-to-fine training strategy: first training on random degradation, and then training specifically for 4 magnifications. Coarse-to-fine training is used: random degradation (same setting as OSEDiff) for 25K iterations, then 4 specific upscaling for 20K iterations. Other settings (e.g., batch size, learning rate, etc.) follow the default settings of OSEDiff. Specifically, the Qwen2.5-VL-3B-Instruct model is Lo RA fine-tuned (Rank: 8, Alpha: 32, Dropout: 0.05), with two generations per prompt for 10K global steps. weights are given as: wcritic = 1.0, wphrase = 0.5, wrep = 0.5.