Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ScaleDiff: Higher-Resolution Image Synthesis via Efficient and Model-Agnostic Diffusion

Authors: Sungho Koh, SeungJu Cha, Hyunwoo Oh, Kwanyoung Lee, Dong-Jin Kim

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that Scale Diff achieves state-of-the-art performance among training-free methods in terms of both image quality and inference speed on both U-Net and Diffusion Transformer architectures. 4 Experiments 4.1 Experimental Settings Implementation Details. We evaluate our proposed method, Scale Diff, on both FLUX [21] and SDXL [30] within an iterative 10242 20482 40962 generation pipeline. For FLUX, we use a noise timestep τ = 600, and a structure guidance strength of γt = t. This setup uses 30 denoising steps with a guidance scale of 3.5. For SDXL, we set τ = 400, and the structure guidance strength to γt = 1 αt. This configuration uses 50 denoising steps with a classifier-free guidance (CFG) [14] scale of 7.5. All experiments are conducted on a single NVIDIA A6000 GPU. Baselines. We compare our method with recent training-free methods (Scale Crafter [12], Hi Diffusion [49], Diffuse High [20], Free Scale [31], Demo Fusion [8], Acc Diffusion v2 [24]), super-resolution models (BSRGAN [48], OSEDiff [46]), and a training-based model Ultra Pixel [33]. Training-free Table 2: Quantitative comparison results. The best results are shown in bold, and the second best results are underlined. All time measurements are expressed in seconds. Evaluation. For quantitative evaluation, we randomly sample 1,000 image-text pairs from the LAION-5B [38] dataset and generate one image per prompt using each method. We compute the Fréchet Inception Distance (FID) [13], Kernel Inception Distance (KID) [4], and Inception Score (IS) [37] between generated images and real images. However, these metrics typically require resizing images to 2992 pixels, thereby limiting the evaluation of fine-grained details. To better assess detail fidelity, we extract multiple patches from each image and calculate patch-level FIDp, KIDp, and ISp following [8]. We also measure the CLIP Score [32] to evaluate text-image alignment.
Researcher Affiliation Academia Sungho Koh Hanyang University EMAIL Seung Ju Cha Hanyang University EMAIL Hyunwoo Oh Hanyang University EMAIL Kwanyoung Lee Hanyang University EMAIL Dong-Jin Kim Hanyang University EMAIL
Pseudocode Yes Algorithm 1 NPA: Query/Key/Value Patch Extraction 1: Input: Q, K, V Rmh nw d Full query, key, value tensor 2: Parameters: h, w Native height and width 3: Output: {Qi}N i=1 Set of non-overlapping query patches 4: {Ki}N i=1, {Vi}N i=1 Set of overlapping key, value patches 5: Nr mh h/2 h/2 + 1 Number of patch rows 6: Nc nw w/2 w/2 + 1 Number of patch columns 7: N Nr Nc Total number of patches 8: for i 1 to N do 9: hq start i/Nr h 2 Top-left coordinate of the query patch 10: wq start (i mod Nr) w 2 11: hq end hq start + h 2 Bottom-right coordinate of the query patch 12: wq end wq start + w 2 13: hkv start clamp(hq start h 4 , 0, sh h) Center K/V patch around query patch 14: wkv start clamp(wq start w 4 , 0, sw w) Clamp for window shifting at the edge 15: hkv end hkv start + h 16: wkv end wkv start + w 17: Qi Q[hq start : hq end, wq start : wq end, :] Non-overlapping query patch extraction 18: Ki K[hkv start : hkv end, wkv start : wkv end, :] Overlapping key, value patch extraction 19: Vi V[hkv start : hkv end, wkv start : wkv end, :] 20: end for 21: return {Qi}N i=1, {Ki}N i=1, {Vi}N i=1
Open Source Code No Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: We will release our code after submission.
Open Datasets Yes For quantitative evaluation, we randomly sample 1,000 image-text pairs from the LAION-5B [38] dataset and generate one image per prompt using each method. We compute the Fréchet Inception Distance (FID) [13], Kernel Inception Distance (KID) [4], and Inception Score (IS) [37] between generated images and real images.
Dataset Splits No Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: We do not train any models. Details about evaluation is provided in section 3.1. We cannot share subset for evaluation because we do not hold the rights.
Hardware Specification Yes All experiments are conducted on a single NVIDIA A6000 GPU.
Software Dependencies No The paper does not explicitly provide specific version numbers for software dependencies beyond the pre-trained models themselves.
Experiment Setup Yes 4.1 Experimental Settings Implementation Details. We evaluate our proposed method, Scale Diff, on both FLUX [21] and SDXL [30] within an iterative 10242 20482 40962 generation pipeline. For FLUX, we use a noise timestep τ = 600, and a structure guidance strength of γt = t. This setup uses 30 denoising steps with a guidance scale of 3.5. For SDXL, we set τ = 400, and the structure guidance strength to γt = 1 αt. This configuration uses 50 denoising steps with a classifier-free guidance (CFG) [14] scale of 7.5.