Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Hierarchical Koopman Diffusion: Fast Generation with Interpretable Diffusion Trajectory
Authors: Hanru Bai, Weiyang Ding, Difan Zou
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conducted experiments on the CIFAR-10 and FFHQ datasets to demonstrate the competitive onestep generation performance of our proposed framework. Beyond generation quality, we interpret the underlying generative dynamics via principled spectral analysis, revealing a quantitative correspondence between spectral components and semantic image attributes. Moreover, we further justify the interpretability of our framework through an image editing experiment that targets frequency-specific intervention at the intermediate stage of the diffusion trajectory. An ablation study was finally performed to show the key contributions of our framework (Sec. 4.4). Extensive results highlight both the generation quality and interpretability advantages of our framework, distinguishing it from distillationand consistency-model-based paradigms. |
| Researcher Affiliation | Academia | Hanru Bai Fudan University EMAIL Weiyang Ding Fudan University EMAIL Difan Zou The University of Hong Kong EMAIL |
| Pseudocode | Yes | Algorithm 1 The training algorithm. 1: Inputs: Trajectory {xt}t [0,T ] from a well-trained diffusion model; The number of samples s 2: Outputs: Network parameters θ and ϕ; Koopman matrices {A(l)}L l=1 3: Initialize θ and ϕ by a pre-trained U-Net 4: Initialize A(l) O, l = 1, , L 5: for i = 0 to num_iter 1 do 6: Sample St = {ti | ti U[0, T]}s 1 i=1 {T} Uniformly sample the intermediate time 7: Reconstruct ˆxε = Dϕ({e(ϵ t)A(l)E(l) θ (xt)}L l=1) for t St Apply the HKD model 8: Let L = P h ˆxε x0 + F(ˆxε) F(x0) i F is the feature extractor in LPIPS 9: Update θ, ϕ and A(l) by the gradients of L 10: end for 11: Return θ, ϕ and {A(l)}L l=1 |
| Open Source Code | No | After organizing the code, we will release the code to support full reproducibility. Details for reproducing results are described in the Appendix. |
| Open Datasets | Yes | We conducted experiments to present the promising one-step generation capability of our framework compared with other one-step baseline methods (Sec. 4.1). Notably, beyond generating high-quality images, we further provide empirical insights toward understanding the underlying dynamics of diffusion models through Koopman spectral analysis (Sec. 4.2). This analysis demonstrates the dynamical interpretability of our approach and further enables controllable image editing (Sec. 4.3). An ablation study was finally performed to show the key contributions of our framework (Sec. 4.4). Extensive results highlight both the generation quality and interpretability advantages of our framework, distinguishing it from distillationand consistency-model-based paradigms. |
| Dataset Splits | No | We conducted experiments to present the promising one-step generation capability of our framework compared with other one-step baseline methods (Sec. 4.1). Notably, beyond generating high-quality images, we further provide empirical insights toward understanding the underlying dynamics of diffusion models through Koopman spectral analysis (Sec. 4.2). This analysis demonstrates the dynamical interpretability of our approach and further enables controllable image editing (Sec. 4.3). An ablation study was finally performed to show the key contributions of our framework (Sec. 4.4). Extensive results highlight both the generation quality and interpretability advantages of our framework, distinguishing it from distillationand consistency-model-based paradigms. |
| Hardware Specification | Yes | Our method achieved comparable performance within just 2 3 days on 8 V100 GPUs. ... We conduct experiments using 8 NVIDIA V100 GPUs, with a batch size of 256 for the CIFAR-10 dataset and 64 for the FFHQ dataset. |
| Software Dependencies | No | For network architecture, we adopt the mature U-Net encoder and decoder backbone used in diffusion models for both the encoder and decoder modules in our formulation to leverage existing diffusion models design efficacy. To improve training efficiency and reduce mode collapse, we further initialize the encoder and decoder with pre-trained weights from diffusion models, which provide structured latent-to-output mappings, and retain rich hierarchical features acquired during large-scale training. The entire model, including Eθ, Dϕ, and {A(l)}L l=1, is trained end-to-end using the loss Ltotal. |
| Experiment Setup | Yes | All models were trained using the Adam optimizer [11] with a constant learning rate of 1 10 3 and a weight decay of 0.95. ... We conduct experiments using 8 NVIDIA V100 GPUs, with a batch size of 256 for the CIFAR-10 dataset and 64 for the FFHQ dataset. The loss function weights are set as λ2 = 1 and λ1 = 10 3(current_epoch/overall_epoch), where λ1 decays exponentially over the course of training. For the implementation of the trajectory consistency loss, we employ a Monte Carlo sampling strategy: in each training iteration, four intermediate timesteps are randomly sampled uniformly, and the loss is computed as the average over these sampled time points. |