Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up

Authors: Songhua Liu, Zhenxiong Tan, Xinchao Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments indicate that by fine-tuning the attention layer on merely 10K self-generated samples for 10K iterations, we can effectively transfer knowledge from a pre-trained Di T to a student model with linear complexity, yielding results comparable to those of the teacher model. Simultaneously, it reduces attention computations by 99.5% and accelerates generation by 6.3 times for generating 8K-resolution images. We quantitatively study the proposed method on the validation set of COCO2014 [37] and randomly sample 5,000 images along with their prompts for evaluation. Following conventions [53, 33, 60, 66], we consider FID [24], LPIPS [70], CLIP image similarity [47], and DINO image similarity [4] in this setting as metrics.
Researcher Affiliation Academia Songhua Liu , , Zhenxiong Tan , and Xinchao Wang School of Artificial Intelligence, Shanghai Jiao Tong University, National University of Singapore EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the methodology and training objectives using mathematical equations (Eq. 1-7) and textual descriptions, but it does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Models and codes are available here. We attach the source codes to reproduce the main experimental results as supplementary materials.
Open Datasets Yes we quantitatively study the proposed method on the validation set of COCO2014 [37] and randomly sample 5,000 images along with their prompts for evaluation.
Dataset Splits Yes We fine-tune parameters in attention layers on 10K samples with 1024 1024 resolution generated by FLUX-1.dev itself for 10K iterations under a total batch size 32. We quantitatively study the proposed method on the validation set of COCO2014 [37] and randomly sample 5,000 images along with their prompts for evaluation. [...] we also benchmark our method against the results by the original Di T using consistent random seeds. [...] We additionally deploy our method on Di T models other than FLUX used in the main manuscript to demonstrate the universality of the proposed CLEAR. Here, we consider Stable Diffusion3.5-Large5 [17] (SD3.5-L), another state-of-the-art text-to-image generation Di T. We use the default setting of r = 16, which yields the best trade-off between quality and efficiency according to our experiments. Results on the COCO2014 validation dataset are shown in Tab. 16.
Hardware Specification Yes The training is conducted on 4 H100 GPUs supported by Deep Speed Ze RO-2 [48], which takes 1 day to finish. Unless otherwise specified, all inference is conducted on a single H100 GPU.
Software Dependencies No Leveraging Flex Attention in Py Torch [43], CLEAR, as a sparse attention mechanism, can be efficiently implemented on GPUs with low-level optimizations. [...] Other hyper-parameters, including schedulers, optimizers, etc, follow the default settings provided by Diffusers [57]. [...] The training is conducted on 4 H100 GPUs supported by Deep Speed Ze RO-2 [48].
Experiment Setup Yes We fine-tune parameters in attention layers on 10K samples with 1024 1024 resolution generated by FLUX-1.dev itself for 10K iterations under a total batch size 32 using the loss function defined in Eq. 6. Lattn is applied on single_transformer_blocks of FLUX, whose layer indices are 20 57. Following previous works on architectural distillation for diffusion models [32, 39], both hyper-parameters α and β are set as 0.5. Other hyper-parameters, including schedulers, optimizers, etc, follow the default settings provided by Diffusers [57].