Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
SANA: Efficient High-Resolution Text-to-Image Synthesis with Linear Diffusion Transformers
Authors: Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, Song Han
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We use five mainstream evaluation metrics to evaluate the performance of our Sana, namely FID, Clip Score, Gen Eval (Ghosh et al., 2024), DPG-Bench (Hu et al., 2024), and Image Reward (Xu et al., 2024), comparing it with SOTA methods. FID and Clip Score are evaluated on the MJHQ-30K (Li et al., 2024a) dataset, which contains 30K images from Midjourney. |
| Researcher Affiliation | Collaboration | 1NVIDIA 2MIT 3Tsinghua University |
| Pseudocode | Yes | Algorithm 1 Flow-DPM-Solver (Modified from DPM-Solver++) Require: initial value x T , time steps {ti}M i=0, data prediction model xθ, velocity prediction model vθ, timestep shift factor s 1: Denote hi := λti λti 1 for i = 1, . . . , M 2: σti = s σti 1+(s 1) σti , αti = 1 σti ▹ Hyper-parameter and Time-step transformation 3: xθ( xti, ti) = xti σtivθ( xti, ti) ▹ Model output transformation 4: xt0 x T . Initialize an empty buffer Q. 5: Qbuffer xθ( xt0, t0) 6: xt1 σt1 σt0 xt0 αt1 e h1 1 xθ( xt0, t0) 7: Qbuffer xθ( xt1, ti) 8: for i = 2 to M do 9: ri hi 1 hi 10: Di 1 + 1 2ri xθ( xti 1, ti 1) 1 2ri xθ( xti 2, ti 2) 11: xti σti σti 1 xti 1 αti e hi 1 Di 12: if i < M then 13: Qbuffer xθ( xti, ti) 14: end if 15: end for 16: return xt M |
| Open Source Code | No | Code and model will be publicly released. |
| Open Datasets | Yes | FID and Clip Score are evaluated on the MJHQ-30K (Li et al., 2024a) dataset, which contains 30K images from Midjourney. |
| Dataset Splits | No | The paper evaluates on the MJHQ-30K dataset but does not explicitly describe the training, validation, or test splits used for this dataset. It mentions training steps and resolutions but not data partitioning. |
| Hardware Specification | Yes | Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024x1024 resolution image. It takes only 0.37s to generate a 1024x1024 resolution image on a customer-grade 4090 GPU, providing a powerful foundation model for real-time image generation. The speed is tested on one A100 GPU with FP16 Precision. |
| Software Dependencies | No | The paper mentions using Triton (Tillet et al., 2019) and CUDA C++ for kernel implementation but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We train all the models with the same training setting with 52K iterations. multi-stage training strategy to improve training stability, which involving finetune our AE-F32C32 on 1024 1024 images we discover a useful trick that further accelerates model convergence by initializing a small learnable scale factor (e.g., 0.01) and multiplying it by the text embedding. This adaptation occurs within merely 10K training steps, using a total batch size of 1024. |