Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Speculative Jacobi-Denoising Decoding for Accelerating Autoregressive Text-to-image Generation
Authors: Yao Teng, Fu-Yun Wang, Xian Liu, Zhekai Chen, Han Shi, Yu Wang, Zhenguo Li, Weiyang Liu, Difan Zou, Xihui Liu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To validate our method, we conduct experiments on two open-source large autoregressive models, Lumina-m GPT [11] and Emu3 [4]. Experimental results demonstrate that our method reduces the number of forward passes by about 4 on Lumina-m GPT and more than 5 on Emu3 and thus achieves latency speedup by more than 2 . We further verify image quality to show that our method accelerates autoregressive text-to-image generation without compromising image quality. Our SJD2 significantly reduces the steps for autoregressive text-to-image generation while maintaining visual quality, as demonstrated by our evaluation on MS-COCO [65] validation sets in Table 1. We also find that our method achieves a higher step compression ratio on Emu3 [4] than that on Lumina-m GPT [11]. For visual quality, we further compare our method to autoregressive decoding and SJD [6] on the Gen Eval benchmark [68] with Lumina-m GPT [11] as the baseline in Table 2. We perform experiments for our method with Lumina-m GPT as the baseline and on one RTX 4090 by default. The selected 100 prompts used in Table 3 are for the evaluation in the ablation studies. |
| Researcher Affiliation | Collaboration | Yao Teng1 Fuyun Wang2 Xian Liu2 Zhekai Chen1 Han Shi3 Yu Wang4 Zhenguo Li3 Weiyang Liu2 Difan Zou1 Xihui Liu1 1The University of Hong Kong 2CUHK 3Huawei Noah s Ark Lab 4Tsinghua University |
| Pseudocode | No | The paper describes the decoding process in detail, but it does not present it in a structured pseudocode or algorithm block. For example, Section 4.2 describes "Speculative Jacobi-Denoising Decoding" step-by-step but not in a pseudocode format. |
| Open Source Code | No | Justification: We only utilized publicly available datasets. We will release the code and models as soon as possible. |
| Open Datasets | Yes | To validate our method, we conduct experiments on two open-source large autoregressive models, Lumina-m GPT [11] and Emu3 [4]. Our SJD2 significantly reduces the steps for autoregressive text-to-image generation while maintaining visual quality, as demonstrated by our evaluation on MS-COCO [65] validation sets in Table 1. For visual quality, we further compare our method to autoregressive decoding and SJD [6] on the Gen Eval benchmark [68] with Lumina-m GPT [11] as the baseline in Table 2. |
| Dataset Splits | Yes | Our SJD2 significantly reduces the steps for autoregressive text-to-image generation while maintaining visual quality, as demonstrated by our evaluation on MS-COCO [65] validation sets in Table 1. Configuration COCO2017 (5k) COCO2014 (30k). |
| Hardware Specification | Yes | For each fine-tuning, 8 GPUs with 80G memory are required for each model. We tune each model only within 6 epochs, costing approximately 14 8 A100 hours for Lumina-m GPT and 26 8 H100 hours for Emu3. We perform experiments for our method with Lumina-m GPT as the baseline and on one RTX 4090 by default. According to the latency reported in Table 3, our method is still faster than other decoding methods on the real server by more than 2 . Table 3: The computational cost of decoding methods on a subset of COCO prompts. ... Lumina-m GPT [11] ... 17G ... Emu3 [4] ... 20G ... SJD2 ... 20G ... SJD2 ... 23G. |
| Software Dependencies | No | The paper mentions "Deep Speed Ze RO-3 or FSDP with gradient checkpointing to save GPU memory at the cost of increased training time. The global batch size is set to 64, and the learning rate is set to 2 10 5 with the Adam W optimizer." However, specific version numbers for these or other key software components (like Python, PyTorch, CUDA, etc.) are not provided. |
| Experiment Setup | Yes | The global batch size is set to 64, and the learning rate is set to 2 10 5 with the Adam W optimizer. We tune each model only within 6 epochs, costing approximately 14 8 A100 hours for Lumina-m GPT and 26 8 H100 hours for Emu3. By default, we set classifier-free guidance to 3.0 and use top-2000 for the quantitative results of our method. By default, the length of the Jacobi window of our method is set to 96 for Lumina-m GPT and 128 for Emu3. The number of denoising steps in SJD2 is set to 25. |