Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Denoising Autoregressive Transformers for Scalable Text-to-Image Generation

Authors: Jiatao Gu, Yuyang Wang, Yizhe Zhang, Qihang Zhang, Dinghuai Zhang, Navdeep Jaitly, Joshua Susskind, Shuangfei Zhai

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our approach demonstrates competitive performance on class-conditioned and text-to-image generation tasks, offering a scalable, efficient alternative to traditional diffusion models. DART achieves competitive performance in both class-conditioned and text-to-image generation tasks, offering a scalable and unified approach for high-quality, controllable image synthesis. The paper includes a section titled '4 EXPERIMENTS' detailing dataset usage, evaluation metrics (FID, CLIP score), and quantitative results comparing DART with baseline models (Figure 7, Table 2).
Researcher Affiliation Collaboration Apple, δThe Chinese University of Hong Kong, γMila EMAIL δqhzhang@link.cuhk.edu.hk γdinghuaiEMAIL
Pseudocode No The paper describes the proposed methods and models using textual explanations and mathematical formulations (e.g., equations 1-10) and architectural diagrams (Figure 2), but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statements about releasing source code, nor does it include links to code repositories.
Open Datasets Yes We experiment with DART on both class-conditioned image generation on Image Net (Deng et al., 2009) and text-to-image generation on CC12M (Changpinyo et al., 2021)... To assess the zero-shot capabilities of the models, we report scores based on the MSCOCO 2017 (Lin et al., 2014) validation set.
Dataset Splits Yes We experiment with DART on both class-conditioned image generation on Image Net (Deng et al., 2009) and text-to-image generation on CC12M (Changpinyo et al., 2021)... To assess the zero-shot capabilities of the models, we report scores based on the MSCOCO 2017 (Lin et al., 2014) validation set. The use of 'validation set' for MSCOCO implies a predefined, standard split, and training on ImageNet/CC12M implies the use of their respective training sets.
Hardware Specification Yes We compare both the actual inference speed (measured by wall-clock time with batch size 32 on a single H100) as well as the theoretical computation (measured by GFlops) in 9(a).
Software Dependencies No The paper mentions using components like rotary positional encodings (RoPE) and SwiGLU activation, and the Adam W optimizer, along with mixed-precision training (bf16). However, it does not specify concrete version numbers for any key software libraries or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA) that would be needed for replication.
Experiment Setup Yes We train all models with a batch size of 128 images, resulting in a total of 0.5M image tokens per update. We use the Adam W optimizer (Loshchilov & Hutter, 2017) with a cosine learning rate schedule, setting the maximum learning rate to 3e-4. Appendix B.2 TRAINING: default training config: batch_size=128 optimizer= Adam W adam_beta1=0.9 adam_beta2=0.95 adam_eps=1e-8 learning_rate=3e-4 warmup_steps=10_000 weight_decay=0.01 gradient_clip_norm=2.0 ema_decay=0.9999 mixed_precision_training=bf16