Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Denoising Autoregressive Transformers for Scalable Text-to-Image Generation

Authors: Jiatao Gu, Yuyang Wang, Yizhe Zhang, Qihang Zhang, Dinghuai Zhang, Navdeep Jaitly, Joshua Susskind, Shuangfei Zhai

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our approach demonstrates competitive performance on class-conditioned and text-to-image generation tasks, offering a scalable, efficient alternative to traditional diffusion models. DART achieves competitive performance in both class-conditioned and text-to-image generation tasks, offering a scalable and unified approach for high-quality, controllable image synthesis. The paper includes a section titled '4 EXPERIMENTS' detailing dataset usage, evaluation metrics (FID, CLIP score), and quantitative results comparing DART with baseline models (Figure 7, Table 2).
Researcher Affiliation	Collaboration	Apple, δThe Chinese University of Hong Kong, γMila EMAIL δqhzhang@link.cuhk.edu.hk γdinghuaiEMAIL
Pseudocode	No	The paper describes the proposed methods and models using textual explanations and mathematical formulations (e.g., equations 1-10) and architectural diagrams (Figure 2), but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide any explicit statements about releasing source code, nor does it include links to code repositories.
Open Datasets	Yes	We experiment with DART on both class-conditioned image generation on Image Net (Deng et al., 2009) and text-to-image generation on CC12M (Changpinyo et al., 2021)... To assess the zero-shot capabilities of the models, we report scores based on the MSCOCO 2017 (Lin et al., 2014) validation set.
Dataset Splits	Yes	We experiment with DART on both class-conditioned image generation on Image Net (Deng et al., 2009) and text-to-image generation on CC12M (Changpinyo et al., 2021)... To assess the zero-shot capabilities of the models, we report scores based on the MSCOCO 2017 (Lin et al., 2014) validation set. The use of 'validation set' for MSCOCO implies a predefined, standard split, and training on ImageNet/CC12M implies the use of their respective training sets.
Hardware Specification	Yes	We compare both the actual inference speed (measured by wall-clock time with batch size 32 on a single H100) as well as the theoretical computation (measured by GFlops) in 9(a).
Software Dependencies	No	The paper mentions using components like rotary positional encodings (RoPE) and SwiGLU activation, and the Adam W optimizer, along with mixed-precision training (bf16). However, it does not specify concrete version numbers for any key software libraries or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA) that would be needed for replication.
Experiment Setup	Yes	We train all models with a batch size of 128 images, resulting in a total of 0.5M image tokens per update. We use the Adam W optimizer (Loshchilov & Hutter, 2017) with a cosine learning rate schedule, setting the maximum learning rate to 3e-4. Appendix B.2 TRAINING: default training config: batch_size=128 optimizer= Adam W adam_beta1=0.9 adam_beta2=0.95 adam_eps=1e-8 learning_rate=3e-4 warmup_steps=10_000 weight_decay=0.01 gradient_clip_norm=2.0 ema_decay=0.9999 mixed_precision_training=bf16