Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Denoising Autoregressive Transformers for Scalable Text-to-Image Generation
Authors: Jiatao Gu, Yuyang Wang, Yizhe Zhang, Qihang Zhang, Dinghuai Zhang, Navdeep Jaitly, Joshua Susskind, Shuangfei Zhai
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our approach demonstrates competitive performance on class-conditioned and text-to-image generation tasks, offering a scalable, efficient alternative to traditional diffusion models. DART achieves competitive performance in both class-conditioned and text-to-image generation tasks, offering a scalable and unified approach for high-quality, controllable image synthesis. The paper includes a section titled '4 EXPERIMENTS' detailing dataset usage, evaluation metrics (FID, CLIP score), and quantitative results comparing DART with baseline models (Figure 7, Table 2). |
| Researcher Affiliation | Collaboration | Apple, δThe Chinese University of Hong Kong, γMila EMAIL δqhzhang@link.cuhk.edu.hk γdinghuaiEMAIL |
| Pseudocode | No | The paper describes the proposed methods and models using textual explanations and mathematical formulations (e.g., equations 1-10) and architectural diagrams (Figure 2), but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statements about releasing source code, nor does it include links to code repositories. |
| Open Datasets | Yes | We experiment with DART on both class-conditioned image generation on Image Net (Deng et al., 2009) and text-to-image generation on CC12M (Changpinyo et al., 2021)... To assess the zero-shot capabilities of the models, we report scores based on the MSCOCO 2017 (Lin et al., 2014) validation set. |
| Dataset Splits | Yes | We experiment with DART on both class-conditioned image generation on Image Net (Deng et al., 2009) and text-to-image generation on CC12M (Changpinyo et al., 2021)... To assess the zero-shot capabilities of the models, we report scores based on the MSCOCO 2017 (Lin et al., 2014) validation set. The use of 'validation set' for MSCOCO implies a predefined, standard split, and training on ImageNet/CC12M implies the use of their respective training sets. |
| Hardware Specification | Yes | We compare both the actual inference speed (measured by wall-clock time with batch size 32 on a single H100) as well as the theoretical computation (measured by GFlops) in 9(a). |
| Software Dependencies | No | The paper mentions using components like rotary positional encodings (RoPE) and SwiGLU activation, and the Adam W optimizer, along with mixed-precision training (bf16). However, it does not specify concrete version numbers for any key software libraries or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA) that would be needed for replication. |
| Experiment Setup | Yes | We train all models with a batch size of 128 images, resulting in a total of 0.5M image tokens per update. We use the Adam W optimizer (Loshchilov & Hutter, 2017) with a cosine learning rate schedule, setting the maximum learning rate to 3e-4. Appendix B.2 TRAINING: default training config: batch_size=128 optimizer= Adam W adam_beta1=0.9 adam_beta2=0.95 adam_eps=1e-8 learning_rate=3e-4 warmup_steps=10_000 weight_decay=0.01 gradient_clip_norm=2.0 ema_decay=0.9999 mixed_precision_training=bf16 |