Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

AR-RAG: Autoregressive Retrieval Augmentation for Image Generation

Authors: Jingyuan Qi, Zhiyang Xu, Qifan Wang, Lifu Huangg

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate the effectiveness of AR-RAG on widely adopted benchmarks, including Midjourney-30K, Gen Eval, Rare Bench, T2I-Bench, and DPG-Bench, demonstrating significant performance gains over state-of-the-art image generation models.
Researcher Affiliation	Collaboration	1Virginia Tech 2Meta 3 UC Davis EMAIL EMAIL
Pseudocode	Yes	Algorithm 1: Multi-Scale Feature Smoothing
Open Source Code	No	Will release after acceptance
Open Datasets	Yes	We build a patch-based retrieval database based on several large-scale, real-world image datasets, including CC12M [5] and Journey DB [38]. Specifically, for each image I, we encode it into N patches using the quantized autoencoder [39], θEnc, from Janus-Pro: V = θenc(I) RN d, where d is the hidden dimension, and Vij corresponds to the latent representation of the patch at position (i, j). Evaluation Benchmarks and Metrics To comprehensively evaluate our proposed methods, we employ five benchmarks: (1) Gen Eval [14]... (2) DPG-Bench [18]... (3) Rare Bench [29]... (4) T2I-Bench [19]... and (5) Midjourney-30k [42]...
Dataset Splits	Yes	To construct our patch-level retrieval database, we randomly sample 5.7 million images from CC12M [5], 3.3 million from Journey DB [38], and 4.6 million from Data Comp [12], while ensuring that any samples included in the testing set are excluded to prevent data leakage. ... For model training, we utilize two large-scale image-caption datasets: CC12M [5] and Midjourney-v6 5. From the training sets of these datasets, we randomly sample a total of 50,000 image-caption pairs (25,000 from each dataset) to fine-tune our model.
Hardware Specification	Yes	The fine-tuning process is conducted on 4 NVIDIA A100 (80GB) GPUs with a global batch size of 256 for a single epoch. Table 6: Inference time for generating 100 images on a single L40 card.
Software Dependencies	No	For efficient similarity search, we implement our retriever using the FAISS library [22].
Experiment Setup	Yes	The fine-tuning process is conducted on 4 NVIDIA A100 (80GB) GPUs with a global batch size of 256 for a single epoch. We utilize the Adam W optimizer without weight decay, incorporating a 10% linear warm-up schedule followed by a constant learning rate of 2e-4. Based on this analysis, we selected λ = 0.05) and τ = 0.6 for DAi D, and hop levels 12 with 2 blender modules for FAi D, achieving FID scores of 14.12 and 13.13, respectively.