Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
AR-RAG: Autoregressive Retrieval Augmentation for Image Generation
Authors: Jingyuan Qi, Zhiyang Xu, Qifan Wang, Lifu Huangg
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate the effectiveness of AR-RAG on widely adopted benchmarks, including Midjourney-30K, Gen Eval, Rare Bench, T2I-Bench, and DPG-Bench, demonstrating significant performance gains over state-of-the-art image generation models. |
| Researcher Affiliation | Collaboration | 1Virginia Tech 2Meta 3 UC Davis EMAIL EMAIL |
| Pseudocode | Yes | Algorithm 1: Multi-Scale Feature Smoothing |
| Open Source Code | No | Will release after acceptance |
| Open Datasets | Yes | We build a patch-based retrieval database based on several large-scale, real-world image datasets, including CC12M [5] and Journey DB [38]. Specifically, for each image I, we encode it into N patches using the quantized autoencoder [39], θEnc, from Janus-Pro: V = θenc(I) RN d, where d is the hidden dimension, and Vij corresponds to the latent representation of the patch at position (i, j). Evaluation Benchmarks and Metrics To comprehensively evaluate our proposed methods, we employ five benchmarks: (1) Gen Eval [14]... (2) DPG-Bench [18]... (3) Rare Bench [29]... (4) T2I-Bench [19]... and (5) Midjourney-30k [42]... |
| Dataset Splits | Yes | To construct our patch-level retrieval database, we randomly sample 5.7 million images from CC12M [5], 3.3 million from Journey DB [38], and 4.6 million from Data Comp [12], while ensuring that any samples included in the testing set are excluded to prevent data leakage. ... For model training, we utilize two large-scale image-caption datasets: CC12M [5] and Midjourney-v6 5. From the training sets of these datasets, we randomly sample a total of 50,000 image-caption pairs (25,000 from each dataset) to fine-tune our model. |
| Hardware Specification | Yes | The fine-tuning process is conducted on 4 NVIDIA A100 (80GB) GPUs with a global batch size of 256 for a single epoch. Table 6: Inference time for generating 100 images on a single L40 card. |
| Software Dependencies | No | For efficient similarity search, we implement our retriever using the FAISS library [22]. |
| Experiment Setup | Yes | The fine-tuning process is conducted on 4 NVIDIA A100 (80GB) GPUs with a global batch size of 256 for a single epoch. We utilize the Adam W optimizer without weight decay, incorporating a 10% linear warm-up schedule followed by a constant learning rate of 2e-4. Based on this analysis, we selected λ = 0.05) and τ = 0.6 for DAi D, and hop levels 12 with 2 blender modules for FAi D, achieving FID scores of 14.12 and 13.13, respectively. |