Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis
Authors: Patrick Esser, Robin Rombach, Andreas Blattmann, Bjorn Ommer
NeurIPS 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show greatly improved image modification capabilities over autoregressive models while also providing high-fidelity image generation, both of which are enabled through efficient training in a compressed latent space.In this section we present qualitative and quantitative results on images synthesized by our approach. We train models at resolution 256x256 for unconditional generation on FFHQ [33], LSUN -Cats, -Churches and -Bedrooms [79] and on class-conditional synthesis on Image Net (c IN) [13]. |
| Researcher Affiliation | Academia | Ludwig Maximilian University of Munich & IWR, Heidelberg University, Germany |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks (e.g., labeled 'Algorithm' or 'Pseudocode'). |
| Open Source Code | No | Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [No] Including the code while maintaining anonymity is difficult; it will be published after deanonymization. |
| Open Datasets | Yes | We train models at resolution 256x256 for unconditional generation on FFHQ [33], LSUN -Cats, -Churches and -Bedrooms [79] and on class-conditional synthesis on Image Net (c IN) [13].we also learn a text-conditional model on Conceptual Captions (CC) [64, 51]. |
| Dataset Splits | Yes | As the majority of codebook entries remains unused, we shrink the codebook to those entries which are actually used (evaluated on the validation split of Image Net) and assign a random entry for eventual outliers.All models were trained with the same computational budget and evaluated at the best validation checkpoint. |
| Hardware Specification | Yes | Experiments were conducted on a single NVIDIA A100 and are reported averaged over 1000 samples with a batch size of 50, evaluated on FFHQ while using the same number of trainable parameters (800m) for all AR models. |
| Software Dependencies | No | The paper mentions 'PyTorch implementations' and credits a GitHub repository for 'x-transformers' but does not specify exact version numbers for any software dependencies, such as 'PyTorch 1.9' or 'CUDA 11.1'. |
| Experiment Setup | Yes | We train models at resolution 256x256 for unconditional generation on FFHQ [33], LSUN -Cats, -Churches and -Bedrooms [79] and on class-conditional synthesis on Image Net (c IN) [13].For FFHQ we choose a chain of length T = 3, such that the total model consists of (i) the compression stage and (ii) n = 2 transformer models trained in parallel via the objective described in Eq.(7). Similarly, we set n = 3 for each of the LSUN models and n = 5 for the Image Net model.We identity a favorable trade-off between four and six decoder layers and transfer this setting to our other experiments. |