ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis

Authors: Patrick Esser, Robin Rombach, Andreas Blattmann, Bjorn Ommer

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show greatly improved image modification capabilities over autoregressive models while also providing high-fidelity image generation, both of which are enabled through efficient training in a compressed latent space.In this section we present qualitative and quantitative results on images synthesized by our approach. We train models at resolution 256x256 for unconditional generation on FFHQ [33], LSUN -Cats, -Churches and -Bedrooms [79] and on class-conditional synthesis on Image Net (c IN) [13].
Researcher Affiliation Academia Ludwig Maximilian University of Munich & IWR, Heidelberg University, Germany
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks (e.g., labeled 'Algorithm' or 'Pseudocode').
Open Source Code No Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [No] Including the code while maintaining anonymity is difficult; it will be published after deanonymization.
Open Datasets Yes We train models at resolution 256x256 for unconditional generation on FFHQ [33], LSUN -Cats, -Churches and -Bedrooms [79] and on class-conditional synthesis on Image Net (c IN) [13].we also learn a text-conditional model on Conceptual Captions (CC) [64, 51].
Dataset Splits Yes As the majority of codebook entries remains unused, we shrink the codebook to those entries which are actually used (evaluated on the validation split of Image Net) and assign a random entry for eventual outliers.All models were trained with the same computational budget and evaluated at the best validation checkpoint.
Hardware Specification Yes Experiments were conducted on a single NVIDIA A100 and are reported averaged over 1000 samples with a batch size of 50, evaluated on FFHQ while using the same number of trainable parameters (800m) for all AR models.
Software Dependencies No The paper mentions 'PyTorch implementations' and credits a GitHub repository for 'x-transformers' but does not specify exact version numbers for any software dependencies, such as 'PyTorch 1.9' or 'CUDA 11.1'.
Experiment Setup Yes We train models at resolution 256x256 for unconditional generation on FFHQ [33], LSUN -Cats, -Churches and -Bedrooms [79] and on class-conditional synthesis on Image Net (c IN) [13].For FFHQ we choose a chain of length T = 3, such that the total model consists of (i) the compression stage and (ii) n = 2 transformer models trained in parallel via the objective described in Eq.(7). Similarly, we set n = 3 for each of the LSUN models and n = 5 for the Image Net model.We identity a favorable trade-off between four and six decoder layers and transfer this setting to our other experiments.