ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis
Authors: Patrick Esser, Robin Rombach, Andreas Blattmann, Bjorn Ommer
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show greatly improved image modification capabilities over autoregressive models while also providing high-fidelity image generation, both of which are enabled through efficient training in a compressed latent space.In this section we present qualitative and quantitative results on images synthesized by our approach. We train models at resolution 256x256 for unconditional generation on FFHQ [33], LSUN -Cats, -Churches and -Bedrooms [79] and on class-conditional synthesis on Image Net (c IN) [13]. |
| Researcher Affiliation | Academia | Ludwig Maximilian University of Munich & IWR, Heidelberg University, Germany |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks (e.g., labeled 'Algorithm' or 'Pseudocode'). |
| Open Source Code | No | Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [No] Including the code while maintaining anonymity is difficult; it will be published after deanonymization. |
| Open Datasets | Yes | We train models at resolution 256x256 for unconditional generation on FFHQ [33], LSUN -Cats, -Churches and -Bedrooms [79] and on class-conditional synthesis on Image Net (c IN) [13].we also learn a text-conditional model on Conceptual Captions (CC) [64, 51]. |
| Dataset Splits | Yes | As the majority of codebook entries remains unused, we shrink the codebook to those entries which are actually used (evaluated on the validation split of Image Net) and assign a random entry for eventual outliers.All models were trained with the same computational budget and evaluated at the best validation checkpoint. |
| Hardware Specification | Yes | Experiments were conducted on a single NVIDIA A100 and are reported averaged over 1000 samples with a batch size of 50, evaluated on FFHQ while using the same number of trainable parameters (800m) for all AR models. |
| Software Dependencies | No | The paper mentions 'PyTorch implementations' and credits a GitHub repository for 'x-transformers' but does not specify exact version numbers for any software dependencies, such as 'PyTorch 1.9' or 'CUDA 11.1'. |
| Experiment Setup | Yes | We train models at resolution 256x256 for unconditional generation on FFHQ [33], LSUN -Cats, -Churches and -Bedrooms [79] and on class-conditional synthesis on Image Net (c IN) [13].For FFHQ we choose a chain of length T = 3, such that the total model consists of (i) the compression stage and (ii) n = 2 transformer models trained in parallel via the objective described in Eq.(7). Similarly, we set n = 3 for each of the LSUN models and n = 5 for the Image Net model.We identity a favorable trade-off between four and six decoder layers and transfer this setting to our other experiments. |