Image Transformer

Authors: Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, Dustin Tran

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental While conceptually simple, our generative models significantly outperform the current state of the art in image generation on Image Net, improving the best published negative log-likelihood on Image Net from 3.83 to 3.77. We also present results on image super-resolution with a large magnification ratio, applying an encoder-decoder configuration of our architecture. In a human evaluation study, we find that images generated by our super-resolution model fool human observers three times more often than the previous state of the art.
Researcher Affiliation Collaboration 1Google Brain, Mountain View, USA 2Department of Electrical Engineering and Computer Sciences, University of California, Berkeley 3Work done during an internship at Google Brain 4Google AI, Mountain View, USA.
Pseudocode No The paper includes a diagram (Figure 1) illustrating a slice of the Image Transformer and equations describing operations, but it does not contain a dedicated pseudocode or algorithm block.
Open Source Code Yes All code we used to develop, train, and evaluate our models is available in Tensor2Tensor (Vaswani et al., 2018).
Open Datasets Yes modeling images from the standard Image Net data set, as measured by log-likelihood.
Dataset Splits Yes Table 4. Bits/dim on CIFAR-10 test and Image Net validation sets. The Image Transformer outperforms all models and matches Pixel CNN++, achieving a new state-of-the-art on Image Net.
Hardware Specification Yes We train our models on both p100 and k40 GPUs, with batch sizes ranging from 1 to 8 per GPU.
Software Dependencies No The paper mentions 'TensorFlow' for image resizing and 'Tensor2Tensor' as the framework where the code is available, but it does not specify version numbers for these or any other software dependencies.
Experiment Setup Yes For categorical, we use 12 layers with d = 512, heads=4, feed-forward dimension 2048 with a dropout of 0.3. In DMOL, our best config uses 14 layers, d = 256, heads=8, feed-forward dimension 512 and a dropout of 0.2.