Generative Pretraining From Pixels

Authors: Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We train a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure. Despite training on low-resolution Image Net without labels, we find that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification. On CIFAR-10, we achieve 96.3% accuracy with a linear probe, outperforming a supervised Wide Res Net, and 99.0% accuracy with full fine-tuning, matching the top supervised pretrained models. We are also competitive with self-supervised benchmarks on Image Net when substituting pixels for a VQVAE encoding, achieving 69.0% top-1 accuracy on a linear probe of our features.
Researcher Affiliation Industry 1Open AI, San Francisco, CA, USA. Correspondence to: Mark Chen <mark@openai.com>.
Pseudocode No No structured pseudocode or algorithm blocks labeled 'Pseudocode' or 'Algorithm' were found.
Open Source Code No No explicit statement providing concrete access to source code (e.g., a specific repository link or explicit code release statement) for the methodology described in this paper was found.
Open Datasets Yes We use the Image Net ILSVRC 2012 training dataset, splitting off 4% as our experimental validation set and report results on the ILSVRC 2012 validation set as our test set. For CIFAR-10, CIFAR-100 and STL-10, we split off 10% of the provided training set instead.
Dataset Splits Yes We use the Image Net ILSVRC 2012 training dataset, splitting off 4% as our experimental validation set and report results on the ILSVRC 2012 validation set as our test set. For CIFAR-10, CIFAR-100 and STL-10, we split off 10% of the provided training set instead.
Hardware Specification Yes When using a flexible architecture (Vaswani et al., 2017), a tractable and efficient likelihood based training objective (Larochelle & Murray, 2011; Oord et al., 2016), and significant compute resources (1024 TPU cores), generative pre-training is competitive with other self-supervised approaches and learns
Software Dependencies No No specific software dependencies with version numbers (e.g., library names with versions) were found. The paper mentions using 'Adam' and the 'GPT-2 formulation of the transformer decoder block', but not specific software environments or libraries.
Experiment Setup Yes When pre-training, we use a batch size of 128 and train for 1000000 iterations using Adam with β1 = 0.9 and β2 = 0.95. We sequentially try the learning rates 0.01, 0.003, 0.001, 0.0003, ..., stopping at the first local minimum. The learning rate is warmed up for one epoch, and then decays to 0 following a cosine schedule. No dropout is used. When fine-tuning, we use the same batch size and Adam hyperparameters. Here, we do not employ a cosine schedule, and early stop once we reach the maximum validation accuracy. Again, no dropout is used. When running a linear probe on Image Net, we follow recent literature and use SGD with momentum 0.9 and a high learning rate (we try the values 30, 10, 3, ... in the manner described above) (He et al., 2019). We train for 1000000 iterations with a cosine learning rate schedule. Finally, when running a linear probe on CIFAR-10, CIFAR-100, or STL10, we use the L-BFGS algorithm for consistency with prior results (Pedregosa et al., 2011). In Table 5, we present the learning rates used to train each model in the paper.