Generative Pretraining From Pixels
Authors: Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We train a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure. Despite training on low-resolution Image Net without labels, we find that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification. On CIFAR-10, we achieve 96.3% accuracy with a linear probe, outperforming a supervised Wide Res Net, and 99.0% accuracy with full fine-tuning, matching the top supervised pretrained models. We are also competitive with self-supervised benchmarks on Image Net when substituting pixels for a VQVAE encoding, achieving 69.0% top-1 accuracy on a linear probe of our features. |
| Researcher Affiliation | Industry | 1Open AI, San Francisco, CA, USA. Correspondence to: Mark Chen <mark@openai.com>. |
| Pseudocode | No | No structured pseudocode or algorithm blocks labeled 'Pseudocode' or 'Algorithm' were found. |
| Open Source Code | No | No explicit statement providing concrete access to source code (e.g., a specific repository link or explicit code release statement) for the methodology described in this paper was found. |
| Open Datasets | Yes | We use the Image Net ILSVRC 2012 training dataset, splitting off 4% as our experimental validation set and report results on the ILSVRC 2012 validation set as our test set. For CIFAR-10, CIFAR-100 and STL-10, we split off 10% of the provided training set instead. |
| Dataset Splits | Yes | We use the Image Net ILSVRC 2012 training dataset, splitting off 4% as our experimental validation set and report results on the ILSVRC 2012 validation set as our test set. For CIFAR-10, CIFAR-100 and STL-10, we split off 10% of the provided training set instead. |
| Hardware Specification | Yes | When using a flexible architecture (Vaswani et al., 2017), a tractable and efficient likelihood based training objective (Larochelle & Murray, 2011; Oord et al., 2016), and significant compute resources (1024 TPU cores), generative pre-training is competitive with other self-supervised approaches and learns |
| Software Dependencies | No | No specific software dependencies with version numbers (e.g., library names with versions) were found. The paper mentions using 'Adam' and the 'GPT-2 formulation of the transformer decoder block', but not specific software environments or libraries. |
| Experiment Setup | Yes | When pre-training, we use a batch size of 128 and train for 1000000 iterations using Adam with β1 = 0.9 and β2 = 0.95. We sequentially try the learning rates 0.01, 0.003, 0.001, 0.0003, ..., stopping at the first local minimum. The learning rate is warmed up for one epoch, and then decays to 0 following a cosine schedule. No dropout is used. When fine-tuning, we use the same batch size and Adam hyperparameters. Here, we do not employ a cosine schedule, and early stop once we reach the maximum validation accuracy. Again, no dropout is used. When running a linear probe on Image Net, we follow recent literature and use SGD with momentum 0.9 and a high learning rate (we try the values 30, 10, 3, ... in the manner described above) (He et al., 2019). We train for 1000000 iterations with a cosine learning rate schedule. Finally, when running a linear probe on CIFAR-10, CIFAR-100, or STL10, we use the L-BFGS algorithm for consistency with prior results (Pedregosa et al., 2011). In Table 5, we present the learning rates used to train each model in the paper. |