reproducibilityindex.ai

Generative Pretraining From Pixels

Authors: Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We train a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure. Despite training on low-resolution Image Net without labels, we ﬁnd that a GPT-2 scale model learns strong image representations as measured by linear probing, ﬁne-tuning, and low-data classiﬁcation. On CIFAR-10, we achieve 96.3% accuracy with a linear probe, outperforming a supervised Wide Res Net, and 99.0% accuracy with full ﬁne-tuning, matching the top supervised pretrained models. We are also competitive with self-supervised benchmarks on Image Net when substituting pixels for a VQVAE encoding, achieving 69.0% top-1 accuracy on a linear probe of our features.
Researcher Affiliation	Industry	1Open AI, San Francisco, CA, USA. Correspondence to: Mark Chen <mark@openai.com>.
Pseudocode	No	No structured pseudocode or algorithm blocks labeled 'Pseudocode' or 'Algorithm' were found.
Open Source Code	No	No explicit statement providing concrete access to source code (e.g., a specific repository link or explicit code release statement) for the methodology described in this paper was found.
Open Datasets	Yes	We use the Image Net ILSVRC 2012 training dataset, splitting off 4% as our experimental validation set and report results on the ILSVRC 2012 validation set as our test set. For CIFAR-10, CIFAR-100 and STL-10, we split off 10% of the provided training set instead.
Dataset Splits	Yes	We use the Image Net ILSVRC 2012 training dataset, splitting off 4% as our experimental validation set and report results on the ILSVRC 2012 validation set as our test set. For CIFAR-10, CIFAR-100 and STL-10, we split off 10% of the provided training set instead.
Hardware Specification	Yes	When using a ﬂexible architecture (Vaswani et al., 2017), a tractable and efﬁcient likelihood based training objective (Larochelle & Murray, 2011; Oord et al., 2016), and signiﬁcant compute resources (1024 TPU cores), generative pre-training is competitive with other self-supervised approaches and learns
Software Dependencies	No	No specific software dependencies with version numbers (e.g., library names with versions) were found. The paper mentions using 'Adam' and the 'GPT-2 formulation of the transformer decoder block', but not specific software environments or libraries.
Experiment Setup	Yes	When pre-training, we use a batch size of 128 and train for 1000000 iterations using Adam with β1 = 0.9 and β2 = 0.95. We sequentially try the learning rates 0.01, 0.003, 0.001, 0.0003, ..., stopping at the ﬁrst local minimum. The learning rate is warmed up for one epoch, and then decays to 0 following a cosine schedule. No dropout is used. When ﬁne-tuning, we use the same batch size and Adam hyperparameters. Here, we do not employ a cosine schedule, and early stop once we reach the maximum validation accuracy. Again, no dropout is used. When running a linear probe on Image Net, we follow recent literature and use SGD with momentum 0.9 and a high learning rate (we try the values 30, 10, 3, ... in the manner described above) (He et al., 2019). We train for 1000000 iterations with a cosine learning rate schedule. Finally, when running a linear probe on CIFAR-10, CIFAR-100, or STL10, we use the L-BFGS algorithm for consistency with prior results (Pedregosa et al., 2011). In Table 5, we present the learning rates used to train each model in the paper.