Frozen Pretrained Transformers as Universal Computation Engines

Authors: Kevin Lu, Aditya Grover, Pieter Abbeel, Igor Mordatch7628-7636

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We investigate the capability of a transformer pretrained on natural language to generalize to other modalities with minimal finetuning in particular, without finetuning of the selfattention and feedforward layers of the residual blocks. We consider such a model, which we call a Frozen Pretrained Transformer (FPT), and study finetuning it on a variety of sequence classification tasks spanning numerical computation, vision, and protein fold prediction.
Researcher Affiliation Collaboration 1 UC Berkeley, 2 Facebook AI Research, 3 UCLA, 4 Google Brain
Pseudocode No The paper does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code Yes Our code is available at: github.com/kzl/universal-computation
Open Datasets Yes On MNIST, the model must classify a handwritten digit from a 32 32 black-and-white image. On CIFAR-10 (Krizhevsky et al. 2009), the model will be given 4 4 image patches... We use the datasets provided by TAPE (Rao et al. 2019; Fox, Brenner, and Chandonia 2013; Hou, Adhikari, and Cheng 2018).
Dataset Splits No The paper mentions 'train/test split' for the Remote homology detection task but does not provide specific percentages, counts, or explicit mention of a validation split for any of the datasets used.
Hardware Specification No The paper does not provide specific details about the hardware used for experiments, such as GPU models, CPU types, or memory specifications.
Software Dependencies No The paper mentions 'PyTorch' but does not provide specific version numbers for PyTorch or any other software dependencies used.
Experiment Setup Yes We minimally tune the FPT models, using the standard pretrained GPT-2 and default Py Torch learning rate of 10-3. We compare to fully training a transformer from scratch, without pretraining, using the same learning rate and batch size; we swept over layer sizes of 3 or 12, as some of the fully trained models benefited from smaller size due to optimization challenges. All model sizes are the base model size (12 layers, 768 hidden dimension), unless stated otherwise.