Frozen Pretrained Transformers as Universal Computation Engines
Authors: Kevin Lu, Aditya Grover, Pieter Abbeel, Igor Mordatch7628-7636
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We investigate the capability of a transformer pretrained on natural language to generalize to other modalities with minimal finetuning in particular, without finetuning of the selfattention and feedforward layers of the residual blocks. We consider such a model, which we call a Frozen Pretrained Transformer (FPT), and study finetuning it on a variety of sequence classification tasks spanning numerical computation, vision, and protein fold prediction. |
| Researcher Affiliation | Collaboration | 1 UC Berkeley, 2 Facebook AI Research, 3 UCLA, 4 Google Brain |
| Pseudocode | No | The paper does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm'. |
| Open Source Code | Yes | Our code is available at: github.com/kzl/universal-computation |
| Open Datasets | Yes | On MNIST, the model must classify a handwritten digit from a 32 32 black-and-white image. On CIFAR-10 (Krizhevsky et al. 2009), the model will be given 4 4 image patches... We use the datasets provided by TAPE (Rao et al. 2019; Fox, Brenner, and Chandonia 2013; Hou, Adhikari, and Cheng 2018). |
| Dataset Splits | No | The paper mentions 'train/test split' for the Remote homology detection task but does not provide specific percentages, counts, or explicit mention of a validation split for any of the datasets used. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for experiments, such as GPU models, CPU types, or memory specifications. |
| Software Dependencies | No | The paper mentions 'PyTorch' but does not provide specific version numbers for PyTorch or any other software dependencies used. |
| Experiment Setup | Yes | We minimally tune the FPT models, using the standard pretrained GPT-2 and default Py Torch learning rate of 10-3. We compare to fully training a transformer from scratch, without pretraining, using the same learning rate and batch size; we swept over layer sizes of 3 or 12, as some of the fully trained models benefited from smaller size due to optimization challenges. All model sizes are the base model size (12 layers, 768 hidden dimension), unless stated otherwise. |