reproducibilityindex.ai

Frozen Pretrained Transformers as Universal Computation Engines

Authors: Kevin Lu, Aditya Grover, Pieter Abbeel, Igor Mordatch7628-7636

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We investigate the capability of a transformer pretrained on natural language to generalize to other modalities with minimal finetuning in particular, without finetuning of the selfattention and feedforward layers of the residual blocks. We consider such a model, which we call a Frozen Pretrained Transformer (FPT), and study finetuning it on a variety of sequence classiﬁcation tasks spanning numerical computation, vision, and protein fold prediction.
Researcher Affiliation	Collaboration	1 UC Berkeley, 2 Facebook AI Research, 3 UCLA, 4 Google Brain
Pseudocode	No	The paper does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code	Yes	Our code is available at: github.com/kzl/universal-computation
Open Datasets	Yes	On MNIST, the model must classify a handwritten digit from a 32 32 black-and-white image. On CIFAR-10 (Krizhevsky et al. 2009), the model will be given 4 4 image patches... We use the datasets provided by TAPE (Rao et al. 2019; Fox, Brenner, and Chandonia 2013; Hou, Adhikari, and Cheng 2018).
Dataset Splits	No	The paper mentions 'train/test split' for the Remote homology detection task but does not provide specific percentages, counts, or explicit mention of a validation split for any of the datasets used.
Hardware Specification	No	The paper does not provide specific details about the hardware used for experiments, such as GPU models, CPU types, or memory specifications.
Software Dependencies	No	The paper mentions 'PyTorch' but does not provide specific version numbers for PyTorch or any other software dependencies used.
Experiment Setup	Yes	We minimally tune the FPT models, using the standard pretrained GPT-2 and default Py Torch learning rate of 10-3. We compare to fully training a transformer from scratch, without pretraining, using the same learning rate and batch size; we swept over layer sizes of 3 or 12, as some of the fully trained models beneﬁted from smaller size due to optimization challenges. All model sizes are the base model size (12 layers, 768 hidden dimension), unless stated otherwise.