Learning Transformer Programs

Authors: Dan Friedman, Alexander Wettig, Danqi Chen

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To validate our approach, we learn Transformer Programs for a variety of problems, including an in-context learning task, a suite of algorithmic problems (e.g. sorting, recognizing Dyck-languages), and NLP tasks including named entity recognition and text classification. The Transformer Programs can automatically find reasonable solutions, performing on par with standard Transformers of comparable size; and, more importantly, they are easy to interpret.
Researcher Affiliation Academia Dan Friedman Alexander Wettig Danqi Chen Department of Computer Science & Princeton Language and Intelligence Princeton University {dfriedman,awettig,danqic}@cs.princeton.edu
Pseudocode No The paper includes Python code snippets for the learned Transformer Programs, but it does not provide pseudocode or algorithm blocks for the method of learning Transformer Programs itself.
Open Source Code Yes Our code is available at https://github.com/princeton-nlp/Transformer Programs, along with a number of example Transformer Programs.
Open Datasets Yes We validate our approach by learning Transformer Programs for a variety of problems, including an in-context learning task; the set of algorithmic problems introduced by Weiss et al. [2021]; and NLP benchmarks for named entity recognition and text classification. CoNLL-2003 Named Entity Recognition task [Sang and De Meulder, 2003] using the distribution from Hugging Face Datasets [Lhoest et al., 2021].
Dataset Splits Yes For each RASP task, we sample 20,000 inputs without replacement and partition them into train, validation, and test sets containing 16,000/2,000/2,000 instances respectively. We use the standard train/validation/test split and evaluate the results using a Python implementation of the standard CoNLL evaluation script [Nakayama, 2018].
Hardware Specification Yes Each model takes between five and fifteen minutes to train on an Nvidia RTX 2080 GPU, depending on the number of layers.
Software Dependencies No The paper mentions implementing models in PyTorch [Paszke et al., 2019], but does not provide a specific version number for PyTorch or other software dependencies.
Experiment Setup Yes We train each model for 250 epochs with a batch size of 512, a learning rate of 0.05, and annealing the Gumbel temperature geometrically from 3.0 to 0.01, decreasing the temperature at each training step.