Transformers Learn Shortcuts to Automata

Authors: Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, Cyril Zhang

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we perform synthetic experiments by training Transformers to simulate a wide variety of automata, and show that shortcut solutions can be learned via standard training. We accompany these theoretical findings with an extensive set of experiments: End-to-end learnability of shortcuts via SGD (Section 4). More challenging settings (Section 5).
Researcher Affiliation Collaboration 1Carnegie Mellon University 2Microsoft Research NYC 3University of Pennsylvania
Pseudocode Yes We first provide pseudocode (rather than Transformer weights) for computing the final state q T (rather than the entire state sequence). Algorithm 1: 1D gridworld: computing the final state
Open Source Code No We intend to release our code as open source prior to publication.
Open Datasets No For the empirical results, all our datasets are derived from synthetic distributions, which are clearly described in Appendix B.1 and B.2.
Dataset Splits No The paper specifies sequence lengths and mentions using 'independent (unseen) samples' for evaluation and different lengths for training and testing in generalization experiments. However, it explicitly states that data is 'freshly-sampled' for each minibatch and does not provide fixed, reproducible training/validation/test splits from a single dataset.
Hardware Specification Yes The experiments were performed on an internal cluster with NVIDIA Tesla P40, P100, V100, and A100 GPUs.
Software Dependencies No Our experiments are implemented with Py Torch (Paszke et al., 2019). The Transformers architectures are taken from the Hugging Face Transformers library (Wolf et al., 2019), using the GPT-2 configuration as a base.
Experiment Setup Yes For GPT-2 models, we fix the embedding dimension and MLP width to 512 and the number of heads to 8 in all experiments in Section 4, and vary the number of layers from 1 to 16. For LSTM, we fix the embedding dimension to 64, the hidden dimension to 128, and the number of layers to 1. We use the Adam W optimizer (Loshchilov & Hutter, 2017), with learning rate in {3e-5, 1e-4, 3e-4} for GPT-2 or {1e-3, 3e-3} for LSTM, weight decay 1e-4 for GPT-2 or 1e-9 for LSTM, and batch size 16 for GPT-2 or 64 for LSTM.