Transformers Learn Shortcuts to Automata
Authors: Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, Cyril Zhang
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we perform synthetic experiments by training Transformers to simulate a wide variety of automata, and show that shortcut solutions can be learned via standard training. We accompany these theoretical findings with an extensive set of experiments: End-to-end learnability of shortcuts via SGD (Section 4). More challenging settings (Section 5). |
| Researcher Affiliation | Collaboration | 1Carnegie Mellon University 2Microsoft Research NYC 3University of Pennsylvania |
| Pseudocode | Yes | We first provide pseudocode (rather than Transformer weights) for computing the final state q T (rather than the entire state sequence). Algorithm 1: 1D gridworld: computing the final state |
| Open Source Code | No | We intend to release our code as open source prior to publication. |
| Open Datasets | No | For the empirical results, all our datasets are derived from synthetic distributions, which are clearly described in Appendix B.1 and B.2. |
| Dataset Splits | No | The paper specifies sequence lengths and mentions using 'independent (unseen) samples' for evaluation and different lengths for training and testing in generalization experiments. However, it explicitly states that data is 'freshly-sampled' for each minibatch and does not provide fixed, reproducible training/validation/test splits from a single dataset. |
| Hardware Specification | Yes | The experiments were performed on an internal cluster with NVIDIA Tesla P40, P100, V100, and A100 GPUs. |
| Software Dependencies | No | Our experiments are implemented with Py Torch (Paszke et al., 2019). The Transformers architectures are taken from the Hugging Face Transformers library (Wolf et al., 2019), using the GPT-2 configuration as a base. |
| Experiment Setup | Yes | For GPT-2 models, we fix the embedding dimension and MLP width to 512 and the number of heads to 8 in all experiments in Section 4, and vary the number of layers from 1 to 16. For LSTM, we fix the embedding dimension to 64, the hidden dimension to 128, and the number of layers to 1. We use the Adam W optimizer (Loshchilov & Hutter, 2017), with learning rate in {3e-5, 1e-4, 3e-4} for GPT-2 or {1e-3, 3e-3} for LSTM, weight decay 1e-4 for GPT-2 or 1e-9 for LSTM, and batch size 16 for GPT-2 or 64 for LSTM. |