reproducibilityindex.ai

Transformers Learn Shortcuts to Automata

Authors: Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, Cyril Zhang

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we perform synthetic experiments by training Transformers to simulate a wide variety of automata, and show that shortcut solutions can be learned via standard training. We accompany these theoretical findings with an extensive set of experiments: End-to-end learnability of shortcuts via SGD (Section 4). More challenging settings (Section 5).
Researcher Affiliation	Collaboration	1Carnegie Mellon University 2Microsoft Research NYC 3University of Pennsylvania
Pseudocode	Yes	We first provide pseudocode (rather than Transformer weights) for computing the final state q T (rather than the entire state sequence). Algorithm 1: 1D gridworld: computing the final state
Open Source Code	No	We intend to release our code as open source prior to publication.
Open Datasets	No	For the empirical results, all our datasets are derived from synthetic distributions, which are clearly described in Appendix B.1 and B.2.
Dataset Splits	No	The paper specifies sequence lengths and mentions using 'independent (unseen) samples' for evaluation and different lengths for training and testing in generalization experiments. However, it explicitly states that data is 'freshly-sampled' for each minibatch and does not provide fixed, reproducible training/validation/test splits from a single dataset.
Hardware Specification	Yes	The experiments were performed on an internal cluster with NVIDIA Tesla P40, P100, V100, and A100 GPUs.
Software Dependencies	No	Our experiments are implemented with Py Torch (Paszke et al., 2019). The Transformers architectures are taken from the Hugging Face Transformers library (Wolf et al., 2019), using the GPT-2 configuration as a base.
Experiment Setup	Yes	For GPT-2 models, we fix the embedding dimension and MLP width to 512 and the number of heads to 8 in all experiments in Section 4, and vary the number of layers from 1 to 16. For LSTM, we fix the embedding dimension to 64, the hidden dimension to 128, and the number of layers to 1. We use the Adam W optimizer (Loshchilov & Hutter, 2017), with learning rate in {3e-5, 1e-4, 3e-4} for GPT-2 or {1e-3, 3e-3} for LSTM, weight decay 1e-4 for GPT-2 or 1e-9 for LSTM, and batch size 16 for GPT-2 or 64 for LSTM.