Exposing Attention Glitches with Flip-Flop Language Modeling
Authors: Bingbin Liu, Jordan Ash, Surbhi Goel, Akshay Krishnamurthy, Cyril Zhang
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our main set of synthetic experiments, we train neural language models to generate strings from the flip-flop language FFL(T = 512, p = (0.1, 0.1, 0.8)) (for short, FFL(pi = 0.8)), 8 and probe whether the networks robustly learn the language. |
| Researcher Affiliation | Collaboration | Bingbin Liu1 Jordan T. Ash2 Surbhi Goel3 Akshay Krishnamurthy2 Cyril Zhang2 1Carnegie Mellon University 2Microsoft Research NYC 3University of Pennsylvania |
| Pseudocode | No | The paper describes models and processes but does not include any clearly structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper provides a link to a dataset ('https://huggingface.co/datasets/synthseq/flipflop') and mentions using existing open-source libraries (x-transformers, Hugging Face's Transformer implementation), but it does not explicitly state that the code for their specific FFLM benchmark or experimental methodology is open-source or provide a link to it. |
| Open Datasets | Yes | For reproducibility, we publish this synthetic data at https://huggingface.co/datasets/synthseq/flipflop: 16M FFL(0.8) training sequences, 16K FFL(0.8) in-distribution test sequences, 160K sparse o.o.d. sequences from FFL(0.98), and 4K FFL(0.1)) dense o.o.d. sequences from FFL(0.98). |
| Dataset Splits | No | The paper describes the training data and test sets, but it does not explicitly mention validation data splits or their percentages/counts. |
| Hardware Specification | Yes | Each training run was performed on one GPU in an internal cluster, with NVIDIA P40, P100, V100, and RTX A6000 GPUs, with at least 16GB of VRAM. |
| Software Dependencies | No | The paper mentions software like PyTorch, x-transformers, and Hugging Face's Transformer implementation, but it does not provide specific version numbers for these components, which is required for reproducibility. |
| Experiment Setup | Yes | Other hyperparameter choices. We use a sequence length of T = 512... We use a canonical set of training hyperparameters for this sweep: the Adam W (Loshchilov and Hutter, 2017) optimizer, with (β1, β2) = (0.9, 0.999), learning rate 3 × 10^−4, weight decay 0.1, 50 steps of linear learning rate warmup, and linear learning rate decay (setting the would-be 10001th step to 0). We train for 10000 steps on freshly sampled data, and choose a minibatch size of 16... |