Exposing Attention Glitches with Flip-Flop Language Modeling

Authors: Bingbin Liu, Jordan Ash, Surbhi Goel, Akshay Krishnamurthy, Cyril Zhang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our main set of synthetic experiments, we train neural language models to generate strings from the flip-flop language FFL(T = 512, p = (0.1, 0.1, 0.8)) (for short, FFL(pi = 0.8)), 8 and probe whether the networks robustly learn the language.
Researcher Affiliation Collaboration Bingbin Liu1 Jordan T. Ash2 Surbhi Goel3 Akshay Krishnamurthy2 Cyril Zhang2 1Carnegie Mellon University 2Microsoft Research NYC 3University of Pennsylvania
Pseudocode No The paper describes models and processes but does not include any clearly structured pseudocode or algorithm blocks.
Open Source Code No The paper provides a link to a dataset ('https://huggingface.co/datasets/synthseq/flipflop') and mentions using existing open-source libraries (x-transformers, Hugging Face's Transformer implementation), but it does not explicitly state that the code for their specific FFLM benchmark or experimental methodology is open-source or provide a link to it.
Open Datasets Yes For reproducibility, we publish this synthetic data at https://huggingface.co/datasets/synthseq/flipflop: 16M FFL(0.8) training sequences, 16K FFL(0.8) in-distribution test sequences, 160K sparse o.o.d. sequences from FFL(0.98), and 4K FFL(0.1)) dense o.o.d. sequences from FFL(0.98).
Dataset Splits No The paper describes the training data and test sets, but it does not explicitly mention validation data splits or their percentages/counts.
Hardware Specification Yes Each training run was performed on one GPU in an internal cluster, with NVIDIA P40, P100, V100, and RTX A6000 GPUs, with at least 16GB of VRAM.
Software Dependencies No The paper mentions software like PyTorch, x-transformers, and Hugging Face's Transformer implementation, but it does not provide specific version numbers for these components, which is required for reproducibility.
Experiment Setup Yes Other hyperparameter choices. We use a sequence length of T = 512... We use a canonical set of training hyperparameters for this sweep: the Adam W (Loshchilov and Hutter, 2017) optimizer, with (β1, β2) = (0.9, 0.999), learning rate 3 × 10^−4, weight decay 0.1, 50 steps of linear learning rate warmup, and linear learning rate decay (setting the would-be 10001th step to 0). We train for 10000 steps on freshly sampled data, and choose a minibatch size of 16...