A distributional simplicity bias in the learning dynamics of transformers

Authors: Riccardo Rende, Federica Gerace, Alessandro Laio, Sebastian Goldt

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our study, we demonstrate that transformers, trained on natural language data, also display a simplicity bias. Specifically, they sequentially learn many-body interactions among input tokens, reaching a saturation point in the prediction error for low-degree interactions while continuing to learn high-degree interactions. To conduct this analysis, we develop a procedure to generate clones of a given natural language data set, which rigorously capture the interactions between tokens up to a specified order.
Researcher Affiliation Academia International School for Advanced Studies Trieste, Italy {rrende, fgerace, laio, sgoldt}@sissa.it
Pseudocode No The paper describes mechanisms and mathematical formulations, but it does not contain a clearly labeled pseudocode or algorithm block.
Open Source Code Yes Reproducibility We share the scripts to train and sample the networks and reproduce our results at https://doi.org/10.5281/zenodo.13992398.
Open Datasets Yes Given a data set containing natural language (we will use Wiki Text-103 [20] as a running example) and Specifically, we analyze the Tiny Stories dataset [22]
Dataset Splits No The paper describes the use of training and test sets but does not specify a validation set split for model training. For example, 'We then train via MLM an increasing number of factored self-attention layers (from one to three) on these data, monitoring the test loss as a function of the number of epochs.'
Hardware Specification Yes We trained the model for 10 hours on eight A100 GPUs.
Software Dependencies No The paper mentions 'Byte Level BPETokenizer (as implemented in Huggingface)' and specific optimizers (SGD, AdamW), but does not provide explicit version numbers for software libraries or frameworks like Huggingface itself, Python, or deep learning frameworks used (e.g., PyTorch, TensorFlow).
Experiment Setup Yes We choose the number of layers equal to 4, number of heads equal to 16, embedding dimension equal to 768 and size of the hidden layer of the MLPs equal to 1536. Similar hyper-parameters are also used for the models trained in Ref. [22]. We choose SGD for the optimiser, setting the batch size to 1024. We start with a learning rate of 0.03, annealing it with a cosine decay scheduler.