reproducibilityindex.ai

A distributional simplicity bias in the learning dynamics of transformers

Authors: Riccardo Rende, Federica Gerace, Alessandro Laio, Sebastian Goldt

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our study, we demonstrate that transformers, trained on natural language data, also display a simplicity bias. Specifically, they sequentially learn many-body interactions among input tokens, reaching a saturation point in the prediction error for low-degree interactions while continuing to learn high-degree interactions. To conduct this analysis, we develop a procedure to generate clones of a given natural language data set, which rigorously capture the interactions between tokens up to a specified order.
Researcher Affiliation	Academia	International School for Advanced Studies Trieste, Italy {rrende, fgerace, laio, sgoldt}@sissa.it
Pseudocode	No	The paper describes mechanisms and mathematical formulations, but it does not contain a clearly labeled pseudocode or algorithm block.
Open Source Code	Yes	Reproducibility We share the scripts to train and sample the networks and reproduce our results at https://doi.org/10.5281/zenodo.13992398.
Open Datasets	Yes	Given a data set containing natural language (we will use Wiki Text-103 [20] as a running example) and Specifically, we analyze the Tiny Stories dataset [22]
Dataset Splits	No	The paper describes the use of training and test sets but does not specify a validation set split for model training. For example, 'We then train via MLM an increasing number of factored self-attention layers (from one to three) on these data, monitoring the test loss as a function of the number of epochs.'
Hardware Specification	Yes	We trained the model for 10 hours on eight A100 GPUs.
Software Dependencies	No	The paper mentions 'Byte Level BPETokenizer (as implemented in Huggingface)' and specific optimizers (SGD, AdamW), but does not provide explicit version numbers for software libraries or frameworks like Huggingface itself, Python, or deep learning frameworks used (e.g., PyTorch, TensorFlow).
Experiment Setup	Yes	We choose the number of layers equal to 4, number of heads equal to 16, embedding dimension equal to 768 and size of the hidden layer of the MLPs equal to 1536. Similar hyper-parameters are also used for the models trained in Ref. [22]. We choose SGD for the optimiser, setting the batch size to 1024. We start with a learning rate of 0.03, annealing it with a cosine decay scheduler.