reproducibilityindex.ai

Understanding Transformers via N-Gram Statistics

Authors: Timothy Nguyen

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This paper takes a ﬁrst step in this direction by considering families of functions (i.e. rules) formed out of simple N-gram based statistics of the training data. By studying how well these rulesets approximate transformer predictions, we obtain a variety of novel discoveries... We perform our main investigations on the Tiny Stories [11] dataset, with supporting experiments on Wikipedia to conﬁrm our results remain robust at larger scales. ... 3 Experimental Setup We train standard decoder-only transformer models on the Tiny Stories [11] dataset (480M tokens)...
Researcher Affiliation	Industry	Timothy Nguyen Google Deep Mind timothycnguyen@google.com
Pseudocode	No	The paper defines rules using mathematical notation and descriptions but does not include structured pseudocode or algorithm blocks (e.g., labeled "Pseudocode" or "Algorithm").
Open Source Code	No	The NeurIPS Paper Checklist (Q5) explicitly states: "We do not release the code used to run our experiments (though there is nothing proprietary or novel about the model or the training procedure, and both are described in detail). The datasets we use are publicly available." Although a GitHub link is provided in a footnote, it explicitly states it is for "training datasets and related N-gram statistics", not the code for the methodology itself.
Open Datasets	Yes	We train standard decoder-only transformer models on the Tiny Stories [11] dataset (480M tokens) ... In the Appendix, we include additional corresponding experiments on Wikipedia (from Massive Text [25]).
Dataset Splits	Yes	Unless stated otherwise, our experiments use a 160M parameter model trained for 4 epochs, which achieves a loss of around 1.11 nats on the validation set. ... We have train and validation splits based on using choosing random sets of disjoint documents.
Hardware Specification	Yes	Our models are trained using TPU accelerators. The 160M and 420M models use 16 TPU accelerators while the 1.4B models use 64 TPU accelerators per run.
Software Dependencies	No	The paper mentions using a "weighted Adam optimizer [20]" and a "tokenizer ... trained using https://github.com/google/sentencepiece" but does not specify version numbers for any software dependencies.
Experiment Setup	Yes	Our transformer architecture and training procedure is based on that of Chinchilla [14]. The architecture hyperparameters are as follows: Table 3: Model speciﬁcations. Model Layers Number Heads dkey/dvalue dmodel ... We use a linear learning rate warmup of 1000 steps up to a maximum value of 2e-4 and then use a cosine learning rate decay. We use weighted Adam optimizer [20] with weight decay 10e-4. ... We use a batch size of 128 sequences with each sequence consisting of 2048 tokens.