Understanding Transformers via N-Gram Statistics

Authors: Timothy Nguyen

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper takes a first step in this direction by considering families of functions (i.e. rules) formed out of simple N-gram based statistics of the training data. By studying how well these rulesets approximate transformer predictions, we obtain a variety of novel discoveries... We perform our main investigations on the Tiny Stories [11] dataset, with supporting experiments on Wikipedia to confirm our results remain robust at larger scales. ... 3 Experimental Setup We train standard decoder-only transformer models on the Tiny Stories [11] dataset (480M tokens)...
Researcher Affiliation Industry Timothy Nguyen Google Deep Mind timothycnguyen@google.com
Pseudocode No The paper defines rules using mathematical notation and descriptions but does not include structured pseudocode or algorithm blocks (e.g., labeled "Pseudocode" or "Algorithm").
Open Source Code No The NeurIPS Paper Checklist (Q5) explicitly states: "We do not release the code used to run our experiments (though there is nothing proprietary or novel about the model or the training procedure, and both are described in detail). The datasets we use are publicly available." Although a GitHub link is provided in a footnote, it explicitly states it is for "training datasets and related N-gram statistics", not the code for the methodology itself.
Open Datasets Yes We train standard decoder-only transformer models on the Tiny Stories [11] dataset (480M tokens) ... In the Appendix, we include additional corresponding experiments on Wikipedia (from Massive Text [25]).
Dataset Splits Yes Unless stated otherwise, our experiments use a 160M parameter model trained for 4 epochs, which achieves a loss of around 1.11 nats on the validation set. ... We have train and validation splits based on using choosing random sets of disjoint documents.
Hardware Specification Yes Our models are trained using TPU accelerators. The 160M and 420M models use 16 TPU accelerators while the 1.4B models use 64 TPU accelerators per run.
Software Dependencies No The paper mentions using a "weighted Adam optimizer [20]" and a "tokenizer ... trained using https://github.com/google/sentencepiece" but does not specify version numbers for any software dependencies.
Experiment Setup Yes Our transformer architecture and training procedure is based on that of Chinchilla [14]. The architecture hyperparameters are as follows: Table 3: Model specifications. Model Layers Number Heads dkey/dvalue dmodel ... We use a linear learning rate warmup of 1000 steps up to a maximum value of 2e-4 and then use a cosine learning rate decay. We use weighted Adam optimizer [20] with weight decay 10e-4. ... We use a batch size of 128 sequences with each sequence consisting of 2048 tokens.