Understanding Transformers via N-Gram Statistics
Authors: Timothy Nguyen
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper takes a first step in this direction by considering families of functions (i.e. rules) formed out of simple N-gram based statistics of the training data. By studying how well these rulesets approximate transformer predictions, we obtain a variety of novel discoveries... We perform our main investigations on the Tiny Stories [11] dataset, with supporting experiments on Wikipedia to confirm our results remain robust at larger scales. ... 3 Experimental Setup We train standard decoder-only transformer models on the Tiny Stories [11] dataset (480M tokens)... |
| Researcher Affiliation | Industry | Timothy Nguyen Google Deep Mind timothycnguyen@google.com |
| Pseudocode | No | The paper defines rules using mathematical notation and descriptions but does not include structured pseudocode or algorithm blocks (e.g., labeled "Pseudocode" or "Algorithm"). |
| Open Source Code | No | The NeurIPS Paper Checklist (Q5) explicitly states: "We do not release the code used to run our experiments (though there is nothing proprietary or novel about the model or the training procedure, and both are described in detail). The datasets we use are publicly available." Although a GitHub link is provided in a footnote, it explicitly states it is for "training datasets and related N-gram statistics", not the code for the methodology itself. |
| Open Datasets | Yes | We train standard decoder-only transformer models on the Tiny Stories [11] dataset (480M tokens) ... In the Appendix, we include additional corresponding experiments on Wikipedia (from Massive Text [25]). |
| Dataset Splits | Yes | Unless stated otherwise, our experiments use a 160M parameter model trained for 4 epochs, which achieves a loss of around 1.11 nats on the validation set. ... We have train and validation splits based on using choosing random sets of disjoint documents. |
| Hardware Specification | Yes | Our models are trained using TPU accelerators. The 160M and 420M models use 16 TPU accelerators while the 1.4B models use 64 TPU accelerators per run. |
| Software Dependencies | No | The paper mentions using a "weighted Adam optimizer [20]" and a "tokenizer ... trained using https://github.com/google/sentencepiece" but does not specify version numbers for any software dependencies. |
| Experiment Setup | Yes | Our transformer architecture and training procedure is based on that of Chinchilla [14]. The architecture hyperparameters are as follows: Table 3: Model specifications. Model Layers Number Heads dkey/dvalue dmodel ... We use a linear learning rate warmup of 1000 steps up to a maximum value of 2e-4 and then use a cosine learning rate decay. We use weighted Adam optimizer [20] with weight decay 10e-4. ... We use a batch size of 128 sequences with each sequence consisting of 2048 tokens. |