Birth of a Transformer: A Memory Viewpoint
Authors: Alberto Bietti, Vivien Cabannes, Diane Bouchacourt, Herve Jegou, Leon Bottou
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | By a careful empirical analysis of the training process on a simplified two-layer transformer, we illustrate the fast learning of global bigrams and the slower development of an induction head mechanism for the in-context bigrams. |
| Researcher Affiliation | Collaboration | Alberto Bietti Flatiron Institute Vivien Cabannes FAIR, Meta Diane Bouchacourt FAIR, Meta Hervé Jégou FAIR, Meta Léon Bottou FAIR, Meta Work done while at FAIR, Meta. |
| Pseudocode | No | No structured pseudocode or algorithm blocks are present in the paper. |
| Open Source Code | Yes | Our code is available at https://github.com/albietz/transformer-birth. |
| Open Datasets | Yes | Our experiments take πu and πb to be unigram and bigram character-level distributions estimated from the tiny Shakespeare dataset, with vocabulary size N = 65. |
| Dataset Splits | No | No explicit percentages or counts for training/validation/test splits are provided. The paper states, 'each batch consists of 512 fresh sequences of length T = 256 sampled from our synthetic model,' indicating dynamic data generation rather than a static dataset split. |
| Hardware Specification | No | The paper states, 'each run uses a single GPU, along with 60 CPU cores for real-time data generation,' but does not provide specific model numbers for the GPU or CPU. |
| Software Dependencies | No | The paper mentions using 'Pytorch' but does not specify a version number or other software dependencies with versions. |
| Experiment Setup | Yes | We train our models using mini-batch SGD with momentum, where each batch consists of 512 fresh sequences of length T = 256 sampled from our synthetic model. We use a fixed learning rate and weight decay. Hyperparameters are given in Appendix E. Unless otherwise noted, we use d = 128, random triggers with πq = πu and uniform output tokens. |