Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Birth of a Transformer: A Memory Viewpoint
Authors: Alberto Bietti, Vivien Cabannes, Diane Bouchacourt, Herve Jegou, Leon Bottou
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | By a careful empirical analysis of the training process on a simplified two-layer transformer, we illustrate the fast learning of global bigrams and the slower development of an induction head mechanism for the in-context bigrams. |
| Researcher Affiliation | Collaboration | Alberto Bietti Flatiron Institute Vivien Cabannes FAIR, Meta Diane Bouchacourt FAIR, Meta Hervé Jégou FAIR, Meta Léon Bottou FAIR, Meta Work done while at FAIR, Meta. |
| Pseudocode | No | No structured pseudocode or algorithm blocks are present in the paper. |
| Open Source Code | Yes | Our code is available at https://github.com/albietz/transformer-birth. |
| Open Datasets | Yes | Our experiments take πu and πb to be unigram and bigram character-level distributions estimated from the tiny Shakespeare dataset, with vocabulary size N = 65. |
| Dataset Splits | No | No explicit percentages or counts for training/validation/test splits are provided. The paper states, 'each batch consists of 512 fresh sequences of length T = 256 sampled from our synthetic model,' indicating dynamic data generation rather than a static dataset split. |
| Hardware Specification | No | The paper states, 'each run uses a single GPU, along with 60 CPU cores for real-time data generation,' but does not provide specific model numbers for the GPU or CPU. |
| Software Dependencies | No | The paper mentions using 'Pytorch' but does not specify a version number or other software dependencies with versions. |
| Experiment Setup | Yes | We train our models using mini-batch SGD with momentum, where each batch consists of 512 fresh sequences of length T = 256 sampled from our synthetic model. We use a fixed learning rate and weight decay. Hyperparameters are given in Appendix E. Unless otherwise noted, we use d = 128, random triggers with πq = πu and uniform output tokens. |