Birth of a Transformer: A Memory Viewpoint

Authors: Alberto Bietti, Vivien Cabannes, Diane Bouchacourt, Herve Jegou, Leon Bottou

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental By a careful empirical analysis of the training process on a simplified two-layer transformer, we illustrate the fast learning of global bigrams and the slower development of an induction head mechanism for the in-context bigrams.
Researcher Affiliation Collaboration Alberto Bietti Flatiron Institute Vivien Cabannes FAIR, Meta Diane Bouchacourt FAIR, Meta Hervé Jégou FAIR, Meta Léon Bottou FAIR, Meta Work done while at FAIR, Meta.
Pseudocode No No structured pseudocode or algorithm blocks are present in the paper.
Open Source Code Yes Our code is available at https://github.com/albietz/transformer-birth.
Open Datasets Yes Our experiments take πu and πb to be unigram and bigram character-level distributions estimated from the tiny Shakespeare dataset, with vocabulary size N = 65.
Dataset Splits No No explicit percentages or counts for training/validation/test splits are provided. The paper states, 'each batch consists of 512 fresh sequences of length T = 256 sampled from our synthetic model,' indicating dynamic data generation rather than a static dataset split.
Hardware Specification No The paper states, 'each run uses a single GPU, along with 60 CPU cores for real-time data generation,' but does not provide specific model numbers for the GPU or CPU.
Software Dependencies No The paper mentions using 'Pytorch' but does not specify a version number or other software dependencies with versions.
Experiment Setup Yes We train our models using mini-batch SGD with momentum, where each batch consists of 512 fresh sequences of length T = 256 sampled from our synthetic model. We use a fixed learning rate and weight decay. Hyperparameters are given in Appendix E. Unless otherwise noted, we use d = 128, random triggers with πq = πu and uniform output tokens.