reproducibilityindex.ai

Birth of a Transformer: A Memory Viewpoint

Authors: Alberto Bietti, Vivien Cabannes, Diane Bouchacourt, Herve Jegou, Leon Bottou

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	By a careful empirical analysis of the training process on a simplified two-layer transformer, we illustrate the fast learning of global bigrams and the slower development of an induction head mechanism for the in-context bigrams.
Researcher Affiliation	Collaboration	Alberto Bietti Flatiron Institute Vivien Cabannes FAIR, Meta Diane Bouchacourt FAIR, Meta Hervé Jégou FAIR, Meta Léon Bottou FAIR, Meta Work done while at FAIR, Meta.
Pseudocode	No	No structured pseudocode or algorithm blocks are present in the paper.
Open Source Code	Yes	Our code is available at https://github.com/albietz/transformer-birth.
Open Datasets	Yes	Our experiments take πu and πb to be unigram and bigram character-level distributions estimated from the tiny Shakespeare dataset, with vocabulary size N = 65.
Dataset Splits	No	No explicit percentages or counts for training/validation/test splits are provided. The paper states, 'each batch consists of 512 fresh sequences of length T = 256 sampled from our synthetic model,' indicating dynamic data generation rather than a static dataset split.
Hardware Specification	No	The paper states, 'each run uses a single GPU, along with 60 CPU cores for real-time data generation,' but does not provide specific model numbers for the GPU or CPU.
Software Dependencies	No	The paper mentions using 'Pytorch' but does not specify a version number or other software dependencies with versions.
Experiment Setup	Yes	We train our models using mini-batch SGD with momentum, where each batch consists of 512 fresh sequences of length T = 256 sampled from our synthetic model. We use a fixed learning rate and weight decay. Hyperparameters are given in Appendix E. Unless otherwise noted, we use d = 128, random triggers with πq = πu and uniform output tokens.