Provable Memorization Capacity of Transformers

Authors: Junghwan Kim, Michelle Kim, Barzan Mozafari

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide experiments validating the memorization capacity of Transformers for token classification and sequence classification tasks. ... We complement our theory with experiments on real-world dataset. We train encoder-only Transformer models (Vaswani et al., 2017) on token classification task where each token is assigned a label as in Theorem 3.1 and sequence classification task where each sequence is assigned a label as in Theorem 4.1. We study the relationship between the memorized dataset size and the model size. For token classification, we use 14,000 randomly selected examples among 14,041 training examples in the named entity recognition dataset from Co NLL-2003 (Tjong Kim Sang & De Meulder, 2003). For sequence classification, we use 50,000 randomly selected examples among 392,702 training examples in the MNLI dataset from GLUE benchmark (Wang et al., 2019). Figure 1 shows heatmaps of training errors as the dataset size and the model size vary. Figure 2: The number of parameters required for memorization.
Researcher Affiliation Academia Junghwan Kim CSE Department University of Michigan Ann Arbor, MI kimjhj@umich.edu Michelle Young Jin Kim CSE Department Michigan State University East Lansing, MI kimmic16@msu.edu Barzan Mozafari CSE Department University of Michigan Ann Arbor, MI mozafari@umich.edu
Pseudocode No The paper describes its theoretical construction and proof in a structured manner, outlining 'stages' and providing mathematical lemmas and definitions (e.g., 'Lemma A.1', 'Lemma A.2', 'Construction of N1', 'Construction of N2'). However, it does not include any sections explicitly labeled 'Pseudocode' or 'Algorithm', nor does it present its methods in a code-like, structured algorithmic format.
Open Source Code No The paper does not provide a link or explicit statement about releasing the source code for the methodology or theoretical constructions presented. It mentions using 'Hugging Face Py Torch implementation of the BERT model' for experiments, but this refers to a third-party library, not their own original code for this paper.
Open Datasets Yes For token classification, we use 14,000 randomly selected examples among 14,041 training examples in the named entity recognition dataset from Co NLL-2003 (Tjong Kim Sang & De Meulder, 2003). For sequence classification, we use 50,000 randomly selected examples among 392,702 training examples in the MNLI dataset from GLUE benchmark (Wang et al., 2019).
Dataset Splits No The paper describes using 'randomly selected examples' for experiments and varying 'dataset size by randomly order examples and picking first p%'. While it mentions 'training examples' and 'training errors', it does not specify explicit training, validation, or test splits (e.g., percentages, counts, or a specific split methodology like k-fold cross-validation or standard benchmark splits) for reproducibility.
Hardware Specification Yes All experiments are conducted on an Nvidia Quatro RTX 5000, 16 GB memory GPU in a machine with Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz.
Software Dependencies No The paper mentions key software components: 'We use Hugging Face Py Torch implementation of the BERT model for our experiments.' and 'We optimize using Adam optimizer (Kingma & Ba, 2015)'. However, it does not specify version numbers for Hugging Face, PyTorch, or the Adam optimizer, which are necessary for reproducible software dependencies.
Experiment Setup Yes We vary the model size through the embedding size m while fixing the number of layers as L = 6. We fix the number of attention head as h = 12, the embedding to head size ratio as m/k = h = 12 and the feedforward to embedding size ratio as q/m = 4, as commonly done in practice. We optimize using Adam optimizer (Kingma & Ba, 2015) with learning rate 0.00002, batch size 32 and dropout rate 10%. We train our models for 1,500 and 7,500 steps for token and sequence classification, respectively.