How Smooth Is Attention?

Authors: Valérie Castin, Pierre Ablin, Gabriel Peyré

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments on pretrained and randomly initialized BERT and GPT-2 support our theoretical findings.
Researcher Affiliation Collaboration 1 Ecole Normale Sup erieure PSL, Paris, France 2Apple, Paris, France 3CNRS.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement or link for the open-source code of the methodology described in the paper. It mentions using existing models like "pretrained Huggingface model bert-base-uncased" but does not offer its own implementation code.
Open Datasets Yes We take our data from two test datasets, Alice in Wonderland from the NLTK corpus Gutenberg (Bird et al., 2009), and AG NEWS from the Py Torch package torchtext (Zhang et al., 2015).
Dataset Splits No The paper describes how input sequences are constructed for analysis (e.g., "For each even integer n in {2, . . . , 100}, we build 10 sequences with n tokens") but does not specify traditional training, validation, and test dataset splits for model training or tuning. As the paper analyzes pre-trained models, these splits are not relevant to its experimental methodology.
Hardware Specification No The paper mentions using "a BERT model and a GPT-2 model" for experiments but does not specify any particular hardware (e.g., GPU/CPU models, memory) used to run these experiments.
Software Dependencies No The paper mentions using "the pretrained Huggingface model bert-base-uncased" and "the Py Torch package torchtext" but does not specify version numbers for these or any other software dependencies.
Experiment Setup Yes For each even integer n in {2, . . . , 100}, we build 10 sequences with n tokens, so that none of the constructed sequences (s1, . . . , sn) overlap. Then, for each input sequence (s1, . . . , sn), we do a forward pass of the model, and get with a forward hook the intermediate activations just before the attention layer of interest f model. This gives us a batch of sequences (x1, . . . , xn) (Rd)n that are fed to f model when (s1, . . . , sn) goes through the model. The local Lipschitz constant of f model at an input sequence X = (x1, . . . , xn) is equal to DXf model 2. As DXf model, which we denote JX to alleviate notations, is of shape nd nd with d = 768, we do not compute it explicitly but use a power method on the matrix J XJX by alternating Jacobian-vector products and vector-Jacobian products (see Appendix E.1).