How Smooth Is Attention?
Authors: Valérie Castin, Pierre Ablin, Gabriel Peyré
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on pretrained and randomly initialized BERT and GPT-2 support our theoretical findings. |
| Researcher Affiliation | Collaboration | 1 Ecole Normale Sup erieure PSL, Paris, France 2Apple, Paris, France 3CNRS. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement or link for the open-source code of the methodology described in the paper. It mentions using existing models like "pretrained Huggingface model bert-base-uncased" but does not offer its own implementation code. |
| Open Datasets | Yes | We take our data from two test datasets, Alice in Wonderland from the NLTK corpus Gutenberg (Bird et al., 2009), and AG NEWS from the Py Torch package torchtext (Zhang et al., 2015). |
| Dataset Splits | No | The paper describes how input sequences are constructed for analysis (e.g., "For each even integer n in {2, . . . , 100}, we build 10 sequences with n tokens") but does not specify traditional training, validation, and test dataset splits for model training or tuning. As the paper analyzes pre-trained models, these splits are not relevant to its experimental methodology. |
| Hardware Specification | No | The paper mentions using "a BERT model and a GPT-2 model" for experiments but does not specify any particular hardware (e.g., GPU/CPU models, memory) used to run these experiments. |
| Software Dependencies | No | The paper mentions using "the pretrained Huggingface model bert-base-uncased" and "the Py Torch package torchtext" but does not specify version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | For each even integer n in {2, . . . , 100}, we build 10 sequences with n tokens, so that none of the constructed sequences (s1, . . . , sn) overlap. Then, for each input sequence (s1, . . . , sn), we do a forward pass of the model, and get with a forward hook the intermediate activations just before the attention layer of interest f model. This gives us a batch of sequences (x1, . . . , xn) (Rd)n that are fed to f model when (s1, . . . , sn) goes through the model. The local Lipschitz constant of f model at an input sequence X = (x1, . . . , xn) is equal to DXf model 2. As DXf model, which we denote JX to alleviate notations, is of shape nd nd with d = 768, we do not compute it explicitly but use a power method on the matrix J XJX by alternating Jacobian-vector products and vector-Jacobian products (see Appendix E.1). |