On the Expressive Flexibility of Self-Attention Matrices
Authors: Valerii Likhosherstov, Krzysztof Choromanski, Adrian Weller
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we present experimental simulations, discuss related work and make concluding remarks. Experiments Theorem 0.4 suggests an upper bound (r.h.s. in Equation 22) for the dmin(ϵ1, ϵ2) i.e. the minimal d which results in M satisfying (23,24) for fixed ϵ1, ϵ2. A question which we address in the experimental section is, therefore, What is the actual dmin(ϵ1, ϵ2) in practice? Does it satisfy the logarithmic law d = O(log n)? To answer this question, we perform the following simulation. |
| Researcher Affiliation | Collaboration | Valerii Likhosherstov1*, Krzysztof Choromanski2*, Adrian Weller1,3 1University of Cambridge 2Google Brain 3The Alan Turing Institute |
| Pseudocode | Yes | Algorithm 1: Algorithm for solving (1, dhid)-WEAK-SAMAPPROX (V denotes {1, . . . , n}.). |
| Open Source Code | No | The paper does not contain any statement about releasing source code for the described methodology or provide a link to a code repository. |
| Open Datasets | No | The paper describes sampling synthetic matrices for its experiments ("For each n we sample the matrix A"), and while Figure 1 refers to "randomly sampled input sentences from the text corpus", this is for an illustrative example from a trained model (Distil BERT), not a dataset used in their core experiments. No concrete access information for a public dataset is provided for the main experiments. |
| Dataset Splits | No | The paper describes a simulation process where matrices are sampled and parameters are iterated over ("For each n we sample the matrix A and iterate over a uniform grid of d values in ascending order until we find such d which results in M satisfying (23,24)."), but it does not define specific train, validation, or test dataset splits in the conventional sense. |
| Hardware Specification | No | The paper does not provide any specific hardware details such as GPU/CPU models, processors, or cloud instance types used for running experiments. |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers (e.g., "Python 3.8, PyTorch 1.9"). |
| Experiment Setup | Yes | For each set of parameters, we iterate over n on a uniform grid from 512 to 3072 with a step size 256. For each n we sample the matrix A and iterate over a uniform grid of d values in ascending order until we find such d which results in M satisfying (23,24). We sample nonzero positions of A by taking a union over k random permutation matrices. The nonzero value is set to either 1 or γ by a coin flip. To check whether for the current d there is M satisfying (23,24), we construct Y , X(1), X(2) and M using the algorithm implied by the proof of Theorem 0.4. To sample Stiefel matrices Y , we use the algorithm based on QR decompositions of random Gaussian matrices from (Stewart 1980). We redraw the Y matrix Qn times, Q = 1, in the spirit of Lemma 0.6 suggesting that O(n) redraws should be enough to find the right Y, X(1), X(2), X with a constant probability (when d is big enough). |