On the Expressive Flexibility of Self-Attention Matrices

Authors: Valerii Likhosherstov, Krzysztof Choromanski, Adrian Weller

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we present experimental simulations, discuss related work and make concluding remarks. Experiments Theorem 0.4 suggests an upper bound (r.h.s. in Equation 22) for the dmin(ϵ1, ϵ2) i.e. the minimal d which results in M satisfying (23,24) for fixed ϵ1, ϵ2. A question which we address in the experimental section is, therefore, What is the actual dmin(ϵ1, ϵ2) in practice? Does it satisfy the logarithmic law d = O(log n)? To answer this question, we perform the following simulation.
Researcher Affiliation Collaboration Valerii Likhosherstov1*, Krzysztof Choromanski2*, Adrian Weller1,3 1University of Cambridge 2Google Brain 3The Alan Turing Institute
Pseudocode Yes Algorithm 1: Algorithm for solving (1, dhid)-WEAK-SAMAPPROX (V denotes {1, . . . , n}.).
Open Source Code No The paper does not contain any statement about releasing source code for the described methodology or provide a link to a code repository.
Open Datasets No The paper describes sampling synthetic matrices for its experiments ("For each n we sample the matrix A"), and while Figure 1 refers to "randomly sampled input sentences from the text corpus", this is for an illustrative example from a trained model (Distil BERT), not a dataset used in their core experiments. No concrete access information for a public dataset is provided for the main experiments.
Dataset Splits No The paper describes a simulation process where matrices are sampled and parameters are iterated over ("For each n we sample the matrix A and iterate over a uniform grid of d values in ascending order until we find such d which results in M satisfying (23,24)."), but it does not define specific train, validation, or test dataset splits in the conventional sense.
Hardware Specification No The paper does not provide any specific hardware details such as GPU/CPU models, processors, or cloud instance types used for running experiments.
Software Dependencies No The paper does not list specific software dependencies with version numbers (e.g., "Python 3.8, PyTorch 1.9").
Experiment Setup Yes For each set of parameters, we iterate over n on a uniform grid from 512 to 3072 with a step size 256. For each n we sample the matrix A and iterate over a uniform grid of d values in ascending order until we find such d which results in M satisfying (23,24). We sample nonzero positions of A by taking a union over k random permutation matrices. The nonzero value is set to either 1 or γ by a coin flip. To check whether for the current d there is M satisfying (23,24), we construct Y , X(1), X(2) and M using the algorithm implied by the proof of Theorem 0.4. To sample Stiefel matrices Y , we use the algorithm based on QR decompositions of random Gaussian matrices from (Stewart 1980). We redraw the Y matrix Qn times, Q = 1, in the spirit of Lemma 0.6 suggesting that O(n) redraws should be enough to find the right Y, X(1), X(2), X with a constant probability (when d is big enough).