Inductive Biases and Variable Creation in Self-Attention Mechanisms

Authors: Benjamin L Edelman, Surbhi Goel, Sham Kakade, Cyril Zhang

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This work provides a theoretical analysis of the inductive biases of self-attention modules. Our main result shows that bounded-norm Transformer networks create sparse variables : a single self-attention head can represent a sparse function of the input sequence, with sample complexity scaling only logarithmically with the context length. To support our analysis, we present synthetic experiments to probe the sample complexity of learning sparse Boolean functions with Transformers.
Researcher Affiliation Collaboration 1Department of Computer Science, Harvard University, Cambridge, MA, USA 2Microsoft Research, New York, NY, USA. Correspondence to: Cyril Zhang <cyrilzhang@microsoft.com>.
Pseudocode No The paper does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper states, 'our experimental setup is based on a popular PyTorch implementation (https: //github.com/karpathy/min GPT)'. This refers to a third-party implementation used, not the authors' own code for their specific methodology.
Open Datasets No We introduce a synthetic benchmark to support our analysis, in which we measure the statistical limit for learning sparse Boolean functions with Transformers. We choose a distribution D on {0, 1}T...
Dataset Splits Yes m samples were drawn this distribution to form a training set (rejecting training sets which were compatible with multiple hypotheses), and 104 samples were drawn from the same distribution as a holdout validation set.
Hardware Specification Yes All experiments were performed on an internal cluster, with NVIDIA Tesla P100, NVIDIA Tesla P40, and NVIDIA RTX A6000 GPUs.
Software Dependencies No The paper mentions using 'PyTorch implementation' and 'Adam' optimizer but does not specify version numbers for these or other software components.
Experiment Setup Yes A fixed architecture was used (d = 64, k = 4, 16 parallel heads), with trainable positional embeddings initialized with Gaussian entries N(0, σ2), σ = 0.02, 3 input token embeddings (corresponding to 0, 1, [CLS]), and 2 output embeddings (corresponding to the possible labels 0, 1). For regularization mechanisms, typical choices were used: 0.1 for {attention, embedding, output} dropout; 10 4 weight decay. The Adam optimizer was instantiated with typical parameters η = 10 3, β1 = 0.9, β2 = 0.999.