The Lipschitz Constant of Self-Attention
Authors: Hyunjik Kim, George Papamakarios, Andriy Mnih
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To demonstrate the practical relevance of our theoretical work, we formulate invertible self-attention and use it in a Transformer-based architecture for a characterlevel language modelling task. We compare its test log-likelihood and stability to dot-product self-attention. |
| Researcher Affiliation | Industry | 1Deep Mind, UK. Correspondence to: Hyunjik Kim <hyunjikk@google.com>. |
| Pseudocode | No | The paper describes algorithms and mathematical formulations in text and equations but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not explicitly state that source code for the described methodology is publicly available, nor does it provide a link to a code repository. |
| Open Datasets | Yes | at character-level language modelling on the Penn Treebank dataset (Marcus et al., 1993). |
| Dataset Splits | Yes | tuning the hyperparameters on a validation set. |
| Hardware Specification | No | This was the deepest model we could fit on a single GPU, and we expect to be able to train even deeper models with these two. (Section 5.4) - This mentions a GPU but no specific model or specifications are provided. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used. |
| Experiment Setup | Yes | In practice, this leads to instabilities in training for DP-MHA, hence requiring careful tuning of the learning rate schedule for training deeper Transformer models: linear warmup and square root decay, as detailed in Appendix H. |