Weight decay induces low-rank attention layers
Authors: Seijin Kobayashi, Yassir Akram, Johannes von Oswald
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically validate our result on various experimental settings, including when optimization with decoupled weight decay [Loshchilov and Hutter, 2019], on models ranging from deep linear networks to language models as well as Vision Transformers Dosovitskiy et al. [2021]. |
| Researcher Affiliation | Collaboration | 1Department of Computer Science, ETH Zürich 2Google, Paradigms of Intelligence Team |
| Pseudocode | No | The paper contains mathematical derivations and equations but does not include any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | We aim to collect the code as soon as possible in a Git repository. |
| Open Datasets | Yes | trained on the Pile [Gao et al., 2020] a common language modeling dataset. and train a Vision Transformer on the Image Net dataset [Deng et al., 2009] |
| Dataset Splits | No | The paper uses standard datasets and mentions a 'test set', but it does not explicitly provide the train/validation/test dataset split percentages or sample counts for reproduction (e.g., '80% training, 10% validation, 10% test'). |
| Hardware Specification | Yes | We estimate the total compute budget to 4 Nvidia RTX 4090 for two months. The LLMs were punctually trained on a cluster of 16x A100 GPUs for 4 days. |
| Software Dependencies | No | Table 2 lists 'Optimizer Adam [Kingma and Ba, 2015] with ϵ = 1e 8, β1 = 0.9, β2 = 0.95' but does not provide specific version numbers for programming languages (e.g., Python), machine learning frameworks (e.g., PyTorch, TensorFlow), or other relevant libraries. |
| Experiment Setup | Yes | Table 2: Hyperparameters for language modelling experiments, including 'Context size 756', 'Batchsize 128', 'Optimizer Adam [Kingma and Ba, 2015] with ϵ = 1e 8, β1 = 0.9, β2 = 0.95', 'Gradient clipping Global norm of 1', 'Learning rate scheduler Linear warm-up starting from 1e 6 to 1e 3 in the first 8000 training steps, cosine annealing to 10% of the learning rate after warm-up for the end of training'. |