Weight decay induces low-rank attention layers

Authors: Seijin Kobayashi, Yassir Akram, Johannes von Oswald

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically validate our result on various experimental settings, including when optimization with decoupled weight decay [Loshchilov and Hutter, 2019], on models ranging from deep linear networks to language models as well as Vision Transformers Dosovitskiy et al. [2021].
Researcher Affiliation Collaboration 1Department of Computer Science, ETH Zürich 2Google, Paradigms of Intelligence Team
Pseudocode No The paper contains mathematical derivations and equations but does not include any pseudocode or clearly labeled algorithm blocks.
Open Source Code No We aim to collect the code as soon as possible in a Git repository.
Open Datasets Yes trained on the Pile [Gao et al., 2020] a common language modeling dataset. and train a Vision Transformer on the Image Net dataset [Deng et al., 2009]
Dataset Splits No The paper uses standard datasets and mentions a 'test set', but it does not explicitly provide the train/validation/test dataset split percentages or sample counts for reproduction (e.g., '80% training, 10% validation, 10% test').
Hardware Specification Yes We estimate the total compute budget to 4 Nvidia RTX 4090 for two months. The LLMs were punctually trained on a cluster of 16x A100 GPUs for 4 days.
Software Dependencies No Table 2 lists 'Optimizer Adam [Kingma and Ba, 2015] with ϵ = 1e 8, β1 = 0.9, β2 = 0.95' but does not provide specific version numbers for programming languages (e.g., Python), machine learning frameworks (e.g., PyTorch, TensorFlow), or other relevant libraries.
Experiment Setup Yes Table 2: Hyperparameters for language modelling experiments, including 'Context size 756', 'Batchsize 128', 'Optimizer Adam [Kingma and Ba, 2015] with ϵ = 1e 8, β1 = 0.9, β2 = 0.95', 'Gradient clipping Global norm of 1', 'Learning rate scheduler Linear warm-up starting from 1e 6 to 1e 3 in the first 8000 training steps, cosine annealing to 10% of the learning rate after warm-up for the end of training'.