reproducibilityindex.ai

Weight decay induces low-rank attention layers

Authors: Seijin Kobayashi, Yassir Akram, Johannes von Oswald

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically validate our result on various experimental settings, including when optimization with decoupled weight decay [Loshchilov and Hutter, 2019], on models ranging from deep linear networks to language models as well as Vision Transformers Dosovitskiy et al. [2021].
Researcher Affiliation	Collaboration	1Department of Computer Science, ETH Zürich 2Google, Paradigms of Intelligence Team
Pseudocode	No	The paper contains mathematical derivations and equations but does not include any pseudocode or clearly labeled algorithm blocks.
Open Source Code	No	We aim to collect the code as soon as possible in a Git repository.
Open Datasets	Yes	trained on the Pile [Gao et al., 2020] a common language modeling dataset. and train a Vision Transformer on the Image Net dataset [Deng et al., 2009]
Dataset Splits	No	The paper uses standard datasets and mentions a 'test set', but it does not explicitly provide the train/validation/test dataset split percentages or sample counts for reproduction (e.g., '80% training, 10% validation, 10% test').
Hardware Specification	Yes	We estimate the total compute budget to 4 Nvidia RTX 4090 for two months. The LLMs were punctually trained on a cluster of 16x A100 GPUs for 4 days.
Software Dependencies	No	Table 2 lists 'Optimizer Adam [Kingma and Ba, 2015] with ϵ = 1e 8, β1 = 0.9, β2 = 0.95' but does not provide specific version numbers for programming languages (e.g., Python), machine learning frameworks (e.g., PyTorch, TensorFlow), or other relevant libraries.
Experiment Setup	Yes	Table 2: Hyperparameters for language modelling experiments, including 'Context size 756', 'Batchsize 128', 'Optimizer Adam [Kingma and Ba, 2015] with ϵ = 1e 8, β1 = 0.9, β2 = 0.95', 'Gradient clipping Global norm of 1', 'Learning rate scheduler Linear warm-up starting from 1e 6 to 1e 3 in the first 8000 training steps, cosine annealing to 10% of the learning rate after warm-up for the end of training'.