Transformers learn through gradual rank increase

Authors: Enric Boix-Adsera, Etai Littwin, Emmanuel Abbe, Samy Bengio, Joshua Susskind

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments support the theory and also show that phenomenon can occur in practice without the simplifying assumptions. Our results have a theoretical component and an experimental component.
Researcher Affiliation Collaboration 1Apple 2MIT 3EPFL eboix@mit.edu,emmanuel.abbe@epfl.ch {elittwin,bengio,jsusskind}@apple.com
Pseudocode Yes Algorithm 1 Incremental learning in networks with diagonal weights
Open Source Code No The paper does not contain any explicit statements about releasing source code or provide links to a code repository for the described methodology.
Open Datasets Yes We conduct experiments on vision transformers (Vi T) [DBK+20] trained on the CIFAR-10/100 and Image Net datasets, and with the GPT-2 language transformer [BMR+20] trained on the Wikitext-103 dataset.
Dataset Splits No The paper mentions using well-known datasets like CIFAR-10/100 and Image Net, which often have standard splits, but it does not explicitly state the train/validation/test dataset splits (e.g., percentages, sample counts, or explicit citation to a predefined split) within its text.
Hardware Specification Yes Each run took 2 hours on one A100 GPU. ... trained for 3 epochs on 2 A100 GPUs, which took 12 hours.
Software Dependencies No The paper mentions using Adam and SGD optimizers, and the Hugging Face training script. However, it does not provide specific version numbers for these software components or any other libraries used.
Experiment Setup Yes We train all layers (including the feedforward layers). ... We use a Vi T of depth 6, with 8 self-attention heads per layer (with layer normalization). We use an embedding and MLP dimension of demb = 512, and a head dimension of dh = 128. ... We train the transformer using Adam with the cross-entropy loss. ... For the CIFAR-10/100 datasets we use a VIT with 6 layers, patchsize of 4, 8 heads per self attention layer, an embedding and MLP dimension of 512, and a head dimension of 128. We train the model using the Adam optimizer for 500 epochs with a base learning rate of 1e-4, a cyclic learning rate decay with a linear warmup schedule for 15 epochs and a batchsize of 512.