Transformers learn through gradual rank increase
Authors: Enric Boix-Adsera, Etai Littwin, Emmanuel Abbe, Samy Bengio, Joshua Susskind
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments support the theory and also show that phenomenon can occur in practice without the simplifying assumptions. Our results have a theoretical component and an experimental component. |
| Researcher Affiliation | Collaboration | 1Apple 2MIT 3EPFL eboix@mit.edu,emmanuel.abbe@epfl.ch {elittwin,bengio,jsusskind}@apple.com |
| Pseudocode | Yes | Algorithm 1 Incremental learning in networks with diagonal weights |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code or provide links to a code repository for the described methodology. |
| Open Datasets | Yes | We conduct experiments on vision transformers (Vi T) [DBK+20] trained on the CIFAR-10/100 and Image Net datasets, and with the GPT-2 language transformer [BMR+20] trained on the Wikitext-103 dataset. |
| Dataset Splits | No | The paper mentions using well-known datasets like CIFAR-10/100 and Image Net, which often have standard splits, but it does not explicitly state the train/validation/test dataset splits (e.g., percentages, sample counts, or explicit citation to a predefined split) within its text. |
| Hardware Specification | Yes | Each run took 2 hours on one A100 GPU. ... trained for 3 epochs on 2 A100 GPUs, which took 12 hours. |
| Software Dependencies | No | The paper mentions using Adam and SGD optimizers, and the Hugging Face training script. However, it does not provide specific version numbers for these software components or any other libraries used. |
| Experiment Setup | Yes | We train all layers (including the feedforward layers). ... We use a Vi T of depth 6, with 8 self-attention heads per layer (with layer normalization). We use an embedding and MLP dimension of demb = 512, and a head dimension of dh = 128. ... We train the transformer using Adam with the cross-entropy loss. ... For the CIFAR-10/100 datasets we use a VIT with 6 layers, patchsize of 4, 8 heads per self attention layer, an embedding and MLP dimension of 512, and a head dimension of 128. We train the model using the Adam optimizer for 500 epochs with a base learning rate of 1e-4, a cyclic learning rate decay with a linear warmup schedule for 15 epochs and a batchsize of 512. |