The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit

Authors: Lorenzo Noci, Chuning Li, Mufan Li, Bobby He, Thomas Hofmann, Chris J. Maddison, Dan Roy

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we show, through simulations, that the SDE provides a surprisingly good description of the corresponding finite-size model. Through simulations (e.g., Figure 1), we show that the limiting neural covariance SDE approximates the distribution of finite-size Transformers with shaped attention mechanism surprisingly well. We also provide preliminary training experiments for our proposed shaped attention architecture on standard language modeling tasks, demonstrating the feasibility of the new architecture in practice (see Section 5 and Appendix D).
Researcher Affiliation Collaboration Lorenzo Noci Chuning Li Mufan (Bill) Li Bobby He Thomas Hofmann Chris Maddison Daniel M. Roy ETH Zurich University of Toronto and Vector Institute University of Oxford
Pseudocode No The paper does not contain a clearly labeled pseudocode or algorithm block.
Open Source Code No The paper does not provide a direct link to a source-code repository or an explicit statement about the release of code for the described methodology. The text mentions 'Given limited computing resources, we chose to only briefly test the feasibility of training the shaped Transformer' but makes no commitment to release code.
Open Datasets Yes We use a subset of the English Wikipedia 20220301.en and English bookcorpus datasets [69, 70].
Dataset Splits No The paper mentions 'train/test loss' but does not specify exact split percentages, sample counts, or provide details on the splitting methodology used for the datasets, nor does it explicitly state the use of predefined splits with proper citation for reproduction.
Hardware Specification Yes The experiments are executed on Nvidia DGX-1 GPU nodes equipped with 4 20-core Xeon E5-2698v4 processors, 512 GB of memory and 8 Nvidia V100 GPUs.
Software Dependencies No The paper mentions using 'Adam [71]' and refers to the 'Hugging face implementation of Bert [73]' but does not provide specific version numbers for these or other ancillary software components like Python, PyTorch, or other libraries.
Experiment Setup Yes All models including the baselines use Adam [71] with learning rate warmup of 4000 steps, the learning rate is tuned in the grid (0.0001, 0.0005, 0.001, 0.005). The batch size is fixed to 32 sequences. We train using Adam [71] with betas parameters (0.9, 0.999) and learning rate chosen in the grid (0.0001, 0.0005, 0.001, 0.005). We do not use weight decay. We also add scalar multipliers γ1, γ2 R both to the identity and centering terms of the shaped attention... and propose two ways to set γ1, γ2 during training... In both alternatives we initialize γ1, γ2 = 1... initialize γ in the grid (0.05, 0.1, 0.2) and set τ0 = 1.