reproducibilityindex.ai

The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit

Authors: Lorenzo Noci, Chuning Li, Mufan Li, Bobby He, Thomas Hofmann, Chris J. Maddison, Dan Roy

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we show, through simulations, that the SDE provides a surprisingly good description of the corresponding finite-size model. Through simulations (e.g., Figure 1), we show that the limiting neural covariance SDE approximates the distribution of finite-size Transformers with shaped attention mechanism surprisingly well. We also provide preliminary training experiments for our proposed shaped attention architecture on standard language modeling tasks, demonstrating the feasibility of the new architecture in practice (see Section 5 and Appendix D).
Researcher Affiliation	Collaboration	Lorenzo Noci Chuning Li Mufan (Bill) Li Bobby He Thomas Hofmann Chris Maddison Daniel M. Roy ETH Zurich University of Toronto and Vector Institute University of Oxford
Pseudocode	No	The paper does not contain a clearly labeled pseudocode or algorithm block.
Open Source Code	No	The paper does not provide a direct link to a source-code repository or an explicit statement about the release of code for the described methodology. The text mentions 'Given limited computing resources, we chose to only briefly test the feasibility of training the shaped Transformer' but makes no commitment to release code.
Open Datasets	Yes	We use a subset of the English Wikipedia 20220301.en and English bookcorpus datasets [69, 70].
Dataset Splits	No	The paper mentions 'train/test loss' but does not specify exact split percentages, sample counts, or provide details on the splitting methodology used for the datasets, nor does it explicitly state the use of predefined splits with proper citation for reproduction.
Hardware Specification	Yes	The experiments are executed on Nvidia DGX-1 GPU nodes equipped with 4 20-core Xeon E5-2698v4 processors, 512 GB of memory and 8 Nvidia V100 GPUs.
Software Dependencies	No	The paper mentions using 'Adam [71]' and refers to the 'Hugging face implementation of Bert [73]' but does not provide specific version numbers for these or other ancillary software components like Python, PyTorch, or other libraries.
Experiment Setup	Yes	All models including the baselines use Adam [71] with learning rate warmup of 4000 steps, the learning rate is tuned in the grid (0.0001, 0.0005, 0.001, 0.005). The batch size is fixed to 32 sequences. We train using Adam [71] with betas parameters (0.9, 0.999) and learning rate chosen in the grid (0.0001, 0.0005, 0.001, 0.005). We do not use weight decay. We also add scalar multipliers γ1, γ2 R both to the identity and centering terms of the shaped attention... and propose two ways to set γ1, γ2 during training... In both alternatives we initialize γ1, γ2 = 1... initialize γ in the grid (0.05, 0.1, 0.2) and set τ0 = 1.