On the Convergence of Encoder-only Shallow Transformers

Authors: Yongtao Wu, Fanghui Liu, Grigorios Chrysos, Volkan Cevher

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments are organized as follows: In Section 5.1, we conduct experiments with the model Eq. (1.2) on synthetic data and study the training dynamics. Next, we show convergence results on Vi T [Dosovitskiy et al., 2021] on the standard MNIST dataset in Section 5.2.
Researcher Affiliation Academia Yongtao Wu LIONS, EPFL yongtao.wu@epfl.ch; Fanghui Liu University of Warwick fanghui.liu@warwick.ac.uk; Grigorios G Chrysos LIONS, EPFL University of Wisconsin-Madison chrysos@wisc.edu; Volkan Cevher LIONS, EPFL volkan.cevher@epfl.ch
Pseudocode Yes Algorithm 1: Gradient descent training
Open Source Code No The paper does not provide any explicit statements or links indicating that its source code is open or publicly available.
Open Datasets Yes Here we experimentally validate this assumption under a standard language IMDB dataset [Maas et al., 2011]... Next, we show convergence results on Vi T [Dosovitskiy et al., 2021] on the standard MNIST dataset [Le Cun et al., 1998]
Dataset Splits No The paper mentions generating 100 data points for synthetic data and using the MNIST dataset, but it does not specify explicit training, validation, or test split percentages or sample counts in the main text.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes We apply gradient descent on the shallow Transformer defined in Eq. (1.2) with Le Cun initialization and τ0 = d 1/2 m for 400 epochs with a fixed step size γ = 1. We test different widths of the network including dm = {10, 100, 1000, 4000}... The dimension of d is 64. We change the dimension of the query, key, and value from 16 to 1024 and 16384. The network is optimized with SGD with step size 0.1, and momentum 0.9 for 50 epochs.