reproducibilityindex.ai

On the Convergence of Encoder-only Shallow Transformers

Authors: Yongtao Wu, Fanghui Liu, Grigorios Chrysos, Volkan Cevher

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments are organized as follows: In Section 5.1, we conduct experiments with the model Eq. (1.2) on synthetic data and study the training dynamics. Next, we show convergence results on Vi T [Dosovitskiy et al., 2021] on the standard MNIST dataset in Section 5.2.
Researcher Affiliation	Academia	Yongtao Wu LIONS, EPFL yongtao.wu@epfl.ch; Fanghui Liu University of Warwick fanghui.liu@warwick.ac.uk; Grigorios G Chrysos LIONS, EPFL University of Wisconsin-Madison chrysos@wisc.edu; Volkan Cevher LIONS, EPFL volkan.cevher@epfl.ch
Pseudocode	Yes	Algorithm 1: Gradient descent training
Open Source Code	No	The paper does not provide any explicit statements or links indicating that its source code is open or publicly available.
Open Datasets	Yes	Here we experimentally validate this assumption under a standard language IMDB dataset [Maas et al., 2011]... Next, we show convergence results on Vi T [Dosovitskiy et al., 2021] on the standard MNIST dataset [Le Cun et al., 1998]
Dataset Splits	No	The paper mentions generating 100 data points for synthetic data and using the MNIST dataset, but it does not specify explicit training, validation, or test split percentages or sample counts in the main text.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	We apply gradient descent on the shallow Transformer defined in Eq. (1.2) with Le Cun initialization and τ0 = d 1/2 m for 400 epochs with a fixed step size γ = 1. We test different widths of the network including dm = {10, 100, 1000, 4000}... The dimension of d is 64. We change the dimension of the query, key, and value from 16 to 1024 and 16384. The network is optimized with SGD with step size 0.1, and momentum 0.9 for 50 epochs.