Transformers Provably Learn Sparse Token Selection While Fully-Connected Nets Cannot

Authors: Zixuan Wang, Stanley Wei, Daniel Hsu, Jason D. Lee

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide empirical simulations to justify our theoretical findings. ... 4. Experiments In this section, we describe our experimental setup on synthetic data, which numerically justifies our theoretical guarantees for convergence. In addition, we devise several length generalization tasks for our model, in which we are able to highlight the benefits of our stochastic architecture.
Researcher Affiliation Academia 1Department of Electrical and Computer Engineering, Princeton University, NJ, USA 2Department of Computer Science, Columbia University, NY, USA.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide concrete access to source code for the methodology described in this paper.
Open Datasets No The paper uses "synthetic data" generated from a specified distribution ("X is sampled from standard Gaussian distribution, and the q-sparse subset y containing all the averaging indices is uniformly sampled from all q-subsets of [T]") but does not provide concrete access information (link, DOI, repository, or formal citation to an established benchmark) for a publicly available dataset.
Dataset Splits No The paper describes generating data on-the-fly ("resampling a fresh batch of n = 256 datapoints (X, y) at each iteration") and fixing a validation set for OOD tasks ("fix before training a validation set of ntest = 128 out-of-distribution datapoints"), but it does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning from a static dataset.
Hardware Specification Yes For all of our experiments, we use Py Torch (Paszke et al., 2019), run on NVIDIA RTX A6000s.
Software Dependencies No The paper mentions using "Py Torch (Paszke et al., 2019)" but does not provide specific version numbers for PyTorch or any other software dependencies needed to replicate the experiment.
Experiment Setup Yes In particular, we choose T = 200 for our sequence length, q = 3, d = 5, and de = 170. In addition, to simulate the population loss training, we train using online stochastic gradient descent (SGD) by resampling a fresh batch of n = 256 datapoints (X, y) at each iteration to use for our gradient estimate. ... When we attend [Z, zquery] and train with GD, we run with η = 1, then annealing to η = 1/3 at iteration 50000. We run until iteration 100000.