Transformers Provably Learn Sparse Token Selection While Fully-Connected Nets Cannot
Authors: Zixuan Wang, Stanley Wei, Daniel Hsu, Jason D. Lee
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide empirical simulations to justify our theoretical findings. ... 4. Experiments In this section, we describe our experimental setup on synthetic data, which numerically justifies our theoretical guarantees for convergence. In addition, we devise several length generalization tasks for our model, in which we are able to highlight the benefits of our stochastic architecture. |
| Researcher Affiliation | Academia | 1Department of Electrical and Computer Engineering, Princeton University, NJ, USA 2Department of Computer Science, Columbia University, NY, USA. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described in this paper. |
| Open Datasets | No | The paper uses "synthetic data" generated from a specified distribution ("X is sampled from standard Gaussian distribution, and the q-sparse subset y containing all the averaging indices is uniformly sampled from all q-subsets of [T]") but does not provide concrete access information (link, DOI, repository, or formal citation to an established benchmark) for a publicly available dataset. |
| Dataset Splits | No | The paper describes generating data on-the-fly ("resampling a fresh batch of n = 256 datapoints (X, y) at each iteration") and fixing a validation set for OOD tasks ("fix before training a validation set of ntest = 128 out-of-distribution datapoints"), but it does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning from a static dataset. |
| Hardware Specification | Yes | For all of our experiments, we use Py Torch (Paszke et al., 2019), run on NVIDIA RTX A6000s. |
| Software Dependencies | No | The paper mentions using "Py Torch (Paszke et al., 2019)" but does not provide specific version numbers for PyTorch or any other software dependencies needed to replicate the experiment. |
| Experiment Setup | Yes | In particular, we choose T = 200 for our sequence length, q = 3, d = 5, and de = 170. In addition, to simulate the population loss training, we train using online stochastic gradient descent (SGD) by resampling a fresh batch of n = 256 datapoints (X, y) at each iteration to use for our gradient estimate. ... When we attend [Z, zquery] and train with GD, we run with η = 1, then annealing to η = 1/3 at iteration 50000. We run until iteration 100000. |