Transformers Implement Functional Gradient Descent to Learn Non-Linear Functions In Context

Authors: Xiang Cheng, Yuxin Chen, Suvrit Sra

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To experimentally verify Proposition 3.4, we compare the performance of different choices of h against different choices of generating kernel K. We present our findings in Figures 1 and 2.
Researcher Affiliation Academia 1Massachusetts Institute of Technology 2University of California, Davis 3Technical University of Munich.
Pseudocode No The paper describes algorithms and derivations but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any statement about releasing open-source code or links to a code repository.
Open Datasets No The covariates x(i) are drawn iid from the unit sphere, and the labels y(i) are drawn from one of the three K-Gaussian Processes. We consider three choices of kernels: Klinear(u, v) = u, v , Krelu(u, v) = relu ( u, v ), and Kexp(u, v) = exp( u, v ) (as defined (11)).
Dataset Splits No The paper does not explicitly provide training/test/validation dataset splits or cross-validation setup.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory, or cloud instance types) used for running experiments.
Software Dependencies No The paper mentions using ADAM for training but does not provide specific version numbers for software dependencies or libraries.
Experiment Setup Yes Training Algorithm We train the Transformer using ADAM with gradient clipping. Each gradient step is computed from a minibatch of size 30000, and we resample the minibatch every 10 steps. All plots are averaged over 3 runs with different U (i.e. Σ) sampled each time, and different seeds for sampling training data.