Transformers Implement Functional Gradient Descent to Learn Non-Linear Functions In Context
Authors: Xiang Cheng, Yuxin Chen, Suvrit Sra
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To experimentally verify Proposition 3.4, we compare the performance of different choices of h against different choices of generating kernel K. We present our findings in Figures 1 and 2. |
| Researcher Affiliation | Academia | 1Massachusetts Institute of Technology 2University of California, Davis 3Technical University of Munich. |
| Pseudocode | No | The paper describes algorithms and derivations but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any statement about releasing open-source code or links to a code repository. |
| Open Datasets | No | The covariates x(i) are drawn iid from the unit sphere, and the labels y(i) are drawn from one of the three K-Gaussian Processes. We consider three choices of kernels: Klinear(u, v) = u, v , Krelu(u, v) = relu ( u, v ), and Kexp(u, v) = exp( u, v ) (as defined (11)). |
| Dataset Splits | No | The paper does not explicitly provide training/test/validation dataset splits or cross-validation setup. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory, or cloud instance types) used for running experiments. |
| Software Dependencies | No | The paper mentions using ADAM for training but does not provide specific version numbers for software dependencies or libraries. |
| Experiment Setup | Yes | Training Algorithm We train the Transformer using ADAM with gradient clipping. Each gradient step is computed from a minibatch of size 30000, and we resample the minibatch every 10 steps. All plots are averaged over 3 runs with different U (i.e. Σ) sampled each time, and different seeds for sampling training data. |