Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Transformers Implement Functional Gradient Descent to Learn Non-Linear Functions In Context
Authors: Xiang Cheng, Yuxin Chen, Suvrit Sra
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To experimentally verify Proposition 3.4, we compare the performance of different choices of h against different choices of generating kernel K. We present our findings in Figures 1 and 2. |
| Researcher Affiliation | Academia | 1Massachusetts Institute of Technology 2University of California, Davis 3Technical University of Munich. |
| Pseudocode | No | The paper describes algorithms and derivations but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any statement about releasing open-source code or links to a code repository. |
| Open Datasets | No | The covariates x(i) are drawn iid from the unit sphere, and the labels y(i) are drawn from one of the three K-Gaussian Processes. We consider three choices of kernels: Klinear(u, v) = u, v , Krelu(u, v) = relu ( u, v ), and Kexp(u, v) = exp( u, v ) (as defined (11)). |
| Dataset Splits | No | The paper does not explicitly provide training/test/validation dataset splits or cross-validation setup. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory, or cloud instance types) used for running experiments. |
| Software Dependencies | No | The paper mentions using ADAM for training but does not provide specific version numbers for software dependencies or libraries. |
| Experiment Setup | Yes | Training Algorithm We train the Transformer using ADAM with gradient clipping. Each gradient step is computed from a minibatch of size 30000, and we resample the minibatch every 10 steps. All plots are averaged over 3 runs with different U (i.e. Σ) sampled each time, and different seeds for sampling training data. |