Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Nonlinear transformers can perform inference-time feature learning
Authors: Naoki Nishikawa, Yujin Song, Kazusato Oko, Denny Wu, Taiji Suzuki
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We prove that transformers pretrained by gradient-based optimization can perform inference-time feature learning, i.e., extract information of the target features β solely from test prompts... We conduct numerical experiments on synthetic data to compare the in-context learning algorithm implemented by nonlinear transformers against non-adaptive kernel methods. |
| Researcher Affiliation | Academia | 1The University of Tokyo, Tokyo, Japan 2RIKEN AIP, Tokyo, Japan 3Unversity of California, Berkeley 4New York University 5Flatiron Institute. Correspondence to: Naoki Nishikawa <EMAIL>, Yujin Song <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Gradient-based training of transformer Input : Learning rate η1, η2, regularization rate λ1, λ2, initialization scale α, temperature ρ |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code for the described methodology, nor does it provide a link to a code repository. It mentions using a '6-layer GPT-2 model', which is a third-party tool. |
| Open Datasets | No | We conduct numerical experiments on synthetic data to compare the in-context learning algorithm... For each test task t, we generate data as xt 1, . . . , xt Ntest, x N(0, Id), βt Unif(Sd 1)... |
| Dataset Splits | No | For each test task t, we generate data as xt 1, . . . , xt Ntest, x N(0, Id), βt Unif(Sd 1) (i.e., r = d),with yt i = σ ( βt, xt i ) for i [N]. We compare the performance of two approaches... The paper describes a process of generating synthetic data for each task, rather than using a fixed dataset with explicit train/test/validation splits. |
| Hardware Specification | No | The paper mentions training a '6-layer GPT-2 model' and performing 'numerical experiments' but does not specify any particular hardware details such as GPU models, CPU types, or memory. |
| Software Dependencies | No | We train a 6-layer GPT-2 model (Radford et al., 2019)... We pretrain the GPT-2 model using the Adam (Kingma & Ba, 2015) optimizer... The paper mentions software components like GPT-2 and Adam optimizer, but does not provide specific version numbers for these or other relevant libraries. |
| Experiment Setup | Yes | We pretrain the GPT-2 model using the Adam (Kingma & Ba, 2015) optimizer with learning rate 0.0001 on the mean-squared loss calculated over all the positions. |