Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Nonlinear transformers can perform inference-time feature learning

Authors: Naoki Nishikawa, Yujin Song, Kazusato Oko, Denny Wu, Taiji Suzuki

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We prove that transformers pretrained by gradient-based optimization can perform inference-time feature learning, i.e., extract information of the target features β solely from test prompts... We conduct numerical experiments on synthetic data to compare the in-context learning algorithm implemented by nonlinear transformers against non-adaptive kernel methods.
Researcher Affiliation Academia 1The University of Tokyo, Tokyo, Japan 2RIKEN AIP, Tokyo, Japan 3Unversity of California, Berkeley 4New York University 5Flatiron Institute. Correspondence to: Naoki Nishikawa <EMAIL>, Yujin Song <EMAIL>.
Pseudocode Yes Algorithm 1 Gradient-based training of transformer Input : Learning rate η1, η2, regularization rate λ1, λ2, initialization scale α, temperature ρ
Open Source Code No The paper does not contain an explicit statement about releasing source code for the described methodology, nor does it provide a link to a code repository. It mentions using a '6-layer GPT-2 model', which is a third-party tool.
Open Datasets No We conduct numerical experiments on synthetic data to compare the in-context learning algorithm... For each test task t, we generate data as xt 1, . . . , xt Ntest, x N(0, Id), βt Unif(Sd 1)...
Dataset Splits No For each test task t, we generate data as xt 1, . . . , xt Ntest, x N(0, Id), βt Unif(Sd 1) (i.e., r = d),with yt i = σ ( βt, xt i ) for i [N]. We compare the performance of two approaches... The paper describes a process of generating synthetic data for each task, rather than using a fixed dataset with explicit train/test/validation splits.
Hardware Specification No The paper mentions training a '6-layer GPT-2 model' and performing 'numerical experiments' but does not specify any particular hardware details such as GPU models, CPU types, or memory.
Software Dependencies No We train a 6-layer GPT-2 model (Radford et al., 2019)... We pretrain the GPT-2 model using the Adam (Kingma & Ba, 2015) optimizer... The paper mentions software components like GPT-2 and Adam optimizer, but does not provide specific version numbers for these or other relevant libraries.
Experiment Setup Yes We pretrain the GPT-2 model using the Adam (Kingma & Ba, 2015) optimizer with learning rate 0.0001 on the mean-squared loss calculated over all the positions.