reproducibilityindex.ai

Pretrained Transformer Efficiently Learns Low-Dimensional Target Functions In-Context

Authors: Kazusato Oko, Yujin Song, Taiji Suzuki, Denny Wu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We pretrain a GPT-2 model [RWC+19] (with the same configurations as the incontext linear regression setting in [GTLV22]) to learn the Gaussian single-index task (1.1) with degree-3 link function, and compare its in-context sample complexity against baseline algorithms (see Section 4 for details). In Figure 1 we observe that the pretrained transformer achieves low prediction risk using fewer in-context examples than two baseline algorithms: kernel ridge regression, and neural network trained by gradient descent.
Researcher Affiliation	Collaboration	Kazusato Oko1,3, Yujin Song2,3, Taiji Suzuki2,3, Denny Wu4,5 1University of California, Berkeley, 2University of Tokyo, 3RIKEN AIP 4New York University, 5Flatiron Institute
Pseudocode	Yes	Algorithm 1: Gradient-based training of transformer with MLP layer
Open Source Code	No	The answer No means that paper does not include experiments requiring code. Please see the NeurIPS code and data submission guidelines (https://nips.cc/public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark).
Open Datasets	No	The paper describes generating synthetic data for experiments and does not specify a publicly available dataset with concrete access information. For example, in Section 4.1, it states: "The pretraining data is generated from random single-index models: for each task t, the context {(xt i, yt i)}N+1 i=1 is generated as xt i i.i.d. N(0, Id) and yt i = PP i=Q ct i i! Hei( xt i, βt )"
Dataset Splits	No	The paper refers to using "training data" and mentions "test prompt length" or "validation tasks" but does not specify explicit dataset splits (e.g., 80/10/10 percentages or sample counts). For example, Section 4.1 states: "The pretraining data is generated from random single-index models..." and "test loss was averaged over 128 independent tasks... During the validation of the experiment of Figure 1, the coefficients {ci} in the single-index model were fixed to be (c2, c3) = (2 3!/2) to reduce the variance in the baseline methods." It does not provide specific train/validation/test splits.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for experiments. It mentions using a GPT-2 model and compares it, but does not specify the hardware for its own experiments. The NeurIPS checklist confirms: "We only have toy experiments and it is not a central part of this paper."
Software Dependencies	No	The paper mentions "Adam optimizer [KB15]" and "GPT-2 model [RWC+19]" but does not specify version numbers for these or other software libraries/frameworks used for the experiments. It does not provide a reproducible list of software dependencies with versions.
Experiment Setup	Yes	We used the Adam optimizer [KB15] with a learning rate of 0.0001. The training loss at each step was set as 1 B PB t=1 PN+1 k=1 yt k ˆyk(xt 1, yt 1 . . . , xt k 1, yt k 1, xt k) 2 where B = 8 is the minibatch size.