Pretrained Transformer Efficiently Learns Low-Dimensional Target Functions In-Context
Authors: Kazusato Oko, Yujin Song, Taiji Suzuki, Denny Wu
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We pretrain a GPT-2 model [RWC+19] (with the same configurations as the incontext linear regression setting in [GTLV22]) to learn the Gaussian single-index task (1.1) with degree-3 link function, and compare its in-context sample complexity against baseline algorithms (see Section 4 for details). In Figure 1 we observe that the pretrained transformer achieves low prediction risk using fewer in-context examples than two baseline algorithms: kernel ridge regression, and neural network trained by gradient descent. |
| Researcher Affiliation | Collaboration | Kazusato Oko1,3, Yujin Song2,3, Taiji Suzuki2,3, Denny Wu4,5 1University of California, Berkeley, 2University of Tokyo, 3RIKEN AIP 4New York University, 5Flatiron Institute |
| Pseudocode | Yes | Algorithm 1: Gradient-based training of transformer with MLP layer |
| Open Source Code | No | The answer No means that paper does not include experiments requiring code. Please see the NeurIPS code and data submission guidelines (https://nips.cc/public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). |
| Open Datasets | No | The paper describes generating synthetic data for experiments and does not specify a publicly available dataset with concrete access information. For example, in Section 4.1, it states: "The pretraining data is generated from random single-index models: for each task t, the context {(xt i, yt i)}N+1 i=1 is generated as xt i i.i.d. N(0, Id) and yt i = PP i=Q ct i i! Hei( xt i, βt )" |
| Dataset Splits | No | The paper refers to using "training data" and mentions "test prompt length" or "validation tasks" but does not specify explicit dataset splits (e.g., 80/10/10 percentages or sample counts). For example, Section 4.1 states: "The pretraining data is generated from random single-index models..." and "test loss was averaged over 128 independent tasks... During the validation of the experiment of Figure 1, the coefficients {ci} in the single-index model were fixed to be (c2, c3) = (2 3!/2) to reduce the variance in the baseline methods." It does not provide specific train/validation/test splits. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for experiments. It mentions using a GPT-2 model and compares it, but does not specify the hardware for its own experiments. The NeurIPS checklist confirms: "We only have toy experiments and it is not a central part of this paper." |
| Software Dependencies | No | The paper mentions "Adam optimizer [KB15]" and "GPT-2 model [RWC+19]" but does not specify version numbers for these or other software libraries/frameworks used for the experiments. It does not provide a reproducible list of software dependencies with versions. |
| Experiment Setup | Yes | We used the Adam optimizer [KB15] with a learning rate of 0.0001. The training loss at each step was set as 1 B PB t=1 PN+1 k=1 yt k ˆyk(xt 1, yt 1 . . . , xt k 1, yt k 1, xt k) 2 where B = 8 is the minibatch size. |