In-Context Learning with Representations: Contextual Generalization of Trained Transformers

Authors: Tong Yang, Yu Huang, Yingbin Liang, Yuejie Chi

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper investigates the training dynamics of transformers by gradient descent through the lens of non-linear regression tasks. We conduct experiments on a synthetic dataset, where we randomly generate each token vk and their representation f(vk) from standard Gaussian distribution. Figure 2 shows the training and inference losses of both 1-layer and 4-layer transformers, where we measure the inference loss by 1 K by by 2 2 to validate (22): after sufficient training, the output of the transformer by converges to by .
Researcher Affiliation Academia Tong Yang CMU Yu Huang UPenn Yingbin Liang OSU Yuejie Chi CMU Department of Electrical and Computer Engineering, Carnegie Mellon University; email: tongyang@andrew.cmu.edu. Department of Statistics and Data Science, Wharton School, University of Pennsylvania; email: yuh42@wharton.upenn.edu. Department of Electrical and Computer Engineering, The Ohio State University; email: liang.889@osu.edu. Department of Electrical and Computer Engineering, Carnegie Mellon University; email: yuejiechi@cmu.edu.
Pseudocode No The paper does not include any explicitly labeled pseudocode or algorithm blocks. The methods are described in prose and mathematical formulations.
Open Source Code No Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: The experiments are very simple and can be easily reproduced by following the instructions in the paper.
Open Datasets No We conduct experiments on a synthetic dataset, where we randomly generate each token vk and their representation f(vk) from standard Gaussian distribution. The paper does not provide concrete access information (link, DOI, formal citation) for this synthetic dataset.
Dataset Splits No We generate λ from standard Gaussian distribution to create the training set with 10000 samples and in-domain test set with 200 samples; we also create an out-of-domain (ood) test set with 200 samples by sampling λ from N(1m, 4Im). The paper explicitly mentions training and test sets but does not specify a validation dataset split.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running the experiments. It only mentions general terms like "1-layer transformer" and "4-layer transformer".
Software Dependencies No All experiments use the Adam optimizer with a learning rate 1 10 4. The paper mentions the Adam optimizer and a learning rate but does not provide specific version numbers for any software, libraries, or programming languages used.
Experiment Setup Yes We set N = 30, K = 200, d = 100, m = 20, and set H to be 64 and 8 for 1-layer and 4-layer transformers, respectively. We set the training loss to be the population loss defined in (9), and initialize {Q(0) h }h [H] using standard Gaussian and set {w(0) h }h [H] to be 0, identical to what is specified in Section 3. We train with a batch size 256. All experiments use the Adam optimizer with a learning rate 1 10 4.