In-Context Learning with Representations: Contextual Generalization of Trained Transformers
Authors: Tong Yang, Yu Huang, Yingbin Liang, Yuejie Chi
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper investigates the training dynamics of transformers by gradient descent through the lens of non-linear regression tasks. We conduct experiments on a synthetic dataset, where we randomly generate each token vk and their representation f(vk) from standard Gaussian distribution. Figure 2 shows the training and inference losses of both 1-layer and 4-layer transformers, where we measure the inference loss by 1 K by by 2 2 to validate (22): after sufficient training, the output of the transformer by converges to by . |
| Researcher Affiliation | Academia | Tong Yang CMU Yu Huang UPenn Yingbin Liang OSU Yuejie Chi CMU Department of Electrical and Computer Engineering, Carnegie Mellon University; email: tongyang@andrew.cmu.edu. Department of Statistics and Data Science, Wharton School, University of Pennsylvania; email: yuh42@wharton.upenn.edu. Department of Electrical and Computer Engineering, The Ohio State University; email: liang.889@osu.edu. Department of Electrical and Computer Engineering, Carnegie Mellon University; email: yuejiechi@cmu.edu. |
| Pseudocode | No | The paper does not include any explicitly labeled pseudocode or algorithm blocks. The methods are described in prose and mathematical formulations. |
| Open Source Code | No | Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: The experiments are very simple and can be easily reproduced by following the instructions in the paper. |
| Open Datasets | No | We conduct experiments on a synthetic dataset, where we randomly generate each token vk and their representation f(vk) from standard Gaussian distribution. The paper does not provide concrete access information (link, DOI, formal citation) for this synthetic dataset. |
| Dataset Splits | No | We generate λ from standard Gaussian distribution to create the training set with 10000 samples and in-domain test set with 200 samples; we also create an out-of-domain (ood) test set with 200 samples by sampling λ from N(1m, 4Im). The paper explicitly mentions training and test sets but does not specify a validation dataset split. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running the experiments. It only mentions general terms like "1-layer transformer" and "4-layer transformer". |
| Software Dependencies | No | All experiments use the Adam optimizer with a learning rate 1 10 4. The paper mentions the Adam optimizer and a learning rate but does not provide specific version numbers for any software, libraries, or programming languages used. |
| Experiment Setup | Yes | We set N = 30, K = 200, d = 100, m = 20, and set H to be 64 and 8 for 1-layer and 4-layer transformers, respectively. We set the training loss to be the population loss defined in (9), and initialize {Q(0) h }h [H] using standard Gaussian and set {w(0) h }h [H] to be 0, identical to what is specified in Section 3. We train with a batch size 256. All experiments use the Adam optimizer with a learning rate 1 10 4. |