Exact Conversion of In-Context Learning to Model Weights in Linearized-Attention Transformers
Authors: Brian K Chen, Tianyang Hu, Hui Jin, Hwee Kuan Lee, Kenji Kawaguchi
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the efficacy of our approach through experiments that show the exact incorporation of ICL tokens into a linear transformer. We further suggest how our method can be adapted to achieve cheap approximate conversion of ICL tokens, even in regular transformer networks that are not linearized. Our experiments on GPT-2 show that, even though the conversion is only approximate, the model still gains valuable context from the included bias terms. |
| Researcher Affiliation | Collaboration | 1National University of Singapore 2Bioinformatics Institute, Agency for Science, Technology and Research (A*STAR) 3Huawei Noah s Ark Lab 4Nanyang Technological University 5Singapore Eye Research Institute 6Singapore International Research Laboratory on Artificial Intelligence 7Singapore Institute for Clinical Sciences. |
| Pseudocode | Yes | Algorithm 1 ICL conversion algorithm (ICLCA) |
| Open Source Code | No | The paper does not include any explicit statements about making the source code for their described methodology publicly available, nor does it provide a direct link to a code repository. |
| Open Datasets | No | The paper mentions experiments on a 'synthetic induction head task' and a 'pretrained GPT-2 model'. For the synthetic task, it describes the data model but provides no access information. For GPT-2, it uses a pretrained model but does not describe the dataset used for its own training or provide access to any custom dataset for training their new architecture from scratch. |
| Dataset Splits | No | The paper does not explicitly provide training/validation/test dataset splits (e.g., percentages or sample counts). It refers to evaluation on '100 randomly generated ICL prompts and input prompts' but does not define this as a specific validation split. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU models, CPU models, or cloud instance types) used for running its experiments. It only mentions training models. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers (e.g., programming language versions, library versions, or solver versions) needed to replicate the experiment. |
| Experiment Setup | No | The paper mentions 'We use a 12-layer linear attention transformer with embedding dimension=128 and Ro PE' which is a model architecture detail. However, it does not provide specific hyperparameters such as learning rate, batch size, number of epochs, or optimizer settings for training. |