Exact Conversion of In-Context Learning to Model Weights in Linearized-Attention Transformers

Authors: Brian K Chen, Tianyang Hu, Hui Jin, Hwee Kuan Lee, Kenji Kawaguchi

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the efficacy of our approach through experiments that show the exact incorporation of ICL tokens into a linear transformer. We further suggest how our method can be adapted to achieve cheap approximate conversion of ICL tokens, even in regular transformer networks that are not linearized. Our experiments on GPT-2 show that, even though the conversion is only approximate, the model still gains valuable context from the included bias terms.
Researcher Affiliation Collaboration 1National University of Singapore 2Bioinformatics Institute, Agency for Science, Technology and Research (A*STAR) 3Huawei Noah s Ark Lab 4Nanyang Technological University 5Singapore Eye Research Institute 6Singapore International Research Laboratory on Artificial Intelligence 7Singapore Institute for Clinical Sciences.
Pseudocode Yes Algorithm 1 ICL conversion algorithm (ICLCA)
Open Source Code No The paper does not include any explicit statements about making the source code for their described methodology publicly available, nor does it provide a direct link to a code repository.
Open Datasets No The paper mentions experiments on a 'synthetic induction head task' and a 'pretrained GPT-2 model'. For the synthetic task, it describes the data model but provides no access information. For GPT-2, it uses a pretrained model but does not describe the dataset used for its own training or provide access to any custom dataset for training their new architecture from scratch.
Dataset Splits No The paper does not explicitly provide training/validation/test dataset splits (e.g., percentages or sample counts). It refers to evaluation on '100 randomly generated ICL prompts and input prompts' but does not define this as a specific validation split.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU models, CPU models, or cloud instance types) used for running its experiments. It only mentions training models.
Software Dependencies No The paper does not provide specific software dependencies with version numbers (e.g., programming language versions, library versions, or solver versions) needed to replicate the experiment.
Experiment Setup No The paper mentions 'We use a 12-layer linear attention transformer with embedding dimension=128 and Ro PE' which is a model architecture detail. However, it does not provide specific hyperparameters such as learning rate, batch size, number of epochs, or optimizer settings for training.