Where does In-context Learning Happen in Large Language Models?

Authors: Suzanna Sia, David Mueller, Kevin Duh

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through a series of layer-wise context-masking experiments on GPTNEO2.7B, BLOOM3B, and STARCODER2-7B, LLAMA3.1-8B, LLAMA3.1-8B-INSTRUCT, on Machine Translation and Code generation, we demonstrate evidence of a "task recognition" point where the task is encoded into the input representations and attention to context is no longer necessary.
Researcher Affiliation Academia Suzanna Sia Johns Hopkins University ssia1@jhu.edu David Mueller Johns Hopkins University dam@cs.jhu.edu Kevin Duh Johns Hopkins University kevinduh@cs.jhu.edu
Pseudocode No The paper describes the computational steps involved in its methodology (e.g., attention weight computation) but does not include a formally labeled "Pseudocode" or "Algorithm" block.
Open Source Code Yes Corresponding Author, suzyahyah@gmail.com and Code Repository https://github.com/ suzyahyah/where_does_in-context-learning_happen_in_LLMs
Open Datasets Yes Data We test our models using two datasets, FLORES [29] for Translation and HUMANEVAL for Code generation. For FLORES, we experiment with en fr (main paper) and en pt (appendix).
Dataset Splits Yes We split the dev set of FLORES into 400 and 800 training examples and 200 dev examples, we repeated the experiments with 2 random seeds initialisations. Note that this setup is designed to tune the layers for task location. It is highly unlikely that the model can learn translation knowledge from this small amount of supervision. The Lo RA layers were trained for up to 50 epochs with batch size= 32, learning rate= 1e 4, early stopping patience= 5 and threshold= 0.01, with α = 32, r = 8 and dropout= 0.05. These values are default and there was no hyper-parameter optimisation over the training parameters. The cross-entropy loss was computed across the entire sequence, and we used the best checkpoint on the 200 held out dev examples for evaluation.
Hardware Specification Yes We provide sufficient information to reproduce all of our experiments (all of our experiments can be run on consume-grade GPUs such as RTX 6000).
Software Dependencies No The paper mentions using models from Meta AI and the 'transformers library [67]' but does not specify version numbers for these or any other software dependencies, such as Python or PyTorch versions.
Experiment Setup Yes When examples are provided in-context, we use 5 examples per prompt and we re-sample these examples to control for variance in example selection. The Lo RA layers were trained for up to 50 epochs with batch size= 32, learning rate= 1e 4, early stopping patience= 5 and threshold= 0.01, with α = 32, r = 8 and dropout= 0.05.