Dual Operating Modes of In-Context Learning

Authors: Ziqian Lin, Kangwook Lee

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We corroborate our analysis and predictions with extensive experiments with Transformers and LLMs.
Researcher Affiliation Academia 1Department of Computer Science, University of Wisconsin Madison, Madison, Wisconsin, USA 2Department of Electrical & Computer Engineering, University of Wisconsin-Madison, Madison, Wisconsin, USA.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes The code is available at: https: //github.com/UW-Madison-Lee-Lab/ Dual_Operating_Modes_of_ICL.
Open Datasets Yes The classification task adopts five datasets including (i) glue-mrpc (Dolan & Brockett, 2005), (ii) glue-rte (Dagan et al., 2005), (iii) tweet eval-hate (Barbieri et al., 2020), (iv) sick (Marelli et al., 2014), and (v) poem-sentiment (Sheng & Uthus, 2020).
Dataset Splits No The paper describes using in-context examples for language models and evaluates ICL performance, but it does not specify explicit training, validation, or test dataset splits in the traditional machine learning sense for its experiments.
Hardware Specification Yes We perform inference on large models with 8 H100 with the package vllm7.
Software Dependencies No The paper mentions software like 'GPT2Model from the package Transformers supported by Hugging Face', 'Adam W', 'Git Hub code released by Min et al. (2022)', and 'vllm' but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes We use a 10-layer, 8-head Transformer decoder with 1024-dimensional feedforward layers, and the input dimension is set to d, equal to the dimension of x. We train the model over three epochs, each consisting of 10,000 batches, with every batch containing 256 samples. We use Adam W (Loshchilov & Hutter, 2019) as the optimizer with weight decay as 0.00001 and set the learning rate to 0.00001.