Dual Operating Modes of In-Context Learning
Authors: Ziqian Lin, Kangwook Lee
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We corroborate our analysis and predictions with extensive experiments with Transformers and LLMs. |
| Researcher Affiliation | Academia | 1Department of Computer Science, University of Wisconsin Madison, Madison, Wisconsin, USA 2Department of Electrical & Computer Engineering, University of Wisconsin-Madison, Madison, Wisconsin, USA. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is available at: https: //github.com/UW-Madison-Lee-Lab/ Dual_Operating_Modes_of_ICL. |
| Open Datasets | Yes | The classification task adopts five datasets including (i) glue-mrpc (Dolan & Brockett, 2005), (ii) glue-rte (Dagan et al., 2005), (iii) tweet eval-hate (Barbieri et al., 2020), (iv) sick (Marelli et al., 2014), and (v) poem-sentiment (Sheng & Uthus, 2020). |
| Dataset Splits | No | The paper describes using in-context examples for language models and evaluates ICL performance, but it does not specify explicit training, validation, or test dataset splits in the traditional machine learning sense for its experiments. |
| Hardware Specification | Yes | We perform inference on large models with 8 H100 with the package vllm7. |
| Software Dependencies | No | The paper mentions software like 'GPT2Model from the package Transformers supported by Hugging Face', 'Adam W', 'Git Hub code released by Min et al. (2022)', and 'vllm' but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | We use a 10-layer, 8-head Transformer decoder with 1024-dimensional feedforward layers, and the input dimension is set to d, equal to the dimension of x. We train the model over three epochs, each consisting of 10,000 batches, with every batch containing 256 samples. We use Adam W (Loshchilov & Hutter, 2019) as the optimizer with weight decay as 0.00001 and set the learning rate to 0.00001. |