Understanding In-Context Learning from Repetitions
Authors: Jianhao Yan, Jin Xu, Chiyu Song, Chenming Wu, Yafu Li, Yue Zhang
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We quantitatively investigate the role of surface features in text generation, and empirically establish the existence of token co-occurrence reinforcement, a principle that strengthens the relationship between two tokens based on their contextual co-occurrences. Furthermore, we find similar reinforcements lie behind the pretraining corpus, revealing the existence is due to LLMs efforts to maximize likelihood. By investigating the dual impacts of these features, our research illuminates the internal workings of in-context learning and expounds on the reasons for its failures. This paper provides an essential contribution to the understanding of in-context learning and its potential limitations, providing a fresh perspective on this exciting capability. The experiments in this section are conducted over the dataset of randomly generated sentences as in Section 2 and with the four LLa MA models. The results on Wikitext-103 and Book Corpus, and results with OPT and various other LLMs can be found in Appendix D. |
| Researcher Affiliation | Collaboration | Jianhao Yan1,2 Jin Xu4 Chiyu Song1,2 Chenming Wu5 Yafu Li1,2 Yue Zhang2,3, 1Zhejiang University 2School of Engineering, Westlake University 3 Institute of Advanced Technology, Westlake Institute for Advanced Study 4 Tsinghua University 5 Baidu Research |
| Pseudocode | No | The paper does not contain any explicit pseudocode blocks or algorithm listings. |
| Open Source Code | Yes | 1https://github.com/Elliott Yan/understand-icl-from-repetition |
| Open Datasets | Yes | We use 1,000 sentences from each of the three different datasets Wikitext-103 (Merity et al., 2016), Book Corpus (Zhu et al., 2015), and sequences of random words. The experiments in this section are conducted over MMLU (Hendrycks et al., 2021) and GSM8K (Cobbe et al., 2021). We compute these probabilities over a commonly used pretraining corpus, wikipedia-english-20223 |
| Dataset Splits | No | The paper does not explicitly mention a validation dataset split. It mentions "7500 training problems and 1000 test problems" for GSM8K and "1140 test samples" for MMLU, but no explicit validation set details for reproduction. |
| Hardware Specification | Yes | We run our experiments on 4 NVIDIA A100 GPUs, and each experiment takes about 30 hours to finish. |
| Software Dependencies | No | The paper mentions "LLa MA-Factory (hiyouga, 2023)" as the codebase used, and that "The tokenizer is the same as LLa MA". However, specific version numbers for LLa MA-Factory or other software dependencies like Python, PyTorch, or Hugging Face Transformers library are not provided. |
| Experiment Setup | Yes | The base architecture of our model is the same as LLa MA. Due to limitation of computational cost, we make several changes in hyperparameters for a smaller size. We set the hidden size to 1024 and the FFN size to 4096, and we incorporate 12 layers. The tokenizer is the same as LLa MA and the vocabulary size is 32000. This configuration results in a model with 267M trainable parameters. ... We pretrain each model for 50k steps, with a total batch size of 160 sentences per step. Each sentence contains 1024 tokens. |