The Inductive Bias of In-Context Learning: Rethinking Pretraining Example Design
Authors: Yoav Levine, Noam Wies, Daniel Jannai, Dan Navon, Yedid Hoshen, Amnon Shashua
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | After the presentation of our theoretical results in section 2, we detail in section 3 two controlled setting exemplifications of new methods that directly leverage the in-context bias. Table 1, shows zero-shot Sent Eval sentence similarity scores, attained by using the average word embedding of an inserted sentence as its examined sentence representation (shown by Reimers & Gurevych (2019) to be most meaningful in zero shot). In order to directly probe the acquired ability to integrate non-neighboring sentences, we evaluated the resultant models on the very challenging setup of zero-shot closed-book open domain question answering. |
| Researcher Affiliation | Academia | Yoav Levine, Noam Wies, Daniel Jannai, Dan Navon, Yedid Hoshen & Amnon Shashua The Hebrew University of Jerusalem {yoav.levine, noam.wies, daniel.jannai, dan.nav}@mail.huji.ac.il |
| Pseudocode | No | The paper contains detailed mathematical proofs and definitions of functions and terms in its theoretical sections (e.g., Appendix B), but it does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states it used "pretrained Ro BERTA-base weights from the Hugging Face Transformers repository" and a "Hugging Face Transformers implementation of GPT-2," but it does not include an explicit statement or link indicating that the authors' own code for their methodology is being released or is publicly available. |
| Open Datasets | Yes | We perform TAPT on the Sent Eval sentence similarity benchmark (Conneau & Kiela, 2018)... We evaluated the models on questions from the Natural Questions (NQ) benchmark (Kwiatkowski et al., 2019)... We test how k NN-Pretraining affects other NLU tasks, by examining several tasks from the GLUE benchmark (Wang et al., 2018)... |
| Dataset Splits | No | The paper does not explicitly provide specific training, validation, and test dataset splits with percentages or counts. It describes training for a certain number of epochs and evaluates models on zero-shot benchmarks, implying standard splits of those benchmarks, but does not detail its own data partitioning for training and validation. |
| Hardware Specification | No | The paper states 'Experiments were performed with Cloud TPUs and supported by Google s Tensor Flow Research Cloud (TFRC),' but it does not specify exact TPU models (e.g., TPU v2, v3) or other detailed hardware specifications like CPU models or memory. |
| Software Dependencies | No | The paper mentions using 'Adam W optimizer,' 'Hugging Face Transformers,' and the 'FAISS library,' but it does not provide specific version numbers for these or any other software dependencies, which is required for reproducibility. |
| Experiment Setup | Yes | We continued training a pretrained Ro BERTa-base model for 5 epochs, using the first epoch for learning-rate warmup and examining peak learning rates of {1, 3, 5, 7} 10 5. Adam W optimizer (with the parameters suggested in the original Ro BERTa paper: β1 = 0.9, β2 = 0.98, ε = 10 6 and weight decay of 0.01), with batch sizes of 128 or 256 (depending on model size) and sequences of 256 tokens each. Adam W optimizer (with the parameters suggested in the original GPT-2 paper: β1 = 0.9, β2 = 0.95, ε = 10 8 and weight decay of 0.1), with batch size of 512 and sequences of 256 tokens each. |