reproducibilityindex.ai

The Inductive Bias of In-Context Learning: Rethinking Pretraining Example Design

Authors: Yoav Levine, Noam Wies, Daniel Jannai, Dan Navon, Yedid Hoshen, Amnon Shashua

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	After the presentation of our theoretical results in section 2, we detail in section 3 two controlled setting exemplifications of new methods that directly leverage the in-context bias. Table 1, shows zero-shot Sent Eval sentence similarity scores, attained by using the average word embedding of an inserted sentence as its examined sentence representation (shown by Reimers & Gurevych (2019) to be most meaningful in zero shot). In order to directly probe the acquired ability to integrate non-neighboring sentences, we evaluated the resultant models on the very challenging setup of zero-shot closed-book open domain question answering.
Researcher Affiliation	Academia	Yoav Levine, Noam Wies, Daniel Jannai, Dan Navon, Yedid Hoshen & Amnon Shashua The Hebrew University of Jerusalem {yoav.levine, noam.wies, daniel.jannai, dan.nav}@mail.huji.ac.il
Pseudocode	No	The paper contains detailed mathematical proofs and definitions of functions and terms in its theoretical sections (e.g., Appendix B), but it does not present any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper states it used "pretrained Ro BERTA-base weights from the Hugging Face Transformers repository" and a "Hugging Face Transformers implementation of GPT-2," but it does not include an explicit statement or link indicating that the authors' own code for their methodology is being released or is publicly available.
Open Datasets	Yes	We perform TAPT on the Sent Eval sentence similarity benchmark (Conneau & Kiela, 2018)... We evaluated the models on questions from the Natural Questions (NQ) benchmark (Kwiatkowski et al., 2019)... We test how k NN-Pretraining affects other NLU tasks, by examining several tasks from the GLUE benchmark (Wang et al., 2018)...
Dataset Splits	No	The paper does not explicitly provide specific training, validation, and test dataset splits with percentages or counts. It describes training for a certain number of epochs and evaluates models on zero-shot benchmarks, implying standard splits of those benchmarks, but does not detail its own data partitioning for training and validation.
Hardware Specification	No	The paper states 'Experiments were performed with Cloud TPUs and supported by Google s Tensor Flow Research Cloud (TFRC),' but it does not specify exact TPU models (e.g., TPU v2, v3) or other detailed hardware specifications like CPU models or memory.
Software Dependencies	No	The paper mentions using 'Adam W optimizer,' 'Hugging Face Transformers,' and the 'FAISS library,' but it does not provide specific version numbers for these or any other software dependencies, which is required for reproducibility.
Experiment Setup	Yes	We continued training a pretrained Ro BERTa-base model for 5 epochs, using the first epoch for learning-rate warmup and examining peak learning rates of {1, 3, 5, 7} 10 5. Adam W optimizer (with the parameters suggested in the original Ro BERTa paper: β1 = 0.9, β2 = 0.98, ε = 10 6 and weight decay of 0.01), with batch sizes of 128 or 256 (depending on model size) and sequences of 256 tokens each. Adam W optimizer (with the parameters suggested in the original GPT-2 paper: β1 = 0.9, β2 = 0.95, ε = 10 8 and weight decay of 0.1), with batch size of 512 and sequences of 256 tokens each.