reproducibilityindex.ai

BERTs are Generative In-Context Learners

Authors: David Samuel

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our evaluation reveals that the masked and causal language models behave very differently, as they clearly outperform each other on different categories of tasks. These complementary strengths suggest that the field s focus on causal models for in-context learning may be limiting both architectures can develop these capabilities, but with distinct advantages; pointing toward promising hybrid approaches that combine the strengths of both objectives.
Researcher Affiliation	Academia	David Samuel Language Technology Group University of Oslo davisamu@uio.no
Pseudocode	No	The paper illustrates methods with diagrams (Figure 2) but does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We have converted the officially available De BERTa checkpoint into a Hugging Face (Wolf et al., 2020) implementation of Auto Model For Causal LM (following the method in Section 2.1), and released it openly at https://hf.co/ltg/deberta-xxlarge-fixed.
Open Datasets	Yes	De BERTa was pretrained on a relatively small and clean text corpus totalling 78GB of data after deduplication, the corpus is comprised of the English Wikipedia (12GB), Book Corpus (6GB; Zhu et al., 2015), Open Web Text (38GB; Gokaslan and Cohen, 2019), and STORIES (31GB; Trinh and Le, 2019).
Dataset Splits	Yes	For each evaluated sample, the example demonstrations are randomly selected (without replacement) from the training set of each task; if the training set is not available, we sample from the only available dataset split, making sure not to select the same sample as the evaluated one. We format these examples using the respective prompt templates and concatenate them together, joined by two newline characters. The numbers of shots used for each task are given in Appendix F.
Hardware Specification	No	The paper states: "This paper does not involve any training, only inference with negligable cost. However, we still give the compute cost of the pretrained language models (even though they were not pretrained as part of this paper)." No specific hardware details for the experiments performed in this paper are provided.
Software Dependencies	Yes	We have converted the officially available De BERTa checkpoint into a Hugging Face (Wolf et al., 2020) implementation of Auto Model For Causal LM... The models are then asked to complete the prompt using beam search decoding with 4 beams and the default Hugging Face hyperparameters8 and the generation is stopped after producing the special newline character \n. 8https://huggingface.co/docs/transformers/v4.41.3/en/main_classes/text_generation#transformers.GenerationMixin.generate
Experiment Setup	Yes	Generation is performed with beam search (4 candidate beams), and ranking uses the modified PLL scores (and the normalized unconditional probability of completions P(completion \| context) P(completion \| answer context) for ARC and Open Book QA), again replicating the choices for GPT-3). We also use the exact same prompt templates, with the exception of the machine translation task its template did not produce any meaningful output, and so we decided to use the simple prompt template from Garcia et al. (2023) instead.