reproducibilityindex.ai

Can discrete information extraction prompts generalize across language models?

Authors: Nathanaël Carraz Rakotonirina, Roberto Dessi, Fabio Petroni, Sebastian Riedel, Marco Baroni

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present a systematic study of the extent to which LM query protocols, that, following current usage, we call prompting methods, generalize across LMs. Extending and confirming prior results, we find that discrete prompts that are automatically induced through an existing optimization procedure (Shin et al., 2020) outperform manually and semi-manually crafted prompts, reaching a good performance level when tested with the same LM used for prompt induction. While the automatically induced discrete prompts also generalize better to other LMs than (semi-)manual prompts and currently popular soft prompts, their overall generalization performance is quite poor. We next show that a simple change to the original training procedure, namely using more than one LM at prompt induction time, leads to discrete prompts that better generalize to new LMs. The proposed procedure, however, is brittle, crucially relying on the right choice of LMs to mix at prompt induction. We finally conduct the first extensive analysis of automatically induced discrete prompts, tentatively identifying a set of properties characterizing the more general prompts, such as a higher incidence of existing English words and robustness to token shuffling and deletion.
Researcher Affiliation	Collaboration	1Universitat Pompeu Fabra, 2Meta AI, 3Samaya AI, 4University College London, 5ICREA
Pseudocode	No	The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code	Yes	The code to reproduce our analysis is available at https://github.com/ncarraz/prompt_generalization.
Open Datasets	Yes	We focus on the task of slot-filling which, since its introduction in LM evaluation through the LAMA benchmark (Petroni et al., 2019a), has been extensively used to probe the knowledge contained in LMs (Al Khamissi et al., 2022). More specifically, we use the T-Re X split (Elsahar et al., 2018) of LAMA.
Dataset Splits	No	The paper mentions training and testing sets, but does not explicitly provide details about a separate validation set split needed for reproduction.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies	No	The paper states that pre-trained models and prompting methods are publicly available, but does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	In our experiments, we use 5-token prompts and run the algorithm for 1,000 iterations. ... Except for the learning rate, which is increased to 3e-2 for the T5 models for proper convergence, we use the same hyperparameters as the original implementation. We initialize vectors randomly.