Prompting GPT-3 To Be Reliable

Authors: Chenglei Si, Zhe Gan, Zhengyuan Yang, Shuohang Wang, Jianfeng Wang, Jordan Lee Boyd-Graber, Lijuan Wang

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our systematic empirical study not only sheds new insights on the reliability of prompting LLMs, but more importantly, our prompting strategies can help practitioners more reliably use LLMs like GPT-3.
Researcher Affiliation Collaboration 1 University of Maryland 2 Microsoft
Pseudocode No The paper describes its methods in prose and through experimental setups, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes We release all processed datasets, evaluation scripts, and model predictions.1 1https://github.com/Novi Scl/GPT3-Reliability
Open Datasets Yes For domain shift, MRQA (Fisch et al., 2019) trains on six machine reading datasets from the source domain and tests on six different target domains; for perturbations, Adv GLUE (Wang et al., 2021) craft adversarial versions of GLUE (Wang et al., 2018) based on automatic adversarial perturbations and human filtering, and Contrast Sets (Gardner et al., 2020) are expert-authored minimal edits that change the label; for spurious correlation, HANS (Mc Coy et al., 2019) and PAWS (Zhang et al., 2019) are challenge sets designed for models trained on MNLI and QQP
Dataset Splits No The paper describes using test sets and mentions some data characteristics, but it does not specify explicit training, validation, and test splits (e.g., percentages or counts) for reproducibility, nor does it refer to specific predefined splits with citations for all datasets.
Hardware Specification No The paper mentions using GPT-3 models (CODE-DAVINCI-002, Text-Davinci-001, Text-Curie-001) but does not provide details about the specific hardware (e.g., GPU models, CPU types, or memory) used by the authors to run their experiments or evaluations.
Software Dependencies No The paper mentions various models and tools used (e.g., GPT-3, DPR-BERT, T5, Contriever) but does not specify the version numbers of any general software dependencies or libraries (like Python, PyTorch, TensorFlow, etc.) that would be needed to reproduce the experimental environment.
Experiment Setup Yes For each of these settings, we evaluate a simple prompting strategy by sampling examples from the source domains (for MRQA, we use a fixed prompt consisting of eight randomly sampled examples from the source domain on all target datasets; for perturbations and spurious correlation, we randomly sample 16 demos from the original clean training data from GLUE, MNLI, and QQP respectively).