Understanding the Effects of Iterative Prompting on Truthfulness

Authors: Satyapriya Krishna, Chirag Agarwal, Himabindu Lakkaraju

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive experiments explore the intricacies of iterative prompting variants, examining their influence on the accuracy and calibration of model responses.
Researcher Affiliation Academia Satyapriya Krishna 1 Chirag Agarwal 1 Himabindu Lakkaraju 1 1Harvard University. Correspondence to: Satyapriya Krishna <skrishna@g.harvard.edu>.
Pseudocode No No explicit pseudocode or algorithm blocks are present in the paper.
Open Source Code No No explicit statement about providing open-source code for their methodology or a link to a code repository is found in the paper.
Open Datasets Yes To empirically analyse the impact of iterative prompting on truthfulness, we use Truthful QA(Lin et al., 2021), which serves as a benchmark designed to evaluate the truthfulness of LLM responses.
Dataset Splits No For our experiments, we focused on the multiple-choice test samples of Truthful QA in which there is only one correct answer out of all the provided options (named mc1_targets in (Truthful QA, 2021)).
Hardware Specification No We experimented with Open AI GPT-3.5 (Open AI-GPT, 2022) for all our experiments, named as gpt-3.5-turbo-16k-0613 for API endpoint. The results for open-source, instruction-tuned models such as Llama-2-70b-chat-hf (Touvron et al., 2023) and Mixtral-8x7B-Instruct-v0.1 (Jiang et al., 2024) are provided in Appendix B.
Software Dependencies No We compute confidence of the prediction by using the logprobs returned for the answer token by the Open AI API reponse, similar to the setup used in (Zhang et al., 2023; Hills & Anadkat, 2023).
Experiment Setup Yes We iteratively prompt the model using various strategies for 10 iterations, aiming to discern the trend in performance with increasing number of prompt iterations. We set the temperature of temperature sampling to 1.0 to allow for sufficient exploration, commonly employed in setups related to truthfulness (Chen et al., 2023a).