reproducibilityindex.ai

Understanding the Effects of Iterative Prompting on Truthfulness

Authors: Satyapriya Krishna, Chirag Agarwal, Himabindu Lakkaraju

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive experiments explore the intricacies of iterative prompting variants, examining their influence on the accuracy and calibration of model responses.
Researcher Affiliation	Academia	Satyapriya Krishna 1 Chirag Agarwal 1 Himabindu Lakkaraju 1 1Harvard University. Correspondence to: Satyapriya Krishna <skrishna@g.harvard.edu>.
Pseudocode	No	No explicit pseudocode or algorithm blocks are present in the paper.
Open Source Code	No	No explicit statement about providing open-source code for their methodology or a link to a code repository is found in the paper.
Open Datasets	Yes	To empirically analyse the impact of iterative prompting on truthfulness, we use Truthful QA(Lin et al., 2021), which serves as a benchmark designed to evaluate the truthfulness of LLM responses.
Dataset Splits	No	For our experiments, we focused on the multiple-choice test samples of Truthful QA in which there is only one correct answer out of all the provided options (named mc1_targets in (Truthful QA, 2021)).
Hardware Specification	No	We experimented with Open AI GPT-3.5 (Open AI-GPT, 2022) for all our experiments, named as gpt-3.5-turbo-16k-0613 for API endpoint. The results for open-source, instruction-tuned models such as Llama-2-70b-chat-hf (Touvron et al., 2023) and Mixtral-8x7B-Instruct-v0.1 (Jiang et al., 2024) are provided in Appendix B.
Software Dependencies	No	We compute confidence of the prediction by using the logprobs returned for the answer token by the Open AI API reponse, similar to the setup used in (Zhang et al., 2023; Hills & Anadkat, 2023).
Experiment Setup	Yes	We iteratively prompt the model using various strategies for 10 iterations, aiming to discern the trend in performance with increasing number of prompt iterations. We set the temperature of temperature sampling to 1.0 to allow for sufficient exploration, commonly employed in setups related to truthfulness (Chen et al., 2023a).