Understanding the Effects of Iterative Prompting on Truthfulness
Authors: Satyapriya Krishna, Chirag Agarwal, Himabindu Lakkaraju
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive experiments explore the intricacies of iterative prompting variants, examining their influence on the accuracy and calibration of model responses. |
| Researcher Affiliation | Academia | Satyapriya Krishna 1 Chirag Agarwal 1 Himabindu Lakkaraju 1 1Harvard University. Correspondence to: Satyapriya Krishna <skrishna@g.harvard.edu>. |
| Pseudocode | No | No explicit pseudocode or algorithm blocks are present in the paper. |
| Open Source Code | No | No explicit statement about providing open-source code for their methodology or a link to a code repository is found in the paper. |
| Open Datasets | Yes | To empirically analyse the impact of iterative prompting on truthfulness, we use Truthful QA(Lin et al., 2021), which serves as a benchmark designed to evaluate the truthfulness of LLM responses. |
| Dataset Splits | No | For our experiments, we focused on the multiple-choice test samples of Truthful QA in which there is only one correct answer out of all the provided options (named mc1_targets in (Truthful QA, 2021)). |
| Hardware Specification | No | We experimented with Open AI GPT-3.5 (Open AI-GPT, 2022) for all our experiments, named as gpt-3.5-turbo-16k-0613 for API endpoint. The results for open-source, instruction-tuned models such as Llama-2-70b-chat-hf (Touvron et al., 2023) and Mixtral-8x7B-Instruct-v0.1 (Jiang et al., 2024) are provided in Appendix B. |
| Software Dependencies | No | We compute confidence of the prediction by using the logprobs returned for the answer token by the Open AI API reponse, similar to the setup used in (Zhang et al., 2023; Hills & Anadkat, 2023). |
| Experiment Setup | Yes | We iteratively prompt the model using various strategies for 10 iterations, aiming to discern the trend in performance with increasing number of prompt iterations. We set the temperature of temperature sampling to 1.0 to allow for sufficient exploration, commonly employed in setups related to truthfulness (Chen et al., 2023a). |