Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Understanding the Effects of Iterative Prompting on Truthfulness
Authors: Satyapriya Krishna, Chirag Agarwal, Himabindu Lakkaraju
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive experiments explore the intricacies of iterative prompting variants, examining their influence on the accuracy and calibration of model responses. |
| Researcher Affiliation | Academia | Satyapriya Krishna 1 Chirag Agarwal 1 Himabindu Lakkaraju 1 1Harvard University. Correspondence to: Satyapriya Krishna <EMAIL>. |
| Pseudocode | No | No explicit pseudocode or algorithm blocks are present in the paper. |
| Open Source Code | No | No explicit statement about providing open-source code for their methodology or a link to a code repository is found in the paper. |
| Open Datasets | Yes | To empirically analyse the impact of iterative prompting on truthfulness, we use Truthful QA(Lin et al., 2021), which serves as a benchmark designed to evaluate the truthfulness of LLM responses. |
| Dataset Splits | No | For our experiments, we focused on the multiple-choice test samples of Truthful QA in which there is only one correct answer out of all the provided options (named mc1_targets in (Truthful QA, 2021)). |
| Hardware Specification | No | We experimented with Open AI GPT-3.5 (Open AI-GPT, 2022) for all our experiments, named as gpt-3.5-turbo-16k-0613 for API endpoint. The results for open-source, instruction-tuned models such as Llama-2-70b-chat-hf (Touvron et al., 2023) and Mixtral-8x7B-Instruct-v0.1 (Jiang et al., 2024) are provided in Appendix B. |
| Software Dependencies | No | We compute confidence of the prediction by using the logprobs returned for the answer token by the Open AI API reponse, similar to the setup used in (Zhang et al., 2023; Hills & Anadkat, 2023). |
| Experiment Setup | Yes | We iteratively prompt the model using various strategies for 10 iterations, aiming to discern the trend in performance with increasing number of prompt iterations. We set the temperature of temperature sampling to 1.0 to allow for sufficient exploration, commonly employed in setups related to truthfulness (Chen et al., 2023a). |