Uncertainty of Thoughts: Uncertainty-Aware Planning Enhances Information Seeking in LLMs

Authors: Zhiyuan Hu, Chumin Liu, Xidong Feng, Yilun Zhao, See-Kiong Ng, Anh Tuan Luu, Junxian He, Pang Wei W. Koh, Bryan Hooi

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In experiments on medical diagnosis, troubleshooting and the 20 Questions game, Uo T achieves an average performance improvement of 38.1% in the rate of successful task completion across multiple LLMs compared with direct prompting, and also improves efficiency (i.e., the number of questions needed to complete the task).
Researcher Affiliation Academia Zhiyuan Hu1 Chumin Liu2 Xidong Feng3 Yilun Zhao4 See-Kiong Ng1 Anh Tuan Luu2 Junxian He5 Pang Wei Koh6 Bryan Hooi1 1 National University of Singapore 2 Nanyang Technological University 3 University College London 4 Yale University 5 The Hong Kong University of Science and Technology 6 University of Washington
Pseudocode No The paper describes the Uo T algorithm in prose within Section 2 'Methodology' and illustrates it with Figure 2, but it does not present a formal pseudocode block or algorithm listing.
Open Source Code Yes Our code are released2. 2https://github.com/zhiyuanhubj/Uo T
Open Datasets Yes To close this gap, we first introduce a benchmark comprising 5 datasets3 on 3 tasks: 20 Questions, a simplified medical diagnosis task, and a basic troubleshooting task. [...] We use two datasets, Common (collected by us, refer to Appendix I.2 for more details) and Things [14], including 111 and 1854 items separately. In Medical Diagnosis, [...] We use two datasets: DX [37], with 104 doctor-patient dialogues and 5 diseases in test set, and Med DG [19] with over 17K conversations across 15 disease types. Troubleshooting [...] Raghu et al.[27] introduce Flo Dial with 894 dialogues
Dataset Splits No The paper mentions using a 'test set' for the DX dataset in Section 3.1, but it does not provide explicit training, validation, or test dataset splits (e.g., percentages or sample counts) for all datasets used, nor does it detail a general splitting methodology or refer to standard predefined splits for all.
Hardware Specification No The paper mentions 'computation resources and Open AI API budgets' and 'GPT-4 token use' but does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies No The paper lists various large language models used in experiments (e.g., 'Llama-3-70B-Instruct', 'Mistral-Large', 'Gemini-1.5-Pro', 'GPT-4'), but it does not provide specific version numbers for these models or for any other ancillary software dependencies (e.g., programming languages, libraries, frameworks) required to reproduce the experiments.
Experiment Setup Yes Empirically, we set the plan (simulation) steps as 3 and the number of questions during the simulation is 3. The hyperparameter λ in uncertainty-based reward is 0.4.