Uncertainty of Thoughts: Uncertainty-Aware Planning Enhances Information Seeking in LLMs
Authors: Zhiyuan Hu, Chumin Liu, Xidong Feng, Yilun Zhao, See-Kiong Ng, Anh Tuan Luu, Junxian He, Pang Wei W. Koh, Bryan Hooi
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In experiments on medical diagnosis, troubleshooting and the 20 Questions game, Uo T achieves an average performance improvement of 38.1% in the rate of successful task completion across multiple LLMs compared with direct prompting, and also improves efficiency (i.e., the number of questions needed to complete the task). |
| Researcher Affiliation | Academia | Zhiyuan Hu1 Chumin Liu2 Xidong Feng3 Yilun Zhao4 See-Kiong Ng1 Anh Tuan Luu2 Junxian He5 Pang Wei Koh6 Bryan Hooi1 1 National University of Singapore 2 Nanyang Technological University 3 University College London 4 Yale University 5 The Hong Kong University of Science and Technology 6 University of Washington |
| Pseudocode | No | The paper describes the Uo T algorithm in prose within Section 2 'Methodology' and illustrates it with Figure 2, but it does not present a formal pseudocode block or algorithm listing. |
| Open Source Code | Yes | Our code are released2. 2https://github.com/zhiyuanhubj/Uo T |
| Open Datasets | Yes | To close this gap, we first introduce a benchmark comprising 5 datasets3 on 3 tasks: 20 Questions, a simplified medical diagnosis task, and a basic troubleshooting task. [...] We use two datasets, Common (collected by us, refer to Appendix I.2 for more details) and Things [14], including 111 and 1854 items separately. In Medical Diagnosis, [...] We use two datasets: DX [37], with 104 doctor-patient dialogues and 5 diseases in test set, and Med DG [19] with over 17K conversations across 15 disease types. Troubleshooting [...] Raghu et al.[27] introduce Flo Dial with 894 dialogues |
| Dataset Splits | No | The paper mentions using a 'test set' for the DX dataset in Section 3.1, but it does not provide explicit training, validation, or test dataset splits (e.g., percentages or sample counts) for all datasets used, nor does it detail a general splitting methodology or refer to standard predefined splits for all. |
| Hardware Specification | No | The paper mentions 'computation resources and Open AI API budgets' and 'GPT-4 token use' but does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper lists various large language models used in experiments (e.g., 'Llama-3-70B-Instruct', 'Mistral-Large', 'Gemini-1.5-Pro', 'GPT-4'), but it does not provide specific version numbers for these models or for any other ancillary software dependencies (e.g., programming languages, libraries, frameworks) required to reproduce the experiments. |
| Experiment Setup | Yes | Empirically, we set the plan (simulation) steps as 3 and the number of questions during the simulation is 3. The hyperparameter λ in uncertainty-based reward is 0.4. |