Decomposing Uncertainty for Large Language Models through Input Clarification Ensembling
Authors: Bairu Hou, Yujian Liu, Kaizhi Qian, Jacob Andreas, Shiyu Chang, Yang Zhang
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical evaluations demonstrate that input clarification ensembling provides accurate and reliable uncertainty quantification on several language processing tasks. |
| Researcher Affiliation | Collaboration | 1UC Santa Barbara 2MIT-IBM Watson AI Lab, IBM Research 3MIT CSAIL. Correspondence to: Bairu Hou <bairu@ucsb.edu>. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and data are available at https://github.com/ UCSB-NLP-Chang/llm_uncertainty. |
| Open Datasets | Yes | We evaluate the total uncertainty on the Natural Question (NQ) dataset (Kwiatkowski et al., 2019) and GSM8K (Cobbe et al., 2021). For ambiguity detection of the question, we select the Ambig QA dataset (Min et al., 2020) |
| Dataset Splits | Yes | We use the full Ambig Inst dataset and randomly sample 200 examples from the validation set of Ambig QA for evaluation. We fine-tuning the Llama-3-8B-Instruction on the full training set of Ambig QA dataset... We evaluate the model on the validation set and take the model that achieves lowest validation loss (epoch = 2) for testing. |
| Hardware Specification | Yes | We fine-tuning the Llama-3-8B-Instruction on the full training set of Ambig QA dataset on 4 NVIDIA H100 80GB HBM3 GPU. |
| Software Dependencies | No | The paper mentions using "Py Torch Lightning, Deep Speed Stage 1, and flash-attention 2" but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | We train the model with batch size 16, learning rate 2e-5, and cosine learning rate scheduler for 5 epochs. The loss is only computed on the output tokens. |