Self-contradictory Hallucinations of Large Language Models: Evaluation, Detection and Mitigation
Authors: Niels Mündler, Jingxuan He, Slobodan Jenko, Martin Vechev
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our primary evaluation task is open-domain text generation, but we also demonstrate the applicability of our approach to shorter question answering. Our analysis reveals the prevalence of self-contradictions, e.g., in 17.7% of all sentences produced by Chat GPT. We then propose a novel prompting-based framework designed to effectively detect and mitigate self-contradictions. Our detector achieves high accuracy, e.g., around 80% F1 score when prompting Chat GPT. The mitigation algorithm iteratively refines the generated text to remove contradictory information while preserving text fluency and informativeness. |
| Researcher Affiliation | Academia | Niels Mündler, Jingxuan He, Slobodan Jenko & Martin Vechev Department of Computer Science, ETH Zurich, Switzerland nmuendler@student.ethz.ch, jingxuan.he@inf.ethz.ch, sjenko@student.ethz.ch, martin.vechev@inf.ethz.ch |
| Pseudocode | Yes | Algorithm 1: Triggering self-contradictions for text generated by g LM. Algorithm 2: Mitigating self-contradictions for one pair of g LM-generated sentences. Algorithm 3: Iterative mitigation of selfcontradictions for text generated by g LM. |
| Open Source Code | Yes | Our code and datasets are publicly available on Git Hub at https://github.com/eth-sri/ChatProtect. |
| Open Datasets | Yes | Our code and datasets are publicly available on Git Hub at https://github.com/eth-sri/ChatProtect. To construct evaluation data for open-domain text generation, we sample g LM to generate encyclopedic text descriptions for Wikipedia entities. |
| Dataset Splits | Yes | For validation, we use 12 entities. |
| Hardware Specification | Yes | We run Vicuna-13B with the Fast Chat API (LMSYS, 2023a) on NVIDIA A100 80GB GPUs. We use the service of Together AI (Together AI, 2023) for running Llama2-70B-Chat. |
| Software Dependencies | No | The paper mentions using Open AI API, Fast Chat API, and Together AI services for running models, but it does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions or specific library versions). |
| Experiment Setup | Yes | When generating the text descriptions, we use temperature 1.0 because it aligns with practical aspects of LM usage... When running g LM.gen_sentence, a LM.detect, and a LM.revise, we use temperature 0 for Chat GPT and GPT-4. This is because we desire maximum confidence for these functions. |