Self-contradictory Hallucinations of Large Language Models: Evaluation, Detection and Mitigation

Authors: Niels Mündler, Jingxuan He, Slobodan Jenko, Martin Vechev

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our primary evaluation task is open-domain text generation, but we also demonstrate the applicability of our approach to shorter question answering. Our analysis reveals the prevalence of self-contradictions, e.g., in 17.7% of all sentences produced by Chat GPT. We then propose a novel prompting-based framework designed to effectively detect and mitigate self-contradictions. Our detector achieves high accuracy, e.g., around 80% F1 score when prompting Chat GPT. The mitigation algorithm iteratively refines the generated text to remove contradictory information while preserving text fluency and informativeness.
Researcher Affiliation Academia Niels Mündler, Jingxuan He, Slobodan Jenko & Martin Vechev Department of Computer Science, ETH Zurich, Switzerland nmuendler@student.ethz.ch, jingxuan.he@inf.ethz.ch, sjenko@student.ethz.ch, martin.vechev@inf.ethz.ch
Pseudocode Yes Algorithm 1: Triggering self-contradictions for text generated by g LM. Algorithm 2: Mitigating self-contradictions for one pair of g LM-generated sentences. Algorithm 3: Iterative mitigation of selfcontradictions for text generated by g LM.
Open Source Code Yes Our code and datasets are publicly available on Git Hub at https://github.com/eth-sri/ChatProtect.
Open Datasets Yes Our code and datasets are publicly available on Git Hub at https://github.com/eth-sri/ChatProtect. To construct evaluation data for open-domain text generation, we sample g LM to generate encyclopedic text descriptions for Wikipedia entities.
Dataset Splits Yes For validation, we use 12 entities.
Hardware Specification Yes We run Vicuna-13B with the Fast Chat API (LMSYS, 2023a) on NVIDIA A100 80GB GPUs. We use the service of Together AI (Together AI, 2023) for running Llama2-70B-Chat.
Software Dependencies No The paper mentions using Open AI API, Fast Chat API, and Together AI services for running models, but it does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions or specific library versions).
Experiment Setup Yes When generating the text descriptions, we use temperature 1.0 because it aligns with practical aspects of LM usage... When running g LM.gen_sentence, a LM.detect, and a LM.revise, we use temperature 0 for Chat GPT and GPT-4. This is because we desire maximum confidence for these functions.