reproducibilityindex.ai

Self-contradictory Hallucinations of Large Language Models: Evaluation, Detection and Mitigation

Authors: Niels Mündler, Jingxuan He, Slobodan Jenko, Martin Vechev

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our primary evaluation task is open-domain text generation, but we also demonstrate the applicability of our approach to shorter question answering. Our analysis reveals the prevalence of self-contradictions, e.g., in 17.7% of all sentences produced by Chat GPT. We then propose a novel prompting-based framework designed to effectively detect and mitigate self-contradictions. Our detector achieves high accuracy, e.g., around 80% F1 score when prompting Chat GPT. The mitigation algorithm iteratively refines the generated text to remove contradictory information while preserving text fluency and informativeness.
Researcher Affiliation	Academia	Niels Mündler, Jingxuan He, Slobodan Jenko & Martin Vechev Department of Computer Science, ETH Zurich, Switzerland nmuendler@student.ethz.ch, jingxuan.he@inf.ethz.ch, sjenko@student.ethz.ch, martin.vechev@inf.ethz.ch
Pseudocode	Yes	Algorithm 1: Triggering self-contradictions for text generated by g LM. Algorithm 2: Mitigating self-contradictions for one pair of g LM-generated sentences. Algorithm 3: Iterative mitigation of selfcontradictions for text generated by g LM.
Open Source Code	Yes	Our code and datasets are publicly available on Git Hub at https://github.com/eth-sri/ChatProtect.
Open Datasets	Yes	Our code and datasets are publicly available on Git Hub at https://github.com/eth-sri/ChatProtect. To construct evaluation data for open-domain text generation, we sample g LM to generate encyclopedic text descriptions for Wikipedia entities.
Dataset Splits	Yes	For validation, we use 12 entities.
Hardware Specification	Yes	We run Vicuna-13B with the Fast Chat API (LMSYS, 2023a) on NVIDIA A100 80GB GPUs. We use the service of Together AI (Together AI, 2023) for running Llama2-70B-Chat.
Software Dependencies	No	The paper mentions using Open AI API, Fast Chat API, and Together AI services for running models, but it does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions or specific library versions).
Experiment Setup	Yes	When generating the text descriptions, we use temperature 1.0 because it aligns with practical aspects of LM usage... When running g LM.gen_sentence, a LM.detect, and a LM.revise, we use temperature 0 for Chat GPT and GPT-4. This is because we desire maximum confidence for these functions.