CoP: Factual Inconsistency Detection by Controlling the Preference
Authors: Shuaijie She, Xiang Geng, Shujian Huang, Jiajun Chen
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments are conducted on token-level and summarylevel inconsistency detection and inconsistency category detection, where Co P achieves remarkable improvements over several strong baselines in the unsupervised settings. |
| Researcher Affiliation | Academia | Shuaijie She, Xiang Geng, Shujian Huang*, Jiajun Chen National Key Laboratory for Novel Software Technology, Nanjing University {shesj, gx}@smail.nju.edu.cn, {huangsj,chenjj}@nju.edu.cn |
| Pseudocode | No | The paper describes its methods but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | 1Code will be released at https://github.com/NJUNLP/Co P |
| Open Datasets | Yes | Dataset: XSum Hallucination Annotations (Maynez et al. 2020) sampled 500 document-summary pairs from the summarization dataset XSUM (Narayan, Cohen, and Lapata 2018) and explored the fact consistency of summary generated by four popular models (Pt Gen, TConv S2S, Tran S2S, BERTS2S). The annotator marks each token with 0 or 1 to indicate its factuality. In addition to the above token-level dataset, we also take two popular summary-level datasets into consideration: QAGS (Wang, Cho, and Lewis 2020) and FRANK (Pagnoni, Balachandran, and Tsvetkov 2021) datasets. |
| Dataset Splits | Yes | In full-shot setting, the dataset is split into three subsets: training (1200), validation (400), and test (400) sets which is similar to DAE (Goyal and Durrett 2021). |
| Hardware Specification | Yes | the experiment are conducted on single TITAN-RTX GPU. |
| Software Dependencies | No | The paper mentions software like 'BARTCNN2', 'Span BERT', and 'Spacy' but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | The length of prompt vector is set to 40 and 5 for fullshot and few-shot respectively. The prompt tuning process uses Adam W optimizer with 1e-3 learning rate and the experiment are conducted on single TITAN-RTX GPU. |