CoP: Factual Inconsistency Detection by Controlling the Preference

Authors: Shuaijie She, Xiang Geng, Shujian Huang, Jiajun Chen

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments are conducted on token-level and summarylevel inconsistency detection and inconsistency category detection, where Co P achieves remarkable improvements over several strong baselines in the unsupervised settings.
Researcher Affiliation Academia Shuaijie She, Xiang Geng, Shujian Huang*, Jiajun Chen National Key Laboratory for Novel Software Technology, Nanjing University {shesj, gx}@smail.nju.edu.cn, {huangsj,chenjj}@nju.edu.cn
Pseudocode No The paper describes its methods but does not include any structured pseudocode or algorithm blocks.
Open Source Code No 1Code will be released at https://github.com/NJUNLP/Co P
Open Datasets Yes Dataset: XSum Hallucination Annotations (Maynez et al. 2020) sampled 500 document-summary pairs from the summarization dataset XSUM (Narayan, Cohen, and Lapata 2018) and explored the fact consistency of summary generated by four popular models (Pt Gen, TConv S2S, Tran S2S, BERTS2S). The annotator marks each token with 0 or 1 to indicate its factuality. In addition to the above token-level dataset, we also take two popular summary-level datasets into consideration: QAGS (Wang, Cho, and Lewis 2020) and FRANK (Pagnoni, Balachandran, and Tsvetkov 2021) datasets.
Dataset Splits Yes In full-shot setting, the dataset is split into three subsets: training (1200), validation (400), and test (400) sets which is similar to DAE (Goyal and Durrett 2021).
Hardware Specification Yes the experiment are conducted on single TITAN-RTX GPU.
Software Dependencies No The paper mentions software like 'BARTCNN2', 'Span BERT', and 'Spacy' but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes The length of prompt vector is set to 40 and 5 for fullshot and few-shot respectively. The prompt tuning process uses Adam W optimizer with 1e-3 learning rate and the experiment are conducted on single TITAN-RTX GPU.