reproducibilityindex.ai

CoP: Factual Inconsistency Detection by Controlling the Preference

Authors: Shuaijie She, Xiang Geng, Shujian Huang, Jiajun Chen

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments are conducted on token-level and summarylevel inconsistency detection and inconsistency category detection, where Co P achieves remarkable improvements over several strong baselines in the unsupervised settings.
Researcher Affiliation	Academia	Shuaijie She, Xiang Geng, Shujian Huang*, Jiajun Chen National Key Laboratory for Novel Software Technology, Nanjing University {shesj, gx}@smail.nju.edu.cn, {huangsj,chenjj}@nju.edu.cn
Pseudocode	No	The paper describes its methods but does not include any structured pseudocode or algorithm blocks.
Open Source Code	No	1Code will be released at https://github.com/NJUNLP/Co P
Open Datasets	Yes	Dataset: XSum Hallucination Annotations (Maynez et al. 2020) sampled 500 document-summary pairs from the summarization dataset XSUM (Narayan, Cohen, and Lapata 2018) and explored the fact consistency of summary generated by four popular models (Pt Gen, TConv S2S, Tran S2S, BERTS2S). The annotator marks each token with 0 or 1 to indicate its factuality. In addition to the above token-level dataset, we also take two popular summary-level datasets into consideration: QAGS (Wang, Cho, and Lewis 2020) and FRANK (Pagnoni, Balachandran, and Tsvetkov 2021) datasets.
Dataset Splits	Yes	In full-shot setting, the dataset is split into three subsets: training (1200), validation (400), and test (400) sets which is similar to DAE (Goyal and Durrett 2021).
Hardware Specification	Yes	the experiment are conducted on single TITAN-RTX GPU.
Software Dependencies	No	The paper mentions software like 'BARTCNN2', 'Span BERT', and 'Spacy' but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	The length of prompt vector is set to 40 and 5 for fullshot and few-shot respectively. The prompt tuning process uses Adam W optimizer with 1e-3 learning rate and the experiment are conducted on single TITAN-RTX GPU.