Tracing Text Provenance via Context-Aware Lexical Substitution

Authors: Xi Yang, Jie Zhang, Kejiang Chen, Weiming Zhang, Zehua Ma, Feng Wang, Nenghai Yu11613-11621

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that, under both objective and subjective metrics, our watermarking scheme can well preserve the semantic integrity of original sentences and has a better transferability than existing methods. Besides, the proposed LS approach outperforms the state-of-the-art approach on the Stanford Word Substitution Benchmark.
Researcher Affiliation Academia University of Science and Technology of China {yx9726@mail., zjzac@mail., chenkj@mail., zhangwm@, mzh045@mail., nishi@mail., ynh@}ustc.edu.cn
Pseudocode Yes Algorithm 1 Context-Aware Lexical Substitution; Algorithm 2 Sequence Incremental Watermark Embedding
Open Source Code No The paper does not provide an explicit statement or link for the open-source code of their methodology.
Open Datasets Yes We choose datasets with different writing styles, namely, Novels, Wiki Text-2, IMDB, and Ag News. For Novels, we select Wuthering Heights, Dracula, and Pride and Prejudice from Project Gutenberg2. For the rest datasets, we select the first 10,000 sentences each from the Wiki Text-2, IMDB, and Ag News datasets provided by Hugging Face3. (2https://www.gutenberg.org/, 3https://huggingface.co/datasets)
Dataset Splits No The paper mentions datasets used but does not specify the exact train/validation/test splits (e.g., percentages or sample counts) needed for reproduction.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running its experiments.
Software Dependencies No The paper mentions pre-trained models (bert-base-cased, roberta-large-mnli, stsb-roberta-base-v2) and NLTK, but does not specify version numbers for general software dependencies like Python, PyTorch, or other libraries.
Experiment Setup Yes We set f = 1 by default in Algorithm 2 and K = 32 when generating candidates.