Unsupervised Paraphrasing under Syntax Knowledge

Authors: Tianyuan Liu, Yuqing Sun, Jiaqi Wu, Xi Xu, Yuchen Han, Cheng Li, Bin Gong

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The proposed method is evaluated on a few paraphrase datasets. The experimental results show that the quality of paraphrases by our proposed method outperforms the compared methods, especially in terms of syntax correctness. Experiments are also conducted to inspect the contributions of components of our method.
Researcher Affiliation Academia Tianyuan Liu, Yuqing Sun*, Jiaqi Wu, Xi Xu, Yuchen Han, Cheng Li, Bin Gong School of Software, Shandong University zodiacg@foxmail.com, sun yuqing@sdu.edu.cn, oofelvis@163.com, {sidxu,hanyc}@mail.sdu.edu.cn, 609172827@qq.com, gb@sdu.edu.cn
Pseudocode No The paper describes the architecture and components of the model but does not include any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Code is available at https://splab.sdu.edu.cn/xscg/sjjydm.htm
Open Datasets Yes For training and evaluating the proposed method, we adopted two paraphrasing datasets: Quora and Simple Wiki. The Quora (Iyer et al. 2017) dataset is a widely used dataset for paraphrase generation. Simplewiki (Coster and Kauchak 2011) is a dataset for text simplification. For pretraining the word composable knowledge, we use the English Web Treebank from Universal Dependencies 1 https://universaldependencies.org to obtain high quality syntax paring annotations. For pretraining semantic encoder, we used the Wiki Text corpus.
Dataset Splits No The paper provides training and test set sizes in Table 1 but does not explicitly provide numerical splits or counts for a validation set.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU models, memory) used to run the experiments.
Software Dependencies Yes Our method is implemented with Py Torch 1.10.2 with CUDA 11.3. The dependency parsing trees used are generated with Stanza 1.4.0 (Qi et al. 2020).
Experiment Setup Yes The text encoder uses 300 dimension Glo Ve (Pennington, Socher, and Manning 2014) embeddings and produces a 300 dimension vector as the semantic representation. For the parsing tree encoder, the dimension size of syntax element embeddings is chosen from {150, 200, 300, 500, 750}. For the paraphrase generator, the dimension size of the RNN hidden vector is chosen from {150, 200, 300, 500, 750}. The weight of the syntax matching loss is chosen from [0, 1.5]. The performance results are chosen from the best parameter combinations, while the model analysis and ablation study experiments are conducted with both dimension sizes set to 300 and the weight set to 0.2 if not stated otherwise.