Unsupervised Paraphrasing under Syntax Knowledge
Authors: Tianyuan Liu, Yuqing Sun, Jiaqi Wu, Xi Xu, Yuchen Han, Cheng Li, Bin Gong
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The proposed method is evaluated on a few paraphrase datasets. The experimental results show that the quality of paraphrases by our proposed method outperforms the compared methods, especially in terms of syntax correctness. Experiments are also conducted to inspect the contributions of components of our method. |
| Researcher Affiliation | Academia | Tianyuan Liu, Yuqing Sun*, Jiaqi Wu, Xi Xu, Yuchen Han, Cheng Li, Bin Gong School of Software, Shandong University zodiacg@foxmail.com, sun yuqing@sdu.edu.cn, oofelvis@163.com, {sidxu,hanyc}@mail.sdu.edu.cn, 609172827@qq.com, gb@sdu.edu.cn |
| Pseudocode | No | The paper describes the architecture and components of the model but does not include any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | Code is available at https://splab.sdu.edu.cn/xscg/sjjydm.htm |
| Open Datasets | Yes | For training and evaluating the proposed method, we adopted two paraphrasing datasets: Quora and Simple Wiki. The Quora (Iyer et al. 2017) dataset is a widely used dataset for paraphrase generation. Simplewiki (Coster and Kauchak 2011) is a dataset for text simpliļ¬cation. For pretraining the word composable knowledge, we use the English Web Treebank from Universal Dependencies 1 https://universaldependencies.org to obtain high quality syntax paring annotations. For pretraining semantic encoder, we used the Wiki Text corpus. |
| Dataset Splits | No | The paper provides training and test set sizes in Table 1 but does not explicitly provide numerical splits or counts for a validation set. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU models, memory) used to run the experiments. |
| Software Dependencies | Yes | Our method is implemented with Py Torch 1.10.2 with CUDA 11.3. The dependency parsing trees used are generated with Stanza 1.4.0 (Qi et al. 2020). |
| Experiment Setup | Yes | The text encoder uses 300 dimension Glo Ve (Pennington, Socher, and Manning 2014) embeddings and produces a 300 dimension vector as the semantic representation. For the parsing tree encoder, the dimension size of syntax element embeddings is chosen from {150, 200, 300, 500, 750}. For the paraphrase generator, the dimension size of the RNN hidden vector is chosen from {150, 200, 300, 500, 750}. The weight of the syntax matching loss is chosen from [0, 1.5]. The performance results are chosen from the best parameter combinations, while the model analysis and ablation study experiments are conducted with both dimension sizes set to 300 and the weight set to 0.2 if not stated otherwise. |