Style-transfer and Paraphrase: Looking for a Sensible Semantic Similarity Metric
Authors: Ivan P. Yamshchikov, Viacheslav Shibaev, Nikolay Khlebnikov, Alexey Tikhonov14213-14220
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper provides a comprehensive analysis for more than a dozen of such methods. Using a new dataset of fourteen thousand sentence pairs human-labeled according to their semantic similarity, we demonstrate that none of the metrics widely used in the literature is close enough to human judgment in these tasks. A number of recently proposed metrics provide comparable results, yet Word Mover Distance is shown to be the most reasonable solution to measure semantic similarity in reformulated texts at the moment. |
| Researcher Affiliation | Collaboration | 1 Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, Leipzig, Germany 04103 2 Ural Federal University, Mira 19, Ekaterinburg, Russia, 620002 3 Yandex, Oberwallstr. 6, Berlin, Germany, 10117 |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper states: "To stimulate further research of semantic similarity measurements, we publish5 our dataset that consists of 14 000 different pairs of sentences alongside with semantic similarity scores given by the annotators. 5https://github.com/VAShibaev/semantic similarity metrics". This link is for the dataset, not the source code for the methodology presented in the paper. |
| Open Datasets | Yes | To stimulate further research of semantic similarity measurements, we publish5 our dataset that consists of 14 000 different pairs of sentences alongside with semantic similarity scores given by the annotators. 5https://github.com/VAShibaev/semantic similarity metrics |
| Dataset Splits | No | The paper describes using various datasets and sampling 1000 sentence pairs for analysis, and a new 14,000-pair human-labeled dataset. However, it does not specify explicit training/validation/test splits (e.g., percentages or counts) for reproducing any model training or evaluation. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory, or computing environment) used for running its experiments. |
| Software Dependencies | No | The paper mentions various metrics and embeddings (e.g., GloVe, Fast Text, ELMo, BERT score) but does not provide specific version numbers for any software libraries or dependencies used. |
| Experiment Setup | No | The paper focuses on comparing existing semantic similarity metrics and does not describe a specific experimental setup with hyperparameters, training configurations, or system-level settings for a model that they trained. |