Style-transfer and Paraphrase: Looking for a Sensible Semantic Similarity Metric

Authors: Ivan P. Yamshchikov, Viacheslav Shibaev, Nikolay Khlebnikov, Alexey Tikhonov14213-14220

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper provides a comprehensive analysis for more than a dozen of such methods. Using a new dataset of fourteen thousand sentence pairs human-labeled according to their semantic similarity, we demonstrate that none of the metrics widely used in the literature is close enough to human judgment in these tasks. A number of recently proposed metrics provide comparable results, yet Word Mover Distance is shown to be the most reasonable solution to measure semantic similarity in reformulated texts at the moment.
Researcher Affiliation Collaboration 1 Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, Leipzig, Germany 04103 2 Ural Federal University, Mira 19, Ekaterinburg, Russia, 620002 3 Yandex, Oberwallstr. 6, Berlin, Germany, 10117
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper states: "To stimulate further research of semantic similarity measurements, we publish5 our dataset that consists of 14 000 different pairs of sentences alongside with semantic similarity scores given by the annotators. 5https://github.com/VAShibaev/semantic similarity metrics". This link is for the dataset, not the source code for the methodology presented in the paper.
Open Datasets Yes To stimulate further research of semantic similarity measurements, we publish5 our dataset that consists of 14 000 different pairs of sentences alongside with semantic similarity scores given by the annotators. 5https://github.com/VAShibaev/semantic similarity metrics
Dataset Splits No The paper describes using various datasets and sampling 1000 sentence pairs for analysis, and a new 14,000-pair human-labeled dataset. However, it does not specify explicit training/validation/test splits (e.g., percentages or counts) for reproducing any model training or evaluation.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory, or computing environment) used for running its experiments.
Software Dependencies No The paper mentions various metrics and embeddings (e.g., GloVe, Fast Text, ELMo, BERT score) but does not provide specific version numbers for any software libraries or dependencies used.
Experiment Setup No The paper focuses on comparing existing semantic similarity metrics and does not describe a specific experimental setup with hyperparameters, training configurations, or system-level settings for a model that they trained.