reproducibilityindex.ai

Style-transfer and Paraphrase: Looking for a Sensible Semantic Similarity Metric

Authors: Ivan P. Yamshchikov, Viacheslav Shibaev, Nikolay Khlebnikov, Alexey Tikhonov14213-14220

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This paper provides a comprehensive analysis for more than a dozen of such methods. Using a new dataset of fourteen thousand sentence pairs human-labeled according to their semantic similarity, we demonstrate that none of the metrics widely used in the literature is close enough to human judgment in these tasks. A number of recently proposed metrics provide comparable results, yet Word Mover Distance is shown to be the most reasonable solution to measure semantic similarity in reformulated texts at the moment.
Researcher Affiliation	Collaboration	1 Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, Leipzig, Germany 04103 2 Ural Federal University, Mira 19, Ekaterinburg, Russia, 620002 3 Yandex, Oberwallstr. 6, Berlin, Germany, 10117
Pseudocode	No	The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code	No	The paper states: "To stimulate further research of semantic similarity measurements, we publish5 our dataset that consists of 14 000 different pairs of sentences alongside with semantic similarity scores given by the annotators. 5https://github.com/VAShibaev/semantic similarity metrics". This link is for the dataset, not the source code for the methodology presented in the paper.
Open Datasets	Yes	To stimulate further research of semantic similarity measurements, we publish5 our dataset that consists of 14 000 different pairs of sentences alongside with semantic similarity scores given by the annotators. 5https://github.com/VAShibaev/semantic similarity metrics
Dataset Splits	No	The paper describes using various datasets and sampling 1000 sentence pairs for analysis, and a new 14,000-pair human-labeled dataset. However, it does not specify explicit training/validation/test splits (e.g., percentages or counts) for reproducing any model training or evaluation.
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory, or computing environment) used for running its experiments.
Software Dependencies	No	The paper mentions various metrics and embeddings (e.g., GloVe, Fast Text, ELMo, BERT score) but does not provide specific version numbers for any software libraries or dependencies used.
Experiment Setup	No	The paper focuses on comparing existing semantic similarity metrics and does not describe a specific experimental setup with hyperparameters, training configurations, or system-level settings for a model that they trained.