Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Re-evaluating Word Mover’s Distance

Authors: Ryoma Sato, Makoto Yamada, Hisashi Kashima

ICML 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The original study on WMD reported that WMD outperforms classical baselines such as bag-of-words (BOW) and TF-IDF by significant margins in various datasets. In this paper, we point out that the evaluation in the original study could be misleading. We re-evaluate the performances of WMD and the classical baselines and find that the classical baselines are competitive with WMD if we employ an appropriate preprocessing, i.e., L1 normalization.
Researcher Affiliation	Academia	1Kyoto University 2RIKEN AIP.
Pseudocode	No	The paper describes methods and formulas but does not include any explicitly labeled pseudocode blocks or algorithms.
Open Source Code	Yes	The code is available at https://github.com/joisino/reeval-wmd.
Open Datasets	Yes	We use the same datasets (Greene & Cunningham, 2006; Sanders, 2011; Joachims, 1998; Sebastiani, 2002; Lang, 1995) as in the original paper (Kusner et al., 2015) and use the same train/test splits as in the original paper. ... Table 1. Dataset statistics.
Dataset Splits	Yes	We split the training set into an 80/20 train/validation set uniformly and randomly and select the neighborhood size from {1, 2, , 19} using the validation data.
Hardware Specification	Yes	We use a server cluster to compute WMD. Each node has two 2.4GHz Intel Xeon Gold 6148 CPUs. We use a Linux server with Intel Xeon E7-4830 v4 CPUs to evaluate the performances.
Software Dependencies	No	The paper mentions using "word2vec embeddings" and "GloVe" embeddings, but it does not specify exact version numbers for these or any other software dependencies such as libraries, frameworks, or operating systems.
Experiment Setup	Yes	We select the neighborhood size from {1, 2, , 19} using the validation data. ... we fix k of wk NN to 19 and tune only γ in the hyperparameter tuning. We select γ from Γ = {0.005, 0.010, , 0.095, 0.1}