Linear-Complexity Data-Parallel Earth Mover’s Distance Approximations
Authors: Kubilay Atasu, Thomas Mittelholzer
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On the popular text-based 20 Newsgroups dataset, the new algorithms are four orders of magnitude faster than a multi-threaded CPU implementation of Word Mover s Distance and match its nearest-neighbors-search accuracy. On MNIST images, the new algorithms are four orders of magnitude faster than a GPU implementation of the Sinkhorn s algorithm while offering a slightly higher nearest-neighbors-search accuracy. |
| Researcher Affiliation | Collaboration | 1IBM Research Zurich 2HSR Hochschule f ur Technik, Rapperswil. Correspondence to: Kubilay Atasu <kat@zurich.ibm.com>. |
| Pseudocode | Yes | Algorithm 1 Optimal Computation of OMR; Algorithm 2 Optimal Computation of ICT; Algorithm 3 Approximate Computation of ICT |
| Open Source Code | No | No explicit statement about providing source code for their own methodology was found. The paper mentions using 'Cuturi s open-source implementation of Sinkhorn s algorithm'. |
| Open Datasets | Yes | We performed experiments on two public datasets: 20 Newsgroups is a text database of newsgroup documents, partitioned evenly across 20 different classes1, and MNIST is an image database of greyscale hand-written digits that are partitioned evenly across 10 classes2. 1http://qwone.com/ jason/20Newsgroups/; 2http://yann.lecun.com/exdb/mnist/ |
| Dataset Splits | No | No specific dataset split information for training, validation, and test sets (e.g., exact percentages, sample counts, or citations to predefined splits) was provided. The paper mentions 'We used only the first 6000 MNIST training images as our query documents and compared them with the complete set of 60000 MNIST training images,' which describes a query setup rather than a standard model training split. |
| Hardware Specification | Yes | We have developed GPU-accelerated implementations of the WCD, RWMD, OMR, ACT, and Bo W cosine similarity methods and evaluated their performance on an NVIDIA R GTX 1080Ti GPU. Cuturi s Sinkhorn implementation has also been executed on the same GPU. Our multithreaded WMD implementation has been deployed on an 8-core Intel R i7-6900K CPU. |
| Software Dependencies | No | No specific software dependencies with version numbers were provided. The paper mentions 'Fast EMD library' and 'Cuturi s open-source implementation of Sinkhorn s algorithm' but without version details. |
| Experiment Setup | Yes | We used λ = 20 as the entropic regularization parameter because it offered a good trade-off between the accuracy and the speed of Sinkhorn s algorithm. Word2Vec embedding vectors are L2-normalized, but MNIST embedding vectors are not normalized in our setup. In addition, when computing our approximations, the histogram weights are always L1-normalized. Lastly, the transportation cost between two words or pixels is the Euclidean (L2) distance between their embedding vectors. |