reproducibilityindex.ai

Practical Hash Functions for Similarity Estimation and Dimensionality Reduction

Authors: Søren Dahlgaard, Mathias Knudsen, Mikkel Thorup

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our main contribution, however, is an experimental comparison of different hashing schemes when used inside FH, OPH, and LSH.
Researcher Affiliation	Collaboration	Søren Dahlgaard University of Copenhagen / Sup Wiz s.dahlgaard@supwiz.com Mathias Bæk Tejs Knudsen University of Copenhagen / Sup Wiz m.knudsen@supwiz.com Mikkel Thorup University of Copenhagen mthorup@di.ku.dk
Pseudocode	Yes	A sample implementation with c = d = 4 and 32-bit keys and values can be found below. uint64_t mt_T1[256][4]; // Filled with random bits uint32_t mt_T2[256][4]; // Filled with random bits uint32_t mixedtab(uint32_t x) { uint64_t h=0; // This will be the final hash value for(int i = 0;i < 4;++i, x >>= 8) h = mt_T1[(uint8_t)x][i]; uint32_t drv=h >> 32; for(int i = 0;i < 4;++i, drv >>= 8) h = mt_T2[(uint8_t)drv][i]; return (uint32_t)h; }
Open Source Code	No	The paper states 'All experiments are implemented in C++11' and mentions using 'ofﬁcial implementations' for third-party hashes, but does not provide a link or explicit statement for the availability of their own source code for the mixed tabulation scheme or the experimental framework.
Open Datasets	Yes	We consider the following real-world data sets MNIST [21] Standard collection of handwritten digits... News20 [11] Collection of newsgroup documents.
Dataset Splits	No	The paper mentions 'standard partition of 60000 database points and 10000 query points' for MNIST and 'randomly split the set into two sets of roughly 10000 database and query points' for News20, which imply training and testing sets, but does not explicitly detail a separate 'validation' split with percentages or counts.
Hardware Specification	No	No specific hardware details like GPU models, CPU models, or memory amounts are provided. The paper only mentions that official implementations 'are highly optimized to the x86 and x64 architectures'.
Software Dependencies	No	The paper states 'All experiments are implemented in C++11' and mentions using 'ofﬁcial implementations of Murmur Hash3, City Hash and Blake2' but does not provide specific version numbers for these or other key software components.
Experiment Setup	Yes	We use OPH with densiﬁcation as in [32] implemented with different basic hash functions to estimate J(A, B). We generate one instance of A and B and perform 2000 independent repetitions for each different hash function on these A and B. Figure 2 shows the histogram and mean squared error (MSE) of estimates obtained with n = 2000 and k = 200. ... For FH we obtained a vector v by taking the indicator vector of a set A generated as above and normalizing the length. ... Figure 3 displays the histograms and MSE we obtained for d = 200. ... We test all combinations of K {8, 10, 12} and L {8, 10, 12}.