Practical Hash Functions for Similarity Estimation and Dimensionality Reduction
Authors: Søren Dahlgaard, Mathias Knudsen, Mikkel Thorup
NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our main contribution, however, is an experimental comparison of different hashing schemes when used inside FH, OPH, and LSH. |
| Researcher Affiliation | Collaboration | Søren Dahlgaard University of Copenhagen / Sup Wiz s.dahlgaard@supwiz.com Mathias Bæk Tejs Knudsen University of Copenhagen / Sup Wiz m.knudsen@supwiz.com Mikkel Thorup University of Copenhagen mthorup@di.ku.dk |
| Pseudocode | Yes | A sample implementation with c = d = 4 and 32-bit keys and values can be found below. uint64_t mt_T1[256][4]; // Filled with random bits uint32_t mt_T2[256][4]; // Filled with random bits uint32_t mixedtab(uint32_t x) { uint64_t h=0; // This will be the final hash value for(int i = 0;i < 4;++i, x >>= 8) h = mt_T1[(uint8_t)x][i]; uint32_t drv=h >> 32; for(int i = 0;i < 4;++i, drv >>= 8) h = mt_T2[(uint8_t)drv][i]; return (uint32_t)h; } |
| Open Source Code | No | The paper states 'All experiments are implemented in C++11' and mentions using 'official implementations' for third-party hashes, but does not provide a link or explicit statement for the availability of their own source code for the mixed tabulation scheme or the experimental framework. |
| Open Datasets | Yes | We consider the following real-world data sets MNIST [21] Standard collection of handwritten digits... News20 [11] Collection of newsgroup documents. |
| Dataset Splits | No | The paper mentions 'standard partition of 60000 database points and 10000 query points' for MNIST and 'randomly split the set into two sets of roughly 10000 database and query points' for News20, which imply training and testing sets, but does not explicitly detail a separate 'validation' split with percentages or counts. |
| Hardware Specification | No | No specific hardware details like GPU models, CPU models, or memory amounts are provided. The paper only mentions that official implementations 'are highly optimized to the x86 and x64 architectures'. |
| Software Dependencies | No | The paper states 'All experiments are implemented in C++11' and mentions using 'official implementations of Murmur Hash3, City Hash and Blake2' but does not provide specific version numbers for these or other key software components. |
| Experiment Setup | Yes | We use OPH with densification as in [32] implemented with different basic hash functions to estimate J(A, B). We generate one instance of A and B and perform 2000 independent repetitions for each different hash function on these A and B. Figure 2 shows the histogram and mean squared error (MSE) of estimates obtained with n = 2000 and k = 200. ... For FH we obtained a vector v by taking the indicator vector of a set A generated as above and normalizing the length. ... Figure 3 displays the histograms and MSE we obtained for d = 200. ... We test all combinations of K {8, 10, 12} and L {8, 10, 12}. |