Local Density Estimation in High Dimensions
Authors: Xian Wu, Moses Charikar, Vishnu Natchu
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our algorithm uses locality sensitive hashing to preprocess the data to accurately and efficiently estimate the answers to such questions via an unbiased estimator that uses importance sampling. [...] We demonstrate the effectiveness of our algorithm by experiments on a standard word embedding dataset. |
| Researcher Affiliation | Collaboration | 1Stanford University, USA 2Laserlike Inc, USA. |
| Pseudocode | Yes | Theorem 3.1 (Aggregate-Counts). Given a set of K hash tables, each with 2t hash buckets with addresses in {0, 1}t, Aggregate-Counts (Algorithm 1) computes...; Theorem 3.2 (Sampler). ...Hamming-Distance-Sampler (Algorithm 2) generates a sample in time O(t). |
| Open Source Code | No | The paper does not provide any specific links to source code, nor does it state that the code for their methodology is publicly released or available in supplementary materials. |
| Open Datasets | Yes | We use the set of 400,000 pre-trained 50-dimensional word embedding vectors trained from Wikipedia 2014 + Gigaword 5, provided by (Pennington et al., 2014). |
| Dataset Splits | No | The paper uses a pre-trained dataset (GLOVE embeddings) and evaluates an estimator on it. It does not describe standard train/validation/test splits for training a model, nor does it provide specific percentages or counts for data partitioning relevant to typical model validation. |
| Hardware Specification | No | The paper does not specify any hardware details (e.g., GPU models, CPU types, or memory) used for running the experiments. |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x, or specific libraries with their versions) that would be needed to reproduce the experiment setup. |
| Experiment Setup | Yes | We also fix t = 20 in all of our experiments, since we have 400,000 embeddings in total and 20 log2(400, 000). [...] In this experiment, we fix our sampling budget to 1000 samples and the table budget to 20 tables. |