Understanding Sparse JL for Feature Hashing
Authors: Meena Jagadeesan
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our result theoretically demonstrates that sparse JL with s > 1 can have significantly better norm-preservation properties on feature vectors than sparse JL with s = 1; we also empirically demonstrate this finding. We also empirically support our theoretical findings in Theorem 1.5. First, we illustrate with realworld datasets the potential benefits of using small constants s > 1 for sparse JL on feature vectors. We specifically show that s = {4, 8, 16} consistently outperforms s = 1 in preserving the ℓ2 norm of each vector, and that there can be up to a factor of ten decrease in failure probability for s = 8, 16 in comparison to s = 1. Second, we use synthetic data to illustrate phase transitions and other trends in Theorem 1.5. |
| Researcher Affiliation | Academia | Meena Jagadeesan Harvard University Cambridge, MA 02138 mjagadeesan@college.harvard.edu |
| Pseudocode | No | No pseudocode or algorithm blocks are present. |
| Open Source Code | No | The paper mentions that a variant of sparse JL is included in the Python sklearn library (footnote 3: https://scikit-learn.org/stable/modules/random_projection.html), but does not provide specific access to their own source code for the methodology described in the paper. |
| Open Datasets | Yes | We considered two bag-of-words datasets: the News20 dataset [1] (based on newsgroup documents), and the Enron email dataset [26] (based on e-mails from the senior management of Enron). |
| Dataset Splits | No | The paper does not explicitly provide training/validation/test dataset splits, only stating that the failure probability was estimated 'for each dataset'. |
| Hardware Specification | No | No specific hardware details (such as GPU/CPU models or memory) used for running the experiments are mentioned. |
| Software Dependencies | No | The paper mentions the use of the 'Python sklearn library' but does not specify version numbers for sklearn or Python, which is necessary for reproducibility. |
| Experiment Setup | Yes | We consider s {1, 2, 4, 8, 16}, and choose m values so that 0.01 ˆδ(1, m, ϵ) 0.04. We computed each ˆδ(s, m, ϵ, w) using 100,000 samples from a block sparse JL distribution. Our synthetic data consisted of binary vectors (i.e. vectors whose entries are in {0, 1}). |