Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Understanding Sparse JL for Feature Hashing
Authors: Meena Jagadeesan
NeurIPS 2019 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our result theoretically demonstrates that sparse JL with s > 1 can have significantly better norm-preservation properties on feature vectors than sparse JL with s = 1; we also empirically demonstrate this finding. We also empirically support our theoretical findings in Theorem 1.5. First, we illustrate with realworld datasets the potential benefits of using small constants s > 1 for sparse JL on feature vectors. We specifically show that s = {4, 8, 16} consistently outperforms s = 1 in preserving the ℓ2 norm of each vector, and that there can be up to a factor of ten decrease in failure probability for s = 8, 16 in comparison to s = 1. Second, we use synthetic data to illustrate phase transitions and other trends in Theorem 1.5. |
| Researcher Affiliation | Academia | Meena Jagadeesan Harvard University Cambridge, MA 02138 EMAIL |
| Pseudocode | No | No pseudocode or algorithm blocks are present. |
| Open Source Code | No | The paper mentions that a variant of sparse JL is included in the Python sklearn library (footnote 3: https://scikit-learn.org/stable/modules/random_projection.html), but does not provide specific access to their own source code for the methodology described in the paper. |
| Open Datasets | Yes | We considered two bag-of-words datasets: the News20 dataset [1] (based on newsgroup documents), and the Enron email dataset [26] (based on e-mails from the senior management of Enron). |
| Dataset Splits | No | The paper does not explicitly provide training/validation/test dataset splits, only stating that the failure probability was estimated 'for each dataset'. |
| Hardware Specification | No | No specific hardware details (such as GPU/CPU models or memory) used for running the experiments are mentioned. |
| Software Dependencies | No | The paper mentions the use of the 'Python sklearn library' but does not specify version numbers for sklearn or Python, which is necessary for reproducibility. |
| Experiment Setup | Yes | We consider s {1, 2, 4, 8, 16}, and choose m values so that 0.01 ˆδ(1, m, ϵ) 0.04. We computed each ˆδ(s, m, ϵ, w) using 100,000 samples from a block sparse JL distribution. Our synthetic data consisted of binary vectors (i.e. vectors whose entries are in {0, 1}). |