reproducibilityindex.ai

Understanding Sparse JL for Feature Hashing

Authors: Meena Jagadeesan

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our result theoretically demonstrates that sparse JL with s > 1 can have signiﬁcantly better norm-preservation properties on feature vectors than sparse JL with s = 1; we also empirically demonstrate this ﬁnding. We also empirically support our theoretical ﬁndings in Theorem 1.5. First, we illustrate with realworld datasets the potential beneﬁts of using small constants s > 1 for sparse JL on feature vectors. We speciﬁcally show that s = {4, 8, 16} consistently outperforms s = 1 in preserving the ℓ2 norm of each vector, and that there can be up to a factor of ten decrease in failure probability for s = 8, 16 in comparison to s = 1. Second, we use synthetic data to illustrate phase transitions and other trends in Theorem 1.5.
Researcher Affiliation	Academia	Meena Jagadeesan Harvard University Cambridge, MA 02138 mjagadeesan@college.harvard.edu
Pseudocode	No	No pseudocode or algorithm blocks are present.
Open Source Code	No	The paper mentions that a variant of sparse JL is included in the Python sklearn library (footnote 3: https://scikit-learn.org/stable/modules/random_projection.html), but does not provide specific access to their own source code for the methodology described in the paper.
Open Datasets	Yes	We considered two bag-of-words datasets: the News20 dataset [1] (based on newsgroup documents), and the Enron email dataset [26] (based on e-mails from the senior management of Enron).
Dataset Splits	No	The paper does not explicitly provide training/validation/test dataset splits, only stating that the failure probability was estimated 'for each dataset'.
Hardware Specification	No	No specific hardware details (such as GPU/CPU models or memory) used for running the experiments are mentioned.
Software Dependencies	No	The paper mentions the use of the 'Python sklearn library' but does not specify version numbers for sklearn or Python, which is necessary for reproducibility.
Experiment Setup	Yes	We consider s {1, 2, 4, 8, 16}, and choose m values so that 0.01 ˆδ(1, m, ϵ) 0.04. We computed each ˆδ(s, m, ϵ, w) using 100,000 samples from a block sparse JL distribution. Our synthetic data consisted of binary vectors (i.e. vectors whose entries are in {0, 1}).