Sparse Dimensionality Reduction Revisited

Authors: Mikael Møller Høgsgaard, Lior Kamma, Kasper Green Larsen, Jelani Nelson, Chris Schwiegelshohn

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical Our first main result is an improved analysis of the random sparse embedding by Kane and Nelson (2014) reducing the O(ε 1 lg n) upper bound on the sparsity in the case d n. Formally we show the following. Theorem 1.1. Let 0 < ε < ε0 for some constant ε0. There is a distribution over s-sparse matrices in Rm d with m = O(ε 2 lg n) and ε lg n lg(1/ε) + lg2/3 n lg1/3 d , such that for every set of n vectors X Rd, it holds with probability at least 1 O(1/d) that a sampled matrix is an ε-JL matrix for X. Our second result complements the upper bound in Theorem 1.1 with a tight lower bound on the sparsity of ε-JL matrices. We show that if m is sufficiently smaller than d, then every ε-JL matrix embedding d-dimensional vectors in Rm must have relatively dense columns. Formally we show the following. Theorem 1.2. Let 0 < ε < 1/4, and let m be such that m = Ω(ε 2 lg n) and m (εd/ lg n)1 o(1). Then there is a set of n vectors X Rd such that any ε-JL matrix A embedding X into Rm, must have a column with sparsity s satisfying s = Ω lg n ε lg(m/ lg n) In this work, we provide the first proof that keeps m = O(k/ε2) while showing a sparsity bound that is o(k/ε).
Researcher Affiliation Academia 1Computer Science Department, Aarhus University, Aarhus, Denmark 2School of Computer Science, Academic College of Tel-Aviv Yaffo, Tel-Aviv, Israel 3Department of EECS, UC Berkeley, Berkeley, CA, USA.
Pseudocode No The paper describes algorithms conceptually but does not include any structured pseudocode or algorithm blocks. Explanation: The paper does not contain any sections or figures explicitly labeled "Pseudocode" or "Algorithm," nor does it present structured steps in a code-like format.
Open Source Code No The paper does not mention providing open-source code for the described methodology. Explanation: There is no statement about releasing code, nor any links to a code repository for the methodology.
Open Datasets No Consider, for instance, the Sentiment140 data set consisting of 1.6M tweets (Go et al., 2009). Using a bag-of-words or tf-idf representation of the tweets, where all words occuring less than 10 times in the 1.6M tweets have been stripped, results in a data set with n = 1.6 106, d = 37, 129 and an average of 12 words/non-zeros per tweet. Similarly, for the Kosarak data set (Benson et al., 2018) consisting of an anonymized clickstream from a Hungarian online news portal, there are n = 900, 002 transactions, each consisting of several items. Explanation: While two datasets are mentioned, they are used as motivating examples rather than for conducting experiments, and no concrete access information (like a direct link or specific instructions for downloading) is provided for reproducibility within the scope of *their* work. The citations are for the original datasets, not access points for their use.
Dataset Splits No The paper is theoretical and does not conduct experiments, therefore no training/validation/test splits are mentioned. Explanation: The paper is theoretical and does not describe experiments that would involve dataset splits for training, validation, or testing.
Hardware Specification No The paper is theoretical and does not describe empirical experiments, therefore no hardware specifications are mentioned. Explanation: The paper does not conduct experiments and thus does not mention any hardware specifications.
Software Dependencies No For efficient use in practice sparse dimensionality reductions techniques have for instance been implemented in the popular library scikit-learn (Pedregosa et al., 2011) as Sparse Random Projection. Explanation: The paper mentions `scikit-learn` as a library that implements sparse dimensionality reduction, but it does not specify any software dependencies with version numbers for the work presented in the paper itself.
Experiment Setup No The paper is theoretical and does not describe empirical experiments, therefore no experimental setup details or hyperparameters are mentioned. Explanation: The paper is theoretical and does not describe experiments that would involve specific experimental setup details like hyperparameters or training configurations.