Dimensionality Reduction of Massive Sparse Datasets Using Coresets
Authors: Dan Feldman, Mikhail Volkov, Daniela Rus
NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We implemented our algorithm to obtain a low-rank approximation for the term-document matrix of Wikipedia with provable error bounds. Since our streaming algorithm is also embarrassingly parallel we run it on Amazon Cloud, and receive a significantly better running time and accuracy compared to existing heuristics (e.g. Hadoop/Map Reduce) that yield non-sparse solutions. |
| Researcher Affiliation | Academia | Dan Feldman University of Haifa Haifa, Israel dannyf.post@gmail.com Mikhail Volkov CSAIL, MIT Cambridge, MA, USA mikhail@csail.mit.edu Daniela Rus CSAIL, MIT Cambridge, MA, USA rus@csail.mit.edu |
| Pseudocode | Yes | Fig. 1a contains the pseudocode for Algorithm 1. [...] Fig. 5 contains the pseudocode for Algorithm 2. |
| Open Source Code | Yes | Our project codebase is open-sourced and can be found here: http://people.csail.mit.edu/mikhail/NIPS2016. |
| Open Datasets | Yes | For our large-scale grand challenge experiment, we apply our algorithm for computing Latent Semantic Analysis (LSA) on the entire English Wikipedia. |
| Dataset Splits | No | The paper does not specify explicit train/validation/test dataset splits, percentages, or detailed splitting methodology. |
| Hardware Specification | No | The paper mentions running on 'Amazon Cloud' and that existing SVD implementations 'crash on a regular laptop' but does not provide specific hardware details (e.g., CPU/GPU models, memory sizes). |
| Software Dependencies | No | The coreset construction algorithm described in Section 5 was implemented in MATLAB. We make use of the redsvd package [12] to improve performance, but it is not required to run the system. (No version numbers provided for MATLAB or redsvd). |
| Experiment Setup | Yes | We specify a nominal error of ε=0.5, which is a theoretical upper bound for N = 2k/ε iterations, and show that the coreset error remains bounded. [...] For our synthetic data experiments, we used a moderate size sparse input of (5000 1000) to evaluate the relationship between the error ε and the number of iterations of the algorithm N. |