reproducibilityindex.ai

Dimensionality Reduction of Massive Sparse Datasets Using Coresets

Authors: Dan Feldman, Mikhail Volkov, Daniela Rus

NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We implemented our algorithm to obtain a low-rank approximation for the term-document matrix of Wikipedia with provable error bounds. Since our streaming algorithm is also embarrassingly parallel we run it on Amazon Cloud, and receive a signiﬁcantly better running time and accuracy compared to existing heuristics (e.g. Hadoop/Map Reduce) that yield non-sparse solutions.
Researcher Affiliation	Academia	Dan Feldman University of Haifa Haifa, Israel dannyf.post@gmail.com Mikhail Volkov CSAIL, MIT Cambridge, MA, USA mikhail@csail.mit.edu Daniela Rus CSAIL, MIT Cambridge, MA, USA rus@csail.mit.edu
Pseudocode	Yes	Fig. 1a contains the pseudocode for Algorithm 1. [...] Fig. 5 contains the pseudocode for Algorithm 2.
Open Source Code	Yes	Our project codebase is open-sourced and can be found here: http://people.csail.mit.edu/mikhail/NIPS2016.
Open Datasets	Yes	For our large-scale grand challenge experiment, we apply our algorithm for computing Latent Semantic Analysis (LSA) on the entire English Wikipedia.
Dataset Splits	No	The paper does not specify explicit train/validation/test dataset splits, percentages, or detailed splitting methodology.
Hardware Specification	No	The paper mentions running on 'Amazon Cloud' and that existing SVD implementations 'crash on a regular laptop' but does not provide specific hardware details (e.g., CPU/GPU models, memory sizes).
Software Dependencies	No	The coreset construction algorithm described in Section 5 was implemented in MATLAB. We make use of the redsvd package [12] to improve performance, but it is not required to run the system. (No version numbers provided for MATLAB or redsvd).
Experiment Setup	Yes	We specify a nominal error of ε=0.5, which is a theoretical upper bound for N = 2k/ε iterations, and show that the coreset error remains bounded. [...] For our synthetic data experiments, we used a moderate size sparse input of (5000 1000) to evaluate the relationship between the error ε and the number of iterations of the algorithm N.