reproducibilityindex.ai

Leveraging Sparsity for Efficient Submodular Data Summarization

Authors: Erik Lindgren, Shanshan Wu, Alexandros G. Dimakis

NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate our approach by demonstrating that it rapidly generates interpretable summaries. Further, we propose the use of Locality Sensitive Hashing (LSH) and random walk methods to accelerate approximate nearest neighbor computations. Specifically, we use two types of similarity metrics: inner products and personalized Page Rank (PPR). We propose the use of fast approximations for these metrics and empirically show that they dramatically improve running times. We validate our approach by demonstrating that it rapidly generates interpretable summaries.
Researcher Affiliation	Academia	Erik M. Lindgren, Shanshan Wu, Alexandros G. Dimakis The University of Texas at Austin Department of Electrical and Computer Engineering erikml@utexas.edu, shanshan@utexas.edu, dimakis@austin.utexas.edu
Pseudocode	Yes	See Algorithm 1 in the Appendix for pseudocode.
Open Source Code	No	The paper mentions using third-party libraries such as "MLlib library in Apache Spark [29]" and "LSH in the FALCONN library for cosine similarity [3]", but it does not provide an explicit statement or a link to the open-source code for the specific methodology or algorithms developed by the authors in this paper.
Open Datasets	Yes	We create our feature vectors from the Movie Lens ratings data [16]. The Movie Lens database has 20 million ratings for 27,000 movies from 138,000 users. Data was obtained from [19] and an actor or actress was only included if he or she was one of the top six in the cast billing.
Dataset Splits	No	The paper uses the Movie Lens ratings data and IMDb data but does not explicitly specify the training, validation, or test dataset splits (e.g., exact percentages, absolute sample counts, or references to predefined splits) needed to reproduce the data partitioning for its experiments.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., CPU or GPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	The paper mentions using the "MLlib library in Apache Spark [29]" and "LSH in the FALCONN library for cosine similarity [3]" but does not provide specific version numbers for these software components, which are crucial for reproducibility.
Experiment Setup	Yes	The number of elements chosen was set to 40 and for the LSH method and stochastic greedy we average over ﬁve trials.