Leveraging Sparsity for Efficient Submodular Data Summarization
Authors: Erik Lindgren, Shanshan Wu, Alexandros G. Dimakis
NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our approach by demonstrating that it rapidly generates interpretable summaries. Further, we propose the use of Locality Sensitive Hashing (LSH) and random walk methods to accelerate approximate nearest neighbor computations. Specifically, we use two types of similarity metrics: inner products and personalized Page Rank (PPR). We propose the use of fast approximations for these metrics and empirically show that they dramatically improve running times. We validate our approach by demonstrating that it rapidly generates interpretable summaries. |
| Researcher Affiliation | Academia | Erik M. Lindgren, Shanshan Wu, Alexandros G. Dimakis The University of Texas at Austin Department of Electrical and Computer Engineering erikml@utexas.edu, shanshan@utexas.edu, dimakis@austin.utexas.edu |
| Pseudocode | Yes | See Algorithm 1 in the Appendix for pseudocode. |
| Open Source Code | No | The paper mentions using third-party libraries such as "MLlib library in Apache Spark [29]" and "LSH in the FALCONN library for cosine similarity [3]", but it does not provide an explicit statement or a link to the open-source code for the specific methodology or algorithms developed by the authors in this paper. |
| Open Datasets | Yes | We create our feature vectors from the Movie Lens ratings data [16]. The Movie Lens database has 20 million ratings for 27,000 movies from 138,000 users. Data was obtained from [19] and an actor or actress was only included if he or she was one of the top six in the cast billing. |
| Dataset Splits | No | The paper uses the Movie Lens ratings data and IMDb data but does not explicitly specify the training, validation, or test dataset splits (e.g., exact percentages, absolute sample counts, or references to predefined splits) needed to reproduce the data partitioning for its experiments. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU or GPU models, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions using the "MLlib library in Apache Spark [29]" and "LSH in the FALCONN library for cosine similarity [3]" but does not provide specific version numbers for these software components, which are crucial for reproducibility. |
| Experiment Setup | Yes | The number of elements chosen was set to 40 and for the LSH method and stochastic greedy we average over five trials. |