Deletion-Robust Submodular Maximization: Data Summarization with “the Right to be Forgotten”
Authors: Baharan Mirzasoleiman, Amin Karbasi, Andreas Krause
ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the effectiveness of our approach on several real-world applications, including summarizing (1) streams of geocoordinates (2); streams of images; and (3) clickstream log data, consisting of 45 million feature vectors from a news recommendation task. |
| Researcher Affiliation | Academia | 1ETH Zurich, Switzerland 2Yale University, New Haven, USA. |
| Pseudocode | Yes | Algorithm 1 ROBUST-STREAMING |
| Open Source Code | No | The paper does not provide any explicit statement or link indicating that its source code is publicly available. |
| Open Datasets | Yes | We first apply ROBUST-STREAMING to a collection of 100 images from Tschiatschek et al. (2014). Our dataset consists of 3,607 geolocations, collected during a one hour bike ride around Zurich (Fatio, 2015). We used Yahoo! Webscope data set containing 45,811,883 user click logs for news articles displayed in the Featured Tab of the Today Module on Yahoo! Front Page during the first ten days in May 2009 (Yahoo, 2012). |
| Dataset Splits | Yes | We considered the first 80% of the data (for the fist 8 days) as our training set, and the last 20% (for the last 2 days) as our test set. |
| Hardware Specification | Yes | Due to the massive size of the dataset, we used Spark on a cluster of 15 quad-core machines with 32GB of memory each. |
| Software Dependencies | No | The paper mentions using 'Vowpal-Wabbit' and 'Spark' but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | Fig. 2a compares the performance of SIEVE-STREAMING with its robust version ROBUST-STREAMING for r = 3 and solution size k=5. We fix k = 20 and r = 5. We considered the first 80% of the data (for the fist 8 days) as our training set, and the last 20% (for the last 2 days) as our test set. Since only 4% of the data points are clicked, we assign a weight of 10 to each clicked vector. We fix k = 10, 000 and r = 2. |