reproducibilityindex.ai

Deletion-Robust Submodular Maximization: Data Summarization with “the Right to be Forgotten”

Authors: Baharan Mirzasoleiman, Amin Karbasi, Andreas Krause

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the effectiveness of our approach on several real-world applications, including summarizing (1) streams of geocoordinates (2); streams of images; and (3) clickstream log data, consisting of 45 million feature vectors from a news recommendation task.
Researcher Affiliation	Academia	1ETH Zurich, Switzerland 2Yale University, New Haven, USA.
Pseudocode	Yes	Algorithm 1 ROBUST-STREAMING
Open Source Code	No	The paper does not provide any explicit statement or link indicating that its source code is publicly available.
Open Datasets	Yes	We ﬁrst apply ROBUST-STREAMING to a collection of 100 images from Tschiatschek et al. (2014). Our dataset consists of 3,607 geolocations, collected during a one hour bike ride around Zurich (Fatio, 2015). We used Yahoo! Webscope data set containing 45,811,883 user click logs for news articles displayed in the Featured Tab of the Today Module on Yahoo! Front Page during the ﬁrst ten days in May 2009 (Yahoo, 2012).
Dataset Splits	Yes	We considered the ﬁrst 80% of the data (for the ﬁst 8 days) as our training set, and the last 20% (for the last 2 days) as our test set.
Hardware Specification	Yes	Due to the massive size of the dataset, we used Spark on a cluster of 15 quad-core machines with 32GB of memory each.
Software Dependencies	No	The paper mentions using 'Vowpal-Wabbit' and 'Spark' but does not provide specific version numbers for these software components.
Experiment Setup	Yes	Fig. 2a compares the performance of SIEVE-STREAMING with its robust version ROBUST-STREAMING for r = 3 and solution size k=5. We ﬁx k = 20 and r = 5. We considered the ﬁrst 80% of the data (for the ﬁst 8 days) as our training set, and the last 20% (for the last 2 days) as our test set. Since only 4% of the data points are clicked, we assign a weight of 10 to each clicked vector. We ﬁx k = 10, 000 and r = 2.