Sampling Sketches for Concave Sublinear Functions of Frequencies

Authors: Edith Cohen, Ofir Geri

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We complement our work with a small-scale experimental study. We use a simple implementation of our sampling sketch to study the actual performance in terms of estimate quality and sketch size. In particular, we show that the estimate quality is even better than the (already adequate) guarantees provided by our worst-case bounds. We additionally compare the estimate quality to that of two popular sampling schemes for aggregated data, PPSWOR [32, 33] and priority (sequential Poisson) sampling [28, 13]. In the experiments, we see that the estimate quality of our sketch is close to what achieved by PPSWOR and priority sampling, while our sketch uses much less space by eliminating the need for aggregation. This paper presents our sketch structures and states our results. The full version (including proofs and additional details) can be found in the supplementary material. ... We implemented our sampling sketch in Python and report here the results of experiments on real and synthetic datasets. ... Tables 1 reports aggregated results of 200 repetitions where we used the final sample to estimate the sum π‘₯ 𝒳𝑓(𝜈π‘₯).
Researcher Affiliation Collaboration Edith Cohen Google Research, CA Tel Aviv University, Israel edith@cohenwang.com Ofir Geri Stanford University, CA ofirgeri@cs.stanford.edu
Pseudocode Yes Algorithm 1: Bottom-π‘˜Sketch Structure; Algorithm 2: PPSWOR Sampling Sketch; Algorithm 3: Sampling Sketch Structure for 𝑓; Algorithm 4: Produce a Final Sample from a Sampling Sketch Structure (Algorithm 3); Algorithm 5: Sum Max sampling sketch
Open Source Code No The paper states "We implemented our sampling sketch in Python" but does not provide any concrete access information (e.g., a link to a repository) for the source code.
Open Datasets Yes We used the following datasets: abcnews [23]: News headlines. For each word, we created an element with value 1. flicker [31]: Tags used by Flickr users to annotate images. ... [23] R. Kulkarni. A million news headlines [csv data file]. https://www.kaggle.com/therohk/ million-headlines/home, 2017. [31] A. Plangprasopchok, K. Lerman, and L. Getoor. Growing a tree in the forest: Constructing folksonomies by integrating structured metadata. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 10, pages 949 958, 2010.
Dataset Splits No The paper does not explicitly provide details about training, validation, or test dataset splits. It describes processing elements in a stream and performing 200 repetitions for estimating sums.
Hardware Specification No The paper mentions that "The computing for this project was performed on the Sherlock cluster" and acknowledges "Stanford University and the Stanford Research Computing Center" but does not provide specific hardware details such as CPU or GPU models.
Software Dependencies No The paper states "We implemented our sampling sketch in Python" but does not provide specific version numbers for Python or any other software libraries or dependencies.
Experiment Setup Yes We applied our sampling sketch with sample size parameter values π‘˜ {25, 50, 75, 100} and set the parameter πœ€= 0.5 in all experiments. We sampled according to two concave sublinear functions: the frequency moment 𝑓(𝜈) = 𝜈0.5 and 𝑓(𝜈) = ln(1 + 𝜈). Tables 1 reports aggregated results of 200 repetitions where we used the final sample to estimate the sum π‘₯ 𝒳𝑓(𝜈π‘₯).