Streaming Algorithms for Support-Aware Histograms

Authors: Justin Chen, Piotr Indyk, Tal Wagner

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On the empirical side, we analyze the performance of our one-pass and two-pass streaming algorithms on several real-world datasets, demonstrating the practical application of support-aware histograms.
Researcher Affiliation Collaboration 1MIT, Cambridge, MA, USA 2Microsoft Research, Redmond, WA, USA.
Pseudocode Yes Algorithm 1 One-Pass Algorithm Algorithm 2 Two-Pass Algorithm
Open Source Code No The paper does not provide explicit statements or links to open-source code for the described methodology.
Open Datasets Yes The Taxi dataset (NYC Taxi and Limousine Commission, 2021) contains pickup times of yellow cabs in New York City in the month of January 2021.
Dataset Splits No The paper describes the datasets and experimental setup but does not specify the train/validation/test splits used for the experiments.
Hardware Specification No The paper does not provide specific details on the hardware used for running the experiments.
Software Dependencies No The paper mentions specific algorithms and prior work (e.g., Space Saving algorithm, Jowhari et al. for sampling) but does not list specific software libraries or their version numbers used in the implementation.
Experiment Setup Yes For the two-pass algorithm, as all of the streams in our experiments are insertion-only, we use the space-saving based implementation of hierarchical heavy hitters (Mitzenmacher et al., 2012). In the first pass, the heaviness threshold is set to n lg(n)/s, and in the second pass, the L0 samples allowed by the budget s are evenly distributed over all light intervals. All of the baselines and algorithms involve storing sampled elements from the stream along with their masses. The parameter s is varied to compare how the number of samples affects the performance of the algorithms (as shown in the x-axis of the figures). The k parameter determines the number of pieces used by the baselines and by the one-pass algorithm (after removing heavy hitters). We set k = 5 in the experiments in this section and display the same figures with k = 10 in Appendix E.