Optimal Algorithms for Augmented Testing of Discrete Distributions

Authors: Maryam Aliakbarpour, Piotr Indyk, Ronitt Rubinfeld, Sandeep Silwal

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Furthermore, experimental results show that the performance of our algorithms on real data significantly exceeds our worst-case guarantees for sample complexity, demonstrating the practicality of our approach. We empirically evaluate our augmented closeness testing algorithm on synthetic and real distributions and refer all details to Section E.
Researcher Affiliation Academia Maryam Aliakbarpour Rice University Houston, TX maryama@rice.edu Piotr Indyk MIT Cambridge, MA indyk@mit.edu Ronitt Rubinfeld MIT Cambridge, MA ronitt@csail.mit.edu Sandeep Silwal UW-Madison Madison, WI silwal@cs.wisc.edu
Pseudocode Yes Algorithm 1 An augmented tester for testing closeness of p and q with a suggested value α
Open Source Code Yes We include a link to the CAIDA dataset and a Jupyter notebook is attached to the submission.
Open Datasets Yes The data is internet traffic data collected at a backbone link of a Tier1 ISP in a data center9. Using the preprocessing in [1], we construct the empirical distribution over source IP addresses in consecutive chunks of time. ... 9From CAIDA internet traces 2019, see https://www.caida.org/catalog/datasets/monitors/ passive-equinix-nyc
Dataset Splits No The paper describes using 'synthetic and real distributions' and sets 'p' and 'q' based on time chunks of the IP traffic data, but it does not specify explicit training, validation, and test splits with percentages or sample counts.
Hardware Specification No The paper does not specify any particular hardware, such as GPU or CPU models, or cloud computing instances, used for running the experiments.
Software Dependencies No The paper does not list specific software dependencies with version numbers (e.g., Python version, library versions) in its main text or provided excerpt.
Experiment Setup Yes We set n, the domain size, to n = 10^6 and our goal is to determine if we are sampling from (p, p) or (p, q). ... We set p to be distribution for the first chunk of time and we let q be the distribution for the last chunk of time. ... In our experiments, given a sample budget, we compute the empirical distribution of Z for both algorithms under both hypothesis cases across 100 trials.