Conformal Frequency Estimation with Sketched Data

Authors: Matteo Sesia, Stefano Favaro

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The performance is compared to that of frequentist and Bayesian alternatives through simulations and experiments with data sets of SARS-Co V-2 DNA sequences and classic English literature.
Researcher Affiliation Academia Matteo Sesia Department of Data Sciences and Operations University of Southern California Los Angeles, California, USA sesia@marshall.usc.edu Stefano Favaro Department of Economics and Statistics University of Torino and Collegio Carlo Alberto Torino, Italy stefano.favaro@unito.it
Pseudocode Yes This procedure is outlined in Algorithm A1 (Appendix A1)
Open Source Code Yes Accompanying software and data are available online at https://github.com/msesia/conformalized-sketching.
Open Datasets Yes Experiments are performed on synthetic data sampled from two families of distributions. This application involves a data set of nucleotide sequences from SARS-Co V-2 viruses made publicly available by the National Center for Biotechnology Information [43]. This application is based on a data set consisting of 18 open-domain classic pieces of English literature downloaded using the NLTK Python package [45] from the Gutenberg Corpus [46].
Dataset Splits Yes The simplest version of conformal prediction begins by randomly splitting the available observations into two disjoint subsets, assumed for simplicity to have equal size n = m/2. The first m0 = 5000 observations are stored without loss during the warm-up phase, as outlined in Algorithm A3, while the remaining 95, 000 are compressed by the CMS-CU.
Hardware Specification No Experiments were carried out in parallel using a computing cluster; each experiment required less than a few hours with a standard CPU and less than 5GB of memory (20 GB of memory are needed for the analysis of the SARS-Co V-2 DNA data).
Software Dependencies No The paper mentions the NLTK Python package but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes In particular, m = 100, 000 observations are sampled i.i.d. from some distribution specified below. The first m0 = 5000 observations are stored without loss during the warm-up phase, as outlined in Algorithm A3, while the remaining 95, 000 are compressed by the CMS-CU. The conformity scores are evaluated separately within L = 5 frequency bins, seeking the frequency-range conditional coverage property defined in (8). The bins are determined in a data-driven fashion so that each contains approximately the same probability mass; in practice, this is achieved by partitioning the range of frequencies for the objects tracked exactly by Algorithm A3 according to the observed empirical quantiles. Lower bounds for new queries are computed for 10, 000 data points also sampled i.i.d. from the same distribution. The quality of these bounds is quantified with two metrics: the mean length of the resulting confidence intervals and the coverage the proportion of queries for which the true frequency is correctly covered, or empirical coverage. The performance is averaged over 10 independent experiments.