Conformal Frequency Estimation with Sketched Data
Authors: Matteo Sesia, Stefano Favaro
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The performance is compared to that of frequentist and Bayesian alternatives through simulations and experiments with data sets of SARS-Co V-2 DNA sequences and classic English literature. |
| Researcher Affiliation | Academia | Matteo Sesia Department of Data Sciences and Operations University of Southern California Los Angeles, California, USA sesia@marshall.usc.edu Stefano Favaro Department of Economics and Statistics University of Torino and Collegio Carlo Alberto Torino, Italy stefano.favaro@unito.it |
| Pseudocode | Yes | This procedure is outlined in Algorithm A1 (Appendix A1) |
| Open Source Code | Yes | Accompanying software and data are available online at https://github.com/msesia/conformalized-sketching. |
| Open Datasets | Yes | Experiments are performed on synthetic data sampled from two families of distributions. This application involves a data set of nucleotide sequences from SARS-Co V-2 viruses made publicly available by the National Center for Biotechnology Information [43]. This application is based on a data set consisting of 18 open-domain classic pieces of English literature downloaded using the NLTK Python package [45] from the Gutenberg Corpus [46]. |
| Dataset Splits | Yes | The simplest version of conformal prediction begins by randomly splitting the available observations into two disjoint subsets, assumed for simplicity to have equal size n = m/2. The first m0 = 5000 observations are stored without loss during the warm-up phase, as outlined in Algorithm A3, while the remaining 95, 000 are compressed by the CMS-CU. |
| Hardware Specification | No | Experiments were carried out in parallel using a computing cluster; each experiment required less than a few hours with a standard CPU and less than 5GB of memory (20 GB of memory are needed for the analysis of the SARS-Co V-2 DNA data). |
| Software Dependencies | No | The paper mentions the NLTK Python package but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | In particular, m = 100, 000 observations are sampled i.i.d. from some distribution specified below. The first m0 = 5000 observations are stored without loss during the warm-up phase, as outlined in Algorithm A3, while the remaining 95, 000 are compressed by the CMS-CU. The conformity scores are evaluated separately within L = 5 frequency bins, seeking the frequency-range conditional coverage property defined in (8). The bins are determined in a data-driven fashion so that each contains approximately the same probability mass; in practice, this is achieved by partitioning the range of frequencies for the objects tracked exactly by Algorithm A3 according to the observed empirical quantiles. Lower bounds for new queries are computed for 10, 000 data points also sampled i.i.d. from the same distribution. The quality of these bounds is quantified with two metrics: the mean length of the resulting confidence intervals and the coverage the proportion of queries for which the true frequency is correctly covered, or empirical coverage. The performance is averaged over 10 independent experiments. |