reproducibilityindex.ai

Conformal Frequency Estimation with Sketched Data

Authors: Matteo Sesia, Stefano Favaro

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The performance is compared to that of frequentist and Bayesian alternatives through simulations and experiments with data sets of SARS-Co V-2 DNA sequences and classic English literature.
Researcher Affiliation	Academia	Matteo Sesia Department of Data Sciences and Operations University of Southern California Los Angeles, California, USA sesia@marshall.usc.edu Stefano Favaro Department of Economics and Statistics University of Torino and Collegio Carlo Alberto Torino, Italy stefano.favaro@unito.it
Pseudocode	Yes	This procedure is outlined in Algorithm A1 (Appendix A1)
Open Source Code	Yes	Accompanying software and data are available online at https://github.com/msesia/conformalized-sketching.
Open Datasets	Yes	Experiments are performed on synthetic data sampled from two families of distributions. This application involves a data set of nucleotide sequences from SARS-Co V-2 viruses made publicly available by the National Center for Biotechnology Information [43]. This application is based on a data set consisting of 18 open-domain classic pieces of English literature downloaded using the NLTK Python package [45] from the Gutenberg Corpus [46].
Dataset Splits	Yes	The simplest version of conformal prediction begins by randomly splitting the available observations into two disjoint subsets, assumed for simplicity to have equal size n = m/2. The ﬁrst m0 = 5000 observations are stored without loss during the warm-up phase, as outlined in Algorithm A3, while the remaining 95, 000 are compressed by the CMS-CU.
Hardware Specification	No	Experiments were carried out in parallel using a computing cluster; each experiment required less than a few hours with a standard CPU and less than 5GB of memory (20 GB of memory are needed for the analysis of the SARS-Co V-2 DNA data).
Software Dependencies	No	The paper mentions the NLTK Python package but does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	In particular, m = 100, 000 observations are sampled i.i.d. from some distribution speciﬁed below. The ﬁrst m0 = 5000 observations are stored without loss during the warm-up phase, as outlined in Algorithm A3, while the remaining 95, 000 are compressed by the CMS-CU. The conformity scores are evaluated separately within L = 5 frequency bins, seeking the frequency-range conditional coverage property deﬁned in (8). The bins are determined in a data-driven fashion so that each contains approximately the same probability mass; in practice, this is achieved by partitioning the range of frequencies for the objects tracked exactly by Algorithm A3 according to the observed empirical quantiles. Lower bounds for new queries are computed for 10, 000 data points also sampled i.i.d. from the same distribution. The quality of these bounds is quantiﬁed with two metrics: the mean length of the resulting conﬁdence intervals and the coverage the proportion of queries for which the true frequency is correctly covered, or empirical coverage. The performance is averaged over 10 independent experiments.