Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Sketch-Flip-Merge: Mergeable Sketches for Private Distinct Counting
Authors: Jonathan Hehir, Daniel Ting, Graham Cormode
ICML 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our sketches dramatically outperform existing theoretical solutions in simulations and on real-world data. We evaluate our methods on both real-world and synthetic datasets. The simulations use sketches with dimensions B = 4096, P = 24 by default, using the xx Hash64 (Collet, 2022) hash function, averaged over m = 1000 trials. |
| Researcher Affiliation | Collaboration | Jonathan Hehir 1 2 Daniel Ting 3 Graham Cormode 3 4 1Department of Statistics, Penn State University, State College, PA, USA 2Work performed while at Meta. 3Meta, Seattle, WA, USA 4University of Warwick, Coventry, UK. Correspondence to: Daniel Ting <EMAIL>. |
| Pseudocode | No | The paper presents mathematical formulas and theoretical analyses, including proofs in the appendices, but it does not contain a distinct block of pseudocode or a formally labeled algorithm. |
| Open Source Code | No | The paper does not provide any statements or links indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | Real data is taken from the Bitcoin Heist paper (Akcora et al., 2020), which provides a database of Bitcoin transactions to n = 2,631,095 unique addresses. The dataset is available in the UCI repository under the CC BY 4.0 license. |
| Dataset Splits | No | The paper mentions the use of synthetic and real-world datasets for evaluation, but it does not specify any training, validation, or test split percentages or counts, nor does it reference predefined splits. |
| Hardware Specification | No | The paper does not specify the particular hardware components (e.g., CPU, GPU models) used for running the experiments. |
| Software Dependencies | No | The paper mentions 'using the xx Hash64 (Collet, 2022) hash function' but does not provide version numbers for this or any other software libraries or dependencies used in the experiments. |
| Experiment Setup | Yes | The simulations use sketches with dimensions B = 4096, P = 24 by default, using the xx Hash64 (Collet, 2022) hash function, averaged over m = 1000 trials. We compare these methods against the sketch and estimator of Pagh & Stausholm (2021) implemented two ways: PS (loose) constructs sketches using MPS ε = Mp,q with p = 1/2, q = 1/(2 + ε) as prescribed in Pagh & Stausholm (2021), while PS (tight) uses the tightened Mxor ε (Definition 4.2). We compare private methods against the real-world Bitcoin Heist data over a variety of privacy budgets ε, ranging from 0.25 to 4. |