Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Random measure priors in Bayesian recovery from sketches

Authors: Mario Beraha, Stefano Favaro, Matteo Sesia

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Section 5 contains an empirical validation of our methods on synthetic and real data, whereas Section 6 discusses some directions for future work.
Researcher Affiliation	Academia	Mario Beraha EMAIL Polytechnic University of Milan, Milan, Italy Stefano Favaro EMAIL University of Torino and Collegio Carlo Alberto, Torino, Italy Matteo Sesia EMAIL University of Southern California, Los Angeles, California, United States
Pseudocode	No	The paper describes algorithms and methods conceptually and mathematically, but does not include a dedicated section or figure presenting pseudocode or an algorithm block with structured steps. For example, it refers to 'the count-min sketch (CMS) is a popular algorithm' and 'hyperloglog algorithm' but does not provide pseudocode for its own contributions.
Open Source Code	Yes	A software implementation of our methods is available at https://github.com/mberaha/BNPSketching.
Open Datasets	Yes	The second data set comprises 18 open-domain classic pieces of English literature from the Gutenberg Corpus (Project Gutenberg, 2022). These data are pre-processed with the same approach of Sesia and Favaro (2022)... The first one was made publicly available by the National Center for Biotechnology Information (Hatcher et al., 2017) and contains 43,196 sequences... The last data set is discussed in Rojas et al. (2018) and contains a list of 3,577,296 IP addresses, which we sketch directly without pre-processing; these data were made publicly available through the Kaggle machine-learning competition website.
Dataset Splits	No	The paper primarily discusses the generation of synthetic data with specified parameters (e.g., 'n = 500,000 data points') and the use of full datasets or random subsets for evaluation ('random subsets of the three aforementioned data sets, as a function of the sample size'). It does not specify explicit train/validation/test splits, fixed seed for splitting, or cross-validation strategies for its experimental evaluations.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running experiments, such as GPU/CPU models, memory specifications, or cloud computing instances.
Software Dependencies	No	The paper mentions 'standard software packages' in relation to minimizing the loss function but does not specify any software libraries, frameworks, or solvers with their corresponding version numbers required to replicate the experiments.
Experiment Setup	Yes	To illustrate the limitations of the DP prior, we conducted two simulations with n = 500,000 data points, simulated either from a DP with parameters θ = 5, 10, 20, 100 or from a Zipf distribution with tail parameters c = 1.18, 1.54, 1.82, 2.22. ... We set n = 50 and J = 10, considering four sketched datasets... Additionally, we assume Xn+1 is mapped into bucket h(i)(Xn+1) such that C(i)h(i)(Xn+1) = 5 for i = 1, 2, 3, 4. We consider a PYP prior with parameter γ = 1 and parameter α = 0, 0.1, 0.3, 0.5... In all the panels, a and b vary, while we ﬁx c = 50, m = 1000, J = 50, θ = 0.3, τ = 1, λ = 1, and α = 0.25, 0.75 for the orange and green lines respectively.