Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Random measure priors in Bayesian recovery from sketches
Authors: Mario Beraha, Stefano Favaro, Matteo Sesia
JMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Section 5 contains an empirical validation of our methods on synthetic and real data, whereas Section 6 discusses some directions for future work. |
| Researcher Affiliation | Academia | Mario Beraha EMAIL Polytechnic University of Milan, Milan, Italy Stefano Favaro EMAIL University of Torino and Collegio Carlo Alberto, Torino, Italy Matteo Sesia EMAIL University of Southern California, Los Angeles, California, United States |
| Pseudocode | No | The paper describes algorithms and methods conceptually and mathematically, but does not include a dedicated section or figure presenting pseudocode or an algorithm block with structured steps. For example, it refers to 'the count-min sketch (CMS) is a popular algorithm' and 'hyperloglog algorithm' but does not provide pseudocode for its own contributions. |
| Open Source Code | Yes | A software implementation of our methods is available at https://github.com/mberaha/BNPSketching. |
| Open Datasets | Yes | The second data set comprises 18 open-domain classic pieces of English literature from the Gutenberg Corpus (Project Gutenberg, 2022). These data are pre-processed with the same approach of Sesia and Favaro (2022)... The first one was made publicly available by the National Center for Biotechnology Information (Hatcher et al., 2017) and contains 43,196 sequences... The last data set is discussed in Rojas et al. (2018) and contains a list of 3,577,296 IP addresses, which we sketch directly without pre-processing; these data were made publicly available through the Kaggle machine-learning competition website. |
| Dataset Splits | No | The paper primarily discusses the generation of synthetic data with specified parameters (e.g., 'n = 500,000 data points') and the use of full datasets or random subsets for evaluation ('random subsets of the three aforementioned data sets, as a function of the sample size'). It does not specify explicit train/validation/test splits, fixed seed for splitting, or cross-validation strategies for its experimental evaluations. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running experiments, such as GPU/CPU models, memory specifications, or cloud computing instances. |
| Software Dependencies | No | The paper mentions 'standard software packages' in relation to minimizing the loss function but does not specify any software libraries, frameworks, or solvers with their corresponding version numbers required to replicate the experiments. |
| Experiment Setup | Yes | To illustrate the limitations of the DP prior, we conducted two simulations with n = 500,000 data points, simulated either from a DP with parameters θ = 5, 10, 20, 100 or from a Zipf distribution with tail parameters c = 1.18, 1.54, 1.82, 2.22. ... We set n = 50 and J = 10, considering four sketched datasets... Additionally, we assume Xn+1 is mapped into bucket h(i)(Xn+1) such that C(i)h(i)(Xn+1) = 5 for i = 1, 2, 3, 4. We consider a PYP prior with parameter γ = 1 and parameter α = 0, 0.1, 0.3, 0.5... In all the panels, a and b vary, while we fix c = 50, m = 1000, J = 50, θ = 0.3, τ = 1, λ = 1, and α = 0.25, 0.75 for the orange and green lines respectively. |