Estimating the unseen from multiple populations

Authors: Aditi Raghunathan, Gregory Valiant, James Zou

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We systematically validate these two algorithms on synthetic data as well as real datasets from population genetics and from English books. Moreover, we illustrate that by estimating the joint frequency distribution, we can significantly improve the discovery power under a budget constraint. 6. Experiments. Evaluating the weighted linear estimator for large m. We empirically evaluated the performance of the weighted linear estimator ˆU W . The experiments were conducted for three types of distributions Uniform, Dirichlet and Geometric...
Researcher Affiliation Collaboration 1Stanford University, Stanford, CA 2Chan Zuckerberg Biohub, San Francisco, CA.
Pseudocode Yes Estimating the multi-population histogram: Core Approach. Input: Multi-population fingerprint Φ of samples, Output: Two estimates, ˆHcounts and ˆHll of histogram corresponding to the distributions underlying fingerprint Φ.
Open Source Code No The paper does not contain an explicit statement or link indicating that the source code for the methodology described is publicly available.
Open Datasets Yes We systematically validate these two algorithms on synthetic data as well as real datasets from population genetics and from English books. Additionally, we evaluate the performance of ˆHcount on a real dataset, in which we sampled words from three books Hamlet (32K total words), Treasure Island (40K) and The Sun Also Rises (72K). To illustrate, we obtained genome sequencing data of 45K individuals from the Exome Aggregation Consortium (Lek et al., 2016).
Dataset Splits Yes We model the multi-population unseen estimation as a two stage process. In the first period, we observe nj independent samples from the j-th population, {Xj i }j=1,...,m i=1,...,nj. This is the seen data. In period two, which is in the future, we will sample additional tjnj samples from the j-th population, {Y j i }j=1,...,m i=1,...,tjnj. The period two samples are unseen and we would like to estimate some statistic U({Y j i }, {Xj i }). We estimated ˆHcount and ˆHll using 16K from each population, and then used Eqn. 3 to estimate the number of unseen elements in additional samples.
Hardware Specification No The paper mentions that 'each run of our experiments took less than 20 minutes on a single laptop,' but it does not provide specific hardware details such as CPU/GPU models, processor types, or memory specifications.
Software Dependencies No The paper does not explicitly list specific software dependencies with version numbers (e.g., Python version, library versions like PyTorch, TensorFlow, or specific solver versions) required to reproduce the experiments.
Experiment Setup Yes Each experiment contains m = 100 populations. We have a total of 3000 distinct elements. In the Uniform setting, each population has support on 100 elements that are randomly sampled from the 3000. For Dirichlet, each population also has support on 100 random elements (from the 3000), and the weights on these 100 elements are sampled from a Dirichlet prior. For the Geometric experiments, each population corresponds to a random ordering of the 3000 elements and the k-th element is assigned probability (1 p)kp. In period one, ten samples are observed in each of the 100 populations. In period two, 95 randomly chosen populations have extrapolation factor t [0, 1] and five populations have extrapolation factor 10t.