Estimating the unseen from multiple populations
Authors: Aditi Raghunathan, Gregory Valiant, James Zou
ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We systematically validate these two algorithms on synthetic data as well as real datasets from population genetics and from English books. Moreover, we illustrate that by estimating the joint frequency distribution, we can significantly improve the discovery power under a budget constraint. 6. Experiments. Evaluating the weighted linear estimator for large m. We empirically evaluated the performance of the weighted linear estimator ˆU W . The experiments were conducted for three types of distributions Uniform, Dirichlet and Geometric... |
| Researcher Affiliation | Collaboration | 1Stanford University, Stanford, CA 2Chan Zuckerberg Biohub, San Francisco, CA. |
| Pseudocode | Yes | Estimating the multi-population histogram: Core Approach. Input: Multi-population fingerprint Φ of samples, Output: Two estimates, ˆHcounts and ˆHll of histogram corresponding to the distributions underlying fingerprint Φ. |
| Open Source Code | No | The paper does not contain an explicit statement or link indicating that the source code for the methodology described is publicly available. |
| Open Datasets | Yes | We systematically validate these two algorithms on synthetic data as well as real datasets from population genetics and from English books. Additionally, we evaluate the performance of ˆHcount on a real dataset, in which we sampled words from three books Hamlet (32K total words), Treasure Island (40K) and The Sun Also Rises (72K). To illustrate, we obtained genome sequencing data of 45K individuals from the Exome Aggregation Consortium (Lek et al., 2016). |
| Dataset Splits | Yes | We model the multi-population unseen estimation as a two stage process. In the first period, we observe nj independent samples from the j-th population, {Xj i }j=1,...,m i=1,...,nj. This is the seen data. In period two, which is in the future, we will sample additional tjnj samples from the j-th population, {Y j i }j=1,...,m i=1,...,tjnj. The period two samples are unseen and we would like to estimate some statistic U({Y j i }, {Xj i }). We estimated ˆHcount and ˆHll using 16K from each population, and then used Eqn. 3 to estimate the number of unseen elements in additional samples. |
| Hardware Specification | No | The paper mentions that 'each run of our experiments took less than 20 minutes on a single laptop,' but it does not provide specific hardware details such as CPU/GPU models, processor types, or memory specifications. |
| Software Dependencies | No | The paper does not explicitly list specific software dependencies with version numbers (e.g., Python version, library versions like PyTorch, TensorFlow, or specific solver versions) required to reproduce the experiments. |
| Experiment Setup | Yes | Each experiment contains m = 100 populations. We have a total of 3000 distinct elements. In the Uniform setting, each population has support on 100 elements that are randomly sampled from the 3000. For Dirichlet, each population also has support on 100 random elements (from the 3000), and the weights on these 100 elements are sampled from a Dirichlet prior. For the Geometric experiments, each population corresponds to a random ordering of the 3000 elements and the k-th element is assigned probability (1 p)kp. In period one, ten samples are observed in each of the 100 populations. In period two, 95 randomly chosen populations have extrapolation factor t [0, 1] and five populations have extrapolation factor 10t. |