Testing Closeness With Unequal Sized Samples
Authors: Bhaswar Bhattacharya, Gregory Valiant
NeurIPS 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The core of our testing algorithm is a relatively simple statistic that seems to perform well in practice, both on synthetic and on natural language data. Finally, Section 4 contains some empirical results, suggesting that the statistic at the core of our testing algorithm performs very well in practice. This section contains both results on synthetic data, as well as an illustration of how to apply these ideas to the problem of estimating the semantic similarity of two words based on samples of the n-grams that contain the words in a corpus of text. |
| Researcher Affiliation | Academia | Bhaswar B. Bhattacharya Department of Statistics Stanford University California, CA 94305 bhaswar@stanford.edu Gregory Valiant Department of Computer Science Stanford University California, CA 94305 valiant@stanford.edu |
| Pseudocode | Yes | Algorithm 1 The Closeness Testing Algorithm. Algorithm 2 Testing for Mixing Times in Markov Chains. |
| Open Source Code | No | The paper discusses the Google Books Ngram Dataset and provides a link to it, but it does not provide concrete access to the source code for the methodology described in this paper. There is no explicit statement about releasing their own code. |
| Open Datasets | Yes | Specifically, for each pair of words, a, b that we consider, we select m1 random occurrences of a and m2 random occurrences of word b from the Google books corpus, using the Google Books Ngram Dataset.2 The Google Books Ngram Dataset is freely available here: http://storage.googleapis.com/ books/ngrams/books/datasetsv2.html |
| Dataset Splits | No | The paper describes statistical tests and does not explicitly provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) for training, validation, or testing. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment. |
| Experiment Setup | Yes | Specifically, for each pair of words, a, b that we consider, we select m1 random occurrences of a and m2 random occurrences of word b from the Google books corpus, using the Google Books Ngram Dataset. The sample size of bi-grams containing the first word is fixed at m1 = 1,000, and the sample size corresponding to the second word varies from m2 = 50 through m2 = 1,000. Let b = C0 log n / m2, for an absolute constant C0. If (2) and (3) hold, then ACCEPT. Otherwise, REJECT. If γ < 1/9 : Xi + 1 C1 m2 2 m1 , (4) where C1 is an appropriately chosen absolute constant. REJECT if there exists i [n] such that Yi 3 and Xi C2 m1 m2n1/3 , where C2 is an appropriately chosen absolute constant. |