reproducibilityindex.ai

Testing Closeness With Unequal Sized Samples

Authors: Bhaswar Bhattacharya, Gregory Valiant

NeurIPS 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The core of our testing algorithm is a relatively simple statistic that seems to perform well in practice, both on synthetic and on natural language data. Finally, Section 4 contains some empirical results, suggesting that the statistic at the core of our testing algorithm performs very well in practice. This section contains both results on synthetic data, as well as an illustration of how to apply these ideas to the problem of estimating the semantic similarity of two words based on samples of the n-grams that contain the words in a corpus of text.
Researcher Affiliation	Academia	Bhaswar B. Bhattacharya Department of Statistics Stanford University California, CA 94305 bhaswar@stanford.edu Gregory Valiant Department of Computer Science Stanford University California, CA 94305 valiant@stanford.edu
Pseudocode	Yes	Algorithm 1 The Closeness Testing Algorithm. Algorithm 2 Testing for Mixing Times in Markov Chains.
Open Source Code	No	The paper discusses the Google Books Ngram Dataset and provides a link to it, but it does not provide concrete access to the source code for the methodology described in this paper. There is no explicit statement about releasing their own code.
Open Datasets	Yes	Speciﬁcally, for each pair of words, a, b that we consider, we select m1 random occurrences of a and m2 random occurrences of word b from the Google books corpus, using the Google Books Ngram Dataset.2 The Google Books Ngram Dataset is freely available here: http://storage.googleapis.com/ books/ngrams/books/datasetsv2.html
Dataset Splits	No	The paper describes statistical tests and does not explicitly provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) for training, validation, or testing.
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup	Yes	Speciﬁcally, for each pair of words, a, b that we consider, we select m1 random occurrences of a and m2 random occurrences of word b from the Google books corpus, using the Google Books Ngram Dataset. The sample size of bi-grams containing the ﬁrst word is ﬁxed at m1 = 1,000, and the sample size corresponding to the second word varies from m2 = 50 through m2 = 1,000. Let b = C0 log n / m2, for an absolute constant C0. If (2) and (3) hold, then ACCEPT. Otherwise, REJECT. If γ < 1/9 : Xi + 1 C1 m2 2 m1 , (4) where C1 is an appropriately chosen absolute constant. REJECT if there exists i [n] such that Yi 3 and Xi C2 m1 m2n1/3 , where C2 is an appropriately chosen absolute constant.