reproducibilityindex.ai

The Limits of Post-Selection Generalization

Authors: Jonathan Ullman, Adam Smith, Kobbi Nissim, Uri Stemmer, Thomas Steinke

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	In this work we show several limitations on the power of algorithms satisfying post hoc generalization. First, we show a tight lower bound on the error of any algorithm that satisﬁes post hoc generalization and answers adaptively chosen statistical queries, showing a strong barrier to progress in post selection data analysis. Second, we show that post hoc generalization is not closed under composition, despite many examples of such algorithms exhibiting strong composition properties.Our ﬁrst contribution is strong new lower bounds on any algorithm that satisﬁes post hoc generalization and answers a sequence of adaptively chosen statistical queries the setting introduced in Dwork et al. [9] and further studied in [1, 15, 20]. To prove our theorem, we construct a joint distribution over pairs (A, P) such that when M is given too small a sample X, and A asks k 1 statistical queries, then either M does not answer all the queries accurately or A outputs a k-th query q such that q (P) q (X) > ε. Thus, M cannot be both accurate and satisfy post hoc generalization. We prove the information-theoretic result (Theorem 1.2) in Section 2. Due to space restrictions, we defer the computational result (Theorem 1.3) to the full version of this work.
Researcher Affiliation	Collaboration	Kobbi Nissim Georgetown University kobbi.nissim@georgetown.edu Adam Smith Boston University ads22@bu.edu Thomas Steinke IBM Research Almaden posel@thomas-steinke.net Uri Stemmer Ben-Gurion University u@uri.co.il Jonathan Ullman Northeastern University jullman@ccs.neu.edu
Pseudocode	Yes	Algorithm 1: AQX,n,k[M A] Algorithm 2: ANAQ Algorithm 3: AAQ Algorithm 4: Encrypermute
Open Source Code	No	The paper is theoretical and focuses on mathematical proofs and analyses. It does not mention releasing any source code for its methodology or provide links to a code repository.
Open Datasets	No	The paper is theoretical and does not conduct experiments on datasets requiring public access. It defines theoretical distributions (e.g., 'uniform distribution over pairs (i, mi)') for its proofs, but these are not publicly available datasets in the empirical sense.
Dataset Splits	No	The paper is theoretical and does not report on empirical experiments. Therefore, there are no training, validation, or test dataset splits mentioned.
Hardware Specification	No	The paper is theoretical and does not report on empirical experiments that would require specific hardware. No hardware specifications (e.g., GPU/CPU models) are mentioned.
Software Dependencies	No	The paper is theoretical and does not describe any implementation or empirical evaluation. Therefore, no software dependencies with specific version numbers are mentioned.
Experiment Setup	No	The paper is theoretical and does not report on empirical experiments. Therefore, no experimental setup details such as hyperparameters or system-level training settings are provided.