The Limits of Post-Selection Generalization

Authors: Jonathan Ullman, Adam Smith, Kobbi Nissim, Uri Stemmer, Thomas Steinke

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical In this work we show several limitations on the power of algorithms satisfying post hoc generalization. First, we show a tight lower bound on the error of any algorithm that satisfies post hoc generalization and answers adaptively chosen statistical queries, showing a strong barrier to progress in post selection data analysis. Second, we show that post hoc generalization is not closed under composition, despite many examples of such algorithms exhibiting strong composition properties.Our first contribution is strong new lower bounds on any algorithm that satisfies post hoc generalization and answers a sequence of adaptively chosen statistical queries the setting introduced in Dwork et al. [9] and further studied in [1, 15, 20]. To prove our theorem, we construct a joint distribution over pairs (A, P) such that when M is given too small a sample X, and A asks k 1 statistical queries, then either M does not answer all the queries accurately or A outputs a k-th query q such that q (P) q (X) > ε. Thus, M cannot be both accurate and satisfy post hoc generalization. We prove the information-theoretic result (Theorem 1.2) in Section 2. Due to space restrictions, we defer the computational result (Theorem 1.3) to the full version of this work.
Researcher Affiliation Collaboration Kobbi Nissim Georgetown University kobbi.nissim@georgetown.edu Adam Smith Boston University ads22@bu.edu Thomas Steinke IBM Research Almaden posel@thomas-steinke.net Uri Stemmer Ben-Gurion University u@uri.co.il Jonathan Ullman Northeastern University jullman@ccs.neu.edu
Pseudocode Yes Algorithm 1: AQX,n,k[M A] Algorithm 2: ANAQ Algorithm 3: AAQ Algorithm 4: Encrypermute
Open Source Code No The paper is theoretical and focuses on mathematical proofs and analyses. It does not mention releasing any source code for its methodology or provide links to a code repository.
Open Datasets No The paper is theoretical and does not conduct experiments on datasets requiring public access. It defines theoretical distributions (e.g., 'uniform distribution over pairs (i, mi)') for its proofs, but these are not publicly available datasets in the empirical sense.
Dataset Splits No The paper is theoretical and does not report on empirical experiments. Therefore, there are no training, validation, or test dataset splits mentioned.
Hardware Specification No The paper is theoretical and does not report on empirical experiments that would require specific hardware. No hardware specifications (e.g., GPU/CPU models) are mentioned.
Software Dependencies No The paper is theoretical and does not describe any implementation or empirical evaluation. Therefore, no software dependencies with specific version numbers are mentioned.
Experiment Setup No The paper is theoretical and does not report on empirical experiments. Therefore, no experimental setup details such as hyperparameters or system-level training settings are provided.