Generalization in Adaptive Data Analysis and Holdout Reuse
Authors: Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toni Pitassi, Omer Reingold, Aaron Roth
NeurIPS 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We describe a simple experiment on synthetic data that illustrates the danger of reusing a standard holdout set, and how this issue can be resolved by our reusable holdout. The design of this experiment is inspired by Freedman s classical experiment, which demonstrated the dangers of performing variable selection and regression on the same data [10]. In our first experiment, each attribute of x is drawn independently from the normal distribution N(0, 1) and we choose the class label y { 1, 1} uniformly at random so that there is no correlation between the data point and its label. We chose n = 10,000, d = 10,000 and varied the number of selected variables k. In this scenario no classifier can achieve true accuracy better than 50%. Nevertheless, reusing a standard holdout results in reported accuracy of over 63% for k = 500 on both the training set and the holdout set (the standard deviation of the error is less than 0.5%). The average and standard deviation of results obtained from 100 independent executions of the experiment are plotted above. For comparison, the plot also includes the accuracy of the classifier on another fresh data set of size n drawn from the same distribution. We then executed the same algorithm with our reusable holdout. Thresholdout was invoked with T = 0.04 and τ = 0.01 explaining why the accuracy of the classifier reported by Thresholdout is off by up to 0.04 whenever the accuracy on the holdout set is within 0.04 of the accuracy on the training set. We also used Gaussian noise instead of Laplacian noise as it has stronger concentration properties. Thresholdout prevents the algorithm from overfitting to the holdout set and gives a valid estimate of classifier accuracy. Additional experiments and discussion are presented in the full version. |
| Researcher Affiliation | Collaboration | Cynthia Dwork Microsoft Research Vitaly Feldman IBM Almaden Research Center Moritz Hardt Google Research Toniann Pitassi University of Toronto Omer Reingold Samsung Research America Aaron Roth University of Pennsylvania |
| Pseudocode | Yes | We provide the pseudocode of Thresholdout below. Input: Training set St, holdout set Sh, threshold T, noise rate σ, budget B 1. sample γ Lap(2 σ); ˆT T + γ 2. For each query φ do (a) if B < 1 output (b) else i. sample η Lap(4 σ) ii. if |ESh[φ] ESt[φ]| > ˆT + η A. sample ξ Lap(σ), γ Lap(2 σ) B. B B 1 and ˆT T + γ C. output ESh[φ] + ξ iii. else output ESt[φ]. |
| Open Source Code | No | The paper does not provide any specific links or explicit statements about the availability of open-source code for the described methodology. |
| Open Datasets | No | In our first experiment, each attribute of x is drawn independently from the normal distribution N(0, 1) and we choose the class label y { 1, 1} uniformly at random so that there is no correlation between the data point and its label. We chose n = 10,000, d = 10,000 and varied the number of selected variables k. This describes a synthetically generated dataset rather than a publicly available one with concrete access information (link, DOI, formal citation). |
| Dataset Splits | Yes | the analyst is given a d-dimensional labeled data set S of size 2n and splits it randomly into a training set St and a holdout set Sh of equal size. |
| Hardware Specification | No | The paper does not provide any specific hardware details such as GPU models, CPU models, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific software dependency details, such as library names with version numbers (e.g., Python 3.8, PyTorch 1.9). |
| Experiment Setup | Yes | Thresholdout was invoked with T = 0.04 and τ = 0.01 explaining why the accuracy of the classifier reported by Thresholdout is off by up to 0.04 whenever the accuracy on the holdout set is within 0.04 of the accuracy on the training set. We also used Gaussian noise instead of Laplacian noise as it has stronger concentration properties. |