Model Similarity Mitigates Test Set Overuse

Authors: Horia Mania, John Miller, Ludwig Schmidt, Moritz Hardt, Benjamin Recht

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We proffer a new explanation for the apparent longevity of test data: Many proposed models are similar in their predictions and we prove that this similarity mitigates overfitting. Specifically, we show empirically that models proposed for the Image Net ILSVRC benchmark agree in their predictions well beyond what we can conclude from their accuracy levels alone. Likewise, models created by large scale hyperparameter search enjoy high levels of similarity. Motivated by these empirical observations, we give a non-asymptotic generalization bound that takes similarity into account, leading to meaningful confidence bounds in practical settings.
Researcher Affiliation Academia Horia Mania UC Berkeley hmania@berkeley.edu John Miller UC Berkeley miller_john@berkeley.edu Ludwig Schmidt UC Berkeley ludwig@berkeley.edu Moritz Hardt UC Berkeley hardt@berkeley.edu Benjamin Recht UC Berkeley brecht@berkeley.edu
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code No The paper does not provide open-source code for the methodology it describes. The link in footnote 1 is to a model testbed used in their analysis, not their own code.
Open Datasets Yes We specifically focus on Image Net and CIFAR-10, two widely used machine learning benchmarks that have recently been shown to exhibit little to no adaptive overfitting in spite of almost a decade of test set re-use [15].
Dataset Splits No The paper mentions using 'the Image Net validation set' with n=50,000 for evaluation, and training models for hyperparameter search, but does not provide specific training/validation split percentages or methodology for their own training processes.
Hardware Specification No The paper does not provide any specific hardware details used for running its experiments.
Software Dependencies No The paper mentions using ResNet-110 and the DARTS pipeline but does not specify software dependencies with version numbers.
Experiment Setup Yes To understand the similarity between models evaluated in hyperparameter search, we ran our own random search to choose hyperparameters for a Res Net-110. The grid included properties of the architecture (e.g. type of residual block), the optimization algorithm (e.g. choice of optimizer), and the data distribution (e.g. data augmentation strategies). A full specification of the grid is included in Appendix D.