reproducibilityindex.ai

Model Similarity Mitigates Test Set Overuse

Authors: Horia Mania, John Miller, Ludwig Schmidt, Moritz Hardt, Benjamin Recht

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We proffer a new explanation for the apparent longevity of test data: Many proposed models are similar in their predictions and we prove that this similarity mitigates overﬁtting. Speciﬁcally, we show empirically that models proposed for the Image Net ILSVRC benchmark agree in their predictions well beyond what we can conclude from their accuracy levels alone. Likewise, models created by large scale hyperparameter search enjoy high levels of similarity. Motivated by these empirical observations, we give a non-asymptotic generalization bound that takes similarity into account, leading to meaningful conﬁdence bounds in practical settings.
Researcher Affiliation	Academia	Horia Mania UC Berkeley hmania@berkeley.edu John Miller UC Berkeley miller_john@berkeley.edu Ludwig Schmidt UC Berkeley ludwig@berkeley.edu Moritz Hardt UC Berkeley hardt@berkeley.edu Benjamin Recht UC Berkeley brecht@berkeley.edu
Pseudocode	No	The paper does not contain any pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide open-source code for the methodology it describes. The link in footnote 1 is to a model testbed used in their analysis, not their own code.
Open Datasets	Yes	We speciﬁcally focus on Image Net and CIFAR-10, two widely used machine learning benchmarks that have recently been shown to exhibit little to no adaptive overﬁtting in spite of almost a decade of test set re-use [15].
Dataset Splits	No	The paper mentions using 'the Image Net validation set' with n=50,000 for evaluation, and training models for hyperparameter search, but does not provide specific training/validation split percentages or methodology for their own training processes.
Hardware Specification	No	The paper does not provide any specific hardware details used for running its experiments.
Software Dependencies	No	The paper mentions using ResNet-110 and the DARTS pipeline but does not specify software dependencies with version numbers.
Experiment Setup	Yes	To understand the similarity between models evaluated in hyperparameter search, we ran our own random search to choose hyperparameters for a Res Net-110. The grid included properties of the architecture (e.g. type of residual block), the optimization algorithm (e.g. choice of optimizer), and the data distribution (e.g. data augmentation strategies). A full speciﬁcation of the grid is included in Appendix D.