reproducibilityindex.ai

Agreement-on-the-line: Predicting the Performance of Neural Networks under Distribution Shift

Authors: Christina Baek, Yiding Jiang, Aditi Raghunathan, J. Zico Kolter

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We study the ID vs. OOD accuracy and agreement between pairs of models across more than 20 common OOD benchmarks and hundreds of independently trained neural networks. We present results on 8 dataset shifts in the main paper, and include results for other distribution shifts in the Appendix C. In Table 2, we observe that ALine-D generally outperforms other methods on datasets where agreement-on-the-line holds.
Researcher Affiliation	Collaboration	Christina Baek1 Yiding Jiang1 Aditi Raghunathan1 Zico Kolter1,2 1Carnegie Mellon University, 2Bosch Center for AI
Pseudocode	Yes	Algorithm 1 ALine-D: Predicting OOD Accuracy
Open Source Code	Yes	Implementation of our method is available at https://github.com/kebaek/ agreement-on-the-line.
Open Datasets	Yes	Datasets. We present results on 8 dataset shifts in the main paper, and include results for other distribution shifts in the Appendix C. These 8 datasets span: 1. Dataset reproductions: CIFAR-10.1 [67], CIFAR-10.2 [52] reproductions of CIFAR-10 [43] and Image Net V2 [67] reproduction of Image Net [21]
Dataset Splits	No	The paper mentions using a 'labeled validation set' but does not specify the exact size, percentages, or split methodology for these validation sets across the various datasets, making it difficult to reproduce the data partitioning precisely.
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., GPU model, CPU type, memory) used to run the experiments. It only mentions 'hundreds of independently trained neural networks' and models from a 'testbed'.
Software Dependencies	No	The paper mentions evaluating models from the 'timm [82] package' and references 'Pytorch image models' but does not provide specific version numbers for these or other software dependencies like PyTorch itself, Python, or CUDA.
Experiment Setup	No	The paper mentions 'probit scaling' and 'temperature scaling' as data transformations or model adjustments, but it does not specify concrete hyperparameter values (e.g., learning rate, batch size, number of epochs, optimizer details) for training the models used in the experiments.