Agreement-on-the-line: Predicting the Performance of Neural Networks under Distribution Shift

Authors: Christina Baek, Yiding Jiang, Aditi Raghunathan, J. Zico Kolter

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We study the ID vs. OOD accuracy and agreement between pairs of models across more than 20 common OOD benchmarks and hundreds of independently trained neural networks. We present results on 8 dataset shifts in the main paper, and include results for other distribution shifts in the Appendix C. In Table 2, we observe that ALine-D generally outperforms other methods on datasets where agreement-on-the-line holds.
Researcher Affiliation Collaboration Christina Baek1 Yiding Jiang1 Aditi Raghunathan1 Zico Kolter1,2 1Carnegie Mellon University, 2Bosch Center for AI
Pseudocode Yes Algorithm 1 ALine-D: Predicting OOD Accuracy
Open Source Code Yes Implementation of our method is available at https://github.com/kebaek/ agreement-on-the-line.
Open Datasets Yes Datasets. We present results on 8 dataset shifts in the main paper, and include results for other distribution shifts in the Appendix C. These 8 datasets span: 1. Dataset reproductions: CIFAR-10.1 [67], CIFAR-10.2 [52] reproductions of CIFAR-10 [43] and Image Net V2 [67] reproduction of Image Net [21]
Dataset Splits No The paper mentions using a 'labeled validation set' but does not specify the exact size, percentages, or split methodology for these validation sets across the various datasets, making it difficult to reproduce the data partitioning precisely.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU model, CPU type, memory) used to run the experiments. It only mentions 'hundreds of independently trained neural networks' and models from a 'testbed'.
Software Dependencies No The paper mentions evaluating models from the 'timm [82] package' and references 'Pytorch image models' but does not provide specific version numbers for these or other software dependencies like PyTorch itself, Python, or CUDA.
Experiment Setup No The paper mentions 'probit scaling' and 'temperature scaling' as data transformations or model adjustments, but it does not specify concrete hyperparameter values (e.g., learning rate, batch size, number of epochs, optimizer details) for training the models used in the experiments.