Automated Classification of Model Errors on ImageNet

Authors: Momchil Peychev, Mark Müller, Marc Fischer, Martin Vechev

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We use our framework to comprehensively evaluate the error distribution of over 900 models.
Researcher Affiliation Academia Momchil Peychev , Mark Niklas Müller , Marc Fischer, Martin Vechev Department of Computer Science ETH Zurich, Switzerland {momchil.peychev, mark.mueller, marc.fischer, martin.vechev}@inf.ethz.ch
Pseudocode No The paper describes its error classification pipeline in detail using prose and flowcharts (e.g., Figure 1 and Figure 8), but it does not include explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes We release all our code at https://github.com/eth-sri/automated-error-analysis.
Open Datasets Yes We consider the validation set of the ILSVRC-2012 subset of IMAGENET (Deng et al., 2009; Russakovsky et al., 2015), available under a non-commercial research license4. More concretely, we use the subset of this validation set labeled by Shankar et al. (2020) and then Vasudevan et al. (2022), with the labels3 being available under Apache License 2.0. We further evaluate our pipeline on the Image Net-A dataset (Hendrycks et al., 2021) available under MIT License5.
Dataset Splits Yes We consider the validation set of the ILSVRC-2012 subset of IMAGENET (Deng et al., 2009; Russakovsky et al., 2015)
Hardware Specification Yes After collecting all model outputs (6 days for Image Net and 1 day for Image Net-A on a single Ge Force RTX 2080 Ti GPU), running our error analysis pipeline on all models takes 12 to 24 hours using a single Ge Force RTX 2080 Ti GPU for Image Net and Image Net-A respectively.
Software Dependencies No Table 3 contains a list of all models we considered in this study and a subset of their metadata. The models were obtained from multiple sources: Torchvision6, torch.hub7, Hugging Face8, and timm9.
Experiment Setup No The paper refers to 'full details on all 962 models we consider' in Appendix F, which lists model IDs, sources, architectures, and datasets, but it does not specify hyperparameters or detailed training configurations for these models or for the error analysis pipeline itself.