Automated Classification of Model Errors on ImageNet
Authors: Momchil Peychev, Mark Müller, Marc Fischer, Martin Vechev
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We use our framework to comprehensively evaluate the error distribution of over 900 models. |
| Researcher Affiliation | Academia | Momchil Peychev , Mark Niklas Müller , Marc Fischer, Martin Vechev Department of Computer Science ETH Zurich, Switzerland {momchil.peychev, mark.mueller, marc.fischer, martin.vechev}@inf.ethz.ch |
| Pseudocode | No | The paper describes its error classification pipeline in detail using prose and flowcharts (e.g., Figure 1 and Figure 8), but it does not include explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | We release all our code at https://github.com/eth-sri/automated-error-analysis. |
| Open Datasets | Yes | We consider the validation set of the ILSVRC-2012 subset of IMAGENET (Deng et al., 2009; Russakovsky et al., 2015), available under a non-commercial research license4. More concretely, we use the subset of this validation set labeled by Shankar et al. (2020) and then Vasudevan et al. (2022), with the labels3 being available under Apache License 2.0. We further evaluate our pipeline on the Image Net-A dataset (Hendrycks et al., 2021) available under MIT License5. |
| Dataset Splits | Yes | We consider the validation set of the ILSVRC-2012 subset of IMAGENET (Deng et al., 2009; Russakovsky et al., 2015) |
| Hardware Specification | Yes | After collecting all model outputs (6 days for Image Net and 1 day for Image Net-A on a single Ge Force RTX 2080 Ti GPU), running our error analysis pipeline on all models takes 12 to 24 hours using a single Ge Force RTX 2080 Ti GPU for Image Net and Image Net-A respectively. |
| Software Dependencies | No | Table 3 contains a list of all models we considered in this study and a subset of their metadata. The models were obtained from multiple sources: Torchvision6, torch.hub7, Hugging Face8, and timm9. |
| Experiment Setup | No | The paper refers to 'full details on all 962 models we consider' in Appendix F, which lists model IDs, sources, architectures, and datasets, but it does not specify hyperparameters or detailed training configurations for these models or for the error analysis pipeline itself. |