reproducibilityindex.ai

Automated Classification of Model Errors on ImageNet

Authors: Momchil Peychev, Mark Müller, Marc Fischer, Martin Vechev

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We use our framework to comprehensively evaluate the error distribution of over 900 models.
Researcher Affiliation	Academia	Momchil Peychev , Mark Niklas Müller , Marc Fischer, Martin Vechev Department of Computer Science ETH Zurich, Switzerland {momchil.peychev, mark.mueller, marc.fischer, martin.vechev}@inf.ethz.ch
Pseudocode	No	The paper describes its error classification pipeline in detail using prose and flowcharts (e.g., Figure 1 and Figure 8), but it does not include explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We release all our code at https://github.com/eth-sri/automated-error-analysis.
Open Datasets	Yes	We consider the validation set of the ILSVRC-2012 subset of IMAGENET (Deng et al., 2009; Russakovsky et al., 2015), available under a non-commercial research license4. More concretely, we use the subset of this validation set labeled by Shankar et al. (2020) and then Vasudevan et al. (2022), with the labels3 being available under Apache License 2.0. We further evaluate our pipeline on the Image Net-A dataset (Hendrycks et al., 2021) available under MIT License5.
Dataset Splits	Yes	We consider the validation set of the ILSVRC-2012 subset of IMAGENET (Deng et al., 2009; Russakovsky et al., 2015)
Hardware Specification	Yes	After collecting all model outputs (6 days for Image Net and 1 day for Image Net-A on a single Ge Force RTX 2080 Ti GPU), running our error analysis pipeline on all models takes 12 to 24 hours using a single Ge Force RTX 2080 Ti GPU for Image Net and Image Net-A respectively.
Software Dependencies	No	Table 3 contains a list of all models we considered in this study and a subset of their metadata. The models were obtained from multiple sources: Torchvision6, torch.hub7, Hugging Face8, and timm9.
Experiment Setup	No	The paper refers to 'full details on all 962 models we consider' in Appendix F, which lists model IDs, sources, architectures, and datasets, but it does not specify hyperparameters or detailed training configurations for these models or for the error analysis pipeline itself.