When does dough become a bagel? Analyzing the remaining mistakes on ImageNet

Authors: Vijay Vasudevan, Benjamin Caine, Raphael Gontijo Lopes, Sara Fridovich-Keil, Rebecca Roelofs

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To help contextualize progress on Image Net and provide a more meaningful evaluation for today s stateof-the-art models, we manually review and categorize every remaining mistake that a few top models make and provide insights into the long-tail of errors on one of the most benchmarked datasets in computer vision.
Researcher Affiliation Collaboration Vijay Vasudevan, Benjamin Caine, Raphael Gontijo-Lopes, Sara Fridovich-Keil2, Rebecca Roelofs {vrv, rofls}@google.com Google Research, Brain Team, 2University of California, Berkeley
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code Yes Dataset and analysis available at https://github.com/google-research/imagenet-mistakes. We have prepared a Git Hub containing our mistake assessments for the two models for others to verify. We also will release the updated multi-label set for others to build upon our work.
Open Datasets Yes Image classification accuracy on the Image Net dataset has been a barometer for progress in computer vision over the last decade. Several recent papers have questioned the degree to which the benchmark remains useful to the community [33, 3, 31, 42, 36], yet innovations continue to contribute gains to performance, with today s largest models achieving 90%+ top-1 accuracy. To help contextualize progress on Image Net and provide a more meaningful evaluation for today s stateof-the-art models, we manually review and categorize every remaining mistake that a few top models make and provide insights into the long-tail of errors on one of the most benchmarked datasets in computer vision. We focus on the multi-label subset evaluation of Image Net, where today s best models achieve upwards of 97% top-1 accuracy.
Dataset Splits Yes In this paper we analyze the Image Net multi-label validation subsets [31], in which expert labelers were used to assess the correctness of model predictions through the year 2020, and on which a 1000-image human-evaluated subset provides a direct comparison to expert human performance. Exhaustively examining every mistake has been made more convenient and practical due to the quality of today s top models as well as the smaller subset of 20k validation images present in the multi-label set.
Hardware Specification No The paper mentions it 'is largely based analyzing pre-trained models' and that it 'did no training specific to this work except for the fine-tuning required for measuring the impact of de-duplication of validation leakage examples', but it does not provide specific hardware details such as GPU/CPU models, memory, or cloud computing instance types used for these operations.
Software Dependencies No The paper mentions using specific models like 'Vi T [6] model' and data like 'JFT-3B [34]', but it does not specify any software dependencies with version numbers (e.g., Python, PyTorch/TensorFlow, CUDA versions).
Experiment Setup Yes To obtain an initial set of mistakes remaining on Image Net, we used a standard Vi T [6] model scaled to 3B parameters (Vi T-3B) that was pre-trained on JFT-3B [34] and fine-tuned on Image Net-1K [5], achieving a top-1 accuracy of 89.5% (details in Appendix ??). For the Vi T-3B model, we provide training details in the Appendix, though we note that reproducing the exact model is not the contribution of our paper.