When does dough become a bagel? Analyzing the remaining mistakes on ImageNet
Authors: Vijay Vasudevan, Benjamin Caine, Raphael Gontijo Lopes, Sara Fridovich-Keil, Rebecca Roelofs
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To help contextualize progress on Image Net and provide a more meaningful evaluation for today s stateof-the-art models, we manually review and categorize every remaining mistake that a few top models make and provide insights into the long-tail of errors on one of the most benchmarked datasets in computer vision. |
| Researcher Affiliation | Collaboration | Vijay Vasudevan, Benjamin Caine, Raphael Gontijo-Lopes, Sara Fridovich-Keil2, Rebecca Roelofs {vrv, rofls}@google.com Google Research, Brain Team, 2University of California, Berkeley |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | Yes | Dataset and analysis available at https://github.com/google-research/imagenet-mistakes. We have prepared a Git Hub containing our mistake assessments for the two models for others to verify. We also will release the updated multi-label set for others to build upon our work. |
| Open Datasets | Yes | Image classification accuracy on the Image Net dataset has been a barometer for progress in computer vision over the last decade. Several recent papers have questioned the degree to which the benchmark remains useful to the community [33, 3, 31, 42, 36], yet innovations continue to contribute gains to performance, with today s largest models achieving 90%+ top-1 accuracy. To help contextualize progress on Image Net and provide a more meaningful evaluation for today s stateof-the-art models, we manually review and categorize every remaining mistake that a few top models make and provide insights into the long-tail of errors on one of the most benchmarked datasets in computer vision. We focus on the multi-label subset evaluation of Image Net, where today s best models achieve upwards of 97% top-1 accuracy. |
| Dataset Splits | Yes | In this paper we analyze the Image Net multi-label validation subsets [31], in which expert labelers were used to assess the correctness of model predictions through the year 2020, and on which a 1000-image human-evaluated subset provides a direct comparison to expert human performance. Exhaustively examining every mistake has been made more convenient and practical due to the quality of today s top models as well as the smaller subset of 20k validation images present in the multi-label set. |
| Hardware Specification | No | The paper mentions it 'is largely based analyzing pre-trained models' and that it 'did no training specific to this work except for the fine-tuning required for measuring the impact of de-duplication of validation leakage examples', but it does not provide specific hardware details such as GPU/CPU models, memory, or cloud computing instance types used for these operations. |
| Software Dependencies | No | The paper mentions using specific models like 'Vi T [6] model' and data like 'JFT-3B [34]', but it does not specify any software dependencies with version numbers (e.g., Python, PyTorch/TensorFlow, CUDA versions). |
| Experiment Setup | Yes | To obtain an initial set of mistakes remaining on Image Net, we used a standard Vi T [6] model scaled to 3B parameters (Vi T-3B) that was pre-trained on JFT-3B [34] and fine-tuned on Image Net-1K [5], achieving a top-1 accuracy of 89.5% (details in Appendix ??). For the Vi T-3B model, we provide training details in the Appendix, though we note that reproducing the exact model is not the contribution of our paper. |