Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
The Robustness Limits of SoTA Vision Models to Natural Variation
Authors: Mark Ibrahim, Quentin Garrido, Ari S. Morcos, Diane Bouchacourt
TMLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To study this question, we develop a dataset of more than 7 million images with controlled changes in pose, position background, lighting color, and size. We study not only how robust recent state-of-the-art models are, but also the extent to which models can generalize to variation in each of these factors. We consider a catalog of recent vision models, including vision transformers (Vi T), self-supervised models such as masked autoencoders (MAE), and models trained on larger datasets such as CLIP. We find that even today s best models are not robust to common changes in pose, size, and background. When some samples varied during training, we found models required a significant portion of instances seen varying to generalize though eventually robustness did improve. When variability is only witnessed for some classes however, we found that models did not generalize to other classes unless the classes were very similar to those seen varying during training. |
| Researcher Affiliation | Industry | Mark Ibrahim EMAIL Quentin Garrido Diane Bouchacourt Fundamental AI Research (FAIR), Meta |
| Pseudocode | No | The paper describes methods and processes in narrative text and refers to figures and tables for results, but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions existing codebases and models (e.g., 'Openclip, July 2021. URL https://doi.org/10.5281/zenodo.5143773') that were used as part of their study, but it does not provide an explicit statement or link for the source code related to the methodology or contributions described in this paper. |
| Open Datasets | No | Therefore we develop our own dataset based on 3d Warehouse (Trimble Inc) objects that we place in non-uniform backgrounds. We use 54 synsets from 3d Warehouse (Trimble Inc), and 50 objects for each synset. For the first 4 scalar factors (position, pose, size, lighting color), we use equally spaced scalar values. For the background, we use 5 background types (sky, water, city, home, grass) and 5 different backgrounds per type, with natural images coming from Li et al. (2022; 2021b;a). |
| Dataset Splits | No | The paper describes the division of the *generated* dataset into categories like "single factor (1.1M), paired factors (3.1M), and all factors (2.7M)" for different variation types. It also mentions |
| Hardware Specification | Yes | The generation of all images takes around 1500 hours on a single NVIDIA V100 GPU, but can be easily parallelized. |
| Software Dependencies | No | To generate the scenes and the renderings we rely on Blender (Blender Online Community) and use Blender Proc (Denninger et al., 2019) to simplify the automation of the generation process. |
| Experiment Setup | Yes | Tables 1a and 1b show results for the best model after 10k steps of training with adam on 6 log scale learning rates (1e-2 to 1e-6) cross validated on canonical top-1 accuracy for validation images. |