Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

The Robustness Limits of SoTA Vision Models to Natural Variation

Authors: Mark Ibrahim, Quentin Garrido, Ari S. Morcos, Diane Bouchacourt

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To study this question, we develop a dataset of more than 7 million images with controlled changes in pose, position background, lighting color, and size. We study not only how robust recent state-of-the-art models are, but also the extent to which models can generalize to variation in each of these factors. We consider a catalog of recent vision models, including vision transformers (Vi T), self-supervised models such as masked autoencoders (MAE), and models trained on larger datasets such as CLIP. We find that even today s best models are not robust to common changes in pose, size, and background. When some samples varied during training, we found models required a significant portion of instances seen varying to generalize though eventually robustness did improve. When variability is only witnessed for some classes however, we found that models did not generalize to other classes unless the classes were very similar to those seen varying during training.
Researcher Affiliation	Industry	Mark Ibrahim EMAIL Quentin Garrido Diane Bouchacourt Fundamental AI Research (FAIR), Meta
Pseudocode	No	The paper describes methods and processes in narrative text and refers to figures and tables for results, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper mentions existing codebases and models (e.g., 'Openclip, July 2021. URL https://doi.org/10.5281/zenodo.5143773') that were used as part of their study, but it does not provide an explicit statement or link for the source code related to the methodology or contributions described in this paper.
Open Datasets	No	Therefore we develop our own dataset based on 3d Warehouse (Trimble Inc) objects that we place in non-uniform backgrounds. We use 54 synsets from 3d Warehouse (Trimble Inc), and 50 objects for each synset. For the first 4 scalar factors (position, pose, size, lighting color), we use equally spaced scalar values. For the background, we use 5 background types (sky, water, city, home, grass) and 5 different backgrounds per type, with natural images coming from Li et al. (2022; 2021b;a).
Dataset Splits	No	The paper describes the division of the generated dataset into categories like "single factor (1.1M), paired factors (3.1M), and all factors (2.7M)" for different variation types. It also mentions
Hardware Specification	Yes	The generation of all images takes around 1500 hours on a single NVIDIA V100 GPU, but can be easily parallelized.
Software Dependencies	No	To generate the scenes and the renderings we rely on Blender (Blender Online Community) and use Blender Proc (Denninger et al., 2019) to simplify the automation of the generation process.
Experiment Setup	Yes	Tables 1a and 1b show results for the best model after 10k steps of training with adam on 6 log scale learning rates (1e-2 to 1e-6) cross validated on canonical top-1 accuracy for validation images.