Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

LANCE: Stress-testing Visual Models by Generating Language-guided Counterfactual Images

Authors: Viraj Prabhu, Sriram Yenamandra, Prithvijit Chattopadhyay, Judy Hoffman

NeurIPS 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We benchmark the performance of a diverse set of pretrained models on our generated data and observe significant and consistent performance drops. We further analyze model sensitivity across different types of edits, and demonstrate its applicability at surfacing previously unknown class-level model biases in Image Net. (Abstract) In Section 4.1, we overview our experimental setup, describing the data, metrics, baselines, and implementation details used. Next, we present our results (Section 4.2), comparing the performance of a diverse set of pretrained models on the subset of the Image Net test set, and on our generated counterfactual test sets. (Section 4 introduction)
Researcher Affiliation	Academia	Viraj Prabhu Sriram Yenamandra Prithvijit Chattopadhyay Judy Hoffman Georgia Institute of Technology EMAIL
Pseudocode	Yes	Algorithm 1 Generating Language-guided Counterfactual Images (Page 5)
Open Source Code	Yes	Code: https://github.com/virajprabhu/lance. (Abstract)
Open Datasets	Yes	Dataset. We evaluate LANCE on a subset of the Image Net [2] validation set. (Section 4.1) All source images belong to the Image Net dataset [2], which is distributed under a BSD-3 license that permits research and commercial use. (Appendix A)
Dataset Splits	Yes	We evaluate LANCE on a subset of the Image Net [2] validation set. Specifically, we study the 15 classes included in the Hard Image Net benchmark [62]. ... We consider the original Image Net validation sets for these 15 classes, with 50 images/class, as our base set. (Section 4.1)
Hardware Specification	Yes	We run all experiments on a single NVIDIA A40 GPU. (Appendix F)
Software Dependencies	Yes	We use Py Torch [65] for all experiments. (Section 4.1) Stable Diffusion [16] version 1.4 (Table 6, Image Editing)
Experiment Setup	Yes	We include additional implementation details for hyperparameters used by LANCE for caption and image editing in Table 6. (Appendix F) Table 6: Hyperparameter values used for caption (top left), LLAMA finetuning (top right) and image editing (bottom). (Page 13)