Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Automated Detection of Visual Attribute Reliance with a Self-Reflective Agent

Authors: Christy Li, Josep Lopez Camuñas, Jake Touchet, Jacob Andreas, Agata Lapedriza, Antonio Torralba, Tamar Rott Shaham

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our approach on a novel benchmark of 130 models designed to exhibit diverse visual attribute dependencies across 18 categories. Our results show that the agent s performance consistently improves with self-reflection, with a significant performance increase over non-reflective baselines. We further demonstrate that the agent identifies real-world visual attribute dependencies in state-of-the-art models, including CLIP s vision encoder and the YOLOv8 object detector.
Researcher Affiliation	Academia	1MIT CSAIL 2Universitat Oberta de Catalunya 3Louisiana Tech 4Northeastern University
Pseudocode	Yes	A.9 API prompt class System: """ A Python class containing the vision model and the specific classifier to interact with. ... class Tools: """ A Python class containing tools to interact with the units implemented in the system class , in order to run experiments on it. ...
Open Source Code	No	The code and model benchmark will be made available upon acceptance.
Open Datasets	Yes	Prior evaluations often use models trained on datasets with known biases, such as WaterBirds [Wah et al., 2011] or CelebA [Liu et al., 2015], using label co-occurrence as a proxy for ground truth [Sagawa et al., 2020].
Dataset Splits	Yes	Similar to the self-evaluation score, given a candidate explanation, we start by generating 10 synthetic images that are expected to elicit high model scores and 10 that are expected to elicit low scores. We then pass these images through the model and record its actual responses.
Hardware Specification	Yes	All our experiments were conducted on a single NVIDIA RTX 3090 (24 GB) GPU.
Software Dependencies	Yes	We implement SAIA with a Claude-Sonnet-3.5 backbone. Please refer to Appendix A for implementation details, full prompts, and API.
Experiment Setup	Yes	In practice, we cap the total number of agent rounds (hypothesis-testing followed by self-reflection) to 10. If no hypothesis meets the self-evaluation threshold by that point, SAIA returns the hypothesis that achieved the best alignment between predicted and actual model behavior, typically the most recent one.