Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Automated Detection of Visual Attribute Reliance with a Self-Reflective Agent

Authors: Christy Li, Josep Lopez Camuñas, Jake Touchet, Jacob Andreas, Agata Lapedriza, Antonio Torralba, Tamar Rott Shaham

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our approach on a novel benchmark of 130 models designed to exhibit diverse visual attribute dependencies across 18 categories. Our results show that the agent s performance consistently improves with self-reflection, with a significant performance increase over non-reflective baselines. We further demonstrate that the agent identifies real-world visual attribute dependencies in state-of-the-art models, including CLIP s vision encoder and the YOLOv8 object detector.
Researcher Affiliation Academia 1MIT CSAIL 2Universitat Oberta de Catalunya 3Louisiana Tech 4Northeastern University
Pseudocode Yes A.9 API prompt class System: """ A Python class containing the vision model and the specific classifier to interact with. ... class Tools: """ A Python class containing tools to interact with the units implemented in the system class , in order to run experiments on it. ...
Open Source Code No The code and model benchmark will be made available upon acceptance.
Open Datasets Yes Prior evaluations often use models trained on datasets with known biases, such as WaterBirds [Wah et al., 2011] or CelebA [Liu et al., 2015], using label co-occurrence as a proxy for ground truth [Sagawa et al., 2020].
Dataset Splits Yes Similar to the self-evaluation score, given a candidate explanation, we start by generating 10 synthetic images that are expected to elicit high model scores and 10 that are expected to elicit low scores. We then pass these images through the model and record its actual responses.
Hardware Specification Yes All our experiments were conducted on a single NVIDIA RTX 3090 (24 GB) GPU.
Software Dependencies Yes We implement SAIA with a Claude-Sonnet-3.5 backbone. Please refer to Appendix A for implementation details, full prompts, and API.
Experiment Setup Yes In practice, we cap the total number of agent rounds (hypothesis-testing followed by self-reflection) to 10. If no hypothesis meets the self-evaluation threshold by that point, SAIA returns the hypothesis that achieved the best alignment between predicted and actual model behavior, typically the most recent one.