Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Detecting Systematic Weaknesses in Vision Models along Predefined Human-Understandable Dimensions

Authors: Sujan Sai Gannamaneni, Rohil Prakash Rao, Michael Mock, Maram Akila, Stefan Wrobel

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate our algorithm on both synthetic and real-world datasets, demonstrating its ability to recover human-understandable systematic weaknesses. Furthermore, using our approach, we identify systematic weaknesses of multiple pre-trained and publicly available state-of-the-art computer vision DNNs.
Researcher Affiliation	Academia	Sujan Sai Gannamaneni EMAIL Fraunhofer IAIS, Lamarr Institute Rohil Prakash Rao EMAIL Fraunhofer IAIS Michael Mock EMAIL Fraunhofer IAIS Maram Akila EMAIL Fraunhofer IAIS, Lamarr Institute Stefan Wrobel EMAIL Fraunhofer IAIS, University of Bonn
Pseudocode	Yes	Algorithm 1: Systematic Weakness Detector (SWD)
Open Source Code	Yes	Our implementation is available at https://github.com/sujan-sai-g/Systematic-Weakness-Detection.
Open Datasets	Yes	Five pre-trained models, Vi T-B-16 (Dosovitskiy et al., 2021), Faster R-CNN (Ren et al., 2015), SETR PUP (Zheng et al., 2021), Panoptic FCN (Li et al., 2021), and YOLOv11m (Jocher & Qiu, 2024) are evaluated using five public datasets (Celeb A (Liu et al., 2015), BDD100k (Yu et al., 2020), Cityscapes (Cordts et al., 2016), Rail Sem19 (Zendel et al., 2019)), and Euro City Persons dataset (Braun et al., 2019), respectively.
Dataset Splits	Yes	We obtain an accuracy of 94.48% on the 202 599 images in the Celeb A dataset. The models are evaluated on their respective datasets, i.e., BDD100k, Cityscapes, and Rail Sem19.
Hardware Specification	No	No specific hardware details (like GPU models, CPU types, or cloud configurations) used for running experiments are provided in the paper.
Software Dependencies	No	The paper refers to various models and methods like CLIP, Slice Line, Faster R-CNN, and YOLOv11m, but does not explicitly list specific software dependencies with version numbers (e.g., Python, PyTorch versions, specific library versions) used for their implementation.
Experiment Setup	Yes	We restrict the number of combinations (level) to 2 in this work. We used the cutoff for the slice error as 1.5 e\|D for all experiments except the Panoptic FCN model evaluation. In the Panoptic FCN evaluation, we utilize the cutoff point for the slice error as 1.0 e\|D as the global average error is already quite high.