Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Detecting Systematic Weaknesses in Vision Models along Predefined Human-Understandable Dimensions
Authors: Sujan Sai Gannamaneni, Rohil Prakash Rao, Michael Mock, Maram Akila, Stefan Wrobel
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our algorithm on both synthetic and real-world datasets, demonstrating its ability to recover human-understandable systematic weaknesses. Furthermore, using our approach, we identify systematic weaknesses of multiple pre-trained and publicly available state-of-the-art computer vision DNNs. |
| Researcher Affiliation | Academia | Sujan Sai Gannamaneni EMAIL Fraunhofer IAIS, Lamarr Institute Rohil Prakash Rao EMAIL Fraunhofer IAIS Michael Mock EMAIL Fraunhofer IAIS Maram Akila EMAIL Fraunhofer IAIS, Lamarr Institute Stefan Wrobel EMAIL Fraunhofer IAIS, University of Bonn |
| Pseudocode | Yes | Algorithm 1: Systematic Weakness Detector (SWD) |
| Open Source Code | Yes | Our implementation is available at https://github.com/sujan-sai-g/Systematic-Weakness-Detection. |
| Open Datasets | Yes | Five pre-trained models, Vi T-B-16 (Dosovitskiy et al., 2021), Faster R-CNN (Ren et al., 2015), SETR PUP (Zheng et al., 2021), Panoptic FCN (Li et al., 2021), and YOLOv11m (Jocher & Qiu, 2024) are evaluated using five public datasets (Celeb A (Liu et al., 2015), BDD100k (Yu et al., 2020), Cityscapes (Cordts et al., 2016), Rail Sem19 (Zendel et al., 2019)), and Euro City Persons dataset (Braun et al., 2019), respectively. |
| Dataset Splits | Yes | We obtain an accuracy of 94.48% on the 202 599 images in the Celeb A dataset. The models are evaluated on their respective datasets, i.e., BDD100k, Cityscapes, and Rail Sem19. |
| Hardware Specification | No | No specific hardware details (like GPU models, CPU types, or cloud configurations) used for running experiments are provided in the paper. |
| Software Dependencies | No | The paper refers to various models and methods like CLIP, Slice Line, Faster R-CNN, and YOLOv11m, but does not explicitly list specific software dependencies with version numbers (e.g., Python, PyTorch versions, specific library versions) used for their implementation. |
| Experiment Setup | Yes | We restrict the number of combinations (level) to 2 in this work. We used the cutoff for the slice error as 1.5 e|D for all experiments except the Panoptic FCN model evaluation. In the Panoptic FCN evaluation, we utilize the cutoff point for the slice error as 1.0 e|D as the global average error is already quite high. |