Position: Do Not Explain Vision Models Without Context
Authors: Paulina Tomaszewska, Przemyslaw Biecek
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We fine-tuned Resnet50 and Vi T base models in low data regime as it is done in work (Jia et al., 2022) using datasets from Visual Task Adaptation Benchmark (VTAB) (Zhai et al., 2019). The models were pretrained on Imagenet: Resnet-50 in a supervised manner, Vi T base in a contrastive manner (Moco v3). We focused on the subset of VTAB called structured where the labels depend on spatial context. In the experiments, we use two datasets: KITTI (Geiger et al., 2012) where images were collected using sensors in the car the task is to predict the binned distance to the closest vehicle in the scene, dsprites (Matthey et al., 2017) where images of simple shapes undergo rotations and other shifts in the space the task is to predict binned orientation. Hence, we analyze the only real-life dataset and one of the few synthetic datasets in VTAB. Having fine-tuned models of a satisfactory performance (similar to the one claimed by Jia et al. (2022)), we applied 5 popular XAI techniques: Gradient SHAP (Lundberg & Lee, 2017), Integrated gradients (Sundararajan et al., 2017), Occlusion (Zeiler & Fergus, 2014), Saliency (Simonyan et al., 2013), LIME (Ribeiro et al., 2016) to check if they manage to explain the model s decisions correctly. |
| Researcher Affiliation | Academia | 1Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland 2Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Warsaw, Poland. |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper does not provide any statement or link indicating that the source code for the methodology described is publicly available. |
| Open Datasets | Yes | We fine-tuned Resnet50 and Vi T base models... using datasets from Visual Task Adaptation Benchmark (VTAB) (Zhai et al., 2019). ... In the experiments, we use two datasets: KITTI (Geiger et al., 2012)... dsprites (Matthey et al., 2017)... |
| Dataset Splits | No | The paper mentions using VTAB, KITTI, and dsprites datasets but does not specify the exact train/validation/test split percentages, sample counts, or refer to predefined splits with citations for the splits themselves. It only mentions fine-tuning models on these datasets. |
| Hardware Specification | No | The paper does not provide any specific hardware details such as GPU models, CPU types, or other computing infrastructure used for the experiments. |
| Software Dependencies | No | The paper mentions using Resnet50, ViT, and XAI techniques (Gradient SHAP, Integrated gradients, Occlusion, Saliency, LIME), but it does not provide specific version numbers for any of these software components or libraries. |
| Experiment Setup | No | The paper mentions 'fine-tuned Resnet50 and Vi T base models in low data regime as it is done in work (Jia et al., 2022)' but does not provide specific hyperparameters (e.g., learning rate, batch size, epochs) or detailed training configurations for their experiments in the main text. |