Optimizing Relevance Maps of Vision Transformers Improves Robustness
Authors: Hila Chefer, Idan Schwartz, Lior Wolf
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In an extensive battery of experiments we show that (i) the classification accuracy on datasets from shifted domains increases considerably. This includes real-world unbiased and adversarial datasets, as well as synthetic ones that were created specifically to measure the robustness of the classification model, (ii) the resulting relevance maps demonstrate a significant improvement in focusing on the foreground of the image, i.e. the object, rather than on its background. |
| Researcher Affiliation | Academia | Hila Chefer Idan Schwartz Lior Wolf School of Computer Science Tel-Aviv University |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at: https://github.com/hila-chefer/Robust Vi T. |
| Open Datasets | Yes | We conduct our experiments on Image Net-v2 [39], Image Net-A [26], Image Net-R [25], Image Net-Sketch [56], Object Net [4], and SI-Score [13]. |
| Dataset Splits | Yes | We use 3 training images from 500 Image Net classes for our finetuning (overall1500 samples), and another 414 images as a validation set. |
| Hardware Specification | Yes | The small, base models are finetuned on a single RTX 2080 Ti GPU, and the large models on a single Tesla V100 GPU. |
| Software Dependencies | No | The paper mentions using "the implementation and pre-trained weights from [59]" which refers to "Pytorch image models" but does not specify version numbers for PyTorch or other libraries, which is required for reproducible ancillary software details. |
| Experiment Setup | Yes | All models are finetuned as described in Sec. 3 for 50 epochs, with a batch size of 8. We use 3 training images from 500 Image Net classes for our finetuning (overall1500 samples), and another 414 images as a validation set. The learning rate of each model is determined using a grid search between the values 5e 7 and 5e 6. All our experiments apply the same choice of λbg = 2, λfg = 0.3. The overall loss for the finetuning process is, therefore: L = λrelevance Lrelevance + λclassification Lclassification, where λrelevance = 0.8, and λclassification = 0.2 remain constant in all our experiments. |