Regularizing Attention Networks for Anomaly Detection in Visual Question Answering

Authors: Doyup Lee, Yeongjae Cheon, Wook-Shin Han1845-1853

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the robustness of state-of-the-art VQA models to five different anomalies, including worst-case scenarios, the most frequent scenarios, and the current limitation of VQA models. Different from the results in unimodal tasks, the maximum confidence of answers in VQA models cannot detect anomalous inputs, and post-training of the outputs, such as outlier exposure, is ineffective for VQA models. Thus, we propose an attention-based method, which uses confidence of reasoning between input images and questions and shows much more promising results than the previous methods in unimodal tasks. In addition, we show that a maximum entropy regularization of attention networks can significantly improve the attention-based anomaly detection of the VQA models. Thanks to the simplicity, attention-based anomaly detection and the regularization are model-agnostic methods, which can be used for various cross-modal attentions in the state-of-the-art VQA models. The results imply that crossmodal attention in VQA is important to improve not only VQA accuracy, but also the robustness to various anomalies.
Researcher Affiliation Collaboration Doyup Lee1, Yeongjae Cheon2, Wook-Shin Han1* POSTECH1, South Korea Kakao Brain2, South Korea {doyup.lee, wshan}@postech.ac.kr1, yeongjae.cheon@kakaobrain.com2
Pseudocode No No pseudocode or algorithm blocks were found in the paper.
Open Source Code Yes All codes are implemented with Pytorch 0.4.1 and available1. 1https://github.com/LeeDoYup/AnomalyDetectionVQA
Open Datasets Yes The VQA v2 dataset (Goyal et al. 2017) is used for training and is considered normal. Test samples of MNIST, SVHN, Fashion MNIST, CIFAR-10, and Tiny Image Net are used for OOD images. The 20 Newsgroup, Reuter 52, and IMDB movie review datasets are used for OOD questions. For irrelevant question datasets, the two test datasets are used: 1) Visual vs. Non-visual Question (VNQ) (Ray et al. 2016) contains general knowledge or philosophical questions. 2) Question Relevance Prediction and Explanation (QRPE) (Mahendru et al. 2017) contains questions with false-premises about the existence of visual objects in the VQA v2 images.
Dataset Splits Yes We use 10 % of training samples to determine the increased temperature T and δ, maximizing AUROC scores on the samples.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory, cloud instance types) used for running the experiments were provided in the paper.
Software Dependencies Yes All codes are implemented with Pytorch 0.4.1 and available1.
Experiment Setup Yes K = 36 objects are detected by pretrained faster R-CNN (Ren et al. 2015), and a 2048 dimensional vector for each object is extracted by pretrained Res Net-152 (He et al. 2016). Question tokens are trimmed to a maximum of 14 words, and pretrained Glo Ve (Pennington, Socher, and Manning 2014) is used for word embedding. The batch size is 256. For regularization of the attention network, we use training samples of Tiny Image, VNQ, and QRPE for Panomaly in Eq (5). We fine-tune the pretrained VQA models in 15 epochs, and the λ in Eq (5) is set to 0.00001. The temperature is 1.0 in training, and increasing T in test time is known to improve confidence calibration and OOD detection.