Semantic-Guided Multi-Attention Localization for Zero-Shot Learning

Authors: Yizhe Zhu, Jianwen Xie, Zhiqiang Tang, Xi Peng, Ahmed Elgammal

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through comprehensive experiments on three widely used zero-shot learning benchmarks, we show the efficacy of the multi-attention localization and our proposed approach improves the state-of-the-art results by a considerable margin.
Researcher Affiliation Collaboration Yizhe Zhu Rutgers University yizhe.zhu@rutgers.edu, Jianwen Xie Hikvision Research Institute jianwen@ucla.edu, Zhiqiang Tang Rutgers University zhiqiang.tang@rutgers.edu, Xi Peng University of Delaware xipeng@udel.edu, Ahmed Elgammal Rutgers University elgammal@cs.rutgers.edu
Pseudocode No The paper includes mathematical equations and descriptions of the model, but no explicit pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any statement about open-sourcing code or a link to a code repository for the described methodology.
Open Datasets Yes We use three widely used zero-shot learning datasets: Caltech-UCSD-Birds 200-2011 (CUB) [37], Oxford Flowers (FLO) [38], Animals with Attributes (Aw A) [22].
Dataset Splits Yes Hyper-parameters in our models are obtained by grid search on the validation set.
Hardware Specification Yes We use the SGD optimizer with the learning rate of 0.05, the momentum of 0.9, and weight decay of 5 10 4 to optimize the objective functions. The learning rate is decay by 0.1 on the plateau, and the minimum one is set to be 5 10 4. Hyper-parameters in our models are obtained by grid search on the validation set. mrgs in Eq. 7 and Eq. 10 are set to be 0.2 and 0.8, respectively. k in Eq. 8 is set to be 10. The number of parts is set to be 2 since we find that increasing the number of parts will result in little improvement on the zero-shot learning performance and lead to attention redundancy, i.e., maps attend to the same region.
Software Dependencies No We implement our approach on the Pytorch Framework. No specific version number for Pytorch or other software dependencies is provided.
Experiment Setup Yes We implement our approach on the Pytorch Framework. For the multi-attention subnet, we take the images of size 448 448 as input in order to achieve high-resolution attention maps. For the joint feature embedding subnet, we resize all the input images to the size of 224 224. We consistently adopt VGG19 as the backbone and train the model with a batch size of 32 on two GPUs (Titan X). We use the SGD optimizer with the learning rate of 0.05, the momentum of 0.9, and weight decay of 5 10 4 to optimize the objective functions. The learning rate is decay by 0.1 on the plateau, and the minimum one is set to be 5 10 4. Hyper-parameters in our models are obtained by grid search on the validation set. mrgs in Eq. 7 and Eq. 10 are set to be 0.2 and 0.8, respectively. k in Eq. 8 is set to be 10. The number of parts is set to be 2 since we find that increasing the number of parts will result in little improvement on the zero-shot learning performance and lead to attention redundancy, i.e., maps attend to the same region.