Multi-Attention Based Visual-Semantic Interaction for Few-Shot Learning

Authors: Peng Zhao, Yin Wang, Wei Wang, Jie Mu, Huiting Liu, Cong Wang, Xiaochun Cao

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on four benchmark datasets demonstrate that our proposed MAVSI could outperform existing state-of-the-art FSL methods.
Researcher Affiliation Academia 1School of Computer Science and Technology, Anhui University 2School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University 3School of Data Science and Artificial Intelligence, Dongbei University of Finance and Economics 4Department of Computing, The Hong Kong Polytechnic University
Pseudocode No The paper does not contain any explicit pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statement about releasing its source code or a link to a code repository.
Open Datasets Yes mini Image Net [Vinyals et al., 2016] consists of 100 classes. These classes are divided into 64, 16, and 20 for training, validation, and testing. tired Image Net [Ren et al., 2018] contains 608 classes, split into 351, 97, and 160 for training, validation, and testing. CIFAR-FS [Bertinetto et al., 2019] consists of 100 classes. These classes are divided into 64, 16, and 20 for training, validation, and testing. CUB-200-2011 [Wah et al., 2011] contains images from 200 bird species, where 200 species are divided into 100, 50, and 50 for training, validation, and testing, respectively.
Dataset Splits Yes mini Image Net [Vinyals et al., 2016] consists of 100 classes. These classes are divided into 64, 16, and 20 for training, validation, and testing. tired Image Net [Ren et al., 2018] contains 608 classes, split into 351, 97, and 160 for training, validation, and testing. CIFAR-FS [Bertinetto et al., 2019] consists of 100 classes. These classes are divided into 64, 16, and 20 for training, validation, and testing. CUB-200-2011 [Wah et al., 2011] contains images from 200 bird species, where 200 species are divided into 100, 50, and 50 for training, validation, and testing, respectively.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models) used for running its experiments.
Software Dependencies No The paper mentions using Glove as a semantic extractor but does not provide specific version numbers for software dependencies like programming languages or libraries.
Experiment Setup Yes Similar to previous works [Xing et al., 2019; Schwartz et al., 2022; Yang et al., 2022], we utilize Res Net-12 as the backbone network, and modify the number of convolutional filters from [64, 128, 256, 512] to [64, 160, 320, 640]. In all cases, the comparison network F is the MLP with a Leaky Re LUactivated hidden layer, and the relation network consists of convolutional layers and the MLP. We use Glove [Pennington et al., 2014] as the semantic extractor, which is pre-trained on a large corpus. Our experiments are implemented under 5way 1-shot and 5-way 5-shot settings. The input image size is 84 84. Following [Peng et al., 2019], we train the model for 150 epochs, with 800 episodes in each epoch. We use the Adam optimizer with a learning rate of 5e-3 and weight decay of 5e-6. The learning rate is dropped by half every 6,000 episodes, and other parameters such as λ, γ, and the temperature parameter τ are adjusted during end-to-end training.