Rethinking Generalization in Few-Shot Classification

Authors: Markus Hiller, Rongkai Ma, Mehrtash Harandi, Tom Drummond

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show the competitiveness of our approach, achieving new state-of-the-art results on four popular few-shot classification benchmarks for 5-shot and 1-shot scenarios.
Researcher Affiliation Academia Markus Hiller 1 Rongkai Ma 2 Mehrtash Harandi2 Tom Drummond1 1School of Computing and Information Systems, The University of Melbourne 2Department of Electrical and Computer Systems Engineering, Monash University markus.hiller@student.unimelb.edu.au {rongkai.ma, mehrtash.harandi}@monash.edu tom.drummond@unimelb.edu.au
Pseudocode No The paper describes the method using figures (Figure 2 and Figure 3) but does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Our work.2 To alleviate the negative influence of image-level annotations and to avoid supervision collapse, we decompose the images into patches representing local regions, each having a higher likelihood of being dominated by only one entity. To overcome the lack of such fine-grained annotations we employ self-supervised training with Masked Image Modelling as pretext task [60] and use a Vision Transformer architecture [10] as encoder due to its patch-based nature. We build our classification around the concept of learning task-specific similarity between local regions as a function of the support set at inference time. To this extent, we first create a prior similarity map by establishing semantic patch correspondences between all support set samples irrespective of their class, i.e., also between entities that might not be relevant or potentially even harmful for correct classification (Figure 1, step (1)). Consider the depicted support set with only two classes: person and cat . The lower-right image is part of our support set for cat and the dog just happens to be in the image. Now in the query sample that shall be classified, the image depicts a person patting their dog. We will thus correctly detect a correspondence of the two dogs across those two images, as well as between the person patches and the other samples of the person support set class. While the correspondences between the person regions are helpful, there is no dog class in the actual support set (i.e., dog is out-of-task information), rendering this correspondence harmful for classification since it would indicate that the query is connected to the image with the cat label. This is where our token importance weighting comes into play. We infer an importance weight for each token based on its contribution towards correct classification of the other support set samples, actively strengthening intra-class similarities and inter-class differences by jointly considering all available information in other words, we learn which tokens help or harm our classification objective (Figure 1, step (2)). These importance-reweighted support set embeddings are then used as basis for our similarity-based query sample classification (step (3)). Our main contributions include the following: 2Our code is publicly available at https://github.com/mrkshllr/Few TURE
Open Datasets Yes We train and evaluate our methods using four popular few-shot classification benchmarks, namely mini Image Net [49], tiered Image Net [38], CIFAR-FS [4] and FC-100 [35].
Dataset Splits Yes We follow the meta-learning protocol of previous works [49] to formulate the few-shot classification problem with episodic training and testing. An episode E is composed of a support set Xs = {(xnk s , ynk s )|n = 1, . . . , N; k = 1, . . . , K; ynk s Ctrain}, where xnk s denotes the k-th sample of class n with label ynk s , and a query set Xq = {(xn q , yn q )|n = 1, . . . , N}, where xn q denotes a query sample3 of class n with label yn q . We evaluate at each epoch on 600 randomly sampled episodes from the respective validation set to select the best set of parameters. During test time, we randomly sample 600 episodes from the test set to evaluate our model.
Hardware Specification Yes We use 4 Nvidia A100 GPUs with 40GB each for our Vi T and 8 such GPUs for our Swin models.
Software Dependencies No The paper does not provide specific software dependencies, such as library names with version numbers (e.g., Python 3.8, PyTorch 1.9), used to replicate the experiment.
Experiment Setup Yes Our Vi T and Swin architectures are trained with a batch size of 512 for 1600 and 800 epochs, respectively. We use 4 Nvidia A100 GPUs with 40GB each for our Vi T and 8 such GPUs for our Swin models. [...] We generally train for up to 200 epochs but find most architectures to converge earlier. [...] We use SGD as optimizer with a learning rate of 0.1 for the token importance weight generation. [...] We found a local window of m = 5 to work well throughout our experiments for both architectures.