Information Maximization for Few-Shot Learning
Authors: Malik Boudiaf, Imtiaz Ziko, Jérôme Rony, Jose Dolz, Pablo Piantanida, Ismail Ben Ayed
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Following standard transductive few-shot settings, our comprehensive experiments2 demonstrate that TIM outperforms state-of-the-art methods significantly across various datasets and networks |
| Researcher Affiliation | Academia | Malik Boudiaf ÉTS Montreal Ziko Imtiaz Masud ÉTS Montreal Jérôme Rony ÉTS Montreal Jose Dolz ÉTS Montreal Pablo Piantanida Centrale Supélec-CNRS Université Paris-Saclay Ismail Ben Ayed ÉTS Montreal |
| Pseudocode | No | The paper describes the mathematical formulations and propositions for optimization but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | 2Code publicly available at https://github.com/mboudiaf/TIM |
| Open Datasets | Yes | Datasets: We resort to 3 few-shot learning datasets to benchmark the proposed models. As standard few-shot benchmarks, we use the mini-Imagenet [45] dataset, with 100 classes split as in [35], the Caltech-UCSD Birds 200 [47] (CUB) dataset, with 200 classes, split following [5], and finally the larger tiered-Imagenet dataset, with 608 classes split as in [36]. |
| Dataset Splits | Yes | The few-shot scenario assumes that we are given a test dataset: Xtest := {xi, yi}Ntest i=1, with a completely new set of classes Ytest such that Ybase Ytest = , from which we create randomly sampled few-shot tasks, each with a few labeled examples. Specifically, each K-way NS-shot task involves sampling NS labeled examples from each of K different classes, also chosen at random. Let S denote the set of these labeled examples, referred to as the support set with size |S| = NS K. Furthermore, each task has a query set denoted by Q composed of |Q| = NQ K unlabeled (unseen) examples from each of the K classes. |
| Hardware Specification | Yes | Our methods were run on the same GTX 1080 Ti GPU, while the run-time of [7] is directly reported from the paper. |
| Software Dependencies | No | The paper mentions using the ADAM optimizer and standard networks like Res Net-18 and WRN28-10, but it does not provide specific version numbers for software dependencies (e.g., programming languages, libraries, or frameworks). |
| Experiment Setup | Yes | Hyperparameters: To keep our experiments as simple as possible, our hyperparameters are kept fixed across all the experiments and methods (TIM-GD and TIM-ADM). The conditional entropy weight α and the cross-entropy weights λ in Objective (3) are both set to 0.1. The temperature parameter τ in the classifier is set to 15. In our TIM-GD method, we use the ADAM optimizer with the recommended parameters [20], and run 1000 iterations for each task. For TIM-ADM, we run 150 iterations. Base-training procedure: The feature extractors are trained following the same simple base-training procedure as in [51] and using standard networks (Res Net-18 and WRN28-10), for all the experiments. Specifically, they are trained using the standard cross-entropy loss on the base classes, with label smoothing. The label-smoothing parameter is set to 0.1. We emphasize that base training does not involve any meta-learning or episodic training strategy. The models are trained for 90 epochs, with the learning rate initialized to 0.1, and divided by 10 at epochs 45 and 66. Batch size is set to 256 for Res Net-18, and to 128 for WRN28-10. |