A Universal Representation Transformer Layer for Few-Shot Image Classification

Authors: Lu Liu, William L. Hamilton, Guodong Long, Jing Jiang, Hugo Larochelle

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In experiments, we show that URT sets a new state-of-the-art result on Meta-Dataset. Specifically, it achieves top-performance on the highest number of data sources compared to competing methods. We analyze variants of URT and present a visualization of the attention score heatmaps that sheds light on how the model performs cross-domain generalization.
Researcher Affiliation Collaboration Lu Liu1,2 , William Hamilton1,3 , Guodong Long2, Jing Jiang2, Hugo Larochelle1,4 1 Mila, 2 Australian AI Institute, UTS, 3 Mc Gill University, 4 Google Research, Brain Team Correspondence to lu.liu.cs@icloud.com
Pseudocode Yes Algorithm 1 Training of URT layer
Open Source Code Yes Our code is available at https://github.com/liulu112601/URT.
Open Datasets Yes We test our methods on the large-scale few-shot learning benchmark Meta-Dataset (Triantafillou et al., 2020).
Dataset Splits Yes Meta-Dataset includes ten datasets (domains), with eight of them available for training. Additionally, each task sampled in the benchmark varies in the number of classes N, with each class also varying in the number of shots K. As in all few-shot learning benchmarks, the classes used for training and testing do not overlap. ... We chose the hyper-parameters based on the performance of the validation set.
Hardware Specification Yes Of note, the average inference time for URT is 0.04 second per task, compared to 0.43 for SUR, on a single V100.
Software Dependencies No The paper does not provide specific software dependencies with version numbers (e.g., Python version, library versions like PyTorch or TensorFlow, CUDA version).
Experiment Setup Yes Then, we freeze the backbone and train the URT layer for 10,000 episodes, with an initial learning rate of 0.01 and a cosine learning rate scheduler. ... URT is trained with parameter weight decay of 1e-5 and with a regularization factor λ = 0.1. The number of heads (H in Equation 7), is set to 2 and the dimension of the keys and queries (l in Equation 4) is set to 1024.