Supervised Metric Learning to Rank for Retrieval via Contextual Similarity Optimization

Authors: Christopher Liao, Theodoros Tsiligkaridis, Brian Kulis

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that our method achieves a new state-of-the-art across four image retrieval benchmarks and multiple different evaluation settings.
Researcher Affiliation Collaboration 1Department of Electrical and Computer Engineering, Boston University 2MIT Lincoln Laboratory.
Pseudocode Yes Algorithm 1 Pseudo-code, Py Torch-like
Open Source Code Yes Code is available at: https://github.com/Chris210634/metric-learning-using-contextualsimilarity
Open Datasets Yes Datasets We experiment on two small-scale and two large-scale datasets: Caltech-UCSD Birds (CUB-200) (Wah et al., 2011), Stanford Cars-196 (Krause et al., 2013), Stanford Online Products (SOP) (Oh Song et al., 2016), and mini-i Naturalist-2021 (Van Horn et al., 2018).
Dataset Splits No While the paper mentions using the best test R@1 metric and train-test splits, it does not explicitly provide percentages or counts for a separate validation set.
Hardware Specification Yes We run experiments on 1 V100 GPU with 16 GB of memory. The CUB and Cars experiments take under one hour. The SOP and i Naturalist experiments take 4 hours and 6 hours, respectively. Some of our code is borrowed from ROADMAP. For faster experimentation, we use mixed precision floating point. Experiments take more than 12 hours on P100 GPUs, partially because mixed precision arithmetic does not appear to speed-up experiments as much on P100 GPUs compared to on V100 GPUs.
Software Dependencies No The paper mentions 'Py Torch-like pseudo-code' and 'Adam' as optimizer but does not specify version numbers for PyTorch, Python, CUDA, or other key libraries.
Experiment Setup Yes Hyperparameters and Setup On our method, we tune λ and ϵ separately for each dataset. We use fixed values for remaining hyperparameters: γ = 0.1, α = 10.0, k = 4, δ+ = 0.75, δ = 0.6, s = 0.3. The results in the main paper all use 224 224 image resolution. Some recent studies use 256 256 image resolution; comparisons in this setting are included in Appendix G Table 4. We use Adam with a decaying learning-rate schedule. We report results on the model with the best test R@1 metric, as is standard in the literature. We tune learning rates separately for each method and dataset combination. We use a batch size of 256 for i Naturalist, 128 for SOP and CUB, and 64 for Cars; the larger batch size is necessary to achieve reasonable performance on i Naturalist, while the smaller batch size appears to reduce overfitting on Cars. We use a 4 per class balanced sampler.