Data Efficient Language-Supervised Zero-Shot Recognition with Optimal Transport Distillation

Authors: Bichen Wu, Ruizhe Cheng, Peizhao Zhang, Tianren Gao, Joseph E. Gonzalez, Peter Vajda

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Based on pretrained image and text encoders, models trained with OTTER achieve strong performance with only 3M image text pairs. Compared with Info NCE loss, label smoothing, and knowledge distillation, OTTER consistently outperforms these baselines in zero-shot evaluation on Google Open Images (19,958 classes) and multi-labeled Image Net 10K (10032 classes) from Tencent ML-Images. Over 42 evaluations on 7 different dataset/architecture settings x 6 metrics, OTTER outperforms (32) or ties (2) all baselines in 34 of them.
Researcher Affiliation Collaboration 1Meta Reality Labs, 2UC Berkeley {wbc,stzpz,vajdap}@fb.com,{chengruizhe,jegonzal}@berkeley.edu
Pseudocode Yes Algorithm 1: Py Torch Pseudocode for OTTER; Algorithm 2: Py Torch Pseudocode for Sinkhorn-Knopp
Open Source Code Yes Our source code is open sourced at https: //github.com/facebookresearch/OTTER.
Open Datasets Yes We train on three publicly available datasets, Conceptual Captions 3M (CC) (Sharma et al., 2018), Wikipedia-base Image-Text Dataset (WIT), and YFCC 15M (Thomee et al., 2016). We evaluate the image encoder s zero-shot recognition of common visual concepts on Google Open Images (GOI) (Kuznetsova et al., 2020) and multi-labeled Image Net 10K (Wu et al., 2019a).
Dataset Splits No The paper specifies the datasets used and mentions evaluation on test sets, but it does not explicitly provide training/validation split percentages or sample counts to reproduce the data partitioning. It mentions "test images" but not a distinct validation split or its size.
Hardware Specification Yes We train on 8 V100 GPUs using Pytorch (Paszke et al., 2019) distributed data parallel with a total batch size of 512 (64 per GPU) for 10 epochs.
Software Dependencies No The paper mentions "Pytorch (Paszke et al., 2019)" but does not specify a version number for PyTorch or any other software dependencies, which is required for reproducibility.
Experiment Setup Yes We use SGD with an initial learning rate of 3e-3, a cosine annealing scheduler, momentum 0.9, and no weight decay. Input images are resized to 256x256 and randomly cropped to 224x224 while test images are resized to 256x256 and center-cropped to 224x224. We train on 8 V100 GPUs using Pytorch... with a total batch size of 512 (64 per GPU) for 10 epochs. we set the loss coefficient α = 0.5, set γv = γt = 1 for the similarity matrix. We use the exponential-moving average (EMA) of the image/text encoders as teachers and set the EMA decay to 0.999. For Sinkhorn-Knopp, we set λ = 0.15 and the number of iterations to 5.