Data Efficient Language-Supervised Zero-Shot Recognition with Optimal Transport Distillation
Authors: Bichen Wu, Ruizhe Cheng, Peizhao Zhang, Tianren Gao, Joseph E. Gonzalez, Peter Vajda
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Based on pretrained image and text encoders, models trained with OTTER achieve strong performance with only 3M image text pairs. Compared with Info NCE loss, label smoothing, and knowledge distillation, OTTER consistently outperforms these baselines in zero-shot evaluation on Google Open Images (19,958 classes) and multi-labeled Image Net 10K (10032 classes) from Tencent ML-Images. Over 42 evaluations on 7 different dataset/architecture settings x 6 metrics, OTTER outperforms (32) or ties (2) all baselines in 34 of them. |
| Researcher Affiliation | Collaboration | 1Meta Reality Labs, 2UC Berkeley {wbc,stzpz,vajdap}@fb.com,{chengruizhe,jegonzal}@berkeley.edu |
| Pseudocode | Yes | Algorithm 1: Py Torch Pseudocode for OTTER; Algorithm 2: Py Torch Pseudocode for Sinkhorn-Knopp |
| Open Source Code | Yes | Our source code is open sourced at https: //github.com/facebookresearch/OTTER. |
| Open Datasets | Yes | We train on three publicly available datasets, Conceptual Captions 3M (CC) (Sharma et al., 2018), Wikipedia-base Image-Text Dataset (WIT), and YFCC 15M (Thomee et al., 2016). We evaluate the image encoder s zero-shot recognition of common visual concepts on Google Open Images (GOI) (Kuznetsova et al., 2020) and multi-labeled Image Net 10K (Wu et al., 2019a). |
| Dataset Splits | No | The paper specifies the datasets used and mentions evaluation on test sets, but it does not explicitly provide training/validation split percentages or sample counts to reproduce the data partitioning. It mentions "test images" but not a distinct validation split or its size. |
| Hardware Specification | Yes | We train on 8 V100 GPUs using Pytorch (Paszke et al., 2019) distributed data parallel with a total batch size of 512 (64 per GPU) for 10 epochs. |
| Software Dependencies | No | The paper mentions "Pytorch (Paszke et al., 2019)" but does not specify a version number for PyTorch or any other software dependencies, which is required for reproducibility. |
| Experiment Setup | Yes | We use SGD with an initial learning rate of 3e-3, a cosine annealing scheduler, momentum 0.9, and no weight decay. Input images are resized to 256x256 and randomly cropped to 224x224 while test images are resized to 256x256 and center-cropped to 224x224. We train on 8 V100 GPUs using Pytorch... with a total batch size of 512 (64 per GPU) for 10 epochs. we set the loss coefficient α = 0.5, set γv = γt = 1 for the similarity matrix. We use the exponential-moving average (EMA) of the image/text encoders as teachers and set the EMA decay to 0.999. For Sinkhorn-Knopp, we set λ = 0.15 and the number of iterations to 5. |