Unicom: Universal and Compact Representation Learning for Image Retrieval
Authors: Xiang An, Jiankang Deng, Kaicheng Yang, Jaiwei Li, Ziyong Feng, Jia Guo, Jing Yang, Tongliang Liu
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our method significantly outperforms stateof-the-art unsupervised and supervised image retrieval approaches on multiple benchmarks. The code and pre-trained models are released to facilitate future research https://github.com/deepglint/unicom. 4 EXPERIMENTS |
| Researcher Affiliation | Collaboration | Xiang An1, Jiankang Deng2 , Kaicheng Yang1, Jiawei Li1, Ziyong Feng1, Jia Guo3, Jing Yang4, Tongliang Liu5 1Deep Glint, 2Huawei, 3Insight Face, 4University of Cambridge, 5University of Sydney |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | The code and pre-trained models are released to facilitate future research https://github.com/deepglint/unicom. |
| Open Datasets | Yes | We first cluster the large-scale LAION 400M dataset into one million pseudo classes based on the joint textual and visual features extracted by the CLIP model. Table 10: List of linear probe datasets with the data distribution and evaluation metrics. Table 11: Dataset composition for training and evaluation in the image retrieval task. |
| Dataset Splits | Yes | For supervised retrieval, we follow the data-split settings of the baseline methods (Patel et al., 2022; Ermolov et al., 2022) to fine-tune models. Table 10: List of linear probe datasets with the data distribution and evaluation metrics. Table 11: Dataset composition for training and evaluation in the image retrieval task. |
| Hardware Specification | Yes | The training is conducted on 128 NVIDIA V100 GPUs across 16 nodes. |
| Software Dependencies | No | The paper mentions 'Adam W (Loshchilov & Hutter, 2018) as the optimizer' and 'Arc Face (Deng et al., 2019; 2020) for both pre-training and image retrieval tasks' but does not provide specific version numbers for these or other software components. |
| Experiment Setup | Yes | Unless otherwise specified, all Vi T models in our experiments follow the same architecture designs in CLIP, and are trained from scratch for 32 epochs on the automatically clustered LAION 400M dataset (Section 3.2) with cluster number k = 1M. During training, we randomly crop and horizontally flip each image to get the input image with 224 224 resolution. We set the random class sampling ratio r1 as 0.1 in the pre-training step. We use Adam W (Loshchilov & Hutter, 2018) as the optimizer with an initial learning rate of 0.001, and a weight decay of 0.05. We employ margin-based softmax loss, Arc Face (Deng et al., 2019; 2020), for both pre-training and image retrieval tasks. The margin value is set to 0.3 and the feature scale is set to 64. |