Transferring Pre-trained Multimodal Representations with Cross-modal Similarity Matching

Authors: Byoungjip Kim, Sungik Choi, Dasol Hwang, Moontae Lee, Honglak Lee

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that unsupervised representation transfer of a pre-trained vision-language model enables a small Res Net-18 to achieve a better Image Net-1K top-1 linear probe accuracy (66.2%) than vision-only self-supervised learning (SSL) methods (e.g., Sim CLR: 51.8%, Sw AV: 63.7%), while closing the gap with supervised learning (69.8%).
Researcher Affiliation Collaboration Byoungjip Kim1, Sungik Choi1, Dasol Hwang1, Moontae Lee1,2, Honglak Lee1 LG AI Research1, University of Illinois Chicago2 {bjkim, sungik.choi, dasol.h wang, moontae.lee, honglak}@lgresearch.ai
Pseudocode No The paper describes its methods using text and mathematical formulations but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes We provide the code, data, and instructions in the supplemental material.
Open Datasets Yes We evaluate the Beam CLIP on six standard benchmark datasets: CIFAR10 [23], CIFAR100 [23], STL10 [9], Flowers102 [28], Pets37 [31], and Image Net-1K [10].
Dataset Splits Yes Following convention, we split the datasets into train, validation, and test sets. Then, we use train set for transfer, and test set for evaluation. For Image Net, we use the validation set as a test set, since its test set does not provide labels.
Hardware Specification Yes We perform our experiments on 8 NVIDA A100 GPUs and it takes about 30 hours for 200 epoch training.
Software Dependencies No The paper does not provide specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9, CUDA 11.1) in its main text or appendices.
Experiment Setup Yes Its hyperparameter C is determined through coarse-grained hyperparameter search on the validation split. And, the accuracy is evaluated in the test split. We found that it provides the best linear probe accuracy when C is set to 30. We perform our experiments on 8 NVIDA A100 GPUs and it takes about 30 hours for 200 epoch training. For optimization we use SGD with cosine annealing schedule (SGDR) [25]. The momentum encoder of a student ˆS is updated using the following rule: ˆS m ˆS + (1 m) S (9) where S is the image encoder of a student model and m is a momentum hyperparameter that is set 0.99 in our experiments. The model hyperparameters are summarized in Table 12 in Appendix B.5.