Transferring Pre-trained Multimodal Representations with Cross-modal Similarity Matching
Authors: Byoungjip Kim, Sungik Choi, Dasol Hwang, Moontae Lee, Honglak Lee
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that unsupervised representation transfer of a pre-trained vision-language model enables a small Res Net-18 to achieve a better Image Net-1K top-1 linear probe accuracy (66.2%) than vision-only self-supervised learning (SSL) methods (e.g., Sim CLR: 51.8%, Sw AV: 63.7%), while closing the gap with supervised learning (69.8%). |
| Researcher Affiliation | Collaboration | Byoungjip Kim1, Sungik Choi1, Dasol Hwang1, Moontae Lee1,2, Honglak Lee1 LG AI Research1, University of Illinois Chicago2 {bjkim, sungik.choi, dasol.h wang, moontae.lee, honglak}@lgresearch.ai |
| Pseudocode | No | The paper describes its methods using text and mathematical formulations but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | We provide the code, data, and instructions in the supplemental material. |
| Open Datasets | Yes | We evaluate the Beam CLIP on six standard benchmark datasets: CIFAR10 [23], CIFAR100 [23], STL10 [9], Flowers102 [28], Pets37 [31], and Image Net-1K [10]. |
| Dataset Splits | Yes | Following convention, we split the datasets into train, validation, and test sets. Then, we use train set for transfer, and test set for evaluation. For Image Net, we use the validation set as a test set, since its test set does not provide labels. |
| Hardware Specification | Yes | We perform our experiments on 8 NVIDA A100 GPUs and it takes about 30 hours for 200 epoch training. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9, CUDA 11.1) in its main text or appendices. |
| Experiment Setup | Yes | Its hyperparameter C is determined through coarse-grained hyperparameter search on the validation split. And, the accuracy is evaluated in the test split. We found that it provides the best linear probe accuracy when C is set to 30. We perform our experiments on 8 NVIDA A100 GPUs and it takes about 30 hours for 200 epoch training. For optimization we use SGD with cosine annealing schedule (SGDR) [25]. The momentum encoder of a student ˆS is updated using the following rule: ˆS m ˆS + (1 m) S (9) where S is the image encoder of a student model and m is a momentum hyperparameter that is set 0.99 in our experiments. The model hyperparameters are summarized in Table 12 in Appendix B.5. |