CWCL: Cross-Modal Transfer with Continuously Weighted Contrastive Loss
Authors: Rakshith Sharma Srinivasa, Jaejin Cho, Chouchang Yang, Yashas Malur Saidutta, Ching-Hua Lee, Yilin Shen, Hongxia Jin
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we provide experimental results that demonstrate that CWCL leads to better zero-shot transfer performance. We study two pairs of domains, namely image-text and speech-text. |
| Researcher Affiliation | Industry | Samsung Research America, Mountain View, CA {r.srinivasa, jaejin.cho, c.yang1}@samsung.com {ym.saidutta, chinghua.l, yilin.shen, hongxia.jin}@samsung.com |
| Pseudocode | No | No section or figure explicitly labeled 'Pseudocode' or 'Algorithm' was found. |
| Open Source Code | No | The paper states 'We build upon the code repository in [56]' which refers to an external repository used for their work, not their own source code release. |
| Open Datasets | Yes | All our experiments are based on the combination of two publicly available datasets, CC12M and YFCC15M. The CC12M dataset is a subset of the Conceptual Captions dataset [34] defined in [35]. We use a set of 10 million images that are still available in the set of URLs (since the rest of them have been taken down). The YFCC15M dataset is a subset of the Yahoo Flicker Creative Commons dataset [36] defined by [1] by filtering for high quality English text. It contains a set of 15 million image-text pairs. For cross-modal training, we used the Common Voice Corpus 13.0 [50]. |
| Dataset Splits | Yes | We use the MS-COCO validation dataset [43] to study zero-shot retrieval performance of these models. Dataset: We used the SLURP [8] and STOP [51] datasets for evaluation. In the SLURP dataset, we used all the text sentences in the train subset to generate the class embeddings for 60 intent classes... The evaluation was done on the devel and test subsets. In the STOP dataset... The evaluation was done on the validation and test sets. |
| Hardware Specification | Yes | We train on 4 A100 GPUs. |
| Software Dependencies | No | The paper mentions building upon 'OpenCLIP' [56] and using 'Adam W optimizer', but does not provide specific version numbers for general software dependencies like Python, PyTorch, or CUDA, which are necessary for full reproducibility. |
| Experiment Setup | Yes | We train our models for a total of 70 epochs, where each epoch uses a subset of 6 million images. The batch size is set to 16000. ... We use a learning rate of 0.001, Adam W optimizer with β1 = 0.9, β2 = 0.999 and a weight decay of 0.0001 [57]. We train each model for a total of 20 epochs... We use a batch size of 20 with the 12,500 warmup steps... We use a learning rate of 0.00003, Adam W optimizer with β1 = 0.9, β2 = 0.999, a weight decay of 0.0001, and gradient clipping norm of 10. |