SoftCLIP: Softer Cross-Modal Alignment Makes CLIP Stronger
Authors: Yuting Gao, Jinfeng Liu, Zihan Xu, Tong Wu, Enwei Zhang, Ke Li, Jie Yang, Wei Liu, Xing Sun
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate the effectiveness of Soft CLIP. |
| Researcher Affiliation | Collaboration | 1Tencent Youtu Lab 2Department of Automation, Shanghai Jiao Tong University |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. Figure 2 shows an overall framework, not a detailed algorithm. |
| Open Source Code | No | The paper does not provide concrete access to source code (specific repository link, explicit code release statement, or code in supplementary materials) for the methodology described in this paper. |
| Open Datasets | Yes | And Soft CLIP is pre-trained on three datasets, CC3M (Changpinyo et al. 2021), CC12M (Sharma et al. 2018) and YFCC15M-V2 (Li et al. 2021b). These datasets are listed in Table 1. |
| Dataset Splits | No | The paper mentions training for a certain number of epochs and using 'automatic mixed-precision', but it does not provide specific dataset split information (exact percentages, sample counts, or citations to predefined splits) needed to reproduce the data partitioning into train/validation/test sets. |
| Hardware Specification | Yes | We use 8 V100 GPUs for experiments |
| Software Dependencies | No | The paper mentions the use of AdamW optimizer and automatic mixed-precision but does not provide specific version numbers for software libraries or dependencies (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | The input resolution of image encoder is 224 224 and the maximum context length of text encoder is 77. ... We train our Soft CLIP using an Adam W (Loshchilov and Hutter 2017) optimizer and the cosine learning rate scheduler with a linear warm-up. Specifically, the learning rate linearly increases from 0 to the peak value within 10% of the total steps, and then decreases with a cosine anneal strategy. The weight decay rate of Adam W is set to 0.2. ... The models are trained from scratch for either 8 or 32 epochs in our experiments, i.e., 8 epochs for ablation and 32 epochs for comparison. ... the batch size is set to 2048, while with the image encoder Vi T-B/16, the batch size is 1024. |