SoftCLIP: Softer Cross-Modal Alignment Makes CLIP Stronger

Authors: Yuting Gao, Jinfeng Liu, Zihan Xu, Tong Wu, Enwei Zhang, Ke Li, Jie Yang, Wei Liu, Xing Sun

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate the effectiveness of Soft CLIP.
Researcher Affiliation Collaboration 1Tencent Youtu Lab 2Department of Automation, Shanghai Jiao Tong University
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. Figure 2 shows an overall framework, not a detailed algorithm.
Open Source Code No The paper does not provide concrete access to source code (specific repository link, explicit code release statement, or code in supplementary materials) for the methodology described in this paper.
Open Datasets Yes And Soft CLIP is pre-trained on three datasets, CC3M (Changpinyo et al. 2021), CC12M (Sharma et al. 2018) and YFCC15M-V2 (Li et al. 2021b). These datasets are listed in Table 1.
Dataset Splits No The paper mentions training for a certain number of epochs and using 'automatic mixed-precision', but it does not provide specific dataset split information (exact percentages, sample counts, or citations to predefined splits) needed to reproduce the data partitioning into train/validation/test sets.
Hardware Specification Yes We use 8 V100 GPUs for experiments
Software Dependencies No The paper mentions the use of AdamW optimizer and automatic mixed-precision but does not provide specific version numbers for software libraries or dependencies (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes The input resolution of image encoder is 224 224 and the maximum context length of text encoder is 77. ... We train our Soft CLIP using an Adam W (Loshchilov and Hutter 2017) optimizer and the cosine learning rate scheduler with a linear warm-up. Specifically, the learning rate linearly increases from 0 to the peak value within 10% of the total steps, and then decreases with a cosine anneal strategy. The weight decay rate of Adam W is set to 0.2. ... The models are trained from scratch for either 8 or 32 epochs in our experiments, i.e., 8 epochs for ablation and 32 epochs for comparison. ... the batch size is set to 2048, while with the image encoder Vi T-B/16, the batch size is 1024.