Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
SoftCLIP: Softer Cross-Modal Alignment Makes CLIP Stronger
Authors: Yuting Gao, Jinfeng Liu, Zihan Xu, Tong Wu, Enwei Zhang, Ke Li, Jie Yang, Wei Liu, Xing Sun
AAAI 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate the effectiveness of Soft CLIP. |
| Researcher Affiliation | Collaboration | 1Tencent Youtu Lab 2Department of Automation, Shanghai Jiao Tong University |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. Figure 2 shows an overall framework, not a detailed algorithm. |
| Open Source Code | No | The paper does not provide concrete access to source code (specific repository link, explicit code release statement, or code in supplementary materials) for the methodology described in this paper. |
| Open Datasets | Yes | And Soft CLIP is pre-trained on three datasets, CC3M (Changpinyo et al. 2021), CC12M (Sharma et al. 2018) and YFCC15M-V2 (Li et al. 2021b). These datasets are listed in Table 1. |
| Dataset Splits | No | The paper mentions training for a certain number of epochs and using 'automatic mixed-precision', but it does not provide specific dataset split information (exact percentages, sample counts, or citations to predefined splits) needed to reproduce the data partitioning into train/validation/test sets. |
| Hardware Specification | Yes | We use 8 V100 GPUs for experiments |
| Software Dependencies | No | The paper mentions the use of AdamW optimizer and automatic mixed-precision but does not provide specific version numbers for software libraries or dependencies (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | The input resolution of image encoder is 224 224 and the maximum context length of text encoder is 77. ... We train our Soft CLIP using an Adam W (Loshchilov and Hutter 2017) optimizer and the cosine learning rate scheduler with a linear warm-up. Specifically, the learning rate linearly increases from 0 to the peak value within 10% of the total steps, and then decreases with a cosine anneal strategy. The weight decay rate of Adam W is set to 0.2. ... The models are trained from scratch for either 8 or 32 epochs in our experiments, i.e., 8 epochs for ablation and 32 epochs for comparison. ... the batch size is set to 2048, while with the image encoder Vi T-B/16, the batch size is 1024. |