S-CLIP: Semi-supervised Vision-Language Learning using Few Specialist Captions
Authors: Sangwoo Mo, Minkyu Kim, Kyungmin Lee, Jinwoo Shin
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | S-CLIP significantly enhances the training of CLIP using only a few image-text pairs, as demonstrated in various specialist domains, including remote sensing, fashion, scientific figures, and comics. For instance, S-CLIP improves CLIP by 10% for zero-shot classification and 4% for image-text retrieval on the remote sensing benchmark, matching the performance of supervised CLIP while using three times fewer image-text pairs. |
| Researcher Affiliation | Collaboration | Sangwoo Mo1,2 Minkyu Kim1,3 Kyungmin Lee1 Jinwoo Shin1 1KAIST 2University of Michigan 3KRAFTON |
| Pseudocode | No | The paper includes conceptual illustrations of the method (Figure 2 and Figure 3) but does not provide any formal pseudocode blocks or algorithms. |
| Open Source Code | Yes | Code: https://github.com/alinlab/s-clip |
| Open Datasets | Yes | We train vision-language models using the union of RSICD [27], UCM [32], and Sydney [78], named RS-ALL, following the setup of [64]. ... For semi-supervised learning, we subsample 10% of image-text pairs as labeled data (L), while the remaining 90% of images (L=U) or unlabeled images from the RESISC45 [79] dataset (L =U) served as unlabeled data. |
| Dataset Splits | Yes | For semi-supervised learning, we subsample 10% of image-text pairs as labeled data (L), while the remaining 90% of images (L=U) or unlabeled images from the RESISC45 [79] dataset (L =U) served as unlabeled data. ... We use a batch size of 64 per GPU, with a total of 4 GPUs. To ensure fair GPU memory usage in semi-supervised learning, we employ 32 image-caption pairs and 32 unpaired images for each mini-batch. ... We train all models until the performance saturate, which can vary over the number of image-text pairs. Specifically, for remote sensing, we train models for 25 epochs, for fashion 10 epochs, for scientific figures 5 epochs, and for comics datasets 10 epochs. |
| Hardware Specification | No | The paper states, "We use a batch size of 64 per GPU, with a total of 4 GPUs." However, it does not specify the model or type of GPU (e.g., NVIDIA A100, Tesla V100), nor does it mention CPU, memory, or cloud instance details. |
| Software Dependencies | No | The paper mentions using "Open CLIP [77] library" and extracting keywords using the "YAKE [74] algorithm". However, it does not provide specific version numbers for these software components, which is necessary for reproducibility. |
| Experiment Setup | Yes | Setup. ... We use a batch size of 64 per GPU, with a total of 4 GPUs. To ensure fair GPU memory usage in semi-supervised learning, we employ 32 image-caption pairs and 32 unpaired images for each mini-batch. Models are evaluated on zero-shot classification and image-text retrieval tasks, measuring Top-1 classification accuracy (%) and recall at K (R@K). We report the average and standard deviation across three random seeds. We follow the training recipe of Open Clip [77] if not specified. Additional experimental details are in Appendix C.2: The learning rate is set to 5e-5, and we apply the default cosine learning rate scheduling with a warmup period of the first 10 steps. We train all models until the performance saturate, which can vary over the number of image-text pairs. Specifically, for remote sensing, we train models for 25 epochs, for fashion 10 epochs, for scientific figures 5 epochs, and for comics datasets 10 epochs. |