S-CLIP: Semi-supervised Vision-Language Learning using Few Specialist Captions

Authors: Sangwoo Mo, Minkyu Kim, Kyungmin Lee, Jinwoo Shin

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental S-CLIP significantly enhances the training of CLIP using only a few image-text pairs, as demonstrated in various specialist domains, including remote sensing, fashion, scientific figures, and comics. For instance, S-CLIP improves CLIP by 10% for zero-shot classification and 4% for image-text retrieval on the remote sensing benchmark, matching the performance of supervised CLIP while using three times fewer image-text pairs.
Researcher Affiliation Collaboration Sangwoo Mo1,2 Minkyu Kim1,3 Kyungmin Lee1 Jinwoo Shin1 1KAIST 2University of Michigan 3KRAFTON
Pseudocode No The paper includes conceptual illustrations of the method (Figure 2 and Figure 3) but does not provide any formal pseudocode blocks or algorithms.
Open Source Code Yes Code: https://github.com/alinlab/s-clip
Open Datasets Yes We train vision-language models using the union of RSICD [27], UCM [32], and Sydney [78], named RS-ALL, following the setup of [64]. ... For semi-supervised learning, we subsample 10% of image-text pairs as labeled data (L), while the remaining 90% of images (L=U) or unlabeled images from the RESISC45 [79] dataset (L =U) served as unlabeled data.
Dataset Splits Yes For semi-supervised learning, we subsample 10% of image-text pairs as labeled data (L), while the remaining 90% of images (L=U) or unlabeled images from the RESISC45 [79] dataset (L =U) served as unlabeled data. ... We use a batch size of 64 per GPU, with a total of 4 GPUs. To ensure fair GPU memory usage in semi-supervised learning, we employ 32 image-caption pairs and 32 unpaired images for each mini-batch. ... We train all models until the performance saturate, which can vary over the number of image-text pairs. Specifically, for remote sensing, we train models for 25 epochs, for fashion 10 epochs, for scientific figures 5 epochs, and for comics datasets 10 epochs.
Hardware Specification No The paper states, "We use a batch size of 64 per GPU, with a total of 4 GPUs." However, it does not specify the model or type of GPU (e.g., NVIDIA A100, Tesla V100), nor does it mention CPU, memory, or cloud instance details.
Software Dependencies No The paper mentions using "Open CLIP [77] library" and extracting keywords using the "YAKE [74] algorithm". However, it does not provide specific version numbers for these software components, which is necessary for reproducibility.
Experiment Setup Yes Setup. ... We use a batch size of 64 per GPU, with a total of 4 GPUs. To ensure fair GPU memory usage in semi-supervised learning, we employ 32 image-caption pairs and 32 unpaired images for each mini-batch. Models are evaluated on zero-shot classification and image-text retrieval tasks, measuring Top-1 classification accuracy (%) and recall at K (R@K). We report the average and standard deviation across three random seeds. We follow the training recipe of Open Clip [77] if not specified. Additional experimental details are in Appendix C.2: The learning rate is set to 5e-5, and we apply the default cosine learning rate scheduling with a warmup period of the first 10 steps. We train all models until the performance saturate, which can vary over the number of image-text pairs. Specifically, for remote sensing, we train models for 25 epochs, for fashion 10 epochs, for scientific figures 5 epochs, and for comics datasets 10 epochs.