Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

S-CLIP: Semi-supervised Vision-Language Learning using Few Specialist Captions

Authors: Sangwoo Mo, Minkyu Kim, Kyungmin Lee, Jinwoo Shin

NeurIPS 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	S-CLIP significantly enhances the training of CLIP using only a few image-text pairs, as demonstrated in various specialist domains, including remote sensing, fashion, scientific figures, and comics. For instance, S-CLIP improves CLIP by 10% for zero-shot classification and 4% for image-text retrieval on the remote sensing benchmark, matching the performance of supervised CLIP while using three times fewer image-text pairs.
Researcher Affiliation	Collaboration	Sangwoo Mo1,2 Minkyu Kim1,3 Kyungmin Lee1 Jinwoo Shin1 1KAIST 2University of Michigan 3KRAFTON
Pseudocode	No	The paper includes conceptual illustrations of the method (Figure 2 and Figure 3) but does not provide any formal pseudocode blocks or algorithms.
Open Source Code	Yes	Code: https://github.com/alinlab/s-clip
Open Datasets	Yes	We train vision-language models using the union of RSICD [27], UCM [32], and Sydney [78], named RS-ALL, following the setup of [64]. ... For semi-supervised learning, we subsample 10% of image-text pairs as labeled data (L), while the remaining 90% of images (L=U) or unlabeled images from the RESISC45 [79] dataset (L =U) served as unlabeled data.
Dataset Splits	Yes	For semi-supervised learning, we subsample 10% of image-text pairs as labeled data (L), while the remaining 90% of images (L=U) or unlabeled images from the RESISC45 [79] dataset (L =U) served as unlabeled data. ... We use a batch size of 64 per GPU, with a total of 4 GPUs. To ensure fair GPU memory usage in semi-supervised learning, we employ 32 image-caption pairs and 32 unpaired images for each mini-batch. ... We train all models until the performance saturate, which can vary over the number of image-text pairs. Specifically, for remote sensing, we train models for 25 epochs, for fashion 10 epochs, for scientific figures 5 epochs, and for comics datasets 10 epochs.
Hardware Specification	No	The paper states, "We use a batch size of 64 per GPU, with a total of 4 GPUs." However, it does not specify the model or type of GPU (e.g., NVIDIA A100, Tesla V100), nor does it mention CPU, memory, or cloud instance details.
Software Dependencies	No	The paper mentions using "Open CLIP [77] library" and extracting keywords using the "YAKE [74] algorithm". However, it does not provide specific version numbers for these software components, which is necessary for reproducibility.
Experiment Setup	Yes	Setup. ... We use a batch size of 64 per GPU, with a total of 4 GPUs. To ensure fair GPU memory usage in semi-supervised learning, we employ 32 image-caption pairs and 32 unpaired images for each mini-batch. Models are evaluated on zero-shot classification and image-text retrieval tasks, measuring Top-1 classification accuracy (%) and recall at K (R@K). We report the average and standard deviation across three random seeds. We follow the training recipe of Open Clip [77] if not specified. Additional experimental details are in Appendix C.2: The learning rate is set to 5e-5, and we apply the default cosine learning rate scheduling with a warmup period of the first 10 steps. We train all models until the performance saturate, which can vary over the number of image-text pairs. Specifically, for remote sensing, we train models for 25 epochs, for fashion 10 epochs, for scientific figures 5 epochs, and for comics datasets 10 epochs.