Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

RECLIP: Resource-efficient CLIP by Training with Small Images

Authors: Runze Li, Dahun Kim, Bir Bhanu, Weicheng Kuo

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using the same batch size and training epoch, RECLIP achieves highly competitive zero-shot classification and image-text retrieval accuracy with 6 to 8 less computational resources and 7 to 9 fewer FLOPs than the baseline. Compared to the state-of-the-art contrastive learning methods, RECLIP demonstrates 5 to 59 training resource savings while maintaining highly competitive zero-shot classification and retrieval performance. Finally, RECLIP matches the state of the art in transfer learning to open-vocabulary detection tasks, achieving 32 APr on LVIS.
Researcher Affiliation Collaboration Runze Li Dahun Kim Bir Bhanu Weicheng Kuo UC Riverside Google Deepmind
Pseudocode No The paper describes the RECLIP training pipeline with two phases (low-resolution main training and high-resolution finetuning) and provides equations for computational complexity. However, it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper states: 'We hope this work will pave the path for the broader research community to explore language supervised pretraining in resource-friendly settings.' However, it does not provide any explicit statements about releasing source code, a repository link, or mention of code in supplementary materials for the methodology described.
Open Datasets Yes Following existing works (Radford et al., 2021; Li et al., 2022b; Yu et al., 2022), we evaluate RECLIP on zero-shot image and text retrieval on Flickr30K (Plummer et al., 2015) and MSCOCO (Chen et al., 2015) test sets, and zero-shot image classification on Image Net (Deng et al., 2009), Image Net-A (Hendrycks et al., 2021b), Image Net-R (Hendrycks et al., 2021a), Image Net-V2 (Recht et al., 2019) and Image Net-Sketch (Wang et al., 2019) datasets. ... We use the English subset of the Web LI dataset (Chen et al., 2022b) for training. ... We conduct evaluation on the LVIS dataset (Gupta et al., 2019) by using RECLIP for open vocabulary detection.
Dataset Splits Yes Following existing works (Radford et al., 2021; Li et al., 2022b; Yu et al., 2022), we evaluate RECLIP on zero-shot image and text retrieval on Flickr30K (Plummer et al., 2015) and MSCOCO (Chen et al., 2015) test sets, and zero-shot image classification on Image Net (Deng et al., 2009), Image Net-A (Hendrycks et al., 2021b), Image Net-R (Hendrycks et al., 2021a), Image Net-V2 (Recht et al., 2019) and Image Net-Sketch (Wang et al., 2019) datasets. ... Zero-shot image-text retrieval results are averaged from image-to-text and text-to-image Recall@1 on two benchmark datasets, Flickr30K (Plummer et al., 2015) and MSCOCO (Chen et al., 2015). ... We train only on the LVIS base categories (frequent & common) and test on both the base and novel (rare) categories following the protocol of Vi LD (Gu et al., 2022).
Hardware Specification Yes Our training is run on TPU-v3 infrastructure. Compared to general-purpose GPU devices, TPUs are specifically designed for large matrix operations commonly used in neural networks. Each TPU v3 device has 16GB high-bandwidth memory per core, which is comparable to that of a V100 and suitable for synchronous large-scale training.
Software Dependencies No The paper mentions using an 'Adafactor optimizer' and `ViT-Large` backbone, but it does not specify any software versions (e.g., Python, PyTorch, TensorFlow versions or other libraries with their versions).
Experiment Setup Yes We use a starting learning rate of 0.001, and train for 250k and 550k steps with linear LR decay using an Adafactor optimizer. We set weight decay to 0.01 and batch size to 16384. The batch size is chosen to be a multiple of 1024 and the model feature dimension (e.g. 4096) a multiple of 128, so that TPU padding would not occur on the sequence dimension. A short LR warmup of 2500 steps is used. Our high-resolution finetuning schedule starts with a learning rate of 10-4 with 5000 steps LR warmup, and decays linearly over a total schedule of 20k or 50k iterations. We use an image size of 224 or 448 for finetuning.