Expediting Contrastive Language-Image Pretraining via Self-Distilled Encoders

Authors: Bumsoo Kim, Jinhyung Kim, Yeonsik Jo, Seung Hwan Kim

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through our extensive experiments, we validate that there is a sweet spot between expedition and distillation where the partial view from the expedited online image encoder interacts complementarily with the momentum teacher. As a result, ECLIPSE outperforms its counterparts while achieving substantial acceleration in inference speed.
Researcher Affiliation Industry Bumsoo Kim*, Jinhyung Kim, Yeonsik Jo, Seung Hwan Kim LG AI Research *correspondence to: bumsoo.kim@lgresearch.ai
Pseudocode No No pseudocode or algorithm blocks were found in the paper.
Open Source Code No For implementation details, our work is built on top of the open-source SLIP codebase (Mu et al. 2021)1. For De CLIP (Li et al. 2022), we follow the implementation details of the official code release2. Footnotes link to 'https://github.com/facebookresearch/SLIP' and 'https://github.com/Sense-GVT/De CLIP', which are external codebases, not the authors' own code for ECLIPSE.
Open Datasets Yes we pretrain ECLIPSE on large-scale open-source datasets, CC (Conceptual Captions) 3M (Sharma et al. 2018) and YFCC (Yahoo Flickr Creative Commons) 15M (Thomee et al. 2016).
Dataset Splits No The paper mentions pretraining on CC3M and YFCC15M datasets and evaluating on downstream datasets, but it does not explicitly state the training, validation, and test splits for the pretraining datasets.
Hardware Specification Yes All of our models are pretrained in 16 A100 GPUs.
Software Dependencies No The paper mentions building on 'open-source SLIP codebase' and following 'official code release' for De CLIP, but it does not specify version numbers for Python, PyTorch, CUDA, or other specific software libraries.
Experiment Setup Yes All models are pretrained on the CC3M dataset with a learning rate 5e-4 for 40 epochs4. We use κ=0.7 for EVi T with a Vi T-B/16 backbone." and "We use m = 0.994 in our experiments.