DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only Training

Authors: Wei Li, Linchao Zhu, Longyin Wen, Yi Yang

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experiments show that De Cap outperforms other zero-shot captioning methods and unpaired captioning methods by a large margin on the typical image captioning benchmarks, i.e., MSCOCO and No Caps. We apply De Cap to video captioning and achieve stateof-the-art zero-shot performance on MSR-VTT and Activity Net-Captions.
Researcher Affiliation Collaboration Wei Li1 Linchao Zhu1 Longyin Wen2 Yi Yang1 1CCAI, Zhejiang University 2Byte Dance Inc., San Jose, USA {weili6,zhulinchao,yangyics}@zju.edu.cn longyin.wen@bytedance.com
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes The code is available at https://github.com/dhg-wei/De Cap.
Open Datasets Yes We consider three webly-collected corpora for De Cap training: (1) CC3M (Sharma et al., 2018) contains three million image-description pairs collected from the web. (2) SS1M is a webly-collected corpus specifically designed for MSCOCO caption. Feng et al. (2019)... (3) Book Corpus (Zhu et al., 2015) is a large collection of free novel books.
Dataset Splits Yes Table 1 shows the zero-shot results on MSCOCO Karpathy-test split and No Caps validation set.
Hardware Specification Yes The experiment is conducted on a single Nvidia RTX2080Ti GPU.
Software Dependencies No The paper mentions "Optimizer Adam W (Loshchilov & Hutter, 2018)" but does not provide specific version numbers for software dependencies such as Python, PyTorch, or other libraries.
Experiment Setup Yes Implementation Details. We employ a frozen pre-trained Vit-B/32 CLIP model. We adopt a 4-layer Transformer (Subramanian et al., 2018) with 4 attention heads as our language model. The size of the hidden state is 768. By default, we use all the text data in the training set to train the language model from scratch with a naive cross-entropy loss. All the text embeddings from the training corpus are stored in the support memory unless specified otherwise. At inference, the temperature τ in Eq. 2 is set to 1/150 in video captioning experiments, and 1/100 in image captioning experiments.