DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only Training
Authors: Wei Li, Linchao Zhu, Longyin Wen, Yi Yang
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experiments show that De Cap outperforms other zero-shot captioning methods and unpaired captioning methods by a large margin on the typical image captioning benchmarks, i.e., MSCOCO and No Caps. We apply De Cap to video captioning and achieve stateof-the-art zero-shot performance on MSR-VTT and Activity Net-Captions. |
| Researcher Affiliation | Collaboration | Wei Li1 Linchao Zhu1 Longyin Wen2 Yi Yang1 1CCAI, Zhejiang University 2Byte Dance Inc., San Jose, USA {weili6,zhulinchao,yangyics}@zju.edu.cn longyin.wen@bytedance.com |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | The code is available at https://github.com/dhg-wei/De Cap. |
| Open Datasets | Yes | We consider three webly-collected corpora for De Cap training: (1) CC3M (Sharma et al., 2018) contains three million image-description pairs collected from the web. (2) SS1M is a webly-collected corpus specifically designed for MSCOCO caption. Feng et al. (2019)... (3) Book Corpus (Zhu et al., 2015) is a large collection of free novel books. |
| Dataset Splits | Yes | Table 1 shows the zero-shot results on MSCOCO Karpathy-test split and No Caps validation set. |
| Hardware Specification | Yes | The experiment is conducted on a single Nvidia RTX2080Ti GPU. |
| Software Dependencies | No | The paper mentions "Optimizer Adam W (Loshchilov & Hutter, 2018)" but does not provide specific version numbers for software dependencies such as Python, PyTorch, or other libraries. |
| Experiment Setup | Yes | Implementation Details. We employ a frozen pre-trained Vit-B/32 CLIP model. We adopt a 4-layer Transformer (Subramanian et al., 2018) with 4 attention heads as our language model. The size of the hidden state is 768. By default, we use all the text data in the training set to train the language model from scratch with a naive cross-entropy loss. All the text embeddings from the training corpus are stored in the support memory unless specified otherwise. At inference, the temperature τ in Eq. 2 is set to 1/150 in video captioning experiments, and 1/100 in image captioning experiments. |