Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training
Authors: Longtian Qiu, Shan Ning, Xuming He
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments on common captioning and VQA datasets such as MSCOCO, Flickr30k and VQAV2, we show that our method achieves remarkable performance improvements. |
| Researcher Affiliation | Academia | 1Shanghai Tech University, Shanghai, China 2Shanghai Engineering Research Center of Intelligent Vision and Imaging {qiult, ningshan2022, hexm}@shanghaitech.edu.cn |
| Pseudocode | No | The paper describes the methods using mathematical formulations and textual descriptions, but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/Artanic30/Mac Cap. |
| Open Datasets | Yes | Our experimental evaluations are performed on two common benchmark datasets for image captioning: MSCOCO (Vinyals et al. 2016) and Flickr30k (Plummer et al. 2015). ... Additionally, we use the popular visual question answering benchmark VQAV2 (Antol et al. 2015). ... For training, we use the texts from MSCOCO, Flickr30K, and CC3M (Changpinyo et al. 2021b) datasets. |
| Dataset Splits | Yes | We follow previous works (Nukrai, Mokady, and Globerson 2022) using the widely-used Karpathy et al. split, which partitions the dataset into 5000 images for validation and 5,000 for testing. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for training or inference, such as GPU or CPU models. |
| Software Dependencies | No | The paper mentions using a 'frozen Vit-B/32 CLIP model' and a 'frozen pre-trained OPT (Zhang et al. 2022b) 1.3b model' but does not specify software dependencies with version numbers (e.g., Python, PyTorch versions). |
| Experiment Setup | Yes | In text reconstruction training, we set the noise variance πto 0.016 as suggested in (Nukrai, Mokady, and Globerson 2022), and the region concept feature length πππis set to 10. In caption generation, the sampling number πin inference is set to 20. The text generation strategy is beam search with 4 beams. ... Our model and the reproduced baseline are trained with a batch size of 128 and a learning rate of 4e-4. |