reproducibilityindex.ai

Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training

Authors: Longtian Qiu, Shan Ning, Xuming He

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments on common captioning and VQA datasets such as MSCOCO, Flickr30k and VQAV2, we show that our method achieves remarkable performance improvements.
Researcher Affiliation	Academia	1Shanghai Tech University, Shanghai, China 2Shanghai Engineering Research Center of Intelligent Vision and Imaging {qiult, ningshan2022, hexm}@shanghaitech.edu.cn
Pseudocode	No	The paper describes the methods using mathematical formulations and textual descriptions, but does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at https://github.com/Artanic30/Mac Cap.
Open Datasets	Yes	Our experimental evaluations are performed on two common benchmark datasets for image captioning: MSCOCO (Vinyals et al. 2016) and Flickr30k (Plummer et al. 2015). ... Additionally, we use the popular visual question answering benchmark VQAV2 (Antol et al. 2015). ... For training, we use the texts from MSCOCO, Flickr30K, and CC3M (Changpinyo et al. 2021b) datasets.
Dataset Splits	Yes	We follow previous works (Nukrai, Mokady, and Globerson 2022) using the widely-used Karpathy et al. split, which partitions the dataset into 5000 images for validation and 5,000 for testing.
Hardware Specification	No	The paper does not provide specific details about the hardware used for training or inference, such as GPU or CPU models.
Software Dependencies	No	The paper mentions using a 'frozen Vit-B/32 CLIP model' and a 'frozen pre-trained OPT (Zhang et al. 2022b) 1.3b model' but does not specify software dependencies with version numbers (e.g., Python, PyTorch versions).
Experiment Setup	Yes	In text reconstruction training, we set the noise variance 𝜎to 0.016 as suggested in (Nukrai, Mokady, and Globerson 2022), and the region concept feature length 𝑁𝑐𝑟is set to 10. In caption generation, the sampling number 𝑆in inference is set to 20. The text generation strategy is beam search with 4 beams. ... Our model and the reproduced baseline are trained with a batch size of 128 and a learning rate of 4e-4.