reproducibilityindex.ai

Learning Descriptive Image Captioning via Semipermeable Maximum Likelihood Estimation

Authors: Zihao Yue, Anwen Hu, Liang Zhang, Qin Jin

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on two mainstream image captioning datasets MSCOCO and Flickr30K demonstrate that SMILE significantly enhances the descriptiveness of generated captions. We further provide in-depth investigations to facilitate a better understanding of how SMILE works.
Researcher Affiliation	Academia	Zihao Yue, Anwen Hu, Liang Zhang, Qin Jin Renmin University of China {yzihao, anwenhu, zhangliang00, qjin}@ruc.edu.cn
Pseudocode	Yes	We present pseudocode in Algorithm 1 to better understand SMILE, which can be implemented and applied in such a straightforward manner.
Open Source Code	Yes	https://github.com/yuezih/SMILE
Open Datasets	Yes	We evaluate our method on the two most commonly used image captioning benchmarks, MSCOCO [24] and Flickr30K [57].
Dataset Splits	Yes	MSCOCO contains about 120K images, and we adopt the commonly used Karpathy splitting [18] with 5,000 images each for the validation and test sets. Flickr30K contains about 31K images, with 1,000 images each for the test and validation sets.
Hardware Specification	Yes	Each experiment involves fine-tuning the baseline model using a Low-Rank Adaptation (Lo RA) approach [17] for parameter efficiency, employing 4 RTX A6000 nodes with a batch size of 8.
Software Dependencies	No	The paper mentions software like GPT-2, LLaMA-13B, and LMFlow framework, but does not provide specific version numbers for these dependencies.
Experiment Setup	Yes	For all of our models optimized with SMILE, we choose the checkpoints according to the best self-retrieval performance on the validation set, which always occurs within 3 epochs. ... employing 4 RTX A6000 nodes with a batch size of 8. ... Loverall = λ LMLE + (1 − λ) LSMILE, λ ∈ [0, 1]. ... We design two implementations, First-token MLE and First-token Shifting, to achieve such restriction: