Learning Descriptive Image Captioning via Semipermeable Maximum Likelihood Estimation

Authors: Zihao Yue, Anwen Hu, Liang Zhang, Qin Jin

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on two mainstream image captioning datasets MSCOCO and Flickr30K demonstrate that SMILE significantly enhances the descriptiveness of generated captions. We further provide in-depth investigations to facilitate a better understanding of how SMILE works.
Researcher Affiliation Academia Zihao Yue, Anwen Hu, Liang Zhang, Qin Jin Renmin University of China {yzihao, anwenhu, zhangliang00, qjin}@ruc.edu.cn
Pseudocode Yes We present pseudocode in Algorithm 1 to better understand SMILE, which can be implemented and applied in such a straightforward manner.
Open Source Code Yes https://github.com/yuezih/SMILE
Open Datasets Yes We evaluate our method on the two most commonly used image captioning benchmarks, MSCOCO [24] and Flickr30K [57].
Dataset Splits Yes MSCOCO contains about 120K images, and we adopt the commonly used Karpathy splitting [18] with 5,000 images each for the validation and test sets. Flickr30K contains about 31K images, with 1,000 images each for the test and validation sets.
Hardware Specification Yes Each experiment involves fine-tuning the baseline model using a Low-Rank Adaptation (Lo RA) approach [17] for parameter efficiency, employing 4 RTX A6000 nodes with a batch size of 8.
Software Dependencies No The paper mentions software like GPT-2, LLaMA-13B, and LMFlow framework, but does not provide specific version numbers for these dependencies.
Experiment Setup Yes For all of our models optimized with SMILE, we choose the checkpoints according to the best self-retrieval performance on the validation set, which always occurs within 3 epochs. ... employing 4 RTX A6000 nodes with a batch size of 8. ... Loverall = λ LMLE + (1 − λ) LSMILE, λ ∈ [0, 1]. ... We design two implementations, First-token MLE and First-token Shifting, to achieve such restriction: