Learning Descriptive Image Captioning via Semipermeable Maximum Likelihood Estimation
Authors: Zihao Yue, Anwen Hu, Liang Zhang, Qin Jin
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on two mainstream image captioning datasets MSCOCO and Flickr30K demonstrate that SMILE significantly enhances the descriptiveness of generated captions. We further provide in-depth investigations to facilitate a better understanding of how SMILE works. |
| Researcher Affiliation | Academia | Zihao Yue, Anwen Hu, Liang Zhang, Qin Jin Renmin University of China {yzihao, anwenhu, zhangliang00, qjin}@ruc.edu.cn |
| Pseudocode | Yes | We present pseudocode in Algorithm 1 to better understand SMILE, which can be implemented and applied in such a straightforward manner. |
| Open Source Code | Yes | https://github.com/yuezih/SMILE |
| Open Datasets | Yes | We evaluate our method on the two most commonly used image captioning benchmarks, MSCOCO [24] and Flickr30K [57]. |
| Dataset Splits | Yes | MSCOCO contains about 120K images, and we adopt the commonly used Karpathy splitting [18] with 5,000 images each for the validation and test sets. Flickr30K contains about 31K images, with 1,000 images each for the test and validation sets. |
| Hardware Specification | Yes | Each experiment involves fine-tuning the baseline model using a Low-Rank Adaptation (Lo RA) approach [17] for parameter efficiency, employing 4 RTX A6000 nodes with a batch size of 8. |
| Software Dependencies | No | The paper mentions software like GPT-2, LLaMA-13B, and LMFlow framework, but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | For all of our models optimized with SMILE, we choose the checkpoints according to the best self-retrieval performance on the validation set, which always occurs within 3 epochs. ... employing 4 RTX A6000 nodes with a batch size of 8. ... Loverall = λ LMLE + (1 − λ) LSMILE, λ ∈ [0, 1]. ... We design two implementations, First-token MLE and First-token Shifting, to achieve such restriction: |