Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning Descriptive Image Captioning via Semipermeable Maximum Likelihood Estimation

Authors: Zihao Yue, Anwen Hu, Liang Zhang, Qin Jin

NeurIPS 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on two mainstream image captioning datasets MSCOCO and Flickr30K demonstrate that SMILE significantly enhances the descriptiveness of generated captions. We further provide in-depth investigations to facilitate a better understanding of how SMILE works.
Researcher Affiliation Academia Zihao Yue, Anwen Hu, Liang Zhang, Qin Jin Renmin University of China EMAIL
Pseudocode Yes We present pseudocode in Algorithm 1 to better understand SMILE, which can be implemented and applied in such a straightforward manner.
Open Source Code Yes https://github.com/yuezih/SMILE
Open Datasets Yes We evaluate our method on the two most commonly used image captioning benchmarks, MSCOCO [24] and Flickr30K [57].
Dataset Splits Yes MSCOCO contains about 120K images, and we adopt the commonly used Karpathy splitting [18] with 5,000 images each for the validation and test sets. Flickr30K contains about 31K images, with 1,000 images each for the test and validation sets.
Hardware Specification Yes Each experiment involves fine-tuning the baseline model using a Low-Rank Adaptation (Lo RA) approach [17] for parameter efficiency, employing 4 RTX A6000 nodes with a batch size of 8.
Software Dependencies No The paper mentions software like GPT-2, LLaMA-13B, and LMFlow framework, but does not provide specific version numbers for these dependencies.
Experiment Setup Yes For all of our models optimized with SMILE, we choose the checkpoints according to the best self-retrieval performance on the validation set, which always occurs within 3 epochs. ... employing 4 RTX A6000 nodes with a batch size of 8. ... Loverall = λ LMLE + (1 − λ) LSMILE, λ ∈ [0, 1]. ... We design two implementations, First-token MLE and First-token Shifting, to achieve such restriction: