Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Learning Descriptive Image Captioning via Semipermeable Maximum Likelihood Estimation
Authors: Zihao Yue, Anwen Hu, Liang Zhang, Qin Jin
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on two mainstream image captioning datasets MSCOCO and Flickr30K demonstrate that SMILE significantly enhances the descriptiveness of generated captions. We further provide in-depth investigations to facilitate a better understanding of how SMILE works. |
| Researcher Affiliation | Academia | Zihao Yue, Anwen Hu, Liang Zhang, Qin Jin Renmin University of China EMAIL |
| Pseudocode | Yes | We present pseudocode in Algorithm 1 to better understand SMILE, which can be implemented and applied in such a straightforward manner. |
| Open Source Code | Yes | https://github.com/yuezih/SMILE |
| Open Datasets | Yes | We evaluate our method on the two most commonly used image captioning benchmarks, MSCOCO [24] and Flickr30K [57]. |
| Dataset Splits | Yes | MSCOCO contains about 120K images, and we adopt the commonly used Karpathy splitting [18] with 5,000 images each for the validation and test sets. Flickr30K contains about 31K images, with 1,000 images each for the test and validation sets. |
| Hardware Specification | Yes | Each experiment involves fine-tuning the baseline model using a Low-Rank Adaptation (Lo RA) approach [17] for parameter efficiency, employing 4 RTX A6000 nodes with a batch size of 8. |
| Software Dependencies | No | The paper mentions software like GPT-2, LLaMA-13B, and LMFlow framework, but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | For all of our models optimized with SMILE, we choose the checkpoints according to the best self-retrieval performance on the validation set, which always occurs within 3 epochs. ... employing 4 RTX A6000 nodes with a batch size of 8. ... Loverall = λ LMLE + (1 − λ) LSMILE, λ ∈ [0, 1]. ... We design two implementations, First-token MLE and First-token Shifting, to achieve such restriction: |