Efficient Image Captioning for Edge Devices

Authors: Ning Wang, Jiangrong Xie, Hang Luo, Qinglin Cheng, Jihao Wu, Mingbo Jia, Linlin Li

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In spite of the low capacity, our model still exhibits state-of-the-art performance on prevalent datasets, e.g., 136.6 CIDEr on COCO Karpathy test split. Testing on the smartphone with only a single CPU, the proposed Light Cap exhibits a fast inference speed of 188ms per image, which is ready for practical applications.
Researcher Affiliation Collaboration Huawei Inc. wn6149@mail.ustc.edu.cn, xiexjr@foxmail.com, {lhjeremy, qlincheng}@outlook.com, {wujihao, jiamingbo, lynn.lilinlin}@huawei.com
Pseudocode No The paper describes methods in text and uses figures but does not include any pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statement about providing open-source code or a link to a code repository.
Open Datasets Yes In the experiments, we collect the image-text pairs from Google Conceptual Captions (CC3M) (Sharma et al. 2018), SBU Captions (Ordonez, Kulkarni, and Berg 2011), Open Images (Shao et al. 2019), and MS-COCO (Lin et al. 2014) to form the pre-training data.
Dataset Splits Yes We evaluate the proposed method on the COCO caption of Karpathy split (Lin et al. 2014) and nocaps validation dataset (Agrawal et al. 2019).
Hardware Specification Yes Then, we test the inference latency of Light Cap model on Huawei P40 smartphone with a Kirin 990 chip.
Software Dependencies No The paper mentions software components such as 'Tiny BERT4' and 'CLIP model (Res Net-50 version)' but does not provide specific version numbers for these or other software dependencies required for replication.
Experiment Setup Yes The input image resolution is 224 224. This alignment module only contains two linear blocks (2048 1024 and 1024 1024) and is trained for 60 epochs with a learning rate of 1 10 5. In the pre-training stage, the fusion model is trained 1.0M steps with a learning rate of 5 10 5 and batch size of 512. In the fine-tuning stage, the fusion model is trained 120 epochs with a learning rate of 3 10 5.