Efficient Image Captioning for Edge Devices
Authors: Ning Wang, Jiangrong Xie, Hang Luo, Qinglin Cheng, Jihao Wu, Mingbo Jia, Linlin Li
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In spite of the low capacity, our model still exhibits state-of-the-art performance on prevalent datasets, e.g., 136.6 CIDEr on COCO Karpathy test split. Testing on the smartphone with only a single CPU, the proposed Light Cap exhibits a fast inference speed of 188ms per image, which is ready for practical applications. |
| Researcher Affiliation | Collaboration | Huawei Inc. wn6149@mail.ustc.edu.cn, xiexjr@foxmail.com, {lhjeremy, qlincheng}@outlook.com, {wujihao, jiamingbo, lynn.lilinlin}@huawei.com |
| Pseudocode | No | The paper describes methods in text and uses figures but does not include any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statement about providing open-source code or a link to a code repository. |
| Open Datasets | Yes | In the experiments, we collect the image-text pairs from Google Conceptual Captions (CC3M) (Sharma et al. 2018), SBU Captions (Ordonez, Kulkarni, and Berg 2011), Open Images (Shao et al. 2019), and MS-COCO (Lin et al. 2014) to form the pre-training data. |
| Dataset Splits | Yes | We evaluate the proposed method on the COCO caption of Karpathy split (Lin et al. 2014) and nocaps validation dataset (Agrawal et al. 2019). |
| Hardware Specification | Yes | Then, we test the inference latency of Light Cap model on Huawei P40 smartphone with a Kirin 990 chip. |
| Software Dependencies | No | The paper mentions software components such as 'Tiny BERT4' and 'CLIP model (Res Net-50 version)' but does not provide specific version numbers for these or other software dependencies required for replication. |
| Experiment Setup | Yes | The input image resolution is 224 224. This alignment module only contains two linear blocks (2048 1024 and 1024 1024) and is trained for 60 epochs with a learning rate of 1 10 5. In the pre-training stage, the fusion model is trained 1.0M steps with a learning rate of 5 10 5 and batch size of 512. In the fine-tuning stage, the fusion model is trained 120 epochs with a learning rate of 3 10 5. |