reproducibilityindex.ai

Efficient Image Captioning for Edge Devices

Authors: Ning Wang, Jiangrong Xie, Hang Luo, Qinglin Cheng, Jihao Wu, Mingbo Jia, Linlin Li

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In spite of the low capacity, our model still exhibits state-of-the-art performance on prevalent datasets, e.g., 136.6 CIDEr on COCO Karpathy test split. Testing on the smartphone with only a single CPU, the proposed Light Cap exhibits a fast inference speed of 188ms per image, which is ready for practical applications.
Researcher Affiliation	Collaboration	Huawei Inc. wn6149@mail.ustc.edu.cn, xiexjr@foxmail.com, {lhjeremy, qlincheng}@outlook.com, {wujihao, jiamingbo, lynn.lilinlin}@huawei.com
Pseudocode	No	The paper describes methods in text and uses figures but does not include any pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain any explicit statement about providing open-source code or a link to a code repository.
Open Datasets	Yes	In the experiments, we collect the image-text pairs from Google Conceptual Captions (CC3M) (Sharma et al. 2018), SBU Captions (Ordonez, Kulkarni, and Berg 2011), Open Images (Shao et al. 2019), and MS-COCO (Lin et al. 2014) to form the pre-training data.
Dataset Splits	Yes	We evaluate the proposed method on the COCO caption of Karpathy split (Lin et al. 2014) and nocaps validation dataset (Agrawal et al. 2019).
Hardware Specification	Yes	Then, we test the inference latency of Light Cap model on Huawei P40 smartphone with a Kirin 990 chip.
Software Dependencies	No	The paper mentions software components such as 'Tiny BERT4' and 'CLIP model (Res Net-50 version)' but does not provide specific version numbers for these or other software dependencies required for replication.
Experiment Setup	Yes	The input image resolution is 224 224. This alignment module only contains two linear blocks (2048 1024 and 1024 1024) and is trained for 60 epochs with a learning rate of 1 10 5. In the pre-training stage, the fusion model is trained 1.0M steps with a learning rate of 5 10 5 and batch size of 512. In the ﬁne-tuning stage, the fusion model is trained 120 epochs with a learning rate of 3 10 5.