From Association to Generation: Text-only Captioning by Unsupervised Cross-modal Mapping

Authors: Junyang Wang, Ming Yan, Yi Zhang, Jitao Sang

IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 Experiment, 4.2 Performance Comparison, The results of image captioning are shown in Table 1., Experimental results show that Knight achieves state-of-the-art performance.
Researcher Affiliation Collaboration Junyang Wang1 , Ming Yan2 , Yi Zhang1 , Jitao Sang1,2 1School of Computer and Information Technology & Beijing Key Lab of Traffic Data Analysis and Mining, Beijing Jiaotong University 2Peng Cheng Lab 3DAMO Academy, Alibaba Group {junyangwang, yi.zhang, jtsang}@bjtu.edu.cn, ym119608@alibaba-inc.com
Pseudocode No The paper describes the method using equations and prose, but does not contain a distinct block labeled "Pseudocode" or "Algorithm".
Open Source Code No The paper does not provide any specific link or statement regarding the availability of its source code.
Open Datasets Yes For the image captioning task, we conduct experiments on two widely used benchmarks: Flickr30k [Plummer et al., 2015] and MS-COCO [Lin et al., 2014; Chen et al., 2015]. For the video captioning task, we choose two video datasets: MSRVTT [Xu et al., 2016] and MSVD [Wu et al., 2017].
Dataset Splits Yes And we set up the training, validation, and test splits according to the protocols provided by [Karpathy and Fei-Fei, 2015] for both datasets.
Hardware Specification Yes The training process is less than 6 hours with 1 Tesla A100 GPU.
Software Dependencies No The paper mentions GPT-2 and Adam optimizer, but does not provide specific version numbers for software dependencies such as libraries or frameworks (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes For CLIP, we choose the Resnet50x64 architecture which encodes each image as a 1024-dimension vector. For the decoder, we choose the large vision of GPT-2 [Radford et al., 2019] with a 1280-dimension embedding space. To align CLIP and decoder on the representation layer, we use a 3-layer MLP that transforms the representation of CLIP into 1280 dimensions. We optimize the decoder with the Adam optimizer [Kingma and Ba, 2014] and a learning rate of 1e-6.