From Association to Generation: Text-only Captioning by Unsupervised Cross-modal Mapping
Authors: Junyang Wang, Ming Yan, Yi Zhang, Jitao Sang
IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 Experiment, 4.2 Performance Comparison, The results of image captioning are shown in Table 1., Experimental results show that Knight achieves state-of-the-art performance. |
| Researcher Affiliation | Collaboration | Junyang Wang1 , Ming Yan2 , Yi Zhang1 , Jitao Sang1,2 1School of Computer and Information Technology & Beijing Key Lab of Traffic Data Analysis and Mining, Beijing Jiaotong University 2Peng Cheng Lab 3DAMO Academy, Alibaba Group {junyangwang, yi.zhang, jtsang}@bjtu.edu.cn, ym119608@alibaba-inc.com |
| Pseudocode | No | The paper describes the method using equations and prose, but does not contain a distinct block labeled "Pseudocode" or "Algorithm". |
| Open Source Code | No | The paper does not provide any specific link or statement regarding the availability of its source code. |
| Open Datasets | Yes | For the image captioning task, we conduct experiments on two widely used benchmarks: Flickr30k [Plummer et al., 2015] and MS-COCO [Lin et al., 2014; Chen et al., 2015]. For the video captioning task, we choose two video datasets: MSRVTT [Xu et al., 2016] and MSVD [Wu et al., 2017]. |
| Dataset Splits | Yes | And we set up the training, validation, and test splits according to the protocols provided by [Karpathy and Fei-Fei, 2015] for both datasets. |
| Hardware Specification | Yes | The training process is less than 6 hours with 1 Tesla A100 GPU. |
| Software Dependencies | No | The paper mentions GPT-2 and Adam optimizer, but does not provide specific version numbers for software dependencies such as libraries or frameworks (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | For CLIP, we choose the Resnet50x64 architecture which encodes each image as a 1024-dimension vector. For the decoder, we choose the large vision of GPT-2 [Radford et al., 2019] with a 1280-dimension embedding space. To align CLIP and decoder on the representation layer, we use a 3-layer MLP that transforms the representation of CLIP into 1280 dimensions. We optimize the decoder with the Adam optimizer [Kingma and Ba, 2014] and a learning rate of 1e-6. |