reproducibilityindex.ai

From Association to Generation: Text-only Captioning by Unsupervised Cross-modal Mapping

Authors: Junyang Wang, Ming Yan, Yi Zhang, Jitao Sang

IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	4 Experiment, 4.2 Performance Comparison, The results of image captioning are shown in Table 1., Experimental results show that Knight achieves state-of-the-art performance.
Researcher Affiliation	Collaboration	Junyang Wang1 , Ming Yan2 , Yi Zhang1 , Jitao Sang1,2 1School of Computer and Information Technology & Beijing Key Lab of Traffic Data Analysis and Mining, Beijing Jiaotong University 2Peng Cheng Lab 3DAMO Academy, Alibaba Group {junyangwang, yi.zhang, jtsang}@bjtu.edu.cn, ym119608@alibaba-inc.com
Pseudocode	No	The paper describes the method using equations and prose, but does not contain a distinct block labeled "Pseudocode" or "Algorithm".
Open Source Code	No	The paper does not provide any specific link or statement regarding the availability of its source code.
Open Datasets	Yes	For the image captioning task, we conduct experiments on two widely used benchmarks: Flickr30k [Plummer et al., 2015] and MS-COCO [Lin et al., 2014; Chen et al., 2015]. For the video captioning task, we choose two video datasets: MSRVTT [Xu et al., 2016] and MSVD [Wu et al., 2017].
Dataset Splits	Yes	And we set up the training, validation, and test splits according to the protocols provided by [Karpathy and Fei-Fei, 2015] for both datasets.
Hardware Specification	Yes	The training process is less than 6 hours with 1 Tesla A100 GPU.
Software Dependencies	No	The paper mentions GPT-2 and Adam optimizer, but does not provide specific version numbers for software dependencies such as libraries or frameworks (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	For CLIP, we choose the Resnet50x64 architecture which encodes each image as a 1024-dimension vector. For the decoder, we choose the large vision of GPT-2 [Radford et al., 2019] with a 1280-dimension embedding space. To align CLIP and decoder on the representation layer, we use a 3-layer MLP that transforms the representation of CLIP into 1280 dimensions. We optimize the decoder with the Adam optimizer [Kingma and Ba, 2014] and a learning rate of 1e-6.