reproducibilityindex.ai

Improving Cross-Modal Alignment with Synthetic Pairs for Text-Only Image Captioning

Authors: Zhiyue Liu, Jinyuan Liu, Fanrong Ma

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate that our method obtains the state-of-the-art performance on benchmark datasets.
Researcher Affiliation	Academia	Zhiyue Liu1,2*, Jinyuan Liu1, Fanrong Ma1 1School of Computer, Electronics and Information, Guangxi University, Nanning, China 2Guangxi Key Laboratory of Multimedia Communications and Network Technology liuzhy@gxu.edu.cn, {2213394017, 2213301037}@st.gxu.edu.cn
Pseudocode	No	The paper describes procedures and methods but does not contain any formally labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	No	The paper provides a footnote '1https://arxiv.org/abs/2312.08865' which links to the arXiv preprint, not to a code repository or an explicit statement about code availability.
Open Datasets	Yes	Experiments are conducted on three benchmark datasets: MSCOCO (Chen et al. 2015), Flickr30k (Young et al. 2014), and SS1M (Feng et al. 2019).
Dataset Splits	Yes	We follow Karpathy (Karpathy and Fei-Fei 2015) to split the MSCOCO and Flickr30k datasets into the training, validation, and test sets.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running the experiments, such as GPU models, CPU types, or memory specifications.
Software Dependencies	No	The paper mentions specific models like 'Stable Diffusion v1-5' and 'CLIP VIT-B/32' and 'DETR', but it does not list general software dependencies such as programming languages, deep learning frameworks (e.g., PyTorch, TensorFlow), or CUDA versions with their specific version numbers required for replication.
Experiment Setup	Yes	For synthetic image generation, we utilize Stable Diffusion v1-5 (Rombach et al. 2022) as the text-to-image model, which leverages 20 sampling steps to generate a 512 512 image for each input text. A pre-trained CLIP VIT-B/32 model is used as the feature extractor. The Adam optimizer (Kingma and Ba 2015) is employed to optimize parameters. For better alignment, we optimize pseudo image features using Eq. 2 with a temperature τ of 1/100 and a learning rate of 1e-5. We use a transformer decoder structure with 4 layers and 4 attention heads as the caption generator. The projection temperatures τ in Eq. 3 for MSCOCO, Flickr30k, and SS1M are set to 1/100, 1/80, and 1/100, respectively.