Improving Cross-Modal Alignment with Synthetic Pairs for Text-Only Image Captioning
Authors: Zhiyue Liu, Jinyuan Liu, Fanrong Ma
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that our method obtains the state-of-the-art performance on benchmark datasets. |
| Researcher Affiliation | Academia | Zhiyue Liu1,2*, Jinyuan Liu1, Fanrong Ma1 1School of Computer, Electronics and Information, Guangxi University, Nanning, China 2Guangxi Key Laboratory of Multimedia Communications and Network Technology liuzhy@gxu.edu.cn, {2213394017, 2213301037}@st.gxu.edu.cn |
| Pseudocode | No | The paper describes procedures and methods but does not contain any formally labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | No | The paper provides a footnote '1https://arxiv.org/abs/2312.08865' which links to the arXiv preprint, not to a code repository or an explicit statement about code availability. |
| Open Datasets | Yes | Experiments are conducted on three benchmark datasets: MSCOCO (Chen et al. 2015), Flickr30k (Young et al. 2014), and SS1M (Feng et al. 2019). |
| Dataset Splits | Yes | We follow Karpathy (Karpathy and Fei-Fei 2015) to split the MSCOCO and Flickr30k datasets into the training, validation, and test sets. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments, such as GPU models, CPU types, or memory specifications. |
| Software Dependencies | No | The paper mentions specific models like 'Stable Diffusion v1-5' and 'CLIP VIT-B/32' and 'DETR', but it does not list general software dependencies such as programming languages, deep learning frameworks (e.g., PyTorch, TensorFlow), or CUDA versions with their specific version numbers required for replication. |
| Experiment Setup | Yes | For synthetic image generation, we utilize Stable Diffusion v1-5 (Rombach et al. 2022) as the text-to-image model, which leverages 20 sampling steps to generate a 512 512 image for each input text. A pre-trained CLIP VIT-B/32 model is used as the feature extractor. The Adam optimizer (Kingma and Ba 2015) is employed to optimize parameters. For better alignment, we optimize pseudo image features using Eq. 2 with a temperature τ of 1/100 and a learning rate of 1e-5. We use a transformer decoder structure with 4 layers and 4 attention heads as the caption generator. The projection temperatures τ in Eq. 3 for MSCOCO, Flickr30k, and SS1M are set to 1/100, 1/80, and 1/100, respectively. |