Bridging Vision and Language Spaces with Assignment Prediction

Authors: Jungin Park, Jiyoung Lee, Kwanghoon Sohn

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that VLAP achieves substantial improvements over the previous linear transformation-based approaches across a range of vision-language tasks, including image captioning, visual question answering, and cross-modal retrieval. We also demonstrate the learned visual representations hold a semantic taxonomy of LLMs, making visual semantic arithmetic possible.
Researcher Affiliation Collaboration Jungin Park1 Jiyoung Lee2 Kwanghoon Sohn1,3 1Yonsei University 2NAVER AI Lab 3Korea Institute of Science and Technology (KIST)
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes We provide Py Torch implementation for VLAP at https://github.com/park-jungin/vlap.
Open Datasets Yes We first train the model on the CC3M (Sharma et al., 2018) and evaluate the performance on the following datasets for each task. For zero-shot image captioning, we evaluate the performance on MSCOCO (Lin et al., 2014) and No Caps (Agrawal et al., 2019), following (Merullo et al., 2023). For visual question answering, we evaluate the model on the VQA2 (Goyal et al., 2017) dataset from zero-shot to 4-shot settings. In cross-modal retrieval, we use the Visual Dialog (Das et al., 2017) dataset for the comparability to previous work (Koh et al., 2023).
Dataset Splits No The paper mentions training on CC3M and evaluating on MSCOCO, No Caps, VQA2, and Visual Dialog, but does not explicitly provide the specific training, validation, or test dataset splits in terms of percentages or sample counts within the paper's text.
Hardware Specification No The paper mentions hardware specifications (TPUs, A100 GPUs) only in the context of related works (Flamingo, BLIP-2) and their training costs, not for the experiments conducted in this paper.
Software Dependencies No The paper mentions 'Py Torch implementation' but does not specify its version or other software dependencies with version numbers.
Experiment Setup Yes In Table 4, we provide hyperparameters used in training. Table 4: Hyperparameters for training VLAP corresponding to each image encoders and LLMs. Hyperparameter Model BEi T-OPT1.3B BEi T-T5Base CLIP-OPT1.3B CLIP-T5Base Warmup steps 1.5K 3K 1.5K 3K Learning rate 1e-4 5e-3 1e-4 5e-3 Batch size 128 256 128 256 Total steps 30K 15K 30K 15K Final learning rate 0 Adam W β (0.9, 0.999) Text prompt A photo of