Bridging Vision and Language Spaces with Assignment Prediction
Authors: Jungin Park, Jiyoung Lee, Kwanghoon Sohn
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that VLAP achieves substantial improvements over the previous linear transformation-based approaches across a range of vision-language tasks, including image captioning, visual question answering, and cross-modal retrieval. We also demonstrate the learned visual representations hold a semantic taxonomy of LLMs, making visual semantic arithmetic possible. |
| Researcher Affiliation | Collaboration | Jungin Park1 Jiyoung Lee2 Kwanghoon Sohn1,3 1Yonsei University 2NAVER AI Lab 3Korea Institute of Science and Technology (KIST) |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We provide Py Torch implementation for VLAP at https://github.com/park-jungin/vlap. |
| Open Datasets | Yes | We first train the model on the CC3M (Sharma et al., 2018) and evaluate the performance on the following datasets for each task. For zero-shot image captioning, we evaluate the performance on MSCOCO (Lin et al., 2014) and No Caps (Agrawal et al., 2019), following (Merullo et al., 2023). For visual question answering, we evaluate the model on the VQA2 (Goyal et al., 2017) dataset from zero-shot to 4-shot settings. In cross-modal retrieval, we use the Visual Dialog (Das et al., 2017) dataset for the comparability to previous work (Koh et al., 2023). |
| Dataset Splits | No | The paper mentions training on CC3M and evaluating on MSCOCO, No Caps, VQA2, and Visual Dialog, but does not explicitly provide the specific training, validation, or test dataset splits in terms of percentages or sample counts within the paper's text. |
| Hardware Specification | No | The paper mentions hardware specifications (TPUs, A100 GPUs) only in the context of related works (Flamingo, BLIP-2) and their training costs, not for the experiments conducted in this paper. |
| Software Dependencies | No | The paper mentions 'Py Torch implementation' but does not specify its version or other software dependencies with version numbers. |
| Experiment Setup | Yes | In Table 4, we provide hyperparameters used in training. Table 4: Hyperparameters for training VLAP corresponding to each image encoders and LLMs. Hyperparameter Model BEi T-OPT1.3B BEi T-T5Base CLIP-OPT1.3B CLIP-T5Base Warmup steps 1.5K 3K 1.5K 3K Learning rate 1e-4 5e-3 1e-4 5e-3 Batch size 128 256 128 256 Total steps 30K 15K 30K 15K Final learning rate 0 Adam W β (0.9, 0.999) Text prompt A photo of |