Unifying Vision-Language Representation Space with Single-Tower Transformer

Authors: Jiho Jang, Chaerin Kong, DongHyeon Jeon, Seonhoon Kim, Nojun Kwak

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Thorough evaluations demonstrate the potential of a unified modality-agnostic VLP framework. Experiments Training Setup Datasets Following prior works (Li et al. 2021; Yang et al. 2022; Gan et al. 2020), we train One R on the combination of CC3M (Sharma et al. 2018), SBU Captions (Ordonez, Kulkarni, and Berg 2011), Visual Genome (Krishna et al. 2017) and COCO (Lin et al. 2014), which sums up to 4M images and 5.1M image-text pairs.
Researcher Affiliation Collaboration Jiho Jang1 , Chaerin Kong1 , Donghyeon Jeon2, Seonhoon Kim3 , Nojun Kwak1 1Seoul National University 2NAVER 3Coupang {geographic,veztylord,nojunk}@snu.ac.kr, donghyeon.jeon@navercorp.com, sekim625@coupang.com
Pseudocode No The paper provides mathematical formulations and diagrams but no explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any concrete access information, such as a repository link or an explicit statement about code release, for the methodology described.
Open Datasets Yes Datasets Following prior works (Li et al. 2021; Yang et al. 2022; Gan et al. 2020), we train One R on the combination of CC3M (Sharma et al. 2018), SBU Captions (Ordonez, Kulkarni, and Berg 2011), Visual Genome (Krishna et al. 2017) and COCO (Lin et al. 2014), which sums up to 4M images and 5.1M image-text pairs.
Dataset Splits No The paper mentions training datasets and testing on subsets like MS-COCO (5K) and Imagenet/CIFAR100, but it does not explicitly provide specific dataset split information (e.g., percentages, sample counts) for training, validation, and test sets, nor does it detail cross-validation setups.
Hardware Specification Yes We train our model with 32 A100 GPUs for 40 epochs under PyTorch framework.
Software Dependencies No The paper mentions 'PyTorch framework' but does not specify a version number or other software dependencies with their versions.
Experiment Setup No We train our model with 32 A100 GPUs for 40 epochs under PyTorch framework. Details on hyperparameters are listed in the supplementary.