Unifying Vision-Language Representation Space with Single-Tower Transformer
Authors: Jiho Jang, Chaerin Kong, DongHyeon Jeon, Seonhoon Kim, Nojun Kwak
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Thorough evaluations demonstrate the potential of a unified modality-agnostic VLP framework. Experiments Training Setup Datasets Following prior works (Li et al. 2021; Yang et al. 2022; Gan et al. 2020), we train One R on the combination of CC3M (Sharma et al. 2018), SBU Captions (Ordonez, Kulkarni, and Berg 2011), Visual Genome (Krishna et al. 2017) and COCO (Lin et al. 2014), which sums up to 4M images and 5.1M image-text pairs. |
| Researcher Affiliation | Collaboration | Jiho Jang1 , Chaerin Kong1 , Donghyeon Jeon2, Seonhoon Kim3 , Nojun Kwak1 1Seoul National University 2NAVER 3Coupang {geographic,veztylord,nojunk}@snu.ac.kr, donghyeon.jeon@navercorp.com, sekim625@coupang.com |
| Pseudocode | No | The paper provides mathematical formulations and diagrams but no explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any concrete access information, such as a repository link or an explicit statement about code release, for the methodology described. |
| Open Datasets | Yes | Datasets Following prior works (Li et al. 2021; Yang et al. 2022; Gan et al. 2020), we train One R on the combination of CC3M (Sharma et al. 2018), SBU Captions (Ordonez, Kulkarni, and Berg 2011), Visual Genome (Krishna et al. 2017) and COCO (Lin et al. 2014), which sums up to 4M images and 5.1M image-text pairs. |
| Dataset Splits | No | The paper mentions training datasets and testing on subsets like MS-COCO (5K) and Imagenet/CIFAR100, but it does not explicitly provide specific dataset split information (e.g., percentages, sample counts) for training, validation, and test sets, nor does it detail cross-validation setups. |
| Hardware Specification | Yes | We train our model with 32 A100 GPUs for 40 epochs under PyTorch framework. |
| Software Dependencies | No | The paper mentions 'PyTorch framework' but does not specify a version number or other software dependencies with their versions. |
| Experiment Setup | No | We train our model with 32 A100 GPUs for 40 epochs under PyTorch framework. Details on hyperparameters are listed in the supplementary. |