SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for Multimodal Alignment

Authors: Ziping Ma, Furong Xu, Jian Liu, Ming Yang, Qingpei Guo

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on five visionlanguage tasks, including image-text retrieval, image-captioning, visual question answering, and zero-shot/finetuned image classification, validate the effectiveness of our proposed method.
Researcher Affiliation Industry Ziping Ma 1 Furong Xu 1 Jian Liu 1 Ming Yang 1 Qingpei Guo 1 1Ant Group. Correspondence to: Qingpei Guo <qingpei.gqp@antgroup.com>.
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper references an open-source implementation of CoCa ('https://github.com/mlfoundations/open_clip') but does not provide specific access to the source code for the proposed SyCoCa method.
Open Datasets Yes We use the Conceptual Captions 12M (Changpinyo et al., 2021) (CC12M) dataset with 12 million image-caption pairs, as the multi-modal pretraining data for all models. and Specifically, we collect Laion-2B (Schuhmann et al., 2022) and COYO-700M (Byeon et al., 2022).
Dataset Splits Yes The results in Table 2 clearly indicate that Sy Co Ca surpasses Co Ca in all cases, where Sy Co Ca achieves remarkable improvements of 8%-9% on the validation, test-dev, and test splits of VQA.
Hardware Specification Yes We conduct our model training on two machines, each equipped with 8 NVIDIA A100 GPUs, for a total of 20 epochs.
Software Dependencies No The paper mentions the AdamW optimizer but does not provide specific software dependencies with version numbers (e.g., libraries, frameworks, or programming language versions).
Experiment Setup Yes The batch size during training is set to 2048, and the resolution of pretraining images is set to 224 224. We use the Adam W optimizer (Loshchilov & Hutter, 2017) with an initial learning rate of 1e 4. The learning rate schedule follows a cosine decay, including a warm-up period of 5000 steps. In terms of hyperparameters, we simply set λIC = 2 following Co Ca and λT M = 1. The masking ratios rh and rl are both empirically set to 50%.