SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for Multimodal Alignment
Authors: Ziping Ma, Furong Xu, Jian Liu, Ming Yang, Qingpei Guo
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on five visionlanguage tasks, including image-text retrieval, image-captioning, visual question answering, and zero-shot/finetuned image classification, validate the effectiveness of our proposed method. |
| Researcher Affiliation | Industry | Ziping Ma 1 Furong Xu 1 Jian Liu 1 Ming Yang 1 Qingpei Guo 1 1Ant Group. Correspondence to: Qingpei Guo <qingpei.gqp@antgroup.com>. |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper references an open-source implementation of CoCa ('https://github.com/mlfoundations/open_clip') but does not provide specific access to the source code for the proposed SyCoCa method. |
| Open Datasets | Yes | We use the Conceptual Captions 12M (Changpinyo et al., 2021) (CC12M) dataset with 12 million image-caption pairs, as the multi-modal pretraining data for all models. and Specifically, we collect Laion-2B (Schuhmann et al., 2022) and COYO-700M (Byeon et al., 2022). |
| Dataset Splits | Yes | The results in Table 2 clearly indicate that Sy Co Ca surpasses Co Ca in all cases, where Sy Co Ca achieves remarkable improvements of 8%-9% on the validation, test-dev, and test splits of VQA. |
| Hardware Specification | Yes | We conduct our model training on two machines, each equipped with 8 NVIDIA A100 GPUs, for a total of 20 epochs. |
| Software Dependencies | No | The paper mentions the AdamW optimizer but does not provide specific software dependencies with version numbers (e.g., libraries, frameworks, or programming language versions). |
| Experiment Setup | Yes | The batch size during training is set to 2048, and the resolution of pretraining images is set to 224 224. We use the Adam W optimizer (Loshchilov & Hutter, 2017) with an initial learning rate of 1e 4. The learning rate schedule follows a cosine decay, including a warm-up period of 5000 steps. In terms of hyperparameters, we simply set λIC = 2 following Co Ca and λT M = 1. The masking ratios rh and rl are both empirically set to 50%. |