reproducibilityindex.ai

SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for Multimodal Alignment

Authors: Ziping Ma, Furong Xu, Jian Liu, Ming Yang, Qingpei Guo

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on five visionlanguage tasks, including image-text retrieval, image-captioning, visual question answering, and zero-shot/finetuned image classification, validate the effectiveness of our proposed method.
Researcher Affiliation	Industry	Ziping Ma 1 Furong Xu 1 Jian Liu 1 Ming Yang 1 Qingpei Guo 1 1Ant Group. Correspondence to: Qingpei Guo <qingpei.gqp@antgroup.com>.
Pseudocode	No	The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code	No	The paper references an open-source implementation of CoCa ('https://github.com/mlfoundations/open_clip') but does not provide specific access to the source code for the proposed SyCoCa method.
Open Datasets	Yes	We use the Conceptual Captions 12M (Changpinyo et al., 2021) (CC12M) dataset with 12 million image-caption pairs, as the multi-modal pretraining data for all models. and Specifically, we collect Laion-2B (Schuhmann et al., 2022) and COYO-700M (Byeon et al., 2022).
Dataset Splits	Yes	The results in Table 2 clearly indicate that Sy Co Ca surpasses Co Ca in all cases, where Sy Co Ca achieves remarkable improvements of 8%-9% on the validation, test-dev, and test splits of VQA.
Hardware Specification	Yes	We conduct our model training on two machines, each equipped with 8 NVIDIA A100 GPUs, for a total of 20 epochs.
Software Dependencies	No	The paper mentions the AdamW optimizer but does not provide specific software dependencies with version numbers (e.g., libraries, frameworks, or programming language versions).
Experiment Setup	Yes	The batch size during training is set to 2048, and the resolution of pretraining images is set to 224 224. We use the Adam W optimizer (Loshchilov & Hutter, 2017) with an initial learning rate of 1e 4. The learning rate schedule follows a cosine decay, including a warm-up period of 5000 steps. In terms of hyperparameters, we simply set λIC = 2 following Co Ca and λT M = 1. The masking ratios rh and rl are both empirically set to 50%.