Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for Multimodal Alignment
Authors: Ziping Ma, Furong Xu, Jian Liu, Ming Yang, Qingpei Guo
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on five visionlanguage tasks, including image-text retrieval, image-captioning, visual question answering, and zero-shot/finetuned image classification, validate the effectiveness of our proposed method. |
| Researcher Affiliation | Industry | Ziping Ma 1 Furong Xu 1 Jian Liu 1 Ming Yang 1 Qingpei Guo 1 1Ant Group. Correspondence to: Qingpei Guo <EMAIL>. |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper references an open-source implementation of CoCa ('https://github.com/mlfoundations/open_clip') but does not provide specific access to the source code for the proposed SyCoCa method. |
| Open Datasets | Yes | We use the Conceptual Captions 12M (Changpinyo et al., 2021) (CC12M) dataset with 12 million image-caption pairs, as the multi-modal pretraining data for all models. and Specifically, we collect Laion-2B (Schuhmann et al., 2022) and COYO-700M (Byeon et al., 2022). |
| Dataset Splits | Yes | The results in Table 2 clearly indicate that Sy Co Ca surpasses Co Ca in all cases, where Sy Co Ca achieves remarkable improvements of 8%-9% on the validation, test-dev, and test splits of VQA. |
| Hardware Specification | Yes | We conduct our model training on two machines, each equipped with 8 NVIDIA A100 GPUs, for a total of 20 epochs. |
| Software Dependencies | No | The paper mentions the AdamW optimizer but does not provide specific software dependencies with version numbers (e.g., libraries, frameworks, or programming language versions). |
| Experiment Setup | Yes | The batch size during training is set to 2048, and the resolution of pretraining images is set to 224 224. We use the Adam W optimizer (Loshchilov & Hutter, 2017) with an initial learning rate of 1e 4. The learning rate schedule follows a cosine decay, including a warm-up period of 5000 steps. In terms of hyperparameters, we simply set Ξ»IC = 2 following Co Ca and Ξ»T M = 1. The masking ratios rh and rl are both empirically set to 50%. |