TiMix: Text-Aware Image Mixing for Effective Vision-Language Pre-training

Authors: Chaoya Jiang, Wei Ye, Haiyang Xu, Qinghao Ye, Ming Yan, Ji Zhang, Shikun Zhang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experimental results demonstrate that Ti Mix exhibits a comparable performance on downstream tasks, even with a reduced amount of training data and shorter training time, when benchmarked against existing methods.
Researcher Affiliation Collaboration 1National Engineering Research Center for Software Engineering, Peking University, Beijing, China 2Alibaba Group, Hangzhou, China
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Our code is available on https://github.com/chaoyajiang/Ti Mi X/tree/main.
Open Datasets Yes Following the previous works (Li et al. 2021) and (Li et al. 2022a), we use the same pre-training dataset with 14M images with texts, which includes two in-domain datasets (MS COCO (Lin et al. 2014) and Visual Genome (Krishna et al. 2016)), and three web out-domain datasets (Conceptual Captions (Sharma et al. 2018a), Conceptual 12M (Changpinyo et al. 2021a), SBU Captions (Ordonez, Kulkarni, and Berg 2011)).
Dataset Splits Yes We evaluated our models by submitting the results to the evaluation server 1 and report the test-dev and test-std scores in Table 1. The fine-tuning hyper-parameters and the details of downstream tasks are described in Appendix D. Tables 1, 2, and 3 use standard splits like 'dev', 'test-dev', 'test-std', and 'COCO Karpathy test split'.
Hardware Specification Yes on 8 80G A100
Software Dependencies No The paper mentions specific models and loss functions, but does not provide version numbers for any software dependencies like programming languages, frameworks, or libraries.
Experiment Setup No The paper states that 'The fine-tuning hyper-parameters and the details of downstream tasks are described in Appendix D' and 'Please refer to Appendix C to see more detail about the pre-training dataset and pre-training setting.' However, these specific details are not present in the main body of the text provided.