Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Emergent Corpus Pre-training Benefits Vision Language Models

Authors: Makanjuola Adekunmi Ogunleye, Chase Vickery, Ismini Lourentzou

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We pre-train a Vision-Language Model (VLM) using EC tokens generated through a referential game between two artificial agents. Across three diverse cross-modal matching and reasoning benchmarks, EC pretraining yields substantial gains, improving Visual Referring Expression (VRE) accuracy by 108.6% and Visual Entailment (VE) by 69.6%. To further validate the effectiveness of EC pretraining, we introduce LLa VA-1.5-EC, a LLa VA variant trained entirely on EC tokens. LLa VA-1.5-EC outperforms strong LVLM baselines, including BLIP-2 (13B), achieving relative gains of 104.23% on Viz Wiz, 34.8% on GQA, and 10.8% on VQAv2, and top performance on MMBench, a challenging instruction-following benchmark. These results highlight the transferability and generalization capacity of EC pretraining and underscore the potential of leveraging grounded EC tokens to enhance vision-language reasoning in low-resource settings, especially in settings with limited natural language data.
Researcher Affiliation Academia Makanjuola Ogunleye EMAIL Virginia Tech Chase Vickery EMAIL Virginia Tech Ismini Lourentzou EMAIL University of Illinois Urbana-Champaign
Pseudocode No The generation process ends when either of the two conditions is satisfied: the special end-of-sentence symbol [EOS] is generated, or the maximum message length Tmax is reached. Initially, at t = 0, m0 = [CLS] and h0 S = Ii. At each time step t > 0, the generation of the i-th speaker message token mt i can be described by ht S = GRUS mt 1 i , ht 1 S , (1) mt i = Gumbel-Softmax MLPS(ht S) , (2) where the Gumbel-Softmax trick (Jang et al., 2017) is employed to draw samples from categorical distributions of emergent tokens in an end-to-end differentiable way. Here, ht S stands for the hidden state at time step t, while MLPS denotes the multilayer perception speaker S utilizes to project each hidden state into vectors with dimensionality equal to the vocabulary size of the emergent language. This section describes the process using equations and descriptive text, but it does not present a structured pseudocode or algorithm block.
Open Source Code No Project Website: https://plan-lab.github.io/ec-vlm/ (This website states 'Code and data coming soon' and does not provide immediate access to the code for this paper.) Additionally, the paper mentions using external codebases: 'The EC speakers used to generate the EC datasets are directly trained on the COCO image features from the codebase2 of Yao et al. (2022) for 2000 epochs.' (with footnote 2 pointing to https://github.com/ysymyth/ec-nl/tree/master/ec-game) and 'we adopt the codebase of OFA3 (Wang et al., 2022a)' (with footnote 3 pointing to https://github.com/OFA-Sys/OFA).
Open Datasets Yes Visual Referring Expression (VRE). We evaluate on the standard Ref COCO benchmark suite (Yu et al., 2016), which includes Ref COCO, Ref COCO+, and Ref COCOg, all derived from the MS-COCO image dataset (Lin et al., 2014). Visual Question Answering (VQA). We conduct experiments on the VQAv2 dataset (Goyal et al., 2017). Visual Entailment (VE). We evaluate VE performance using the SNLI-VE dataset (Xie et al., 2019; 2018). Image Captioning (IC). We evaluate image captioning performance using the Microsoft COCO dataset (Lin et al., 2014). Instruction-Following Benchmarks. We adopt the Lla VA-1.5 pretraining and fine-tuning dataset... utilizes approximately 558K images sampled from the LAION (Schuhmann et al., 2022), CC (Changpinyo et al., 2021), and SBU (Ordonez et al., 2011) synthesized captioning datasets... a mixture of 665K multimodal instruction-following examples, synthesized and sampled from a variety of VQA data sources, including GPT-generated content, GQA (Hudson & Manning, 2019), COCO (Lin et al., 2014), and Text VQA (Singh et al., 2019). In addition, we evaluate on multimodal benchmarks POPE (Li et al., 2023b), MME (Fu et al., 2023), and MM-Vet (Yu et al., 2024).
Dataset Splits Yes Each dataset [Ref COCO] is split into three subsets: val, test A, and test B... This OFA-adapted version of the VQAv2 dataset includes training, validation, and test sets, with 1.8M, 10,402, and 447,793 samples, respectively. To enable EC pretraining, we divide the training set into two halves... SNLI-VE dataset (Xie et al., 2019; 2018), which contains 529,527 training samples, 17,858 validation samples, and 17,901 test samples derived from 29,783 unique images. To examine the effect of EC pretraining under different supervision levels, we conduct experiments using the full training set as well as reduced subsets of 50,000 and 10,000 samples... MSCOCO, which is split into four subsets: caption_stage1_train, caption_stage2_train, caption_val, and caption_test. The dataset includes 566K samples in caption_stage1_train, 113K in caption_stage2_train, 5K in the validation set, and 5K in the test set. For evaluation, we follow the Karpathy split (Karpathy & Fei-Fei, 2015).
Hardware Specification Yes Training is performed on a P100 GPU... Pretraining was conducted on a single NVIDIA A100 GPU for 2 days, while the fine-tuning phase required 2 NVIDIA A100 GPUs and took 2 days to complete. For the Visual Question Answering (VQA) task... Continuous pretraining was executed on a single NVIDIA V100 GPU for 4 days... Fine-tuning for 5 epochs required around 90 hours on 2 P100 GPUs. In the Visual Entailment (VE) task... The fine-tuning process was distributed across 4 NVIDIA A40 GPU workers and took approximately 15 hours to complete.
Software Dependencies No All text is tokenized using the NLTK word tokenizer1, and unigram counts are computed and sorted to generate the respective unigram distributions for each corpus. (Footnote 1: https://www.nltk.org/api/nltk.tokenize.html) This mentions NLTK but does not provide a specific version number. No other specific software dependencies with version numbers are provided.
Experiment Setup Yes The EC speakers used to generate the EC datasets are directly trained... for 2000 epochs. Training is performed on a P100 GPU, and the sequence length limit is set to 15. Generating EC sequences of length 15, the speakers draw from a vocabulary size of 4035 tokens... Initially, we pre-train the OFA model on the Ref COCO training set... The pretraining process consists of 17 epochs and 492,000 updates. Subsequently, we fine-tune the pre-trained model for 10 epochs and 18,500 updates... Continuous pretraining was executed on a single NVIDIA V100 GPU for 4 days, encompassing 960,000 updates, which corresponded to approximately 4 to 5 epochs. Fine-tuning for 5 epochs required around 90 hours on 2 P100 GPUs... For the Visual Entailment (VE) task, we utilize the pre-trained model... We fine-tune this model on the SNLI-VE dataset (Xie et al., 2019) for 5 epochs and 20,500 updates... For Image Captioning (IC), fine-tuning is conducted in two stages: (1) cross-entropy optimization for two epochs with a batch size of 128, learning rate of 1e-5, and label smoothing of 0.1; (2) CIDEr optimization for three additional epochs using a batch size of 64, disabling dropout and stochastic depth for stability.