Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Breaking the Batch Barrier (B3) of Contrastive Learning via Smart Batch Mining
Authors: Raghuveer Thirukovalluru, Rui Meng, Ye Liu, Karthikeyan K, Mingyi Su, Ping Nie, Semih Yavuz, Yingbo Zhou, Wenhu Chen, Bhuwan Dhingra
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results on the MMEB multimodal embedding benchmark (36 tasks) demonstrate that our method sets a new state of the art, outperforming previous best methods by +1.3 and +2.9 points at the 7B and 2B model scales, respectively. |
| Researcher Affiliation | Collaboration | Duke University Salesforce AI Research Independent University of Waterloo |
| Pseudocode | Yes | Figure 1: The batch mining mechanism of B3. Initially, a teacher model generates a rank matrix R over the training set, indicating potential negative relationships. From these rankings (specifically ranks in the range [p : p + m] for each query), a undirected sparse preference graph S is constructed. Then, METIS clustering is applied to identify communities of mutually strong negatives. Finally, diverse training batches of size |B| are formed by sampling examples from |B|/K distinct communities. |
| Open Source Code | Yes | Code and Models: https://github.com/raghavlite/B3 |
| Open Datasets | Yes | Training Set We train our models on the MMEB [8] training set. We evaluate our methods on the MMEB [8] benchmark. Following Uni ME [6], we perform zero-shot evaluation of B3++ on short (Flickr [23], COCO [15]) and long (Urban1k [35]) image caption retrieval. Table 13: Additional positive instruction prompts that were used in B3, B3++. For positives of other datasets, we used the existing prompts from VLM2Vec [8]. These prompts are expected to decouple diverse tasks during training. |
| Dataset Splits | Yes | We train our models on the MMEB [8] training set. We evaluate our methods on the MMEB [8] benchmark. For p, we tuned this value on heldout portions of the train set. |
| Hardware Specification | Yes | All training and evaluation were conducted on 8 H200 GPUs. |
| Software Dependencies | No | All models in this work are trained using Lo RA with a rank of 8. |
| Experiment Setup | Yes | We train for 2000 steps (~2 epochs) with a batch size of 1024 unless specified. We use m = 100 following SFR-Embedding and NV-Retriever. For p, we tuned this value on heldout portions of the train set. We used p = 30 for retrieval and grounding tasks and p = 70 for VQA tasks. For classification tasks, we just filter out the golden label from the rank list. We use 5 hard-negatives h = 5 for B3++ and no hard negatives in B3. All models in this work are trained using Lo RA with a rank of 8. Unless mentioned, models are trained for 2k steps with peak learning rate of 1e-4 and warmup of 10%. Temperature used was 0.02. |