Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Breaking the Batch Barrier (B3) of Contrastive Learning via Smart Batch Mining

Authors: Raghuveer Thirukovalluru, Rui Meng, Ye Liu, Karthikeyan K, Mingyi Su, Ping Nie, Semih Yavuz, Yingbo Zhou, Wenhu Chen, Bhuwan Dhingra

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results on the MMEB multimodal embedding benchmark (36 tasks) demonstrate that our method sets a new state of the art, outperforming previous best methods by +1.3 and +2.9 points at the 7B and 2B model scales, respectively.
Researcher Affiliation	Collaboration	Duke University Salesforce AI Research Independent University of Waterloo
Pseudocode	Yes	Figure 1: The batch mining mechanism of B3. Initially, a teacher model generates a rank matrix R over the training set, indicating potential negative relationships. From these rankings (specifically ranks in the range [p : p + m] for each query), a undirected sparse preference graph S is constructed. Then, METIS clustering is applied to identify communities of mutually strong negatives. Finally, diverse training batches of size \|B\| are formed by sampling examples from \|B\|/K distinct communities.
Open Source Code	Yes	Code and Models: https://github.com/raghavlite/B3
Open Datasets	Yes	Training Set We train our models on the MMEB [8] training set. We evaluate our methods on the MMEB [8] benchmark. Following Uni ME [6], we perform zero-shot evaluation of B3++ on short (Flickr [23], COCO [15]) and long (Urban1k [35]) image caption retrieval. Table 13: Additional positive instruction prompts that were used in B3, B3++. For positives of other datasets, we used the existing prompts from VLM2Vec [8]. These prompts are expected to decouple diverse tasks during training.
Dataset Splits	Yes	We train our models on the MMEB [8] training set. We evaluate our methods on the MMEB [8] benchmark. For p, we tuned this value on heldout portions of the train set.
Hardware Specification	Yes	All training and evaluation were conducted on 8 H200 GPUs.
Software Dependencies	No	All models in this work are trained using Lo RA with a rank of 8.
Experiment Setup	Yes	We train for 2000 steps (~2 epochs) with a batch size of 1024 unless specified. We use m = 100 following SFR-Embedding and NV-Retriever. For p, we tuned this value on heldout portions of the train set. We used p = 30 for retrieval and grounding tasks and p = 70 for VQA tasks. For classification tasks, we just filter out the golden label from the rank list. We use 5 hard-negatives h = 5 for B3++ and no hard negatives in B3. All models in this work are trained using Lo RA with a rank of 8. Unless mentioned, models are trained for 2k steps with peak learning rate of 1e-4 and warmup of 10%. Temperature used was 0.02.