Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query

Authors: Wei Chow, Yuan Gao, Linfeng Li, Xian Wang, Qi Xu, Hang Song, Lingdong Kong, Ran Zhou, Yi Zeng, Yidong Cai, Botian Jiang, Shilin Xu, Jiajunzhang, Minghui Qiu, Xiangtai Li, Tianshu Yang, Siliang Tang, Juncheng Li

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on MERIT identify the existing models critical limitation: focusing solely on global semantic information while neglecting specific conditional elements in queries. Consequently, we propose CORAL, a novel fine-tuning framework that adapts pre-trained MLLMs by integrating embedding reconstruction to preserve fine-grained conditional elements and contrastive learning to extract comprehensive global semantics. Experiments demonstrate that CORAL achieves a 45.9% performance improvement over conventional approaches on MERIT, with strong generalization capabilities validated across 8 established retrieval benchmarks.
Researcher Affiliation	Collaboration	Wei Chow1 , Yuan Gao1 , Linfeng Li1 , Xian Wang1, Qi Xu1, Hang Song1, Lingdong Kong1, Ran Zhou1, Yi Zeng1, Yidong Cai1, Botian Jiang1, Shilin Xu1, Jiajun Zhang1, Minghui Qiu1, Xiangtai Li1, Tianshu Yang1, Siliang Tang2, Juncheng Li2, 1Byte Dance Inc. 2Zhejiang University
Pseudocode	No	The paper includes mathematical formulations and describes processes verbally and with flowcharts (e.g., Figure 11), but does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Data & Code: MERIT-2025.github.io. To ensure reproducibility, code and data are committed to be publicly available.
Open Datasets	Yes	Hence, this paper introduces MERIT, the first multilingual dataset for interleaved multi-condition semantic retrieval, comprising 320,000 queries with 135,000 products in 5 languages, covering 7 distinct product categories. Data & Code: MERIT-2025.github.io. To further validate the efficacy of CORAL, we conducted evaluations on 8 retrieval benchmarks with experimental configurations following the methodology described in [39]. The results are illustrated in Fig. 9. Comparative analyses between our approach and other foundational models, such as CLIP [68] and E5-V [38], are presented in Appendix E.4. Experimental results demonstrate that our method achieves consistent improvements across these eight retrieval tasks, with particularly notable performance on Vis Dial [16], where our approach exhibits a 181% enhancement over the baseline.
Dataset Splits	Yes	For convenience, the dataset is partitioned into training and test sets, containing 310,000 and 10,000 entries respectively. MERIT is divided into training and test sets, consisting of 310,000 and 10,000 queries respectively as mentioned in Sec. 3.1. To ensure equitable representation of each source dataset within the test split and maintain distributional consistency of language and product categories between the test set and the complete dataset, we implemented a stratified sampling methodology: 1. Random sampling of queries to achieve proportional representation of languages and product categories in alignment with the full dataset distribution. 2. Supplementary random sampling of remaining queries from individual source datasets according to their respective volumetric contributions to the complete corpus.
Hardware Specification	Yes	All experiments were conducted on a computing node equipped with 8 H100 GPUs.
Software Dependencies	No	The paper mentions tools and frameworks like VLMEval Kit [15, 69] and LoRA [28] but does not provide specific version numbers for any key software components or libraries.
Experiment Setup	Yes	Experiments were conducted for a single epoch with the following training configuration: A perdevice batch size of 4 was employed with gradient accumulation steps set to 2, resulting in an effective global batch size of 64. The Info NCE contrastive loss temperature parameter (τ) was fixed at 0.02. For negative sampling, we implemented in-batch negatives combined with cross-device negative sample gathering, achieving a final positive-to-negative ratio of 1 : 63. For full-parameter fine-tuning, we adopted a learning rate of 1e 5 with weight decay of 0.0005 and linear warmup ratio of 0.01. The Lo RA [28] configuration employed the following parameters: learning rate of 1e 4 (10 times higher than full fine-tuning), identical weight decay (0.0005) and warmup ratio (0.01), with Lo RA-specific hyperparameters set to r = 8, α = 16, no bias terms, and a dropout rate of 0.05 between Lo RA layers. The CORAL framework was configured with the following hyperparameters: the loss weighting coefficients λ1 andλ2 were both set to 0.1 to balance the objective components, while maintaining uniform masking probabilities of 0.5 for both visual and linguistic modalities.