Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Scaling Language-centric Omnimodal Representation Learning

Authors: Chenghao Xiao, Hou Pong (Ken) Chan, Hao Helen Zhang, Weiwen Xu, Mahani Aljunied, Yu Rong

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments across diverse backbones and benchmarks demonstrate its effectiveness, achieving state-of-the-art performance across modalities. Furthermore, we identify a Generation-Representation Scaling Law (GRSL), showing that the representational capabilities gained through contrastive refinement scale positively with the MLLM s generative capabilities. This suggests that improving generative abilities evolves as an effective paradigm for enhancing representation quality. We provide a theoretical explanation of GRSL, which formally links the MLLM s generative quality to the upper bound on its representation performance, and validate it on a challenging, low-resource visual-document retrieval task, showing that continual generative pretraining before CL can further enhance the potential of a model s embedding capabilities.
Researcher Affiliation	Industry	1DAMO Academy, Alibaba Group 2Hupan Lab
Pseudocode	No	The paper describes methodologies and procedures through textual descriptions and mathematical formulations, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor structured code-like steps for any procedure.
Open Source Code	Yes	1Codes, models, and resources are available at https://github.com/LCO-Embedding/LCO-Embedding.
Open Datasets	Yes	We consider two data settings: all-NLI and Scale-1M. The all-NLI combines MNLI [70] and SNLI [5], both frequently used for sentence representation learning. ... We further construct Scale-1M, a curated collection of 1M sentence pairs sampled from 20M multilingual parallel corpora, including Global Voice [48], MUSE [53], News Commentary [62], Tatoeba [1], Talks [52], Wiki Matrix [55], and other Sentence Transformers sources [51]. ... We utilize paired datasets, i.e., Pixmo Cap [13] for image-text, Audio Caps [32] for audio-text, and MSR-VTT [79] for video-text, for anisotropy comparison.
Dataset Splits	Yes	We use 276k triplets from all-NLI with entailments as positives and contradictions as hard negatives. ... Building on all-NLI, we further add 94k synthetic multimodal pairs (Appendix B) to enhance alignment in the downstream task format space, yielding a final dataset of 370k triplets. ... For conducting additional OCR-intensive generative training, we construct a training set leveraging images that do not correspond to retrieval test set queries, resulting in 4k seed images.
Hardware Specification	Yes	GPU hours are benchmarked by hours * number of H20 GPUs.
Software Dependencies	No	The paper mentions 'Adam W optimizer' and 'LoRA' as components of the training process, but it does not specify concrete version numbers for any software libraries, frameworks, or programming languages used in the experiments.
Experiment Setup	Yes	We adopt Adam W optimizer with a cosine learning rate schedule, a peak learning rate of 4 × 10−4, and a batch size of 768,3 to train the model for 2 epochs. The LoRA rank (r) and α are set as 64 and 16 for text-only variants and 64 and 128 for multimodal variants, respectively. For multimodal variants of Qwen2.5-Omni-7B, we use a reduced learning rate of 3 × 10−4 due to the loss spike.