Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Scaling Language-centric Omnimodal Representation Learning

Authors: Chenghao Xiao, Hou Pong (Ken) Chan, Hao Helen Zhang, Weiwen Xu, Mahani Aljunied, Yu Rong

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments across diverse backbones and benchmarks demonstrate its effectiveness, achieving state-of-the-art performance across modalities. Furthermore, we identify a Generation-Representation Scaling Law (GRSL), showing that the representational capabilities gained through contrastive refinement scale positively with the MLLM s generative capabilities. This suggests that improving generative abilities evolves as an effective paradigm for enhancing representation quality. We provide a theoretical explanation of GRSL, which formally links the MLLM s generative quality to the upper bound on its representation performance, and validate it on a challenging, low-resource visual-document retrieval task, showing that continual generative pretraining before CL can further enhance the potential of a model s embedding capabilities.
Researcher Affiliation Industry 1DAMO Academy, Alibaba Group 2Hupan Lab
Pseudocode No The paper describes methodologies and procedures through textual descriptions and mathematical formulations, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor structured code-like steps for any procedure.
Open Source Code Yes 1Codes, models, and resources are available at https://github.com/LCO-Embedding/LCO-Embedding.
Open Datasets Yes We consider two data settings: all-NLI and Scale-1M. The all-NLI combines MNLI [70] and SNLI [5], both frequently used for sentence representation learning. ... We further construct Scale-1M, a curated collection of 1M sentence pairs sampled from 20M multilingual parallel corpora, including Global Voice [48], MUSE [53], News Commentary [62], Tatoeba [1], Talks [52], Wiki Matrix [55], and other Sentence Transformers sources [51]. ... We utilize paired datasets, i.e., Pixmo Cap [13] for image-text, Audio Caps [32] for audio-text, and MSR-VTT [79] for video-text, for anisotropy comparison.
Dataset Splits Yes We use 276k triplets from all-NLI with entailments as positives and contradictions as hard negatives. ... Building on all-NLI, we further add 94k synthetic multimodal pairs (Appendix B) to enhance alignment in the downstream task format space, yielding a final dataset of 370k triplets. ... For conducting additional OCR-intensive generative training, we construct a training set leveraging images that do not correspond to retrieval test set queries, resulting in 4k seed images.
Hardware Specification Yes GPU hours are benchmarked by hours * number of H20 GPUs.
Software Dependencies No The paper mentions 'Adam W optimizer' and 'LoRA' as components of the training process, but it does not specify concrete version numbers for any software libraries, frameworks, or programming languages used in the experiments.
Experiment Setup Yes We adopt Adam W optimizer with a cosine learning rate schedule, a peak learning rate of 4 × 10−4, and a batch size of 768,3 to train the model for 2 epochs. The LoRA rank (r) and α are set as 64 and 16 for text-only variants and 64 and 128 for multimodal variants, respectively. For multimodal variants of Qwen2.5-Omni-7B, we use a reduced learning rate of 3 × 10−4 due to the loss spike.