Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

VITRIX-UniViTAR: Unified Vision Transformer with Native Resolution

Authors: Limeng Qiao, Yiyang Gan, Bairui Wang, Jie Qin, Shuang Xu, Siqi Yang, Lin Ma

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive evaluations demonstrate the effectiveness of our proposed methods.
Researcher Affiliation Collaboration Limeng Qiao Yiyang Gan Bairui Wang Jie Qin Shuang Xu Siqi Yang Lin Ma Meituan Inc. EMAIL, EMAIL, EMAIL EMAIL, EMAIL, EMAIL
Pseudocode No The paper includes architectural diagrams (Figure 2) and describes methods in prose, but does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes The code and models are available here.
Open Datasets Yes We collect public accessible image-text pairs and build our Merged-1B dataset, which is composed of Data Comp-1B [21], COYO [22], LAION-2B [23], LAION-400M [24], DFN-2B [22], CC12M [25] and CC3M [26]. Moreover, to further enhance the video feature extraction capabilities of Uni Vi TAR, we meticulously constructed a dataset Merged-65M of roughly 65 million samples by randomly selecting video clips from three public accessible video datasets, i.e., Panda-70M [27], Web Vid-10M [28], and Intern Vid-10M-FLT [29].
Dataset Splits Yes For cross-modal retrieval assessment, we adopt the benchmark protocols defined in [41], evaluating on Flickr [42] and MS-COCO [43] using their official partitions.
Hardware Specification Yes Note all experiments are conducted on H800 GPUs.
Software Dependencies No To enhance training efficiency, we integrated the Deep Speed library [30] by employing Ze RO optimizer sharding [31], gradient checkpointing [32], and flash attention [33].
Experiment Setup Yes The detailed hyperparameter configurations for each training stage are presented in the Table 11.