Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
VITRIX-UniViTAR: Unified Vision Transformer with Native Resolution
Authors: Limeng Qiao, Yiyang Gan, Bairui Wang, Jie Qin, Shuang Xu, Siqi Yang, Lin Ma
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive evaluations demonstrate the effectiveness of our proposed methods. |
| Researcher Affiliation | Collaboration | Limeng Qiao Yiyang Gan Bairui Wang Jie Qin Shuang Xu Siqi Yang Lin Ma Meituan Inc. EMAIL, EMAIL, EMAIL EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper includes architectural diagrams (Figure 2) and describes methods in prose, but does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code and models are available here. |
| Open Datasets | Yes | We collect public accessible image-text pairs and build our Merged-1B dataset, which is composed of Data Comp-1B [21], COYO [22], LAION-2B [23], LAION-400M [24], DFN-2B [22], CC12M [25] and CC3M [26]. Moreover, to further enhance the video feature extraction capabilities of Uni Vi TAR, we meticulously constructed a dataset Merged-65M of roughly 65 million samples by randomly selecting video clips from three public accessible video datasets, i.e., Panda-70M [27], Web Vid-10M [28], and Intern Vid-10M-FLT [29]. |
| Dataset Splits | Yes | For cross-modal retrieval assessment, we adopt the benchmark protocols defined in [41], evaluating on Flickr [42] and MS-COCO [43] using their official partitions. |
| Hardware Specification | Yes | Note all experiments are conducted on H800 GPUs. |
| Software Dependencies | No | To enhance training efficiency, we integrated the Deep Speed library [30] by employing Ze RO optimizer sharding [31], gradient checkpointing [32], and flash attention [33]. |
| Experiment Setup | Yes | The detailed hyperparameter configurations for each training stage are presented in the Table 11. |