Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

UniViT: Unifying Image and Video Understanding in One Vision Encoder

Authors: feilong tang, xiangan, Haolin Yang, Yin Xie, Kaicheng Yang, Ming Hu, Zheng Cheng, Xingyu Zhou, Zimin Ran, Imran Razzak, Ziyong Feng, Behzad Bozorgtabar, Jiankang Deng, Zongyuan Ge

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments across varying model scales demonstrate that Uni Vi T achieves state-of-the-art performance on linear probing, attentive probing, question answering, and spatial understanding tasks.
Researcher Affiliation Collaboration 1Monash University, 2Deep Glint, 3MBZUAI, 4UTS, 5EPFL, 6Imperial College London
Pseudocode No The paper describes methods using mathematical formulations and descriptive text, but does not include a clearly labeled pseudocode or algorithm block.
Open Source Code No The study is based on a proprietary dataset that is subject to confidentiality constraints. Due to these restrictions, we are currently unable to provide public access to the dataset and code. We are exploring possibilities for future release, subject to approval.
Open Datasets Yes Our models are pre-trained on the LAION400M[50], COYO700M[13], and Intern Vid[61].
Dataset Splits No The paper evaluates on various benchmarks like Euro SAT, Resisc45, Calteh101, Imagenet, Sun397, HMDB51, UCF101, and Rare Act under few-shot attentive and linear probing settings. However, it does not explicitly provide specific percentages, sample counts, or detailed methodology for training/test/validation splits for either the pre-training datasets (LAION400M, COYO700M, and Intern Vid) or the downstream task evaluation datasets.
Hardware Specification Yes We use 80 H800 GPUs for the training process.
Software Dependencies No The paper mentions using the AdamW optimizer and Qwen2.5-7B as the language model backbone but does not list specific software dependencies with version numbers such as Python, PyTorch, or CUDA.
Experiment Setup Yes During training, we maintained a 1:1 ratio between images and video frames, with an image batch size of 16K and a video batch size of 2K (each video containing 16 frames). For our standard model, we use 224 resolution images. For the 336 resolution variant, we first train the model at 224 resolution, then increase it to 336 and continue training for an additional 1B frames. We utilize the Adam W optimizer with a learning rate of 0.001 and weight decay of 0.2. The number of classes (k) is one million, the ratio of sampled negative class centers (r) is 0.1, and the number of positive labels (l) assigned to each image and video is 8.