Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Authors: Hao Fei, Shengqiong Wu, Hanwang Zhang, Tat-Seng Chua, Shuicheng Yan
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Demonstrated over 12 visual tasks and evaluated across 22 datasets, VITRON showcases its extensive capabilities in the four main vision task clusters. |
| Researcher Affiliation | Collaboration | Hao Fei1,2 Shengqiong Wu1,2 Hanwang Zhang1,3 Tat-Seng Chua2 Shuicheng Yan1, 1, Skywork AI, Singapore 2 National University of Singapore 3 Nanyang Technological University EMAIL EMAIL EMAIL EMAIL EMAIL |
| Pseudocode | No | No pseudocode or algorithm blocks are provided in the paper. |
| Open Source Code | No | We will release the code upon the acceptance of the paper. |
| Open Datasets | Yes | We utilize datasets comprising image-caption pairs (CC3M [89]), video-caption pairs (Webvid [4]), and region-caption pairs (Ref COCO [40]) drawn from existing established corpora and benchmarks. |
| Dataset Splits | Yes | Method DAVIS 17 [80] Test-Dev Youtube-VOS 2019 [119] Val |
| Hardware Specification | Yes | All the training of VITRON is conducted on 10 A100 (80G) GPUs. |
| Software Dependencies | Yes | Our backbone LLM is Vicuna3, 7B, version 1.5. |
| Experiment Setup | Yes | Our backbone LLM is Vicuna3, 7B, version 1.5. The CLIP-Vi T encoders for both images and videos are with a patch size of 14, and convert all images and video frames into 336px resolutions. The task discriminator in our synergy module is with a Transformer architecture, with 4 layers and each in 768-d representation. To train our model, we employ the Adam W optimizer along with a learning rate scheduler. The pre-training of VITRON unfolds in three phases, all conducted on 10 16 A100 (80G) GPUs. Initially, we train the model using a global batch size of 128 and a maximum learning rate of 3e-4, a process that takes approximately 40 hours. In the second tuning phase, we adjust the model with a maximum learning rate of 1e-5, utilizing a global batch size of 90. This stage of training lasts about 35 hours. The third phase of training employs a global batch size of 128 and maintains the maximum learning rate of 1e-5, completing in roughly 10 hours. |