Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing

Authors: Hao Fei, Shengqiong Wu, Hanwang Zhang, Tat-Seng Chua, Shuicheng Yan

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Demonstrated over 12 visual tasks and evaluated across 22 datasets, VITRON showcases its extensive capabilities in the four main vision task clusters.
Researcher Affiliation Collaboration Hao Fei1,2 Shengqiong Wu1,2 Hanwang Zhang1,3 Tat-Seng Chua2 Shuicheng Yan1, 1, Skywork AI, Singapore 2 National University of Singapore 3 Nanyang Technological University haofei37@nus.edu.sg swu@u.nus.edu hanwangzhang@ntu.edu.sg dcscts@nus.edu.sg shuicheng.yan@kunlun-inc.com
Pseudocode No No pseudocode or algorithm blocks are provided in the paper.
Open Source Code No We will release the code upon the acceptance of the paper.
Open Datasets Yes We utilize datasets comprising image-caption pairs (CC3M [89]), video-caption pairs (Webvid [4]), and region-caption pairs (Ref COCO [40]) drawn from existing established corpora and benchmarks.
Dataset Splits Yes Method DAVIS 17 [80] Test-Dev Youtube-VOS 2019 [119] Val
Hardware Specification Yes All the training of VITRON is conducted on 10 A100 (80G) GPUs.
Software Dependencies Yes Our backbone LLM is Vicuna3, 7B, version 1.5.
Experiment Setup Yes Our backbone LLM is Vicuna3, 7B, version 1.5. The CLIP-Vi T encoders for both images and videos are with a patch size of 14, and convert all images and video frames into 336px resolutions. The task discriminator in our synergy module is with a Transformer architecture, with 4 layers and each in 768-d representation. To train our model, we employ the Adam W optimizer along with a learning rate scheduler. The pre-training of VITRON unfolds in three phases, all conducted on 10 16 A100 (80G) GPUs. Initially, we train the model using a global batch size of 128 and a maximum learning rate of 3e-4, a process that takes approximately 40 hours. In the second tuning phase, we adjust the model with a maximum learning rate of 1e-5, utilizing a global batch size of 90. This stage of training lasts about 35 hours. The third phase of training employs a global batch size of 128 and maintains the maximum learning rate of 1e-5, completing in roughly 10 hours.