Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Authors: Hao Fei, Shengqiong Wu, Hanwang Zhang, Tat-Seng Chua, Shuicheng Yan
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Demonstrated over 12 visual tasks and evaluated across 22 datasets, VITRON showcases its extensive capabilities in the four main vision task clusters. |
| Researcher Affiliation | Collaboration | Hao Fei1,2 Shengqiong Wu1,2 Hanwang Zhang1,3 Tat-Seng Chua2 Shuicheng Yan1, 1, Skywork AI, Singapore 2 National University of Singapore 3 Nanyang Technological University haofei37@nus.edu.sg swu@u.nus.edu hanwangzhang@ntu.edu.sg dcscts@nus.edu.sg shuicheng.yan@kunlun-inc.com |
| Pseudocode | No | No pseudocode or algorithm blocks are provided in the paper. |
| Open Source Code | No | We will release the code upon the acceptance of the paper. |
| Open Datasets | Yes | We utilize datasets comprising image-caption pairs (CC3M [89]), video-caption pairs (Webvid [4]), and region-caption pairs (Ref COCO [40]) drawn from existing established corpora and benchmarks. |
| Dataset Splits | Yes | Method DAVIS 17 [80] Test-Dev Youtube-VOS 2019 [119] Val |
| Hardware Specification | Yes | All the training of VITRON is conducted on 10 A100 (80G) GPUs. |
| Software Dependencies | Yes | Our backbone LLM is Vicuna3, 7B, version 1.5. |
| Experiment Setup | Yes | Our backbone LLM is Vicuna3, 7B, version 1.5. The CLIP-Vi T encoders for both images and videos are with a patch size of 14, and convert all images and video frames into 336px resolutions. The task discriminator in our synergy module is with a Transformer architecture, with 4 layers and each in 768-d representation. To train our model, we employ the Adam W optimizer along with a learning rate scheduler. The pre-training of VITRON unfolds in three phases, all conducted on 10 16 A100 (80G) GPUs. Initially, we train the model using a global batch size of 128 and a maximum learning rate of 3e-4, a process that takes approximately 40 hours. In the second tuning phase, we adjust the model with a maximum learning rate of 1e-5, utilizing a global batch size of 90. This stage of training lasts about 35 hours. The third phase of training employs a global batch size of 128 and maintains the maximum learning rate of 1e-5, completing in roughly 10 hours. |