reproducibilityindex.ai

Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing

Authors: Hao Fei, Shengqiong Wu, Hanwang Zhang, Tat-Seng Chua, Shuicheng Yan

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Demonstrated over 12 visual tasks and evaluated across 22 datasets, VITRON showcases its extensive capabilities in the four main vision task clusters.
Researcher Affiliation	Collaboration	Hao Fei1,2 Shengqiong Wu1,2 Hanwang Zhang1,3 Tat-Seng Chua2 Shuicheng Yan1, 1, Skywork AI, Singapore 2 National University of Singapore 3 Nanyang Technological University haofei37@nus.edu.sg swu@u.nus.edu hanwangzhang@ntu.edu.sg dcscts@nus.edu.sg shuicheng.yan@kunlun-inc.com
Pseudocode	No	No pseudocode or algorithm blocks are provided in the paper.
Open Source Code	No	We will release the code upon the acceptance of the paper.
Open Datasets	Yes	We utilize datasets comprising image-caption pairs (CC3M [89]), video-caption pairs (Webvid [4]), and region-caption pairs (Ref COCO [40]) drawn from existing established corpora and benchmarks.
Dataset Splits	Yes	Method DAVIS 17 [80] Test-Dev Youtube-VOS 2019 [119] Val
Hardware Specification	Yes	All the training of VITRON is conducted on 10 A100 (80G) GPUs.
Software Dependencies	Yes	Our backbone LLM is Vicuna3, 7B, version 1.5.
Experiment Setup	Yes	Our backbone LLM is Vicuna3, 7B, version 1.5. The CLIP-Vi T encoders for both images and videos are with a patch size of 14, and convert all images and video frames into 336px resolutions. The task discriminator in our synergy module is with a Transformer architecture, with 4 layers and each in 768-d representation. To train our model, we employ the Adam W optimizer along with a learning rate scheduler. The pre-training of VITRON unfolds in three phases, all conducted on 10 16 A100 (80G) GPUs. Initially, we train the model using a global batch size of 128 and a maximum learning rate of 3e-4, a process that takes approximately 40 hours. In the second tuning phase, we adjust the model with a maximum learning rate of 1e-5, utilizing a global batch size of 90. This stage of training lasts about 35 hours. The third phase of training employs a global batch size of 128 and maintains the maximum learning rate of 1e-5, completing in roughly 10 hours.