Instruction-Guided Visual Masking
Authors: Jinliang Zheng, Jianxiong Li, Sijie Cheng, Yinan Zheng, Jiaming Li, Jihao Liu, Yu Liu, Jingjing Liu, Xianyuan Zhan
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on generic multimodal tasks such as VQA and embodied robotic control demonstrate the versatility of IVM, which as a plug-and-play tool, significantly boosts the performance of diverse multimodal models, yielding new state-of-the-art results across challenging multimodal benchmarks. |
| Researcher Affiliation | Collaboration | 1 AIR, Tsinghua University, 2 Sensetime Research 3 MMLab, CUHK, 4 Shanghai AI Lab |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code, model and data are available at https://github.com/2toinf/IVM. |
| Open Datasets | Yes | We collect 250K labeled VG data from multiple sources including VG caption [36], Flickr30K [63], VSR [3], Open Image [30], and Ref Co Co [64, 37]... We sample a 700K subset from LLa VA-Instruction-tuning [41] for VQA-type data, and a 50K subset from Open X [54] for robotics data. |
| Dataset Splits | Yes | We reported the accuracy (IOU-50%) on the validation split in Table 7. |
| Hardware Specification | Yes | We adopt 8 NVIDIA 80G A100 GPUs and take 4 days to train our IVM model... The training can be completed on 2 NVIDIA RTX4090 GPU in 17h. |
| Software Dependencies | No | The paper mentions software components like 'deepspeed [4] engine' and 'optimizer Adam W [43]', as well as models and architectures like 'ResNet50 [22]' and 'T5 [48]'. However, it does not specify version numbers for these software components or libraries, which is required for reproducibility. |
| Experiment Setup | Yes | The training scripts are based on deepspeed [4] engine and the training hyperparameters can be found in Table 4... Table 4 lists: training iteration 200K, optimizer Adam W [43], learning rate 1e-5, batch size 32, weight decay 0, optimizer momentum β1, β2=0.9, 0.95, data augmentation Random Crop Resize. Table 6 lists: Chunking size 4, Optimizer Adam W [43], Learning rate 1e-4, Lr schedule cosine annealing, Warm up steps 2000, Batch size 64, Gradient Steps 200K. |