VIGC: Visual Instruction Generation and Correction
Authors: Bin Wang, Fan Wu, Xiao Han, Jiahui Peng, Huaping Zhong, Pan Zhang, Xiaoyi Dong, Weijia Li, Wei Li, Jiaqi Wang, Conghui He
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that VIGC not only compensates for the shortcomings of language-only data generation methods, but also effectively enhances the benchmark performance. |
| Researcher Affiliation | Collaboration | Bin Wang*1, Fan Wu*1, Xiao Han*1, Jiahui Peng*1, Huaping Zhong*2, Pan Zhang1, Xiaoyi Dong1,3, Weijia Li4, Wei Li1, Jiaqi Wang1, Conghui He 1, 1Shanghai Artificial Intelligence Laboratory, 2Sense Time Research, 3The Chinese University of Hong Kong, 4Sun Yat-sen University |
| Pseudocode | No | The paper describes the framework and its processes textually and with diagrams, but it does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The models, datasets, and code are available at https://opendatalab.github.io/VIGC. |
| Open Datasets | Yes | We trained the VIGC network using two types of visual-language instruction fine-tuning data. The first type, represented by the LLaVA dataset (Liu et al. 2023b), is manually curated and combined with language-only GPT-4 (Open AI 2023b) for multimodal models. It includes 150K training samples... The second type of data is multimodal instruction fine-tuning data derived from publicly available image-text datasets. Specifically, we used OKVQA (Marino et al. 2019) and A-OKVQA (Schwenk et al. 2022) datasets, as utilized in Instruct BLIP (Dai et al. 2023), for VIGC training. |
| Dataset Splits | No | The training is conducted throughout 10 epochs, with the model’s performance being validated after each epoch. The model that demonstrates the best performance is subsequently selected for data generation. |
| Hardware Specification | Yes | The entire training process, executed on 8 A100 (80GB) GPUs, completes in approximately 10 hours. |
| Software Dependencies | No | The paper mentions software components like 'Vicuna7B and Vicuna13B', 'BLIP2-Flan T5XXL', and 'EVA-CLIP' but does not specify their version numbers or any other software dependencies with specific versions (e.g., Python, PyTorch versions). |
| Experiment Setup | Yes | In terms of batch sizes, we utilize 64 for both 7B and 13B models. The entire training process, executed on 8 A100 (80GB) GPUs, completes in approximately 10 hours. The training is conducted throughout 10 epochs, with the model’s performance being validated after each epoch. |