VIGC: Visual Instruction Generation and Correction

Authors: Bin Wang, Fan Wu, Xiao Han, Jiahui Peng, Huaping Zhong, Pan Zhang, Xiaoyi Dong, Weijia Li, Wei Li, Jiaqi Wang, Conghui He

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that VIGC not only compensates for the shortcomings of language-only data generation methods, but also effectively enhances the benchmark performance.
Researcher Affiliation Collaboration Bin Wang*1, Fan Wu*1, Xiao Han*1, Jiahui Peng*1, Huaping Zhong*2, Pan Zhang1, Xiaoyi Dong1,3, Weijia Li4, Wei Li1, Jiaqi Wang1, Conghui He 1, 1Shanghai Artificial Intelligence Laboratory, 2Sense Time Research, 3The Chinese University of Hong Kong, 4Sun Yat-sen University
Pseudocode No The paper describes the framework and its processes textually and with diagrams, but it does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes The models, datasets, and code are available at https://opendatalab.github.io/VIGC.
Open Datasets Yes We trained the VIGC network using two types of visual-language instruction fine-tuning data. The first type, represented by the LLaVA dataset (Liu et al. 2023b), is manually curated and combined with language-only GPT-4 (Open AI 2023b) for multimodal models. It includes 150K training samples... The second type of data is multimodal instruction fine-tuning data derived from publicly available image-text datasets. Specifically, we used OKVQA (Marino et al. 2019) and A-OKVQA (Schwenk et al. 2022) datasets, as utilized in Instruct BLIP (Dai et al. 2023), for VIGC training.
Dataset Splits No The training is conducted throughout 10 epochs, with the model’s performance being validated after each epoch. The model that demonstrates the best performance is subsequently selected for data generation.
Hardware Specification Yes The entire training process, executed on 8 A100 (80GB) GPUs, completes in approximately 10 hours.
Software Dependencies No The paper mentions software components like 'Vicuna7B and Vicuna13B', 'BLIP2-Flan T5XXL', and 'EVA-CLIP' but does not specify their version numbers or any other software dependencies with specific versions (e.g., Python, PyTorch versions).
Experiment Setup Yes In terms of batch sizes, we utilize 64 for both 7B and 13B models. The entire training process, executed on 8 A100 (80GB) GPUs, completes in approximately 10 hours. The training is conducted throughout 10 epochs, with the model’s performance being validated after each epoch.