ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning
Authors: Liang Zhao, En Yu, Zheng Ge, Jinrong Yang, Haoran Wei, Hongyu Zhou, Jianjian Sun, Yuang Peng, Runpei Dong, Chunrui Han, Xiangyu Zhang
IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results showcase Chat Spot s promising performance. Project page: https://github.com/Ahnsun/Chat Spot. |
| Researcher Affiliation | Collaboration | Liang Zhao1 , En Yu2 , Zheng Ge1 , Jinrong Yang2 , Haoran Wei1 , Hongyu Zhou1 , Jianjian Sun1 , Yuang Peng3 , Runpei Dong4 , Chunrui Han1 , Xiangyu Zhang1 1MEGVII Technology 2Huazhong University of Science and Technology 3Tsinghua University 4Xian Jiaotong University |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Project page: https://github.com/Ahnsun/Chat Spot. |
| Open Datasets | Yes | Based on this data generation pipeline, we build a high-quality Multi-Grained Vision-Language Instructionfollowing Dataset, named MGVLID... we collect a wide range of publicly available multimodal datasets... including CC595K (filtered based on CC3M [Sharma et al., 2018]), OCRVQA [Mishra et al., 2019], ST-VQA [Biten et al., 2022], Doc VQA [Mathew et al., 2021], Text VQA [Singh et al., 2019] and Object365 [Shao et al., 2019]... Object365 [Shao et al., 2019], COCO text [Veit et al., 2016], Hier Text [Long et al., 2022] and Art [Chng et al., 2019]... We also collect the Point QA datasets from Look Twice-QA [Mani et al., 2020]. |
| Dataset Splits | No | The paper describes the construction of a training dataset (MGVLID) from multiple sources and mentions holding out datasets for evaluation. However, it does not specify explicit training/validation/test splits (e.g., percentages or counts for validation) for its primary training data (MGVLID). While it evaluates on standard test/validation sets of external benchmarks, this is for testing, not a validation split for the main training process. |
| Hardware Specification | No | The paper specifies the models used (CLIP Vi T-L/14, Vicuna-7B) but does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions software components like CLIP and Vicuna, and an optimizer (Adam W), but does not provide specific version numbers for any of these software dependencies (e.g., Python version, PyTorch version). |
| Experiment Setup | Yes | Specifically, the model is fine-tuned over 3 epochs, with a batch size of 128. Adam W [Loshchilov and Hutter, 2019] optimizer is employed, and the learning rate is set to 2e 3 in the first training stage and 2e 5 in the second training stage. For LLM, the maximum length of tokens is set to 2, 048. |