CogVLM: Visual Expert for Pretrained Language Models

Authors: Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Song XiXuan, Jiazheng Xu, Keqin Chen, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, Jie Tang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Cog VLM-17B achieves state-of-the-art performance on 15 classic crossmodal benchmarks, including 1) image captioning datasets: No Caps, Flicker30k, 2) VQA datasets: OKVQA, Science QA, 3) LVLM benchmarks: MM-Vet, MMBench, SEED-Bench, LLa VABench, POPE, MMMU, Math Vista, 4) visual grounding datasets: Ref COCO, Ref COCO+, Ref COCOg, Visual7W. Codes and checkpoints are available at Github.
Researcher Affiliation Collaboration Weihan Wang 1,2, Qingsong Lv 1, Wenmeng Yu1, Wenyi Hong1, Ji Qi1,2, Yan Wang1, Junhui Ji1, Zhuoyi Yang1,2, Lei Zhao1, Xixuan Song1,2, Jiazheng Xu1,2, Keqin Chen1, Bin Xu2, Juanzi Li2, Yuxiao Dong 2, Ming Ding 1, Jie Tang 2 1Zhipu AI 2Tsinghua University
Pseudocode No The paper describes the model architecture and formal equations but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Codes and checkpoints are available at Github.
Open Datasets Yes The image-text pairs for pretraining are all publicly available, including LAION-2B and COYO-700M. After removing the broken URLs, NSFW images, images with noisy captions, images with political bias and images with an aspect ratio > 6 or < 1/6, about 1.5B images are left for pretraining. We also crafted a visual grounding dataset of 40M images.
Dataset Splits Yes Table 1: Performance on Image Captioning benchmarks. All tasks use CIDEr as the evaluation metric. OOD refers to out-of-domain test set. Karp. refers to the Karpathy test split. Method Train Data No Caps val No Caps test Flickr COCO Text Caps
Hardware Specification No The paper does not provide specific details about the hardware used for experiments, such as exact GPU or CPU models. It mentions 'computational efficiency' and 'PFLOPS*days' metrics in Table 9 but lacks hardware specifications.
Software Dependencies No The paper mentions models like 'Vicuna-1.5-7B' and 'EVA2-CLIP-E' and tools like 'spa Cy [Honnibal and Johnson, 2015]' but does not specify version numbers for general software dependencies or frameworks like Python, PyTorch, or TensorFlow.
Experiment Setup Yes We report the details of parameter settings during pre-training and multitask training in Table 5 and Table 6. Table 5: Hyperparameters for pre-training model. Total steps 120, 000, Warmup steps 12, 000, Batch size 8, 192, Learning rate 1e 4, Learning rate decay Cosine, Weight decay 0.05, Dropout ratio 0.1, Adam ϵ 1e 8, Adam β (0.9, 0.95), Textual encoder Vicuna-1.5-7B, Visual encoder EVA2-CLIP-E, Patch size 14, Input resolution 2242 2242 4902.