CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers

Authors: Dachuan Shi, Chaofan Tao, Anyi Rao, Zhendong Yang, Chun Yuan, Jiaqi Wang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments have been conducted on various vision-language tasks, such as image-text retrieval, visual reasoning, image captioning, and visual question answering. We report the performance on modality-independent model CLIP (Radford et al., 2021) as well as modality-dependent models BLIP/BLIP2 (Li et al., 2022; 2023c), and mainstream tasks such as Image-Text Retrieval, Visual Reasoning, Image Captioning, and Visual Question Answering.
Researcher Affiliation Collaboration 1Tsinghua University 2Shanghai AI Laboratory 3The University of Hong Kong 4Stanford University.
Pseudocode Yes Algorithm 1 Complete-Graph Soft Matching. Algorithm 2 Cross-Guided Matching and Ensemble (improvements upon Algorithm 1)
Open Source Code Yes The code is available at https://github.com/sdc17/Cross GET.
Open Datasets Yes We conduct experiments on the CLIP model, and Flickr30K datasets (Young et al., 2014) with Karpathy split (Karpathy & Fei-Fei, 2015) of Image-Text Retrieval and Text-Image Retrieval task. Table 2: Accelerate BLIP on the NLVR2 dataset of the Vision Reasoning task. The Co Op benchmark (Zhou et al., 2022b) consists of 11 datasets, which are Image Net (1000 classes) (Deng et al., 2009), Caltech101 (100 classes) (Fei-Fei et al., 2004), Oxford Pets (37 classes) (Parkhi et al., 2012), Stanford Cars (196 classes) (Krause et al., 2013), Flowers102 (102 classes) (Nilsback & Zisserman, 2008), Food101 (101 classes) (Bossard et al., 2014), FGVCAircraft (100 classes) (Maji et al., 2013), SUN397 (397 classes) (Xiao et al., 2010), DTD (47 classes) (Cimpoi et al., 2014), Euro SAT (10 classes) (Helber et al., 2019), and UCF101 (101 classes) (Soomro et al., 2012).
Dataset Splits Yes We conduct experiments on the CLIP model, and Flickr30K datasets (Young et al., 2014) with Karpathy split (Karpathy & Fei-Fei, 2015) of Image-Text Retrieval and Text-Image Retrieval task. Table 2: Accelerate BLIP on the NLVR2 dataset of the Vision Reasoning task. BLIP is the original model for all approaches. Approach Dev Acc Test Acc
Hardware Specification No The paper does not specify the particular hardware (e.g., GPU/CPU models, memory, or cloud instance types) used for running the experiments.
Software Dependencies No The paper mentions software like AdamW, Cosine LRScheduler, Random Augment, Mixed Precision, and Pytorch, but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes The hyperparameters about model training are listed in Table 9, Table 10, and Table 11. The hyperparameters about model structures are listed in Table 12. Table 9: Training hyperparameters for accelerating BLIP-based models. Hyperparameters ... Batch size 512 Weight decay 0.05 Epochs 15 Initial learning rate 3e-6 Learning rate schedule Cosine LRScheduler Training Precision Mixed Precision Matching loss coefficient 10^1