Inducing High Energy-Latency of Large Vision-Language Models with Verbose Images

Authors: Kuofeng Gao, Yang Bai, Jindong Gu, Shu-Tao Xia, Philip Torr, Zhifeng Li, Wei Liu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that our verbose images can increase the length of generated sequences by 7.87 and 8.56 compared to original images on MS-COCO and Image Net datasets, which presents potential challenges for various applications. Our code is available at https://github.com/Kuofeng Gao/Verbose_Images.
Researcher Affiliation Collaboration Kuofeng Gao1, Yang Bai2, Jindong Gu3, Shu-Tao Xia1,5 , Philip Torr3, Zhifeng Li4 , Wei Liu4 1 Tsinghua University 2 Tencent Technology (Beijing) Co.Ltd 3 University of Oxford 4 Tencent Data Platform 5 Peng Cheng Laboratory
Pseudocode Yes Algorithm 1 Verbose images: Inducing high energy-latency cost of VLMs
Open Source Code Yes Our code is available at https://github.com/Kuofeng Gao/Verbose_Images.
Open Datasets Yes We randomly choose the 1,000 images from MS-COCO (Lin et al., 2014) and Image Net (Deng et al., 2009) dataset, respectively, as our evaluation dataset.
Dataset Splits No The paper states that they 'randomly choose the 1,000 images from MS-COCO and Image Net dataset, respectively, as our evaluation dataset', which serves as their test set. However, it does not explicitly provide training and validation dataset splits, as their work involves attacking pre-trained models rather than training new ones.
Hardware Specification Yes Note that every experiment is run on one NVIDIA Tesla A100 GPU with 40GB memory.
Software Dependencies No The paper states that they use 'the PyTorch framework (Paszke et al., 2019) and the LAVIS library (Li et al., 2023a)', but it does not specify the version numbers for these software components.
Experiment Setup Yes the perturbation magnitude is set as ϵ = 8 within l∞ restriction, following Carlini et al. (2019), and the step size is set as α = 1. The default maximum length of generated sequences of VLMs is set as 512 and the sampling policy is configured to use nucleus sampling (Holtzman et al., 2020) with p = 0.9 and temperature t = 1. For our verbose images, the parameters of loss weights are a1 = 10, b1 = 20, a2 = 0, b2 = 0, a3 = 0.5, and b3 = 1 and the momentum of our optimization is m = 0.9.