Inducing High Energy-Latency of Large Vision-Language Models with Verbose Images
Authors: Kuofeng Gao, Yang Bai, Jindong Gu, Shu-Tao Xia, Philip Torr, Zhifeng Li, Wei Liu
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that our verbose images can increase the length of generated sequences by 7.87 and 8.56 compared to original images on MS-COCO and Image Net datasets, which presents potential challenges for various applications. Our code is available at https://github.com/Kuofeng Gao/Verbose_Images. |
| Researcher Affiliation | Collaboration | Kuofeng Gao1, Yang Bai2, Jindong Gu3, Shu-Tao Xia1,5 , Philip Torr3, Zhifeng Li4 , Wei Liu4 1 Tsinghua University 2 Tencent Technology (Beijing) Co.Ltd 3 University of Oxford 4 Tencent Data Platform 5 Peng Cheng Laboratory |
| Pseudocode | Yes | Algorithm 1 Verbose images: Inducing high energy-latency cost of VLMs |
| Open Source Code | Yes | Our code is available at https://github.com/Kuofeng Gao/Verbose_Images. |
| Open Datasets | Yes | We randomly choose the 1,000 images from MS-COCO (Lin et al., 2014) and Image Net (Deng et al., 2009) dataset, respectively, as our evaluation dataset. |
| Dataset Splits | No | The paper states that they 'randomly choose the 1,000 images from MS-COCO and Image Net dataset, respectively, as our evaluation dataset', which serves as their test set. However, it does not explicitly provide training and validation dataset splits, as their work involves attacking pre-trained models rather than training new ones. |
| Hardware Specification | Yes | Note that every experiment is run on one NVIDIA Tesla A100 GPU with 40GB memory. |
| Software Dependencies | No | The paper states that they use 'the PyTorch framework (Paszke et al., 2019) and the LAVIS library (Li et al., 2023a)', but it does not specify the version numbers for these software components. |
| Experiment Setup | Yes | the perturbation magnitude is set as ϵ = 8 within l∞ restriction, following Carlini et al. (2019), and the step size is set as α = 1. The default maximum length of generated sequences of VLMs is set as 512 and the sampling policy is configured to use nucleus sampling (Holtzman et al., 2020) with p = 0.9 and temperature t = 1. For our verbose images, the parameters of loss weights are a1 = 10, b1 = 20, a2 = 0, b2 = 0, a3 = 0.5, and b3 = 1 and the momentum of our optimization is m = 0.9. |