Q-VLM: Post-training Quantization for Large Vision-Language Models

Authors: Changyuan Wang, Ziwei Wang, Xiuwei Xu, Yansong Tang, Jie Zhou, Jiwen Lu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that our method compresses the memory by 2.78x and increase generate speed by 1.44x about 13B LLa VA model without performance degradation on diverse multi-modal reasoning tasks.
Researcher Affiliation Academia 1Shenzhen International Graduate School, Tsinghua University, China 2Department of Automation, Tsinghua University, China 3School of Electrical and Electronic Engineering, Nanyang Technological University
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks clearly labeled as such.
Open Source Code Yes Code is available at https://github.com/Changyuan Wang17/QVLM
Open Datasets Yes We utilize the large vision-language frameworks for post-training quantization including LLa VA [31] and Mo E-LLa VA [28] with their pre-trained weights for multi-modal question answering tasks. ... The multi-modal answer reasoning dataset is Science QA [35], which contains 21k vision-language multiple choice questions. We also contain Viz Wiz [15] and VQA-v2 [14] datasets.
Dataset Splits Yes For the parameter learning in LVLM quantization, we randomly select 64 vision-language pairs from the datasets for hyper-network learning where the batchsize was assigned with 8 for calibration set construction.
Hardware Specification No The paper does not provide specific hardware details (GPU/CPU models, processor types, or memory amounts) used for running its experiments within the main text.
Software Dependencies No The paper mentions frameworks and methods used (e.g., LLa VA, Mo E-LLa VA, QLo RA, AWQ) but does not provide specific version numbers for any software components or libraries.
Experiment Setup Yes We set the bitwidth of quantized weight and activation to 6 and 4 to evaluate our method in different qualityefficiency trade-offs uniform quantization scheme where the interval between adjacent rounding points was equal. ... We set the maximum layer depth to 3 within a block... In the LVLM quantization exploration, we adjust hyperparameters p of percentile ranging from 1.0 to 0.98 with 0.005 interval... we modified the hyperparameter η... For the parameter learning in LVLM quantization, we randomly select 64 vision-language pairs from the datasets for hyper-network learning where the batchsize was assigned with 8 for calibration set construction. The quantization function parameters were updated for 10 epochs in searching process...