Q-VLM: Post-training Quantization for Large Vision-Language Models
Authors: Changyuan Wang, Ziwei Wang, Xiuwei Xu, Yansong Tang, Jie Zhou, Jiwen Lu
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that our method compresses the memory by 2.78x and increase generate speed by 1.44x about 13B LLa VA model without performance degradation on diverse multi-modal reasoning tasks. |
| Researcher Affiliation | Academia | 1Shenzhen International Graduate School, Tsinghua University, China 2Department of Automation, Tsinghua University, China 3School of Electrical and Electronic Engineering, Nanyang Technological University |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks clearly labeled as such. |
| Open Source Code | Yes | Code is available at https://github.com/Changyuan Wang17/QVLM |
| Open Datasets | Yes | We utilize the large vision-language frameworks for post-training quantization including LLa VA [31] and Mo E-LLa VA [28] with their pre-trained weights for multi-modal question answering tasks. ... The multi-modal answer reasoning dataset is Science QA [35], which contains 21k vision-language multiple choice questions. We also contain Viz Wiz [15] and VQA-v2 [14] datasets. |
| Dataset Splits | Yes | For the parameter learning in LVLM quantization, we randomly select 64 vision-language pairs from the datasets for hyper-network learning where the batchsize was assigned with 8 for calibration set construction. |
| Hardware Specification | No | The paper does not provide specific hardware details (GPU/CPU models, processor types, or memory amounts) used for running its experiments within the main text. |
| Software Dependencies | No | The paper mentions frameworks and methods used (e.g., LLa VA, Mo E-LLa VA, QLo RA, AWQ) but does not provide specific version numbers for any software components or libraries. |
| Experiment Setup | Yes | We set the bitwidth of quantized weight and activation to 6 and 4 to evaluate our method in different qualityefficiency trade-offs uniform quantization scheme where the interval between adjacent rounding points was equal. ... We set the maximum layer depth to 3 within a block... In the LVLM quantization exploration, we adjust hyperparameters p of percentile ranging from 1.0 to 0.98 with 0.005 interval... we modified the hyperparameter η... For the parameter learning in LVLM quantization, we randomly select 64 vision-language pairs from the datasets for hyper-network learning where the batchsize was assigned with 8 for calibration set construction. The quantization function parameters were updated for 10 epochs in searching process... |