Exploring Intrinsic Dimension for Vision-Language Model Pruning

Authors: Hanzhang Wang, Jiawen Zhang, Qingyuan Ma

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically study ID variations in large-scale visionlanguage pre-trained models and examine the contributions of different modalities to model prunability. We propose a layer importance metric based on ID, which can conveniently integrate with current metrics and enhance performance in vision-language model pruning. The experimental results show a high correlation between ID and modality prunability.
Researcher Affiliation Academia 1School of Computer Engineering and Science, Shanghai University.
Pseudocode Yes Algorithm 1 Iterative Pruning with Intrinsic Dimension
Open Source Code Yes The code is available at https://github.com/Nofear18/ID_VL_Pruning
Open Datasets Yes We evaluate image captioning performance using the MSCOCO dataset (Lin et al., 2014); We evaluate the visual reasoning task using the NLVR2 dataset (Suhr et al., 2018); The Flickr30k dataset comprises over 30,000 images; We utilize the CIFAR-100 (Krizhevsky, 2009) and Image Net-1k (Russakovsky, 2015) datasets
Dataset Splits Yes MSCOCO dataset (Lin et al., 2014), encompassing 80 object and 91 stuff categories with a standard split of 118K training images and 5K images each for validation and testing
Hardware Specification Yes All of our experiments are conducted on 4 NVIDIA GeForce GTX 3090 GPUs using PyTorch.
Software Dependencies No The paper mentions 'PyTorch' and 'AdamW optimizer' but does not specify version numbers for these software components. The question requires specific version numbers for reproducibility.
Experiment Setup Yes We use a cubic pruning schedule similar to Sanh et al. (2020); Zhang et al. (2022) for the experiments in rows 1-4 of Table 8. This schedule includes initial warm-ups, ti, and final warm-ups, tf, defined as: r(0) if 0 t < ti r(T ) + r(0) r(T ) 1 t ti tf 3 if ti t < T tf r(T ) otherwise where ti = i l, tf = f l, and l is the length of the training dataloader. All experiments use the Adam W optimizer (Loshchilov & Hutter, 2018), with additional hyperparameters detailed in Table 8.