Improving Visual Prompt Tuning for Self-supervised Vision Transformers

Authors: Seungryong Yoo, Eunji Kim, Dahuin Jung, Jungbeom Lee, Sungroh Yoon

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through empirical observations, we deduce that the effectiveness of VPT hinges largely on the Vi T blocks with which the prompt tokens interact. Specifically, VPT shows improved performance on image classification tasks for MAE and Mo Co v3 when the prompt tokens are inserted into later blocks rather than the first block. These observations suggest that there exists an optimal location of blocks for the insertion of prompt tokens. Unfortunately, identifying the optimal blocks for prompts within each self-supervised Vi T for diverse future scenarios is a costly process. To mitigate this problem, we propose a simple yet effective method that learns a gate for each Vi T block to adjust its intervention into the prompt tokens. With our method, prompt tokens are selectively influenced by blocks that require steering for task adaptation. Our method outperforms VPT variants in FGVC and VTAB image classification and ADE20K semantic segmentation.
Researcher Affiliation Academia 1Electrical and Computer Engineering, 2Interdisciplinary Program in Artificial Intelligence, Seoul National University, Seoul, Korea.
Pseudocode Yes Algorithm 1 Py Torch-like Pseudocode for Gated Prompt Tuning
Open Source Code Yes The code is available at https://github. com/ryongithub/Gated Prompt Tuning.
Open Datasets Yes FGVC includes five fine-grained classification tasks: CUB (Wah et al., 2011), Oxford Flowers (Nilsback & Zisserman, 2008), Stanford Cars (Gebru et al., 2017), Stanford Dogs (Khosla et al., 2011) and NABirds (Van Horn et al., 2015). ... VTAB-1K (Zhai et al., 2019), which consists of 19 diverse visual classification tasks... For semantic segmentation, we evaluate the performances on ADE20K (Zhou et al., 2017) benchmark.
Dataset Splits No The paper uses standard benchmarks (FGVC, VTAB-1K, ADE20K) which typically have predefined splits, but it does not explicitly detail training, validation, and test splits with specific percentages, counts, or methodologies (e.g.,
Hardware Specification No No specific hardware details (e.g., GPU models, CPU models, memory specifications) are mentioned in the paper. It discusses computational efficiency and parameters but not the machines used.
Software Dependencies No The paper provides
Experiment Setup Yes The hyperparameters used to train the models for FGVC (Table 1), VTAB-1K (Zhai et al., 2019) (Table 2), and ADE20K (Table 3) are listed in Table 5. We used the SGD optimizer, and the learning rate was searched among {0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0}. For ADE20K semantic segmentation, we used the default hyperparameters following SETR-PUP (Zheng et al., 2021). ... Table 5. Selected hyper-parameters of our method for each downstream task and SSL method.