Nemesis: Normalizing the Soft-prompt Vectors of Vision-Language Models

Authors: Shuai Fu, Xiequn Wang, Qiushi Huang, Yu Zhang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To thoroughly investigate the role of soft-prompt vector norms in VLM performance, we introduce two corruption operations, REPLACE and RESCALE, to alter the norms in vectors learned by Co Op (Zhou et al., 2022b). Through the corruption of learned soft prompts, an intriguing phenomenon emerges: the reduction of norms at specific positions within these prompts enhances performance, whereas an increase in norms typically results in performance deterioration, as illustrated in Figure 1(a). We term this previously uncovered phenomenon the Low-Norm Effect. Figure 1(b) explores the prevalence of the Low-Norm Effect across 11 widely-used prompt-tuning VLM datasets.
Researcher Affiliation Academia 1Department of Computer Science and Engineering, Southern University of Science and Technology 2Computer Science Research Centre, University of Surrey {fus.jayce, wangxiequn, yu.zhang.ust}@gmail.com qiushi.huang@surrey.ac.uk
Pseudocode Yes Algorithm 1: The training process of Nemesis adopting the PAN loss.
Open Source Code Yes The code is available at https://github.com/Shy Foo/Nemesis.
Open Datasets Yes For few-shot image classification experiments and base-to-new generalization tasks, we follow the experimental setting of Co Op and Co Co Op, respectively, and conduct experiments on 11 visual classification datasets, including Caltech101 (Fei-Fei et al., 2004) and Image Net (Deng et al., 2009) for object recognition, Euro SAT (Helber et al., 2019) for satellite image recognition, DTD (Cimpoi et al., 2014) for texture recognition, UCF101 (Soomro et al., 2012) for action recognition, SUN397 (Xiao et al., 2010) for scene recognition, Oxford Pets (Parkhi et al., 2012), FGVCAircraft (Maji et al., 2013), Food101 (Bossard et al., 2014), Flowers102 (Nilsback & Zisserman, 2008), and Stanford Cars (Krause et al., 2013) for fine-grained recognition.
Dataset Splits No Table A1, 'The detailed statistics of datasets', only lists 'Train' and 'Test' columns for the number of samples, with no explicit mention of a 'validation' split or a methodology for creating one for reproducibility.
Hardware Specification No The paper does not provide specific details about the hardware used for experiments, such as GPU models or CPU specifications.
Software Dependencies No The paper mentions software like CLIP, Co Op, and PLOT, but it does not specify any version numbers for these or other software components, which is necessary for reproducibility.
Experiment Setup Yes To normalize the soft prompts in VLMs, two types of normalization losses are proposed: the Position-Uniform Normalization (PUN) loss and the Position-Aware Normalization (PAN) loss. Both losses involve a crucial hyper-parameter ω, which controls the extent of normalization for soft prompts. Unless specified otherwise, ω is set to 1 for all datasets, except for the Image Net dataset where it is set to 10, the Oxford Pets dataset, where it is set to 50, and the Food101 dataset where it is also set to 50. ... it is varied based on a logistic function ωE = 1 1 1+exp( k(E 0.5 max E)), where E and max E denote current training epoch and maximum training epoch, respectively. k represents the attenuation rate, and it is fixed as 0.2. ... We set the default value of N to 1 and τ to 0.5.