Learning to Decompose Visual Features with Latent Textual Prompts

Authors: Feng Wang, Manling Li, Xudong Lin, Hairong Lv, Alex Schwing, Heng Ji

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical study shows De Fo s significance in improving the vision-language models. For example, De Fo obtains 73.2% test accuracy on Image Net with a Res Net-50 backbone without tuning any pretrained weights of both the vision and language encoder, outperforming zero-shot CLIP by a large margin of 15.0%, and outperforming state-of-the-art vision-language prompt tuning by 7.6%.
Researcher Affiliation Academia Feng Wang1, Manling Li2, Xudong Lin3, Hairong Lv1, Alexander G. Schwing2 & Heng Ji2 1Tsinghua University 2University of Illinois at Urbana-Champaign 3Columbia University
Pseudocode No The paper does not include pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper does not contain any explicit statements or links indicating that source code for the described methodology is publicly available.
Open Datasets Yes We follow prior methods to select 11 publicly available datasets, i.e., Image Net (Deng et al., 2009), Food101 (Bossard et al., 2014), Oxford Pets (Parkhi et al., 2012), Caltech101 (Fei-Fei et al., 2004), SUN397 (Xiao et al., 2010), UCF101 (Soomro et al., 2012), Stanford Cars (Krause et al., 2013), FGVCAircraft (Maji et al., 2013), DTD (Cimpoi et al., 2014), Flowers102 (Nilsback & Zisserman, 2008), and Euro SAT (Helber et al., 2019).
Dataset Splits No The paper mentions "few-shot training" and "full-dataset training" but does not explicitly describe a validation dataset split or percentages for hyperparameter tuning. It only specifies training data amounts for few-shot scenarios.
Hardware Specification No The paper describes experimental setup details, such as batch sizes and learning rates, but does not specify any particular hardware components like GPU models, CPU types, or memory.
Software Dependencies No The paper mentions using an "SGD optimizer" and being built upon "CLIP pretrained models," but it does not specify version numbers for any software dependencies like programming languages, libraries (e.g., PyTorch, TensorFlow), or operating systems.
Experiment Setup Yes By default, we use simple data augmentation of random crop and flip, and train with a SGD optimizer with a minibatch size of 32, 2e-3 learning rate, 0.9 momentum, and 0.01 weight decay (following Co Op (Zhou et al., 2021)) for 50 epochs. For full-dataset training on Image Net, we use a batch size of 256 and a learning rate of 0.01, which yields similar accuracy to the default setting but significantly reduces training time.