Boosting Vision-Language Models with Transduction

Authors: Maxime Zanella, Benoît Gérin, Ismail Ayed

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We report comprehensive evaluations, comparisons, and ablation studies that demonstrate: (i) Transduction can greatly enhance the generalization capabilities of inductive pretrained zeroand few-shot VLMs; (ii) Trans CLIP substantially outperforms standard transductive few-shot learning methods relying solely on vision features, notably due to the KL-based language constraint. and 4 Experiments Datasets. Following the setting of previous works [73, 43], we assess Trans CLIP on Image Net [11] and ten datasets for fine-grained classification of scenes (SUN397 [64]), aircraft types (Aircraft [42]), satellite imagery (Euro SAT [18]), automobiles (Cars [32]), food items (Food [4]), pet breeds (Pets [49]), flowers (Flowers [47]), general objects (Caltech101 [14]), textures (DTD [10]) and human actions (UCF101 [54]).
Researcher Affiliation Academia Maxime Zanella UCLouvain, UMons Benoît Gérin UCLouvain Ismail Ben Ayed ÉTS Montréal and {maxime.zanella,benoit.gerin}@uclouvain.be
Pseudocode Yes Our code is available at https://github.com/Max Zanella/transduction-for-vlms and a pseudo-code in Algorithm 1 summarizes the main steps of the Trans CLIP algorithm.
Open Source Code Yes Code: https://github.com/Max Zanella/transduction-for-vlms
Open Datasets Yes Datasets. Following the setting of previous works [73, 43], we assess Trans CLIP on Image Net [11] and ten datasets for fine-grained classification of scenes (SUN397 [64]), aircraft types (Aircraft [42]), satellite imagery (Euro SAT [18]), automobiles (Cars [32]), food items (Food [4]), pet breeds (Pets [49]), flowers (Flowers [47]), general objects (Caltech101 [14]), textures (DTD [10]) and human actions (UCF101 [54]).
Dataset Splits Yes Methods with tunable hyper-parameters are fine-tuned using the validation split provided with each dataset. In line with other work [48], validation is performed for each dataset and for every shot number, setting the number of validation shots at min(4, #shots).
Hardware Specification Yes Hardware. All our experiments were conducted on a single A100-40 GB.
Software Dependencies No The paper mentions the use of CLIP (a vision-language model) and various other models/methods, but does not provide specific version numbers for software dependencies such as programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes Implementation details. The main component of our transductive formulation is the text-guided KL divergence penalty. We fix λ = 1 for all our zero-shot experiments (see ablation study in Table 6), and λ = 0.5 in all the few-shot experiments to reduce the impact of the text-driven regularization. ... In practice, Trans CLIP performs 10 iterations of z, µ, Σ block-wise updates. For each z-update, we perform 5 iterations, as we found it sufficient for convergence. In the zero-shot setting, we set λ = 1 and γ = 0 (as there are no support samples). In the few-shot setting, we set λ = 0.5 and search for the value of γ in {0.002, 0.01, 0.02, 0.2}.