Boosting Vision-Language Models with Transduction
Authors: Maxime Zanella, Benoît Gérin, Ismail Ayed
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We report comprehensive evaluations, comparisons, and ablation studies that demonstrate: (i) Transduction can greatly enhance the generalization capabilities of inductive pretrained zeroand few-shot VLMs; (ii) Trans CLIP substantially outperforms standard transductive few-shot learning methods relying solely on vision features, notably due to the KL-based language constraint. and 4 Experiments Datasets. Following the setting of previous works [73, 43], we assess Trans CLIP on Image Net [11] and ten datasets for fine-grained classification of scenes (SUN397 [64]), aircraft types (Aircraft [42]), satellite imagery (Euro SAT [18]), automobiles (Cars [32]), food items (Food [4]), pet breeds (Pets [49]), flowers (Flowers [47]), general objects (Caltech101 [14]), textures (DTD [10]) and human actions (UCF101 [54]). |
| Researcher Affiliation | Academia | Maxime Zanella UCLouvain, UMons Benoît Gérin UCLouvain Ismail Ben Ayed ÉTS Montréal and {maxime.zanella,benoit.gerin}@uclouvain.be |
| Pseudocode | Yes | Our code is available at https://github.com/Max Zanella/transduction-for-vlms and a pseudo-code in Algorithm 1 summarizes the main steps of the Trans CLIP algorithm. |
| Open Source Code | Yes | Code: https://github.com/Max Zanella/transduction-for-vlms |
| Open Datasets | Yes | Datasets. Following the setting of previous works [73, 43], we assess Trans CLIP on Image Net [11] and ten datasets for fine-grained classification of scenes (SUN397 [64]), aircraft types (Aircraft [42]), satellite imagery (Euro SAT [18]), automobiles (Cars [32]), food items (Food [4]), pet breeds (Pets [49]), flowers (Flowers [47]), general objects (Caltech101 [14]), textures (DTD [10]) and human actions (UCF101 [54]). |
| Dataset Splits | Yes | Methods with tunable hyper-parameters are fine-tuned using the validation split provided with each dataset. In line with other work [48], validation is performed for each dataset and for every shot number, setting the number of validation shots at min(4, #shots). |
| Hardware Specification | Yes | Hardware. All our experiments were conducted on a single A100-40 GB. |
| Software Dependencies | No | The paper mentions the use of CLIP (a vision-language model) and various other models/methods, but does not provide specific version numbers for software dependencies such as programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | Implementation details. The main component of our transductive formulation is the text-guided KL divergence penalty. We fix λ = 1 for all our zero-shot experiments (see ablation study in Table 6), and λ = 0.5 in all the few-shot experiments to reduce the impact of the text-driven regularization. ... In practice, Trans CLIP performs 10 iterations of z, µ, Σ block-wise updates. For each z-update, we perform 5 iterations, as we found it sufficient for convergence. In the zero-shot setting, we set λ = 1 and γ = 0 (as there are no support samples). In the few-shot setting, we set λ = 0.5 and search for the value of γ in {0.002, 0.01, 0.02, 0.2}. |