Frustratingly Easy Test-Time Adaptation of Vision-Language Models
Authors: Matteo Farina, Gianni Franchi, Giovanni Iacca, Massimiliano Mancini, Elisa Ricci
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We thoroughly evaluate our approach following the experimental protocol established in the literature and show that ZERO largely surpasses or compares favorably w.r.t. the state-of-the-art while being almost 10 faster and 13 more memory friendly than standard Test-Time Prompt Tuning. |
| Researcher Affiliation | Academia | 1University of Trento 2U2IS, ENSTA Paris, Institut Polytechnique de Paris 3Fondazione Bruno Kessler (FBK) |
| Pseudocode | Yes | Algorithm 1 Py Torch-style code for ZERO # z_txt = pre-computed text embeddings (C,hdim) # temp = model s original temp # augment = takes (C,H,W) and returns (N,C,H,W) # gamma = filtering percentile (e.g., 0.1) def zero(image, z_txt, N, gamma, temp): # step1: augment views = augment(image, num_views=N) # step2: predict (unscaled logits) l = model.image_encoder(views) @ z_txt.t() # step3: retain most confident preds l_filt = confidence_filter(l, temp, top=gamma) # step4: zero temperature zero_temp = torch.finfo(l_filt.dtype).eps # step5: marginalize p_bar = (l_filt / zero_temp).softmax(1).sum(0) return p_bar.argmax() |
| Open Source Code | Yes | The code is available. Code at https://github.com/Farina Matteo/zero. |
| Open Datasets | Yes | For all classes in each dataset, we first draw all images sharing the same label (Xy). Then, we compute the expected error ϵ(y) of the model on this subset, together with the error of p (ideally, Eq. (6)). Lastly, we average these errors over the entire label space Y. We do not restrict to the cases where y is supported by the majority and we do not re-organize predictions in a one-versus-all scheme. Fig. 1(a) clearly shows that the error of p is a lower bound to the base error of the model also in practical use cases where the label space is large and guarantees on model calibration are possibly missing. Importantly, this phenomenon persists no matter the dataset. (Referring to ImageNet-1k, ImageNet-A, ImageNet-R, ImageNet-v2, ImageNet-Sketch [4, 13, 12, 32, 46]) |
| Dataset Splits | Yes | The only hyperparameter of ZERO is the percentile for confidence-based filtering, which is set to 0.3 after validation on Image Net (following standard practice [51]) and kept fixed for all datasets. |
| Hardware Specification | Yes | We always use 1 NVIDIA A100 GPU and FP16 Automatic Mixed Precision. to quantify the computational gain of ZERO w.r.t. other TTA methods, we report the runtime per image and peak GPU memory in Table 3 under the same hardware (i.e., 1 NVIDIA RTX 4090. |
| Software Dependencies | No | For reference, a Py Torch-like implementation [30] is reported in Algorithm 1. For the experiments with LAION Pretraining, we use the open_clip repository, i.e., the official code for [2]. |
| Experiment Setup | Yes | The augmentation pool A only contains random resized crops and random horizontal flips. The only hyperparameter of ZERO is the percentile for confidence-based filtering, which is set to 0.3 after validation on Image Net (following standard practice [51]) and kept fixed for all datasets. We inherit the setup of TPT with N = 64, crafting 63 augmentations to collate with the source image. Results are averaged over 3 different seeds. |