Diffusion-TTA: Test-time Adaptation of Discriminative Models via Generative Feedback

Authors: Mihir Prabhudesai, Tsung-Wei Ke, Alex Li, Deepak Pathak, Katerina Fragkiadaki

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We test our approach on multiple tasks, datasets and model architectures. For classification, we test pre-trained Image Net clasisifers on Image Net [9], and its out-of-distribution variants (C, R, A, v2, S). Further we test, large-scale open-vocabulary CLIP-based classifiers on CIFAR100, Food101, FGVC, Oxford Pets, and Flowers102 datasets. For adapting Image Net classifiers we use Di T [36] as our generative model, which is a diffusion model trained on Image Net from scratch. For adapting open-vocabulary CLIP-based classifiers, we use Stable Diffusion [40] as our generative model. We show consistent improvements over the initially employed classifier as shown in Figure 1. We also test on the adaptation of semantic segmentation and depth estimation tasks, where segmenters of Seg Former [49] and depth predictors of Dense Depth [1] performance are greatly improved on ADE20K and NYU Depth v2 dataset. For segmentation and depth prediction, we use conditional latent diffusion models [40] that are trained from scratch on their respective datasets. We show extensive ablations of different components of our Diffusion-TTA method, and present analyses on how diffusion generative feedback enhances discriminative models.
Researcher Affiliation Academia Mihir Prabhudesai Tsung-Wei Ke Alexander C. Li Deepak Pathak Katerina Fragkiadaki {mprabhud,tsungwek,acl2,dpathak,katef}@cs.cmu.edu Carnegie Mellon University
Pseudocode Yes The architecture of Diffusion-TTA method is shown in Figure 2 and its pseudocode in shown in Algorithm 1.
Open Source Code Yes Our code and trained models are publicly available in our project s website: diffusion-tta.github.io/.
Open Datasets Yes For classification, we test pre-trained Image Net clasisifers on Image Net [9], and its out-of-distribution variants (C, R, A, v2, S). Further we test, large-scale open-vocabulary CLIP-based classifiers on CIFAR100 [26], Food101 [6], Flowers102 [32], FGVC Aircraft [30], and Oxford-IIIT Pets [34] datasets. For adapting Image Net classifiers we use Di T [36] as our generative model, which is a diffusion model trained on Image Net from scratch. ... We also test on the adaptation of semantic segmentation and depth estimation tasks, where segmenters of Seg Former [49] and depth predictors of Dense Depth [1] performance are greatly improved on ADE20K and NYU Depth v2 dataset.
Dataset Splits No The paper states it uses pre-trained models and evaluates on existing validation/test sets, but does not provide explicit training/validation/test splits for all data used in its experiments in the format requested (e.g., percentages or full counts across all three types of splits).
Hardware Specification Yes We conduct our experiments on a single NVIDIA-A100 40GB VRAM GPU, with a batch size of approximately 180.
Software Dependencies Yes We use Stable Diffusion v2.0 [40] to adapt CLIP models. For the adaptation of Image Net classifiers, we use pre-trained Diffusion Transformers (Di T) [36] specifically their XL/2 model of resolution 256 256, which is trained on Image Net1K.
Experiment Setup Yes For test-time-adaptation of individual images, we randomly sample 180 different pairs of noise ϵ and timestep t for each adaptation step, composing a mini-batch of size 180. Timestep t is sampled over an uniform distribution from the range 1 to 1000 and epsilon ϵ is sampled from an unit gaussian. We apply 5 test-time adaptation steps for each input image. We adopt Stochastic Gradient Descent (SGD) (or Adam optimizer [24]), and set learning rate, weight decay and momentum to 0.005 (or 0.00001), 0, and 0.9, respectively.