Test-Time Distribution Normalization for Contrastively Learned Visual-language Models

Authors: Yifei Zhou, Juntao Ren, Fengyu Li, Ramin Zabih, Ser Nam Lim

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on a wide variety of downstream tasks exhibit a clear advantage of DN over the dot product on top of other existing test-time augmentation methods. Our experiments are designed to answer the following questions: 1) whether our proposed DN can uniformly improve a wide range of cross-modal alignment tasks for different kinds of cross-modal representation models and whether this gain is larger than that achieved by other zeroshot CLIP augmentations, 2) whether DN can be used in parallel with other common test-time adaptation methods compatible with CLIP, 3) how robust DN is when only scarce, unlabeled data is available to estimate the mean from, and 4) whether DN can improve the performance of fine-tuned models in addition to pre-trained models.
Researcher Affiliation Academia Yifei Zhou University of California, Berkeley yifei zhou@berkeley.edu Juntao Ren Cornell University jlr429@cornell.edu Fengyu Li Cornell University fl334@cornell.edu Ramin Zabih Cornell University rdz@cs.cornell.edu Ser-Nam Lim University of Central Florida sernam@ucf.edu
Pseudocode No The paper describes its methodology using mathematical equations and textual explanations but does not include any pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/fengyuli2002/distribution-normalization.
Open Datasets Yes The paper uses several well-known and publicly available datasets, including 'COCO [39]', 'Flickr30K [48]', 'Image Net1K [9]', 'Cifar100 [28]', 'SUN397 [67]', 'Stanford Cars [27]', 'Caltech 101 [31]', 'Flowers102 [44]', 'Flickr8k-expert [18]', 'Flickr8k-cf [18]', 'THumb [24]', and 'Pascal-50S [51]', all of which are properly cited.
Dataset Splits Yes For all the retrieval tasks, we estimated the mean with 100 random unlabeled samples from the validation set and calculated standard deviations and average recalls with 5 random seeds. We took a train-test split following [68] and [33], which involves a selection of 30K images for fine-tuning and 1K images for testing.
Hardware Specification Yes We fine-tune CLIP on the MSCOCO training set, for a total of 10 epochs on 4 Nvidia 2080Ti.
Software Dependencies No The paper mentions using the 'Adam optimizer' and specific learning rates/weight decays, but does not provide specific version numbers for any software libraries (e.g., Python, PyTorch, CUDA, scikit-learn) used in the experiments.
Experiment Setup Yes We use the Adam optimizer with a learning rate of 1e-5 and a weight decay of 0.1. We again use the Adam optimizer with a learning rate of 1e-5, but with a weight decay of 0.02. We fine-tune CLIP on the MSCOCO training set, for a total of 10 epochs on 4 Nvidia 2080Ti.