Three Towers: Flexible Contrastive Learning with Pretrained Image Models

Authors: Jannik Kossen, Mark Collier, Basil Mustafa, Xiao Wang, Xiaohua Zhai, Lucas Beyer, Andreas Steiner, Jesse Berent, Rodolphe Jenatton, Effrosyni Kokiopoulou

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, 3T consistently improves over Li T and the CLIP-style from-scratch baseline for retrieval tasks. For classification, 3T reliably improves over the from-scratch baseline, and while it underperforms relative to Li T for JFT-pretrained models, it outperforms Li T for Image Net-21k and Places365 pretraining.
Researcher Affiliation Collaboration Jannik Kossen1 Mark Collier2 Basil Mustafa3 Xiao Wang3 Xiaohua Zhai3 Lucas Beyer3 Andreas Steiner3 Jesse Berent2 Rodolphe Jenatton3 EfiKokiopoulou2 1 OATML, Department of Computer Science, University of Oxford 2 Google Research 3 Google Deep Mind
Pseudocode No The paper describes methods using text and equations (e.g., Eq. 1, 2, 3, 4) and provides architectural diagrams (Fig. 1, 2), but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code No The paper mentions using 'the open-source vision transformer implementation available from Beyer et al. [4]' (Big Vision GitHub repository), but does not state that the code for the Three Towers (3T) methodology itself is open-source or provide a link to it.
Open Datasets Yes We rely on the recently proposed Web LI dataset [11], a large-scale dataset of 10B image-caption pairs (Unfiltered Web LI)... For image tower pretraining, we consider both proprietary JFT-3B [84] and the publicly available IN-21k checkpoints of Dosovitskiy et al. [17]... For image classification, we evaluate on IN-1k [40, 61], CIFAR-100 [40], Caltech-256 [23], Oxford-IIIT Pet [53], Describable Textures (DTD) [13], UC Merced Land Use [80], Stanford Cars [39], Col-Hist [37], Birds [73], Image Net variants -C [28], -A [32], -R [31], -v2 [59], Object Net [3], Euro Sat [27], Oxford Flowers-102 [50], NWPU-RESISC45 [12], and Sun397 [78].
Dataset Splits No As we train for less than one epoch, we do not observe any overfitting, in the sense that contrastive losses are identical on the training and validation set.
Hardware Specification Yes We train our models on v3 and v4 TPUs. For our main experiments at L scale, we use 256 TPU chips per experiment... our g scale runs train for about the same duration on only 512 v4 TPU chips.
Software Dependencies No We rely on the Jax [5], Flax [26], and Tensor Flow [1] Python libraries for our implementation. Additionally, we make use of the Big Vision [4] and Robustness Metrics [15] code bases.
Experiment Setup Yes Unless otherwise mentioned, we use Transformers of scale L, with a 16 16 patch size for the Vi T image towers, i.e. L/16. We train for 5B examples seen at a batch size of 14 1024, i.e. for about 350 000 steps... We use a learning rate of 0.001, warming up linearly for 10 000 steps, before following a cosine decay schedule. We use the Adafactor optimizer [65] with default β1 = 0.9 and β2 = 0.99, and we clip gradients if their norm exceeds 1.0... We use weight decay of 0.001.