Three Towers: Flexible Contrastive Learning with Pretrained Image Models
Authors: Jannik Kossen, Mark Collier, Basil Mustafa, Xiao Wang, Xiaohua Zhai, Lucas Beyer, Andreas Steiner, Jesse Berent, Rodolphe Jenatton, Effrosyni Kokiopoulou
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, 3T consistently improves over Li T and the CLIP-style from-scratch baseline for retrieval tasks. For classification, 3T reliably improves over the from-scratch baseline, and while it underperforms relative to Li T for JFT-pretrained models, it outperforms Li T for Image Net-21k and Places365 pretraining. |
| Researcher Affiliation | Collaboration | Jannik Kossen1 Mark Collier2 Basil Mustafa3 Xiao Wang3 Xiaohua Zhai3 Lucas Beyer3 Andreas Steiner3 Jesse Berent2 Rodolphe Jenatton3 EfiKokiopoulou2 1 OATML, Department of Computer Science, University of Oxford 2 Google Research 3 Google Deep Mind |
| Pseudocode | No | The paper describes methods using text and equations (e.g., Eq. 1, 2, 3, 4) and provides architectural diagrams (Fig. 1, 2), but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | No | The paper mentions using 'the open-source vision transformer implementation available from Beyer et al. [4]' (Big Vision GitHub repository), but does not state that the code for the Three Towers (3T) methodology itself is open-source or provide a link to it. |
| Open Datasets | Yes | We rely on the recently proposed Web LI dataset [11], a large-scale dataset of 10B image-caption pairs (Unfiltered Web LI)... For image tower pretraining, we consider both proprietary JFT-3B [84] and the publicly available IN-21k checkpoints of Dosovitskiy et al. [17]... For image classification, we evaluate on IN-1k [40, 61], CIFAR-100 [40], Caltech-256 [23], Oxford-IIIT Pet [53], Describable Textures (DTD) [13], UC Merced Land Use [80], Stanford Cars [39], Col-Hist [37], Birds [73], Image Net variants -C [28], -A [32], -R [31], -v2 [59], Object Net [3], Euro Sat [27], Oxford Flowers-102 [50], NWPU-RESISC45 [12], and Sun397 [78]. |
| Dataset Splits | No | As we train for less than one epoch, we do not observe any overfitting, in the sense that contrastive losses are identical on the training and validation set. |
| Hardware Specification | Yes | We train our models on v3 and v4 TPUs. For our main experiments at L scale, we use 256 TPU chips per experiment... our g scale runs train for about the same duration on only 512 v4 TPU chips. |
| Software Dependencies | No | We rely on the Jax [5], Flax [26], and Tensor Flow [1] Python libraries for our implementation. Additionally, we make use of the Big Vision [4] and Robustness Metrics [15] code bases. |
| Experiment Setup | Yes | Unless otherwise mentioned, we use Transformers of scale L, with a 16 16 patch size for the Vi T image towers, i.e. L/16. We train for 5B examples seen at a batch size of 14 1024, i.e. for about 350 000 steps... We use a learning rate of 0.001, warming up linearly for 10 000 steps, before following a cosine decay schedule. We use the Adafactor optimizer [65] with default β1 = 0.9 and β2 = 0.99, and we clip gradients if their norm exceeds 1.0... We use weight decay of 0.001. |