LocCa: Visual Pretraining with Location-aware Captioners
Authors: Bo Wan, Michael Tschannen, Yongqin Xian, Filip Pavetic, Ibrahim M. Alabdulmohsin, Xiao Wang, André Susano Pinto, Andreas Steiner, Lucas Beyer, Xiaohua Zhai
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments to evaluate Loc Ca. The integration of location-aware cues enables Loc Ca to maintain its performance on holistic image understanding tasks while achieving substantially improved outcomes on location-aware tasks. |
| Researcher Affiliation | Collaboration | 1Google Deep Mind, Zürich 2Google, Zürich 3KU Leuven |
| Pseudocode | No | The paper describes the model architecture and training process in text but does not provide a formal pseudocode block or algorithm figure. |
| Open Source Code | No | The code will be released soon. |
| Open Datasets | Yes | We use a subset of the Web LI dataset [10] corresponding to English websites and apply text-based filtering [8] to obtain 1B image/alt-text pairs. |
| Dataset Splits | Yes | We report the standard metric Acc@0.5 on the validation and test sets. |
| Hardware Specification | Yes | The pretraining of Loc Ca L takes 153 hours using 256 TPUv3 chips. |
| Software Dependencies | No | Alt-texts are tokenized into a vocabulary consisting of 32,000 tokens using a sentence piece model trained on the English segment of C4 [65]... We used the big_vision codebase [98, 99] for all experiments in this project. |
| Experiment Setup | Yes | Loc Ca is pretrained for about 9 billion image/alt-text seen examples, which corresponds to about 9 epochs on our tailored subset of Web LI. For the optimizer, we employ the Scaling-Vi T Ada Factor variant [4], combined with a cosine schedule that includes 10,000 warmup steps. The batch size is set at 8,192, while the learning rate and decay factor are adjusted to 10 3 and 10 4, respectively. During this process, images are uniformly resized to a resolution of 224 x 224 pixels. Alt-texts are tokenized into a vocabulary consisting of 32,000 tokens using a sentence piece model trained on the English segment of C4 [65], with a cap on the sequence length at 64 tokens. |