LocCa: Visual Pretraining with Location-aware Captioners

Authors: Bo Wan, Michael Tschannen, Yongqin Xian, Filip Pavetic, Ibrahim M. Alabdulmohsin, Xiao Wang, André Susano Pinto, Andreas Steiner, Lucas Beyer, Xiaohua Zhai

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments to evaluate Loc Ca. The integration of location-aware cues enables Loc Ca to maintain its performance on holistic image understanding tasks while achieving substantially improved outcomes on location-aware tasks.
Researcher Affiliation Collaboration 1Google Deep Mind, Zürich 2Google, Zürich 3KU Leuven
Pseudocode No The paper describes the model architecture and training process in text but does not provide a formal pseudocode block or algorithm figure.
Open Source Code No The code will be released soon.
Open Datasets Yes We use a subset of the Web LI dataset [10] corresponding to English websites and apply text-based filtering [8] to obtain 1B image/alt-text pairs.
Dataset Splits Yes We report the standard metric Acc@0.5 on the validation and test sets.
Hardware Specification Yes The pretraining of Loc Ca L takes 153 hours using 256 TPUv3 chips.
Software Dependencies No Alt-texts are tokenized into a vocabulary consisting of 32,000 tokens using a sentence piece model trained on the English segment of C4 [65]... We used the big_vision codebase [98, 99] for all experiments in this project.
Experiment Setup Yes Loc Ca is pretrained for about 9 billion image/alt-text seen examples, which corresponds to about 9 epochs on our tailored subset of Web LI. For the optimizer, we employ the Scaling-Vi T Ada Factor variant [4], combined with a cosine schedule that includes 10,000 warmup steps. The batch size is set at 8,192, while the learning rate and decay factor are adjusted to 10 3 and 10 4, respectively. During this process, images are uniformly resized to a resolution of 224 x 224 pixels. Alt-texts are tokenized into a vocabulary consisting of 32,000 tokens using a sentence piece model trained on the English segment of C4 [65], with a cap on the sequence length at 64 tokens.