reproducibilityindex.ai

LocCa: Visual Pretraining with Location-aware Captioners

Authors: Bo Wan, Michael Tschannen, Yongqin Xian, Filip Pavetic, Ibrahim M. Alabdulmohsin, Xiao Wang, André Susano Pinto, Andreas Steiner, Lucas Beyer, Xiaohua Zhai

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments to evaluate Loc Ca. The integration of location-aware cues enables Loc Ca to maintain its performance on holistic image understanding tasks while achieving substantially improved outcomes on location-aware tasks.
Researcher Affiliation	Collaboration	1Google Deep Mind, Zürich 2Google, Zürich 3KU Leuven
Pseudocode	No	The paper describes the model architecture and training process in text but does not provide a formal pseudocode block or algorithm figure.
Open Source Code	No	The code will be released soon.
Open Datasets	Yes	We use a subset of the Web LI dataset [10] corresponding to English websites and apply text-based ﬁltering [8] to obtain 1B image/alt-text pairs.
Dataset Splits	Yes	We report the standard metric Acc@0.5 on the validation and test sets.
Hardware Specification	Yes	The pretraining of Loc Ca L takes 153 hours using 256 TPUv3 chips.
Software Dependencies	No	Alt-texts are tokenized into a vocabulary consisting of 32,000 tokens using a sentence piece model trained on the English segment of C4 [65]... We used the big_vision codebase [98, 99] for all experiments in this project.
Experiment Setup	Yes	Loc Ca is pretrained for about 9 billion image/alt-text seen examples, which corresponds to about 9 epochs on our tailored subset of Web LI. For the optimizer, we employ the Scaling-Vi T Ada Factor variant [4], combined with a cosine schedule that includes 10,000 warmup steps. The batch size is set at 8,192, while the learning rate and decay factor are adjusted to 10 3 and 10 4, respectively. During this process, images are uniformly resized to a resolution of 224 x 224 pixels. Alt-texts are tokenized into a vocabulary consisting of 32,000 tokens using a sentence piece model trained on the English segment of C4 [65], with a cap on the sequence length at 64 tokens.