Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

LocCa: Visual Pretraining with Location-aware Captioners

Authors: Bo Wan, Michael Tschannen, Yongqin Xian, Filip Pavetic, Ibrahim M. Alabdulmohsin, Xiao Wang, André Susano Pinto, Andreas Steiner, Lucas Beyer, Xiaohua Zhai

NeurIPS 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments to evaluate Loc Ca. The integration of location-aware cues enables Loc Ca to maintain its performance on holistic image understanding tasks while achieving substantially improved outcomes on location-aware tasks.
Researcher Affiliation Collaboration 1Google Deep Mind, Zürich 2Google, Zürich 3KU Leuven
Pseudocode No The paper describes the model architecture and training process in text but does not provide a formal pseudocode block or algorithm figure.
Open Source Code No The code will be released soon.
Open Datasets Yes We use a subset of the Web LI dataset [10] corresponding to English websites and apply text-based filtering [8] to obtain 1B image/alt-text pairs.
Dataset Splits Yes We report the standard metric Acc@0.5 on the validation and test sets.
Hardware Specification Yes The pretraining of Loc Ca L takes 153 hours using 256 TPUv3 chips.
Software Dependencies No Alt-texts are tokenized into a vocabulary consisting of 32,000 tokens using a sentence piece model trained on the English segment of C4 [65]... We used the big_vision codebase [98, 99] for all experiments in this project.
Experiment Setup Yes Loc Ca is pretrained for about 9 billion image/alt-text seen examples, which corresponds to about 9 epochs on our tailored subset of Web LI. For the optimizer, we employ the Scaling-Vi T Ada Factor variant [4], combined with a cosine schedule that includes 10,000 warmup steps. The batch size is set at 8,192, while the learning rate and decay factor are adjusted to 10 3 and 10 4, respectively. During this process, images are uniformly resized to a resolution of 224 x 224 pixels. Alt-texts are tokenized into a vocabulary consisting of 32,000 tokens using a sentence piece model trained on the English segment of C4 [65], with a cap on the sequence length at 64 tokens.