Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach

Authors: Mathilde Caron, Alireza Fathi, Cordelia Schmid, Ahmet Iscen

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks (e.g. +6.9% improvement in OVEN entity task), underscoring the importance of high-quality training data in this domain. We conduct extensive experiments to evaluate the effectiveness of our approach. The results demonstrate that models trained on our automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks, notably on the challenging Open-domain Visual Entity recognitio N (OVEN) benchmark [17] (e.g. +6.9% on the OVEN entity split and +3.8% on the OVEN query split).
Researcher Affiliation Industry Mathilde Caron Alireza Fathi Cordelia Schmid Ahmet Iscen Google DeepMind
Pseudocode No The paper does not contain explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor structured steps formatted like code.
Open Source Code No Releasing code and data is not possible in our case.
Open Datasets Yes Our training dataset builds upon the Entity-Web LI dataset [7] (see Sec. 3.1) which itself is based on Web LI [9], a dataset already deduplicated against the train, val, and test splits of 68 common vision/vision-language datasets [9]. We also validate our dataset refining methodology using the LAION dataset [46] as the image-caption base dataset.
Dataset Splits Yes Our training dataset builds upon the Entity-Web LI dataset [7] (see Sec. 3.1) which itself is based on Web LI [9], a dataset already deduplicated against the train, val, and test splits of 68 common vision/vision-language datasets [9]. OVEN validation and test splits are divided into seen and unseen entities.
Hardware Specification Yes Our models are trained on 256 TPUv3.
Software Dependencies Yes We use Gi T-Large [56]: it consists of a visual encoder (CLIP-L/14 [41]) and a 6-layer text decoder with internal dimension d = 768. Following [7], the visual encoder is first pre-trained jointly on Web LI-100M [9] and Conceptual Captions-12M [47] while the decoder is randomly initialized. We use Adam W optimizer [27] and a cosine learning rate schedule with final learning rate of 0. We use standard inception crop data augmentation. For the multimodal LLM, we use Gemini Pro [15].
Experiment Setup Yes We use batch size of 4096, learning rate of 1e 5 for the visual encoder and 1e 4 for the decoder, label smoothing of 0.2 and no weight decay. We use Adam W optimizer [27] and a cosine learning rate schedule with final learning rate of 0. We use standard inception crop data augmentation for the images. We set the maximum decoding length to 32 tokens and the maximum number of context tokens to 32 tokens as well. The decoding beam size is set to 30.