Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach
Authors: Mathilde Caron, Alireza Fathi, Cordelia Schmid, Ahmet Iscen
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks (e.g. +6.9% improvement in OVEN entity task), underscoring the importance of high-quality training data in this domain. We conduct extensive experiments to evaluate the effectiveness of our approach. The results demonstrate that models trained on our automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks, notably on the challenging Open-domain Visual Entity recognitio N (OVEN) benchmark [17] (e.g. +6.9% on the OVEN entity split and +3.8% on the OVEN query split). |
| Researcher Affiliation | Industry | Mathilde Caron Alireza Fathi Cordelia Schmid Ahmet Iscen Google DeepMind |
| Pseudocode | No | The paper does not contain explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor structured steps formatted like code. |
| Open Source Code | No | Releasing code and data is not possible in our case. |
| Open Datasets | Yes | Our training dataset builds upon the Entity-Web LI dataset [7] (see Sec. 3.1) which itself is based on Web LI [9], a dataset already deduplicated against the train, val, and test splits of 68 common vision/vision-language datasets [9]. We also validate our dataset refining methodology using the LAION dataset [46] as the image-caption base dataset. |
| Dataset Splits | Yes | Our training dataset builds upon the Entity-Web LI dataset [7] (see Sec. 3.1) which itself is based on Web LI [9], a dataset already deduplicated against the train, val, and test splits of 68 common vision/vision-language datasets [9]. OVEN validation and test splits are divided into seen and unseen entities. |
| Hardware Specification | Yes | Our models are trained on 256 TPUv3. |
| Software Dependencies | Yes | We use Gi T-Large [56]: it consists of a visual encoder (CLIP-L/14 [41]) and a 6-layer text decoder with internal dimension d = 768. Following [7], the visual encoder is first pre-trained jointly on Web LI-100M [9] and Conceptual Captions-12M [47] while the decoder is randomly initialized. We use Adam W optimizer [27] and a cosine learning rate schedule with final learning rate of 0. We use standard inception crop data augmentation. For the multimodal LLM, we use Gemini Pro [15]. |
| Experiment Setup | Yes | We use batch size of 4096, learning rate of 1e 5 for the visual encoder and 1e 4 for the decoder, label smoothing of 0.2 and no weight decay. We use Adam W optimizer [27] and a cosine learning rate schedule with final learning rate of 0. We use standard inception crop data augmentation for the images. We set the maximum decoding length to 32 tokens and the maximum number of context tokens to 32 tokens as well. The decoding beam size is set to 30. |