Improving fine-grained understanding in image-text pre-training
Authors: Ioana Bica, Anastasija Ilic, Matthias Bauer, Goker Erdogan, Matko Bošnjak, Christos Kaplanis, Alexey A. Gritsenko, Matthias Minderer, Charles Blundell, Razvan Pascanu, Jovana Mitrovic
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We thoroughly evaluate SPARC and show improved performance over competing approaches both on image-level tasks relying on coarse-grained information, e.g. classification, as well as region-level tasks relying on fine-grained information, e.g., retrieval, object detection, segmentation while also improving model faithfulness and captioning in foundational vision-language models. |
| Researcher Affiliation | Industry | 1Google Deep Mind, London, UK 2Google Deep Mind, Zurich, Switzerland. |
| Pseudocode | Yes | We provide the pseudo-code for SPARC in Appendix C. Listing 1 provides Ja X-alike pseudo-code for the SPARC objective detailing the construction of both the global and local losses. |
| Open Source Code | No | The paper states that baselines follow publicly available code, but it does not explicitly state that the code for SPARC, the method described in this paper, is open-source or provide a link to it: 'Our implementation of baselines follow the publicly available code (where available2) with a few minor differences we outline here.' Footnote 2 provides links for GLo RIA and MGCA, not SPARC. |
| Open Datasets | Yes | Datasets. We train on large-scale datasets ALIGN (Jia et al., 2021), JFT (Sun et al., 2017; Zhai et al., 2022) and LTIP (Long Text & Image Pairs) (Alayrac et al., 2022). ALIGN has 1.8 billion images-noisy alt-text pairs, JFT has 4 billion images semi-automatically annotated with a class-hierarchy of 30k labels, while LTIP has 312 million higher-quality images text pairs with richer image captions. |
| Dataset Splits | Yes | We evaluate the resulting model on the large-vocabulary dataset LVIS (Gupta et al., 2019) which is well-suited for testing the transfer of knowledge from image-level pretraining. LVIS contains 1203 categories of objects, of which 307 rare categories are excluded from the training data to measure zero-shot transfer from pretraining. Moreover, we also evaluate detection on the 80 MSCOCO classes. We run detection training three times and report mean and standard deviation in Table 3. ... We evaluate the models on captioning tasks on MSCOCO and Flickr30k. |
| Hardware Specification | Yes | To understand the computational and memory requirements of different methods, we measure the compute and peak memory usage for one update step for different batch size when training on 256 TPUs v3. |
| Software Dependencies | No | The paper mentions software components like 'Adam W optimizer' (citing Loshchilov & Hutter, 2017), 'Vision Transformers (Vi Ts)' (citing Dosovitskiy et al., 2020), 'Transformers' (citing Vaswani et al., 2017), and 'sentencepiece tokenizer' (citing Kudo & Richardson, 2018). However, it does not provide specific version numbers for these software components, which are necessary for reproducible descriptions. |
| Experiment Setup | Yes | We resize images to 224 224 and tokenize the text with a 32k vocabulary sentencepiece tokenizer (Kudo & Richardson, 2018) while keeping a maximum number of 55 tokens for each caption. We train all models using the Adam W (Loshchilov & Hutter, 2017) optimizer, a cosine learning rate schedule with linear warm-up and weight decay regularization. We use a batch size of 16348 and train Vi T-B models for 200k steps ( 3.2 billion data points) and Vi T-L models for 250k steps ( 4.1 billion data points). See Appendix D for more details. ... For the other SPARC hyperparameters, we set the global loss weight λg = 0.5 and we sweep the local loss weight in λf [0.5, 1.0, 5.0, 10.0]. Moreover, we use a learned temperature parameter τ. |