CLIP the Bias: How Useful is Balancing Data in Multimodal Learning?
Authors: Ibrahim Alabdulmohsin, Xiao Wang, Andreas Peter Steiner, Priya Goyal, Alexander D'Amour, Xiaohua Zhai
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our study also explores the dynamic nature of how CLIP learns/unlearns biases. In particular, we find that fine-tuning is effective in countering representation biases, though its impact diminishes for association biases. Also, data balancing has a mixed impact on quality: it tends to improve classification but can hurt retrieval. Interestingly, data and architectural improvements seem to mitigate the negative impact of data balancing on performance; e.g. applying M4 to Sig LIP-B/16 with data quality filters improves COCO image-to-text retrieval @5 from 86% (without data balancing) to 87% and Image Net 0-shot classification from 77% to 77.5%! Finally, we conclude with recommendations for improving the efficacy of data balancing in multimodal systems. |
| Researcher Affiliation | Industry | Google Deepmind: Zürich, Switzerland. New York, USA. Boston, USA. {ibomohsin,xzhai}@google.com |
| Pseudocode | Yes | An overview of the data balancing algorithm is shown in Figure 7. It maintains two optimization variables v R2m(c+1) and µ R, which are used to calculate the sample weight q by solving: ... Figure 7: Pseudo-code of the data balancing algorithm in Section 5. LEFT: Single update per example (s, y, u), where u is the example s utility. RIGHT: Numpy-like implementation of the bias vector a. |
| Open Source Code | No | The paper does not provide explicit statements about the release of its own source code (e.g., 'We release our code...') nor does it include a direct link to a code repository for the methodology described. |
| Open Datasets | Yes | We evaluate the models on Image Net-ILSRCV2012 (Deng et al., 2009), Fair Face (Karkkainen & Joo, 2021), UTKFace (Zhang et al., 2017), and MIAP (Schumann et al., 2021). |
| Dataset Splits | No | The paper describes a two-stage training process and data sizes ('1B image-text pairs... We vary the length of Stage 1 in {0%, 10%, 90%}'). However, it does not specify explicit train/validation/test dataset splits with percentages, absolute counts, or references to predefined splits for the main training dataset. |
| Hardware Specification | Yes | Hardware & Software: The model is implemented using JAX (Bradbury et al., 2018), Flax (Heek et al., 2020), and Big Vision (Beyer et al., 2022). It is trained on TPU v2. Compute Requirements: Each model is trained on 8 8 TPU v2 chips on 1B seen imagetext pairs. |
| Software Dependencies | No | The paper mentions software used ('JAX', 'Flax', 'Big Vision', 'SentencePiece') and cites their origin, but it does not provide specific version numbers for these software dependencies (e.g., 'JAX 0.3.17'). |
| Experiment Setup | Yes | A.4 TRAINING CONFIGURATION: batch_size = 16_384, shuffle_buffer_size = 250_000, pp = 'decode|resize(224)|value_range(-1,1)', model.temperature_init = 10.0, optax_name = 'scale_by_adafactor', grad_clip_norm = 1.0, lr = 0.001, wd = 0.0001, schedule.decay_type = 'rsqrt', schedule.timescale = 5_000, schedule.warmup_steps = 5_000. |