CLIP the Bias: How Useful is Balancing Data in Multimodal Learning?

Authors: Ibrahim Alabdulmohsin, Xiao Wang, Andreas Peter Steiner, Priya Goyal, Alexander D'Amour, Xiaohua Zhai

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our study also explores the dynamic nature of how CLIP learns/unlearns biases. In particular, we find that fine-tuning is effective in countering representation biases, though its impact diminishes for association biases. Also, data balancing has a mixed impact on quality: it tends to improve classification but can hurt retrieval. Interestingly, data and architectural improvements seem to mitigate the negative impact of data balancing on performance; e.g. applying M4 to Sig LIP-B/16 with data quality filters improves COCO image-to-text retrieval @5 from 86% (without data balancing) to 87% and Image Net 0-shot classification from 77% to 77.5%! Finally, we conclude with recommendations for improving the efficacy of data balancing in multimodal systems.
Researcher Affiliation Industry Google Deepmind: Zürich, Switzerland. New York, USA. Boston, USA. {ibomohsin,xzhai}@google.com
Pseudocode Yes An overview of the data balancing algorithm is shown in Figure 7. It maintains two optimization variables v R2m(c+1) and µ R, which are used to calculate the sample weight q by solving: ... Figure 7: Pseudo-code of the data balancing algorithm in Section 5. LEFT: Single update per example (s, y, u), where u is the example s utility. RIGHT: Numpy-like implementation of the bias vector a.
Open Source Code No The paper does not provide explicit statements about the release of its own source code (e.g., 'We release our code...') nor does it include a direct link to a code repository for the methodology described.
Open Datasets Yes We evaluate the models on Image Net-ILSRCV2012 (Deng et al., 2009), Fair Face (Karkkainen & Joo, 2021), UTKFace (Zhang et al., 2017), and MIAP (Schumann et al., 2021).
Dataset Splits No The paper describes a two-stage training process and data sizes ('1B image-text pairs... We vary the length of Stage 1 in {0%, 10%, 90%}'). However, it does not specify explicit train/validation/test dataset splits with percentages, absolute counts, or references to predefined splits for the main training dataset.
Hardware Specification Yes Hardware & Software: The model is implemented using JAX (Bradbury et al., 2018), Flax (Heek et al., 2020), and Big Vision (Beyer et al., 2022). It is trained on TPU v2. Compute Requirements: Each model is trained on 8 8 TPU v2 chips on 1B seen imagetext pairs.
Software Dependencies No The paper mentions software used ('JAX', 'Flax', 'Big Vision', 'SentencePiece') and cites their origin, but it does not provide specific version numbers for these software dependencies (e.g., 'JAX 0.3.17').
Experiment Setup Yes A.4 TRAINING CONFIGURATION: batch_size = 16_384, shuffle_buffer_size = 250_000, pp = 'decode|resize(224)|value_range(-1,1)', model.temperature_init = 10.0, optax_name = 'scale_by_adafactor', grad_clip_norm = 1.0, lr = 0.001, wd = 0.0001, schedule.decay_type = 'rsqrt', schedule.timescale = 5_000, schedule.warmup_steps = 5_000.